Project led by Tian Zheng and Professor Shaw-Hwa Lo

Lo, Adeline, et al. "Why significant variables arenâ€™t automatically good predictors." Proceedings of the National Academy of Sciences 112.45 (2015): 13892-13897.

Lo, Adeline, et al. "Making Good Prediction: A Theoretical Framework." (2016).

Topic-adjusted visibility measure for citation networks

Project led by Linda Tan, Tim Jones
Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articlesâ€™ visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations among them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers.

Tan, Linda SL, Aik Hui Chan, and Tian Zheng. "Topic-adjusted visibility metric for scientific articles." The Annals of Applied Statistics 10.1 (2016): 1-31.

Estimated graph limit object

Estimation of ERGMs via Graph Limit

Project led by Ran He
Exponential random graph models (ERGMs) is very popular for modeling complex network data. Traditional estimation methods cannot scale to large networks, which limits its application to real network data. This project provides a new approach for parameter estimation in network models. Explicitly, we incorporate graph limit into estimation of ERGMs. We develop a numerical method to obtain the graph limit of ERGMs for large network. Then, based on this graph limit, we can approximate the likelihood function in order to maximize it to obtain the MLE. Our method further allows extension to network inference using sampled data.

Publications:

He, R. and T. Zheng (2013) Estimation of exponential random graph models for large social networks via graph limits, ASONAM '13 Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Pages 248-255. Presentation at DIMACS workshop on Statistical Analysis of Network Dynamics and Interactions | Nov. 7-8, 2013
[Source codes] in Columbia's academic commons.

He, Ran, and Tian Zheng. "GLMLE: graph-limit enabled fast computation for fitting exponential random graph models to large social networks." Social Network Analysis and Mining 5.1 (2015): 1-19.

He, R. and T. Zheng (2016) Estimating Exponential Random Graph Models using Sampled Network Data via Graphon,
Ran He and Tian Zheng. ASONAM '16 Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

Latent-state models for social interaction events

Project led by Jing Wu, in collaboration with Curley's lab.
In this project, we examine event time observations of social interactions among a cohort of mice.

Level plots of the pressure level data (500mb geopotential heights)

Spectral filtering and predictive modeling of spatio-temporal dynamics

Project led by Lu Meng, Nathan Lessen, Mohammed Khabbazian,
in collaboration with Hillman's lab.

Many applications generate spatial-temporal data that exhibit lower-rank smooth movements mixed with higher-rank noises. Separating the signal from the noise is important for us to visualize and understand the lower-rank movements. It is also often the case that the lower rank dynamics have multiple independent components that correspond to different trends or functionality of the system under study. In this talk, we present a novelfiltering method for identifying lower-rank dynamics and its components embedded in a high dimensional spatial-temporal system, with applications to climate data.

Flexible sparse learning of feature subspaces

Project led by Yuting Ma, Chengliang Tang
By treating multivariate observations as points in a high-dimensional space, distance measures have been used as natural measures of (dis)similarity and served as foundation of various learning methods. The efficiency of these learning methods heavily depends on the chosen distance measure. Much effort had been contributed to improving the performance of classifiers by learning an appropriate distance metric, particularly for the K-Nearest Neighbor classifier. With dimension of the data increasing, however, traditional metric learning methods face the challenge that many input variables bring in noises that mask the true signal hidden in a low-dimensional subspace, as well as resulting in a formidable computational cost. In this project, we address these issues by adaptively learning a sparse distance metric in high-dimensional space, with simultaneous feature selection. More specifically, we construct a basis classifier based on a Mahalanobis-type distance metric which unifies the ideas of nearest neighbor and large margin classification. Using this basis learner as a building block, a gradient boosting algorithm is adopted to learn one sparse rank-one matrix at each step. The sparsity is controlled by both a stepwise feature selection mechanism and a total complexity penalty. Moreover, we further extend our method to nonlinear metric learning via a hierarchical expansion with interactions. Close connections to kernel methods is drawn via Representer theorem and Taylor expansion, which illustrates the rudiments of our approach.

Ma, Yuting, and Tian Zheng. "Boosted sparse nonlinear distance metric learning." Statistical Analysis and Data Mining: The ASA Data Science Journal (2016).

Ma, Yuting, and Tian Zheng. "Stabilized Sparse Online Learning for Sparse Data." arXiv preprint arXiv:1604.06498 (2016).

[figure]

Network enrichment analysis

Project led by Julia Yang
In this project, we systematically investigate the association between a list of nodes and its relation with a known graph.

A low dimensional association pattern in high dimensional space.

Kernel-based Association Measures

Project led by Ying Liu
Numerical measures of association are important summary for describing statistical relationships between two sets of variables. Traditionally, such association measures were proposed and studied under specic settings, which has limited their use in applying to complex and high dimensional data. There have been recent advances on model-free generalized measures of association. In this projet, we provide a unifying summary of existing measures. We then introduce a general framework for association measures that includes most commonly used conventional measures. It further allows novel and intuitive extensions based on kernels. Under this framework, we introduce association mining and variable screening via the maximization of the proposed kernel-based association measures.

Active projects## Estimation of potential predictivity

Project led by Tian Zheng and Professor Shaw-Hwa LoLo, Adeline, et al. "Why significant variables arenâ€™t automatically good predictors."

Proceedings of the National Academy of Sciences112.45 (2015): 13892-13897.Lo, Adeline, et al. "Making Good Prediction: A Theoretical Framework." (2016).

## Topic-adjusted visibility measure for citation networks

Project led by Linda Tan, Tim JonesMeasuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articlesâ€™ visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations among them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers.

The Annals of Applied Statistics10.1 (2016): 1-31.## Estimation of ERGMs via Graph Limit

Project led by Ran HeExponential random graph models (ERGMs) is very popular for modeling complex network data. Traditional estimation methods cannot scale to large networks, which limits its application to real network data. This project provides a new approach for parameter estimation in network models. Explicitly, we incorporate graph limit into estimation of ERGMs. We develop a numerical method to obtain the graph limit of ERGMs for large network. Then, based on this graph limit, we can approximate the likelihood function in order to maximize it to obtain the MLE. Our method further allows extension to network inference using sampled data.

Publications:

Presentation at DIMACS workshop on Statistical Analysis of Network Dynamics and Interactions | Nov. 7-8, 2013

[Source codes] in Columbia's academic commons.

Social Network Analysis and Mining5.1 (2015): 1-19.Ran He and Tian Zheng. ASONAM '16 Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

## Latent-state models for social interaction events

Project led by Jing Wu, in collaboration with Curley's lab.In this project, we examine event time observations of social interactions among a cohort of mice.

## Spectral filtering and predictive modeling of spatio-temporal dynamics

Project led by Lu Meng, Nathan Lessen, Mohammed Khabbazian,in collaboration with Hillman's lab.

Many applications generate spatial-temporal data that exhibit lower-rank smooth movements mixed with higher-rank noises. Separating the signal from the noise is important for us to visualize and understand the lower-rank movements. It is also often the case that the lower rank dynamics have multiple independent components that correspond to different trends or functionality of the system under study. In this talk, we present a novelfiltering method for identifying lower-rank dynamics and its components embedded in a high dimensional spatial-temporal system, with applications to climate data.

## Flexible sparse learning of feature subspaces

Project led by Yuting Ma, Chengliang TangBy treating multivariate observations as points in a high-dimensional space, distance measures have been used as natural measures of (dis)similarity and served as foundation of various learning methods. The efficiency of these learning methods heavily depends on the chosen distance measure. Much effort had been contributed to improving the performance of classifiers by learning an appropriate distance metric, particularly for the K-Nearest Neighbor classifier. With dimension of the data increasing, however, traditional metric learning methods face the challenge that many input variables bring in noises that mask the true signal hidden in a low-dimensional subspace, as well as resulting in a formidable computational cost. In this project, we address these issues by adaptively learning a sparse distance metric in high-dimensional space, with simultaneous feature selection. More specifically, we construct a basis classifier based on a Mahalanobis-type distance metric which unifies the ideas of nearest neighbor and large margin classification. Using this basis learner as a building block, a gradient boosting algorithm is adopted to learn one sparse rank-one matrix at each step. The sparsity is controlled by both a stepwise feature selection mechanism and a total complexity penalty. Moreover, we further extend our method to nonlinear metric learning via a hierarchical expansion with interactions. Close connections to kernel methods is drawn via Representer theorem and Taylor expansion, which illustrates the rudiments of our approach.

Statistical Analysis and Data Mining: The ASA Data Science Journal(2016).arXiv preprint arXiv:1604.06498(2016).## Network enrichment analysis

Project led by Julia YangIn this project, we systematically investigate the association between a list of nodes and its relation with a known graph.

## Kernel-based Association Measures

Project led by Ying LiuNumerical measures of association are important summary for describing statistical relationships between two sets of variables. Traditionally, such association measures were proposed and studied under specic settings, which has limited their use in applying to complex and high dimensional data. There have been recent advances on model-free generalized measures of association. In this projet, we provide a unifying summary of existing measures. We then introduce a general framework for association measures that includes most commonly used conventional measures. It further allows novel and intuitive extensions based on kernels. Under this framework, we introduce association mining and variable screening via the maximization of the proposed kernel-based association measures.