2022年统计与大数据研究院在统计学国际四大顶级期刊发表6篇高水平论文

2023-01-06

岁月不居，时节如流。回顾2022，统计与大数据研究院牢记初心使命，积极服务国家大数据战略，引领研究院师生立足前沿、追求卓越，在统计学方法、理论与应用上不断创新，于统计学国际四大顶级期刊（AOS、JASA、JRSSB、Biometrika）发表六篇高质量学术论文，不断提高研究院的科研实力与国际影响力，增强对学校统计学学科支撑能力。

踔厉奋发，笃行不怠。展望2023，统计与大数据研究院师生将继续以习近平新时代中国特色社会主义思想为指引，贯彻落实国家“实施科教兴国战略，强化现代化建设人才支撑”的要求与任务，服务社会和人民的需要，以实际行动为学校建设“中国特色，世界一流”大学事业贡献统计学与数据科学的力量。

1. BIOMETRIKA MAR 17 2022

High-dimensional semi-supervised learning: in search of optimal inference of the mean

作者：张宇谦（中国人民大学统计与大数据研究院）; Bradic, Jelena(美国加州大学圣地亚哥分校数学系)

Abstract：A fundamental challenge in semi-supervised learning lies in the observed data's disproportional size when compared with the size of the data collected with missing outcomes. An implicit understanding is that the dataset with missing outcomes, being significantly larger, ought to improve estimation and inference. However, it is unclear to what extent this is correct. We illustrate one clear benefit: root-n inference of the outcome's mean is possible while only requiring a consistent estimation of the outcome, possibly at a rate slower than root n. This is achieved by a novel k-fold, cross-fitted, double robust estimator. We discuss both linear and nonlinear outcomes. Such an estimator is particularly suited for models that naturally do not admit root-n consistency, such as high-dimensional, nonparametric or semiparametric models. We apply our methods to estimating heterogeneous treatment effects.

Keywords: Coefficient of determination; Double robustness; Missing data; Model-lean inference

2. Journal of the American Statistical Association Mar 2022 (Early Access)

Transfer learning in large-scale graphical models with false discovery rate control

作者：李赛（中国人民大学统计与大数据研究院）；T.Tony Cai（宾夕法尼亚大学沃顿商学院统计系）；Li Hongzhe（宾夕法尼亚大学佩雷尔曼医学院生物统计学、流行病学和信息学系）

Abstract: Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied. The target GGM is estimated by incorporating the data from similar and related auxiliary studies, where the similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single-task setting. Furthermore, we introduce a universal debiasing method that can be coupled with a range of initial graph estimators and can be analytically computed in one step. A debiased Trans-CLIME estimator is then constructed and is shown to be element-wise asymptotically normal. This fact is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed.

Keywords: Debiased estimator; Inverse covariance matrix; Meta learning; Multiple testing

3. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Mar 2022 (Early Access)

Testing the Effects of High-Dimensional Covariates via Aggregating Cumulative Covariances

作者：李润泽（宾夕法尼亚州立大学统计系）; 许凯（安徽师范大学数学与统计学院）; 周叶青（同济大学数学科学学院）; 朱利平（中国人民大学统计与大数据研究院）（通讯作者）

Abstract: In this article, we test for the effects of high-dimensional covariates on the response. In many applications, different components of covariates usually exhibit various levels of variation, which is ubiquitous in high-dimensional data. To simultaneously accommodate such heteroscedasticity and high dimensionality, we propose a novel test based on an aggregation of the marginal cumulative covariances, requiring no prior information on the specific form of regression models. Our proposed test statistic is scale-invariance, tuning-free and convenient to implement. The asymptotic normality of the proposed statistic is established under the null hypothesis. We further study the asymptotic relative efficiency of our proposed test with respect to the state-of-art universal tests in two different settings: one is designed for high-dimensional linear model and the other is introduced in a completely model-free setting. A remarkable finding reveals that, thanks to the scale-invariance property, even under the high-dimensional linear models, our proposed test is asymptotically much more powerful than existing competitors for the covariates with heterogeneous variances while maintaining high efficiency for the homoscedastic ones.

Keywords: Conditional mean independence; Cumulative covariance; High dimension; Martingale difference divergence

4. BIOMETRIKA Jun 2022 (Early Access)

Lasso-adjusted treatment effect estimation under covariate-adaptive randomization

作者：刘汉中（清华大学统计学研究中心）；涂富艺（中国人民大学统计与大数据研究院博士生）；马维（中国人民大学统计与大数据研究院）（通讯作者）

Abstract: We consider the problem of estimating and inferring treatment effects in randomized experiments. In practice, stratified randomization, or more generally, covariate-adaptive randomization, is routinely used in the design stage to balance treatment allocations with respect to a few variables that are most relevant to the outcomes. Then, regression is performed in the analysis stage to adjust the remaining imbalances to yield more efficient treatment effect estimators. Building upon and unifying recent results obtained for ordinary-least-squares adjusted estimators under covariate-adaptive randomization, this paper presents a general theory of regression adjustment that allows for model mis-specification and the presence of a large number of baseline covariates. We exemplify the theory on two lasso-adjusted treatment effect estimators, both of which are optimal in their respective classes. In addition, nonparametric consistent variance estimators are proposed to facilitate valid inferences, which work irrespective of the specific randomization methods used. The robustness and improved efficiency of the proposed estimators are demonstrated through numerical studies.

Keywords: Causal inference; Lasso; Minimization; Regression adjustment; Stratified randomization

5. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Jul 2022 (Early Access)

Causal Structural Learning on MPHIA Individual Dataset

作者：Bao, Le（宾夕法尼亚州立大学帕克分校统计系）; 李长城（大连理工大学数学科学学院）; 李润泽（宾夕法尼亚州立大学统计系）; 杨松山（中国人民大学统计与大数据研究院）（字母顺序）

Abstract: The Population-based HIV Impact Assessment (PHIA) is an ongoing project that conducts nationally representative HIV-focused surveys for measuring national and regional progress toward UNAIDS' 90-90-90 targets, the primary strategy to end the HIV epidemic. We believe the PHIA survey offers a unique opportunity to better understand the key factors that drive the HIV epidemics in the most affected countries in sub-Saharan Africa. In this article, we propose a novel causal structural learning algorithm to discover important covariates and potential causal pathways for 90-90-90 targets. Existing constraint-based causal structural learning algorithms are quite aggressive in edge removal. The proposed algorithm preserves more information about important features and potential causal pathways. It is applied to the Malawi PHIA (MPHIA) dataset and leads to interesting results. For example, it discovers age and condom usage to be important for female HIV awareness; the number of sexual partners to be important for male HIV awareness; and knowing the travel time to HIV care facilities leads to a higher chance of being treated for both females and males. We further compare and validate the proposed algorithm using BIC and using Monte Carlo simulations, and show that the proposed algorithm achieves improvement in true positive rates in important feature discovery over existing algorithms.

Keywords: 90-90-90 targets; Causal structural learning; HIV; PHIA

6. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Sep 2022 (Early Access)

A New and Unified Family of Covariate Adaptive Randomization Procedures and Their Properties

作者：马维（中国人民大学统计与大数据研究院）; 李平（百度美国研究院）; 张立新（浙江大学数据科学研究中心）; 胡飞芳（乔治华盛顿大学统计学系）

Abstract: In clinical trials and other comparative studies, covariate balance is crucial for credible and efficient assessment of treatment effects. Covariate adaptive randomization (CAR) procedures are extensively used to reduce the likelihood of covariate imbalances occurring. In the literature, most studies have focused on balancing of discrete covariates. Applications of CAR with continuous covariates remain rare, especially when the interest goes beyond balancing only the first moment. In this article, we propose a family of CAR procedures that can balance general covariate features, such as quadratic and interaction terms. Our framework not only unifies many existing methods, but also introduces a much broader class of new and useful CAR procedures. We show that the proposed procedures have superior balancing properties; in particular, the convergence rate of imbalance vectors is OP(nϵ ) for any ϵ>0if all of the moments are finite for the covariate features, relative to OP(√n) under complete randomization, where n is the sample size. Both the resulting convergence rate and its proof are novel. These favorable balancing properties lead to increased precision of treatment effect estimation in the presence of nonlinear covariate effects. The framework is applied to balance covariate means and covariance matrices simultaneously. Simulation and empirical studies demonstrate the excellent and robust performance of the proposed procedures.

Keywords: Covariate adaptive randomization; Covariate balance; Imbalance vector; Markov chain; Treatment effect estimation