×

Estimation and inference in sparse multivariate regression and conditional Gaussian graphical models under an unbalanced distributed setting. (English) Zbl 07823223

Summary: This paper proposes a distributed estimation and inferential framework for sparse multivariate regression and conditional Gaussian graphical models under the unbalanced splitting setting. This type of data splitting arises when the datasets from different sources cannot be aggregated on one single machine or when the available machines are of different powers. In this paper, the number of covariates, responses and machines grow with the sample size, while sparsity is imposed. Debiased estimators of the coefficient matrix and of the precision matrix are proposed on every single machine and theoretical guarantees are provided. Moreover, new aggregated estimators that pool information across the machines using a pseudo log-likelihood function are proposed. It is shown that they enjoy efficiency and asymptotic normality as the number of machines grows with the sample size. The performance of these estimators is investigated via a simulation study and a real data example. It is shown empirically that the performances of these estimators are close to those of the non-distributed estimators which use the entire dataset.

MSC:

62H12 Estimation in multivariate analysis
62H22 Probabilistic graphical models
62J07 Ridge regression; shrinkage estimators (Lasso)

Software:

WONDER; glasso

References:

[1] AKBANI, R., AKDEMIR, K. C., AKSOY, B. A., ALBERT, M., ALLY, A., AMIN, S. B., ARACHCHI, H., ARORA, A., AUMAN, J. T., AYALA, B. et al. (2015). Genomic classification of cutaneous melanoma. Cell 161 1681-1696.
[2] BANERJEE, O., GHAOUI, L. E. and D’ASPREMONT, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research 9 485-516. MathSciNet: MR2417243 · Zbl 1225.68149
[3] BATTEY, H., FAN, J., LIU, H., LU, J. and ZHU, Z. (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics 46 1352-1382. MathSciNet: MR3798006 · Zbl 1392.62060
[4] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37 1705-1732. MathSciNet: MR2533469 · Zbl 1173.62022
[5] Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer. MathSciNet: MR2807761 · Zbl 1273.62015
[6] CAI, T., LIU, W. and LUO, X. (2011). A constrained \(\ell_1\) minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106 594-607. MathSciNet: MR2847973 · Zbl 1232.62087
[7] CAI, T., LIU, W. and ZHOU, H. (2016). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. The Annals of Statistics 44 455-488. MathSciNet: MR3476606 · Zbl 1341.62115
[8] Cai, T. T., Li, H., Liu, W. and Xie, J. (2013). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100 139-156. Digital Object Identifier: 10.1093/biomet/ass058 Google Scholar: Lookup Link MathSciNet: MR3034329 · Zbl 1284.62648 · doi:10.1093/biomet/ass058
[9] CHEN, M., REN, Z., ZHAO, H. and ZHOU, H. (2016). Asymptotically normal and efficient estimation of covariate-adjusted Gaussian graphical model. Journal of the American Statistical Association 111 394-406. MathSciNet: MR3494667
[10] CHEN, X. and XIE, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica 24 1655-1684. MathSciNet: MR3308656 · Zbl 1480.62258
[11] CLAESKENS, G., MAGNUS, J. R., VASNEV, A. L. and WANG, W. (2016). The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting 32 754-762. MathSciNet: MR3042813
[12] DOBRIBAN, E. and SHENG, Y. (2020). WONDER: weighted one-shot distributed ridge regression in high dimensions. The Journal of Machine Learning Research 21 2483-2534. MathSciNet: MR4095345 · Zbl 1498.68232
[13] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432-441. · Zbl 1143.62076
[14] GOLOSNOY, V., GRIBISCH, B. and SEIFERT, M. I. (2022). Sample and realized minimum variance portfolios: Estimation, statistical inference, and tests. Wiley Interdisciplinary Reviews: Computational Statistics 14 1-18. MathSciNet: MR4483683 · Zbl 07910980
[15] GUT, A. (2005). Probability: a graduate course 5. Springer. MathSciNet: MR2125120 · Zbl 1076.60001
[16] HUO, X. and CAO, S. (2019). Aggregated inference. Wiley Interdisciplinary Reviews: Computational Statistics 11 e1451. MathSciNet: MR3897175 · Zbl 07909147
[17] JANKOVA, J. and VAN DE GEER, S. (2015). Confidence intervals for high-dimensional inverse covariance estimation. Electronic Journal of Statistics 9 1205-1229. MathSciNet: MR3354336 · Zbl 1328.62458
[18] Javanmard, A. and Montanari, A. (2018). Debiasing the Lasso: Optimal sample size for gaussian designs. The Annals of Statistics 46 2593-2622. MathSciNet: MR3851749 · Zbl 1407.62270
[19] JORDAN, M. I., LEE, J. D. and YANG, Y. (2018). Communication-efficient distributed statistical inference. Journal of the American Statistical Association 114 668-681. MathSciNet: MR3963171 · Zbl 1420.62097
[20] KEMPF, A. and MEMMEL, C. (2006). Estimating the global minimum variance portfolio. Schmalenbach Business Review 58 332-348.
[21] LAM, C. and FAN, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics 37 4254-4278. MathSciNet: MR2572459 · Zbl 1191.62101
[22] LEE, J. D., LIU, Q., SUN, Y. and TAYLOR, J. E. (2017). Communication-efficient sparse regression. The Journal of Machine Learning Research 18 115-144. MathSciNet: MR3625709 · Zbl 1434.62157
[23] LIU, J., LICHTENBERG, T., HOADLEY, K. A., POISSON, L. M., LAZAR, A. J., CHERNIACK, A. D., KOVATICH, A. J., BENZ, C. C., LEVINE, D. A., LEE, A. V. et al. (2018). An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173 400-416.
[24] Loh, P.-L. and Tan, X. L. (2018). High-dimensional robust precision matrix estimation: Cellwise corruption under \(ϵ\)-contamination. Electronic Journal of Statistics 12 1429-1467. Digital Object Identifier: 10.1214/18-EJS1427 Google Scholar: Lookup Link MathSciNet: MR3804842 · Zbl 1412.62057 · doi:10.1214/18-EJS1427
[25] MCMAHAN, B., MOORE, E., RAMAGE, D., HAMPSON, S. and Y ARCAS, B. A. (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics 1273-1282. PMLR.
[26] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The annals of statistics 34 1436-1462. MathSciNet: MR2278363 · Zbl 1113.62082
[27] CANCER GENOME ATLAS RESEARCH NETWORK (2017). Integrated genomic and molecular characterization of cervical cancer. Nature 543 378-384.
[28] NEZAKATI, E. and PIRCALABELU, E. (2023). Unbalanced distributed estimation and inference for the precision matrix in Gaussian graphical models. Statistics and Computing 33 1-14. MathSciNet: MR4554147 · Zbl 1516.62021
[29] OBOZINSKI, G., WAINWRIGHT, M. J. and JORDAN, M. I. (2011). Support union recovery in high-dimensional multivariate regression. The Annals of Statistics 39 1-47. MathSciNet: MR2797839 · Zbl 1373.62372
[30] PENG, J., ZHU, J., BERGAMASCHI, A., HAN, W., NOH, D.-Y., POLLACK, J. R. and WANG, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. The Annals of Applied Statistics 4 53-77. MathSciNet: MR2758084 · Zbl 1189.62174
[31] RASKUTTI, G., WAINWRIGHT, M. J. and YU, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. The Journal of Machine Learning Research 11 2241-2259. MathSciNet: MR2719855 · Zbl 1242.62071
[32] RAVIKUMAR, P., WAINWRIGHT, M. J., RASKUTTI, G. and YU, B. (2011). High-dimensional covariance estimation by minimizing \(\ell_1\)-penalized log-determinant divergence. Electronic Journal of Statistics 5 935-980. MathSciNet: MR2836766 · Zbl 1274.62190
[33] ROTHMAN, A. J., LEVINA, E. and ZHU, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics 19 947-962. MathSciNet: MR2791263
[34] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42 1166-1202. MathSciNet: MR3224285 · Zbl 1305.62259
[35] VAN DER VAART, A. W. (2000). Asymptotic Statistics. Cambridge University Press. MathSciNet: MR1652247 · Zbl 0943.62002
[36] WANG, J. (2015). Joint estimation of sparse multivariate regression and conditional graphical models. Statistica Sinica 25 831-851. MathSciNet: MR3409726 · Zbl 1415.62051
[37] YIN, J. and LI, H. (2011). A sparse conditional Gaussian graphical model for analysis of genetical genomics data. The Annals of Applied Statistics 5 2630-2650. MathSciNet: MR2907129 · Zbl 1234.62151
[38] YIN, J. and LI, H. (2013). Adjusting for high-dimensional covariates in sparse precision matrix estimation by \(\ell_1\)-penalization. Journal of Multivariate Analysis 116 365-381. MathSciNet: MR3049910 · Zbl 1277.62146
[39] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19-35. Digital Object Identifier: 10.1093/biomet/asm018 Google Scholar: Lookup Link MathSciNet: MR2367824 · Zbl 1142.62408 · doi:10.1093/biomet/asm018
[40] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 217-242. MathSciNet: MR3153940 · Zbl 1411.62196
[41] ZHAO, P. and YU, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research 7 2541-2563. MathSciNet: MR2274449 · Zbl 1222.62008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.