×

Heterogeneous large datasets integration using Bayesian factor regression. (English) Zbl 1531.62013

Summary: Two key challenges in modern statistical applications are the large amount of information recorded per individual, and that such data are often not collected all at once but in batches. These batch effects can be complex, causing distortions in both mean and variance. We propose a novel sparse latent factor regression model to integrate such heterogeneous data. The model provides a tool for data exploration via dimensionality reduction and sparse low-rank covariance estimation while correcting for a range of batch effects. We study the use of several sparse priors (local and non-local) to learn the dimension of the latent factors. We provide a flexible methodology for sparse factor regression which is not limited to data with batch effects. Our model is fitted in a deterministic fashion by means of an EM algorithm for which we derive closed-form updates, contributing a novel scalable algorithm for non-local priors of interest beyond the immediate scope of this paper. We present several examples, with a focus on bioinformatics applications. Our results show an increase in the accuracy of the dimensionality reduction, with non-local priors substantially improving the reconstruction of factor cardinality. The results of our analyses illustrate how failing to properly account for batch effects can result in unreliable inference. Our model provides a novel approach to latent factor regression that balances sparsity with sensitivity in scenarios both with and without batch effects and is highly computationally efficient.

MSC:

62F15 Bayesian inference
62H25 Factor analysis and principal components; correspondence analysis
62J07 Ridge regression; shrinkage estimators (Lasso)
62P10 Applications of statistics to biology and medical sciences; meta analysis

References:

[1] Alter, O., Brown, P. O., and Botstein, D. (2000). “Singular value decomposition for genome-wide expression data processing and modeling.” Proceedings of the National Academy of Sciences, 97(18): 10101-10106. URL http://www.pnas.org/content/97/18/10101.abstract
[2] Avalos-Pacheco, A., Rossell, D., and Savage, R. S. (2020). “Supplement to “Heterogeneous large datasets integration using Bayesian factor regression”.” Bayesian Analysis. · doi:10.1214/20-BA1240SUPP
[3] Avio, C. G., Gorbi, S., Milan, M., Benedetti, M., Fattorini, D., d’Errico, G., Pauletto, M., Bargelloni, L., and Regoli, F. (2015). “Pollutants bioavailability and toxicological risk from microplastics to marine mussels.” Environmental Pollution, 198: 211-222. URL http://www.sciencedirect.com/science/article/pii/S0269749114005211
[4] Bar, H., Booth, J., and Wells, M. T. (2018). “A scalable empirical Bayes approach to variable selection in generalized linear models.” arXiv:1803.09735, 1-20. · Zbl 07909152 · doi:10.1002/wics.1455
[5] Benito, M., Parker, J., Du, Q., Wu, J., Xiang, D., Perou, C. M., and Marron, J. S. (2004). “Adjustment of systematic microarray data biases.” Bioinformatics, 20(1): 105-114. URL http://bioinformatics.oxfordjournals.org/content/20/1/105.abstract
[6] Bentink, S., Haibe-Kains, B., Risch, T., Fan, J.-B., Hirsch, M. S., Holton, K., Rubio, R., April, C., Chen, J., Wickham-Garcia, E., Liu, J., Culhane, A., Drapkin, R., Quackenbush, J., and Matulonis, U. A. (2012). “Angiogenic mRNA and microRNA Gene Expression Signature Predicts a Novel Subtype of Serous Ovarian Cancer.” PLOS ONE, 7(2): 1-9. · doi:10.1371/journal.pone.0030269
[7] Bersanelli, M., Mosca, E., Remondini, D., Giampieri, E., Sala, C., Castellani, G., and Milanesi, L. (2016). “Methods for the integration of multi-omics data: mathematical aspects.” BMC Bioinformatics, 17(2): 167-177. · doi:10.1186/s12859-015-0857-9
[8] Burges, C. J. C. (2010). “Dimension Reduction: A Guided Tour.” Foundations and Trends in Machine Learning, 2(4): 276-365. · Zbl 1211.68126 · doi:10.1561/2200000002
[9] Calon, A., Espinet, E., Palomo-Ponce, S., Tauriello, D. v., Iglesias, M., Céspedes, M. v., Sevillano, M., Nadal, C., Jung, P., Zhang, X. h.-F., Byrom, D., Riera, A., Rossell, D., and Mangues, R. (2012). “Dependency of Colorectal Cancer on a TGF-Beta-Driven Program in Stromal Cells for Metastasis Initiation.” Cancer Cell, 22(5): 571-584.
[10] Carvalho, C., Polson, N., and Scott, J. (2009). “Handling sparsity via the horseshoe.” Journal of Machine Learning Research, 5: 73-80.
[11] Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008). “High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics.” Journal of the American Statistical Association, 103(484): 1438-1456. · Zbl 1286.62091 · doi:10.1198/016214508000000869
[12] Cox, D. R. (1972). “Regression models and life-tables.” Journal of the Royal Statistical Society, Series B: Methodological, 34: 187-220. · Zbl 0243.62041 · doi:10.1111/j.2517-6161.1972.tb00899.x
[13] Cunningham, J. P. and Ghahramani, Z. (2015). “Linear Dimensionality Reduction: Survey, Insights, and Generalizations.” Journal of Machine Learning Research, 16: 2859-2900. URL http://jmlr.org/papers/v16/cunningham15a.html. · Zbl 1351.62123
[14] De Vito, R., Bellio, R., Trippa, L., and Parmigiani, G. (2018a). “Bayesian Multi-study Factor Analysis for High-throughput Biological Data.” arXiv:1806.09896, 1-35. · Zbl 1436.62538 · doi:10.1111/biom.12974
[15] De Vito, R., Bellio, R., Trippa, L., and Parmigiani, G. (2018b). “Multi-study Factor Analysis.” Biometrics, 75: 337-346. · Zbl 1436.62538 · doi:10.1111/biom.12974
[16] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society, Series B: Statistical Methodology, 39(1): 1-38. · Zbl 0364.62022 · doi:10.1111/j.2517-6161.1977.tb01600.x
[17] Dunson, D. and Bhattacharya, A. (2011). “Sparse Bayesian infinite factor models.” Biometrika, 98: 291-306. · Zbl 1215.62025 · doi:10.1093/biomet/asr013
[18] Ferriss, J. S., Kim, Y., Duska, L., Birrer, M., Levine, D. A., Moskaluk, C., Theodorescu, D., and Lee, J. K. (2012). “Multi-Gene Expression Predictors of Single Drug Responses to Adjuvant Chemotherapy in Ovarian Carcinoma: Predicting Platinum Resistance.” PLOS ONE, 7(2): 1-9. · doi:10.1371/journal.pone.0030550
[19] Fortin, J.-P., Sweeney, E. M., Muschelli, J., Crainiceanu, C. M., and Shinohara, R. T. (2016). “Removing inter-subject technical variability in magnetic resonance imaging studies.” NeuroImage, 132: 198-212.
[20] Fox, E. B. and Dunson, D. B. (2015). “Bayesian Nonparametric Covariance Regression.” Journal of Machine Learning Research, 16: 2501-2542. URL http://jmlr.org/papers/v16/fox15a.html. · Zbl 1351.62090
[21] Frühwirth-Schnatter, S. and Lopes, H. F. (2018). “Sparse Bayesian Factor Analysis when the Number of Factors is Unknown.” arXiv:1804.04231, 1-34.
[22] Fúquene, J., Steel, M., and Rossell, D. (2018). “On choosing mixture components via non-local priors.” arXiv:1604.00314, 1-72. · Zbl 1429.62243 · doi:10.1111/rssb.12333
[23] Ganzfried, B. F., Riester, M., Haibe-Kains, B., Risch, T., Tyekucheva, S., Jazic, I., Wang, X. V., Ahmadifar, M., Birrer, M., Parmigiani, G., Huttenhower, C., and Waldron, L. (2013). “curatedOvarianData: Clinically Annotated Data for the Ovarian Cancer Transcriptome.” Database, 2013. URL http://database.oxfordjournals.org/content/2013/bat013.abstract
[24] George, E. and McCulloch, R. (1993). “Variable selection via Gibbs sampling.” Journal of the American Statistical Association, 88(423): 881-889.
[25] George, E. and McCulloch, R. (1997). “Approaches for Bayesian variable selection.” Statistica Sinica, 339-374. · Zbl 0884.62031
[26] Ghahramani, Z. and Beal, M. J. (2000). “Variational Inference for Bayesian Mixtures of Factor Analysers.” In Solla, S. A., Leen, T. K., and Müller, K. (eds.), Advances in Neural Information Processing Systems 12, 449-455. MIT Press. URL http://papers.nips.cc/paper/1672-variational-inference-for-bayesian-mixtures-of-factor-analysers.pdf
[27] Goh, W. W. B., Wang, W., and Wong, L. (2017). “Why Batch Effects Matter in Omics Data, and How to Avoid Them.” Trends in Biotechnology, 35: 498-507.
[28] Griffiths, T. L. and Ghahramani, Z. (2011). “The Indian Buffet Process: An Introduction and Review.” J. Mach. Learn. Res., 12: 1185-1224. · Zbl 1280.62038
[29] Harrell Jr., F. E., Califf, R. M., Pryor, D. B., Lee, K. L., and Rosati, R. A. (1982). “Evaluating the yield of medical tests.” JAMA, 247(18): 2543-2546.
[30] Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc. · Zbl 0973.62007 · doi:10.1007/978-0-387-84858-7
[31] Hirose, K. and Yamamoto, M. (2015). “Sparse estimation via nonconcave penalized likelihood in factor analysis model.” Statistics and Computing, 25(5): 863-875. · Zbl 1332.62194 · doi:10.1007/s11222-014-9458-0
[32] Hirose, K., Yamamoto, M., and Nagata, H. (2016). fanc: Penalized Likelihood Factor Analysis via Nonconvex Penalty. R package version 2.2. URL https://CRAN.R-project.org/package=fanc
[33] Hoff, P. and Niu, X. (2012). “A Covariance Regression Model.” Statistica Sinica, 22: 729-753. URL http://www.stat.washington.edu/hoff/Code/hoff_niu_2009_ss. · Zbl 1238.62065 · doi:10.5705/ss.2010.051
[34] Hornung, R., Boulesteix, A.-L., and Causeur, D. (2016). “Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment.” BMC Bioinformatics, 17(1): 1-19. · doi:10.1186/s12859-015-0870-z
[35] Johnson, R. A. and Wichern, D. W. (eds.) (1988). Applied Multivariate Statistical Analysis. Upper Saddle River, NJ, USA: Prentice-Hall, Inc. · Zbl 0663.62061
[36] Johnson, V. E. and Rossell, D. (2010). “On the use of non-local prior densities in Bayesian hypothesis tests.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(2): 143-170. · Zbl 1411.62019 · doi:10.1111/j.1467-9868.2009.00730.x
[37] Johnson, V. E. and Rossell, D. (2012). “Bayesian model selection in high-dimensional settings.” Journal of the American Statistical Association, 107(498): 649-660. · Zbl 1261.62024 · doi:10.1080/01621459.2012.682536
[38] Johnson, W. E. and Li, C. (2009). Adjusting Batch Effects in Microarray Experiments with Small Sample Size Using Empirical Bayes Methods, 113-129. John Wiley & Sons, Ltd. · doi:10.1002/9780470685983.ch10
[39] Johnson, W. E., Li, C., and Rabinovic, A. (2007). “Adjusting batch effects in microarray expression data using empirical Bayes methods.” Biostatistics (Oxford, England), 8(1): 118-27. URL http://www.ncbi.nlm.nih.gov/pubmed/16632515 · Zbl 1170.62389 · doi:10.1093/biostatistics/kxj037
[40] Kaiser, H. F. (1958). “The varimax criterion for analytic rotation in factor analysis.” Psychometrika, 23(3): 187-200. · Zbl 0095.33603 · doi:10.1007/BF02289233
[41] Knowles, D. A. and Ghahramani, Z. (2011). “Nonparametric Bayesian sparse factor models with application to gene expression modeling.” The Annals of Applied Statistics, 5(2B): 1534-1552. · Zbl 1223.62013 · doi:10.1214/10-AOAS435
[42] Leek, J. T., Johnson, W. E., Parker, H. S., Fertig, E. J., Jaffe, A. E., Storey, J. D., Zhang, Y., and Torres, L. C. (2017). sva: Surrogate Variable Analysis. R package version 3.26.0.
[43] Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K., and Irizarry, R. A. (2010). “Tackling the widespread and critical impact of batch effects in high-throughput data.” Nat Rev Genet, 11(10): 733-739. · doi:10.1038/nrg2825
[44] Leek, J. T. and Storey, J. D. (2007). “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis.” PLoS Genet, 3(9): 1-12.
[45] Lopes, H. F. and West, M. (2004). “Bayesian model assessment in factor analysis.” Statistica Sinica, 14: 41-67. · Zbl 1035.62060
[46] Lucas, J., Carvalho, C., Wang, Q., Bild, A., Nevins, J., and West, M. (2006). “Sparse statistical modelling in gene expression genomics.” In Bayesian Inference for Gene Expression and Proteomics, 155-176. Cambridge University Press. · Zbl 1286.62091 · doi:10.1198/016214508000000869
[47] Mitchell, T. J. and Beauchamp, J. J. (1988). “Bayesian variable selection in linear regression.” Journal of the American Statistical Association, 83(404): 1023-1032. · Zbl 0673.62051 · doi:10.1080/01621459.1988.10478694
[48] Olivetti, E., Greiner, S., and Greiner, S. (2012). “ADHD diagnosis from multiple data sources with batch effects.” Frontiers in Systems Neuroscience, 6: 1662-5137.
[49] Parker, H. S., Corrada Bravo, H., and Leek, J. T. (2014). “Removing batch effects for prediction problems with frozen surrogate variable analysis.” PeerJ, 2: e561. · doi:10.7717/peerj.561
[50] Rhodes, D. R., Yu, J., Shanker, K., Deshpande, N., Varambally, R., Ghosh, D., Barrette, T., Pandey, A., and Chinnaiyan, A. M. (2004). “Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression.” Proceedings of the National Academy of Sciences of the United States of America, 101(25): 9309-9314. URL http://www.pnas.org/content/101/25/9309.abstract
[51] Rossell, D. and Telesca, D. (2017). “Nonlocal Priors for High-Dimensional Estimation.” Journal of the American Statistical Association, 112(517): 254-265. · doi:10.1080/01621459.2015.1130634
[52] Ročková, V. and George, E. I. (2014). “EMVS: The EM approach to Bayesian variable selection.” Journal of the American Statistical Association, 109(506): 828-846. · Zbl 1367.62049 · doi:10.1080/01621459.2013.869223
[53] Ročková, V. and George, E. I. (2017). “Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity.” Journal of the American Statistical Association, 111(516): 1608-1622. · doi:10.1080/01621459.2015.1100620
[54] Ročková, V. and George, E. I. (2018). “The Spike-and-Slab LASSO.” Journal of the American Statistical Association, 113(521): 431-444. · Zbl 1398.62186 · doi:10.1080/01621459.2016.1260469
[55] Schadt, E. E., Li, C., Ellis, B., and Wong, W. H. (2001). “Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data.” Journal of Cellular Biochemistry, 84(S37): 120-125. · doi:10.1002/jcb.10073
[56] Scherer, A. (2009). Batch Effects and Noise in Microarray Experiments: Sources and Solutions. Wiley Series in Probability and Statistics. Wiley.
[57] Schröeder, M. S., Culhane, A., Quackenbush, J., and Haibe-Kains, B. (2011). “survcomp: an R/Bioconductor package for performance assessment and comparison of survival models.” Bioinformatics, 27(22): 3206-3208.
[58] Schwarz, G. (1978). “Estimating the Dimension of a Model.” Ann. Statist., 6(2): 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[59] Seber, G. (1984). Multivariate observations. Wiley series in probability and mathematical statistics. New York, NY: Wiley. · Zbl 0627.62052 · doi:10.1002/9780470316641
[60] Shah, M., Xiao, Y., Subbanna, N., Francis, S., Arnold, D. L., Collins, D. L., and Arbel, T. (2011). “Evaluating intensity normalization on MRIs of human brain with multiple sclerosis.” Medical Image Analysis, 15(2): 267-282. URL http://www.sciencedirect.com/science/article/pii/S1361841510001337
[61] Shi, G., Lim, C. Y., and Maiti, T. (2019). “Model selection using mass-nonlocal prior.” Statistics & Probability Letters, 147(C): 36-44. URL https://ideas.repec.org/a/eee/stapro/v147y2019icp36-44.html. · Zbl 1450.62028 · doi:10.1016/j.spl.2018.11.027
[62] Shinohara, R. T., Sweeney, E. M., Goldsmith, J., Shiee, N., Mateen, F. J., Calabresi, P. A., Jarso, S., Pham, D. L., Reich, D. S., and Crainiceanu, C. M. (2014). “Statistical normalization techniques for magnetic resonance imaging.” NeuroImage : Clinical, 6: 9-19.
[63] Therneau, T. M. (2015). A Package for Survival Analysis in S. Version 2.38. URL https://CRAN.R-project.org/package=survival
[64] Wan, Y.-W., Allen, G. I., Anderson, M. L., and Liu, Z. (2015). TCGA2STAT: Simple TCGA Data Access for Integrated Statistical Analysis in R. R package version 1.2. URL https://CRAN.R-project.org/package=TCGA2STAT
[65] Wan, Y.-W., Allen, G. I., and Liu, Z. (2016). “TCGA2STAT: simple TCGA data access for integrated statistical analysis in R.” Bioinformatics, 32(6): 952-954. · doi:10.1093/bioinformatics/btv677
[66] Wang, J. and Zhao, Q. (2015). cate: High Dimensional Factor Analysis and Confounder Adjusted Testing and Estimation. R package version 1.0.4. URL https://CRAN.R-project.org/package=cate
[67] West, M. (2003). “Bayesian factor regression models in the “large p, small n” paradigm.” In Bayesian Statistics 7, 723-732. Oxford University Press. URL http://ftp.isds.duke.edu/WorkingPapers/02-12.html.
[68] Witten, D. M., Tibshirani, R., and Hastie, T. (2009). “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.” Biostatistics, 10(3): 515-534. · Zbl 1437.62658 · doi:10.1093/biostatistics/kxp008
[69] Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002). “Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation.” Nucleic Acids Research, 30(4): e15. URL http://nar.oxfordjournals.org/content/30/4/e15
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.