×

Multi-study factor analysis. (English) Zbl 1436.62538

Summary: We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate (1) common factors shared across multiple studies, and (2) study-specific factors. We develop an Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the numbers of common and specific factors. We present simulations for evaluating the performance of the method and we illustrate it by applying it to gene expression data in ovarian cancer. In both, we clarify the benefits of a joint analysis compared to the standard factor analysis. We have provided a tool to accelerate the pace at which we can combine unsupervised analysis across multiple studies, and understand the cross-study reproducibility of signal in multivariate data. An R package (MSFA), is implemented and is available on GitHub.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H20 Measures of association (correlation, canonical correlation, etc.)
62H25 Factor analysis and principal components; correspondence analysis
62J15 Paired and multiple comparisons; multiple testing

Software:

R; MSFA

References:

[1] Abdi, H., Williams, L. J., and Valentin, D. (2013). Multiple factor analysis: principal component analysis for multitable and multiblock data sets. Wiley Interdiscip Rev Comput Stat5, 149-179. · Zbl 1540.62004
[2] Andreasen, N. C. et al. (2005). Remission in schizophrenia: proposed criteria and rationale for consensus. Am J Psychiatry162, 441-449.
[3] Bernau, C. et al. (2014). Cross‐study validation for the assessment of prediction algorithms. Bioinformatics30, 105-112.
[4] Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika98, 291-306. · Zbl 1215.62025
[5] Burnham, K. P. and Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information‐theoretic Approach. second edition, New York: Springer. · Zbl 1005.62007
[6] Byrne, B. M., Shavelson, R. J., and Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychol Bull105, 456-466.
[7] Carrera, P. M., Gao, X., and Tucker, K. L. (2007). A study of dietary patterns in the mexican‐american population and their association with obesity. J Am Diet Assoc107, 1735-1742.
[8] Carvalho, C. M. et al. (2008). High‐dimensional sparse factor modeling: applications in gene expression genomics. J Am Stat Assoc103, 1438-1456. · Zbl 1286.62091
[9] Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behav Res1, 245-276.
[10] Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika95, 759-771. · Zbl 1437.62415
[11] Cope, L., Naiman, D. Q., and Parmigiani, G. (2014). Integrative correlation: Properties and relation to canonical correlations. J Multivariate Anal123, 270-280. · Zbl 1278.62087
[12] De Vito, R. et al. (2019). Shared and study‐specific dietary patterns. Epidemiology30, 93-102.
[13] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B39, 1-38. · Zbl 0364.62022
[14] Dolédec, S. and Chessel, D. (1994). Co‐inertia analysis: an alternative method for studying species-environment relationships. Freshwater Biol31, 277-294.
[15] Dray, S., Chessel, D., and Thioulouse, J. (2003). Co‐inertia analysis and the linking of ecological data tables. Ecology84, 3078-3089.
[16] Edefonti, V. et al. (2012). Nutrient‐based dietary patterns and the risk of head and neck cancer: a pooled analysis in the international head and neck cancer epidemiology consortium. Ann Oncol23, 1869-1880.
[17] Flury, B. N. (1984). Common principal components in k groups. J Am Stat Assoc79, 892-898.
[18] Frühwirth‐Schnatter, S. and Lopes, H. F. (2010). Parsimonious Bayesian factor analysis when the number of factors is unknown. Unpublished Working Paper, Booth Business.
[19] Garrett‐Mayer, E. et al. (2008). Cross‐study validation and combined analysis of gene expression microarray data. Biostatistics9, 333-354. · Zbl 1143.62077
[20] Geweke, J. and Zhou, G. (1996). Measuring the pricing error of the arbitrage pricing theory. Rev Financial Stud9, 557-587.
[21] Hirose, K. and Yamamoto, M. (2014). Estimation of an oblique structure via penalized likelihood factor analysis. Comput Stat Data Anal79, 120-132. · Zbl 1506.62080
[22] Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika30, 179-185. · Zbl 1367.62186
[23] Irizarry, R. A. et al. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics4, 249-264. · Zbl 1141.62348
[24] Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika32, 443-482. · Zbl 0183.24603
[25] Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika36, 409-426. · Zbl 0227.62061
[26] Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika23, 187-200. · Zbl 0095.33603
[27] Kerr, K. F. (2007). Extended analysis of benchmark datasets for agilent two‐color microarrays. BMC Bioinformatics8, 371-377.
[28] Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis. Stat Sinica14, 41-68. · Zbl 1035.62060
[29] Meng, C. et al. (2014). A multivariate approach to the integration of multi‐omics datasets. BMC Bioinformatics15, 162-175.
[30] Meng, X.‐L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika80, 267-278. · Zbl 0778.62022
[31] Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika58, 525-543. · Zbl 0826.62046
[32] Parmigiani, G. et al. (2004). A cross‐study comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res10, 2922-2927.
[33] Preacher, K. J. and Merkle, E. C. (2012). The problem of model selection uncertainty in structural equation modeling. Psychol Methods17, 1-14.
[34] Riester, M. et al. (2014). Risk prediction for late‐stage ovarian cancer by meta‐analysis of 1525 patient samples. J Natl Cancer Inst106, 1-12.
[35] Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the rv‐coefficient. Appl Stat25, 257-265.
[36] Ryman, T. K. et al. (2015). Characterising the reproducibility and reliability of dietary patterns among Yup’ik Alaska native people. Br J Nutr113, 634-643.
[37] Scaramella, L. V., Conger, R. D., Spoth, R., and Simons, R.L. (2002). Evaluation of a social contextual model of delinquency: A cross‐study replication. Child Dev73, 175-195.
[38] Scharpf, R. et al. (2009). A Bayesian model for cross‐study differential gene expression. J Am Stat Assoc104, 1295-1310. · Zbl 1205.62182
[39] Shi, L. et al. (2006). The microarray quality control (maqc) project shows inter‐and intraplatform reproducibility of gene expression measurements. Nat Biotechnol24, 1151-1161.
[40] Steiger, J. H. and Lind, J. M. (1980). Statistically based tests for the number of common factors. Paper presented at Psychometric Society Meeting, Iowa City, May.
[41] Subramanian, A. et al. (2005). Gene set enrichment analysis: a knowledge‐based approach for interpreting genome‐wide expression profiles. Proc Natl Acad Sci USA102, 15545-15550.
[42] Thurstone, L. L. (1931). Multiple factor analysis. Psychol Rev38, 406-427.
[43] Tyekucheva, S. et al. (2011). Integrating diverse genomic data using gene sets. Genome Biol12, 105-129.
[44] Waldron, L. et al. (2014). Comparative meta‐analysis of prognostic gene signatures for late‐stage ovarian cancer. J Natl Cancer Inst106, 49-61.
[45] Wang, X. V. et al. (2011). Unifying gene expression measures from multiple platforms using factor analysis. PloS One6, 1932-1943.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.