×

Power transformations of relative count data as a shrinkage problem. (English) Zbl 07711392

Summary: Here we show an application of our recently proposed information-geometric approach to compositional data analysis (CoDA). This application regards relative count data, which are, e.g., obtained from sequencing experiments. First we review in some detail a variety of necessary concepts ranging from basic count distributions and their information-geometric description over the link between Bayesian statistics and shrinkage to the use of power transformations in CoDA. We then show that powering, i.e., the equivalent to scalar multiplication on the simplex, can be understood as a shrinkage problem on the tangent space of the simplex. In information-geometric terms, traditional shrinkage corresponds to an optimization along a mixture (or \(m\)-) geodesic, while powering (or, as we call it, exponential shrinkage) can be optimized along an exponential (or \(e\)-) geodesic. While the \(m\)-geodesic corresponds to the posterior mean of the multinomial counts using a conjugate prior, the \(e\)-geodesic corresponds to an alternative parametrization of the posterior where prior and data contributions are weighted by geometric rather than arithmetic means. To optimize the exponential shrinkage parameter, we use mean-squared error as a cost function on the tangent space. This is just the expected squared Aitchison distance from the true parameter. We derive an analytic solution for its minimum based on the delta method and test it via simulations. We also discuss exponential shrinkage as an alternative to zero imputation for dimension reduction and data normalization.

MSC:

62Hxx Multivariate analysis
94Axx Communication, information
62Jxx Linear inference, regression

Software:

MDiNE

References:

[1] Greenacre, M., Compositional data analysis, Annu. Rev. Stat. Appl., 8, 1, 271-299 (2021) · doi:10.1146/annurev-statistics-042720-124436
[2] Aitchison, J., The Statistical Analysis of Compositional Data (1986), London: Chapman and Hall, London · Zbl 0688.62004 · doi:10.1007/978-94-009-4109-0
[3] Egozcue, JJ; Pawlowsky-Glahn, V., Compositional data: the sample space and its structure, TEST, 28, 3, 599-638 (2019) · Zbl 1428.62220 · doi:10.1007/s11749-019-00670-6
[4] Erb, I.; Gloor, GB; Quinn, TP, Editorial: Compositional data analysis and related methods applied to genomics-a first special issue from NAR Genomics and Bioinformatics, NAR Genom Bioinform, 2, 4, lqaa103 (2020) · doi:10.1093/nargab/lqaa103
[5] Amari, S., Information Geometry and Its Applications. Applied Mathematical Sciences (2016), Berlin: Springer, Berlin · Zbl 1350.94001 · doi:10.1007/978-4-431-55978-8
[6] Erb, I.; Ay, N.; Filzmoser, P.; Hron, K.; Martín-Fernández, JA; Palarea-Albaladejo, J., The information-geometric perspective of compositional data analysis, Advances in Compositional Data Analysis, 21-43 (2021), New York: Springer, New York · doi:10.1007/978-3-030-71175-7_2
[7] Greenacre, M., Log-ratio analysis is a limiting case of correspondence analysis, Math. Geosci., 42, 129 (2010) · doi:10.1007/s11004-008-9212-2
[8] Ledoit, O.; Wolf, M., Improved estimation of the covariance matrix of stock returns with an application to portfolio selection, J. Empir. Financ., 10, 603-621 (2003) · doi:10.1016/S0927-5398(03)00007-0
[9] Hausser, J.; Strimmer, K., Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks, J. Mach. Learn. Res., 10, 1469-1484 (2009) · Zbl 1235.62006
[10] Quinn, TP; Erb, I.; Richardson, MF; Crowley, TM, Understanding sequencing data as compositions: an outlook and review, Bioinformatics, 34, 16, 2870-2878 (2018) · doi:10.1093/bioinformatics/bty175
[11] Jeganathan, P.; Holmes, SP, A statistical perspective on the challenges in molecular microbial biology, J. Agric. Biol. Environ. Stat., 26, 131-160 (2021) · Zbl 07603058 · doi:10.1007/s13253-021-00447-1
[12] Breda, J.; Zavolan, M.; van Nimwegen, E., Bayesian inference of gene expression states from single-cell RNA-seq data, Nat. Biotechnol., 39, 1008-1016 (2021) · doi:10.1038/s41587-021-00875-x
[13] Robinson, MD; Oshlack, A., A scaling normalization method for differential expression analysis of RNA-seq data, Genome Biol., 11, R25 (2010) · doi:10.1186/gb-2010-11-3-r25
[14] Lovén, J.; Orlando, DA; Sigova, AA; Lin, CY; Rahl, PB; Burge, CB; Levens, DL; Lee, TI; Young, RA, Revisiting global gene expression analysis, Cell, 151, 476-482 (2012) · doi:10.1016/j.cell.2012.10.012
[15] Townes, FW; Hicks, SC; Aryee, MJ; Irizarry, RA, Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model, Genome Biol., 20, 295 (2019) · doi:10.1186/s13059-019-1861-6
[16] de Finetti, B., Theory of Probability, A critical Introductory Treatment (2017), Oxford: Wiley, Oxford · Zbl 1375.60008 · doi:10.1002/9781119286387
[17] Billheimer, D.; Guttorp, P.; Fagan, WF, Statistical interpretation of species composition, J. Am. Stat. Assoc., 96, 1205-1214 (2001) · Zbl 1073.62573 · doi:10.1198/016214501753381850
[18] Xia, F.; Chen, J.; Fung, WK; Li, H., A logistic normal multinomial regression model for microbiome compositional data analysis, Biometrics, 69, 1053-1063 (2013) · Zbl 1288.62171 · doi:10.1111/biom.12079
[19] McGregor, K.; Labbe, A.; Greenwood, CMT, MDiNE: a model to estimate differential co-occurrence networks in microbiome studies, Bioinformatics, 36, 6, 1840-1847 (2020) · doi:10.1093/bioinformatics/btz824
[20] Avalos, M., Nock, R., Ong, C. S., Rouar, J., Sun, K.: Representation learning of compositional data. Adv. Neural Inf. Process. Syst. 31 (2018)
[21] Gzyl, H.; Nielsen, F., Geometry of the probability simplex and its connection to the maximum entropy method, J. Appl. Math. Stat. Inform., 16, 1, 25-35 (2020) · Zbl 1538.60012 · doi:10.2478/jamsi-2020-0003
[22] Ay, N.; Jost, J.; Le, HV; Schwachhöfer, L., Information Geometry. A Series of Modern Surveys in Mathematics (2017), Berlin: Springer, Berlin
[23] Cover, TM; Thomas, JA, Elements of Information Theory (2006), Oxford: Wiley, Oxford · Zbl 1140.94001
[24] Diaconis, P.; Ylvisaker, D., Conjugate priors for exponential families, Ann. Stat., 7, 2, 269-281 (1979) · Zbl 0405.62011 · doi:10.1214/aos/1176344611
[25] Agresti, A.; Hitchcock, DB, Bayesian inference for categorical data analysis, Stat. Methods Appl., 14, 297-330 (2005) · Zbl 1124.62307 · doi:10.1007/s10260-005-0121-y
[26] Agarwal, A.; Daumé, IIIH, A geometric view of conjugate priors, Mach. Learn., 81, 99-113 (2010) · Zbl 1470.68067 · doi:10.1007/s10994-010-5203-x
[27] Berger, JO, Statistical Decision Theory and Bayesian Analysis (1985), Berlin: Springer, Berlin · Zbl 0572.62008 · doi:10.1007/978-1-4757-4286-2
[28] Johnson, BM, On the admissible estimators for certain fixed sample binomial problems, Ann. Math. Stat., 42, 5, 1579-1587 (1971) · Zbl 0246.62017 · doi:10.1214/aoms/1177693156
[29] Stein, C: Inadmissibility of the usual estimator for the mean of a multivariate distribution. In: Proc. Third Berkeley Symp. Math. Statist. Probab., vol. 1. Univ. California Press, pp. 197-206 (1956) · Zbl 0073.35602
[30] James, W, Stein, C: Estimation with quadratic loss. In: Proc. Fourth Berkeley Symp. Math. Statist. Probab., vol. 1. Univ. California Press, pp. 361-379 (1961) · Zbl 1281.62026
[31] Efron, B.; Morris, C., Stein’s estimation rule and its competitors—an empirical Bayes approach, J. Am. Stat. Assoc., 68, 341, 117-130 (1973) · Zbl 0275.62005
[32] Schäfer, J.; Strimmer, K., A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Stat. Appl. Genet. Mol. Biol., 4, 1, 32 (2005) · doi:10.2202/1544-6115.1175
[33] Greenacre, M., Power transformations in correspondence analysis, Comput. Stat. Data Anal., 53, 8, 3107-3116 (2009) · Zbl 1453.62099 · doi:10.1016/j.csda.2008.09.001
[34] Greenacre, M., ‘Size’ and ‘shape’ in the measurement of multivariate proximity, Methods Ecol. Evol., 8, 11, 1415-1424 (2017) · doi:10.1111/2041-210X.12776
[35] Greenacre, M: Biplots in Practice. Fundación BBVA (2010)
[36] Box, GEP; Cox, DR, An analysis of transformations, J. R. Stat. Soc. B, 26, 2, 211-252 (1964) · Zbl 0156.40104
[37] Greenacre, M., Grunsky, E., Bacon-Shone, J., Erb, I., Quinn, T.: Aitchison’s Compositional Data Analysis 40 years On: A Reappraisal. Stat. Sci. Advance Publication 1-25 (2023). doi:10.1214/22-STS880
[38] Booeshaghi, A.S., Hallgrímsdóttir, I.B., Gálvez-Merchán, A., Pachter, L.: Depth normalization for single-cell genomics count data. bioRxiv 2022.05.06.490859 (2022)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.