×

Low-rank matrix denoising for count data using unbiased Kullback-Leibler risk estimation. (English) Zbl 07476353

Summary: Many statistical studies are concerned with the analysis of observations organized in a matrix form whose elements are count data. When these observations are assumed to follow a Poisson or a multinomial distribution, it is of interest to focus on the estimation of either the intensity matrix (Poisson case) or the compositional matrix (multinomial case) when it is assumed to have a low rank structure. In this setting, it is proposed to construct an estimator minimizing the regularized negative log-likelihood by a nuclear norm penalty. Such an approach easily yields a low-rank matrix-valued estimator with positive entries which belongs to the set of row-stochastic matrices in the multinomial case. Then, as a main contribution, a data-driven procedure is constructed to select the regularization parameter in the construction of such estimators by minimizing (approximately) unbiased estimates of the Kullback-Leibler (KL) risk in such models, which generalize Stein’s unbiased risk estimation originally proposed for Gaussian data. The evaluation of these quantities is a delicate problem, and novel methods are introduced to obtain accurate numerical approximation of such unbiased estimates. Simulated data are used to validate this way of selecting regularizing parameters for low-rank matrix estimation from count data. For data following a multinomial distribution, the performances of this approach are also compared to \(K\)-fold cross-validation. Examples from a survey study and metagenomics also illustrate the benefits of this methodology for real data analysis.

MSC:

62H12 Estimation in multivariate analysis
62J07 Ridge regression; shrinkage estimators (Lasso)
62P10 Applications of statistics to biology and medical sciences; meta analysis
94A12 Signal theory (characterization, reconstruction, filtering, etc.)

References:

[1] Aitchison, J., The Statistical Analysis of Compositional Data (2003), Blackburn Press: Blackburn Press Caldwell, NJ, USA · Zbl 0688.62004
[2] Bauschke, H. H.; Combettes, P. L., Convex Analysis and Monotone Operator Theory in Hilbert Spaces (2017), Springer: Springer New York · Zbl 1359.26003
[3] Bazerque, J.; Mateos, G.; Giannakis, G., Inference of Poisson count processes using low-rank tensor data, (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (2013)), 5989-5993
[4] Beck, A.; Teboulle, M., A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2, 1, 183-202 (2009) · Zbl 1175.94009
[5] Bigot, J.; Deledalle, C.; Féral, D., Generalized sure for optimal shrinkage of singular values in low-rank matrix denoising, J. Mach. Learn. Res., 18, 137, 1-50 (2017) · Zbl 1440.94012
[6] Candès, E. J.; Sing-Long, C. A.; Trzasko, J. D., Unbiased risk estimates for singular value thresholding and spectral estimators, IEEE Trans. Signal Process., 61, 19, 4643-4657 (2013) · Zbl 1393.94187
[7] Cao, Y.; Xie, Y., Poisson matrix recovery and completion, IEEE Trans. Signal Process., 64, 6, 1609-1620 (2016) · Zbl 1412.94024
[8] Cao, Y.; Zhang, A.; Li, H., Multisample estimation of bacterial composition matrices in metagenomics data, Biometrika (2020) · Zbl 1435.62188
[9] Chaffron, S.; Rehrauer, H.; Pernthaler, J.; von Mering, C., A global network of coexisting microbes from environmental and whole-genome sequence data, Genome Res., 20, 947-959 (2010)
[10] Combettes, P. L.; Wajs, V. R., Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul., 4, 4, 1168-1200 (2005) · Zbl 1179.94031
[11] Daubechies, I.; Defrise, M.; De Mol, C., An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Commun. Pure Appl. Math., 57, 11, 1413-1457 (2004) · Zbl 1077.65055
[12] Deledalle, C.-A., Estimation of Kullback-Leibler losses for noisy recovery problems within the exponential family, Electron. J. Stat., 11, 2, 3141-3164 (2017) · Zbl 1373.62126
[13] Deledalle, C.-A.; Vaiter, S.; Fadili, J.; Peyré, G., Stein unbiased gradient estimator of the risk (sugar) for multiple parameter selection, SIAM J. Imaging Sci., 7, 4, 2448-2487 (2014) · Zbl 1361.94012
[14] Donoho, D.; Gavish, M., Minimax risk of matrix denoising by singular value thresholding, Ann. Stat., 42, 6, 2413-2440 (2014) · Zbl 1310.62014
[15] Faust, K.; Sathirapongsasuti, J. F.; Izard, J.; Segata, N.; Gevers, D.; Raes, J.; Huttenhower, C., Microbial co-occurrence relationships in the human microbiome, PLoS Comput. Biol., 8, 7, 1-17 (2012)
[16] Gavish, M.; Donoho, D. L., The optimal hard threshold for singular values is \(4 / \sqrt{ 3} \), IEEE Trans. Inf. Theory, 60, 8, 5040-5053 (2014) · Zbl 1360.94071
[17] Hudson, H. M., A natural identity for exponential families with applications in multiparameter estimation, Ann. Stat., 6, 3, 473-484 (1978) · Zbl 0391.62006
[18] Klopp, O., Noisy low-rank matrix completion with general sampling distribution, Bernoulli, 20, 1, 282-303 (2014) · Zbl 1400.62115
[19] Klopp, O.; Lafond, J.; Moulines, E.; Salmon, J., Adaptive multinomial matrix completion, Electron. J. Stat., 9, 2, 2950-2975 (2015) · Zbl 1329.62304
[20] Lewis, A. S., Derivatives of spectral functions, Math. Oper. Res., 21, 3, 576-588 (1996) · Zbl 0860.49017
[21] Liu, L. T.; Dobriban, E.; Singer, A., epca: high dimensional exponential family pca, Ann. Appl. Stat., 12, 4, 2121-2150 (2018) · Zbl 1411.62376
[22] Miller, G. A., Wordnet: a lexical database for English, Commun. ACM, 38, 11, 39-41 (1995)
[23] Moreau, J.-J., Proximité et dualité dans un espace hilbertien, Bull. Soc. Math. Fr., 93, 273-299 (1965) · Zbl 0136.12101
[24] Nadakuditi, R. R., OptShrink: an algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage, IEEE Trans. Inf. Theory, 60, 5, 3002-3018 (2014) · Zbl 1360.62399
[25] Porter, M.F., October 2001. Snowball: a language for stemming algorithms. Published online, accessed 11.03.2008, 15.00h.
[26] Robin, G.; Josse, J.; Moulines, E.; Sardy, S., Low-rank model with covariates for count data with missing values, J. Multivar. Anal., 173, 416-434 (2019) · Zbl 1422.62192
[27] Salmon, J.; Harmany, Z. T.; Deledalle, C.; Willett, R., Poisson noise reduction with non-local PCA, J. Math. Imaging Vis., 48, 2, 279-294 (2014) · Zbl 1365.94050
[28] Shabalin, A. A.; Nobel, A. B., Reconstruction of a low-rank matrix in the presence of Gaussian noise, J. Multivar. Anal., 118, 67-76 (2013) · Zbl 1280.15022
[29] Stein, C. M., Estimation of the mean of a multivariate normal distribution, Ann. Stat., 9, 6, 1135-1151 (1981) · Zbl 0476.62035
[30] Udell, M.; Horn, C.; Zadeh, R.; Boyd, S., Generalized low rank models, Found. Trends Mach. Learn., 9, 1, 1-118 (2016) · Zbl 1350.68221
[31] Wang, H.; Lu, Y.; Zhai, C., Latent aspect rating analysis on review text data: a rating regression approach, (Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010), ACM), 783-792
[32] Wang, H.; Lu, Y.; Zhai, C., Latent aspect rating analysis without aspect keyword supervision, (Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011), ACM), 618-626
[33] Wu, G. D.; Chen, J.; Hoffmann, C.; Bittinger, K.; Chen, Y.-Y.; Keilbaugh, S. A.; Bewtra, M.; Knights, D.; Walters, W. A.; Knight, R.; Sinha, R.; Gilroy, R.; Gupta, K.; Baldassano, R.; Nessel, L.; Li, H.; Bushman, F. D.; Lewis, J. D., Linking long-term dietary patterns with gut microbial enterotypes, Science, 334, 6052, 105-108 (2011)
[34] Zhang, A.; Cai, T.; Wu, Y., Heteroskedastic pca: algorithm, optimality, and applications, Preprint · Zbl 1486.62183
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.