×

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. (English) Zbl 1437.62658

Summary: We present a penalized matrix decomposition (PMD), a new framework for computing a rank-\(K\) approximation for a matrix. We approximate the matrix \(\mathbf{X}\) as \(\widehat{\mathbf{X}}=\sum^K_{k=1}d_k\mathbf{u}_k\mathbf{v}^T_k\), where \(d_{k}\), \(\mathbf{u}_{k}\), and \(\mathbf{v}_{k}\) minimize the squared Frobenius norm of \(\mathbf{X}-\widehat{\mathbf{X}}\), subject to penalties on \(\mathbf{u}_{k}\) and \(\mathbf{v}_{k}\). This results in a regularized version of the singular value decomposition. Of particular interest is the use of \(L_1\)-penalties on \(\mathbf{u}_{k}\) and \(\mathbf{v}_{k}\), which yields a decomposition of \(\mathbf{X}\) using sparse vectors. We show that when the PMD is applied using an \(L_1\)-penalty on \(\mathbf{v}_{k}\) but not on \(\mathbf{u}_{k}\), a method for sparse principal components results. In fact, this yields an efficient algorithm for the “SCoTLASS” proposal [I. Jolliffe et al., “A modified principal component technique based on the Lasso”, J. Comput. Graph. Stat. 12, No. 3, 531–547 (2003; doi:10.1198/1061860032148)] for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of H. Zou et al. [“Sparse principal component analysis”, J. Comput. Graph. Stat. 15, No. 2, 265–286 (2006; doi:10.1198/106186006X113430)]. In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

PMA; rda
Full Text: DOI

References:

[1] (2004)
[2] Genomic and transcriptional aberrations linked to breast cancer pathophysiologies, 10, 529-541 (2006) · doi:10.1016/j.ccr.2006.10.009
[3] Comparison of discrimination methods for the classification of tumors using gene expression data, 96, 1151-1160 (2001) · Zbl 1073.62511 · doi:10.1198/016214501753382129
[4] The approximation of one matrix by another of low rank, 1, 211 (1936) · JFM 62.1075.02 · doi:10.1007/BF02288367
[5] Pathwise coordinate optimization, 1, 302-332 (2007) · Zbl 1378.90064 · doi:10.1214/07-AOAS131
[6] (2009)
[7] Relations between two sets of variates, 28, 321-377 (1936) · Zbl 0015.40705 · doi:10.1093/biomet/28.3-4.321
[8] Non-negative sparse coding, 557-565 (2002)
[9] Non-negative matrix factorization with sparseness constraints, 5, 1457-1469 (2004) · Zbl 1222.68218
[10] Impact of DNA amplification on gene expression patterns in breast cancer, 62, 6240-6245 (2002)
[11] A modified principal component technique based on the lasso, 12, 531-547 (2003) · doi:10.1198/1061860032148
[12] Plaid models for gene expression data, 12, 61-86 (2002) · Zbl 1004.62084
[13] Learning the parts of objects by non-negative matrix factorization, 401, 788 (1999) · Zbl 1369.68285 · doi:10.1038/44565
[14] Algorithms for non-negative matrix factorization, 556-562 (2001)
[15] Genetic analysis of genome-wide variation in human gene expression, 430, 743-747 (2004) · doi:10.1038/nature02797
[16] (2009)
[17] Bi-cross-validation of the SVD and the non-negative matrix factorization (2009) · Zbl 1166.62047
[18] Genome-wide sparse canonical correlation of gene expression with genotypes, 1, S119 (2007) · doi:10.1186/1753-6561-1-s1-s119
[19] Sparse canonical correlation analysis with application to genomic data integration, 8, 1-34 (2009) · Zbl 1276.92071 · doi:10.2202/1544-6115.1406
[20] Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors, 99, 12963-12968 (2002) · doi:10.1073/pnas.162471999
[21] Sparse principal component analysis via regularized low rank matrix approximation, 101, 1015-1034 (2008) · Zbl 1141.62049 · doi:10.1016/j.jmva.2007.06.007
[22] Genome-wide associations of gene expression variation in humans, 1-e78 (2005)
[23] Relative impact of nucleotide and copy number variation on gene expression phenotypes, 315, 848-853 (2007) · doi:10.1126/science.1136678
[24] Class prediction by nearest shrunken centroids, with applications to DNA microarrays, 18, 104-117 (2003) · Zbl 1048.62109 · doi:10.1214/ss/1056397488
[25] Sparsity and smoothness via the fused lasso, 67, 91-108 (2005) · Zbl 1060.62049 · doi:10.1111/rssb.2005.67.issue-1
[26] Spatial smoothing and hotspot detection for CGH data using the fused lasso, 9, 18-29 (2008) · Zbl 1274.62886 · doi:10.1093/biostatistics/kxm013
[27] Projected gradient approach to the numerical solution of the scotlass, 50, 242-253 (2006) · Zbl 1429.62228 · doi:10.1016/j.csda.2004.07.017
[28] Missing value estimation methods for DNA microarrays, 16, 520-525 (2001) · doi:10.1093/bioinformatics/17.6.520
[29] Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis (2008) · Zbl 1276.92077
[30] A greedy approach to sparse canonical correlation analysis (In preparation) (2008)
[31] Cross-validatory estimation of the number of components in factor and principal components models, 20, 397-405 (1978) · Zbl 0403.62032 · doi:10.1080/00401706.1978.10489693
[32] Sparse principal component analysis, 15, 265-286 (2006) · doi:10.1198/106186006X113430
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.