×

Stepwise feature selection using generalized logistic loss. (English) Zbl 1452.62838

Summary: Microarray experiments have raised challenging questions such as how to make an accurate identification of a set of marker genes responsible for various cancers. In statistics, this specific task can be posed as the feature selection problem. Since a support vector machine can deal with a vast number of features, it has gained wide spread use in microarray data analysis. We propose a stepwise feature selection using the generalized logistic loss that is a smooth approximation of the usual hinge loss. We compare the proposed method with the support vector machine with recursive feature elimination for both real and simulated datasets. It is illustrated that the proposed method can improve the quality of feature selection through standardization while the method retains similar predictive performance compared with the recursive feature elimination.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62-08 Computational methods for problems pertaining to statistics
68T05 Learning and adaptive systems in artificial intelligence

Software:

ElemStatLearn
Full Text: DOI

References:

[1] Alizedeh, A. A., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 403, 503-510 (2000)
[2] Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. In: Proceedings of the National Academy of Sciences of the United States of America 96, pp. 6745-6750; Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. In: Proceedings of the National Academy of Sciences of the United States of America 96, pp. 6745-6750
[3] Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. In: Proceedings of the National Academy of Sciences of the United States of America 99, pp. 6562-6566; Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. In: Proceedings of the National Academy of Sciences of the United States of America 99, pp. 6562-6566 · Zbl 1034.92013
[4] Bradley, P.S., Mangasarian, O.L., 1998. Feature selection via concave minimization and support vector machines. In: Proceedings of 13th International Conference on Machine Learning, 82-90; Bradley, P.S., Mangasarian, O.L., 1998. Feature selection via concave minimization and support vector machines. In: Proceedings of 13th International Conference on Machine Learning, 82-90
[5] Cortes, C.; Vapnik, V., Support-vector networks, Mach. Learn., 20, 273-297 (1995) · Zbl 0831.68098
[6] Ding, Y.; Wilkins, D., Improving the performance of SVM-RFE to select genes in microarray, Data. BMC Bioinformatics, 7, Suppl 2, S12 (2006)
[7] Duan, K.-B.; Rajapakse, J. C.; Wang, H.; Azuaje, F., Multiple SVM-RFE for gene selection in cancer classification with expression data, IEEE Trans. Nanobioscience, 4, 228-234 (2005)
[8] Friedman, J., Multivariate adaptive regression splines (with discussions), Ann. Stat., 19, 1-141 (1991) · Zbl 0765.62064
[9] Gordon, G. J., Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma, Cancer Res., 62, 4963-4967 (2002)
[10] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, J. Mach. Learn. Res., 3, 1157-1182 (2003) · Zbl 1102.68556
[11] Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V., Gene selection for cancer classification using support vector machines, Mach. Learn., 46, 389-422 (2002) · Zbl 0998.68111
[12] Hastie, T.; Tibshirani, R.; Friedman, J., The elements of statistical learning: Data mining, inference, and prediction (2001), Springer: Springer New York · Zbl 0973.62007
[13] Ishak, A.B., Ghattas, B., 2005. An Efficient Method for Variable Selection Using SVM-Based Criteria. Preprint, Institut de Mathématiques de Luminy; Ishak, A.B., Ghattas, B., 2005. An Efficient Method for Variable Selection Using SVM-Based Criteria. Preprint, Institut de Mathématiques de Luminy
[14] Koo, J.-Y., Kooperberg, C., 2005. Quantile multivariate adaptive regression splines. Manuscript; Koo, J.-Y., Kooperberg, C., 2005. Quantile multivariate adaptive regression splines. Manuscript
[15] Koo, J.-Y., Lee, Y., Kim, Y., Park, C., 2006. A bahadur representation of the linear support vector machine. Technical Report No. 792. Department of Statistics, The Ohio State University; Koo, J.-Y., Lee, Y., Kim, Y., Park, C., 2006. A bahadur representation of the linear support vector machine. Technical Report No. 792. Department of Statistics, The Ohio State University · Zbl 1225.68191
[16] Lee, Y.; Kim, Y.; Lee, S.; Koo, J.-Y., Structured multicategory support vector machine with ANOVA decomposition, Biometrika, 93, 555-571 (2006) · Zbl 1108.62059
[17] Lee, Y.; Lin, Y.; Wahba, G., Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data, J. Amer. Statist. Assoc., 99, 67-81 (2004) · Zbl 1089.62511
[18] Mao, Y.; Zhou, X.; Pi, D.; Sun, Y.; Wong, S. T.C., Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection, J. Biomedicine Biotechnology, 2005, 160-171 (2005)
[19] Niijima, S.; Kuhara, S., Recursive gene selection based on maximum margin criterion: A comparison with SVM-RFE, BMC Bioinformatics, 7, 543 (2006)
[20] Rao, C. R., Linear Statistical Inference and its Applications (1973), Wiley: Wiley New York · Zbl 0256.62002
[21] Schwarz, G., Estimating the dimension of a model, Ann. Stat., 6, 461-464 (1978) · Zbl 0379.62005
[22] Singh, D.; Febbo, P. G.; Ross, K.; Jackson, D. G.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A. A.; D’Amico, A. V.; Richie, J. P.; Lander, E. S.; Loda, M.; Kantoff, P. W.; Golub, T. R.; Sellers, W. R., Gene expression correlates of clinical prostate cancer behavior, Cancer Cell, 1, 203-209 (2002)
[23] Tang, E. K.; Suganthan, PN; Yao, X., Gene selection algorithms for microarray data based on least square support vector machine, BMC Bioinformatics, 7, 95 (2006)
[24] Zhang, H. H., Variable selection for support vector machines via smoothing spline ANOVA, Statist. Sinica, 16, 659-674 (2006) · Zbl 1096.62072
[25] Zhang, J., Jin, R., Yang, Y., Hauptmann, A.G., 2003. Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. In: Proceedings of the Twentieth International Conference on Machine Learning, ICML-2003, Washington, DC; Zhang, J., Jin, R., Yang, Y., Hauptmann, A.G., 2003. Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. In: Proceedings of the Twentieth International Conference on Machine Learning, ICML-2003, Washington, DC
[26] Zhang, T.; Oles, F. J., Text categorization based on regularized linear classification methods, Information Retrieval, 4, 5-31 (2001) · Zbl 1030.68910
[27] Zhou, X.; Tuck, D. P., MSVM-RFE: Extensions of SVM-RFE for multiclass gene selection on DNA microarray data, Bioinformatics, 23, 1106-1114 (2007)
[28] Zhu, J., Rosset, S., Hastie, T., Tibshirani, R., 2003. 1-norm Support Vector Machines. Technical Report. Stanford University; Zhu, J., Rosset, S., Hastie, T., Tibshirani, R., 2003. 1-norm Support Vector Machines. Technical Report. Stanford University
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.