×

Fast community detection by SCORE. (English) Zbl 1310.62076

Summary: Consider a network where the nodes split into \(K\) different communities. The community labels for the nodes are unknown and it is of major interest to estimate them (i.e., community detection). Degree Corrected Block Model (DCBM) is a popular network model. How to detect communities with the DCBM is an interesting problem, where the main challenge lies in the degree heterogeneity.
We propose a new approach to community detection which we call the Spectral Clustering On Ratios-of-Eigenvectors (SCORE). Compared to classical spectral methods, the main innovation is to use the entry-wise ratios between the first leading eigenvector and each of the other leading eigenvectors for clustering. Let \(A\) be the adjacency matrix of the network. We first obtain the \(K\) leading eigenvectors of \(A\), say, \(\widehat\eta_1,\dots, \widehat\eta_K\), and let \(\widehat R\) be then \(n\times(K-1)\) matrix such that \(\widehat R(i, k)=\widehat\eta_{k+ 1}(i)/\widehat\eta_1(i)\), \(1\leq i\leq n\), \(1\leq k\leq K- 1\). We then use \(k\) for clustering by applying the \(k\)-means method.
The central surprise is, the effect of degree heterogeneity is largely ancillary, and can be effectively removed by taking entry-wise ratios between \(\widehat\eta_{k+1}\) and \(\widehat\eta_1\), \(1\leq k\leq K-1\).
The method is successfully applied to the web blogs data and the karate club data, with error rates of 58/1222 and 1/34, respectively. These results are more satisfactory than those by the classical spectral methods. Additionally, compared to modularity methods, SCORE is easier to implement, computationally faster, and also has smaller error rates.
We develop a theoretic framework where we show that under mild conditions, the SCORE stably yields consistent community detection. In the core of the analysis is the recent development on random matrix theory, where the matrix-form Bernstein inequality is especially helpful.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
91C20 Clustering in the social and behavioral sciences
62P25 Applications of statistics to social sciences
91D30 Social networks; opinion dynamics

References:

[1] Adamic, L. and Glance, N. (2005). The political blogosphere and the 2004 U.S. election: Divided they blog. In Proceedings of the 3 rd International Workshop on Link Discovery 36-43. ACM, New York.
[2] Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097-2122. · Zbl 1277.62166 · doi:10.1214/13-AOS1138
[3] Bai, Z. and Silverstein, J. W. (2009). Spectral Analysis of Large Dimensional Random Matrices , 2nd ed. Springer, New York. · Zbl 1301.60002 · doi:10.1007/978-1-4419-0661-8
[4] Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman-Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068-21073. · Zbl 1359.62411 · doi:10.1073/pnas.0907096106
[5] Boots, B. and Gordon, G. (2011). Online spectral identification of dynamical systems. In NIPS Workshop on Sparse Representation and Low-Rank Approximation . Sierra Nevada, Spain.
[6] Box, G. E. P. and Draper, N. R. (1987). Empirical Model-Building and Response Surfaces . Wiley, New York. · Zbl 0614.62104
[7] Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37. · Zbl 1327.62369 · doi:10.1145/1970392.1970395
[8] Chaudhuri, K., Fan, C. and Tsiatas, A. (2012). Spectral clustering of graphs with general degrees in the extended planted partition of model. J. Mach. Learn. Res. 35 1-23.
[9] Choi, D. S., Wolfe, P. J. and Airoldi, E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika 99 273-284. · Zbl 1318.62207 · doi:10.1093/biomet/asr053
[10] Chung, F. R. K. (1997). Spectral Graph Theory , 1st ed. CBMS Regional Conference Series in Mathematics 92 . AMS, Providence, RI. · Zbl 0867.05046
[11] Erdős, L., Yau, H.-T. and Yin, J. (2012). Bulk universality for generalized Wigner matrices. Probab. Theory Related Fields 154 341-407. · Zbl 1277.15026 · doi:10.1007/s00440-011-0390-3
[12] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[13] Fishkind, D., Sussman, D., Tang, M. and Vogelstein, J. (2012). Consistent adjacency-spectral clustering partitioning for the stochastic model when the model parameters are unknown. Available at . · Zbl 1314.05186 · doi:10.1137/120875600
[14] Goldenberg, A., Zheng, A., Fienberg, S. and Airoldi, E. (2009). A survey of statistical network models. Faund. Trends Mach. Learn. 2 129-233. · Zbl 1184.68030 · doi:10.1561/2200000005
[15] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations , 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 0865.65009
[16] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning . Springer, New York. · Zbl 0973.62007
[17] Hoff, P. (2007). Modeling homophily and stochastic equivalence in symmetric relational data. In Advances in Neural Information Processing Systems . Cambridge.
[18] Horn, R. A. and Johnson, C. R. (1985). Matrix Analysis . Cambridge Univ. Press, Cambridge. · Zbl 0576.15001 · doi:10.1017/CBO9780511810817
[19] Jin, J. (2014). Supplement to “Fast community detection by SCORE”. . · Zbl 1310.62076 · doi:10.1214/14-AOS1265
[20] Jin, J. and Wang, W. (2012). Optimal spectral clustering by Higher Criticism thresholding. Working manuscript.
[21] Jin, J. and Zhang, Q. (2012). New spectral methods for community detection with bipartite networks. Working manuscript.
[22] Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10. · doi:10.1103/PhysRevE.83.016107
[23] Kolaczyk, E. D. (2009). Statistical Analysis of Network Data : Methods and Models . Springer, New York. · Zbl 1277.62021 · doi:10.1007/978-0-387-88146-1
[24] Krzakala, F., Moore, C., Mossel, E., Neeman, J., Sly, A., Zdeborova, L. and Zhang, P. (2012). Spectral redemption: Clustering sparse networks. Available at . · Zbl 1359.62252 · doi:10.1073/pnas.1312486110
[25] Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest density estimation. J. Mach. Learn. Res. 12 907-951. · Zbl 1280.62045
[26] Nayak, R., Kearns, M., Spielman, R. and Cheung, V. (2009). Coexpression network based on natural variation in human gene expression reveals gene interactions and functions. Genome Res. 19 1953-1962.
[27] Newman, M. E. J. (2006). Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E (3) 74 036104, 19. · doi:10.1103/PhysRevE.74.036104
[28] Newman, M. E. J. (2006). Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103 8577-8582.
[29] Perry, P. and Wolfe, P. (2012). Null models for network data. Available at .
[30] Prakah, B. A., Sridharan, A., Seshadri, M., Machiraju, S. and Faloutsos, C. (2010). Eigenspokes: Surprising patterns and scalable community chipping in large graphs. In Advances in Knowledge Discovery and Data Mining 435-448. Springer, Berlin.
[31] Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist. 39 1878-1915. · Zbl 1227.62042 · doi:10.1214/11-AOS887
[32] Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 389-434. · Zbl 1259.60008 · doi:10.1007/s10208-011-9099-z
[33] Tukey, J. W. (1965). Which part of the sample contains the information? Proc. Natl. Acad. Sci. USA 53 127-134. · Zbl 0168.40205 · doi:10.1073/pnas.53.1.127
[34] Yan, X., Jensen, J., Krzakala, F. et al. (2014). Model selection for degree-corrected block model. J. Stat. Mech. Theor. Exp. 2014 P05007.
[35] Zachary, W. (1977). An information flow model for conflict and fission in small groups. J. Anthropo. Res. 33 452-473.
[36] Zhang, S. and Zhao, H. (2012). Community identification in networks with unbalanced structure. Phys. Rev. E 85 066114.
[37] Zhao, Y., Levina, L. and Zhu, J. (2011). Consistency of community detection in network under degree-corrected stochastic block models. Available at . · Zbl 1257.62095 · doi:10.1214/12-AOS1036
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.