×

Embedding-based silhouette community detection. (English) Zbl 1523.68069

Summary: Mining complex data in the form of networks is of increasing interest in many scientific disciplines. Network communities correspond to densely connected subnetworks, and often represent key functional parts of real-world systems. This paper proposes the embedding-based Silhouette community detection (SCD), an approach for detecting communities, based on clustering of network node embeddings, i.e. real valued representations of nodes derived from their neighborhoods. We investigate the performance of the proposed SCD approach on 234 synthetic networks, as well as on a real-life social network. Even though SCD is not based on any form of modularity optimization, it performs comparably or better than state-of-the-art community detection algorithms, such as the InfoMap and Louvain. Further, we demonstrate that SCD’s outputs can be used along with domain ontologies in semantic subgroup discovery, yielding human-understandable explanations of communities detected in a real-life protein interaction network. Being embedding-based, SCD is widely applicable and can be tested out-of-the-box as part of many existing network learning and exploration pipelines.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
05C82 Small world graphs, complex networks (graph-theoretic aspects)
92C42 Systems biology, networks

References:

[1] Adhikari, PR; Vavpetič, A.; Kralj, J.; Lavrač, N.; Hollmén, J., Explaining mixture models through semantic pattern mining and banded matrix visualization, Machine Learning, 105, 1, 3-39 (2016) · Zbl 1392.68336
[2] Aranganayagi, S., & Thangavel, K. (2007). Clustering categorical data using silhouette coefficient as a relocating measure. In International conference on computational intelligence and multimedia applications (ICCIMA 2007) (vol. 2, pp. 13-17). IEEE.
[3] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics. · Zbl 1302.68273
[4] Ashburner, M.; Ball, CA; Blake, JA; Botstein, D.; Butler, H.; Cherry, JM; Davis, AP; Dolinski, K.; Dwight, SS; Eppig, JT, Gene ontology: Tool for the unification of biology, Nature Genetics, 25, 1, 25-29 (2000)
[5] Bachem, O., Lucic, M., Hassani, H., & Krause, A. (2016). Fast and provably good seedings for k-means. In Advances in neural information processing systems 29 (pp. 55-63). Curran Associates Inc.
[6] Barabási, AL, Scale-free networks: a decade and beyond, Science, 325, 5939, 412-413 (2009) · Zbl 1226.91052
[7] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., et al. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy) (Vol. 4). Austin, TX.
[8] Brandes, U., Delling, D., Gaertler, M., Görke, R., Hoefer, M., Nikoloski, Z., et al. (2006). Maximizing modularity is hard. arXiv preprint, arXiv:physics/0608255. · Zbl 1141.68519
[9] Cai, H.; Zheng, VW; Chang, KCC, A comprehensive survey of graph embedding: Problems, techniques, and applications, IEEE Transactions on Knowledge and Data Engineering, 30, 9, 1616-1637 (2018)
[10] Clauset, A.; Newman, ME; Moore, C., Finding community structure in very large networks, Physical Review E, 70, 6, 066111 (2004)
[11] Cordasco, G., & Gargano, L. (2010). Community detection via semi-synchronous label propagation algorithms. In 2010 IEEE international workshop on: business applications of social network analysis (BASNA) (pp. 1-8). IEEE.
[12] Davies, DL; Bouldin, DW, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1, 2, 224-227 (1979)
[13] De Meo, P., Ferrara, E., Fiumara, G., & Provetti, A. (2011). Generalized louvain method for community detection in large networks. In 2011 proceedings of the 11th international conference on intelligent systems design and applications (pp. 88-93). IEEE. · Zbl 1311.68133
[14] Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems 29 (pp. 3844-3852). Curran Associates Inc.
[15] Drineas, P.; Frieze, A.; Kannan, R.; Vempala, S.; Vinay, V., Clustering large graphs via the singular value decomposition, Machine Learning, 56, 1-3, 9-33 (2004) · Zbl 1089.68090
[16] Fortunato, S.; Barthelemy, M., Resolution limit in community detection, Proceedings of the National Academy of Sciences, 104, 1, 36-41 (2007)
[17] Fowlkes, EB; Mallows, CL, A method for comparing two hierarchical clusterings, Journal of the American Statistical Association, 78, 383, 553-569 (1983) · Zbl 0545.62042
[18] Fürnkranz, J.; Gamberger, D.; Lavrač, N., Foundations of rule learning (2012), Berlin: Springer, Berlin · Zbl 1263.68002
[19] Good, BH; De Montjoye, YA; Clauset, A., Performance of modularity maximization in practical contexts, Physical Review E, 81, 4, 046106 (2010)
[20] Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). ACM.
[21] Guédon, O.; Vershynin, R., Community detection in sparse networks via grothendieck’s inequality, Probability Theory and Related Fields, 165, 3-4, 1025-1049 (2016) · Zbl 1357.90111
[22] Hagberg, A., Swart, P., & S Chult, D. (2008). Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States).
[23] Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. In Advances in neural information processing systems 30 (pp. 1024-1034). Curran Associates Inc.
[24] Harenberg, S.; Bello, G.; Gjeltema, L.; Ranshous, S.; Harlalka, J.; Seay, R.; Padmanabhan, K.; Samatova, N., Community detection in large-scale networks: A survey and empirical evaluation, Wiley Interdisciplinary Reviews: Computational Statistics, 6, 6, 426-439 (2014) · Zbl 07912746
[25] Honghao, C., Zuren, F., & Zhigang, R. (2013). Community detection using ant colony optimization. In 2013 IEEE congress on evolutionary computation (pp. 3072-3078). IEEE.
[26] Hotho, A.; Maedche, A.; Staab, S., Ontology-based text document clustering, KI, 16, 4, 48-54 (2002)
[27] Jin, J., Fast community detection by score, The Annals of Statistics, 43, 1, 57-89 (2015) · Zbl 1310.62076
[28] Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International conference on learning representations (ICLR).
[29] Kozak, M., “A dendrite method for cluster analysis” by Caliński and Harabasz: A classical work that is far too often incorrectly cited, Communications in Statistics - Theory and Methods, 41, 12, 2279-2280 (2012) · Zbl 1250.01015
[30] Kralj, J.; Robnik-Šikonja, M.; Lavrač, N., Hinmine: Heterogeneous information network mining with information retrieval heuristics, Journal of Intelligent Information Systems, 50, 1, 29-61 (2018)
[31] Lancichinetti, A.; Fortunato, S., Community detection algorithms: A comparative analysis, Physical Review E, 80, 5, 056117 (2009)
[32] Lancichinetti, A.; Fortunato, S.; Radicchi, F., Benchmark graphs for testing community detection algorithms, Physical Review E, 78, 4, 046110 (2008)
[33] Langohr, L.; Podpečan, V.; Petek, M.; Mozetič, I.; Gruden, K.; Lavrač, N.; Toivonen, H., Contrasting subgroup discovery, The Computer Journal, 56, 3, 289-303 (2012)
[34] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, GS; Dean, J.; Burges, CJC; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, KQ, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems 26, 3111-3119 (2013), Red Hook: Curran Associates Inc, Red Hook
[35] Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems 30 (pp. 6338-6347). Curran Associates Inc.
[36] Novak, PK; Lavrač, N.; Webb, GI, Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining, Journal of Machine Learning Research, 10, Feb, 377-403 (2009) · Zbl 1235.68178
[37] Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab: Technical report.
[38] Park, HS; Jun, CH, A simple and fast algorithm for k-medoids clustering, Expert Systems with Applications, 36, 2, 3336-3341 (2009)
[39] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (pp. 8024-8035). Curran Associates Inc.
[40] Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.
[41] Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., & Tang, J. (2018). Network embedding as matrix factorization: Unifying deepwalk, line, PTE, and node2vec. In Proceedings of the eleventh ACM international conference on web search and data mining (pp. 459-467). ACM.
[42] Rand, WM, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66, 336, 846-850 (1971)
[43] Ribeiro, L. F., Saverese, P. H., & Figueiredo, D. R. (2017). struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 385-394). ACM.
[44] Rosenberger, G.; Meien, S.; Kutsche, K., Oncogenic HRAS mutations cause prolonged PI3K signaling in response to epidermal growth factor in fibroblasts of patients with costello syndrome, Human Mutation, 30, 3, 352-362 (2009)
[45] Rosvall, M.; Axelsson, D.; Bergstrom, CT, The map equation, The European Physical Journal-Special Topics, 178, 1, 13-23 (2009)
[46] Rosvall, M.; Axelsson, D.; Bergstrom, CT, The map equation, The European Physical Journal Special Topics, 178, 1, 13-23 (2009)
[47] Rousseeuw, PJ, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65 (1987) · Zbl 0636.62059
[48] Schaub, MT; Delvenne, JC; Rosvall, M.; Lambiotte, R., The many facets of community detection in complex networks, Applied Network Science, 2, 1, 4 (2017)
[49] Sculley, D. (2010). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web (pp. 1177-1178). ACM.
[50] Škrlj, B., Kralj, J., & Lavrač, N. (2018). Targeted end-to-end knowledge graph decomposition. In International conference on inductive logic programming (pp. 157-171). Berlin: Springer. · Zbl 1455.68202
[51] Škrlj, B., Kralj, J., & Lavrač, N. (2019a). CBSSD: Community-based semantic subgroup discovery. Journal of Intelligent Information Systems, 53, 265-304.
[52] Škrlj, B.; Kralj, J.; Lavrač, N.; Aiello, LM; Cherifi, C.; Cherifi, H.; Lambiotte, R.; Lió, P.; Rocha, LM, Py3plex: A library for scalable multilayer network analysis and visualization, Complex networks and their applications VII, 757-768 (2019), Cham: Springer International Publishing, Cham
[53] Skrlj, B.; Kralj, J.; Lavrac, N., Py3plex toolkit for visualization and analysis of multilayer networks, Applied Network Science, 4, 1, 94 (2019)
[54] Škrlj, B.; Kralj, J.; Vavpetič, A.; Lavrač, N.; Appice, A.; Loglisci, C.; Manco, G.; Masciari, E.; Ras, ZW, Community-based semantic subgroup discovery, New frontiers in mining complex patterns, 182-196 (2018), Berlin: Springer International Publishing, Berlin
[55] Tang, J., Qu, M., & Mei, Q. (2015). PTE: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1165-1174). ACM.
[56] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web (pp. 1067-1077). International World Wide Web Conferences Steering Committee.
[57] Thomas, JA; Cover, T., Elements of information theory (1991), New York: Wiley, New York · Zbl 0762.94001
[58] Toni, T.; Welch, D.; Strelkowa, N.; Ipsen, A.; Stumpf, MPH, Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems, Journal of the Royal Society Interface, 6, 187-202 (2009)
[59] Vavpetič, A., Novak, P. K., Grčar, M., Mozetič, I., & Lavrač, N. (2013). Semantic data mining of financial news articles. In Proceedings of the international conference on discovery science (pp. 294-307). Berlin: Springer.
[60] Vavpetič, A. (2017). Semantic subgroup discovery. Ph.D. thesis, Jožef Stefan International Postgraduate School.
[61] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.
[62] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S. (2019). A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596.
[63] Yang, Z.; Algesheimer, R.; Tessone, CJ, A comparative analysis of community detection algorithms on artificial networks, Scientific Reports, 6, 30750 (2016)
[64] Yin, H., Benson, A. R., Leskovec, J., & Gleich, D. F. (2017). Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 555-564). ACM.
[65] Zhang, Q.; Yang, LT; Chen, Z.; Li, P., A survey on deep learning for big data, Information Fusion, 42, 146-157 (2018)
[66] Zhang, XS; Wang, RS; Wang, Y.; Wang, J.; Qiu, Y.; Wang, L.; Chen, L., Modularity optimization in community detection of complex networks, EPL (Europhysics Letters), 87, 3, 38002 (2009)
[67] Zhao, W. X., Huang, J., & Wen, J. R. (2016). Learning distributed representations for recommender systems with a network embedding approach. In Asia information retrieval symposium (pp. 224-236). Berlin: Springer.
[68] Zhu, Y.; Knolhoff, BL; Meyer, MA; Nywening, TM; West, BL; Luo, J.; Wang-Gillam, A.; Goedegebuure, SP; Linehan, DC; DeNardo, DG, CSF1/CSF1R blockade reprograms tumor-infiltrating macrophages and improves response to t-cell checkpoint immunotherapy in pancreatic cancer models, Cancer Research, 74, 18, 5057-5069 (2014)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.