Abstract
We introduce a new approach for finding overlapping clusters given pairwise similarities of objects. In particular, we relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster. At the core of our approach is an optimization problem in which each data point is mapped to a small set of labels, representing membership in different clusters. The objective is to find a mapping so that the given similarities between objects agree as much as possible with similarities taken over their label sets. The number of labels can vary across objects. To define a similarity between label sets, we consider two measures: (i) a 0–1 function indicating whether the two label sets have non-zero intersection and (ii) the Jaccard coefficient between the two label sets. The algorithm we propose is an iterative local-search method. The definitions of label set similarity give rise to two non-trivial optimization problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy strategy and non-negative least squares, respectively. We also develop a distributed version of our algorithm based on the BSP model and implement it using a Pregel framework. Our algorithm uses as input pairwise similarities of objects and can thus be applied when clustering structured objects for which feature vectors are not available. As a proof of concept, we apply our algorithms on three different and complex application domains: trajectories, amino-acid sequences, and textual documents.
Similar content being viewed by others
References
Ailon N, Charikar M, Newman A (2005) Aggregating inconsistent information: ranking and clustering. In: Proceedings of the ACM symposium on theory of computing (STOC)
Ailon N, Liberty E (2009) Correlation clustering revisited: the “true“ cost of error minimization problems. In: Automata, languages and programming, 36th international colloquium (ICALP)
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3): 403–410
Arabie P, Carroll JD, DeSarbo W, Wind J (1981) Overlapping clustering: a new method for product positioning. J Mark Res 18(3):310–317
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the Fourth SIAM international conference on data mining (SDM)
Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Battle A, Segal E, Koller D (2004) Probabilistic discovery of overlapping cellular processes and their regulation. In: Proceedings of the 8th international conference on research in computational molecular biology (RECOMB)
Bezdek, JC, Pal, SK (eds) (1992) Fuzzy models for pattern recognition—methods that search for structures in data. IEEE Press, New York
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw 30(1–7): 107–117
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the 13th annual ACM symposium on theory of computing (STOC)
Charikar M, Guruswami V, Wirth A (2003) Clustering with qualitative information. In: Proceedings of the IEEE symposium on foundations of computer science (FOCS)
Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data (SIGMOD’05)
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology (ISMB)
Chierichetti F, Kumar R, Pandey S, Vassilvitskii S (2010) Finding the jaccard median. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA)
Coe PK, Johnson BK, Stewart KM, Kie JG (2004) Spatial and temporal interactions of elk, mule deer, and cattle. In: Transactions of the 69th North American wildlife and natural resources conference, pp 656–669
Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of the Fifth SIAM international conference on data mining (SDM)
Davidson I, Ravi SS (2007) Intractability and clustering with constraints. In: Proceedings of the 24th international conference on machine learning (ICML)
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1): 107–113
Demaine ED, Emanuel D, Fiat A, Immorlica N (2006) Correlation clustering in general weighted graphs. Theor Comput Sci 361:172–187
Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the SIAM data mining conference
Fortunato S (2010) Community detection in graphs. Phys Rep 486:75–174
Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Proceedings of the 8th IEEE international conference on data mining (ICDM)
Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM)
Gaffney S, Smyth P (1999) Trajectory clustering with mixtures of regression models. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’99
Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co., San Francisco
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. TKDD 1(1):Article 4
Giotis I, Guruswami V (2006) Correlation clustering with a fixed number of clusters. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithms (SODA)
Hasan M, Salem S, Zaki M (2011) Simclus: an effective algorithm for clustering with a lower bound on similarity. Knowl Inf Syst 28: 665–685
Hathaway RJ, Davenport JW, Bezdek JC (1989) Relational duals of the c-means clustering algorithms. Pattern Recognit 22(2): 205–212
Hathaway RJ, Hu Y (2009) Density-weighted fuzzy c-means clustering. IEEE T Fuzzy Syst 17(1): 243–252
He Z, Xie S, Zdunek R, Zhou G, Cichocki A (2011) Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering. IEEE Trans Neural Netw 22(12): 2117–2131
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 13th annual ACM symposium on theory of computing (STOC)
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the nineteenth international conference on machine learning (ICML)
Kobayashi M, Aono M (2006) Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Syst 10: 295–313
Lee DD, Seung HS (2001) Algorithms for Non-negative Matrix Factorization. In: Advances in neural information processing systems 13:556–562
Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD ’07
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1):24–45
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: SIGMOD conference, pp 135–146
Mei JP, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43(5): 1964–1974
Miettinen P (2008) On the positive-negative partial set cover problem. Inf Process Lett 108(4):219–221
Murzin A, Brenner S, Hubbard T, Chothia C (1995) Scop—a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4): 536–540
Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289
Nepusz T, Sasidharan R, Paccanaro A (2010) Scps: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. BMC Bioinform 11(1): 120
Paccanaro A, Casbon JA, Saqi MAS (2006) Spectral clustering of protein sequences. Nucleic Acids Res 34(5): 1571–1580
Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature
Scheinerman ER, Tucker K (2010) Modeling graphs using dot product representations. Comput Stat 25(1):1–16
Scripps J, Tan PN (2006) Clustering in the presence of bridge-nodes. In: Proceedings of the sixth SIAM international conference on data mining (SDM)
Segal E, Battle A, Koller D (2003) Decomposing gene expression into cellular processes. In: Proceedings of the 8th Pacific symposium on biocomputing (PSB)
Shafiei MM, Milios EE (2006) Model-based overlapping co-clustering. In: Proceedings of the fourth workshop on text mining
Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of discrete overlapping properties. Psychol Rev 86(2):87–123
Swamy C (2004) Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the ACM-SIAM symposium on discrete algorithms (SODA)
Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimensions. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM)
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8): 103–111
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML)
Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning (ICML)
Wang X, Tang L, Gao H, Liu H (2010) Discovering overlapping groups in social media. In: The 10th IEEE international conference on data mining (ICDM)
Xiong H, Steinbach M, Ruslim A, Kumar V (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19: 311–336
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bonchi, F., Gionis, A. & Ukkonen, A. Overlapping correlation clustering. Knowl Inf Syst 35, 1–32 (2013). https://doi.org/10.1007/s10115-012-0522-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0522-9