Overlapping correlation clustering

1108 Accesses
40 Citations
Explore all metrics

Abstract

We introduce a new approach for finding overlapping clusters given pairwise similarities of objects. In particular, we relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster. At the core of our approach is an optimization problem in which each data point is mapped to a small set of labels, representing membership in different clusters. The objective is to find a mapping so that the given similarities between objects agree as much as possible with similarities taken over their label sets. The number of labels can vary across objects. To define a similarity between label sets, we consider two measures: (i) a 0–1 function indicating whether the two label sets have non-zero intersection and (ii) the Jaccard coefficient between the two label sets. The algorithm we propose is an iterative local-search method. The definitions of label set similarity give rise to two non-trivial optimization problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy strategy and non-negative least squares, respectively. We also develop a distributed version of our algorithm based on the BSP model and implement it using a Pregel framework. Our algorithm uses as input pairwise similarities of objects and can thus be applied when clustering structured objects for which feature vectors are not available. As a proof of concept, we apply our algorithms on three different and complex application domains: trajectories, amino-acid sequences, and textual documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ailon N, Charikar M, Newman A (2005) Aggregating inconsistent information: ranking and clustering. In: Proceedings of the ACM symposium on theory of computing (STOC)
Ailon N, Liberty E (2009) Correlation clustering revisited: the “true“ cost of error minimization problems. In: Automata, languages and programming, 36th international colloquium (ICALP)
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3): 403–410
Google Scholar
Arabie P, Carroll JD, DeSarbo W, Wind J (1981) Overlapping clustering: a new method for product positioning. J Mark Res 18(3):310–317
Google Scholar
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113
Google Scholar
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the Fourth SIAM international conference on data mining (SDM)
Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Battle A, Segal E, Koller D (2004) Probabilistic discovery of overlapping cellular processes and their regulation. In: Proceedings of the 8th international conference on research in computational molecular biology (RECOMB)
Bezdek, JC, Pal, SK (eds) (1992) Fuzzy models for pattern recognition—methods that search for structures in data. IEEE Press, New York
Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw 30(1–7): 107–117
Google Scholar
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the 13th annual ACM symposium on theory of computing (STOC)
Charikar M, Guruswami V, Wirth A (2003) Clustering with qualitative information. In: Proceedings of the IEEE symposium on foundations of computer science (FOCS)
Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data (SIGMOD’05)
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology (ISMB)
Chierichetti F, Kumar R, Pandey S, Vassilvitskii S (2010) Finding the jaccard median. In: Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms (SODA)
Coe PK, Johnson BK, Stewart KM, Kie JG (2004) Spatial and temporal interactions of elk, mule deer, and cattle. In: Transactions of the 69th North American wildlife and natural resources conference, pp 656–669
Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of the Fifth SIAM international conference on data mining (SDM)
Davidson I, Ravi SS (2007) Intractability and clustering with constraints. In: Proceedings of the 24th international conference on machine learning (ICML)
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1): 107–113
Article Google Scholar
Demaine ED, Emanuel D, Fiat A, Immorlica N (2006) Correlation clustering in general weighted graphs. Theor Comput Sci 361:172–187
Google Scholar
Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the SIAM data mining conference
Fortunato S (2010) Community detection in graphs. Phys Rep 486:75–174
Google Scholar
Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Proceedings of the 8th IEEE international conference on data mining (ICDM)
Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM)
Gaffney S, Smyth P (1999) Trajectory clustering with mixtures of regression models. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’99
Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co., San Francisco
MATH Google Scholar
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. TKDD 1(1):Article 4
Giotis I, Guruswami V (2006) Correlation clustering with a fixed number of clusters. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithms (SODA)
Hasan M, Salem S, Zaki M (2011) Simclus: an effective algorithm for clustering with a lower bound on similarity. Knowl Inf Syst 28: 665–685
Article Google Scholar
Hathaway RJ, Davenport JW, Bezdek JC (1989) Relational duals of the c-means clustering algorithms. Pattern Recognit 22(2): 205–212
Article MathSciNet MATH Google Scholar
Hathaway RJ, Hu Y (2009) Density-weighted fuzzy c-means clustering. IEEE T Fuzzy Syst 17(1): 243–252
Article Google Scholar
He Z, Xie S, Zdunek R, Zhou G, Cichocki A (2011) Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering. IEEE Trans Neural Netw 22(12): 2117–2131
Article Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 13th annual ACM symposium on theory of computing (STOC)
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the nineteenth international conference on machine learning (ICML)
Kobayashi M, Aono M (2006) Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Syst 10: 295–313
Article Google Scholar
Lee DD, Seung HS (2001) Algorithms for Non-negative Matrix Factorization. In: Advances in neural information processing systems 13:556–562
Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD ’07
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1):24–45
Google Scholar
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: SIGMOD conference, pp 135–146
Mei JP, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43(5): 1964–1974
Article MATH Google Scholar
Miettinen P (2008) On the positive-negative partial set cover problem. Inf Process Lett 108(4):219–221
Google Scholar
Murzin A, Brenner S, Hubbard T, Chothia C (1995) Scop—a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4): 536–540
Google Scholar
Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289
Article Google Scholar
Nepusz T, Sasidharan R, Paccanaro A (2010) Scps: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. BMC Bioinform 11(1): 120
Article Google Scholar
Paccanaro A, Casbon JA, Saqi MAS (2006) Spectral clustering of protein sequences. Nucleic Acids Res 34(5): 1571–1580
Article Google Scholar
Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature
Scheinerman ER, Tucker K (2010) Modeling graphs using dot product representations. Comput Stat 25(1):1–16
Google Scholar
Scripps J, Tan PN (2006) Clustering in the presence of bridge-nodes. In: Proceedings of the sixth SIAM international conference on data mining (SDM)
Segal E, Battle A, Koller D (2003) Decomposing gene expression into cellular processes. In: Proceedings of the 8th Pacific symposium on biocomputing (PSB)
Shafiei MM, Milios EE (2006) Model-based overlapping co-clustering. In: Proceedings of the fourth workshop on text mining
Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of discrete overlapping properties. Psychol Rev 86(2):87–123
Google Scholar
Swamy C (2004) Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the ACM-SIAM symposium on discrete algorithms (SODA)
Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimensions. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM)
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8): 103–111
Article Google Scholar
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML)
Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning (ICML)
Wang X, Tang L, Gao H, Liu H (2010) Discovering overlapping groups in social media. In: The 10th IEEE international conference on data mining (ICDM)
Xiong H, Steinbach M, Ruslim A, Kumar V (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19: 311–336
Article Google Scholar

Download references

Author information

Authors and Affiliations

Yahoo! Research, Avinguda Diagonal 177, 08018, Barcelona, Spain
Francesco Bonchi, Aristides Gionis & Antti Ukkonen

Authors

Francesco Bonchi
View author publications
You can also search for this author in PubMed Google Scholar
Aristides Gionis
View author publications
You can also search for this author in PubMed Google Scholar
Antti Ukkonen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antti Ukkonen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bonchi, F., Gionis, A. & Ukkonen, A. Overlapping correlation clustering. Knowl Inf Syst 35, 1–32 (2013). https://doi.org/10.1007/s10115-012-0522-9

Download citation

Received: 22 December 2011
Revised: 23 April 2012
Accepted: 16 June 2012
Published: 22 July 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10115-012-0522-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Information Theoretic Pairwise Clustering

Simple Iterative Heuristics for Correlation Clustering

Review of Basic Local Searches for Solving the Minimum Sum-of-Squares Clustering Problem

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Overlapping correlation clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Information Theoretic Pairwise Clustering

Simple Iterative Heuristics for Correlation Clustering

Review of Basic Local Searches for Solving the Minimum Sum-of-Squares Clustering Problem

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation