×

A general framework for evaluating and comparing soft clusterings. (English) Zbl 07838237

Summary: In this article, we propose a general framework for the development of external evaluation measures for soft clustering. Our proposal is based on the interpretation of soft clustering as representing uncertain information about an underlying, unknown hard clustering. We present a general construction, based on optimal transport theory, by which any evaluation measure can be naturally extended to soft clustering. The proposed “transport-based measure” provides an objective, interval-valued comparison index that represents the range of compatibility between two soft clusterings. We study the metric and complexity properties of the proposed approach, as well as its relationship with other existing proposals. We also propose approximation and bounding algorithms that make the approach practical for large datasets. Finally, we illustrate the application of the proposed method through two computational experiments.

MSC:

68-XX Computer science
62-XX Statistics

Software:

MBCbook
Full Text: DOI

References:

[1] Ahmed, M.; Mahmood, A. N.; Hu, J., A survey of network anomaly detection techniques, J. Network Comput. Appl., 60, 19-31 (2016)
[2] Anderson, D. T.; Bezdek, J. C.; Popescu, M.; Keller, J. M., Comparing fuzzy, probabilistic, and possibilistic partitions, IEEE Trans. Fuzzy Syst., 18, 906-918 (2010)
[3] Anderson, D. T.; Zare, A.; Price, S., Comparing fuzzy, probabilistic, and possibilistic partitions using the earth mover’s distance, IEEE Trans. Fuzzy Syst., 21, 766-775 (2012)
[4] Ashtari, P.; Haredasht, F. N.; Beigy, H., Supervised fuzzy partitioning, Pattern Recogn., 97, Article 107013 pp. (2020)
[5] Bassetti, F.; Gualandi, S.; Veneroni, M., On the computation of Kantorovich-Wasserstein distances between two-dimensional histograms by uncapacitated minimum cost flows, SIAM J. Optim., 30, 2441-2469 (2020) · Zbl 1450.90008
[6] Bezdek, J. C., Pattern recognition with fuzzy objective function algorithms (1981), Springer Science & Business Media · Zbl 0503.68069
[7] Bouveyron, C.; Celeux, G.; Murphy, T. B.; Raftery, A. E., Model-based clustering and classification for data science: with applications in R (2019), Cambridge University Press · Zbl 1436.62006
[8] Brouwer, R. K., Extending the Rand, adjusted Rand and Jaccard indices to fuzzy partitions, J. Intell. Inform. Syst., 32, 213-235 (2009)
[9] Campagner, A.; Ciucci, D., Orthopartitions and soft clustering: soft mutual information measures for clustering validation, Knowl.-Based Syst., 180, 51-61 (2019)
[10] Campello, R. J., A fuzzy extension of the rand index and other related indexes for clustering and classification assessment, Pattern Recogn. Lett., 28, 833-841 (2007)
[11] Day, W. H., The complexity of computing metric distances between partitions, Mathematical Social Sciences, 1, 269-287 (1981) · Zbl 0497.62049
[12] Dempster, A., Upper and lower probabilities induced by a multivalued mapping, Ann. Math. Stat., 38, 325-339 (1967) · Zbl 0168.17501
[13] Denœux, T., Inner and outer approximation of belief structures using a hierarchical clustering approach, Int. J. Uncertainty, Fuzziness Knowl.-Based Syst., 9, 437-460 (2001) · Zbl 1113.68494
[14] Denœux, T., Calibrated model-based evidential clustering using bootstrapping, Inf. Sci., 528, 17-45 (2020) · Zbl 1459.68168
[15] Denoeux, T., Nn-evclus: Neural network-based evidential clustering, Inf. Sci., 572, 297-330 (2021) · Zbl 1528.68372
[16] T. Denœux, D. Dubois, H. Prade, Representations of uncertainty in ai: beyond probability and possibility, in: A Guided Tour of Artificial Intelligence Research. Springer, 2020, pp. 119-150.
[17] Denoeux, T.; Kanjanatarakul, O., Evidential clustering: a review, International symposium on integrated uncertainty in knowledge modelling and decision making, Springer., 24-35 (2016)
[18] Denœux, T.; Li, S.; Sriboonchitta, S., Evaluating and comparing soft partitions: An approach based on Dempster-Shafer theory, IEEE Trans. Fuzzy Syst., 26, 1231-1244 (2017)
[19] Denoeux, T.; Masson, M., Multidimensional scaling of interval-valued dissimilarity data, Pattern Recogn. Lett., 21, 83-92 (2000)
[20] Denœux, T.; Masson, M. H., EVCLUS: evidential clustering of proximity data, IEEE Trans. Syst., Man, Cybern. Part B (Cybern.), 34, 95-109 (2004)
[21] Depaolini, M. R.; Ciucci, D.; Calegari, S.; Dominoni, M., External indices for rough clustering, International Joint Conference on Rough Sets, Springer, 378-391 (2018) · Zbl 1518.68297
[22] D’Urso, P., Informational paradigm, management of uncertainty and theoretical formalisms in the clustering framework: A review, Inf. Sci., 400-401, 30-62 (2017)
[23] Fowlkes, E. B.; Mallows, C. L., A method for comparing two hierarchical clusterings, J. Am. Stat. Assoc., 78, 553-569 (1983) · Zbl 0545.62042
[24] Frigui, H.; Hwang, C.; Rhee, F. C.H., Clustering and aggregation of relational data with applications to image database categorization, Pattern Recogn., 40, 3053-3068 (2007) · Zbl 1118.68620
[25] Gagolewski, M.; Bartoszuk, M.; Cena, A., Are cluster validity measures (in) valid?, Inf. Sci., 581, 620-636 (2021)
[26] Harenberg, S.; Bello, G.; Gjeltema, L.; Ranshous, S.; Harlalka, J.; Seay, R.; Padmanabhan, K.; Samatova, N., Community detection in large-scale networks: a survey and empirical evaluation, Wiley Interdisc. Rev.: Comput. Stat., 6, 426-439 (2014) · Zbl 07912746
[27] Hullermeier, E.; Rifqi, M.; Henzgen, S.; Senge, R., Comparing fuzzy partitions: A generalization of the Rand index and related measures, IEEE Trans. Fuzzy Syst., 20, 546-556 (2011)
[28] Jousselme, A. L.; Maupin, P., Distances in evidence theory: Comprehensive survey and generalizations, Int. J. Approximate Reason., 53, 118-145 (2012) · Zbl 1280.68258
[29] Kantorovich, L. V., Mathematical methods of organizing and planning production, Manage. Sci., 6, 366-422 (1960) · Zbl 0995.90532
[30] Ko, K. I.; Lin, C. L., On the complexity of min-max optimization problems and their approximation, Minimax and Applications. Springer, 219-239 (1995) · Zbl 0847.90117
[31] Krishnapuram, R.; Keller, J. M., A possibilistic approach to clustering, IEEE Trans. Fuzzy Syst., 1, 98-110 (1993)
[32] Lei, Y.; Bezdek, J. C.; Romano, S.; Vinh, N. X.; Chan, J.; Bailey, J., Ground truth bias in external cluster validity indices, Pattern Recogn., 65, 58-70 (2017)
[33] Lingras, P.; West, C., Interval set clustering of web users with rough k-means, J. Intell. Inform. Syst., 23, 5-16 (2004) · Zbl 1074.68586
[34] Lipor, J.; Balzano, L., Clustering quality metrics for subspace clustering, Pattern Recogn., 104, Article 107328 pp. (2020)
[35] Liu, X.; Song, W.; Wong, B. Y.; Zhang, T.; Yu, S.; Lin, G. N.; Ding, X., A comparison framework and guideline of clustering methods for mass cytometry data, Genome Biol., 20, 1-18 (2019)
[36] Masson, M. H.; Denoeux, T., ECM: an evidential version of the fuzzy c-means algorithm, Pattern Recogn., 41, 1384-1397 (2008) · Zbl 1131.68081
[37] Meilă, M., Comparing clusterings-an information based distance, J. Multivar. Anal., 98, 873-895 (2007) · Zbl 1298.91124
[38] Peters, G., Rough clustering utilizing the principle of indifference, Inf. Sci., 277, 358-374 (2014)
[39] Peters, G.; Crespo, F.; Lingras, P.; Weber, R., Soft clustering: Fuzzy and rough approaches and their extensions and derivatives, Int. J. Approximate Reason., 54, 307-322 (2013)
[40] Rand, W. M., Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc., 66, 846-850 (1971)
[41] Ruspini, E. H.; Bezdek, J. C.; Keller, J. M., Fuzzy clustering: A historical perspective, IEEE Comput. Intell. Mag., 14, 45-55 (2019)
[42] Schütze, H.; Manning, C. D.; Raghavan, P., Introduction to information retrieval, vol. 39 (2008), Cambridge University Press Cambridge · Zbl 1160.68008
[43] Shafer, G., A mathematical theory of evidence (1976), Princeton University Press · Zbl 0359.62002
[44] Steinley, D.; Brusco, M. J., A note on the expected value of the rand index, Br. J. Math. Stat. Psychol., 71, 287-299 (2018) · Zbl 1460.62191
[45] Sutherland, W. A., Introduction to metric and topological spaces (2009), Oxford University Press · Zbl 1182.54001
[46] Villani, C., Topics in optimal transportation, vol. 58 (2021), American Mathematical Society
[47] Vinh, N. X.; Epps, J.; Bailey, J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, J. Mach. Learn. Res., 11, 2837-2854 (2010) · Zbl 1242.62062
[48] Xia, Q., The geodesic problem in quasimetric spaces, J. Geometr. Anal., 19, 452-479 (2009) · Zbl 1200.54013
[49] Xiong, H.; Li, Z., Clustering validation measures, Data Clustering. Chapman and Hall/CRC, 571-606 (2018)
[50] Zhou, D.; Li, J.; Zha, H., A new Mallows distance based metric for comparing clusterings, (Proceedings of the 22nd International Conference on Machine Learning (2005)), 1028-1035
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.