×

Approaches for scaling DBSCAN algorithm to large spatial databases. (English) Zbl 0970.68583

Summary: The huge amount of information stored in databases owned by corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the area of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and other business applications. Although researchers have been working on clustering algorithms for decades, and a lot of algorithms for clustering have been developed, there is still no efficient algorithm for clustering very large databases and high dimensional data. As an outstanding representative of clustering algorithms, DBSCAN algorithm shows good performance in spatial data clustering. However, for large spatial databases, DBSCAN requires large volume of memory support and could incur substantial I/O costs because it operates directly on the entire database. In this paper, several approaches are proposed to scale DBSCAN algorithm to large spatial databases. To begin with, a fast DBSCAN algorithm is developed, which considerably speeds up the original DBSCAN algorithm. Then a sampling based DBSCAN algorithm, a partitioning-based DBSCAN algorithm, and a parallel DBSCAN algorithm are introduced consecutively. Following that, based on the above-proposed algorithms, a synthetic algorithm is also given. Finally, some experimental results are given to demonstrate the effectiveness and efficiency of these algorithms.

MSC:

68U99 Computing methodologies and applications
68P15 Database theory

Software:

clusfind
Full Text: DOI

References:

[1] Chen M S, Han J H, Yu P S. Data mining: An overview from a database perspective.IEEE Trans. KDE, 1996, 8(6): 866–883.
[2] Ng R T, Han J. Efficient and effective clustering methods for spatial data mining. InProceedings of the 20th VLDB Conference, Santiago, Chile, 1994, pp. 144–155.
[3] Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. InProceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Canada, 1996, pp.103–114.
[4] Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of 2nd International Conference on Knowledge Discovering in Databases and Data Mining (KDD-96), Portland, Oregon, August 1996.
[5] Guha S, Rastogi R, Shim K. CURE: An efficient clustering algorithm for large databases. InProceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, ACM Press, 1998, pp.428–439. · Zbl 1006.68661
[6] Zhang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. InProceedings of the 23rd VLDB Conference, Athens, Greece, 1997, pp.186–195.
[7] Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. InProceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, ACM Press, 1998, pp.73–84.
[8] Sheikholeslami G, Chatterjee S, Zhang A. Wave Cluster: A multi-resolution clustering approach for very large spatial databases. InProceedings of the 24th VLDB Conference, New York City, August 1998, pp.428–439.
[9] Bechmann N, Kriegel H P, Schneider R, Seeger B. The R*-tree: An efficient and robust access method for points and rectangles. InProc. ACM SIGMOD Int. Conf. Management of Data, Altantic City, NJ, 1990, pp.322–331.
[10] Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990. · Zbl 1345.62009
[11] Ester M, Kriegel H P, Xu X. Knowledge discovery in large spatial database: Focusing techniques for efficient class identification. InProceedings of 4th International Symposium on Large Spatial Databases, Portland, ME, 1995, InLecture Notes in Computer Science, Vol.951, Springer, 1995, pp.67–82.
[12] Vitter J. Random sampling with reservoir.ACM Transactions on Mathematical Software, 1985, 11(1): 37–57. · Zbl 0562.68028 · doi:10.1145/3147.3165
[13] Motwani R, Raghavan P. Randomized Algorithms. Cambridge University Press, 1995. · Zbl 0849.68039
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.