Clustering is a fundamental technique for revealing structure within data. Density-based clustering has the advantage of detecting structure without prior knowledge, such as the distribution or the number of clusters. However, existing density-based clustering algorithms usually suffer from limitations in latent density patterns, which hinder their applicability to complex datasets. We have showed that this may be caused by the bias of standard methods for point density estimation, such as kernel density estimation (KDE), which tends to favor compact clusters. To address the limitations of existing methods, we introduced the concept of local comparative density. We then proposed a specialized technique well-suited for capturing the nature of local comparative density. This approach allowed us to balance the density variations across clusters. Building on this foundation, we presented a halo-based clustering method that can effectively discover clusters with arbitrary shapes, heterogeneous densities, peakless distributions, and unbalanced cluster sizes. Additionally, this method is capable of identifying high-quality noise within the data. The superiority of the proposed algorithm was demonstrated by comparing it with state-of-the-art density-based clustering algorithms on artificial and real-world datasets.
Citation: Le Li, Fei Wang. HBC: halo-based clustering using local comparative density[J]. Applied Computing and Intelligence, 2024, 4(2): 164-183. doi: 10.3934/aci.2024010
Clustering is a fundamental technique for revealing structure within data. Density-based clustering has the advantage of detecting structure without prior knowledge, such as the distribution or the number of clusters. However, existing density-based clustering algorithms usually suffer from limitations in latent density patterns, which hinder their applicability to complex datasets. We have showed that this may be caused by the bias of standard methods for point density estimation, such as kernel density estimation (KDE), which tends to favor compact clusters. To address the limitations of existing methods, we introduced the concept of local comparative density. We then proposed a specialized technique well-suited for capturing the nature of local comparative density. This approach allowed us to balance the density variations across clusters. Building on this foundation, we presented a halo-based clustering method that can effectively discover clusters with arbitrary shapes, heterogeneous densities, peakless distributions, and unbalanced cluster sizes. Additionally, this method is capable of identifying high-quality noise within the data. The superiority of the proposed algorithm was demonstrated by comparing it with state-of-the-art density-based clustering algorithms on artificial and real-world datasets.
[1] |
A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. P. Patel, A. Tiwari, et al., A review of clustering techniques and developments, Neurocomputing, 267 (2017), 664–681. http://dx.doi.org/10.1016/j.neucom.2017.06.053 doi: 10.1016/j.neucom.2017.06.053
![]() |
[2] |
A. K. Jain, Data clustering: 50 years beyond K-means, Pattern Recogn. Lett., 31 (2010), 651–666. http://dx.doi.org/10.1016/j.patrec.2009.09.011 doi: 10.1016/j.patrec.2009.09.011
![]() |
[3] | W. Durani, D. Mautz, C. Plant, C. Böhm, DBHD: density-based clustering for highly varying density, Proceedings of IEEE International Conference on Data Mining (ICDM), 2022,921–926. http://dx.doi.org/10.1109/ICDM54844.2022.00108 |
[4] |
X. Zheng, C. Ren, Y. Yang, Z. Gong, X. Chen, Z. Hao, Quickdsc: clustering by quick density subgraph estimation, Inform. Sciences, 581 (2021), 403–427. http://dx.doi.org/10.1016/j.ins.2021.09.048 doi: 10.1016/j.ins.2021.09.048
![]() |
[5] | M. Khader, G. Al-Naymat, An overview of various enhancements of denclue algorithm, Proceedings of the Second International Conference on Data Science, E-Learning and Information Systems, 2019, 1–7. http://dx.doi.org/10.1145/3368691.3368724 |
[6] |
A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science, 344 (2014), 1492–1496. http://dx.doi.org/10.1126/science.1242072 doi: 10.1126/science.1242072
![]() |
[7] |
E. Schubert, J. Sander, M. Ester, H. P. Kriegel, X. Xu, DBSCAN revisited, revisited: why and how you should (still) use DBSCAN, ACM T. Database Syst., 42 (2017), 19. http://dx.doi.org/10.1145/3068335 doi: 10.1145/3068335
![]() |
[8] | F. Murtagh, P. Legendre, Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion? J. Classif., 31 (2014), 274–295. http://dx.doi.org/10.1007/s00357-014-9161-z |
[9] |
A. Damle, V. Minden, L. Ying, Simple, direct and efficient multi-way spectral clustering, Inf. Inference, 8 (2019), 181–203. http://dx.doi.org/10.1093/imaiai/iay008 doi: 10.1093/imaiai/iay008
![]() |
[10] |
P. Bhattacharjee, P. Mitra, A survey of density based clustering algorithms, Front. Comput. Sci., 15 (2021), 151308. http://dx.doi.org/10.1007/s11704-019-9059-3 doi: 10.1007/s11704-019-9059-3
![]() |
[11] | W. K. Loh, Y. H. Park, A survey on density-based clustering algorithms, In: Ubiquitous information technologies and applications, Berlin: Springer, 2014,775–780. http://dx.doi.org/10.1007/978-3-642-41671-2_98 |
[12] | E. Schubert, M. Gertz, Improving the cluster structure extracted from optics plots, Proceedings of the Conference "Lernen, Wissen, Daten, Analysen", 2018,318–329. |
[13] |
M. Du, S. Ding, H. Jia, Study on density peaks clustering based on k-nearest neighbors and principal component analysis, Knowl.-Based Syst., 99 (2016), 135–145. http://dx.doi.org/10.1016/j.knosys.2016.02.001 doi: 10.1016/j.knosys.2016.02.001
![]() |
[14] |
R. Mehmood, G. Zhang, R. Bie, H. Dawood, H. Ahmad, Clustering by fast search and find of density peaks via heat diffusion, Neurocomputing, 208 (2016), 210–217. http://dx.doi.org/10.1016/j.neucom.2016.01.102 doi: 10.1016/j.neucom.2016.01.102
![]() |
[15] |
M. Parmar, D. Wang, X. Zhang, A. H. Tan, C. Miao, J. Jiang et al., Redpc: a residual error-based density peak clustering algorithm, Neurocomputing, 348 (2019), 82–96. http://dx.doi.org/10.1016/j.neucom.2018.06.087 doi: 10.1016/j.neucom.2018.06.087
![]() |
[16] | D. Arthur, S. Vassilvitskii, k-means++: the advantages of careful seeding, Technical report, 2006. |
[17] | L. McInnes, J. Healy, Accelerated hierarchical density based clustering, Proceedings of IEEE International Conference on Data Mining Workshops (ICDMW), 2017, 33–42. http://dx.doi.org/10.1109/ICDMW.2017.12 |
[18] |
A. Bryant, K. Cios, Rnn-dbscan: a density-based clustering algorithm using reverse nearest neighbor density estimates, IEEE T. Knowl. Data Eng., 30 (2018), 1109–1121. http://dx.doi.org/10.1109/TKDE.2017.2787640 doi: 10.1109/TKDE.2017.2787640
![]() |
[19] | M. Ester, H. P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of KDD-96, 1996,226–231. |
[20] |
S. Sieranoja, P. Fränti, Fast and general density peaks clustering, Pattern Recogn. Lett., 128 (2019), 551–558. http://dx.doi.org/10.1016/j.patrec.2019.10.019 doi: 10.1016/j.patrec.2019.10.019
![]() |
[21] | A. Hinneburg, H. H. Gabriel, DENCLUE 2.0: fast clustering based on kernel density estimation, In: Advances in intelligent data analysis VII, Berlin: Spring, 2007, 70–80. http://dx.doi.org/10.1007/978-3-540-74825-0_7 |
[22] | R. J. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, In: Advances in knowledge discovery and data mining, Berlin: Spring, 2013,160–172. http://dx.doi.org/10.1007/978-3-642-37456-2_14 |
[23] |
C. Zhong, D. Miao, R. Wang, A graph-theoretical clustering method based on two rounds of minimum spanning trees, Pattern Recogn., 43 (2010), 752–766. http://dx.doi.org/10.1016/j.patcog.2009.07.010 doi: 10.1016/j.patcog.2009.07.010
![]() |
[24] |
C. Zhong, D. Miao, P. Fränti, Minimum spanning tree based split-and-merge: a hierarchical clustering method, Inform. Sciences, 181 (2011), 3397–3410. http://dx.doi.org/10.1016/j.ins.2011.04.013 doi: 10.1016/j.ins.2011.04.013
![]() |
[25] |
G. Mishra, S. Mohanty, Rdmn: a relative density measure based on mst neighborhood for clustering multi-scale datasets, IEEE T. Knowl. Data Eng., 34 (2022), 419–432. http://dx.doi.org/10.1109/TKDE.2020.2982400 doi: 10.1109/TKDE.2020.2982400
![]() |
[26] |
J. Hou, A. Zhang, N. Qi, Density peak clustering based on relative density relationship, Pattern Recogn., 108 (2020), 107554. http://dx.doi.org/10.1016/j.patcog.2020.107554 doi: 10.1016/j.patcog.2020.107554
![]() |
[27] |
C. Cassisi, A. Ferro, R. Giugno, G. Pigola, A. Pulvirenti, Enhancing density-based clustering: parameter reduction and outlier detection, Inform. Syst., 38 (2013), 317–330. http://dx.doi.org/10.1016/j.is.2012.09.001 doi: 10.1016/j.is.2012.09.001
![]() |
[28] |
Y. Lv, T. Ma, M. Tang, J. Cao, Y. Tian, A. Al-Dhelaan, et al., An efficient and scalable density-based clustering algorithm for datasets with complex structures, Neurocomputing, 171 (2016), 9–22. http://dx.doi.org/10.1016/j.neucom.2015.05.109 doi: 10.1016/j.neucom.2015.05.109
![]() |
[29] | V. Satopaa, J. Albrecht, D. Irwin, B. Raghavan, Finding a "kneedle" in a haystack: detecting knee points in system behavior, Proceedings of 31st International Conference on Distributed Computing Systems Workshops, 2011,166–171. http://dx.doi.org/10.1109/ICDCSW.2011.20 |
[30] |
J. Xie, H. Gao, W. Xie, X. Liu, P. W. Grant, Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors, Inform. Sciences, 354 (2016), 19–40. http://dx.doi.org/10.1016/j.ins.2016.03.011 doi: 10.1016/j.ins.2016.03.011
![]() |
[31] | M. Kelly, R. Longjohn, K. Nottingham, The UCI machine learning repository, University of California at Irvine, 2024. Available from: https://archive.ics.uci.edu. |
[32] | F. S. Samaria, A. C. Harter, Parameterisation of a stochastic model for human face identification, Proceedings of 1994 IEEE workshop on applications of computer vision, 1994,138–142. http://dx.doi.org/10.1109/ACV.1994.341300 |
[33] |
A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res., 3 (2002), 583–617. http://dx.doi.org/10.1162/153244303321897735 doi: 10.1162/153244303321897735
![]() |
[34] |
P. Fränti, S. Sieranoja, Clustering accuracy, Applied Computing and Intelligence, 4 (2024), 24–44. http://dx.doi.org/10.3934/aci.2024003 doi: 10.3934/aci.2024003
![]() |