The density peaks clustering (DPC) algorithm plays an important role in data mining by quickly identifying cluster centers using decision graphs to identify arbitrary clusters. However, the decision graph introduces uncertainty in determining the cluster centers, which can result in an incorrect number of clusters. In addition, the cut-off distance parameter relies on prior knowledge, which poses a limitation. To address these issues, we propose an improved automatic density peaks clustering (ADPC) algorithm. First, a novel clustering validity index called density-distance clustering (DDC) is introduced. The DDC index draws inspiration from the density and distance characteristics of cluster centers, which is applicable to DPC and aligns with the general definition of clustering. Based on the DDC index, the ADPC algorithm automatically selects the suitable cut-off distance and acquires the optimal number of clusters without additional parameters. Numerical experimental results validate that the introduced ADPC algorithm successfully automatically determines the optimal number of clusters and cut-off distance, significantly outperforming DPC, AP and DBSCAN algorithms.
Citation: Xiao Xu, Hong Liao, Xu Yang. An automatic density peaks clustering based on a density-distance clustering index[J]. AIMS Mathematics, 2023, 8(12): 28926-28950. doi: 10.3934/math.20231482
The density peaks clustering (DPC) algorithm plays an important role in data mining by quickly identifying cluster centers using decision graphs to identify arbitrary clusters. However, the decision graph introduces uncertainty in determining the cluster centers, which can result in an incorrect number of clusters. In addition, the cut-off distance parameter relies on prior knowledge, which poses a limitation. To address these issues, we propose an improved automatic density peaks clustering (ADPC) algorithm. First, a novel clustering validity index called density-distance clustering (DDC) is introduced. The DDC index draws inspiration from the density and distance characteristics of cluster centers, which is applicable to DPC and aligns with the general definition of clustering. Based on the DDC index, the ADPC algorithm automatically selects the suitable cut-off distance and acquires the optimal number of clusters without additional parameters. Numerical experimental results validate that the introduced ADPC algorithm successfully automatically determines the optimal number of clusters and cut-off distance, significantly outperforming DPC, AP and DBSCAN algorithms.
[1] | H. Kim, Geospatial data-driven assessment of earthquake-induced liquefaction impact mapping using classifier and cluster ensemble, Appl. Soft Comput., 140 (2023), 110266. https://doi.org/10.1016/j.asoc.2023.110266 doi: 10.1016/j.asoc.2023.110266 |
[2] | E. Ivannikova, H. Park, T. Hämäläinen, K. Lee, Revealing community structures by ensemble clustering using group diffusion, Inform. Fusion, 42 (2018), 24–36. https://doi.org/10.1016/j.inffus.2017.09.013 doi: 10.1016/j.inffus.2017.09.013 |
[3] | X. Zeng, A. Chen, M. Zhou, Color perception algorithm of medical images using density peak based hierarchical clustering, Biomed. Signal Proces., 48 (2019), 69–79. https://doi.org/10.1016/j.bspc.2018.09.013 doi: 10.1016/j.bspc.2018.09.013 |
[4] | Y. Slimen, S. Allio, J. Jacques, Model-based co-clustering for functional data, Neurocomputing, 291 (2018), 97–108. https://doi.org/10.1016/j.neucom.2018.02.055 doi: 10.1016/j.neucom.2018.02.055 |
[5] | Q. Zhang, C. Zhu, L. Yang, Z. Chen, L. Zhao, P. Li, An incremental cfs algorithm for clustering large data in industrial internet of things, IEEE T. Ind. Inform., 13 (2017), 1193–1201. https://doi.org/10.1109/TII.2017.2684807 doi: 10.1109/TII.2017.2684807 |
[6] | A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Zomaya, et al., A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE T. Emerg. Top. Com., 2 (2014), 267–279. https://doi.org/10.1109/TETC.2014.2330519 doi: 10.1109/TETC.2014.2330519 |
[7] | D. Wang, T. Li, P. Deng, F. Zhang, W. Huang, P. Zhang, et al., A generalized deep learning clustering algorithm based on non-negative matrix factorization, ACM T. Knowl. Discov. D., 17 (2023), 99. https://doi.org/10.1145/3584862 doi: 10.1145/3584862 |
[8] | M. Shahzad, S. Riazul Islam, M. Hossain, M. Abdullah-Al-Wadud, A. Alamri, M. Hussain, Gafor: genetic algorithm based fuzzy optimized re-clustering in wireless sensor networks, Mathematics, 9 (2021), 43. https://doi.org/10.3390/math9010043 doi: 10.3390/math9010043 |
[9] | W. Zhao, C. Deng, C. Ngo, k-means: a revisit, Neurocomputing, 291 (2018), 195–206. https://doi.org/10.1016/j.neucom.2018.02.072 doi: 10.1016/j.neucom.2018.02.072 |
[10] | Y. Zhu, K. Ting, M. Carman, Density-ratio based clustering for discovering clusters with varying densities, Pattern Recogn., 60 (2016), 983–997. https://doi.org/10.1016/j.patcog.2016.07.007 doi: 10.1016/j.patcog.2016.07.007 |
[11] | Chaomurilige, How klfcm works—convergence and parameter analysis for klfcm clustering algorithm, Mathematics, 11 (2023), 2285. https://doi.org/10.3390/math11102285 doi: 10.3390/math11102285 |
[12] | H. Ling, J. Wu, Y. Zhou, W. Zheng, How many clusters? A robust pso-based local density mode, Neurocomputing, 207 (2016), 264–275. https://doi.org/10.1016/j.neucom.2016.03.071 doi: 10.1016/j.neucom.2016.03.071 |
[13] | A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science, 344 (2014), 1492–1496. https://doi.org/10.1126/science.1242072 doi: 10.1126/science.1242072 |
[14] | R. Liu, H. Wang, X. Yu, Shared-nearest-neighbor-based clustering by fast search and find of density peaks, Inform. Sciences, 450 (2018), 200–226. https://doi.org/10.1016/j.ins.2018.03.031 doi: 10.1016/j.ins.2018.03.031 |
[15] | X. Xu, S. Ding, Y. Wang, L. Wang, W. Jia, A fast density peaks clustering algorithm with sparse search, Inform. Sciences, 554 (2021), 61–83. https://doi.org/10.1016/j.ins.2020.11.050 doi: 10.1016/j.ins.2020.11.050 |
[16] | J. Xu, G. Wang, T. Li, W. Deng, G. Gou, Fat node leading tree for data stream clustering with density peaks, Knowl.-Based Syst., 120 (2017), 99–117. https://doi.org/10.1016/j.knosys.2016.12.025 doi: 10.1016/j.knosys.2016.12.025 |
[17] | S. Ding, M. Du, T. Sun, X. Xu, Y. Xue, An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood, Knowl.-Based Syst., 133 (2017), 294–313. https://doi.org/10.1016/j.knosys.2017.07.027 doi: 10.1016/j.knosys.2017.07.027 |
[18] | M. Karaayvaz, S. Cristea, S. Gillespie, A. Patel, R. Mylvaganam, C. Luo, et al., Unravelling subclonal heterogeneity and aggressive disease states in tnbc through single-cell rna-seq, Nat. Commun., 9 (2018), 3588. https://doi.org/10.1038/s41467-018-06052-0 doi: 10.1038/s41467-018-06052-0 |
[19] | X. Li, K. Wong, Evolutionary multiobjective clustering and its applications to patient stratification, IEEE T. Cybernetics, 49 (2019), 1680–1693. https://doi.org/10.1109/TCYB.2018.2817480 doi: 10.1109/TCYB.2018.2817480 |
[20] | T. Xu, J. Jiang, A graph adaptive density peaks clustering algorithm for automatic centroid selection and effective aggregation, Expert Syst. Appl., 195 (2022), 116539. https://doi.org/10.1016/j.eswa.2022.116539 doi: 10.1016/j.eswa.2022.116539 |
[21] | L. Bai, X. Cheng, J. Liang, H. Shen, Y. Guo, Fast density clustering strategies based on the k-means algorithm, Pattern Recogn., 71 (2017), 375–386. https://doi.org/10.1016/j.patcog.2017.06.023 doi: 10.1016/j.patcog.2017.06.023 |
[22] | J. Xu, G. Wang, W. Deng, Denpehc: density peak based efficient hierarchical clustering, Inform. Sciences, 373 (2016), 200–218. https://doi.org/10.1016/j.ins.2016.08.086 doi: 10.1016/j.ins.2016.08.086 |
[23] | J. Chen, H. He, A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data, Inform. Sciences, 345 (2016), 271–293. https://doi.org/10.1016/j.ins.2016.01.071 doi: 10.1016/j.ins.2016.01.071 |
[24] | Y. Liu, Z. Ma, F. Yu, Adaptive density peak clustering based on k-nearest neighbors with aggregating strategy, Knowl.-Based Syst., 133 (2017), 208–220. https://doi.org/10.1016/j.knosys.2017.07.010 doi: 10.1016/j.knosys.2017.07.010 |
[25] | M. Masud, J. Huang, C. Wei, J. Wang, I. Khan, M. Zhong, I-nice: a new approach for identifying the number of clusters and initial cluster centres, Inform. Sciences, 466 (2018), 129–151. https://doi.org/10.1016/j.ins.2018.07.034 doi: 10.1016/j.ins.2018.07.034 |
[26] | M. D'Errico, E. Facco, A. Laio, A Rodriguez, Automatic topography of high-dimensional data sets by non-parametric density peak clustering, Inform. Sciences, 560 (2021), 476–492. https://doi.org/10.1016/j.ins.2021.01.010 doi: 10.1016/j.ins.2021.01.010 |
[27] | P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math., 20 (1987), 53–65. https://doi.org/10.1016/0377-0427(87)90125-7 doi: 10.1016/0377-0427(87)90125-7 |
[28] | L. Lovmar, A. Ahlford, M. Jonsson, A. Syvanen, Silhouette scores for assessment of SNP genotype clusters, BMC Genomics, 6 (2005), 35. https://doi.org/10.1186/1471-2164-6-35 doi: 10.1186/1471-2164-6-35 |
[29] | X. Xu, S. Ding, Z. Shi, An improved density peaks clustering algorithm with fast finding cluster centers, Knowl.-Based Syst., 158 (2018), 65–74. https://doi.org/10.1016/j.knosys.2018.05.034 doi: 10.1016/j.knosys.2018.05.034 |
[30] | J. Xie, H. Gao, W. Xie, X. Liu, P. Grant, Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors, Inform. Sciences, 354 (2016), 19–40. https://doi.org/10.1016/j.ins.2016.03.011 doi: 10.1016/j.ins.2016.03.011 |
[31] | M. Du, S. Ding, H. Jia, Study on density peaks clustering based on k-nearest neighbors and principal component analysis, Knowl.-Based Syst., 99 (2016), 135–145. https://doi.org/10.1016/j.knosys.2016.02.001 doi: 10.1016/j.knosys.2016.02.001 |
[32] | S. Ding, C. Li, X. Xu, L. Ding, J. Zhang, L. Guo, et al., A sampling-based density peaks clustering algorithm for large-scale data, Pattern Recogn., 136 (2023), 109238. https://doi.org/10.1016/j.patcog.2022.109238 doi: 10.1016/j.patcog.2022.109238 |
[33] | Z. Liang, P. Chen, Delta-density based clustering with a divide-and-conquer strategy: 3dc clustering, Pattern Recogn. Lett., 73 (2016), 52–59. https://doi.org/10.1016/j.patrec.2016.01.009 doi: 10.1016/j.patrec.2016.01.009 |
[34] | M. Chen, L. Li, B. Wang, J. Cheng, L. Pan, X. Chen, Effectively clustering by finding density backbone based-on knn, Pattern Recogn., 60 (2016), 486–498. https://doi.org/10.1016/j.patcog.2016.04.018 doi: 10.1016/j.patcog.2016.04.018 |
[35] | M. Wang, F. Min, Z. Zhang, Y. Wu, Active learning through density clustering, Expert Syst. Appl., 85 (2017), 305–317. https://doi.org/10.1016/j.eswa.2017.05.046 doi: 10.1016/j.eswa.2017.05.046 |
[36] | B. Wu, B. Wilamowski, A fast density and grid based clustering method for data with arbitrary shapes and noise, IEEE T. Ind. Inform., 13 (2017), 1620–1628. https://doi.org/10.1109/TII.2016.2628747 doi: 10.1109/TII.2016.2628747 |
[37] | Z. Li, Y. Tang, Comparative density peaks clustering, Expert Syst. Appl., 95 (2018), 236–247. https://doi.org/10.1016/j.eswa.2017.11.020 doi: 10.1016/j.eswa.2017.11.020 |
[38] | K. Ting, Y. Zhu, M. Carman, Y. Zhu, Z. Zhou, Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure, Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, 1205–1214. https://doi.org/10.1145/2939672.2939779 doi: 10.1145/2939672.2939779 |
[39] | S. Ding, W. Du, X. Xu, T. Shi, Y. Wang, C. Li, An improved density peaks clustering algorithm based on natural neighbor with a merging strategy, Inform. Sciences, 624 (2023), 252–276. https://doi.org/10.1016/j.ins.2022.12.078 doi: 10.1016/j.ins.2022.12.078 |
[40] | F. Samaria, A. Harter, Parameterisation of a stochastic model for human face identification, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, 1994,138–142. https://doi.org/10.1109/ACV.1994.341300 doi: 10.1109/ACV.1994.341300 |
[41] | B. Frey, D. Dueck, Clustering by passing messages between data points, Science, 315 (2007), 972–976. https://doi.org/10.1126/science.1136800 doi: 10.1126/science.1136800 |
[42] | D. Ienco, G. Bordogna, Fuzzy extensions of the DBScan clustering algorithm, Soft Comput., 22 (2018), 1719–1730. https://doi.org/10.1007/s00500-016-2435-0 doi: 10.1007/s00500-016-2435-0 |
[43] | J. Jiang, X. Yan, Z. Yu, J. Guo, W. Tian, A Chinese expert disambiguation method based on semi-supervised graph clustering, Int. J. Mach. Learn. Cyber., 6 (2015), 197–204. https://doi.org/10.1007/s13042-014-0255-z doi: 10.1007/s13042-014-0255-z |
[44] | H. Jia, S. Ding, M. Du, Y. Xue, Approximate normalized cuts without eigen-decomposition, Inform. Sciences, 374 (2016), 135–150. https://doi.org/10.1016/j.ins.2016.09.032 doi: 10.1016/j.ins.2016.09.032 |
[45] | N. Vinh, J. Epps, J. Bailey, Information theoretic measures for clusterings comparison: is a correction for chance necessary? Proceedings of the 26th Annual International Conference on Machine Learning, 2009, 1073–1080. https://doi.org//10.1145/1553374.1553511 doi: 10.1145/1553374.1553511 |
[46] | M. Sampat, Z. Wang, S. Gupta, A. Bovik, M. Markey, Complex wavelet structural similarity: a new image similarity index, IEEE T. Image Process., 18 (2009), 2385–2401. https://doi.org/10.1109/TIP.2009.2025923 doi: 10.1109/TIP.2009.2025923 |