Clustering is essential in data analysis, with K-means clustering being widely used for its simplicity and efficiency. However, several challenges can affect its performance, including the handling of outliers, the transformation of non-spherical data into a spherical form, and the selection of the optimal number of clusters. This paper addressed these challenges by developing and enhancing specific models. The primary objective was to improve the robustness and accuracy of K-means clustering in the presence of these issues. To handle outliers, this research employed the winsorization method, which uses threshold values to minimize the influence of extreme data points. For the transformation of non-spherical data into a spherical form, the KROMD method was introduced, which combines Manhattan distance with a Gaussian kernel. This approach ensured a more accurate representation of the data, facilitating better clustering performance. The third objective focused on enhancing the gap statistic for selecting the optimal number of clusters. This was achieved by standardizing the expected value of reference data using an exponential distribution, providing a more reliable criterion for determining the appropriate number of clusters. Experimental results demonstrated that the winsorization method effectively handles outliers, leading to improved clustering stability. The KROMD method significantly enhanced the accuracy of converting non-spherical data into spherical form, achieving an accuracy level of 0.83 percent and an execution time of 0.14 per second. Furthermore, the enhanced gap statistic method outperformed other techniques in selecting the optimal number of clusters, achieving an accuracy of 93.35 percent and an execution time of 0.1433 per second. These advancements collectively enhance the performance of K-means clustering, making it more robust and effective for complex data analysis tasks.
Citation: Iliyas Karim khan, Hanita Binti Daud, Nooraini binti Zainuddin, Rajalingam Sokkalingam, Abdussamad, Abdul Museeb, Agha Inayat. Addressing limitations of the K-means clustering algorithm: outliers, non-spherical data, and optimal cluster selection[J]. AIMS Mathematics, 2024, 9(9): 25070-25097. doi: 10.3934/math.20241222
Clustering is essential in data analysis, with K-means clustering being widely used for its simplicity and efficiency. However, several challenges can affect its performance, including the handling of outliers, the transformation of non-spherical data into a spherical form, and the selection of the optimal number of clusters. This paper addressed these challenges by developing and enhancing specific models. The primary objective was to improve the robustness and accuracy of K-means clustering in the presence of these issues. To handle outliers, this research employed the winsorization method, which uses threshold values to minimize the influence of extreme data points. For the transformation of non-spherical data into a spherical form, the KROMD method was introduced, which combines Manhattan distance with a Gaussian kernel. This approach ensured a more accurate representation of the data, facilitating better clustering performance. The third objective focused on enhancing the gap statistic for selecting the optimal number of clusters. This was achieved by standardizing the expected value of reference data using an exponential distribution, providing a more reliable criterion for determining the appropriate number of clusters. Experimental results demonstrated that the winsorization method effectively handles outliers, leading to improved clustering stability. The KROMD method significantly enhanced the accuracy of converting non-spherical data into spherical form, achieving an accuracy level of 0.83 percent and an execution time of 0.14 per second. Furthermore, the enhanced gap statistic method outperformed other techniques in selecting the optimal number of clusters, achieving an accuracy of 93.35 percent and an execution time of 0.1433 per second. These advancements collectively enhance the performance of K-means clustering, making it more robust and effective for complex data analysis tasks.
[1] | X. Du, Y. He, J. Z. Huang, Random sample partition-based clustering ensemble algorithm for big data, 2021 IEEE International Conference on Big Data (Big Data), 2021, 5885–5887. https://doi.org/10.1109/BigData52589.2021.9671297 |
[2] | B. Huang, Z. Liu, J. Chen, A. Liu, Q. Liu, Q. He, Behavior pattern clustering in blockchain networks, Multimed. Tools Appl., 76 (2017), 20099–20110. https://doi.org/10.1007/s11042-017-4396-4 doi: 10.1007/s11042-017-4396-4 |
[3] | Y. Djenouri, A. Belhadi, D. Djenouri, J. C. W. Lin, Cluster-based information retrieval using pattern mining, Appl. Intell., 51 (2021), 1888–1903. https://doi.org/10.1007/s10489-020-01922-x doi: 10.1007/s10489-020-01922-x |
[4] | C. Ouyang, C. Liao, D. Zhu, Y. Zheng, C. Zhou, C. Zou, Compound improved Harris hawks optimization for global and engineering optimization, Cluster Comput., 2024. https://doi.org/10.1007/s10586-024-04348-z |
[5] | J. Xu, T. Li, D. Zhang, J. Wu, Ensemble clustering via fusing global and local structure information, Expert Syst. Appl., 237 (2024), 121557. https://doi.org/10.1016/j.eswa.2023.121557 doi: 10.1016/j.eswa.2023.121557 |
[6] | W. L. Zhao, C. H. Deng, C. W. Ngo, K-means: a revisit, Neurocomputing, 291 (2018), 195–206. https://doi.org/10.1016/j.neucom.2018.02.072 doi: 10.1016/j.neucom.2018.02.072 |
[7] | J. Qi, Y. Yu, L. Wang, J. Liu, K*-means: an effective and efficient K-means clustering algorithm, 2016 IEEE international conferences on big data and cloud computing (BDCloud), social computing and networking (SocialCom), sustainable computing and communications (SustainCom) (BDCloud-SocialCom-SustainCom), IEEE, 2016. https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.46 |
[8] | X. Wu, H. Zhou, B. Wu, T. Zhang, A possibilistic fuzzy Gath-Geva clustering algorithm using the exponential distance, Expert Syst. Appl., 184 (2021), 115550. https://doi.org/10.1016/j.eswa.2021.115550 doi: 10.1016/j.eswa.2021.115550 |
[9] | Y. Liu, Z. Liu, S. Li, Y. Guo, Q. Liu, G. Wang, Cloud-cluster: an uncertainty clustering algorithm based on cloud model, Knowl.-Based Syst., 263 (2023), 110261. https://doi.org/10.1016/j.knosys.2023.110261 doi: 10.1016/j.knosys.2023.110261 |
[10] | M. Ahmed, R. Seraj, S. M. S. Islam, The K-means algorithm: a comprehensive survey and performance evaluation, Electronics, 9 (2020), 1295. https://doi.org/10.3390/electronics9081295 doi: 10.3390/electronics9081295 |
[11] | T. M. Ghazal, Performances of K-means clustering algorithm with different distance metrics, Intell. Autom. Soft Comput., 30 (2021), 735–742. https://doi.org/10.32604/iasc.2021.019067 doi: 10.32604/iasc.2021.019067 |
[12] | Z. Zhang, Q. Feng, J. Huang, Y. Guo, J. Xu, J. Wang, A local search algorithm for K-means with outliers, Neurocomputing, 450 (2021), 230–241. https://doi.org/10.1016/j.neucom.2021.04.028 doi: 10.1016/j.neucom.2021.04.028 |
[13] | E. Dandolo, A. Pietracaprina, G. Pucci, Distributed K-means with outliers in general metrics, In: J. Cano, M. D. Dikaiakos, G. A. Papadopoulos, M. Pericàs, R. Sakellariou, Euro-Par 2023: Parallel Processing. Euro-Par 2023, Lecture Notes in Computer Science, Cham: Springer, 14100 (2023), 474–488. https://doi.org/10.1007/978-3-031-39698-4_32 |
[14] | H. He, Y. He, F. Wang, W. Zhu, Improved K‐means algorithm for clustering non‐spherical data, Expert Syst., 39 (2022), e13062. https://doi.org/10.1111/exsy.13062 doi: 10.1111/exsy.13062 |
[15] | J. Heidari, N. Daneshpour, A. Zangeneh, A novel K-means and K-medoids algorithms for clustering non-spherical-shape clusters non-sensitive to outliers, Pattern Recogn., 155 (2024), 110639. https://doi.org/10.1016/j.patcog.2024.110639 doi: 10.1016/j.patcog.2024.110639 |
[16] | T. M. Kodinariya, P. R. Makwana, Review on determining number of cluster in K-means clustering, Int. J. Adv. Res. Comput. Sci. Manage. Stud., 1 (2013), 90–95. |
[17] | B. Sowan, T. P. Hong, A. Al-Qerem, M. Alauthman, N. Matar, Ensembling validation indices to estimate the optimal number of clusters, Appl. Intell., 53 (2023), 9933–9957. https://doi.org/10.1007/s10489-022-03939-w doi: 10.1007/s10489-022-03939-w |
[18] | J. Rossbroich, J. Durieux, T. F. Wilderjans, Model selection strategies for determining the optimal number of overlapping clusters in additive overlapping partitional clustering, J. Classif., 39 (2022), 264–301. https://doi.org/10.1007/s00357-021-09409-1 doi: 10.1007/s00357-021-09409-1 |
[19] | Z. Hao, Z. Lu, G. Li, F. Nie, R. Wang, X. Li, Ensemble clustering with attentional representation, IEEE Trans. Knowl. Data Eng., 36 (2023), 581–593. https://doi.org/10.1109/TKDE.2023.3292573 doi: 10.1109/TKDE.2023.3292573 |
[20] | Z. P. Zhang, S. Li, W. X. Liu, Y. Wang, D. X. Li, A new outlier detection algorithm based on fast density peak clustering outlier factor, Int. J. Data Warehous. Mining, 19 (2023), 1–19. https://doi.org/10.4018/IJDWM.316534 doi: 10.4018/IJDWM.316534 |
[21] | W. Wang, Y. Ren, R. Zhou, J. Zhang, An outlier detection algorithm based on probability density clustering, Int. J. Data Warehous. Mining, 19 (2023), 1–20. https://doi.org/10.4018/IJDWM.333901 doi: 10.4018/IJDWM.333901 |
[22] | Y. Liu, Z. Liu, S. Li, Z. Yu, Y. Guo, Q. Liu, et al., Cloud-vae: variational autoencoder with concepts embedded, Pattern Recogn., 140 (2023), 109530. https://doi.org/10.1016/j.patcog.2023.109530 doi: 10.1016/j.patcog.2023.109530 |
[23] | J. Li, X. Zhao, B. Du, Landslide induced seismic signal clustering with outlier removal, IEEE Geosci. Remote Sens. Lett., 20 (2023), 1–5. https://doi.org/10.1109/LGRS.2023.3327044 doi: 10.1109/LGRS.2023.3327044 |
[24] | H. Wang, P. Xu, J. Zhao, Improved KNN algorithms of spherical regions based on clustering and region division, Alex. Eng. J., 61 (2022), 3571–3585. https://doi.org/10.1016/j.aej.2021.09.004 doi: 10.1016/j.aej.2021.09.004 |
[25] | W. Xiong, J. Wang, Gene mutation of particle morphology through spherical harmonic-based principal component analysis, Powder Technol., 386 (2021), 176–192. https://doi.org/10.1016/j.powtec.2021.03.032 doi: 10.1016/j.powtec.2021.03.032 |
[26] | T. Huang, S. Wang, W. Zhu, An adaptive kernelized rank-order distance for clustering non-spherical data with high noise, Int. J. Mach. Learn. Cyber., 11 (2020), 1735–1747. https://doi.org/10.1007/s13042-020-01068-9 doi: 10.1007/s13042-020-01068-9 |
[27] | H. Xin, Y. Lu, H. Tang, R. Wang, F. Nie, Self-weighted Euler K-means clustering, IEEE Signal Proc. Lett., 30 (2023), 1127–1131. https://doi.org/10.1109/LSP.2023.3305909 doi: 10.1109/LSP.2023.3305909 |
[28] | T. Simmons, M. Daghooghi, I. Borazjani, Dynamics of non-spherical particles resting on a flat surface in a viscous fluid, Phys. Fluids, 35 (2023), 043334. https://doi.org/10.1063/5.0145221 doi: 10.1063/5.0145221 |
[29] | F. Ros, R. Riad, S. Guillaume, PDBI: a partitioning Davies-Bouldin index for clustering evaluation, Neurocomputing, 528 (2023), 178–199. https://doi.org/10.1016/j.neucom.2023.01.043 doi: 10.1016/j.neucom.2023.01.043 |
[30] | I. F. Ashari, E. D. Nugroho, R. Baraku, I. N. Yanda, R. Liwardana, Analysis of elbow, silhouette, Davies-Bouldin, Calinski-Harabasz, and rand-index evaluation on K-means algorithm for classifying flood-affected areas in Jakarta, J. Appl. Inform. Comput., 7 (2023), 95–103. https://doi.org/10.30871/jaic.v7i1.4947 doi: 10.30871/jaic.v7i1.4947 |
[31] | E. Schubert, Stop using the elbow criterion for K-means and how to choose the number of clusters instead, ACM SIGKDD Explor. Newsl., 25 (2023), 36–42. https://doi.org/10.1145/3606274.3606278 doi: 10.1145/3606274.3606278 |
[32] | N. T. M. Sagala, A. A. S. Gunawan, Discovering the optimal number of crime cluster using elbow, Silhouette, gap statistics, and NbClust methods, ComTech: Comput. Math. Eng. Appl., 13 (2022), 1–10. https://doi.org/10.21512/comtech.v13i1.7270 doi: 10.21512/comtech.v13i1.7270 |
[33] | R. G. Ribeiro, R. Rios, Temporal gap statistic: a new internal index to validate time series clustering, Chaos Soliton. Fract., 142 (2021), 110326. https://doi.org/10.1016/j.chaos.2020.110326 doi: 10.1016/j.chaos.2020.110326 |
[34] | S. Demir, E. K. Sahin, Application of state-of-the-art machine learning algorithms for slope stability prediction by handling outliers of the dataset, Earth Sci. Inform., 16 (2023), 2497–2509. https://doi.org/10.1007/s12145-023-01059-8 doi: 10.1007/s12145-023-01059-8 |
[35] | I. Horenko, E. Vecchi, J. Kardoš, A. Wächter, O. Schenk, T. J. O'Kane, et al., On cheap entropy-sparsified regression learning, Proc. Natl. Acad. Sci., 120 (2023), e2214972120. https://doi.org/10.1073/pnas.2214972120 doi: 10.1073/pnas.2214972120 |
[36] | K. K. Sharma, A. Seal, Outlier-robust multi-view clustering for uncertain data, Knowl.-Based Syst., 211 (2021), 106567. https://doi.org/10.1016/j.knosys.2020.106567 doi: 10.1016/j.knosys.2020.106567 |
[37] | E. Schubert, A. Lang, G. Feher, Accelerating spherical K-means, In: N. Reyes, R. Connor, N. Kriege, D. Kazempour, I. Bartolini, E. Schubert, et al., Similarity search and applications. SISAP 2021, Lecture Notes in Computer Science, Cham: Springer, 13058 (2021), 217–231. https://doi.org/10.1007/978-3-030-89657-7_17 |
[38] | D. S. Rini, I. Sriliana, P. Novianti, S. Nugroho, P. Jana, Spherical K-means method to determine earthquake clusters, J. Phys.: Conf. Ser., IOP Publishing, 1823 (2021), 012043. https://doi.org/10.1088/1742-6596/1823/1/012043 |
[39] | N. Ukey, Z. Yang, B. Li, G. Zhang, Y. Hu, W. Zhang, Survey on exact knn queries over high-dimensional data space, Sensors, 23 (2023), 629. https://doi.org/10.3390/s23020629 doi: 10.3390/s23020629 |
[40] | O. Koren, M. Koren, A. Sabban, AutoML–optimal K procedure, 2022 International Conference on Advanced Enterprise Information System (AEIS), IEEE, 2022,110–119. https://doi.org/10.1109/AEIS59450.2022.00023 |
[41] | P. Patel, B. Sivaiah, R. Patel, Approaches for finding optimal number of clusters using K-means and agglomerative hierarchical clustering techniques, 2022 international conference on intelligent controller and computing for smart power (ICICCSP), IEEE, 2022, 1–6. https://doi.org/10.1109/ICICCSP53532.2022.9862439 |
[42] | Jayashree, T. Shivaprakash, Optimal value for number of clusters in a dataset for clustering algorithm, In: M. Pandit, M. K. Gaur, P. S. Rana, A. Tiwari, Artificial intelligence and sustainable computing, Algorithms for Intelligent Systems, Singapore: Springer, 2022,631–645. https://doi.org/10.1007/978-981-19-1653-3_48 |
[43] | M. S. Girija, B. R. Tapas Bapu, D. Magesh Babu, A variance difference method for determining optimal number of clusters in wireless sensor networks, Res. Square, 2023. https://doi.org/10.21203/rs.3.rs-1984952/v1 doi: 10.21203/rs.3.rs-1984952/v1 |
[44] | A. M. El-Mandouh, L. A. Abd-Elmegid, H. A. Mahmoud, M. H. Haggag, Optimized K-means clustering model based on gap statistic, Int. J. Adv. Comput. Sci. Appl., 10 (2019), 183–188. https://doi.org/10.14569/IJACSA.2019.0100124 doi: 10.14569/IJACSA.2019.0100124 |
[45] | E. Purwaningsih, E. Nurelasari, Implementasi metode K-means clustering Dengan Davies Bouldin index pada analisis faktor penyebab perceraian, J. Inform. Manag., 7 (2023), 134–143. https://doi.org/10.51211/imbi.v7i2.2307 doi: 10.51211/imbi.v7i2.2307 |
[46] | G. Gan, M. K. P. Ng, K-means clustering with outlier removal, Pattern Recogn. Lett., 90 (2017), 8–14. https://doi.org/10.1016/j.patrec.2017.03.008 doi: 10.1016/j.patrec.2017.03.008 |
[47] | F. Zubedi, B. Sartono, K. A. Notodiputro, Implementation of Winsorizing and random oversampling on data containing outliers and unbalanced data with the random forest classification method, J. Nat., 22 (2022), 108–116. https://doi.org/10.24815/jn.v22i2.25499 doi: 10.24815/jn.v22i2.25499 |
[48] | L. Guo, X. Zhang, Q. Wang, X. Xue, Z. Liu, Y. Mu, Joint enhanced low-rank constraint and kernel rank-order distance metric for low level vision processing, Expert Syst. Appl., 201 (2022), 116976. https://doi.org/10.1016/j.eswa.2022.116976 doi: 10.1016/j.eswa.2022.116976 |
[49] | S. Yue, P. Wang, J. Wang, T. Huang, Extension of the gap statistics index to fuzzy clustering, Soft Comput., 17 (2023), 1833–1846. https://doi.org/10.1007/s00500-013-1023-9 doi: 10.1007/s00500-013-1023-9 |