In clustering, the cost function that is commonly used involves calculating all-pairwise squared distances. In this paper, we formulate the cost function using mean squared error and show that this leads to more balanced clustering compared to centroid-based distance functions, like the sum of squared distances in $ k $-means. The clustering method has been formulated as a cut-based approach, more intuitively called Squared cut (Scut). We introduce an algorithm for the problem which is faster than the existing one based on the Stirling approximation. Our algorithm is a sequential variant of a local search algorithm. We show by experiments that the proposed approach provides better overall optimization of both mean squared error and cluster balance compared to existing methods.
Citation: Mikko I. Malinen, Pasi Fränti. All-pairwise squared distances lead to more balanced clustering[J]. Applied Computing and Intelligence, 2023, 3(1): 93-115. doi: 10.3934/aci.2023006
In clustering, the cost function that is commonly used involves calculating all-pairwise squared distances. In this paper, we formulate the cost function using mean squared error and show that this leads to more balanced clustering compared to centroid-based distance functions, like the sum of squared distances in $ k $-means. The clustering method has been formulated as a cut-based approach, more intuitively called Squared cut (Scut). We introduce an algorithm for the problem which is faster than the existing one based on the Stirling approximation. Our algorithm is a sequential variant of a local search algorithm. We show by experiments that the proposed approach provides better overall optimization of both mean squared error and cluster balance compared to existing methods.
[1] | J. H. Ward Jr, Hierarchical grouping to optimize an objective function, J. Am. Stat. Assoc., 58 (1963), 236–244. https://doi.org/10.1080/01621459.1963.10500845 doi: 10.1080/01621459.1963.10500845 |
[2] | T. Kohonen, Median strings, Pattern Recogn. Lett., 3 (1985), 309–313. https://doi.org/10.1016/0167-8655(85)90061-3 doi: 10.1016/0167-8655(85)90061-3 |
[3] | V. Hautamäki, P. Nykänen, P. Fränti, Time-series clustering by approximate prototypes, 19th International conference on pattern recognition, (2008), 1–4. IEEE. https://doi.org/10.1109/ICPR.2008.4761105 |
[4] | P. Fränti, R. Mariescu-Istodor, Averaging gps segments: competition 2019, Pattern Recogn., 112 (2021), 107730. https://doi.org/10.1016/j.patcog.2020.107730 doi: 10.1016/j.patcog.2020.107730 |
[5] | P. Fränti, S. Sieranoja, K. Wikström, T. Laatikainen, Clustering diagnoses from 58m patient visits in Finland 2015-2018, 2022. |
[6] | M. Fatemi, P. Fränti, Clustering nordic twitter users based on their connections, 2023. |
[7] | M. I. Malinen, P. Fränti, Clustering by analytic functions, Inform. Sciences, 217 (2012), 31–38. https://doi.org/10.1016/j.ins.2012.06.018 doi: 10.1016/j.ins.2012.06.018 |
[8] | M. I. Malinen, P. Fränti, Balanced $k$-means for clustering, in: Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition (S+SSPR 2014), LNCS 8621, Joensuu, Finland, 2014. |
[9] | D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn., 75 (2009), 245–248. https://doi.org/10.1007/s10994-009-5103-0 doi: 10.1007/s10994-009-5103-0 |
[10] | M. Inaba, N. Katoh, H. Imai, Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based $k$-Clustering, ACM symposium on computational geometry (SCG 1994), (1994), 332–339. https://doi.org/10.1145/177424.178042 doi: 10.1145/177424.178042 |
[11] | J. MacQueen, Some methods of classification and analysis of multivariate observations, Berkeley Symp. Mathemat. Statist. Probab., 1 (1967), 281–297. |
[12] | W. H. Equitz, A New Vector Quantization Clustering Algorithm, IEEE Trans. Acoust., Speech, Signal Processing, 37 (1989), 1568–1575. https://doi.org/10.1109/29.35395 doi: 10.1109/29.35395 |
[13] | P. Fränti, O. Virmajoki, V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE T. Pattern Anal., 28 (2006), 1875–1881. https://doi.org/10.1109/TPAMI.2006.227 doi: 10.1109/TPAMI.2006.227 |
[14] | P. Fränti, O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recogn., 39 (2006), 761–765. https://doi.org/10.1016/j.patcog.2005.09.012 doi: 10.1016/j.patcog.2005.09.012 |
[15] | P. Fränti, Efficiency of random swap clustering, Journal of Big Data, 5 (2018), 1–29. https://doi.org/10.1186/s40537-018-0122-y doi: 10.1186/s40537-018-0122-y |
[16] | B. Fritzke, Breathing k-means, arXiv: 2006.15666. |
[17] | C. Baldassi, Recombinator-k-means:an evolutionary algorithm that exploits k-means++ for recombination, IEEE T. Evolut. Comput., 26 (2022), 991–1003. |
[18] | A. P. Dempster, N. M. Laird, D. B. Rubin, Maximun likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B, 39 (1977), 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x doi: 10.1111/j.2517-6161.1977.tb01600.x |
[19] | Q. Zhao, V. Hautamäki, I. Kärkkäinen, P. Fränti, Random swap EM algorithm for finite mixture models in image segmentation, IEEE International Conference on Image Processing (ICIP), (2009), 2397–2400. https://doi.org/10.1109/ICIP.2009.5414459 doi: 10.1109/ICIP.2009.5414459 |
[20] | J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE T. Pattern Anal., 22 (2000), 888–905. https://doi.org/10.1109/34.868688 doi: 10.1109/34.868688 |
[21] | C. H. Q. Ding, X. He, H. Zha, M. Gu, H. D. Simon, A min-max cut algorithm for graph partitioning and data clustering, IEEE International Conference on Data Mining (ICDM), (2001), 107–114. |
[22] | M. I. Malinen, P. Fränti, K-means*: Clustering by gradual data transformation, Pattern Recogn., 47 (2014), 3376–3386. https://doi.org/10.1016/j.patcog.2014.03.034 doi: 10.1016/j.patcog.2014.03.034 |
[23] | R. Nallusamy, K. Duraiswamy, R. Dhanalaksmi, P. Parthiban, Optimization of non-linear multiple traveling salesman problem using k-means clustering, shrink wrap algorithm and meta-heuristics, International Journal of Nonlinear Science, 9 (2010), 171–177. |
[24] | R. Mariescu-Istodor, P. Fränti, Solving the large-scale tsp problem in 1 h: Santa claus challenge 2020, Front. Robot. AI, (2021), 1–20. https://doi.org/10.3389/frobt.2021.689908 doi: 10.3389/frobt.2021.689908 |
[25] | D. W. Sambo, B. O. Yenke, A. Förster, P. Dayang, Optimized clustering algorithms for large wireless sensor networks: A review, Sensors, 19 (2019), 322. |
[26] | J. Singh, R. Kumar, A. K. Mishra, Clustering algorithms for wireless sensor networks: A review, International Conference on Computing for Sustainable Global Development (INDIACom), (2015), 637–642. |
[27] | Y. Liao, H. Qi, W. Li, Load-Balanced Clustering Algorithm With Distributed Self-Organization for Wireless Sensor Networks, IEEE Sens. J., 13 (2013), 1498–1506. https://doi.org/10.1109/JSEN.2012.2227704 doi: 10.1109/JSEN.2012.2227704 |
[28] | L. Yao, X. Cui, M. Wang, An energy-balanced clustering routing algorithm for wireless sensor networks, IEEE World Congress on Computer Science and Information Engineering, 3 (2009), 316–320. |
[29] | P. S. Bradley, K. P. Bennett, A. Demiriz, Constrained k-means clustering, Tech. rep., MSR-TR-2000-65, Microsoft Research, 2000. |
[30] | S. Zhu, D. Wang, T. Li, Data clustering with size constraints, Knowledge-Based Syst., 23 (2010), 883–889. https://doi.org/10.1016/j.knosys.2010.06.003 doi: 10.1016/j.knosys.2010.06.003 |
[31] | A. Banerjee, J. Ghosh, Frequency sensitive competitive learning for balanced clustering on high-dimensional hyperspheres, IEEE Transactions on Neural Networks, 15 (2004), 702–719. https://doi.org/10.1109/TNN.2004.824416 doi: 10.1109/TNN.2004.824416 |
[32] | C. T. Althoff, A. Ulges, A. Dengel, Balanced clustering for content-based image browsing, in: GI-Informatiktage 2011, Gesellschaft für Informatik e.V., 2011. |
[33] | A. Banerjee, J. Ghosh, On scaling up balanced clustering algorithms, SIAM International Conference on Data Mining, (2002), 333–349. https://doi.org/10.1137/1.9781611972726.20 doi: 10.1137/1.9781611972726.20 |
[34] | Y. Chen, Y. Zhang, X. Ji, Size regularized cut for data clustering, Advances in Neural Information Processing Systems, 2005. |
[35] | Y. Kawahara, K. Nagano, Y. Okamoto, Submodular fractional programming for balanced clustering, Pattern Recogn. Lett., 32 (2011), 235–243. https://doi.org/10.1016/j.patrec.2010.08.008 doi: 10.1016/j.patrec.2010.08.008 |
[36] | G. Tzortzis, A. Likas, The minmax k-means clustering algorithm, Pattern Recogn., 47 (2014), 2505–2516. https://doi.org/10.1016/j.patcog.2014.01.015 doi: 10.1016/j.patcog.2014.01.015 |
[37] | W. Tang, Y. Yang, L. Zeng, Y. Zhan, Optimizing mse for clustering with balanced size constraints, Symmetry, 11 (2019), 338. https://doi.org/10.3390/sym11030338 doi: 10.3390/sym11030338 |
[38] | L. Hagen, A. B. Kahng, New spectrxal methods for ratio cut partitioning and clustering, IEEE T. Computer-Aided D., 11 (1992), 1074–1085. https://doi.org/10.1109/43.159993 doi: 10.1109/43.159993 |
[39] | T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to algorithms (2nd ed.), MIT Press and McGraw-Hill, 2001. |
[40] | M. X. Goemans, D. P. Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming, J. ACM, 42 (1995), 1115–1145. https://doi.org/10.1145/227683.227684 doi: 10.1145/227683.227684 |
[41] | S. Arora, S. Rao, U. Vazirani, Expander flows, geometric embeddings and graph partitioning, J. ACM, 56 (2009), 1–37. https://doi.org/10.1145/1502793.1502794 doi: 10.1145/1502793.1502794 |
[42] | U. von Luxburg, A tutorial on spectral clustering, Stat. Comput., 17 (2007), 395–416. https://doi.org/10.1007/s11222-007-9033-z doi: 10.1007/s11222-007-9033-z |
[43] | M. R. Garey, D. S. Johnson, Computers and intractability: A guide to the theory of NP-completeness, W. H. Freeman, 1979. |
[44] | T. D. Bie, N. Cristianini, Fast sdp relaxations of graph cut clustering, transduction, and other combinatorial problems, J. Mach. Learn. Res., 7 (2006), 1409–1436. |
[45] | A. Frieze, M. Jerrum, Improved approximation algorithms for max-$k$-cut and max bisection, Algorithmica, 18 (1997), 67–81. https://doi.org/10.1007/BF02523688 doi: 10.1007/BF02523688 |
[46] | W. Zhu, C. Guo, A local search approximation algorithm for max-$k$-cut of graph and hypergraph, International Symposium on Parallel Architectures, Algorithms and Programming, (2011), 236–240. https://doi.org/10.1109/PAAP.2011.35 doi: 10.1109/PAAP.2011.35 |
[47] | A. V. Kel'manov, A. V. Pyatkin, On the complexity of some quadratic euclidean 2-clustering problems, Comput. Math. Math. Phys., 56 (2016), 491–497. https://doi.org/10.1134/S096554251603009X doi: 10.1134/S096554251603009X |
[48] | L. J. Schulman, Clustering for edge-cost minimization, Ann. ACM Symp. on Theory of Computing (STOC), (2000), 547–555. https://doi.org/10.1145/335305.335373 doi: 10.1145/335305.335373 |
[49] | S. Sahni, T. Gonzalez, P-complete approximation problems, J. ACM, 23 (1976), 555–565. https://doi.org/10.1145/321958.321975 doi: 10.1145/321958.321975 |
[50] | W. F. de la Vega, M. Karpinski, C. Kenyon, Y. Rabani, Approximation schemes for clustering problems, ACM symposium on Theory of computing (STOC '03), (2003), 50–58. https://doi.org/10.1145/780542.780550 doi: 10.1145/780542.780550 |
[51] | N. Guttmann-Beck, R. Hassin, Approximation algorithms for min-sum p-clustering, Discrete Appl. Math., 89 (1998), 125–142. https://doi.org/10.1016/S0166-218X(98)00100-0 doi: 10.1016/S0166-218X(98)00100-0 |
[52] | H. Späth, Cluster analysis algorithms for data reduction and classification of objects, Wiley, New York, 1980. |
[53] | P. Fränti, S. Sieranoja, Clustering datasets, University of Eastern Finland, 2020. Available from: http://cs.uef.fi/sipu/datasets/. |
[54] | P. Fränti, M. Rezaei, Q. Zhao, Centroid index: Cluster level similarity measure, Pattern Recogn., 47 (2014), 3034–3045. https://doi.org/10.1016/j.patcog.2014.03.017 doi: 10.1016/j.patcog.2014.03.017 |
[55] | S. Sieranoja, P. Fränti, Fast and general density peaks clustering, Pattern Recogn. Lett., 128 (2019), 551–558. https://doi.org/10.1016/j.patrec.2019.10.019 doi: 10.1016/j.patrec.2019.10.019 |
[56] | P. Fränti, Genetic algorithm with deterministic crossover for vector quantization, Pattern Recogn. Lett., 21 (2000), 61–68. https://doi.org/10.1016/S0167-8655(99)00133-6 doi: 10.1016/S0167-8655(99)00133-6 |
[57] | T. Cour, S. Yu, J. Shi, Normalized Cut Segmentation Code, 2004. |