Fast clustering algorithm based on MST of representative points

Hui Du; Depeng Lu; Zhihe Wang; Cuntao Ma; Xinxin Shi; Xiaoli Wang; Hui Du; Depeng Lu; Zhihe Wang; Cuntao Ma; Xinxin Shi; Xiaoli Wang

doi:10.3934/mbe.2023705

Mathematical Biosciences and Engineering

2023, Volume 20, Issue 9: 15830-15858. doi: 10.3934/mbe.2023705

Previous Article Next Article

Research article Special Issues

Fast clustering algorithm based on MST of representative points

1.
The School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
2.
School of Computer Science and Technology, Xidian University, Xi'an 710071, China

Academic Editor: Yiu-ming Cheung

Received: 13 April 2023 Revised: 13 July 2023 Accepted: 20 July 2023 Published: 31 July 2023

Minimum spanning tree (MST)-based clustering algorithms are widely used to detect clusters with diverse densities and irregular shapes. However, most algorithms require the entire dataset to construct an MST, which leads to significant computational overhead. To alleviate this issue, our proposed algorithm R-MST utilizes representative points instead of all sample points for constructing MST. Additionally, based on the density and nearest neighbor distance, we improved the representative point selection strategy to enhance the uniform distribution of representative points in sparse areas, enabling the algorithm to perform well on datasets with varying densities. Furthermore, traditional methods for eliminating inconsistent edges generally require prior knowledge about the number of clusters, which is not always readily available in practical applications. Therefore, we propose an adaptive method that employs mutual neighbors to identify inconsistent edges and determine the optimal number of clusters automatically. The experimental results indicate that the R-MST algorithm not only improves the efficiency of clustering but also enhances its accuracy.

Keywords:

Citation: Hui Du, Depeng Lu, Zhihe Wang, Cuntao Ma, Xinxin Shi, Xiaoli Wang. Fast clustering algorithm based on MST of representative points[J]. Mathematical Biosciences and Engineering, 2023, 20(9): 15830-15858. doi: 10.3934/mbe.2023705

Related Papers:

[1]	Zhihe Wang, Huan Wang, Hui Du, Shiyin Chen, Xinxin Shi . A novel density peaks clustering algorithm for automatic selection of clustering centers based on K-nearest neighbors. Mathematical Biosciences and Engineering, 2023, 20(7): 11875-11894. doi: 10.3934/mbe.2023528
[2]	Jian Zhang, Yan Zhang, Cong Wang, Huilong Yu, Cui Qin . Binocular stereo matching algorithm based on MST cost aggregation. Mathematical Biosciences and Engineering, 2021, 18(4): 3215-3226. doi: 10.3934/mbe.2021160
[3]	Meijiao Wang, Yu chen, Yunyun Wu, Libo He . Spatial co-location pattern mining based on the improved density peak clustering and the fuzzy neighbor relationship. Mathematical Biosciences and Engineering, 2021, 18(6): 8223-8244. doi: 10.3934/mbe.2021408
[4]	Hao Yuan, Qiang Chen, Hongbing Li, Die Zeng, Tianwen Wu, Yuning Wang, Wei Zhang . Improved beluga whale optimization algorithm based cluster routing in wireless sensor networks. Mathematical Biosciences and Engineering, 2024, 21(3): 4587-4625. doi: 10.3934/mbe.2024202
[5]	Rongmei Geng, Renxin Ji, Shuanjin Zi . Research on task allocation of UAV cluster based on particle swarm quantization algorithm. Mathematical Biosciences and Engineering, 2023, 20(1): 18-33. doi: 10.3934/mbe.2023002
[6]	Lixiang Zhang, Yian Zhu, Jie Ren, Wei Lu, Ye Yao . A method for detecting abnormal behavior of ships based on multi-dimensional density distance and an abnormal isolation mechanism. Mathematical Biosciences and Engineering, 2023, 20(8): 13921-13946. doi: 10.3934/mbe.2023620
[7]	Lei Zhang, Lina Ge . A clustering-based differential privacy protection algorithm for weighted social networks. Mathematical Biosciences and Engineering, 2024, 21(3): 3755-3773. doi: 10.3934/mbe.2024166
[8]	He Ma . Achieving deep clustering through the use of variational autoencoders and similarity-based loss. Mathematical Biosciences and Engineering, 2022, 19(10): 10344-10360. doi: 10.3934/mbe.2022484
[9]	Jiayin Song, Yue Zhao, Zhixiang Chi, Qiang Ma, Tianrui Yin, Xiaopeng Zhang . Improved FCM algorithm for fisheye image cluster analysis for tree height calculation. Mathematical Biosciences and Engineering, 2021, 18(6): 7806-7836. doi: 10.3934/mbe.2021388
[10]	Enfeng Qi, Can Fu, Ying Zhai, Jianghui Dong . Insights into protease sequence similarities by comparing substrate sequences and phylogenetic dynamics. Mathematical Biosciences and Engineering, 2021, 18(1): 837-850. doi: 10.3934/mbe.2021044

Abstract

1. Introduction

The rapid development of big data technology has been driving research progress in fields such as biomedicine ^[1,2] and geography ^[3]. Clustering is an important tool for big data analysis, which can help researchers extract useful information from complex and massive data. The existing clustering algorithms can be broadly categorized into partitional clustering, hierarchical clustering, density-based clustering, deep clustering and so on ^[4]. Partitional clustering approaches optimize an objective function by iteratively controlling the division of N data points into K clusters until the optimal solution is found or the termination condition is met, where K is much less than N ^[5]. While the partitional clustering algorithm performs well on datasets with spherical structures and has a linear time complexity, it is not suitable for non-convex datasets ^[6]. The hierarchical clustering and density-based clustering algorithms exhibit good performance when dealing with datasets that are non-convex in shape. Hierarchical clustering techniques cluster datasets through either aggregation or splitting ^[7]. Aggregation methods merge closely related data points in a tree structure until reaching a specific threshold of similarity between nodes ^[8]. Conversely, splitting methods recursively divide vast datasets based on the distinction between two groups of points according to certain thresholds ^[9]. Density-based noisy application space clustering (DBSCAN) is a classical clustering algorithm based on density that can cluster datasets of any shape effectively and detect noise points ^[10]. However, it depends on two input parameters, and its convergence time tends to be long when dealing with large dataset sizes ^[11]. Density Peak Clustering (DPC) is a density-based clustering algorithm that can quickly identify non-spherical clusters by estimating cluster centers based on the assumption that they are surrounded by neighboring points with low local densities and located in scattered distribution ^[12]. However, the quadratic time complexity of DPC is a drawback that renders it unsuitable for processing large-scale datasets. Sami and Pasi ^[13] proposed a fast density peaks algorithm, called FastDP, which utilizes an efficient and adaptable construction of an approximate k-nearest neighbor graph for swift density and increment computation. Utilizing this mechanism, FastDP addresses the quadratic time complexity limitation of DPC. Traditional clustering algorithms rely on geometric concepts such as distance or density. However, they are less effective in clustering non-linear high-dimensional data and identifying complex embedded features and hierarchical structures. This limitation has paved the way for the development of deep clustering algorithms ^[14]. Xie et al. ^[15] proposed deep embedded clustering (DEC) based on the automatic encoder, which is a commonly used technique in deep clustering algorithms. DEC enables joint unsupervised representation learning and clustering tasks, and it has exhibited improved clustering accuracy for large-scale high-dimensional datasets. However, the performance of this algorithm heavily relies on the quality of the learned representations via the automatic encoding process. To better extract structural and attribute information from graph data, there has been a surge of interest in researching Graph Neural Networks ^[16]. Wang et al. ^[17] proposed deep attentional embedded graph clustering (DAEGC), which uses the graph neural network to obtain the structure information of the graph data, adds the attribute information of the node in the input at the same time, fuses the node information and structure information for representation learning and uses the attention mechanism to more effectively aggregate the neighbor nodes of the node.

Minimum spanning tree (MST) is a highly popular graph structure in graph theory and is extensively employed in clustering analysis owing to its ability to detect clusters with irregular boundaries ^[18]. Gower and Ross ^[19] introduced the MST to clustering algorithms in 1969, proposing a single linkage clustering analysis achieved by pruning the MST. This method initially constructs the MST and then removes the longest k-1 edges to obtain single linkage k-partitions. In 1971, Zahn ^[20] formally introduced a clustering algorithm based on MST, which constructs the MST and iteratively removes inconsistent edges according to the edge weight features. Grygorash et al. ^[21] proposed HEMST, which removes edges from MST to achieve a reduction in standard deviation of the best possible edge weights. Müller et al. ^[22] proposed the ITM algorithm, which employs an entropy-based information-theoretical criterion to identify inconsistent edges, considering both cluster size and the average weight of intra-cluster edges. The Genie ^[23] is an example of a single linkage clustering optimization variant, greedily optimizing the total edge length but only allowing the smallest clusters to merge under the constraint that the Gini index of cluster size is higher than a given threshold. The CTCEHC ^[24] constructs an initial partition based on vertex degree, and then merges clusters based on geodesic distance between cluster centroids. Mishra et al. ^[25] proposed a hybrid fast MST-based clustering method to improve efficiency. The method first divides the data into a large number of sub-clusters based on discreteness, constructs an MST for sub-cluster centroids and identifies adjacent pairs and finally merges adjacent pairs based on their internal similarity and cohesion. A clustering algorithm based on MST and the Critical Distance Method (MST-CDC) ^[26] uses a critical distance value as a threshold, removing inconsistent edges from the MST to obtain sub-clusters. Subsequently, it employs shorter inter-cluster distances for merging these sub-clusters. Among these MST-based clustering algorithms, there are mostly two types of problems. First, the computational cost of constructing the MST is too high, especially when dealing with large data sets. The HEMST, ITM, Genie, CTCEHC and MST-CDC all need to construct an MST using a complete graph generated from the entire data set at the initial stage of the algorithm, which is a key factor that affects the low efficiency of the algorithm. The hybrid fast MST-based clustering method use a very small number of sub-cluster centroids to construct MST, so the efficiency of this algorithm is less affected by the construction of MST. The second problem is that identifying inconsistent edges is very challenging, which is the key to the quality of clustering. In the case of the single linkage scheme, the goal is to maximize the sum of the weights of the excised inconsistent edges. This method requires a specified number of clusters, is sensitive to noise and is extremely ineffective on datasets with large differences in density distribution. The MST-based clustering algorithm proposed by Zahn considers the weights of inconsistent edges to be significantly larger than the average weights of nearby edges. This method cannot easily control the number of inconsistent edges and may result in obtaining too many or too few clusters. The HEMST removes inconsistent edges based on the standard deviation of the edge weights, and this method also requires specifying the number of clusters. The ITM utilizes information theory based on entropy to accurately identify inconsistent edges, but it relies on knowing the number of clusters. The MST-CDC uses a critical distance value as a threshold to remove inconsistent edges and obtain sub-clusters, making the algorithm more robust in the presence of outliers. However, this method may overlook density changes within a region. The Genie, CTCEHC, and the hybrid fast MST-based clustering method do not require identifying inconsistent edges. The Genie adopts an agglomerative strategy, but the limitation is that finding a suitable threshold is difficult. The partitioning strategy of CTCEHC based on MST reduces the complexity of the merging process. However, the clustering results of the method based on mixed fast MST are easily influenced by initial partitioning. It is very valuable to research how to improve the efficiency of MST-based clustering algorithms while also achieving automatic identification of inconsistent edges without the need to specify the number of clusters beforehand.

If the information of some key points in the dataset can reflect the overall structure of the dataset, then in some efficiency focused algorithms, these key points can be used to replace the entire dataset to complete the main work. This replacement idea can reduce the computational cost of the algorithm without significantly affecting the clustering quality. Motivated by this idea, we propose an algorithm that constructs the MST of representative points instead of all sample points. The proposed algorithm performs the following major steps and contributions. First, it divides the dataset into two categories, core points and noncore points. Second, it uses a novel selection strategy to pick a set of representative points from the core points. Third, it constructs an MST of representative points, and then uses adaptive methods to identify and eliminate inconsistent edges. Finally, the algorithm first assigns each non-representative point in the core point, and then assigns the non-core point. Experimental analyses were performed on eight synthetic datasets and twelve UCI datasets. The results show that the algorithm has relatively low execution time and significantly improved clustering quality.

2. The MST-based clustering algorithm

The most basic MST-based clustering algorithm consists of two steps. First, construct an MST on the complete graph of all points. Then, remove inconsistent edges from the MST to complete the clustering. Inconsistent edges are the longest edges under ideal conditions, where there are no outliers and the clusters are well separated. The MST-based clustering algorithm is usually divided into three phases: 1) constructing the MST; 2) eliminating inconsistent edges from the MST graph to create the set of connected components; 3) repeating phase 2 until the termination condition is satisfied. Figure 1 shows the main steps of the MST-based clustering algorithm. Figure 1(a) is the initial graph of the spiral dataset. Figure 1(b) constructs a minimum spanning tree using all the points in the dataset. The black and yellow lines are the edges of MST, and the two yellow lines (E1 and E2) are the two longest edges of the minimum spanning tree. After cutting off these two longest edges, three subtrees are obtained, each representing a cluster. Figure 1(c) shows the final clustering result, which is divided into three clusters.

Figure 1. The main steps of the MST-based clustering algorithm. (a) Spiral dataset; (b) Constructing an MST using all points; (c) Final clustering results.

Dataset	Number of instances	Dimension	Number of categories
ED-Hexagon	361	2	2
Jain	373	2	2
Three-circles	299	2	3
Heart-shaped	213	2	3
Ls3	1735	2	6
D31	3100	2	31
2d-20c-no0	1517	2	20
T7	8000	2	9

Algorithm	DPC	FastDP	DBSCAN	MST-CDC	R-MST
Parameters	dc	k	Eps/MinPts	None	${k}_{1}$ / ${k}_{2}$
ED-Hexagon	15	18	20/4	---	2/6
Jain	4.65	18	3.1/8	---	2/9
Threecircles	0.08	18	0.06/4	---	1/6
Heart-shaped	18	18	20/4	---	9/6
Ls3	10.28	18	10/8	---	1/7
D31	1.27	18	0.5/6	---	15/5
2d-20c-no0	1.19	18	1.2/20	---	10/20
T7	29.17	18	10/12	---	80/9

Algorithm	2000	4000	6000	8000	10,000	12,000
DPC	9	36	84	154	312	603
FastDP	0.1	0.4	0.7	0.9	1.3	1.6
DBSCAN	27	112	249	442	902	1872
MST-CDC	54	201	460	802	1532	3553
R-MST	2	6	14	24	54	89

Dataset	Algorithm	ARI	NMI	Homo
Zoo	DPC	0.4972	0.7224	0.7490
	DBSCAN	0.9326	0.8968	0.8978
	MST-CDC	0	0	0
	R-MST	0.9515	0.9137	0.9109
Cancer	DPC	0.4934	0.4404	0.3964
	DBSCAN	0.8362	0.7456	0.7537
	MST-CDC	0.0050	0.0271	0.0048
	R-MST	0.8522	0.7530	0.7530
Seeds	DPC	0.7448	0.7194	0.7169
	DBSCAN	0.3693	0.5062	0.5788
	MST-CDC	0	0.0233	0.0061
	R-MST	0.8109	0.7707	0.7702
WBC	DPC	0.4934	0.4404	0.3964
	DBSCAN	0.8362	0.7456	0.7537
	MST-CDC	0.0050	0.0271	0.0048
	R-MST	0.8522	0.7530	0.7530
Wine	DPC	0.6724	0.7104	0.7096
	DBSCAN	0.4264	0.5266	0.4978
	MST-CDC	-0.0087	0.0881	0.0344
	R-MST	0.7847	0.7872	0.7896
Ecoli	DPC	0.5618	0.5761	0.5017
	DBSCAN	0.4999	0.5109	0.4104
	MST-CDC	0.0610	0.1849	0.0796
	R-MST	0.7691	0.7279	0.7048
Iris	DPC	0.8857	0.8642	0.8640
	DBSCAN	0.5681	0.7337	0.5794
	MST-CDC	0.5681	0.7337	0.5794
	R-MST	0.9222	0.9011	0.9009
Vote	DPC	0.5921	0.5150	0.5241
	DBSCAN	0.4481	0.3977	0.5035
	MST-CDC	0.0746	0.0951	0.0349
	R-MST	0.6353	0.5438	0.5520
Vowel	DPC	0.4596	0.5658	0.5564
	DBSCAN	0.0076	0.0187	0.0105
	MST-CDC	0.2487	0.4717	0.5834
	R-MST	0.5132	0.6030	0.6603
WDBC	DPC	0.4964	0.4822	0.4374
	DBSCAN	0.4515	0.3560	0.3622
	MST-CDC	0.0048	0.0102	0.0053
	R-MST	0.6879	0.5828	0.5680
Dermatology	DPC	0.5293	0.6851	0.5531
	DBSCAN	0.4639	0.6522	0.5661
	MST-CDC	0.2048	0.4514	0.2959
	R-MST	0.7756	0.8484	0.8122
Pendigits	DPC	0.6478	0.7776	0.7630
	DBSCAN	0.5633	0.7384	0.7922
	MST-CDC	0.2053	0.0951	0.0349
	R-MST	0.7356	0.8323	0.9020

[1]	X. Xue, J. Chen, Matching biomedical ontologies through compact differential evolution algorithm with compact adaption schemes on control parameters, Neurocomputing, 458 (2021), 526–534. https://doi.org/10.1016/j.neucom.2020.03.122 doi: 10.1016/j.neucom.2020.03.122
[2]	X. Xue, Y. Wang, Ontology alignment based on instance using NSGA-Ⅱ, J. Inf. Sci., 41 (2015), 58–70. https://doi.org/10.1177/0165551514550142 doi: 10.1177/0165551514550142
[3]	D. S. Silva, M. Holanda, Applications of geospatial big data in the Internet of Things, Trans. GIS, 26 (2022), 41–71. https://doi.org/10.1111/tgis.12846 doi: 10.1111/tgis.12846
[4]	T. Xu, J. Jiang, A graph adaptive density peaks clustering algorithm for automatic centroid selection and effective aggregation, Expert Syst. Appl., 195 (2022), 116539. https://doi.org/10.1016/j.eswa.2022.116539 doi: 10.1016/j.eswa.2022.116539
[5]	F. U. Siddiqui, A. Yahya, F. U. Siddiqui, A. Yahya, Partitioning clustering techniques, in Clustering Techniques for Image Segmentation, Springer, (2022), 35–67. https://doi.org/10.1007/978-3-030-81230-0_2
[6]	F. U. Siddiqui, A. Yahya, F. U. Siddiqui, A. Yahya, Novel partitioning clustering, in Clustering Techniques for Image Segmentation, Springer, (2022), 69–91. https://doi.org/10.1007/978-3-030-81230-0_3
[7]	C. K. Reddy, B. Vinzamuri, A survey of partitional and hierarchical clustering algorithms, in Data Clustering, Chapman and Hall/CRC, (2018), 87–110. https://doi.org/10.1201/9781315373515-4
[8]	S. Zhou, Z. Xu, F. Liu, Method for determining the optimal number of clusters based on agglomerative hierarchical clustering, IEEE Trans. Neural Networks Learn. Syst., 28 (2016), 3007–3017. https://doi.org/10.1109/TNNLS.2016.2608001 doi: 10.1109/TNNLS.2016.2608001
[9]	E. C. Chi, K. Lange, Splitting methods for convex clustering, J. Comput. Graphical Stat., 24 (2015), 994–1013. https://doi.org/10.1080/10618600.2014.948181 doi: 10.1080/10618600.2014.948181
[10]	M. Ester, H. P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in kdd, 96 (1996), 226–231.
[11]	P. Bhattacharjee, P. Mitra, A survey of density based clustering algorithms, Front. Comput. Sci., 15 (2021), 1–27. https://doi.org/10.1007/s11704-019-9059-3 doi: 10.1007/s11704-019-9059-3
[12]	A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science, 344 (2014), 1492–1496. https://doi.org/10.1126/science.1242072 doi: 10.1126/science.1242072
[13]	S. Sieranoja, P. Fränti, Fast and general density peaks clustering, Pattern Recognit. Lett., 128 (2019), 551–558. https://doi.org/10.1016/j.patrec.2019.10.019 doi: 10.1016/j.patrec.2019.10.019
[14]	A. Joshi, E. Fidalgo, E. Alegre, L. Fernández-Robles, SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders, Expert Syst. Appl., 129 (2019), 200–215. https://doi.org/10.1016/j.eswa.2019.03.045 doi: 10.1016/j.eswa.2019.03.045
[15]	J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in International Conference on Machine Learning, PMLR, (2016), 478–487. https://doi.org/10.48550/arXiv.1511.06335
[16]	M. Gori, G. Monfardini, F. Scarselli, A new model for learning in graph domains, in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, IEEE, (2005), 729–734. https://doi.org/10.1109/IJCNN.2005.1555942
[17]	C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, C. Zhang, Attributed graph clustering: A deep attentional embedding approach, preprint, arXiv: 1906.06532.
[18]	R. Jothi, S. K. Mohanty, A. Ojha, Fast approximate minimum spanning tree based clustering algorithm, Neurocomputing, 272 (2018), 542–557. https://doi.org/10.1016/j.neucom.2017.07.038 doi: 10.1016/j.neucom.2017.07.038
[19]	J. C. Gower, G. J. Ross, Minimum spanning trees and single linkage cluster analysis, J. R. Stat. Soc. C, 18 (1969), 54–64. https://doi.org/10.2307/2346439 doi: 10.2307/2346439
[20]	C. T. Zahn, Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Trans. Comput., 100 (1971), 68–86. https://doi.org/10.1109/T-C.1971.223083 doi: 10.1109/T-C.1971.223083
[21]	O. Grygorash, Y. Zhou, Z. Jorgensen, Minimum spanning tree based clustering algorithms, in 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06), IEEE, (2006), 73–81. https://doi.org/10.1109/ICTAI.2006.83
[22]	A. C. Müller, S. Nowozin, C. H. Lampert, Information theoretic clustering using minimum spanning trees, in Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium, Springer, (2012), 205–215. https://doi.org/10.1007/978-3-642-32717-9_21
[23]	M. Gagolewski, M. Bartoszuk, A. Cena, Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Inf. Sci., 363 (2016), 8–23. https://doi.org/10.1016/j.ins.2016.05.003 doi: 10.1016/j.ins.2016.05.003
[24]	Y. Ma, H. Lin, Y. Wang, H. Huang, X. He, A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint, Inf. Sci., 557 (2021), 194–219. https://doi.org/10.1016/j.ins.2020.12.016 doi: 10.1016/j.ins.2020.12.016
[25]	G. Mishra, S. K. Mohanty, A fast hybrid clustering technique based on local nearest neighbor using minimum spanning tree, Expert Syst. Appl., 132 (2019), 28–43. https://doi.org/10.1016/j.eswa.2019.04.048 doi: 10.1016/j.eswa.2019.04.048
[26]	F. Şaar, A. E. Topcu, Minimum spanning tree‐based cluster analysis: A new algorithm for determining inconsistent edges, Concurrency Comput. Pract. Exper., 34 (2022), e6717. https://doi.org/10.1002/cpe.6717 doi: 10.1002/cpe.6717
[27]	H. A. Chowdhury, D. K. Bhattacharyya, J. K. Kalita, UIFDBC: Effective density based clustering to find clusters of arbitrary shapes without user input, Expert Syst. Appl., 186 (2021), 115746. https://doi.org/10.1016/j.eswa.2021.115746 doi: 10.1016/j.eswa.2021.115746
[28]	R. C. Prim, Shortest connection networks and some generalizations, Bell Syst. Tech. J., 36 (1957), 1389–1401. https://doi.org/10.1002/j.1538-7305.1957.tb01515.x doi: 10.1002/j.1538-7305.1957.tb01515.x
[29]	F. Ros, S. Guillaume, Munec: a mutual neighbor-based clustering algorithm, Inf. Sci., 486 (2019), 148–170. https://doi.org/10.1016/j.ins.2019.02.051 doi: 10.1016/j.ins.2019.02.051
[30]	D. Steinley, Properties of the hubert-arable adjusted rand index, Psychol. Methods, 9 (2004), 386. https://doi.org/10.1037/1082-989X.9.3.386 doi: 10.1037/1082-989X.9.3.386
[31]	P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, Normalized mutual information feature selection, IEEE Trans. Neural Networks, 20 (2009), 189–201. https://doi.org/10.1109/TNN.2008.2005601 doi: 10.1109/TNN.2008.2005601
[32]	M. Sato-Ilic, On evaluation of clustering using homogeneity analysis, in IEEE International Conference on Systems, Man and Cybernetics, IEEE, 5 (2000), 3588–3593. https://doi.org/10.1109/ICSMC.2000.886566
[33]	P. Fränti, Clustering datasets, 2017. Available from: https://cs.uef.fi/sipu/datasets.
[34]	P. Fränti, S. Sieranoja, K-means properties on six clustering benchmark datasets, Appl. Intell., 48 (2018), 4743–4759. https://doi.org/10.1007/s10489-018-1238-7 doi: 10.1007/s10489-018-1238-7
[35]	D. Dua, C. Graff, UCI Machine Learning Repository, 2017. Available from: https://archive.ics.uci.edu/ml.
[36]	J. B. Kruskal, On the shortest spanning subtree of a graph and the traveling salesman problem, Proc. Am. Math. Soc., 7 (1956), 48–50.

Algorithm 1: Dividing the dataset into core points and noncore points
Input: Dataset $D$ = $\left\{{x}_{1}, {x}_{2}, {x}_{3}, \dots {x}_{n}\right\}, {k}_{1}$
Output: The set of core points $C = \left\{{c}_{1}, {c}_{2}, {c}_{3}, \dots {c}_{m}\right\}$ , the set of noncore points $O = \left\{{o}_{1}, {o}_{2}, {o}_{3}, \dots {o}_{n-m}\right\}$
/* Obtain ${k}_{1}$ nearest neighbors for each point in the initial dataset. */
For each ${x}_{i}$ in $D$ do
Calculate ${KN}_{{k}_{1}}\left({x}_{i}\right)$
End for
/* Obtain reverse neighbors for each point in the initial dataset. */
For each ${x}_{i}$ in $D$ do
Calculate $R{N}_{{k}_{1}}\left({x}_{i}\right)$
End for
Create $C = \varnothing$ , $O = \varnothing$
/Determine whether each point is a core point or a noncore point sequentially. /
For each ${x}_{i}$ in $D$ do
If ${x}_{i}$ meets Condition 1 or Condition 2 then
$C\leftarrow C\cup {x}_{i}$
Else
$O\leftarrow O\cup {x}_{i}$
End if
End for

Algorithm 2: Selecting representative points
Input: The set of core points C $= \left\{{c}_{1}, {c}_{2}, {c}_{3}, \dots {c}_{m}\right\}$ , ${k}_{2}$
Output: The set of representative points $DREP = \{{drep}_{1}, {drep}_{2}, ..., {drep}_{nrep}\}$ , the set of nonrepresentative points $NDREP = \{{ndrep}_{1}, {ndrep}_{2}, ..., {ndrep}_{m-nrep}\}$
/* Obtain the ${k}_{2}$ nearest neighbors for each core point in the set C. */
For ${c}_{i}$ in C do
Calculate ${KN}_{{k}_{2}}\left({c}_{i}\right)$
End for
/* Obtain the reverse neighbors for each core point in the set C. */
/* Obtain the sum of distances between each core point and its ${k}_{2}$ nearest neighbors within the set C. */
For ${c}_{i}$ in C do
Calculate ${RN}_{{k}_{2}}\left({c}_{i}\right)$
Calculate ${DN}_{{k}_{2}}\left({c}_{i}\right)$
End for
/* Obtain the density of each core point*/
For ${c}_{i}$ in C do
Calculate $\rho \left({c}_{i}\right)$
End for
Calculate ${\rho }_{-}med$ , ${max}_{-}d$ , ${mean}_{-}d$
/* The set REP stores the representative points corresponding to each core point. */
Create a set $REP = \varnothing$
For ${c}_{i}$ in C do
If ${c}_{i}$ meets Condition 3 then /Scenario (1) for selecting representative points. /
${R(c}_{i}) = {c}_{i}$
Else if ${c}_{i}$ meets Condition 4 then /Scenario (2) for selecting representative points. /
${R(c}_{i}) = {c}_{i}$
Else /Scenario (3) for selecting representative points. /
Calculate $R\left({c}_{i}\right)$ according to Eq (9)
End if
End if
/Add the representative point corresponding to core point to the REP. /
$REP\leftarrow REP\cup {R(c}_{i})$
End for
Remove duplicate elements from REP to obtain the final representative point set DREP.
Obtain the set of nonrepresentative points NDREP according to Eq (16).

Algorithm 3: Identifying inconsistent edges
Input: An MST of representative points, ${k}_{2}$
Output: The set of inconsistent edges $S$ , the number of clusters $nc$
/E denotes the set of edges of the MST, $e\left({drep}_{i, }{drep}_{j}\right)\in E$ and $w\left({drep}_{i, }{drep}_{j}\right)$ is the weight corresponding to $e\left({drep}_{i, }{drep}_{j}\right)$ /
Sort all MST edges in descending order by weight size to get a weight list ${w}_{sorted}$
Create $S = \varnothing$ , $S1 = \varnothing$ , $S2 = \varnothing$ , $nc = 1$
For $w\left({drep}_{i, }{drep}_{j}\right)$ in ${w}_{sorted}$ do
/The edge $e\left({drep}_{i, }{drep}_{j}\right)$ associated with weight $w\left({drep}_{i, }{drep}_{j}\right)$ , connects two representative points ${drep}_{i}$ and ${drep}_{j}$ , which are subsequently added to the empty sets S1 and S2, respectively. /
$S1\leftarrow S1\cup {drep}_{i}$
$S2\leftarrow S2\cup {drep}_{j}$
Add all the points in the core point set whose representative point is ${drep}_{i}$ to S1.
Add all the points in the core point set whose representative point is ${drep}_{j}$ to S2.
/When an edge is not found to be inconsistent for the first time, the iteration is halted, marking the conclusion of the process of identifying inconsistent edges. /
If exists ${c}_{i}\in S1$ , ${c}_{j}\in S2$ , ${c}_{i}\in {KN}_{{k}_{2}}\left({c}_{j}\right)\wedge {c}_{j}\in {KN}_{{k}_{2}}\left({c}_{i}\right)$ then
break
Else
/The identified inconsistent edges are added to the set S of inconsistent edges. /
$S\leftarrow S\cup e\left({drep}_{i, }{drep}_{j}\right)$
/S1 and S2 are emptied to be used for the next iteration. /
$S1 = \varnothing$
$S2 = \varnothing$
/Each time an inconsistent edge is detected, the number of clusters increases by 1. /
$nc = nc+1$
End if
End for

Mathematical Biosciences and Engineering

Fast clustering algorithm based on MST of representative points

Related Papers:

Abstract

1. Introduction

2. The MST-based clustering algorithm

3. The proposed algorithm

3.1. Core points

3.2. Representative points

3.3. Constructing an MST of representative points and identifying inconsistent edges

3.4. Assigning nonrepresentative points and noncore points

3.5. Time complexity analysis

4. Experimental result and analysis

4.1. Experiment preparation

4.2. Experimental results on synthetic datasets

4.3. Experimental results on UCI datasets

5. Discussion

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

Abstract

1. Introduction

2. The MST-based clustering algorithm

3. The proposed algorithm

3.1. Core points

3.2. Representative points

3.3. Constructing an MST of representative points and identifying inconsistent edges

3.4. Assigning nonrepresentative points and noncore points

3.5. Time complexity analysis

4. Experimental result and analysis

4.1. Experiment preparation

4.2. Experimental results on synthetic datasets

4.3. Experimental results on UCI datasets

5. Discussion

Use of AI tools declaration

Acknowledgments

Conflict of interest

References