A Novel Clustering Algorithm Based On DPC and PSO
A Novel Clustering Algorithm Based On DPC and PSO
21, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2992903
ABSTRACT Analyzing the fast search and find of density peaks clustering (DPC) algorithm, we find that the
cluster centers cannot be determined automatically and that the selected cluster centers may fall into a local
optimum and the random selection of the parameter cut-off distance dc value. To overcome these problems,
a novel clustering algorithm based on DPC & PSO (PDPC) is proposed. Particle swarm optimization (PSO) is
introduced because of its simple concept and strong global search ability, which can find the optimal solution
in relatively few iterations. First, to solve the effect of the selection of the parameter dc on the calculation
density and the clustering results, this paper proposes a method to calculate that parameter. Second, a new
fitness criterion function is proposed that iteratively searches K global optimal solutions through the PSO
algorithm, that is, the initial cluster centers. Third, each sample is assigned to K initial center points according
to the minimum distance principle. Finally, we update the cluster centers and redistribute the remaining
objects to the clusters closest to the cluster centers. Furthermore, the effectiveness of the proposed algorithm
is verified on nine typical benchmark data sets. The experimental results show that the PDPC can effectively
solve the problem of cluster center selection in the DPC algorithm, avoiding the subjectivity of the manual
selection process and overcoming the influence of the parameter dc . Compared with the other six algorithms,
the PDPC algorithm has a stronger global search ability, higher stability and a better clustering effect.
INDEX TERMS Clustering, density peak, particle swarm optimization, fitness function.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
88200 VOLUME 8, 2020
J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO
which may affect the clustering results. Therefore, it is the parameter dc is proposed. First, the Gaussian dis-
necessary to propose a new method of calculating dc . tance between the data points is calculated. Second,
• The cluster centers selected by the DPC algorithm are the maximum and minimum Gaussian distances are
likely to fall into a local optimum. This problem also found. Finally, the parameter dc is proposed based on
impacts the clustering results and needs to be solved. the mean value of the maximum and minimum Gaussian
• Since the DPC algorithm visually identifies the cluster distances.
centers on the decision diagram (See Section III.A.2)), • Aiming at the problem that the cluster centers selected
it may directly affect the clustering results. Therefore, by the DPC algorithm easily fall into a local optimum,
it is necessary to overcome the influence of human fac- the PSO intelligent optimization algorithm is introduced
tors and achieve the automatic identification of cluster for clustering analysis, and the global search ability of
centers. PSO can be used to find K approximate optimal solu-
Motivation 1: For the calculated density formula in the tions. We use the optimal solutions as the initial cluster
DPC algorithm, there is a parameter cut-off distance dc , centers. The PDPC algorithm achieves the purpose of
which is 1% to 2% of the size of the data set [1]. This automatically selecting the cluster centers, avoids the
empirically chosen value is uncertain and unreliable, which subjectivity of the manual selection process.
may affect the calculation of density and in turn affect the • Literature [1] proposed that the cluster center has the
clustering results. Therefore, a new method for calculating dc characteristics of high density ρi and long distance δi .
is proposed based on the Gaussian distance. According to this feature in the DPC algorithm, a new
Motivation 2: The deficiencies of the DPC clustering algo- fitness function is proposed. Setting the fitness function
rithm must be overcome; its selected cluster centers may fall is a key step in solving the optimization problem, and
into a local optimum, and its initial centers may be located the design of the fitness function should be as simple as
in the same cluster or may not be found. These issues can possible. Therefore, we use the inverse of the product of
affect the clustering results. Considering the above problem, density and distance as the fitness function.
this paper introduces an intelligent optimization algorithm for • We use multiple typical benchmark data sets to test
clustering analysis. the performance of the PDPC algorithm, and use three
Motivation 3: The DPC algorithm selects cluster centers well-known evaluation cluster quality indicators (the
visually and intuitively on the decision diagram. Some of the accuracy, the precision and the recall) to evaluate the
improved clustering methods use the same strategy, such as clustering results. The comparison experiments with
DP_K-medoids [3] and DPNM_K-medoids [3]. These meth- other six algorithms show the effectiveness and correct-
ods show good performance on different data sets. However, ness of the proposed clustering algorithm.
there are human factors in the process of selecting cluster
centers that may directly affect the clustering results. This
insufficiency motivates us to propose a method that automat- C. ROADMAP
ically identifies the cluster centers in the data set. The rest of this paper is organized as follows. Section II
summarizes the related work relevant to this work. Section III
B. CONTRIBUTIONS gives the theoretical basis and some related concepts.
Inspired by the above motivations, the PDPC clustering In Section IV, a novel clustering algorithm based on DPC
algorithm is proposed. First, to solve the influence of the & PSO (PDPC) is proposed, and the algorithm is introduced
parameter dc , this paper proposes a method to calculate the in detail. Section V analyzes the experimental results on
parameter dc . Second, a new fitness criterion function based typical benchmark data sets, then analyzes the characteristics
on the DPC algorithm is proposed, and it iteratively searches of the proposed algorithm. The six improved clustering algo-
K initial cluster centers by the PSO algorithm. Then, each rithms (DP_K-medoids [3], DPNM_K-medoids [3], Improve
sample is assigned to K initial center points according to the K-means [4], K-means [5], Hybrid PSO and K-means [6]
minimum distance principle. Finally, we update the cluster and DPC [1]) were selected for comparison. And finally,
centers and redistribute the remaining objects to the clusters a summary of this work is given in Section VI.
closest to the cluster centers. The process iterates until the
reallocation of objects no longer changes in any cluster or II. RELATED WORKS
reaches the termination condition of iteration. The exper- Clustering is a dynamic research field in data mining. It is also
imental results show that compared to the other methods, an important unsupervised learning technique in machine
the PDPC algorithm has a stronger global search ability, learning. Clustering is the process of grouping a set of data
higher stability and a better clustering effect on the bench- objects into multiple groups or clusters so that objects within
mark data sets. a cluster have high similarity but are very dissimilar to objects
The main contributions of this work are summarized as in other clusters. Clustering as a data mining tool has its roots
follows: in many application areas such as biology, security, busi-
• To solve the influence of the parameter cut-off distance ness intelligence, pattern recognition, Web search [7]–[9],
dc on the clustering results, a method of calculating trajectory clustering [10], [11] and astronomy [12]–[14].
Traditional approaches in clustering can be broadly cat- by finding density peaks based on Chebyshev’s inequality
egorized into partition-based, hierarchical-based, density- (CDP), can obtain a judgment index by screening density and
based, model-based, grid-based and soft computing meth- distance, which are normalized. The points whose judgment
ods [15]. Partitioning methods such as K-means [5] and indexes are above the upper bound based on Chebyshev’s
K-medoids [16] relocate points by moving them from one inequality will be selected as the cluster centers. Then,
category to another according to distance. These methods the remaining points are assigned by their nearest neighbor of
always need the number of clusters to be set in advance, and higher density. Inspired by the visual selection rule of DPC,
they are sensitive to initial cluster centers. For the problem reference [30] proposed a judgment index that approximately
of cluster center selection, [17] proposed a novel algorithm follows the generalized extreme value (GEV) distribution,
for initial cluster center selection, which uses MNN (M near- and each cluster center’s judgment index is much higher.
est neighbors), density and distance to determine the initial Hence, it is reasonable that points are selected as cluster
cluster centers. The authors show that the method obtains centers if their judgment indexes are larger than the upper
high-quality initial cluster centers. Hierarchical methods [18] quantile of GEV. This proposed method is called density
structure categories by recursively classifying the data in peaks clustering based on generalized extreme value distri-
either a top-down or bottom-up fashion. Density-based meth- bution (DPC-GEV).
ods assume that the points that belong to each cluster are Reference [31] introduced the idea of K-nearest neigh-
drawn from a specific probability distribution [19]. Clusters bors (KNN) and principal component analysis (PCA) into
of arbitrary shape can be discovered by density-based meth- DPC to improve the performance of the DPC algorithm.
ods such as DBSCAN [20] and Denclue [21]. Model-based Reference [32] used the technique of K-nearest neighbors
methods [22] can obtain the clustering results by optimiz- and fuzzy weighted K-nearest neighbors to overcome the
ing the fit between the given data and certain mathematical deficiencies of the DPC algorithm. Reference [33] enhanced
models. Reference [23] developed a simple clustering model the DPC to make it suitable for hyperspectral band selec-
inspired by the way in which the human visual system asso- tion. The proposed approach is named the enhanced FDPC
ciates patterns spatially. And the approach is based on Cellu- (E-FDPC), and it can use an exponential-based learning
lar Neural Networks (CNNs), similar to the biological model. rule to adjust different numbers of cut-off thresholds and
In grid-based methods, the data space is divided into a finite determine cluster centers automatically. Reference [34] pre-
number of unit grid structures [24]. Therefore, such methods sented a density peak based hierarchical clustering method
have a high processing speed. The evolutionary approaches (DenPEHC), which directly generates clusters on each
that belong to the soft computing method [25], [26] are also possible clustering layer, and introduced a grid granula-
used to deal with clustering problems. These algorithms such tion framework to enable the clustering of large-scale and
as the genetic algorithm (GA), artificial bee colony (ABC) high-dimensional (LSHD) data sets.
and PSO [27], [28] can obtain satisfactory results by optimiz- To solve the shortcomings of initial cluster center selection
ing the objective function. of the clustering algorithm and being easily falling into a local
In 2014, there was a large breakthrough in density-based optimum, some researchers try to use the intelligent opti-
clustering approaches. Rodriguez and Laio proposed the DPC mization algorithm for clustering analysis and the clustering
algorithm [1]. DPC is based on the concept that cluster cen- problem as the solution to the optimization problem. Among
ters are characterized by a higher density than that of their these strategies, the PSO algorithm is very popular due to
neighbors and by a relatively larger distance from points with its flexibility, robustness, discreteness and self-organization.
higher densities. This algorithm uses these two features to PSO clustering focuses on solving clustering problems by
obtain a scatter graph called a decision diagram, which is using group behavior. Therefore, the global search ability of
used to visually judge the potential cluster centers. Finally, the PSO algorithm is used to find an approximate optimal
each remaining point is assigned to a cluster according to its solution.
nearest neighbor of higher density. The algorithm is simple, PSO is a group intelligent optimization method proposed
and the clustering results can be completed in one step with- by Kennedy and Eberhart in 1995 [2]. It is derived from
out iteration. However, the algorithm has human factors when bird predation behavior research and is an iteration-based
selecting the cluster centers, which may directly affect the optimization tool. The system is initialized to a set of random
clustering results. solutions that search for the optimal value by iteration. The
In response to the problems of the DPC algorithm, PSO algorithm is simple, easy to implement, and does not
researchers have proposed many different algorithms. have many parameters to adjust. It has been widely used
Reference [3] used DPC to optimize the initial medoids of in function optimization, neural network training, and fuzzy
the K-medoids clustering algorithm. To obtain better clus- system control.
tering, a new measure function is proposed as the ratio of In recent years, the PSO optimization algorithm and
the intra-distance of clusters to the inter-distance between improved clustering methods for PSO have been studied and
clusters. The authors proposed two new K-medoids clustering applied. Reference [35] proposed a PSO clustering algorithm
algorithms: the DP_K-medoids algorithm and the DPNM_K- based on different learning methods. The author proposed
medoids algorithm. In [29], the new clustering algorithm, two improved fitness functions, which greatly improved the
From the analysis of equations (1) and (7), we know that (1)
calculates a discrete value and that (7) calculates a continuous
value. In comparison, the probability of conflict in (7) is
small; that is, the probability that different data points have
the same local density will be small. The value of dc in (7)
can be calculated by (6), and is no longer selected according
to the empirical value. Therefore, the local density calculated
by (7) is better. In view of this consideration, we design the
fitness function as follows.
FIGURE 2. The PSO algorithm flowchart.
1 1
f dij = = (8)
ρ×δ
d 2
P − d ij
the DPC algorithm. Setting the fitness function is a crucial e c × lim dij
j j:ρj >ρi
step in solving the optimization problem. In Section IV.C,
the parameters of the velocity update formula are redefined where i and j denote different particles, dij is the Euclidean
in the PSO algorithm. In Section IV.D, the proposed PDPC distance between particles, and dc is the cut-off distance men-
algorithm is introduced in detail and the algorithm steps are tioned in Section IV.A. For a general particle, δ = min dij ,
given. Finally, in Section IV.E, the time complexity of PDPC however, for a particle with the largest density value,
and comparison algorithms is analyzed. δ = max dij . The smaller the value of f (d) is, the greater
the probability that the particle becomes a cluster center
A. SETTING THE PARAMETER point. If f (d)n < f (d)n−1 , the optimal position needs to be
In the density peak clustering algorithm proposed in [1], updated.
the parameter cut-off distance dc is difficult to determine; it We set a convergence condition as the termination condi-
mainly relies on subjective experience, generally has approxi- tion of iteration for the PSO algorithm to ensure the perfor-
mately 1% to 2% of the size of the data set, and lacks a definite mance of the proposed algorithm. The convergence formula
selection basis. Therefore, the impact on the clustering results is as follows:
is great.
|f (d)n − f (d)n−1 | ≤ ε, n≥2 (9)
To solve the influence of the parameter dc value on the
clustering results, a new method for calculating dc is proposed where ε is the convergence parameter and n is the number
in this paper. The specific steps are as follows: of iterations. When a certain number of iterations is reached,
1: Calculate the Gaussian distance between data points; the difference between f (d)n and f (d)n−1 is very small, and
it is determined that the particle swarm algorithm has reached
dij2
Distance = 1 − e− 2 (5) convergence.
TABLE 1. Summary of the time complexity for each of the seven algorithms.
where Centeri is a new center, xi is the data point that belongs sample allocation time complexity is O (n). Therefore,
to cluster Ci , and ni is the number of data points that belong the time complexity of the DPC algorithm in calculating
to cluster Ci . all objects is O n2 without accounting for the process of
The particle swarm optimization algorithm first divides the determining the cluster centers artificially [32].
particle swarm into several ‘‘subgroups’’ according to the For the PDPC algorithm, the number of particles in
clustering algorithm and finds the optimal position of each each iteration does not change. Assume that the num-
‘‘subgroup’’; then, the particles in the particle swarm update ber of particles in the i − th iteration is ni , where
their velocity and position values based on their individual i = 1, 2, · · · , t, t represents the maximum number of itera-
extremum and the optimal position in each ‘‘subgroup’’. tions, so n1 = n2 = · · · = nt = n.The complexity in calcu-
By clustering the particle swarm, the algorithm exchanges lating the distance matrix is O n2 , and the time complexity
information between the particles and finds the optimal solu- for calculating all sample densities is O n2 . It can be con-
tion in the iterative process, which makes the global conver- cluded that the time complexity of selecting the initial center
gence of the algorithm stronger. using the PSO algorithm is O (tn). In the center-updating
phase, the K centers updating complexity is O (tnK ). From
E. COMPLEXITY ANALYSIS this, we can determine that the total time complexity of the
In this subsection, the calculation costs are analyzed PDPC algorithm is O n2 .
for PDPC, DP_K-medoids [3], DPNM_K-medoids [3], The complexity of each of the seven algorithms is sum-
Improved K-means [4], K-means [5], Hybrid PSO and marized in Table 1. The time complexity of the K-means
K-means [6] and DPC [1], as shown in Table 1. However, algorithm is small, but K-means iterates multiple times during
each method differs in its calculation complexity. In addition, the running process. Intuitively, our PDPC has the same
the total cluster complexity includes updating the centers and time complexity as the DP_K-medoids, DPNM_K-medoids,
calculating the distance between each pair of objects. Improve K-means, Hybrid PSO and K-means and DPC algo-
A data set containing n objects, for all algorithms except rithms. However, we introduced the PSO algorithm, which
K-means, the time complexity of calculating the distance reduces the number of iterations because of its strong global
matrix is O n2 . The K-means algorithm does not need to
search capabilities. Overall, the running time of the pro-
calculate the distance matrix and density between data points posed algorithm is less based on the following experimental
during the implementation process. The time complexity of analysis.
the algorithm for calculating the distance from each sample
point to the ‘‘cluster center’’ is O (n). V. EXPERIMENTAL RESULTS AND DISCUSSION
For all algorithms except K-means and Hybrid PSO and All experiments are performed on an Intel Xeon E-2186M
K-means algorithms, the time complexity for calculating all processor with 2.90 GHz and 32.0GB RAM running Win-
sample densities is O n2 . Hybrid PSO and K-means clus- dows 10 Ultimate. All programs are compiled and executed
tering algorithm first executes the K-means once. The result using Eclipse 4.3.2 on a Java HotSpot 64-bit server Virtual
of the K-means algorithm is then used as one of the particles, Machine.
while the rest of the swarm is initialized randomly. Therefore, In this section, we discuss the testing and verification of
the algorithm does not need to calculate the density between the proposed PDPC algorithm clustering performance and
data points, and the total time complexity is O n2 . compare the results with those of the other six algorithms
The time complexity of the cluster center iterative process (DP_K-medoids [3], DPNM_K-medoids [3], Improved
of the six algorithms, except DPC, is O (tnK ), where t is K-means [4], K-means [5], Hybrid PSO and K-means [6]
the number of iterations of the algorithm, n is the number of and DPC [1]) using both classical synthetic data sets and
data points, and K is the number of clusters. After obtaining real data sets. The clustering results of the algorithms were
the initial cluster centers, the DPC algorithm assigns each evaluated using the clustering time, the number of iterations,
remaining point to the cluster of the nearest neighbor sam- the accuracy of the clustering [45], and the precision and
ples whose density is larger than that of the sample, so the recall of external validity evaluation indicators.
Whether the PDPC algorithm has the best value, the worst
value or the average value, it is second only to DPC. The
difference between the best and worst values of the PDPC
algorithm is much smaller than that of DPC, indicating that
the introduction of the PSO algorithm can improve the sta-
bility of the DPC algorithm. For the Aggregation data set,
compared to the other six algorithms, the PDPC algorithm
achieved the best clustering results. This result shows that the
introduction of the PSO optimization algorithm in this paper
overcomes the shortcoming of the DPC artificial selection
center in which it easily falls into a local optimum.
Furthermore, the experimental results of each algorithm
FIGURE 4. Test convergence.
on the real data set are analyzed. For the Wdbc data set,
the PDPC algorithm achieves the optimal average value of
TABLE 3. The threshold value dc of each data set. AC and RE, while the DPNM_K-medoids algorithm obtains
the optimal average value of PR. The PDPC algorithm
has a lower average value of PR than the DP_K-medoids
and DPNM_K-medoids algorithms but a higher value than
the DPC. Similarly, the DPNM_K-medoids algorithm has
also been performed 20 experiments on this data set. First,
the DPNM_K-medoids algorithm selects cluster centers on
the decision diagram. The center selected by the algorithm
may be different in each experiment; Second, in the data
object allocation stage, there are data points that originally
From the experimental Figure 4 of the PDPC clustering belong to this cluster are not fully allocated to this cluster,
algorithm, we can find that after the number of iterations and no data points that belong to other clusters are divided.
reaches 40, the algorithm tends to converge, and the cluster Therefore, it can be known from the analysis of formula (15)
centers no longer has obvious changes. Take the conver- that the DPNM_K-medoids algorithm may obtain a higher
gence parameter ε = 0.02. In the process of improved par- PR value in several experiments, that is, the optimal average
ticle swarm optimization, the cluster centers that algorithm value of PR may be obtained. For the remaining data sets,
outputs are the cluster centers when the algorithm achieves the PDPC algorithm performs well. The average values of
convergence and stability. the three indicators were optimal. In general, the algorithm
proposed in this paper has a good clustering effect and high
D. PERFORMANCE ANALYSIS OF THE PDPC ALGORITHM stability. Our proposed algorithm overcomes the shortcoming
Before clustering, we used the method of calculating param- of the DPC algorithm in which it easily falls into a local opti-
eters proposed in this paper to get the threshold value dc of mum, and it achieves the purpose of automatically selecting
each data set, as shown in Table 3. These values were adopted cluster centers.
in the following experiments. Based on the above analysis, we show the average value of
In this subsection, the PDPC algorithm is compared with each indicator (AC, PR, and RE) in a line chart in Figure 5.
the DP_K-medoids [3], DPNM_K-medoids [3], Improved K- Taking the data sets as the x-axis values and the evaluation
means [4], K-means [5], Hybrid PSO and K-means [6] and index results as the y-axis values, the data set index value
DPC [1] on the data sets in Table 2. Twenty experiments were curves can be constructed. The purpose is to test the effective-
performed on each data set, the AC, PR, and RE of each ness of the proposed algorithm for clustering performance.
experiment were statistically analyzed, and the best value, According to the AC value curve shown in Figure 5(a),
worst value and average value were recorded in 20 clustering the PDPC algorithm (red line) achieves the best clustering
experiments for each algorithm, as shown in Tables 4-6. The accuracy of all algorithms on eight of the nine data sets. PDPC
best results are in bold. is followed by the DPC algorithm, which achieves the best
The experimental results in Tables 4-6 show that compared clustering accuracy on one data set. The worst methods are
with other clustering algorithms, the average values of AC, the DP_K-medoids, DPNM_K-medoids, Improved K-means,
PR, and RE of the PDPC clustering algorithm obtained rela- K-means and Hybrid PSO and K-means algorithms, which
tively high values on most of the data sets listed in Table 2. do not obtain the best evaluation index value in any data
This result shows that the proposed algorithm has a good clus- set. The most significant improvement achieved by using
tering effect and high stability. First, the experimental results the PDPC algorithm was observed for the Aggregation data
of each algorithm on two synthetic data sets are analyzed. For set, and there was an improvement from 0.5977 using the
the Spiral data set, the DPC algorithm is optimal, and PDPC DPC algorithm to 0.7850 using the PDPC algorithm. We also
has higher values than other algorithms and ranks second. find that for the Waveform data set, the AC values of the
FIGURE 5. The AC, PR and RE of seven algorithms on synthetic and real data sets.
six algorithms other than PDPC are very close, and PDPC algorithms, the PDPC algorithm showed the best clustering
is greatly improved. However, for the Sprial data set, the AC performance on most data sets. However, there are subtle
value of the PDPC algorithm is 0.3471, which is significantly differences. For example, the DPNM_K-medoids algorithm
lower than that of the DPC algorithm but still higher than achieved top clustering performance for one data set when
those of the other five algorithms. The results indicate that using the PR (Figure 5(b)), compared to no any data sets when
the proposed algorithm may not be suitable for the Sprial data using the AC (Figure 5(a)). Alternatively, the DP_K-medoids
set. It is related to the distribution of the data set, because and DPNM_K-medoids algorithms had similar clustering
Sprial is a path-based spectral clustering result for 3-spiral performance for all of the indexes on all data sets except
data set. Electrical Grid. This is because the initial cluster center
Figure 5 shows similar trends in the metrics of different selection methods of these two clustering algorithms are the
algorithms on different data sets. Compared with the other six same; the difference is that the clustering criterion function is
different, that is, the stopping conditions of the clustering are clustering performance. In each evaluation index, the PDPC
different [3]. For the Wdbc data set, the PDPC algorithm algorithm showed the best clustering performance. These
obtains the highest values of AC and RE on the evaluation results demonstrate that the PDPC algorithm is effective and
index of clustering performance, while DPNM_K-medoids excellent regardless of the evaluation index chosen.
obtains the highest value of PR. The PDPC algorithm per- It can be seen from Table 4-6 that the clustering quality
formed best on the seven data sets when using the PR index; of PDPC algorithm is better than DPC on most of the data
on the other hand, it is still the best-performing algorithm on sets in Table 2. Further, Figure 5 visually shows that PDPC
the eight data sets when using the RE value, just as it is when (red line) is superior to DPC (blue line) on most of the data
using the AC value. For the Waveform data set, the PDPC sets. From the above analysis, combined with the advantages
algorithm showed the best clustering performance on AC, of the PSO algorithm, the PDPC algorithm proposed in this
PR and RE. Furthermore, for the Waveform(noise) data set, paper solves the disadvantages of DPC. A method for calcu-
which had an increase of nineteen attributes with noise data lating the parameter dc is proposed to solve the uncertainty
in relation to Waveform, the performance of the PDPC algo- and unreliability of DPC selection based on empirical values.
rithm is still better than that of the other six algorithms. There- For some unevenly distributed data sets, the initial centers
fore, the PDPC algorithm is the best method for processing found by the DPC algorithm may be located in the same
the Waveform(noise) data set, which indicates that the PDPC cluster or may not be found. The DPC may consider the
algorithm is more stable than the other six algorithms. non-cluster centers in the dense clusters as the center points
Table 7 gives the number of data sets in which each of the sparse clusters, causing the cluster centers found to
of the eight algorithms showed the top clustering per- fall into a local optimum. Our algorithm solves this prob-
formance for the different evaluation indexes when using lem well. And PDPC algorithm solves the limitation that
synthetic data sets and real data sets. For AC, the traditional DPC cannot automatically determine the cluster
PDPC algorithm tied for the best clustering performance centers, avoids the subjectivity of the manual selection pro-
by achieving the highest value on eight of the nine data cess. The experimental results show that our algorithm has
sets. PR and RE all showed similar results to those for a stronger global search ability, higher stability and a better
AC. In all cases, the PDPC algorithm demonstrated the best clustering effect.
TABLE 7. The number of data sets in which each of the seven algorithms points directly to the nearest cluster centers, so it is not
showed top clustering performance for the average value of the different
evaluation indexes when using synthetic data sets and real data sets. compared with this method.
Figure 6(a) shows the average clustering time of the
six clustering algorithms in milliseconds on the nine data
sets. As shown, the difference in clustering time between
the six methods is not large. However, compared with the
other five algorithms, the clustering time of the proposed
PDPC algorithm is relatively low, although the time com-
plexity is not greatly improved. We can see that the
DP_K-medoids algorithm clustering time was close to that
of DPNM_K-medoids. Although the time required to man-
ually select the centers was excluded, the DP_K-medoids
and DPNM_K-medoids algorithms must generate a decision
diagram, which is time consuming. This was one reason why
E. EVALUATE OF CLUSTERING TIME AND NUMBER OF their computational efficiency was lower. We can also see that
ITERATIONS the K-means algorithm has a longer clustering time because
In Section IV.E, we analyze theoretically the complexity of it has more iterations than the other algorithms on most data
the DP_K-medoids [3], DPNM_K-medoids [3], Improved sets, as shown in Figure 6(b). Figure 6(b) shows the average
K-means [4], K-means [5], Hybrid PSO and K-means [6], number of iterations of the six clustering algorithms on the
DPC [1] and PDPC algorithms. Table 1 gives the detailed nine data sets. Overall, the number of iterations of PDPC is
theoretical results. In this subsection, we compare the actual less than that of the other algorithms.
clustering time and the number of iterations of the six algo- This paper introduces the PSO optimization algorithm;
rithms other than DPC, measured by the average clustering because of its simple concept, strong global search capa-
time and the number of iterations of 20 repeated clustering bility and high stability, it can find the optimal solution
processes. The DPC algorithm does not perform the itera- in relatively few iterations. The above analysis shows that
tive clustering process, which distributes the remaining data the PDPC algorithm runs faster than the other algorithms.
[2] R. Eberhart and J. Kennedy, ‘‘A new optimizer using particle swarm
theory,’’ in Proc. 6th Int. Symp. Micro Mach. Human Sci. (MHS), 1995,
pp. 39–43.
[3] X. Juanying and Y. Qu, ‘‘K-medoids clustering algorithms with optimized
initial seeds by density peaks,’’ J. Frontiers Comput. Sci. Technol., vol. 10,
no. 2, pp. 230–247, 2016.
[4] E. Zhu and R. Ma, ‘‘An effective partitional clustering algorithm based on
new clustering validity index,’’ Appl. Soft Comput., vol. 71, pp. 608–621,
Oct. 2018.
[5] J. MacQueen, ‘‘Some methods for classification and analysis of multivari-
ate observations,’’ in Proc. 5th Berkeley Symp. Math. Statist. Probab., 1967,
vol. 1, no. 14, pp. 281–297.
[6] D. W. van der Merwe and A. P. Engelbrecht, ‘‘Data clustering using particle
swarm optimization,’’ in Proc. Congr. Evol. Comput. (CEC), vol. 1, 2003,
pp. 215–220.
[7] Y. Si, P. Liu, P. Li, and T. P. Brutnell, ‘‘Model-based clustering for RNA-seq
data,’’ Bioinformatics, vol. 30, no. 2, pp. 197–205, Jan. 2014.
[8] L. H. Son and T. M. Tuan, ‘‘A cooperative semi-supervised fuzzy clustering
framework for dental X-ray image segmentation,’’ Expert Syst. Appl.,
vol. 46, pp. 380–393, Mar. 2016.
[9] A. Mehta and O. Dikshit, ‘‘Comparative study on projected clustering
methods for hyperspectral imagery classification,’’ Geocarto Int., vol. 31,
no. 3, pp. 296–307, Mar. 2016.
[10] Y. Yang, ‘‘TAD: A trajectory clustering algorithm based on spatial-
temporal density analysis,’’ Expert Syst. Appl., vol. 139, Jan. 2020,
Art. no. 112846, doi: 10.1016/j.eswa.2019.112846.
[11] C. Jiang-Hui, ‘‘Spectral analysis of sky light based on trajectory cluster-
ing,’’ Spectrosc. Spectral Anal., vol. 39, no. 4, pp. 1301–1306, 2019.
[12] C. Qu, H. Yang, J. Cai, J. Zhang, and Y. Zhou, ‘‘DoPS: A double-peaked
profiles search method based on the RS and SVM,’’ IEEE Access, vol. 7,
pp. 106139–106154, 2019, doi: 10.1109/ACCESS.2019.2927251.
[13] Q. Cai-Xia, Y. Hai-Feng, C. Jiang-Hui, and X. Ya-Ling, ‘‘P-Cygni profile
analysis of the spectrum: LAMOST J152238.11+333136.1,’’ Spectrosc.
FIGURE 6. The six algorithms evaluate the clustering time and number of Spectral Anal., vol. 40, no. 4, pp. 1304–1308, 2020.
iterations on different data sets. [14] H. Yang, C. Qu, J. Cai, S. Zhang, and X. Zhao, ‘‘SVM-Lattice: A recogni-
tion & evaluation frame for double-peaked profiles,’’ IEEE Access, early
access, Apr. 27, 2020, doi: 10.1109/ACCESS.2020.2990801.
[15] J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques
Therefore, the PDPC algorithm reduces the number of iter- (Series in Data Management Systems), 3rd ed. San Mateo, CA, USA:
Morgan Kaufmann, 2011, pp. 83–124.
ations and the clustering time and improves the efficiency of [16] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction
the DPC algorithm. to Cluster Analysis, vol. 344. Hoboken, NJ, USA: Wiley, 2009.
[17] Y. Li, J. Cai, H. Yang, J. Zhang, and X. Zhao, ‘‘A novel algorithm for initial
cluster center selection,’’ IEEE Access, vol. 7, pp. 74683–74693, 2019,
VI. SUMMARY doi: 10.1109/ACCESS.2019.2921320.
To overcome the disadvantages in the DPC algorithm, a novel [18] F. Murtagh and P. Contreras, ‘‘Algorithms for hierarchical clustering: An
clustering algorithm based on DPC & PSO (PDPC) is pro- overview,’’ WIREs Data Mining Knowl. Discovery, vol. 2, no. 1, pp. 86–97,
Jan. 2012.
posed. Particle swarm optimization (PSO) is introduced [19] J. D. Banfield and A. E. Raftery, ‘‘Model-based Gaussian and non-
because of its simple concept and strong global search ability, Gaussian clustering,’’ Biometrics, vol. 49, no. 3, pp. 803–821, Sep. 1993.
which can find the optimal solution in relatively few itera- [20] M. Ester, ‘‘A density-based algorithm for discovering clusters in large spa-
tial databases with noise,’’ in Proc. Kdd, 1996, vol. 96, no. 34, pp. 226–231.
tions. Furthermore, to address the influence of the selection [21] A. Hinneburg and D. A. Keim, ‘‘An efficient approach to clustering in
parameter cut-off distance dc value on the clustering results, large multimedia databases with noise,’’ in Proc. 4th. Int. Conf. Knowl.
a method for calculating the parameter dc is proposed. Finally, Discovery. Data Mining, vol. 98, Aug. 1998, pp. 58–65.
[22] D. McParland and I. C. Gormley, ‘‘Model based clustering for mixed data:
the PDPC and six typical algorithms are tested on classical ClustMD,’’ Adv. Data Anal. Classification, vol. 10, no. 2, pp. 155–169,
synthetic data sets and real data sets, and the experiments Jun. 2016.
verified that the clustering results, the clustering time and [23] A. Rodríguez, E. Cuevas, D. Zaldivar, and L. Castañeda, ‘‘Clustering with
biological visual models,’’ Phys. A, Stat. Mech. Appl., vol. 528, Aug. 2019,
the number of iterations of the PDPC algorithm are better Art. no. 121505.
than those of other algorithms. The PDPC algorithm achieves [24] L. Rokach, ‘‘A survey of clustering algorithms,’’ Data Mining and
the purpose of automatically selecting cluster centers and Knowledge Discovery Handbook. Boston, MA, USA: Springer, 2009,
pp. 269–298.
overcomes the effects of the parameter dc . Compared with [25] Y.-J. Zheng, H.-F. Ling, S.-Y. Chen, and J.-Y. Xue, ‘‘A hybrid neuro-fuzzy
the other six algorithms, the PDPC algorithm has a stronger network based on differential biogeography-based optimization for online
global search ability, higher stability and a better clustering population classification in earthquakes,’’ IEEE Trans. Fuzzy Syst., vol. 23,
no. 4, pp. 1070–1083, Aug. 2015.
effect. [26] Y.-J. Zheng and H.-F. Ling, ‘‘Emergency transportation planning in dis-
aster relief supply chain management: A cooperative fuzzy optimization
approach,’’ Soft Comput., vol. 17, no. 7, pp. 1301–1314, Jul. 2013.
REFERENCES
[27] B. Jiang and N. Wang, ‘‘Cooperative bare-bone particle swarm optimiza-
[1] A. Rodriguez and A. Laio, ‘‘Clustering by fast search and find of density tion for data clustering,’’ Soft Comput., vol. 18, no. 6, pp. 1079–1091,
peaks,’’ Science, vol. 344, no. 6191, pp. 1492–1496, Jun. 2014. Jun. 2014.
[28] Y.-J. Zheng, H.-F. Ling, J.-Y. Xue, and S.-Y. Chen, ‘‘Population clas- JIANGHUI CAI is a Chief Professor of computer
sification in fire evacuation: A multiobjective particle swarm optimiza- application technology with the Taiyuan Univer-
tion approach,’’ IEEE Trans. Evol. Comput., vol. 18, no. 1, pp. 70–81, sity of Science and Technology, Taiyuan, China.
Feb. 2014. He is a long-term member of the Institute for Intel-
[29] J. Ding, Z. Chen, X. He, and Y. Zhan, ‘‘Clustering by finding density ligent Information and Data Mining. His research
peaks based on Chebyshev’s inequality,’’ in Proc. 35th Chin. Control Conf. interests concern the data mining and machine
(CCC), Jul. 2016, pp. 7169–7172. learning methods in specific backgrounds of astro-
[30] J. Ding, X. He, J. Yuan, and B. Jiang, ‘‘Automatic clustering based on
nomical informatics, seismology, and mechanical
density peak detection using generalized extreme value distribution,’’ Soft
engineering. He is a Senior Member of the China
Comput., vol. 22, no. 9, pp. 2777–2796, May 2018.
[31] M. Du, S. Ding, and H. Jia, ‘‘Study on density peaks clustering based on k- Computer Federation (CCF).
nearest neighbors and principal component analysis,’’ Knowl.-Based Syst.,
vol. 99, pp. 135–145, May 2016.
[32] J. Xie, H. Gao, W. Xie, X. Liu, and P. W. Grant, ‘‘Robust clustering by
detecting density peaks and assigning points based on fuzzy weighted K- HUILING WEI was born in Shanxi, China,
nearest neighbors,’’ Inf. Sci., vol. 354, pp. 19–40, Aug. 2016.
in 1993. She is currently pursuing the M.S.
[33] S. Jia, G. Tang, J. Zhu, and Q. Li, ‘‘A novel ranking-based clustering
degree with the Department of Computer Sci-
approach for hyperspectral band selection,’’ IEEE Trans. Geosci. Remote
Sens., vol. 54, no. 1, pp. 88–102, Jan. 2016. ence and Technology, Taiyuan University of Sci-
[34] J. Xu, G. Wang, and W. Deng, ‘‘DenPEHC: Density peak based efficient ence and Technology, Taiyuan, China. Her current
hierarchical clustering,’’ Inf. Sci., vol. 373, pp. 200–218, Dec. 2016. research interests include data mining and artificial
[35] A. A. A. Esmin, D. L. Pereira, and F. P. A. de Araujo, ‘‘Study of different intelligence.
approach to clustering data by using the particle swarm optimization
algorithm,’’ in Proc. IEEE Congr. Evol. Comput. (IEEE World Congr.
Comput. Intelligence), Jun. 2008, pp. 1817–1822.
[36] I. W. Kao, C. Y. Tsai, and Y. C. Wang, ‘‘An effective particle swarm
optimization method for data clustering.,’’ in Proc. IEEE Int. Conf. Ind.
Eng. Eng. Manage., Dec. 2007, pp. 548–552. HAIFENG YANG is a Professor of computer appli-
[37] R. Chouhan and A. Purohit, ‘‘An approach for document clustering using cation technology with the Taiyuan University of
PSO and K-means algorithm,’’ in Proc. 2nd Int. Conf. Inventive Syst. Science and Technology, Taiyuan, China. He is a
Control (ICISC), Jan. 2018, pp. 1380–1384. long-term member of the Institute for Intelligent
[38] A. Khatami, ‘‘A new PSO-based approach to fire flame detection using Information and Data Mining. His research inter-
K-medoids clustering,’’ Expert Syst. Appl., vol. 68, pp. 69–80, Feb. 2017. ests concern the data mining and machine learn-
[39] Y. Jiang, C. Liu, C. Huang, and X. Wu, ‘‘Improved particle swarm algo- ing methods in specific backgrounds, especially
rithm for hydrological parameter optimization,’’ Appl. Math. Comput.,
on astronomical big data. He is a member of the
vol. 217, no. 7, pp. 3207–3215, Dec. 2010.
China Computer Federation (CCF) and the Chi-
[40] A. O’Hagan, T. B. Murphy, I. C. Gormley, P. D. McNicholas, and D. Karlis,
‘‘Clustering with the multivariate normal inverse Gaussian distribution,’’ nese Astronomical Society (CAS).
Comput. Statist. Data Anal., vol. 93, pp. 18–30, Jan. 2016.
[41] Y. Shi and R. Eberhart, ‘‘A modified particle swarm optimizer,’’ in Proc.
IEEE Int. Conf. Evol. Comput., IEEE World Congr. Comput. Intell.,
XUJUN ZHAO received the M.S. degree in com-
May 1998, pp. 69–73.
puter science and technology from the Taiyuan
[42] A. Ratnaweera, S. K. Halgamuge, and H. C. Watson, ‘‘Self-organizing
hierarchical particle swarm optimizer with time-varying acceleration coef- University of Technology, China. He is cur-
ficients,’’ IEEE Trans. Evol. Comput., vol. 8, no. 3, pp. 240–255, Jun. 2004. rently pursuing the Ph.D. degree with the Taiyuan
[43] H. Chang and D.-Y. Yeung, ‘‘Robust path-based spectral clustering,’’ University of Science and Technology. His
Pattern Recognit., vol. 41, no. 1, pp. 191–203, Jan. 2008. research interests include data mining and parallel
[44] A. Gionis, H. Mannila, and P. Tsaparas, ‘‘Clustering aggregation,’’ ACM computing.
Trans. Knowl. Discovery Data (TKDD), vol. 1, no. 1, p. 4, 2007.
[45] C. M. Stein, ‘‘Estimation of the mean of a multivariate normal distribu-
tion,’’ Ann. Statist., vol. 9, no. 6, pp. 1135–1151, Nov. 1981.