0% found this document useful (0 votes)
9 views15 pages

A Novel Clustering Algorithm Based On DPC and PSO

This document presents a novel clustering algorithm called PDPC, which combines the Density Peaks Clustering (DPC) method with Particle Swarm Optimization (PSO) to address the limitations of the DPC algorithm, particularly in the automatic selection of cluster centers and the determination of the cut-off distance parameter. The proposed algorithm enhances clustering effectiveness by iteratively searching for optimal cluster centers and redistributing data points based on proximity to these centers. Experimental results demonstrate that PDPC outperforms existing algorithms in terms of global search ability, stability, and clustering quality across various benchmark datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

A Novel Clustering Algorithm Based On DPC and PSO

This document presents a novel clustering algorithm called PDPC, which combines the Density Peaks Clustering (DPC) method with Particle Swarm Optimization (PSO) to address the limitations of the DPC algorithm, particularly in the automatic selection of cluster centers and the determination of the cut-off distance parameter. The proposed algorithm enhances clustering effectiveness by iteratively searching for optimal cluster centers and redistributing data points based on proximity to these centers. Experimental results demonstrate that PDPC outperforms existing algorithms in terms of global search ability, stability, and clustering quality across various benchmark datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received April 20, 2020, accepted May 3, 2020, date of publication May 6, 2020, date of current version May

21, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2992903

A Novel Clustering Algorithm


Based on DPC and PSO
JIANGHUI CAI 1,2 , HUILING WEI1 , HAIFENG YANG 1,2 , AND XUJUN ZHAO 1
1 School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China
2 Shanxi Key Laboratory of Advanced Control and Equipment Intelligence, Taiyuan 030024, China

Corresponding author: Haifeng Yang ([email protected])


This work was supported in part by the National Natural Science Foundation of China under Grant U1731126 and Grant U1931209, and in
part by the Shanxi Province Key Research and Development Program under Grant 201803D121059 and Grant 201903D121116.

ABSTRACT Analyzing the fast search and find of density peaks clustering (DPC) algorithm, we find that the
cluster centers cannot be determined automatically and that the selected cluster centers may fall into a local
optimum and the random selection of the parameter cut-off distance dc value. To overcome these problems,
a novel clustering algorithm based on DPC & PSO (PDPC) is proposed. Particle swarm optimization (PSO) is
introduced because of its simple concept and strong global search ability, which can find the optimal solution
in relatively few iterations. First, to solve the effect of the selection of the parameter dc on the calculation
density and the clustering results, this paper proposes a method to calculate that parameter. Second, a new
fitness criterion function is proposed that iteratively searches K global optimal solutions through the PSO
algorithm, that is, the initial cluster centers. Third, each sample is assigned to K initial center points according
to the minimum distance principle. Finally, we update the cluster centers and redistribute the remaining
objects to the clusters closest to the cluster centers. Furthermore, the effectiveness of the proposed algorithm
is verified on nine typical benchmark data sets. The experimental results show that the PDPC can effectively
solve the problem of cluster center selection in the DPC algorithm, avoiding the subjectivity of the manual
selection process and overcoming the influence of the parameter dc . Compared with the other six algorithms,
the PDPC algorithm has a stronger global search ability, higher stability and a better clustering effect.

INDEX TERMS Clustering, density peak, particle swarm optimization, fitness function.

I. INTRODUCTION For the shortcomings of the DPC algorithm, the particle


In 2014, Alex Rodriguez et al. proposed a new algorithm, swarm optimization (PSO) algorithm is introduced; it has a
the clustering by fast search and find of density peaks (DPC) simple concept and a strong global search ability that can
algorithm [1]. Because DPC has the advantages of simple find the optimal solution in a relatively small number of
algorithmic principles, easy implementation and the ability iterations [2]. In this paper, a new fitness function based on
to quickly find clusters of arbitrary shapes, many researchers the DPC algorithm is proposed, and a method for calculating
have studied and applied it since the algorithm was pub- the parameter dc is proposed. On these bases, a novel clus-
lished. The advantages of the DPC clustering algorithm are tering algorithm based on DPC & PSO (PDPC) is proposed.
outstanding, but its disadvantages are also obvious. The The effectiveness and advantages of the PDPC algorithm
DPC algorithm has the following disadvantages: are verified by experiments on typical benchmark data sets.
(1) It is difficult to determine the value of the parameter Experiments show that our algorithm can effectively solve
cut-off distance dc , which mainly depends on subjective expe- the problem of cluster center selection in the DPC algorithm,
rience and lacks a certain basis for selection; avoiding the subjectivity of the manual selection process and
(2) The selection of cluster centers requires human par- overcoming the influence of the parameter dc .
ticipation and easily falls into a local optimum, which can-
not guarantee the objectivity and accuracy of the clustering A. MOTIVATIONS
results. The motivation of this study can be summarized as follows:
The associate editor coordinating the review of this manuscript and • There is a parameter cut-off distance dc in the DPC algo-
approving it for publication was Anandakumar Haldorai . rithm that is selected according to an empirical value,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
88200 VOLUME 8, 2020
J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

which may affect the clustering results. Therefore, it is the parameter dc is proposed. First, the Gaussian dis-
necessary to propose a new method of calculating dc . tance between the data points is calculated. Second,
• The cluster centers selected by the DPC algorithm are the maximum and minimum Gaussian distances are
likely to fall into a local optimum. This problem also found. Finally, the parameter dc is proposed based on
impacts the clustering results and needs to be solved. the mean value of the maximum and minimum Gaussian
• Since the DPC algorithm visually identifies the cluster distances.
centers on the decision diagram (See Section III.A.2)), • Aiming at the problem that the cluster centers selected
it may directly affect the clustering results. Therefore, by the DPC algorithm easily fall into a local optimum,
it is necessary to overcome the influence of human fac- the PSO intelligent optimization algorithm is introduced
tors and achieve the automatic identification of cluster for clustering analysis, and the global search ability of
centers. PSO can be used to find K approximate optimal solu-
Motivation 1: For the calculated density formula in the tions. We use the optimal solutions as the initial cluster
DPC algorithm, there is a parameter cut-off distance dc , centers. The PDPC algorithm achieves the purpose of
which is 1% to 2% of the size of the data set [1]. This automatically selecting the cluster centers, avoids the
empirically chosen value is uncertain and unreliable, which subjectivity of the manual selection process.
may affect the calculation of density and in turn affect the • Literature [1] proposed that the cluster center has the
clustering results. Therefore, a new method for calculating dc characteristics of high density ρi and long distance δi .
is proposed based on the Gaussian distance. According to this feature in the DPC algorithm, a new
Motivation 2: The deficiencies of the DPC clustering algo- fitness function is proposed. Setting the fitness function
rithm must be overcome; its selected cluster centers may fall is a key step in solving the optimization problem, and
into a local optimum, and its initial centers may be located the design of the fitness function should be as simple as
in the same cluster or may not be found. These issues can possible. Therefore, we use the inverse of the product of
affect the clustering results. Considering the above problem, density and distance as the fitness function.
this paper introduces an intelligent optimization algorithm for • We use multiple typical benchmark data sets to test
clustering analysis. the performance of the PDPC algorithm, and use three
Motivation 3: The DPC algorithm selects cluster centers well-known evaluation cluster quality indicators (the
visually and intuitively on the decision diagram. Some of the accuracy, the precision and the recall) to evaluate the
improved clustering methods use the same strategy, such as clustering results. The comparison experiments with
DP_K-medoids [3] and DPNM_K-medoids [3]. These meth- other six algorithms show the effectiveness and correct-
ods show good performance on different data sets. However, ness of the proposed clustering algorithm.
there are human factors in the process of selecting cluster
centers that may directly affect the clustering results. This
insufficiency motivates us to propose a method that automat- C. ROADMAP
ically identifies the cluster centers in the data set. The rest of this paper is organized as follows. Section II
summarizes the related work relevant to this work. Section III
B. CONTRIBUTIONS gives the theoretical basis and some related concepts.
Inspired by the above motivations, the PDPC clustering In Section IV, a novel clustering algorithm based on DPC
algorithm is proposed. First, to solve the influence of the & PSO (PDPC) is proposed, and the algorithm is introduced
parameter dc , this paper proposes a method to calculate the in detail. Section V analyzes the experimental results on
parameter dc . Second, a new fitness criterion function based typical benchmark data sets, then analyzes the characteristics
on the DPC algorithm is proposed, and it iteratively searches of the proposed algorithm. The six improved clustering algo-
K initial cluster centers by the PSO algorithm. Then, each rithms (DP_K-medoids [3], DPNM_K-medoids [3], Improve
sample is assigned to K initial center points according to the K-means [4], K-means [5], Hybrid PSO and K-means [6]
minimum distance principle. Finally, we update the cluster and DPC [1]) were selected for comparison. And finally,
centers and redistribute the remaining objects to the clusters a summary of this work is given in Section VI.
closest to the cluster centers. The process iterates until the
reallocation of objects no longer changes in any cluster or II. RELATED WORKS
reaches the termination condition of iteration. The exper- Clustering is a dynamic research field in data mining. It is also
imental results show that compared to the other methods, an important unsupervised learning technique in machine
the PDPC algorithm has a stronger global search ability, learning. Clustering is the process of grouping a set of data
higher stability and a better clustering effect on the bench- objects into multiple groups or clusters so that objects within
mark data sets. a cluster have high similarity but are very dissimilar to objects
The main contributions of this work are summarized as in other clusters. Clustering as a data mining tool has its roots
follows: in many application areas such as biology, security, busi-
• To solve the influence of the parameter cut-off distance ness intelligence, pattern recognition, Web search [7]–[9],
dc on the clustering results, a method of calculating trajectory clustering [10], [11] and astronomy [12]–[14].

VOLUME 8, 2020 88201


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

Traditional approaches in clustering can be broadly cat- by finding density peaks based on Chebyshev’s inequality
egorized into partition-based, hierarchical-based, density- (CDP), can obtain a judgment index by screening density and
based, model-based, grid-based and soft computing meth- distance, which are normalized. The points whose judgment
ods [15]. Partitioning methods such as K-means [5] and indexes are above the upper bound based on Chebyshev’s
K-medoids [16] relocate points by moving them from one inequality will be selected as the cluster centers. Then,
category to another according to distance. These methods the remaining points are assigned by their nearest neighbor of
always need the number of clusters to be set in advance, and higher density. Inspired by the visual selection rule of DPC,
they are sensitive to initial cluster centers. For the problem reference [30] proposed a judgment index that approximately
of cluster center selection, [17] proposed a novel algorithm follows the generalized extreme value (GEV) distribution,
for initial cluster center selection, which uses MNN (M near- and each cluster center’s judgment index is much higher.
est neighbors), density and distance to determine the initial Hence, it is reasonable that points are selected as cluster
cluster centers. The authors show that the method obtains centers if their judgment indexes are larger than the upper
high-quality initial cluster centers. Hierarchical methods [18] quantile of GEV. This proposed method is called density
structure categories by recursively classifying the data in peaks clustering based on generalized extreme value distri-
either a top-down or bottom-up fashion. Density-based meth- bution (DPC-GEV).
ods assume that the points that belong to each cluster are Reference [31] introduced the idea of K-nearest neigh-
drawn from a specific probability distribution [19]. Clusters bors (KNN) and principal component analysis (PCA) into
of arbitrary shape can be discovered by density-based meth- DPC to improve the performance of the DPC algorithm.
ods such as DBSCAN [20] and Denclue [21]. Model-based Reference [32] used the technique of K-nearest neighbors
methods [22] can obtain the clustering results by optimiz- and fuzzy weighted K-nearest neighbors to overcome the
ing the fit between the given data and certain mathematical deficiencies of the DPC algorithm. Reference [33] enhanced
models. Reference [23] developed a simple clustering model the DPC to make it suitable for hyperspectral band selec-
inspired by the way in which the human visual system asso- tion. The proposed approach is named the enhanced FDPC
ciates patterns spatially. And the approach is based on Cellu- (E-FDPC), and it can use an exponential-based learning
lar Neural Networks (CNNs), similar to the biological model. rule to adjust different numbers of cut-off thresholds and
In grid-based methods, the data space is divided into a finite determine cluster centers automatically. Reference [34] pre-
number of unit grid structures [24]. Therefore, such methods sented a density peak based hierarchical clustering method
have a high processing speed. The evolutionary approaches (DenPEHC), which directly generates clusters on each
that belong to the soft computing method [25], [26] are also possible clustering layer, and introduced a grid granula-
used to deal with clustering problems. These algorithms such tion framework to enable the clustering of large-scale and
as the genetic algorithm (GA), artificial bee colony (ABC) high-dimensional (LSHD) data sets.
and PSO [27], [28] can obtain satisfactory results by optimiz- To solve the shortcomings of initial cluster center selection
ing the objective function. of the clustering algorithm and being easily falling into a local
In 2014, there was a large breakthrough in density-based optimum, some researchers try to use the intelligent opti-
clustering approaches. Rodriguez and Laio proposed the DPC mization algorithm for clustering analysis and the clustering
algorithm [1]. DPC is based on the concept that cluster cen- problem as the solution to the optimization problem. Among
ters are characterized by a higher density than that of their these strategies, the PSO algorithm is very popular due to
neighbors and by a relatively larger distance from points with its flexibility, robustness, discreteness and self-organization.
higher densities. This algorithm uses these two features to PSO clustering focuses on solving clustering problems by
obtain a scatter graph called a decision diagram, which is using group behavior. Therefore, the global search ability of
used to visually judge the potential cluster centers. Finally, the PSO algorithm is used to find an approximate optimal
each remaining point is assigned to a cluster according to its solution.
nearest neighbor of higher density. The algorithm is simple, PSO is a group intelligent optimization method proposed
and the clustering results can be completed in one step with- by Kennedy and Eberhart in 1995 [2]. It is derived from
out iteration. However, the algorithm has human factors when bird predation behavior research and is an iteration-based
selecting the cluster centers, which may directly affect the optimization tool. The system is initialized to a set of random
clustering results. solutions that search for the optimal value by iteration. The
In response to the problems of the DPC algorithm, PSO algorithm is simple, easy to implement, and does not
researchers have proposed many different algorithms. have many parameters to adjust. It has been widely used
Reference [3] used DPC to optimize the initial medoids of in function optimization, neural network training, and fuzzy
the K-medoids clustering algorithm. To obtain better clus- system control.
tering, a new measure function is proposed as the ratio of In recent years, the PSO optimization algorithm and
the intra-distance of clusters to the inter-distance between improved clustering methods for PSO have been studied and
clusters. The authors proposed two new K-medoids clustering applied. Reference [35] proposed a PSO clustering algorithm
algorithms: the DP_K-medoids algorithm and the DPNM_K- based on different learning methods. The author proposed
medoids algorithm. In [29], the new clustering algorithm, two improved fitness functions, which greatly improved the

88202 VOLUME 8, 2020


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

classification accuracy of the clustering algorithm. Refer- III. THEORETICAL BASIS


ence [36] proposed an effective PSO clustering method. In this section, we introduce two classic algorithms: the
In view of the shortcomings of PSO when applied to large density peak clustering algorithm and the particle swarm
data sets, particles on the boundary of the search space cannot optimization algorithm.
be moved to a better position, and a mapping method is
proposed. Reference [37] proposed an approach for document A. DENSITY PEAK CLUSTERING ALGORITHM
clustering using the particle swarm optimization method. Reference [1] describes a new density-based clustering algo-
This method is applied before K-means for finding optimal rithm, the density peak clustering algorithm, which uses novel
points in the search space, and these points are used as ideas and is simple and clear. The premise of the algorithm is
initial cluster centroids for the K-means algorithm to find the that the cluster center is surrounded by points whose densities
final clusters of documents. Reference [38] used K-medoids are smaller than itself and that the center has a larger distance
clustering to provide a fitness metric for the particle swarm from other high-density points. The algorithm defines two
optimization procedure to distinguish between active and parameters for each data point i: one is the density ρi of the
inactive pixels in a scheme. Reference [6] proposed two new data point, and the other is the distance δi from the data point
approaches to using PSO to cluster data, and it is shown how to a local high-density point. It uses these two features to
PSO can be used to find the centroids of a user-specified obtain a scatter graph called the decision diagram, selects the
number of clusters. The algorithm is then extended to use points where ρi and δi are both large as the cluster centers on
K-means clustering to seed the initial swarm. This second the decision diagram, and assigns the remaining points to the
algorithm primarily uses PSO to refine the clusters formed cluster of the high-density point closest to them.
by K-means.
Reference [39] proposed a new method named MSSE-PSO 1) DENSITY AND DISTANCE CALCULATION
(master-slave swarm shuffling evolution algorithm based We define a density for each data point i based on the distance
on particle swarm optimization). MSSE-PSO combines the between the data points.
strengths of the particle swarm optimization, competitive evo- X
ρi = χ dij − dc

lution and sub-swarm shuffling, which greatly enhances sur- (1)
vivability by sharing the information gained independently j
by each swarm. Besides, MSSE-PSO adopts the hierarchical
1x<0

idea, by which the master swarm guides the whole group to where if we assume x = dij − dc , then χ (x) = .i
0x>0
the optimal direction to control the balance between explo- and j are different data points; dij is the Euclidean distance
ration and exploitation. between data points; and dc is the cut-off distance and is a
In summary, the main problems of the DPC algorithm are hyperparameter. dc is 1% to 2% of the total number of points
as follows: (1) there is a parameter cut-off distance dc in the in the data set. ρi is equivalent to i as the center, dc is the
local density calculation formula that is selected according to radius and the number of points in this range.
an empirical value, but this value is unreliable and may have The minimum distance from each data point to a high local
an impact on the clustering results; (2) the selected cluster density point is
centers may fall into a local optimum, and the initial centers
δi = min dij

may be located in the same cluster or not be found; and (3) (2)
j:ρj >ρi
selecting the cluster center has human factors, which may
directly affect the clustering results. For the distance between global high-density points,
In contrast, the main advantages of the PDPC algorithm are the opposite is true; we take the maximum distance between
as follows: (1) a new method for calculating the parameter the two highest-density samples. Note that δi is much larger
cut-off distance dc is proposed. When calculating the density than the typical nearest-neighbor distance only for points that
of data points, dc does not need to be randomly selected are local or global maxima in the density. Thus, cluster centers
according to the empirical value; (2) this study introduces are recognized as points for which the value of δi is unusually
the PSO intelligent optimization algorithm because it has large.
strong global search ability, which prevents the cluster cen-
ters selected by the DPC algorithm from falling into a local 2) DECISION DIAGRAM
optimum; (3) a new fitness function is proposed based on the A decision diagram is a novel method for identifying the
DPC algorithm, which iteratively searches K global optimal cluster centers of the data set proposed in [1]. This method
solutions by the PSO algorithm, that is, the initial cluster determines cluster centers by constructing a decision diagram
centers. This approach overcomes the influence of human of the local density ρ and distance δ of each sample point in
factors and realizes the purpose of automatically identifying the data set. When both the density value ρ and the distance
cluster centers; (4) compared with the other six algorithms, value δ of the point are large, the point may be a cluster center
the proposed algorithm improves clustering performance point.
and computational efficiency and has a good clustering Based on the values of the local density and distance of
effect. sample points, cluster centers can be selected intuitively.

VOLUME 8, 2020 88203


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

extremum and the global extremum to dynamically update


its velocity and position.

1) PARTICLE VELOCITY AND POSITION


In a D-dimensional target search space, the PSO algorithm
refers to individuals as ‘‘particles’’. The position of each
particle represents a solution to the problem. A particle con-
stantly adjusts its position x to search for a new solution [2].
FIGURE 1. (a) Point distribution in two dimensions. (b) Decision diagram
The total number of particles is set to m, where the position of
for each sample point. the i-th particle in the d-th dimension is xid , the flying velocity
is vid , the current optimal position the particle has searched
is Pid , and the current optimal position of the particle swarm
Reference [1] uses the data set shown in Figure 1 to illus- as a whole is Pgd . The update formulas for velocity and
trate the process of selecting cluster centers in the decision position are as follows:
diagram. There are 28 sample points arranged in descending vid (t + 1) = w × vid (t) + c1 r1 [Pid (t) − xid (t)]
order of density in Figure 1(a), and the sample points can be
+ c2 r2 Pgd (t) − xid (t)
 
(3)
divided into two clusters. Figure 1(b) is the decision diagram
drawn with ρ as the horizontal axis and δ as the vertical axis. xid (t + 1) = xid (t) + vid (t + 1) (4)
It can be seen that sample points 1 and 10 are located at the where i = 1, 2, . . . , m; d = 1, 2, . . . , D; w is the inertia
upper right corner of the decision diagram. The local density weight; c1 and c2 are learning factors, which are nonnegative
and distance are both large, so these two points are the cluster constants and usually take c1 = c2 = 2; r1 and r2 are random
centers. Points 26, 27, and 28 have a relatively high δ and a numbers in (0, 1); Pid is an individual extremum; and Pgd is
low ρ because they are isolated, and they can be considered the global extremum.
outliers. In the velocity update formula (3), the first term is the
product of the inertia weight and the particle’s current veloc-
3) CLUSTERING PROCESS ity, which represents the degree of trust of the particle in its
Points with relatively large local density ρi and distance δi current movement and is based on the inertial motion of the
are considered the cluster centers. These points are inher- original velocity; the second term indicates the situation of
ently dense and surrounded by neighbors with a density self-awareness, which is the particle’s judgment of its own
that is relatively large and at a relatively large distance history; and the third item represents social awareness, which
from other higher-density points. Such points are selected is the mutual cooperation and information sharing of particles
as cluster centers. After the cluster centers are determined, in the group.
the remaining other points are attributed to the cluster of the
highest-density point closest to them, and the final clustering 2) ALGORITHM STEP
result is obtained. The flowchart of the PSO algorithm is shown in Figure 2.
The DPC algorithm has the advantages of simple princi- The PSO algorithm is initialized as a group of random
ples, easy implementation and can quickly find clusters of particles (random solution), and then, the optimal solution is
any shape. However, in the clustering process, the decision found through iteration. In each iteration, the particle updates
diagram plays a decisive role in determining cluster centers itself by tracking two extreme values, which are the individual
using qualitative selection instead of quantitative analysis. extreme value Pid and the global extreme value Pgd . All par-
Selecting the data point has the characteristics that both ρi ticles have a fitness value determined by the optimized func-
and δi are larger, which is subjective. Sometimes, for the tion, and each particle also has the velocity that determines
same decision diagram, different people may make different the direction and distance of the flight. Then, the particles
choices. As a result, the selected cluster centers may be follow the current optimal particle and search in the solution
located in the same cluster or may not be found. space until the maximum number of iterations is reached;
otherwise, execution continues.
B. PARTICLE SWARM OPTIMIZATION ALGORITHM
PSO is a common evolutionary algorithm based on the con- IV. PDPC CLUSTERING ALGORITHM
cepts of group and fitness [2]. An individual of the particle In this section, based on the advantages of the PSO algorithm,
swarm represents a possible solution to the problem. Starting a novel clustering algorithm based on DPC & PSO (PDPC)
from a random solution, the PSO algorithm uses iteration to is proposed. The rest of this section is organized as fol-
find a possible optimal solution and uses fitness to determine lows. In Section IV.A, a method of calculating the parameter
the quality of the solution. The algorithm randomly initializes dc is proposed to solve the problem of randomly select-
a group of particles and then iterates to find the optimal ing dc according to empirical values in the DPC algorithm.
solution. Each iteration of the particle tracks the individual In Section IV.B, a fitness function is proposed based on

88204 VOLUME 8, 2020


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

identified as a cluster center is proportional to the product of


density ρi and distance δi [40].
Here, we use the density formula as follows.
 d 2
ij
X −
ρi = e dc
(7)
j

From the analysis of equations (1) and (7), we know that (1)
calculates a discrete value and that (7) calculates a continuous
value. In comparison, the probability of conflict in (7) is
small; that is, the probability that different data points have
the same local density will be small. The value of dc in (7)
can be calculated by (6), and is no longer selected according
to the empirical value. Therefore, the local density calculated
by (7) is better. In view of this consideration, we design the
fitness function as follows.
FIGURE 2. The PSO algorithm flowchart.
 1 1
f dij = = (8)
ρ×δ
 d 2
P − d ij 
the DPC algorithm. Setting the fitness function is a crucial e c × lim dij
j j:ρj >ρi
step in solving the optimization problem. In Section IV.C,
the parameters of the velocity update formula are redefined where i and j denote different particles, dij is the Euclidean
in the PSO algorithm. In Section IV.D, the proposed PDPC distance between particles, and dc is the cut-off distance men-
algorithm is introduced in detail and the algorithm steps are tioned in Section IV.A. For a general particle, δ = min dij ,
given. Finally, in Section IV.E, the time complexity of PDPC however,  for a particle with the largest density value,
and comparison algorithms is analyzed. δ = max dij . The smaller the value of f (d) is, the greater
the probability that the particle becomes a cluster center
A. SETTING THE PARAMETER point. If f (d)n < f (d)n−1 , the optimal position needs to be
In the density peak clustering algorithm proposed in [1], updated.
the parameter cut-off distance dc is difficult to determine; it We set a convergence condition as the termination condi-
mainly relies on subjective experience, generally has approxi- tion of iteration for the PSO algorithm to ensure the perfor-
mately 1% to 2% of the size of the data set, and lacks a definite mance of the proposed algorithm. The convergence formula
selection basis. Therefore, the impact on the clustering results is as follows:
is great.
|f (d)n − f (d)n−1 | ≤ ε, n≥2 (9)
To solve the influence of the parameter dc value on the
clustering results, a new method for calculating dc is proposed where ε is the convergence parameter and n is the number
in this paper. The specific steps are as follows: of iterations. When a certain number of iterations is reached,
1: Calculate the Gaussian distance between data points; the difference between f (d)n and f (d)n−1 is very small, and
it is determined that the particle swarm algorithm has reached
dij2
Distance = 1 − e− 2 (5) convergence.

C. SELECTION OF PARAMETERS IN VELOCITY UPDATE


2: Get the maximum and minimum values of the Gaus- FORMULA
sian distance, expressed by maxDistance and minDistance The velocity and position update formulas use (3) and (4)
respectively; in the PDPC algorithm proposed in this paper. However,
3: Take the mean value of maxDistance and minDistance to we redefine the parameters in the velocity update formula (3).
obtain the value of dc . The selected values for the parameters are as follows:
maxDistance + minDistance 1: w is the inertia weight. Shi and Eberhart added the inertia
dc = (6) weight to make the particles gradually slow down, which
2
also affects exploration and exploitation [41]. High val-
B. FITNESS FUNCTION ues of w prevents particles from slowing down more than
The pros and cons of the current position of the particle lower values do, which is good for exploring the search
are measured by the fitness function, and the fitness func- space. Lower values of w allow particles to exploit a good
tion obtains the corresponding fitness value of the particle. region without overshooting positions too much. It was
We hope that the algorithm will automatically recognize found that linearly decreasing the inertia weight from
cluster centers. The probability that a particle is ultimately 0.9 to 0.4 produces good results [41]. The inertia weight

VOLUME 8, 2020 88205


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

is decreased according to 2) ALGORITHM DESCRIPTION


According to the above description, the specific steps of the
t × (wmax − wmin )
w = wmax − (10) PDPC algorithm are shown in Algorithm 1.
tmax
where wmax and wmin are the initial and final values of the Algorithm 1 PDPC: A Novel Clustering Algorithm Based on
inertia weight, respectively, t is the number of iterations DPC & PSO
and tmax is the maximum number of iterations. Input: A data set containing n objects, clusters number K .
2: c1 and c2 are learning factors. Ratnaweera et al. modi- Output: K cluster center points, the final clustering results.
fied PSO by changing the acceleration coefficients over Begin:
time [42]. This variant is called time-varying acceleration 1: Initialization; set the number of particles m and conver-
coefficient PSO (TVAC-PSO). The cognitive accelera- gence condition.
tion (c1 ) starts with a higher value than c2 and linearly 2: Calculate the fitness function value of each particle
decreases, while the social acceleration (c2 ) starts with a according to (8).
lower value and linearly increases. The ranges of values 3: Pid and Pgd are updated by comparing the fitness of each
are the following: c1 decreases linearly from 2.5 to 0, and particle with the fitness of the best position Pid and the
c2 increases linearly from 0 to 2.5. The linear change is fitness of the optimal position Pgd .
performed using 4: Update the velocity and position of each particle using
(3) and (4).
t
c1 (t +1) = c1,final −c1,initial ×

+ c1,final (11) 5: Verify that the final condition is met. If the termination
tmax condition of iteration is satisfied, the iteration is stopped;
t
c2 (t +1) = c2,final −c2,initial ×

+ c2,final (12) otherwise, Step 2 is performed.
tmax 6: Consider the K optimal points given by the PSO algo-
where cfinal and cinitial are the final and initial values of rithm as the initial cluster centers.
the acceleration coefficient, respectively, t is the number 7: The distance of each data point to each cluster centers is
of iterations and tmax is the maximum number of itera- calculated.
tions. Note that c1 and c2 are now functions of time. 8: According to the current position, each sample is
3: r1 and r2 are random numbers obeying the U (0, 1) assigned to K initial cluster centers according to the
distribution. principle of minimum distance.
9: Based on the new classification, calculate the new cluster
D. PDPC ALGORITHM center using (13) in each cluster.
10: Perform an iterative process of assigning the remaining
This paper proposes the PDPC clustering algorithm, which
is mainly developed to address the defects of the DPC algo- data points and updating cluster centers. Stop iteration
rithm. Its main contribution is to introduce the PSO intelligent when the clustering results remain the same or the termi-
optimization algorithm for clustering analysis. nation condition of iteration is reached.
11: Output the final clustering results.

1) ALGORITHM IDEA End


The shortcomings of the DPC algorithm urgently need to be
addressed. In this paper, first, to mitigate the influence of the In order to overcome the defects of DPC algorithm, this
selection parameter dc on the clustering results, a method paper proposes PDPC clustering algorithm. First, a new fit-
for calculating dc is proposed that uses the mean value of ness function based on DPC algorithm is proposed. Second,
the maximum and minimum values of the Gaussian distance. the K optimal solutions are searched by the PSO method
Second, the PDPC algorithm introduces the PSO intelligent as the initial cluster centers. Finally, perform the iterative
optimization algorithm for clustering analysis. Based on the process and create the clusters. Steps 1-6 are the process of the
density and distance of the data points, a new fitness function PSO optimization, where the fitness function used in step 2
is proposed. The global search ability of the PSO algorithm is is based on the DPC algorithm, and steps 7-11 are clustering
used to find the K -approximate optimal solutions. Then, each processes. Step 9 determines new centers using formula (13),
sample is assigned to K initial center points according to the it computes the new mean using the objects assigned to the
minimum distance principle. Finally, we update the cluster cluster. All the objects are then reassigned using the updated
centers and redistribute the remaining objects to the clusters means as the new cluster centers. The iterations continue until
closest to the cluster centers. The process iterates until the the assignment is stable, that is, the clusters formed in the
reallocation of objects no longer changes in any cluster or current round are the same as those formed in the previous
reaches the termination condition of iteration. The experi- round, or the termination condition of iteration is reached.
mental results show that the PDPC algorithm has a strong
1 X
global search ability, high stability and a good clustering Centeri = xi (13)
effect. ni
∀xi ∈Ci

88206 VOLUME 8, 2020


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

TABLE 1. Summary of the time complexity for each of the seven algorithms.

where Centeri is a new center, xi is the data point that belongs sample allocation time complexity is O (n). Therefore,
to cluster Ci , and ni is the number of data points that belong the time complexity  of the DPC algorithm in calculating
to cluster Ci . all objects is O n2 without accounting for the process of
The particle swarm optimization algorithm first divides the determining the cluster centers artificially [32].
particle swarm into several ‘‘subgroups’’ according to the For the PDPC algorithm, the number of particles in
clustering algorithm and finds the optimal position of each each iteration does not change. Assume that the num-
‘‘subgroup’’; then, the particles in the particle swarm update ber of particles in the i − th iteration is ni , where
their velocity and position values based on their individual i = 1, 2, · · · , t, t represents the maximum number of itera-
extremum and the optimal position in each ‘‘subgroup’’. tions, so n1 = n2 = · · · = nt = n.The complexity in calcu-
By clustering the particle swarm, the algorithm exchanges lating the distance matrix is O n2 , and the time complexity
information between the particles and finds the optimal solu- for calculating all sample densities is O n2 . It can be con-
tion in the iterative process, which makes the global conver- cluded that the time complexity of selecting the initial center
gence of the algorithm stronger. using the PSO algorithm is O (tn). In the center-updating
phase, the K centers updating complexity is O (tnK ). From
E. COMPLEXITY ANALYSIS this, we can determine that  the total time complexity of the
In this subsection, the calculation costs are analyzed PDPC algorithm is O n2 .
for PDPC, DP_K-medoids [3], DPNM_K-medoids [3], The complexity of each of the seven algorithms is sum-
Improved K-means [4], K-means [5], Hybrid PSO and marized in Table 1. The time complexity of the K-means
K-means [6] and DPC [1], as shown in Table 1. However, algorithm is small, but K-means iterates multiple times during
each method differs in its calculation complexity. In addition, the running process. Intuitively, our PDPC has the same
the total cluster complexity includes updating the centers and time complexity as the DP_K-medoids, DPNM_K-medoids,
calculating the distance between each pair of objects. Improve K-means, Hybrid PSO and K-means and DPC algo-
A data set containing n objects, for all algorithms except rithms. However, we introduced the PSO algorithm, which
K-means, the time complexity of calculating the distance reduces the number of iterations because of its strong global
matrix is O n2 . The K-means algorithm does not need to
 search capabilities. Overall, the running time of the pro-
calculate the distance matrix and density between data points posed algorithm is less based on the following experimental
during the implementation process. The time complexity of analysis.
the algorithm for calculating the distance from each sample
point to the ‘‘cluster center’’ is O (n). V. EXPERIMENTAL RESULTS AND DISCUSSION
For all algorithms except K-means and Hybrid PSO and All experiments are performed on an Intel Xeon E-2186M
K-means algorithms, the time complexity for calculating all processor with 2.90 GHz and 32.0GB RAM running Win-
sample densities is O n2 . Hybrid PSO and K-means clus- dows 10 Ultimate. All programs are compiled and executed
tering algorithm first executes the K-means once. The result using Eclipse 4.3.2 on a Java HotSpot 64-bit server Virtual
of the K-means algorithm is then used as one of the particles, Machine.
while the rest of the swarm is initialized randomly. Therefore, In this section, we discuss the testing and verification of
the algorithm does not need to calculate the density  between the proposed PDPC algorithm clustering performance and
data points, and the total time complexity is O n2 . compare the results with those of the other six algorithms
The time complexity of the cluster center iterative process (DP_K-medoids [3], DPNM_K-medoids [3], Improved
of the six algorithms, except DPC, is O (tnK ), where t is K-means [4], K-means [5], Hybrid PSO and K-means [6]
the number of iterations of the algorithm, n is the number of and DPC [1]) using both classical synthetic data sets and
data points, and K is the number of clusters. After obtaining real data sets. The clustering results of the algorithms were
the initial cluster centers, the DPC algorithm assigns each evaluated using the clustering time, the number of iterations,
remaining point to the cluster of the nearest neighbor sam- the accuracy of the clustering [45], and the precision and
ples whose density is larger than that of the sample, so the recall of external validity evaluation indicators.

VOLUME 8, 2020 88207


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

TABLE 2. Characteristics of the data sets.

FIGURE 3. Determination of the K value on the aggregation and Wdbc


data sets.

The specific methods are as follows. First, the range of K


can be set to 2-10. Then, by running the algorithm on each
K value and calculating the accuracy (AC) (See Section V.B)
A. DATA SET SELECTION AND INTRODUCTION
of the current algorithm to determine the optimal number of
The data sets are divided into two groups, synthetic and clusters, the K corresponding to the value with the highest
real-world data sets. To verify the validity of the algorithm, accuracy is regarded as the final number of clusters.Taking
we used two synthetic data sets and seven real data sets to the Aggregation and Wdbc data sets as examples, as shown
test the performance of the clustering algorithms, as shown in Figure 3, it can be seen that the final K value is the same
in Table 2. The synthetic data sets come from the research as the actual number of clusters.
published in [43], [44]. Spiral has three clusters in the 3-spiral
data sets. Aggregation consists of seven distinct groups that
B. EVALUATE CLUSTERING QUALITY
are non-Gaussian clusters. These data sets are labeled, and
To evaluate the performance of these different clustering
their descriptions are as follows.
algorithms, three metrics are adopted in this paper. The first
Wdbc: Features are computed from a digitized image of a
measure is the accuracy (AC) of the clustering results, which
fine needle aspirate (FNA) of a breast mass. They describe
was proposed by Huang and Ng [45]:
characteristics of the cell nuclei present in the image.
Wireless: These data were collected to perform experimen- PK
ai
tation on how WiFi signal strength can be used to determine AC = i=1 (14)
|N |
an indoor location.
Waveform and Waveform(noise): These are different ver- where ai is a sample that is classified correctly, K is the
sions of a waveform with 3 classes of waves. Waveform con- number of clusters, and N is the number of data points in the
tains 5000 instances with 21 attributes, and Waveform (noise) data set. The remaining two metrics are precision (PR) and
has 19 more noise attributes. recall (RE):
Frogs-MFCCs: This data set was used in several clas- PK ai
sification tasks related to the challenge of anuran species i=1 ai +bi
PR = (15)
recognition through their calls. K
PK ai
Electrical Grid: This is a local stability analysis of a 4-node i=1 ai +ci
star system (where the electricity producer is in the center) RE = (16)
K
implementing the decentral smart grid control concept.
Pendigits: This is a digit database that collects 250 samples where K is the number of classes of data; ai is the number of
from 44 writers. objects that are correctly assigned to class Ci (1 ≤ i ≤ K );
A detailed description of the nine experimental data sets bi is the number of objects that are incorrectly assigned to
is shown in Table 2. In Table 2, column ‘‘Points’’ specifies class Ci ; and ci is the number of objects that should be in
the number of sample points in each data set; ‘‘Attributes’’ class Ci but are not correctly assigned to it.
gives the dimension of each data sets; ‘‘Clusters’’ denotes the For AC, PR and RE, higher values indicate better clustering
number of clusters in each data set. There were differences quality. When their values are 1, it means that the clustering
in data size, attribute number and/or cluster number for each result is entirely correct. In addition, we used the clustering
data set. We use labeled data sets to test the performance of time and the the number of iterations to evaluate the efficiency
the algorithm, which is helpful for evaluating the clustering of each algorithm.
quality of the algorithm. Therefore, K represents the number
of clusters, and the K value is directly input as a constant. C. TEST CONVERGENCE OF THE PDPC ALGORITHM
Generally, the number of clusters K can not be set too Observe the convergence change of PDPC algorithm on the
large. Therefore, for the data sets with unknown distribu- different data sets and different number of iterations to deter-
tion, we determine the number of clusters K by experiment. mine the convergence parameter ε, as shown in Figure 4.

88208 VOLUME 8, 2020


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

Whether the PDPC algorithm has the best value, the worst
value or the average value, it is second only to DPC. The
difference between the best and worst values of the PDPC
algorithm is much smaller than that of DPC, indicating that
the introduction of the PSO algorithm can improve the sta-
bility of the DPC algorithm. For the Aggregation data set,
compared to the other six algorithms, the PDPC algorithm
achieved the best clustering results. This result shows that the
introduction of the PSO optimization algorithm in this paper
overcomes the shortcoming of the DPC artificial selection
center in which it easily falls into a local optimum.
Furthermore, the experimental results of each algorithm
FIGURE 4. Test convergence.
on the real data set are analyzed. For the Wdbc data set,
the PDPC algorithm achieves the optimal average value of
TABLE 3. The threshold value dc of each data set. AC and RE, while the DPNM_K-medoids algorithm obtains
the optimal average value of PR. The PDPC algorithm
has a lower average value of PR than the DP_K-medoids
and DPNM_K-medoids algorithms but a higher value than
the DPC. Similarly, the DPNM_K-medoids algorithm has
also been performed 20 experiments on this data set. First,
the DPNM_K-medoids algorithm selects cluster centers on
the decision diagram. The center selected by the algorithm
may be different in each experiment; Second, in the data
object allocation stage, there are data points that originally
From the experimental Figure 4 of the PDPC clustering belong to this cluster are not fully allocated to this cluster,
algorithm, we can find that after the number of iterations and no data points that belong to other clusters are divided.
reaches 40, the algorithm tends to converge, and the cluster Therefore, it can be known from the analysis of formula (15)
centers no longer has obvious changes. Take the conver- that the DPNM_K-medoids algorithm may obtain a higher
gence parameter ε = 0.02. In the process of improved par- PR value in several experiments, that is, the optimal average
ticle swarm optimization, the cluster centers that algorithm value of PR may be obtained. For the remaining data sets,
outputs are the cluster centers when the algorithm achieves the PDPC algorithm performs well. The average values of
convergence and stability. the three indicators were optimal. In general, the algorithm
proposed in this paper has a good clustering effect and high
D. PERFORMANCE ANALYSIS OF THE PDPC ALGORITHM stability. Our proposed algorithm overcomes the shortcoming
Before clustering, we used the method of calculating param- of the DPC algorithm in which it easily falls into a local opti-
eters proposed in this paper to get the threshold value dc of mum, and it achieves the purpose of automatically selecting
each data set, as shown in Table 3. These values were adopted cluster centers.
in the following experiments. Based on the above analysis, we show the average value of
In this subsection, the PDPC algorithm is compared with each indicator (AC, PR, and RE) in a line chart in Figure 5.
the DP_K-medoids [3], DPNM_K-medoids [3], Improved K- Taking the data sets as the x-axis values and the evaluation
means [4], K-means [5], Hybrid PSO and K-means [6] and index results as the y-axis values, the data set index value
DPC [1] on the data sets in Table 2. Twenty experiments were curves can be constructed. The purpose is to test the effective-
performed on each data set, the AC, PR, and RE of each ness of the proposed algorithm for clustering performance.
experiment were statistically analyzed, and the best value, According to the AC value curve shown in Figure 5(a),
worst value and average value were recorded in 20 clustering the PDPC algorithm (red line) achieves the best clustering
experiments for each algorithm, as shown in Tables 4-6. The accuracy of all algorithms on eight of the nine data sets. PDPC
best results are in bold. is followed by the DPC algorithm, which achieves the best
The experimental results in Tables 4-6 show that compared clustering accuracy on one data set. The worst methods are
with other clustering algorithms, the average values of AC, the DP_K-medoids, DPNM_K-medoids, Improved K-means,
PR, and RE of the PDPC clustering algorithm obtained rela- K-means and Hybrid PSO and K-means algorithms, which
tively high values on most of the data sets listed in Table 2. do not obtain the best evaluation index value in any data
This result shows that the proposed algorithm has a good clus- set. The most significant improvement achieved by using
tering effect and high stability. First, the experimental results the PDPC algorithm was observed for the Aggregation data
of each algorithm on two synthetic data sets are analyzed. For set, and there was an improvement from 0.5977 using the
the Spiral data set, the DPC algorithm is optimal, and PDPC DPC algorithm to 0.7850 using the PDPC algorithm. We also
has higher values than other algorithms and ranks second. find that for the Waveform data set, the AC values of the

VOLUME 8, 2020 88209


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

TABLE 4. The AC tested by seven algorithms on each data set.

FIGURE 5. The AC, PR and RE of seven algorithms on synthetic and real data sets.

six algorithms other than PDPC are very close, and PDPC algorithms, the PDPC algorithm showed the best clustering
is greatly improved. However, for the Sprial data set, the AC performance on most data sets. However, there are subtle
value of the PDPC algorithm is 0.3471, which is significantly differences. For example, the DPNM_K-medoids algorithm
lower than that of the DPC algorithm but still higher than achieved top clustering performance for one data set when
those of the other five algorithms. The results indicate that using the PR (Figure 5(b)), compared to no any data sets when
the proposed algorithm may not be suitable for the Sprial data using the AC (Figure 5(a)). Alternatively, the DP_K-medoids
set. It is related to the distribution of the data set, because and DPNM_K-medoids algorithms had similar clustering
Sprial is a path-based spectral clustering result for 3-spiral performance for all of the indexes on all data sets except
data set. Electrical Grid. This is because the initial cluster center
Figure 5 shows similar trends in the metrics of different selection methods of these two clustering algorithms are the
algorithms on different data sets. Compared with the other six same; the difference is that the clustering criterion function is

88210 VOLUME 8, 2020


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

TABLE 5. The PR tested by seven algorithms on each data set.

different, that is, the stopping conditions of the clustering are clustering performance. In each evaluation index, the PDPC
different [3]. For the Wdbc data set, the PDPC algorithm algorithm showed the best clustering performance. These
obtains the highest values of AC and RE on the evaluation results demonstrate that the PDPC algorithm is effective and
index of clustering performance, while DPNM_K-medoids excellent regardless of the evaluation index chosen.
obtains the highest value of PR. The PDPC algorithm per- It can be seen from Table 4-6 that the clustering quality
formed best on the seven data sets when using the PR index; of PDPC algorithm is better than DPC on most of the data
on the other hand, it is still the best-performing algorithm on sets in Table 2. Further, Figure 5 visually shows that PDPC
the eight data sets when using the RE value, just as it is when (red line) is superior to DPC (blue line) on most of the data
using the AC value. For the Waveform data set, the PDPC sets. From the above analysis, combined with the advantages
algorithm showed the best clustering performance on AC, of the PSO algorithm, the PDPC algorithm proposed in this
PR and RE. Furthermore, for the Waveform(noise) data set, paper solves the disadvantages of DPC. A method for calcu-
which had an increase of nineteen attributes with noise data lating the parameter dc is proposed to solve the uncertainty
in relation to Waveform, the performance of the PDPC algo- and unreliability of DPC selection based on empirical values.
rithm is still better than that of the other six algorithms. There- For some unevenly distributed data sets, the initial centers
fore, the PDPC algorithm is the best method for processing found by the DPC algorithm may be located in the same
the Waveform(noise) data set, which indicates that the PDPC cluster or may not be found. The DPC may consider the
algorithm is more stable than the other six algorithms. non-cluster centers in the dense clusters as the center points
Table 7 gives the number of data sets in which each of the sparse clusters, causing the cluster centers found to
of the eight algorithms showed the top clustering per- fall into a local optimum. Our algorithm solves this prob-
formance for the different evaluation indexes when using lem well. And PDPC algorithm solves the limitation that
synthetic data sets and real data sets. For AC, the traditional DPC cannot automatically determine the cluster
PDPC algorithm tied for the best clustering performance centers, avoids the subjectivity of the manual selection pro-
by achieving the highest value on eight of the nine data cess. The experimental results show that our algorithm has
sets. PR and RE all showed similar results to those for a stronger global search ability, higher stability and a better
AC. In all cases, the PDPC algorithm demonstrated the best clustering effect.

VOLUME 8, 2020 88211


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

TABLE 6. The RE tested by seven algorithms on each data set.

TABLE 7. The number of data sets in which each of the seven algorithms points directly to the nearest cluster centers, so it is not
showed top clustering performance for the average value of the different
evaluation indexes when using synthetic data sets and real data sets. compared with this method.
Figure 6(a) shows the average clustering time of the
six clustering algorithms in milliseconds on the nine data
sets. As shown, the difference in clustering time between
the six methods is not large. However, compared with the
other five algorithms, the clustering time of the proposed
PDPC algorithm is relatively low, although the time com-
plexity is not greatly improved. We can see that the
DP_K-medoids algorithm clustering time was close to that
of DPNM_K-medoids. Although the time required to man-
ually select the centers was excluded, the DP_K-medoids
and DPNM_K-medoids algorithms must generate a decision
diagram, which is time consuming. This was one reason why
E. EVALUATE OF CLUSTERING TIME AND NUMBER OF their computational efficiency was lower. We can also see that
ITERATIONS the K-means algorithm has a longer clustering time because
In Section IV.E, we analyze theoretically the complexity of it has more iterations than the other algorithms on most data
the DP_K-medoids [3], DPNM_K-medoids [3], Improved sets, as shown in Figure 6(b). Figure 6(b) shows the average
K-means [4], K-means [5], Hybrid PSO and K-means [6], number of iterations of the six clustering algorithms on the
DPC [1] and PDPC algorithms. Table 1 gives the detailed nine data sets. Overall, the number of iterations of PDPC is
theoretical results. In this subsection, we compare the actual less than that of the other algorithms.
clustering time and the number of iterations of the six algo- This paper introduces the PSO optimization algorithm;
rithms other than DPC, measured by the average clustering because of its simple concept, strong global search capa-
time and the number of iterations of 20 repeated clustering bility and high stability, it can find the optimal solution
processes. The DPC algorithm does not perform the itera- in relatively few iterations. The above analysis shows that
tive clustering process, which distributes the remaining data the PDPC algorithm runs faster than the other algorithms.

88212 VOLUME 8, 2020


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

[2] R. Eberhart and J. Kennedy, ‘‘A new optimizer using particle swarm
theory,’’ in Proc. 6th Int. Symp. Micro Mach. Human Sci. (MHS), 1995,
pp. 39–43.
[3] X. Juanying and Y. Qu, ‘‘K-medoids clustering algorithms with optimized
initial seeds by density peaks,’’ J. Frontiers Comput. Sci. Technol., vol. 10,
no. 2, pp. 230–247, 2016.
[4] E. Zhu and R. Ma, ‘‘An effective partitional clustering algorithm based on
new clustering validity index,’’ Appl. Soft Comput., vol. 71, pp. 608–621,
Oct. 2018.
[5] J. MacQueen, ‘‘Some methods for classification and analysis of multivari-
ate observations,’’ in Proc. 5th Berkeley Symp. Math. Statist. Probab., 1967,
vol. 1, no. 14, pp. 281–297.
[6] D. W. van der Merwe and A. P. Engelbrecht, ‘‘Data clustering using particle
swarm optimization,’’ in Proc. Congr. Evol. Comput. (CEC), vol. 1, 2003,
pp. 215–220.
[7] Y. Si, P. Liu, P. Li, and T. P. Brutnell, ‘‘Model-based clustering for RNA-seq
data,’’ Bioinformatics, vol. 30, no. 2, pp. 197–205, Jan. 2014.
[8] L. H. Son and T. M. Tuan, ‘‘A cooperative semi-supervised fuzzy clustering
framework for dental X-ray image segmentation,’’ Expert Syst. Appl.,
vol. 46, pp. 380–393, Mar. 2016.
[9] A. Mehta and O. Dikshit, ‘‘Comparative study on projected clustering
methods for hyperspectral imagery classification,’’ Geocarto Int., vol. 31,
no. 3, pp. 296–307, Mar. 2016.
[10] Y. Yang, ‘‘TAD: A trajectory clustering algorithm based on spatial-
temporal density analysis,’’ Expert Syst. Appl., vol. 139, Jan. 2020,
Art. no. 112846, doi: 10.1016/j.eswa.2019.112846.
[11] C. Jiang-Hui, ‘‘Spectral analysis of sky light based on trajectory cluster-
ing,’’ Spectrosc. Spectral Anal., vol. 39, no. 4, pp. 1301–1306, 2019.
[12] C. Qu, H. Yang, J. Cai, J. Zhang, and Y. Zhou, ‘‘DoPS: A double-peaked
profiles search method based on the RS and SVM,’’ IEEE Access, vol. 7,
pp. 106139–106154, 2019, doi: 10.1109/ACCESS.2019.2927251.
[13] Q. Cai-Xia, Y. Hai-Feng, C. Jiang-Hui, and X. Ya-Ling, ‘‘P-Cygni profile
analysis of the spectrum: LAMOST J152238.11+333136.1,’’ Spectrosc.
FIGURE 6. The six algorithms evaluate the clustering time and number of Spectral Anal., vol. 40, no. 4, pp. 1304–1308, 2020.
iterations on different data sets. [14] H. Yang, C. Qu, J. Cai, S. Zhang, and X. Zhao, ‘‘SVM-Lattice: A recogni-
tion & evaluation frame for double-peaked profiles,’’ IEEE Access, early
access, Apr. 27, 2020, doi: 10.1109/ACCESS.2020.2990801.
[15] J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques
Therefore, the PDPC algorithm reduces the number of iter- (Series in Data Management Systems), 3rd ed. San Mateo, CA, USA:
Morgan Kaufmann, 2011, pp. 83–124.
ations and the clustering time and improves the efficiency of [16] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction
the DPC algorithm. to Cluster Analysis, vol. 344. Hoboken, NJ, USA: Wiley, 2009.
[17] Y. Li, J. Cai, H. Yang, J. Zhang, and X. Zhao, ‘‘A novel algorithm for initial
cluster center selection,’’ IEEE Access, vol. 7, pp. 74683–74693, 2019,
VI. SUMMARY doi: 10.1109/ACCESS.2019.2921320.
To overcome the disadvantages in the DPC algorithm, a novel [18] F. Murtagh and P. Contreras, ‘‘Algorithms for hierarchical clustering: An
clustering algorithm based on DPC & PSO (PDPC) is pro- overview,’’ WIREs Data Mining Knowl. Discovery, vol. 2, no. 1, pp. 86–97,
Jan. 2012.
posed. Particle swarm optimization (PSO) is introduced [19] J. D. Banfield and A. E. Raftery, ‘‘Model-based Gaussian and non-
because of its simple concept and strong global search ability, Gaussian clustering,’’ Biometrics, vol. 49, no. 3, pp. 803–821, Sep. 1993.
which can find the optimal solution in relatively few itera- [20] M. Ester, ‘‘A density-based algorithm for discovering clusters in large spa-
tial databases with noise,’’ in Proc. Kdd, 1996, vol. 96, no. 34, pp. 226–231.
tions. Furthermore, to address the influence of the selection [21] A. Hinneburg and D. A. Keim, ‘‘An efficient approach to clustering in
parameter cut-off distance dc value on the clustering results, large multimedia databases with noise,’’ in Proc. 4th. Int. Conf. Knowl.
a method for calculating the parameter dc is proposed. Finally, Discovery. Data Mining, vol. 98, Aug. 1998, pp. 58–65.
[22] D. McParland and I. C. Gormley, ‘‘Model based clustering for mixed data:
the PDPC and six typical algorithms are tested on classical ClustMD,’’ Adv. Data Anal. Classification, vol. 10, no. 2, pp. 155–169,
synthetic data sets and real data sets, and the experiments Jun. 2016.
verified that the clustering results, the clustering time and [23] A. Rodríguez, E. Cuevas, D. Zaldivar, and L. Castañeda, ‘‘Clustering with
biological visual models,’’ Phys. A, Stat. Mech. Appl., vol. 528, Aug. 2019,
the number of iterations of the PDPC algorithm are better Art. no. 121505.
than those of other algorithms. The PDPC algorithm achieves [24] L. Rokach, ‘‘A survey of clustering algorithms,’’ Data Mining and
the purpose of automatically selecting cluster centers and Knowledge Discovery Handbook. Boston, MA, USA: Springer, 2009,
pp. 269–298.
overcomes the effects of the parameter dc . Compared with [25] Y.-J. Zheng, H.-F. Ling, S.-Y. Chen, and J.-Y. Xue, ‘‘A hybrid neuro-fuzzy
the other six algorithms, the PDPC algorithm has a stronger network based on differential biogeography-based optimization for online
global search ability, higher stability and a better clustering population classification in earthquakes,’’ IEEE Trans. Fuzzy Syst., vol. 23,
no. 4, pp. 1070–1083, Aug. 2015.
effect. [26] Y.-J. Zheng and H.-F. Ling, ‘‘Emergency transportation planning in dis-
aster relief supply chain management: A cooperative fuzzy optimization
approach,’’ Soft Comput., vol. 17, no. 7, pp. 1301–1314, Jul. 2013.
REFERENCES
[27] B. Jiang and N. Wang, ‘‘Cooperative bare-bone particle swarm optimiza-
[1] A. Rodriguez and A. Laio, ‘‘Clustering by fast search and find of density tion for data clustering,’’ Soft Comput., vol. 18, no. 6, pp. 1079–1091,
peaks,’’ Science, vol. 344, no. 6191, pp. 1492–1496, Jun. 2014. Jun. 2014.

VOLUME 8, 2020 88213


J. Cai et al.: Novel Clustering Algorithm Based on DPC and PSO

[28] Y.-J. Zheng, H.-F. Ling, J.-Y. Xue, and S.-Y. Chen, ‘‘Population clas- JIANGHUI CAI is a Chief Professor of computer
sification in fire evacuation: A multiobjective particle swarm optimiza- application technology with the Taiyuan Univer-
tion approach,’’ IEEE Trans. Evol. Comput., vol. 18, no. 1, pp. 70–81, sity of Science and Technology, Taiyuan, China.
Feb. 2014. He is a long-term member of the Institute for Intel-
[29] J. Ding, Z. Chen, X. He, and Y. Zhan, ‘‘Clustering by finding density ligent Information and Data Mining. His research
peaks based on Chebyshev’s inequality,’’ in Proc. 35th Chin. Control Conf. interests concern the data mining and machine
(CCC), Jul. 2016, pp. 7169–7172. learning methods in specific backgrounds of astro-
[30] J. Ding, X. He, J. Yuan, and B. Jiang, ‘‘Automatic clustering based on
nomical informatics, seismology, and mechanical
density peak detection using generalized extreme value distribution,’’ Soft
engineering. He is a Senior Member of the China
Comput., vol. 22, no. 9, pp. 2777–2796, May 2018.
[31] M. Du, S. Ding, and H. Jia, ‘‘Study on density peaks clustering based on k- Computer Federation (CCF).
nearest neighbors and principal component analysis,’’ Knowl.-Based Syst.,
vol. 99, pp. 135–145, May 2016.
[32] J. Xie, H. Gao, W. Xie, X. Liu, and P. W. Grant, ‘‘Robust clustering by
detecting density peaks and assigning points based on fuzzy weighted K- HUILING WEI was born in Shanxi, China,
nearest neighbors,’’ Inf. Sci., vol. 354, pp. 19–40, Aug. 2016.
in 1993. She is currently pursuing the M.S.
[33] S. Jia, G. Tang, J. Zhu, and Q. Li, ‘‘A novel ranking-based clustering
degree with the Department of Computer Sci-
approach for hyperspectral band selection,’’ IEEE Trans. Geosci. Remote
Sens., vol. 54, no. 1, pp. 88–102, Jan. 2016. ence and Technology, Taiyuan University of Sci-
[34] J. Xu, G. Wang, and W. Deng, ‘‘DenPEHC: Density peak based efficient ence and Technology, Taiyuan, China. Her current
hierarchical clustering,’’ Inf. Sci., vol. 373, pp. 200–218, Dec. 2016. research interests include data mining and artificial
[35] A. A. A. Esmin, D. L. Pereira, and F. P. A. de Araujo, ‘‘Study of different intelligence.
approach to clustering data by using the particle swarm optimization
algorithm,’’ in Proc. IEEE Congr. Evol. Comput. (IEEE World Congr.
Comput. Intelligence), Jun. 2008, pp. 1817–1822.
[36] I. W. Kao, C. Y. Tsai, and Y. C. Wang, ‘‘An effective particle swarm
optimization method for data clustering.,’’ in Proc. IEEE Int. Conf. Ind.
Eng. Eng. Manage., Dec. 2007, pp. 548–552. HAIFENG YANG is a Professor of computer appli-
[37] R. Chouhan and A. Purohit, ‘‘An approach for document clustering using cation technology with the Taiyuan University of
PSO and K-means algorithm,’’ in Proc. 2nd Int. Conf. Inventive Syst. Science and Technology, Taiyuan, China. He is a
Control (ICISC), Jan. 2018, pp. 1380–1384. long-term member of the Institute for Intelligent
[38] A. Khatami, ‘‘A new PSO-based approach to fire flame detection using Information and Data Mining. His research inter-
K-medoids clustering,’’ Expert Syst. Appl., vol. 68, pp. 69–80, Feb. 2017. ests concern the data mining and machine learn-
[39] Y. Jiang, C. Liu, C. Huang, and X. Wu, ‘‘Improved particle swarm algo- ing methods in specific backgrounds, especially
rithm for hydrological parameter optimization,’’ Appl. Math. Comput.,
on astronomical big data. He is a member of the
vol. 217, no. 7, pp. 3207–3215, Dec. 2010.
China Computer Federation (CCF) and the Chi-
[40] A. O’Hagan, T. B. Murphy, I. C. Gormley, P. D. McNicholas, and D. Karlis,
‘‘Clustering with the multivariate normal inverse Gaussian distribution,’’ nese Astronomical Society (CAS).
Comput. Statist. Data Anal., vol. 93, pp. 18–30, Jan. 2016.
[41] Y. Shi and R. Eberhart, ‘‘A modified particle swarm optimizer,’’ in Proc.
IEEE Int. Conf. Evol. Comput., IEEE World Congr. Comput. Intell.,
XUJUN ZHAO received the M.S. degree in com-
May 1998, pp. 69–73.
puter science and technology from the Taiyuan
[42] A. Ratnaweera, S. K. Halgamuge, and H. C. Watson, ‘‘Self-organizing
hierarchical particle swarm optimizer with time-varying acceleration coef- University of Technology, China. He is cur-
ficients,’’ IEEE Trans. Evol. Comput., vol. 8, no. 3, pp. 240–255, Jun. 2004. rently pursuing the Ph.D. degree with the Taiyuan
[43] H. Chang and D.-Y. Yeung, ‘‘Robust path-based spectral clustering,’’ University of Science and Technology. His
Pattern Recognit., vol. 41, no. 1, pp. 191–203, Jan. 2008. research interests include data mining and parallel
[44] A. Gionis, H. Mannila, and P. Tsaparas, ‘‘Clustering aggregation,’’ ACM computing.
Trans. Knowl. Discovery Data (TKDD), vol. 1, no. 1, p. 4, 2007.
[45] C. M. Stein, ‘‘Estimation of the mean of a multivariate normal distribu-
tion,’’ Ann. Statist., vol. 9, no. 6, pp. 1135–1151, Nov. 1981.

88214 VOLUME 8, 2020

You might also like