A Graph Adaptive Density Peaks Clustering Algorithm For Automatic Centroid Selection and Effective Aggregation
A Graph Adaptive Density Peaks Clustering Algorithm For Automatic Centroid Selection and Effective Aggregation
MSC: As a clustering approach based on density, Density Peaks Clustering algorithm (DPC) has conspicuous
00-01 superiorities in searching and finding density peaks. Nevertheless, DPC has obvious deficiencies in centroid
68T10 selection and aggregation process affected by differences in data shape and density distribution, which can
Keywords: easily cause problems in centroid selection and trigger domino effect. Therefore, a Graph Adaptive Density
Density Peaks Clustering algorithm (DPC) Peaks Clustering algorithm based on Graph Theory (called GADPC) is proposed to automatically select centroid
GA-DPC and aggregate more effectively. The improvement of GADPC can be subdivided into the two steps. First,
Graph Theory
the clustering centroids are automatically selected based on the turning angle 𝜃 and the graph connectivity
Centroid selection
of centroids. Second, the remaining points are aggregated towards the corresponding clustering centroid.
According to the improved principle, they belong to the closer point which has stronger graph connectivity
and higher density. Theoretical analyses and experimental data indicate that GADPC, compared with DBSCAN,
K-means and DPC, is more feasible and effective in processing some data sets with varying density and
non-spherical distribution such as Jain and Spiral.
✩ The authors thank the financial support from the Foundation of the Education Department of Jilin Province, China (Nos. JJKH20210133KJ, JJKH20200141KJ),
the Foundation of Social Science of Jilin Province, China (No. 2020C053) and the Foundation of Jilin University of Finance and Economics (Nos. 2020ZY14,
2020ZY09).
∗ Corresponding author at: School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun 130117, PR
China.
E-mail addresses: [email protected] (T. Xu), [email protected] (J. Jiang).
https://doi.org/10.1016/j.eswa.2022.116539
Received 25 April 2021; Received in revised form 9 November 2021; Accepted 9 January 2022
Available online 29 January 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
has the following inadequacies. On the one hand, it is difficult for neighbors with aggregating strategy (ADPC-KNN). In this algorithm,
DPC to pick out cluster centroids in the decision graph consisted of the idea of KNN is introduced into the calculation of local density
density and distance when processing the data with uneven density and the remaining objects are merged to the selected cluster centroids.
distribution (Jiang et al., 2018a; Li & Tang, 2018). On the other hand, DFC (Jiang et al., 2018b) is proposed as a improved density peak clus-
in the process of assignment, when one point is assigned incorrectly, the tering method. It improves the assignment process by applying density
remaining points associated with it are also affected, thus triggering the fragment and network structural similarity, achieving advantages in
domino effect (Jiang et al., 2019b, 2019c, 2018b, 2019a; Seyedi et al., automatically determining the number of clusters, aggregating samples
2019). For example, DPC algorithm does not achieve a good result on more reasonably and detecting outliers.
Jain and Spiral. In view of these, Graph Adaptive Density Deak Clustering algorithm
After the DPC algorithm was introduced, researchers proposed many (GADPC) is proposed to avoid the defects of DPC clustering algorithm.
algorithm variations. Those improved variations can be roughly divided GADPC inherits the local density calculation method of DPC, and
into the following three categories. The first focuses on improving focuses on the second and third parts of the above improved ideas. The
the calculation of local density. For example, some researchers have innovations of GADPC are listed as follows:
published an algorithm called CFSFDP-HD, which attempts to esti-
1. Unlike DPC algorithm, it can automatically select centroids,
mate density efficiently by employing the heat equation, thus reducing
avoiding the subjectivity and difficulties in the DPC due to the
the sensitivity of setting cut-off distances (Mehmood et al., 2016).
manual works in selecting cluster centroids in data sets with
However, its high computational cost limits its ability to be used
varying densities. In GADPC, clustering centroids are selected
in large-scale and high-dimensional data. As the same time, a novel
automatically and accurately according to the turning angle 𝜃
algorithm named DPC-PCA-KNN was proposed by Du et al. (2016),
and graph connectivity of clustering centroids.
which integrates KNN and PCA into DPC. But note that this approach
2. The new aggregation principle of GADPC can eliminate the
obtain a poor effect on detecting non-spherical clusters and identi-
domino effect in the original algorithm. When the cluster cen-
fying overlapping clusters. In addition, the nearest neighbor number
troids are selected, the remaining points are aggregated to the
N needs to be set in advance in this case. In view of this, a series
closer point with higher density and stronger graph connectivity.
of DPC extensions based on the idea of KNN are proposed including
3. The new idea of detecting abnormal points and edge points with
FKNNDPC (Xie et al., 2016), SNNDPC (Liu et al., 2018) and DPC-
reference to Graph Theory is added in response to the original
DLP (Seyedi et al., 2019). FKNNDPC improves the calculation of local
DPC algorithm’s shortcomings in abnormal point detection and
density based on the sum of the distance of sample points to KNN.
the lack of edge point labeling. According to the degree central-
SNNDPC improves local density through innovative definitions such
ity and connectivity of each point, the edge points and outliers
as SNN similarity. DPC-DLP introduces KNN into the computation of
can be easily detected.
the parameter of 𝑑𝑐 and local density. In addition, the DPC variation
based on kernel Density Estimation (KDE) also provides a new idea 2. Related work
of deparameterization, which includes HDDPC (Mehmood et al., 2016)
and IVDPC (Zhou et al., 2018). However, due to the inherent limitations GADPC algorithm is inspired by DPC (Rodriguez & Laio, 2014) and
of KDE, the algorithm has high computational complexity and can Graph Theory (Diestel, 2000). A brief introduction of DPC algorithms
easily to fall into the curse of dimensionality. and Graph Theory will be arranged in the following sections.
The second types focuses on the process of selecting the centroids.
The improvement of selecting clustering centroids can happen in two 2.1. An introduction to DPC algorithm and defect analysis
ways: selecting centroids automatically and increasing the identifica-
tion of centers in decision graph. For example, a hierarchical density Density peaks clustering algorithm(DPC) relies on two important
method entitled DENPEHC (Sun et al., 2016) is proposed to automati- assumptions: (1) The local density of the cluster centroids is the great-
cally detect all possible centroids. DENPEHC implements clustering of est and the distance between the centroids is relatively far; (2) The
high-dimensional and large-scale data by integrating grids. However, remaining points belong to the nearest neighbor point with greater
the biggest problem of this variant is that the local structure of data density.
points is ignored in the calculation of density value, leading to incorrect The density peaks clustering mainly has two parameters to be
detection of some clustering. Some scholars try to use statistical test calculated: local density 𝜌𝑖 and relative distance 𝛿𝑖 , which are defined
instead of decision graph to identify clustering centroids (Wang & Song, by Eqs. (1), (2) and (3).
2016). Although the improved algorithm performs great in identifying ∑
clustering centroids, it shows an insufficient ability in dealing with 𝜌𝑖 = 𝜒(𝑑𝑖𝑗 − 𝑑𝑐 ) (1)
𝑗
complex-manifold data sets. Bian et al. (2020) designs a new fuzzy
density peak clustering algorithm entitled FDPC. From the perspective Where 𝜒(𝑑𝑖𝑗 − 𝑑𝑐 ) = 1 if (𝑑𝑖𝑗 − 𝑑𝑐 ) < 0 and 𝜒(𝑑𝑖𝑗 − 𝑑𝑐 ) = 0 otherwise, 𝑑𝑖𝑗
of soft partitions, FDPC measures the density peak by fuzzy distance, is the distance of data points 𝑖 and 𝑗, and 𝑑𝑖𝑗 is given by Eq. (2).
√
and it has some advantages in automatic detection of the number of √ 𝑛
√∑
clusters and assignment of objects with ambiguity and uncertainty. An dij = √ (𝑥𝑖𝑘 − 𝑥𝑗𝑘 )2 (2)
improved density peak clustering method called GDPC (Jiang et al., 𝑘=1
2018a), introducing gravity and nearby distances, offers an alternative
Cutoff distance 𝑑𝑐 is the parameter that needs to be entered by the user
to choosing centroids in the decision graph. Compared with DPC, it is according to the specific situation. Based on the suggestion of the of
easier for GDPC to select clustering centroids on some data sets such as DPC, 𝑑𝑐 can be determined on the premise that the average number
Aggregation, but this advantage is not obvious on some data sets with of neighbor points is maintained at 1% to 2% of the total number of
uneven densities. data points (Rodriguez & Laio, 2014; Jiang et al., 2018a, 2019b). In
The third type of DPC variant aims at improving assignment of addition, Gaussian kernel density is usually applied for measuring the
instances. One improved density peak clustering (IDPC) was proposed density in practical applications, aiming to solve the defect that the
by Lotfi et al. (2017), adopting a two-step strategy in detecting complex density change is not obvious when processing small sample data in
cluster. However, this algorithm has strong parameter sensitivity. The Eq. (6). The specific formula is given by Eq. (3).
larger the K nearest neighbor is, the better the clustering effect is. ( ( )2 )
But it also means huge computational complexity. Liu et al. (2017) ∑ dij
𝜌i = exp − (3)
published an adaptive density peak clustering based on K nearest j≠i
dc
2
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
The relative distance 𝛿𝑖 of point 𝑖 is given by Eqs. (4). always tacitly assumed that V ∩ E = ∅. Based on the direction, the graph
can be classified into two categories, directed graph and undirected
𝛿𝑖 = 𝑚𝑖𝑛 (𝑑𝑖𝑗 ) (4)
𝑗∶𝜌𝑗 >𝜌𝑖 graph. There are some important concepts in Graph Theory.
The degree of a node refers to the number of edges associated
If a point has a maximum local density, its distance 𝛿𝑖 is calculated
with a certain node, conventionally denoted by 𝑑𝐺 (𝑣) or d(v) (Wang
according to Eq. (5).
et al., 2016; Penrose, 2015). In an undirected graph, degree centrality
𝛿𝑖 = 𝑚𝑎𝑥(𝑑𝑖𝑗 ) (5) measures the degree to which one vertex in the graph is related to all
other nodes. The degree and degree centrality of vertices will directly
Although the decision graph of DPC provides a better heuristic
determine the importance of vertices in a graph network.
method for selecting the centroids, the failure of selecting process still
exists. The basic reason is that the local density formula of DPC is ∑
𝑦
𝐶𝐷(𝑁𝑖 ) = 𝑥𝑖𝑗 (𝑖 ≠ 𝑗) (6)
suitable for detecting the local structure of sample points, but ignores 𝑘=1
the overall relationship between them. As a result, it may lead to the
where 𝐶𝐷(𝑁𝑖 ) represents the degree centrality of node 𝑖, which is used
wrong selection or missing selection of the clustering centroids (Jiang
to count the number of direct connections between node 𝑖 and other
et al., 2018a, 2019b). This error is particularly obvious in the following
nodes.
situations: (1) Affected by changes in the density difference between
The connectivity of a graph is used to describe whether there is a
clusters, the density and relative distance between low-density clus-
path between nodes (Diestel, 2000; Mathew & Sunitha, 2010; Harrison,
tering centroids and high-density clustering centroids are significantly
2016). A graph is connected when there is a path between any two
different; (2) Affected by the shape and density distribution of data
vertices in the graph and the maximal connected subgraph is referred
points in the class, there are multiple points with higher density and
to as a connected component. In an undirected graph, a vertex is called
larger relative distance in the class. To illustrate the shortcoming of
cutvertex if the number of connected components of the graph increases
DPC in selecting centroids more clearly, a data set Jain are presented
after the vertex and its associated edges are deleted. Similarly, if an
in Fig. 1. According to the centroids selection principle, the cluster
centroids are manually selected from the decision graph, marked as edge increases the connected components of the graph by deleting the
colored dots in Fig. 1(a). However, the true centroids of Jain are the edge, it is known as a bridge (Diestel, 2000; Goodrich et al., 2011).
colored triangular points in Fig. 1(b). Cutvertex and bridge play important roles in the process of transforming
In addition, the aggregation strategy of DPC algorithm also has some a connected graph into several connected subgraphs. As shown in
shortcomings. According to the aggregation principle of DPC algorithm, Fig. 3(a), v, x, y and w are cutvertices, and x-y is a bridge. When all
each object is aggregated to the nearest neighbor with higher density. cutvertices and bridges are removed from a graph, multiple components
In general, this aggregation strategy can ensure that every object can will be generated. As shown in Fig. 3(b), B is a connected component.
be accurately aggregated to the corresponding centroids, but in some
cases it fails to do that. In Fig. 2(a), the two closest neighbors of 3. Methods
higher density to point No. 220 are points No. 46 and No. 226. Because
distance 𝑑(220,226) is longer than distance 𝑑(220,46) , point No. 220 is The proposed GADPC algorithm develops the DPC algorithm in
aggregated into the same cluster to which point No. 46 belongs. It is many aspects. GADPC has improved the DPC algorithm’s clustering cen-
obvious that point No. 220 is incorrectly aggregated, causing the fact troids selection method and the remaining points aggregation principle.
that the remaining points with lower density around point No. 220 are GADPC includes four major steps: (1) Construct the sparse graph by the
also aggregated incorrectly. In this case, the point that is far from the cutoff distance 𝑑𝑐 ; (2) Select clustering centroids based on the turning
correct class centroid yet close to the wrong clustering point will be angle 𝜃 and graph connectivity of clustering centers automatically
wrongly assigned to other class centroids according to the aggregation and accurately; (3) Aggregate the remaining points according to the
strategy of DPC algorithm, which triggers the domino effect (Jiang principle that non cluster centroids belong to the closer point with
et al., 2019b, 2019c, 2018b, 2019a). Fig. 2(b) also shows that DPC higher density and stronger connectivity; (4) Detect outliers and edge
algorithm has an aggregation error on Jain. points according to the degree centrality and relative distance of nodes.
2.2. A brief introduction to graph theory 3.1. Construct the sparse graph
Graph is regarded as a network structure consisting of a set of In an undirected graph with n vertices and n edges, theoretically
vertices that is mapped to a set of edges (Diestel, 2000; Shimon et al., every vertices has some degree of proximity to other objects. However,
1975; Chen et al., 2003). A pair of sets G = (V, E) can be used to each vertex only has one high degree of proximity to a few objects
describe a graph, where V represents the vertices or nodes of the graph and has weak similarity to most of other objects. Therefore, the cutoff
G and E is its edges or lines. To avoid notational ambiguities, it is distance 𝑑𝑐 is adopted to break the edges whose similarity is higher
3
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
than 𝑑𝑐 to sparse the proximity graph. The degree of each node in correctly in clusters with large density difference like Jain, as shown in
the figure reflects the density of nodes at a certain level, and whether Fig. 5(b).
there is a path between nodes reflects the internal relationship of them. Therefore, the clustering center selection method is improved. Clus-
The generation of sparse graph provides the basis for the selection of tering centroids are selected based on the turning angle 𝜃 and graph
clustering centroids and the implementation of aggregation process. connectivity of clustering centroids automatically and accurately. The
Fig. 4 illustrate the sparse graph generated on the basis of cutoff turning angle 𝜃 is given by Eq. (7).
distance 𝑑𝑐 . 𝑎2 × 𝑏 2 × 𝑐 2
𝜃i = cos (7)
−2 × 𝑏 × 𝑐
3.2. Select clustering centroids where a, b and c are respectively the length values of each side of the
triangular region connected by three points: 𝛾𝑖 − 1, 𝛾𝑖 and 𝛾𝑖 + 1. In
Fig. 6(a), the turning angle 𝜃 of first 20 points are calculated by the 𝛾
Considering the difficulty of manually picking the clustering cen-
value based on Eq. (7).
troids in the decision graph, the author of DPC gives us a hint to select It can be clearly seen from Fig. 6 that the angle value of each
the centroids according to the gamma value calculated by formula point after the turning point is close to 180◦ and the changing trend
𝛾𝑖 = 𝜌𝑖 × 𝛿𝑖 (Rodriguez & Laio, 2014). As shown in Fig. 5(a), the true is stable, which means that there is little chance for those points to be
centroids of clusters are marked with colored dots. This new approach the centroid of a cluster. Therefore, according to the size and the trend
makes it much easier to pick out clustering centroids on the data set of of the value, GADPC can automatically select the potential clustering
Flame. However, because the value of 𝛾 only magnifies the product of centroids before the turning point. After that, the final clustering cen-
density and distance, it is also difficult to select the clustering centroids troids are determined by whether there is a path between the possible
4
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
clustering centroids. The possible clustering centroids are judged one 3.3. Aggregate remaining points
by one according to density sorting, but the low-density ones with
paths existing between with other points will be excluded. However, In order to solve the domino effect in the assignment procedure of
the clustering centroid with low density is likely to be eliminated if the the DPC algorithm, aggregation strategy is updated to a new version. In
connectivity between points is simply considered. As shown in Fig. 7, the new version, the points belong to the closer point with higher den-
two clusters were incorrectly detected because of the connectivity sity and stronger connectivity. Combined with Graph Theory, GADPC
between classes. Therefore, GADPC takes whether the connected path takes the relationship between a certain point and the other points into
contains the cutvertex and bridge into the screening process of clustering consideration in the assignment process. As shown in Fig. 9, although
centroids. If the connected path between two points contains cutvertex the distance between point No. 220 and No. 46 is closer, since there
and bridge, GADPC considers it to be non-connected to ensure that each is no path between them, point No. 220 belongs to the point No. 226,
cluster is a connected component. which has a higher density and stronger connectivity with point No.
The accuracy and convenience of selecting centroids is greatly 220.
increased by considering the connectivity among points with high
local density and selecting centroids by two steps. Moreover, these 3.4. Detect outliers and mark edge points
improvements on selection method and selection criteria of clustering
centroids make it possible for GADPC to adaptively select clustering After determining the clustering centroids and assigning remaining
centers. As illustrated in Fig. 8, the centroids can be easily selected points according to the new aggregation principle, GADPC also detects
from the Decision Graph leading to an automatic selection based on outliers and edge points. Firstly, clustering edge points have lower
the turning point. degree of centrality. Secondly, potential outliers hidden in the edge
5
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
Table 1 Compounds and Aggregation are used to evaluate whether the algorithm
Eleven diverse data sets.
could successfully identify data sets with different shapes. Fourthly,
Name Clusters Dimensions Magnitude we also verify the evaluation algorithm’s ability to detect edge points
Aggregation(SD1) 7 2 788 and outliers when necessary. Finally, we did some experiments to test
Flame(SD2) 2 2 240
the sensitivity of GADPC to parameters. In addition, some experiments
Jain(SD3) 2 2 373
Compound(SD4) 6 2 399 are carried out on real data sets to further evaluate the clustering
Spiral(SD5) 3 2 312 performance of GADPC. All the data sets are be listed in Table 1.
R15(SD6) 15 2 600
Iris 3 4 150
4.2. Evaluation index
Seeds 3 7 210
Diabetes 2 8 768
Vehicle 3 18 846 In this section, in order to evaluate its performance, GADPC was
Sonar 2 60 208 compared with eight algorithms including K-means, DBSCAN, DB-
KDTree, DPC, DPC-PCA-KNN, GDPC and DFC on different data sets. All
the experiments are carried out in MATLAB R2016b and Python 3.8 on
a computer with an Intel core-i7 CPU and 8 GB RAM.
points can be selected from possible edge points by relative distance
and the degree centrality of outliers. In this research, the evaluation indexes F-Measure (Powers, 2011),
GADPC algorithm is depicted in Algorithm 1 and GADPC flowchart NMI and ARI (Vinh et al., 2010) are used to comprehensively evaluate
is shown in Fig. 10. the clustering performance of the eight algorithms described above.
As a comprehensive evaluation index, F-Measure is the harmonic
average value of precision rate 𝑃 and recall rate 𝑅. The definitions
Algorithm 1 Density peaks clustering based on Graph Theory of precision 𝑃 , recall 𝑅 and F-Measure are illustrated in Eqs. (8), (9)
Require: Initial points 𝑃𝑖 ∈R𝑁×𝑀 (𝑅𝑁×𝑀 is the matrix of 𝑁×𝑀 and (10). 𝑃 is the ratio between the number of correct objects and
dimensions), d𝑐 (𝑑𝑐 is a cutoff distance) the number of all objects extracted. 𝑅 refers to the ratio between the
Ensure: The label vector of cluster index: y∈R𝑁×𝑀 number of correct objects extracted and the number of all objects in the
Step 1: Determine 𝑑𝑐 sample. 𝐾𝑖 represents the set of all samples that should be recognized
1.1 Calculate 𝜌𝑖 according to Eq. (1); as positive and 𝑀𝑗 is all positive samples detected by the model.
1.2 Determine 𝑑𝑖𝑗 based on Eq. (2); |𝐾𝑖 ∩ 𝑀𝑗 |
1.3 Compute 𝑑𝑐 by a given percentage of positions fluctuating 𝑃 (𝐾𝑖 , 𝑀𝑗 ) = (8)
|𝑀𝑗 |
between 1 % and 2 %.
Step 2: construct the sparse graph |𝐾𝑖 ∩ 𝑀𝑗 |
𝑅(𝐾𝑖 , 𝑀𝑗 ) = (9)
2.1 Add vertices from the data according to coordinates; |𝐾𝑖 |
2.2 Add edges based on cutoff distance 𝑑𝑐 ; 2 × 𝑃 (𝐾𝑖 , 𝑀𝑗 ) × 𝑅(𝐾𝑖 , 𝑀𝑗 )
2.3 Construct sparse graph according to the vertices and edges. 𝐹 (𝐾𝑖 , 𝑀𝑗 ) = (10)
𝑃 (𝐾𝑖 , 𝑀𝑗 ) + 𝑅(𝐾𝑖 , 𝑀𝑗 )
Step 3: select clustering centroids
3.1 Calculate 𝛾 based on Equation 𝛾𝑖 =𝜌𝑖 × 𝛿𝑖 and sorted it in an The mutual information(MI ) (Vinh et al., 2010) has also been
ascending order; proposed as a method to measure the clustering effects. Supposing that
3.2 Generate 𝛾 graph with serial number i and 𝛾; there is a set S consisting of n objects, X and Y are two subsets of S.
3.3 Calculate the turning angle 𝜃 based on Eq. (7); Assuming that an object is randomly selected from S, the probability of
3.4 Select potential clustering centroids from 𝜃 graph; it landing in set 𝑋𝑖 is
3.5 Find the final cluster centroids according to the connectivity of |𝑋𝑖 |
possible centroids and whether the connected path contains cutvertex 𝑃 (𝑖) = (11)
𝑛
and bridges.
Entropy can be regarded as a measure to describe the degree of
Step 4: Aggregate each point to different clusters
chaos in a system. The entropy of 𝑋 can be calculated by Eq. (12). The
4.1 Point 𝑃𝑖 is aggregated to the point which has higher density and
MI (Vinh et al., 2010) is calculated based on Eq. (13). In summary, the
stronger connectivity;
NMI (Pfitzner et al., 2009) is calculated based on Eq. (14).
4.2 Iterate until all points are assigned.
Step 5: Detect outliers and mark edge points ∑
𝑟
𝐻(𝑥) = − 𝑃 (𝑘)𝑙𝑜𝑔𝑃 (𝑘) (12)
4.1 Mark edge points with lower degree centrality; 𝑘=1
4.2 Detect outliers with lower degree centrality and longer distance
∑
𝑟 ∑ 𝑠
𝑃 (𝑘, 𝑚)
𝛿. 𝐼(𝑥, 𝑦) = 𝑃 (𝑘, 𝑚)𝑙𝑜𝑔 (13)
𝑘=1 𝑚=1
𝑃 (𝑘)𝑃 (𝑚)
2𝐼(𝑥, 𝑦)
4. Experiments 𝑁𝑀𝐼(𝑥, 𝑦) = (14)
𝐻(𝑥) + 𝐻(𝑦)
4.1. Experiment design and data sets ARI can be described as a method to measure diversity in cluster
ensembles, which also measures how well the two data distributions fit
For the sake of evaluating the clustering effect of GADPC, GADPC is together. The contingency table is listed in Table 2, where 𝑁𝑖𝑗 is the
compared with DBSCAN, K-means, and DPC on five real-world data sets number of documents in both cluster 𝑋 and 𝑌 .
obtained from UCI (Iris, Seeds, Diabetes, Vehicle, Sonar) and six synthetic ARI is defined by Eq. (15).
data sets (Flame (Fu & Medico, 2007), Aggregation, Jain, Spiral (Chang ( ) [ ( ) ( )] ( )
∑ 𝑛𝑖𝑗 ∑ 𝑎𝑖 ∑ 𝑏 𝑗 𝑁
& Yeung, 2008), R15 (Veenman et al., 2002) and Compounds (Zahn, 𝑖𝑗 − 𝑖 𝑗 ∕
2 2 2 2
1971)). The attributes of these data sets are illustrated in Table 1. 𝐴𝑅𝐼 = [ ( ) ( )] [ ( ) ( )] ( ) (15)
1 ∑ 𝑎𝑖 ∑ 𝑏𝑗 ∑ 𝑎𝑖 ∑ 𝑏𝑗 𝑁
Experiments can be divided into five categories. Firstly, the experi- 𝑖 + 𝑗 − 𝑖 𝑗 ∕
2 2 2 2 2 2
ment was carried out on Spiral to valuate and examine GADPC’s ability
to detect irregular shapes. Secondly, we use Jain to test the algorithm’s where 𝑛𝑖𝑗 is entries in the contingency table, 𝑎𝑖 and 𝑏𝑗 are its marginal
performance in identifying data sets with uneven density. Thirdly, sums.
6
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
Table 2
Contingency Table.
𝑋 𝑌
𝑌1 𝑌2 ⋯ 𝑌𝑠 Sums
𝑋1 𝑛11 𝑛12 ⋯ 𝑛1𝑠 𝑎1
𝑋2 𝑛21 𝑛22 ⋯ 𝑛2𝑠 𝑎2
⋮ ⋮ ⋮ ⋱ ⋮ ⋮
𝑋𝑟 𝑛𝑟1 𝑛𝑟2 ⋯ 𝑛𝑟𝑠 𝑎𝑟
Sums 𝑏1 𝑏2 ⋯ 𝑏𝑠
7
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
Table 3
The value of F-Measure.
Name K-means DBSCAN DB-KDTree DPC DPC-PCA-KNN GDPC DFC GADPC
Aggregation 0.8159 0.9003 0.2157 1 0.9987 1 0.9598 1 (𝑝 = 2.0)
Flame 0.7586 0.9840 1 1 1 1 1 1 (𝑝 = 1.4)
Spiral 0.3276 1 0.9978 0.7795 0.9987 0.5219 0.9598 1 (𝑝 = 1.1)
Jain 0.6977 0.9767 0.9544 0.4504 0.9226 1 0.9598 1 (𝑝 = 2.0)
R15 0.9932 0.9830 0.1497 0.9966 0.9787 0.9933 0.9937 0.9966 (𝑝 = 1.5)
Compound 0.1065 0.1503 0.0025 0.4210 0.8357 0.7679 0.6155 0.8402 (𝑝 = 1.6)
Iris 0.4319 0.2459 0.1616 0.6600 0.8319 0.8505 0.8121 0.9066 (𝑝 = 2.0)
Seeds 0.8106 0.5585 0.0774 0.8822 0.8897 0.7249 0.8792 0.8822 (𝑝 = 0.8)
Diabetes 0.1324 0.0001 0.2434 0.4830 0.5600 0.6794 0.6891 0.5054 (𝑝 = 2.5)
Vehicle 0.1058 0.2622 0.1000 0.3475 0.4072 0.4667 0.4275 0.3274 (𝑝 = 1.2)
Sonar 0.1690 0.0015 0.0012 0.5096 0.5731 0.6622 0.6891 0.5950 (𝑝 = 1.4)
8
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
Table 4
NMI evaluation.
Name K-means DBSCAN DB-KDTree DPC DPC-PCA-KNN GDPC DFC GADPC
Aggregation 0.8805 0.9207 0.7037 1 0.9987 1 0.9598 1 (p=2.0)
Flame 0.4622 0.9275 0.9342 1 1 1 1 1
Spiral 0.0007 1 0.9978 0.6951 0.8157 0.1879 1 1
Jain 0.3672 0.8729 1 0.6222 0.6373 1 0.8835 1
R15 0.9942 0.9817 0.7271 0.9942 0.9860 0.9893 0.9662 0.9942
Compound 0.8015 0.8703 0.5868 0.8373 0.8594 0.7954 0.6771 0.9122
Iris 0.6188 0.5672 0.4596 0.4240 0.7221 0.7222 0.6445 0.8057
Seeds 0.6949 0.0753 0.4122 0.6982 0.7309 0.4981 0.6457 0.6982
Vehicle 0.1052 0.0979 0.1141 0.1166 0.0884 0.1852 0.1834 0.1166
Diabetes 0.0293 0.1774 0.0032 0.0090 0.0027 0.0298 0.0192 0.0090
Sonar 0.0384 0.0189 0.0101 0.0011 0.0126 0.0384 0.0083 0.0011
Table 5
ARI evaluation.
Name K-means DBSCAN DB-KDTree DPC DPC-PCA-KNN GDPC DFC GADPC
Aggregation 0.7624 0.8662 0.5558 1 0.9978 1 0.9884 1
Flame 0.4998 0.9659 0.9621 1 1 1 1 1
Spiral −0.0057 1 0.995 0.6686 0.8203 0.1629 1 1
Jain 0.3181 0.9411 0.9589 0.4412 0.6965 1 1 1
R15 0.9928 0.9797 0.2958 0.9928 0.9745 0.9857 0.9662 0.9928
Compound 0.7746 0.8774 0.3691 0.6397 0.7826 0.6106 0.5054 0.8513
Iris 0.5862 0.4750 0.4177 0.4240 0.6423 0.6859 0.6130 0.8057
Seeds 0.6949 0.0753 0.1782 0.6982 0.7132 0.4341 0.6729 0.6982
Diabetes 0.0744 0.0013 −0.0082 0.0019 0.0075 0.0210 0.0192 0.0025
Vehicle 0.0737 0.0547 0.0648 0.0581 0.0319 0.0913 0.0930 0.0581
Sonar 0.0071 0.0051 0.0014 −0.0044 0.0161 −0.0033 0.0030 −0.0044
Table 4 shows the NMI values of diverse algorithms. 5.3. Detect clusters with varying densities
The ARI results are depicted in Table 5.
The clustering results are also depicted in Fig. 11(a), where the In some data sets, the distribution of points shows the characteristic
comparison of clustering effects is clearer. Overall, GADPC algorithm of varying density, as in the data set of 14. In SD5 depicted in Fig. 12,
performs better than DPC, DBSCAN and K-Means on the data sets listed it can be observed that K-means, DPC, DPC-PCA-KNN and DFC are not
above. On data sets of Jain, Flame, Spiral and Aggregation, the index capable of detecting Jain data set correctly. They finds more than two
value of GADPC all reaches to 1, which represents a perfect clustering clusters and get wrong cluster results. It is worth noting that d is very
effect. In the remaining data sets, GADPC also performs better cluster- difficult to cluster correctly unless the parameters are within a very
ing effect compared with DPC, K-means and DBSCAN. As illustrated ideal range. To be more convincing, DPC, DPC-PCA-KNN and DFC are
selected for comparison with GADPC as shown in Fig. 14.
in Fig. 11(b), four algorithms have diverse clustering effects on low
and high dimensional data. The performance of the four algorithms
5.4. Detect clusters with different shapes
on the high-dimensional data sets is significantly less satisfying than
that on the low-dimensional data sets. However, compared with DPC,
In Fig. 12, it is obvious that none of the algorithm can achieve
GADPC has the same or sometimes better clustering effect on some perfect effect on the Compounds. However, the DPC algorithm still
high-dimensional data. As is shown in Fig. 11(c) and Fig. 11(d), GADPC achieves better clustering results compared with other test algorithms.
still has the same or similar advantages compared with some variants of As shown in Fig. 15, we choose DB-KDtree, DPC, DPC-PCA-KNN, GDPC,
DPC algorithm including DB-KDTree, DPC-PCA-KNN, GDPC and DFC. DFC and DPC for comparative analysis
In spite of using the clustering evaluation indexes to quantitatively As illustrated in Fig. 12, almost most test algorithms can obtain ideal
analyze the performance of GA-DPC algorithm, many experiments have results on SD1 except for K-means and DBSCAN. However, through
been conducted to visually evaluate its capability in dealing with further studies, it can be found that those algorithms still have a certain
diverse clusters of irregular shapes, varying densities, different shapes, gap in recognition difficulty and clustering accuracy. In Fig. 16, the
and identifying edge points and outliers. All the clustering results of decision graphs of DPC, DPC-PCA-KNN, GDPC and GADPC are used as
different clustering methods on different data sets are aggregated in reference to illustrate this problem.
Fig. 12 for further study.
5.5. Detect outliers and mark edge points
5.2. Detect clusters with irregular shapes
The original DPC algorithm cannot detect outliers. As shown in
Fig. 17(a), the two abnormal points located in the upper left corner
As a typical data set with irregular shape, Spiral is often used
of the picture are not detected by DPC. By contrast, as illustrated
to evaluate the ability of algorithm to detect clusters with irregular
in Fig. 17(b), GADPC algorithm can accurately identify the abnormal
shape. It can be seen from SD5 in Fig. 12 that algorithms such as
points in the cluster. In addition, GADPC algorithm can also mark the
K-means, DPC and DPC-PCA-KNN cannot successfully cluster Spiral. edge points of the cluster when it is necessary, as shown in Fig. 17(c).
The specific situation is shown in Fig. 13, K-means is unable to find
correct cluster centroids and aggregate the remaining points. DPC can 6. Discussion
identify the numbers of clustering correctly, while it cannot aggregate
the remaining points accurately. At the same time, the DPC has not 6.1. Analysis of detecting clusters with irregular shapes
completely solved the problem of the original DPC algorithm. Only
GADPC can correctly identify the number of clustering centroids and In general, most clustering algorithms have greater advantages in
effectively aggregate the remaining points. dealing with regular-shaped clusters compared with irregular ones. For
9
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
example, many clustering algorithms fail to achieve ideal results on In addition, it is also difficult to select the parameters like the n and p.
the non-spherical data set Spiral. In Fig. 13, K-means, DPC and DPC-
PCA-KNN all cannot detect Spiral successfully. GADPC, however, are GADPC improves the original algorithm aggregation strategy in
capable of aggregating Spiral efficiently. It follows K-means in that it DPC to enhance the ability to detect clusters of irregular shapes. In
the assignment process of GADPC algorithm, the remaining points
has certain advantages in dealing with regular-shaped clusters, but fails
belong to the closer point with higher density and stronger connectivity
to achieve excellent clustering performance on non-spherical clusters.
according to the new principle of the algorithm. A significant step in the
DPC algorithm is capable of detecting clusters of irregular shapes,
process is to take density, distance and connectivity into consideration.
but the defect of the aggregation strategy triggers domino effect when In Fig. 18, although the distance between point No. 220 and point
dealing with Spiral. DPC-PCA-KNN does improve the allocation strategy No. 46 is the closest, the two do not have connectivity in the sparse
of DPC by combining KNN, but it fails to aggregated successfully due to graph. So they cannot be assigned to the same cluster. Therefore, the
the difficulty of manually selecting the centroids in the decision graph. attribution of the point need to be continued. It can be iteratively
10
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
judged by the improved aggregation strategy. According to that, No. with the node connectivity, this improved aggregation strategy has
220 should be aggregated to the same cluster to which No. 226 belongs greatly enhanced the accuracy of the clustering results, making the
to. The success in the new aggregation strategy generates the fact
that other points of lower density around point No. 220 are also GADPC algorithm become effective in processing irregularly-shaped
aggregated correctly, thus solving the domino effect of DPC. Integrated clusters like Spiral.
11
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
6.2. Analysis of detecting clusters with varying densities assumption of the original DPC to improve the accuracy of the selection
of cluster centroids. Affected by the data shape and density distribution,
In data set Jain, the distribution of clusters shows the characteristics there are often multiple points with higher density and larger dis-
of large density change, which causes problems in most density-based tance in high-density clusters, and multiple points with smaller density
clustering algorithms. As shown in Fig. 14, except for GADPC, the other but larger distances in low-density clusters. According to the cluster
three algorithms all fail to obtain satisfying clustering results on Jain. centroids hypothesis of DPC, these points are all possible clustering
DPC algorithm follows an important assumption that the local den- centroids. The proposed GADPC algorithm performs node detection
sity of cluster centroids is the greatest and the distance between the on possible cluster centroids, thereby avoiding the trend to select all
centroids is comparatively long, which is very efficient for selecting the points in the clusters with high density as true centroids and
cluster centroids in decision graph. But for data sets with large density prevent points in the clusters with low density from being the true ones.
differences, DPC cannot to find cluster centroids with low density in This improvement can effectively solve the error that multiple cluster
the decision graph. In Fig. 19(a), it is difficult for us to artificially centroids in high-density clusters and cluster centroids in low-density
select the cluster centroids from the decision graph, and it can lead clusters can be ignored.
to the clustering error shown in Fig. 19(b). GDPC does not make a
fundamental improvement on the sample point allocation principle 6.3. Analysis of detecting clusters with varying size
of DPC, but only improves DPC’s ability to detect the number of
clusters and identifying outliers. Even so, it is easy to choose the wrong It is so difficult for K-means to determine the initial partitions
clustering center on Jain. Due to the same problem in decision graph because it can easily to fall into the local optimal solution. Thus, K-
as DPC and GDPC, DPC-PCA-KNN is also unable to successfully detect means is incapable of detecting clusters of arbitrary shapes. The failure
Jain. of DBSCAN and DB-KDtree on Compounds and Aggregation is due to their
In the process of selecting the cluster centroids, GADPC adopts high sensitivity to input parameters and the use of global densities of
the node connectivity detection of possible cluster centroids under the parameters.
12
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
The method of selecting the cluster centroids and the aggregation graph and identifying outliers. As shown in Fig. 16(a), it is difficult for
strategy of the remaining points make DPC algorithm obtain better us to judge which points in the red circle are clustering centroids by
clustering results on Compounds and Aggregation. However, DPC algo- subjective judgment. DPC-PCA-KNN also has the same problem. When
rithm still has some limitations in selecting centroids from decision the local density and density-based distance of the centroids and other
13
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
√
Fig. 18. GA-DPC’s aggregation process on Spiral data set, 𝑑𝑐 = 1.1, .
Fig. 19. The DPC decision graph and clustering result of Jain.
points are not much different, it will lead to the sensitivity of parameter of the improved algorithm listed above are different, several compar-
settings. Therefore, GDPC and GADPC improve the method and criteria ative experiments with DPC was conducted to evaluate the parameter
in selecting centroids. GDPC take the calculation of local density and sensitivity of GADPC. In Figs. 20 and 21, we test the clustering quality
relative distance into a reconsideration by introducing gravity, which of the two algorithms under different parameter values, which range
makes the clustering centroids clearer in the decision graph, as shown from 1.2 to 3. Compared with DPC algorithm, GADPC achieves better
in Fig. 16(c). For GADPC, the accuracy and convenience of selecting clustering effect for different parameter values in most cases. Therefore,
centroids is greatly increased by considering the connectivity among GADPC has good adaptability in parameter sensitivity.
points with high local density and selecting centroids with multi-steps.
Inheriting the superiorities of DPC and making up for the original 7. Conclusion
deficiencies in centroids selection and outliers detection, GADPC is
capable of successfully detecting Compounds and Aggregation. GADPC algorithm integrates graph theory into the process of
centroid selection and aggregation to solve the shortcomings of DPC
algorithm in this respect. Experimental results and theoretical analysis
6.4. Analysis of detecting outliers and marking edge points
show that GADPC is more impeccable and efficient than DBSCAN, DPC
and K-means in terms of parameter adaptability, clustering accuracy
The original DPC algorithm does not provide detection measures
and outlier detection. GADPC achieves excellent performance when
for outliers. GADPC detects outliers and marks edge points based on
processing various sizes, varying densities and non-spherical clusters.
the degree centrality of each node and its connectivity with the cluster
In addition, the selection of cluster centroids in GADPC has greatly
centroids. If the node has low degree centrality and is not connected
avoided potential subjective errors which are indispensable in the
to the cluster centroid, it is marked as an abnormal point. If the node
original algorithm. Obviously, DPC has shortcomings in the process
degree centrality is low but connected to the cluster centroids, the node of centroid selection and aggregation. For data sets with large density
will be marked as an edge point. This simple detection principle can changes, such as Jain, it has difficulties in selecting the cluster centroids
effectively detect the centroid and edge point of the cluster, as shown in the decision graph due to the manual work, and for some data sets
in Figs. 17(b) and 17(c). with non-spherical shape such as Spiral, it is prone to trigger the domino
effect. Moreover, DPC is less convincing on the detection of outliers and
6.5. Analysis of sensitivity edge points.
Absorbing the advantages of Graph Theory and DPC algorithm,
GADPC input only one parameter P, and its setting range is close to GADPC can make the cluster centroid selection automatic and accurate
that given by DPC algorithm. Since the number and range of parameters and the allocation process more reasonable, which guarantees its good
14
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
effect in processing non-spherical clusters and clusters with various Declaration of competing interest
sizes and varying on densities. The detection of abnormal points and
edge points in GADPC makes it possible to put the algorithm into prac- The authors declare that they have no known competing finan-
tical applications. However, it still requires further research regarding cial interests or personal relationships that could have appeared to
to how to choose the size of 𝑑𝑐 , how to improve the accuracy of the influence the work reported in this paper.
algorithm in processing high-dimensional data and how to balance the
clustering accuracy and the time complexity of the algorithm. References
Bian, Z., Chung, F. L., & Wang, S. (2020). Fuzzy density peaks clustering. IEEE
CRediT authorship contribution statement Transactions On Fuzzy Systems, PP(99), 1.
Castellanos-Garzón, J. A., García, C. A., Novais, P., & Díaz, F. (2013). A visual
analytics framework for cluster analysis of DNA microarray data. Expert Systems
Tengfei Xu: Code modification, Document collection, Simulation
With Applications, 40(2), 758–774.
experiment, Writing – review & editing, Validation, Formal analysis, Chang, H., & Yeung, D. (2008). Robust path-based spectral clustering. Pattern
Supervision, Final review. Jianhua Jiang: Supervision, Final review. Recognition, 41, 191–203.
15
T. Xu and J. Jiang Expert Systems With Applications 195 (2022) 116539
Chen, G., Gould, R., & Yu, X. (2003). Graph connectivity after path removal. Lotfi, A., Seyedi, S. A., & Moradi, P. (2017). An improved density peaks method for
Combinatorica, 23, 185–203. data clustering. In International conference on computer & knowledge engineering.
de Andrades, R. K., Dorn, M., Farenzena, D. S., & Lamb, L. C. (2013). A cluster-DEE- Mathew, S., & Sunitha, M. S. (2010). Node connectivity and arc connectivity of a fuzzy
based strategy to empower protein design. Expert Systems With Applications, 40(13), graph. Information Sciences, 180(4), 519–531.
5210–5218. Mehmood, R., Zhang, G., Bie, R., Dawood, H., & Ahmad, H. (2016). Clustering by fast
Diestel, R. (2000). Graph theory. Mathematical Gazette, 173(502), 67–128. search and find of density peaks via heat diffusion. Neurocomputing, 208(oct.5),
Du, M., Ding, S., & Jia, H. (2016). Study on density peaks clustering based on k-nearest 210–217.
neighbors and principal component analysis. Knowledge-Based Systems, 99(may1), Penrose, M. D. (2015). On k-connectivity for a geometric random graph. Random
135–145. Structures & Algorithms, 15(2), 145–164.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for Pfitzner, D., Leibbrandt, R., & Powers, D. (2009). Characterization and evaluation of
discovering clusters in large spatial databases with noise. vol. 96, In Proceedings similarity measures for pairs of clusterings. Knowledge And Information Systems,
of the second international conference on knowledge discovery and data mining (pp. 19(3), 361–394.
226–231). Powers, D. M. W. (2011). Evaluation: from precision, recall and F-measure to ROC,
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. informedness, markedness and correlation. Journal Of Machine Learning Technologies,
Science, 315(5814), 972–976. 2(1), 37–63.
Fu, L., & Medico, E. (2007). FLAME, a novel fuzzy clustering method for the analysis Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks.
of DNA microarray data. BMC Bioinformatics, 8, 3. Science, 344(6191), 1492–1496.
Goodrich, M. T., Tamassia, R., & Triandopoulos, N. (2011). Efficient authenticated Rui, X., & Wunsch, D. I. (2005). Survey of clustering algorithms. IEEE Transactions On
data structures for graph connectivity and geometric search problems. Algorithmica, Neural Networks, 16(3), 645–678.
60(3), 505–552. Seyedi, S. A., Lotfi, A., Moradi, P., & Qader, N. N. (2019). Dynamic graph-based label
Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. Data Mining propagation for density peaks clustering. Expert Systems With Applications, 115,
Concepts Models Methods & Algorithms Second Edition, 5(4), 1–18. 314–328.
Harrison, W. K. (2016). The role of graph theory in system of systems engineering. Shibla, T. P., & Kumar, K. (2018). Improving efficiency of DBSCAN by parallelizing
IEEE Access, 4, 1716–1742. kd-tree using spark. In 2018 Second international conference on intelligent computing
Hruschka, E. R., Campello, R. J. G. B., Freitas, A. A., & Ponce Leon F. de Carvalho, A. and control systems.
C. (2009). A survey of evolutionary algorithms for clustering. IEEE Transactions On Shimon, Even, Endre, R., & Tarjan (1975). Network flow and testing graph connectivity.
Systems, Man, And Cybernetics, Part C (Applications And Reviews), 39(2), 133–155. SIAM Journal On Computing, 4(4), 507–518.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, Sun, S., Yuan, D., Xu, Y., Wang, A., & Deng, Z. (2016). Ligand-mediated synthesis
31(8), 651–666. of shape-controlled cesium lead halide perovskite nanocrystals via reprecipitation
Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: a review. ACM Computing process at room temperature. Acs Nano, 3648.
Surveys, 31(3), 264–323. Veenman, C., Reinders, M., & Backer, E. (2002). A maximum variance cluster algorithm.
Jiang, J., Chen, Y., Hao, D., & Li, K. (2019). DPC-LG: Density peaks clustering based IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1273–1280.
on logistic distribution and gravitation. Physica A: Statistical Mechanics And Its Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings
Applications, 514, 25–35. comparison: Variants, properties, normalization and correction for chance. Journal
Jiang, J., Chen, Y., Meng, X., Wang, L., & Li, K. (2019). A novel density peaks clustering Of Machine Learning Research, 11, 2837–2854.
algorithm based on k nearest neighbors for improving assignment process. Physica Wang, G., & Song, Q. (2016). Automatic clustering via outward statistical testing on
A. Statistical Mechanics and its Applications, 523, 702–713. density metrics. IEEE Transactions on Knowledge & Data Engineering, 1971–1985.
Jiang, J., Hao, D., Chen, Y., Parmar, M., & Li, K. (2018). GDPC: Gravitation- Wang, X., Trajanovski, S., Kooij, R., & Mieghem, P. (2016). Degree distribution and
based density peaks clustering algorithm. Physica A-Statistical Mechanics And Its assortativity in line graphs of complex networks. Physica A: Statistical Mechanics
Applications, 502(15), 345–355. And Its Applications, 445, 343–356.
Jiang, J., Tao, X., & Li, K. (2018). DFC: Density fragment clustering without peaks. Wja, B., Cz, B., & Chao, J. B. (2019). An improvement method of DBSCAN algorithm
Journal Of Intelligent And Fuzzy Systems, 34(1), 525–536. on cloud computing. Procedia Computer Science, 147, 596–604.
Jiang, J., Zhou, W., Wang, L., Tao, X., & Li, K. (2019). HaloDPC: An improved Wu, B., Zhang, Y., Hu, B.-G., & Ji, Q. (2013). Constrained clustering and its application
recognition method on halo node for density peak clustering algorithm. International to face clustering in videos. IEEE Computer Society Conference On Computer Vision
Journal Of Pattern Recognition And Artificial Intelligence, 33(8), Article 1950012. And Pattern Recognition, 3507–3514.
Koster, K., & Spann, M. (2000). MIR: An approach to robust clustering - Application Xie, J., Gao, H., Xie, W., Liu, X., & Grant, P. W. (2016). Robust clustering by detecting
to range image segmentation. IEEE Transactions on Pattern Analysis and Machine density peaks and assigning points based on fuzzy weighted K-nearest neighbors.
Intelligence, 22(5), 430–444. Information Sciences An International Journal, 19–40.
Li, Z., & Tang, Y. (2018). Comparative density peaks clustering. Expert Systems With Xu, X., Ding, S., Wang, Y., Wang, L., & Jia, W. (2021). A fast density peaks clustering
Applications, 95, 236–247. algorithm with sparse search. Information Sciences, [ISSN: 0020-0255] 554, 61–83.
Liu, Y., Ma, Z., & Fang, Y. (2017). Adaptive density peak clustering based on K-nearest Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing gestalt
neighbors with aggregating strategy. Knowledge-Based Systems, 133(oct.1), 208–220. clusters. 20(1), 68–86.
Liu, R., Wang, H., & Yu, X. (2018). Shared-nearest-neighbor-based clustering by fast Zhou, Z., Si, G., Zhang, Y., & Zheng, K. (2018). Robust clustering by identifying
search and find of density peaks. Information Sciences, 450, 200–226. the veins of clusters based on kernel density estimation. Knowledge-Based Systems,
159(NOV.1), 309–320.
16