0% found this document useful (0 votes)
7 views14 pages

A Novel Graph-Based Clustering Method Using Noise Cutting

The document presents a novel graph-based clustering method called CutPC, which utilizes a natural neighbor graph to effectively identify clusters with arbitrary shapes and detect outliers without prior knowledge. The CutPC algorithm incorporates noise cutting based on critical cut-point values, demonstrating superior performance compared to traditional clustering methods like k-means and DBSCAN on various datasets. The method is designed to address limitations in existing clustering algorithms, particularly in handling local and nonlinear data patterns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

A Novel Graph-Based Clustering Method Using Noise Cutting

The document presents a novel graph-based clustering method called CutPC, which utilizes a natural neighbor graph to effectively identify clusters with arbitrary shapes and detect outliers without prior knowledge. The CutPC algorithm incorporates noise cutting based on critical cut-point values, demonstrating superior performance compared to traditional clustering methods like k-means and DBSCAN on various datasets. The method is designed to address limitations in existing clustering algorithms, particularly in handling local and nonlinear data patterns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Information Systems 91 (2020) 101504

Contents lists available at ScienceDirect

Information Systems
journal homepage: www.elsevier.com/locate/is

A novel graph-based clustering method using noise cutting



Lin-Tao Li, Zhong-Yang Xiong , Qi-Zhu Dai, Yong-Fang Zha, Yu-Fang Zhang, Jing-Pei Dan
Key Laboratory of Dependable Service Computing in Cyber Physical Society, Ministry of Education, Chongqing University, Chongqing 400044, China

article info a b s t r a c t

Article history: Recently, many methods have appeared in the field of cluster analysis. Most existing clustering
Received 12 December 2019 algorithms have considerable limitations in dealing with local and nonlinear data patterns. Algorithms
Received in revised form 14 January 2020 based on graphs provide good results for this problem. However, some widely used graph-based
Accepted 18 January 2020
clustering methods, such as spectral clustering algorithms, are sensitive to noise and outliers. In this
Available online 21 January 2020
paper, a cut-point clustering algorithm (CutPC) based on a natural neighbor graph is proposed. The
Recommended by Dennis Shasha
CutPC method performs noise cutting when a cut-point value is above the critical value. Normally,
Keywords: the method can automatically identify clusters with arbitrary shapes and detect outliers without any
Graph-based clustering prior knowledge or preparatory parameter settings. The user can also adjust a coefficient to adapt
Natural neighbors clustering solutions for particular problems better. Experimental results on various synthetic and real-
Noise cutting world datasets demonstrate the obvious superiority of CutPC compared with k-means, DBSCAN, DPC,
SC, and DCore.
© 2020 Elsevier Ltd. All rights reserved.

1. Introduction by using a minimum distance criterion. Generally, these algo-


rithms not only have a high computational cost but also require a
Clustering analysis is considered the most important tech- threshold to define an appropriate stopping condition for splitting
nique in the field of machine learning. The purpose of clustering or merging partitions [8,19].
is to divide a dataset into clusters depending on high intracluster Recently, a variety of clustering algorithms have been pro-
similarity and low intercluster similarity [1,2]. It has been widely posed for detecting nonspherical clusters. One of the most famous
applied to image segmentation, pattern recognition, document is the density peak clustering (DPC) algorithm [9]. The idea of
clustering [3], and social networks [4]. In recent years, a large DPC is that cluster centers are characterized by a higher density
number of clustering algorithms have been proposed. Generally, than their neighbors and by a relatively large distance from
clustering can be classified into many categories: partition-based points with higher densities. However, the DPC still has some
clustering algorithms [5,6], hierarchical clustering algorithms [7, defects. First, a threshold dc needs to be manually determined.
Moreover, the cluster centers are obtained by the decision graph,
8], center-based clustering algorithms [9–11], density-based clus-
and the method has certain human factors. To improve the per-
tering algorithms [12,13], graph-based clustering algorithms [14–
formance of DPC, algorithms such as DPC-KNN-PCA [10] and
18], etc.
SNN-DPC [11] have been proposed. However, this center-based
The purpose of partition-based clustering algorithms is to
approach has difficulty dealing with clusters containing manifold
group the data into a settled number of clusters based on an
distributions [20]. Therefore, Chen et al. [21] proposed a hybrid
objective function such as the sum of squared error. K-means [6],
decentralized method named DCore. The underlying idea is that
a well-known partition-based clustering algorithm, is widely used
each cluster is considered to have a shrunken density core that
because of its efficiency and simplicity. However, k-means has roughly retains the shape of the cluster. The approach has a
two fatal flaws. First, the number of clusters needs to be de- similar motivation to the proposed method, that is, to elimi-
termined manually in advance. Second, since the initial cluster nate objects with relatively low energy. Thus, we compare our
center is randomly selected, it easily falls into local optimal proposed approach with the method in Ref [21].
solutions. Hierarchical clustering algorithms [7] divide the data To identify clusters with arbitrary shapes, density-based clus-
into several clusters represented by a tree of nodes. Clusters are tering methods are introduced where dense regions of data points
identified by merging the groups with different levels of the tree are considered clusters [13]. One of the most representative
density-based clustering approaches is the density based spatial
∗ Corresponding author. clustering of applications with noise (DBSCAN) [12]. It clusters
E-mail addresses: [email protected] (L.-T. Li), [email protected] data points with their closest neighbors, and data points that
(Z.-Y. Xiong). lie isolated in low-density regions are called outliers. Although

https://doi.org/10.1016/j.is.2020.101504
0306-4379/© 2020 Elsevier Ltd. All rights reserved.
2 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504

it can detect clusters with arbitrary shapes and outliers, the this time, the search range k is the natural neighbor characteristic
performance of DBSCAN depends on the scanning radius Eps and value λ [23]. Therefore, λ is defined as follows:
density threshold MinPts determined by the user. n

Traditional clustering methods have considerable limitations λ = min{ k| f (nbk (xi )) = 0 or
in dealing with local and nonlinear data patterns. To overcome i=1
this issue, many graph-based clustering algorithms have been n n (4)
∑ ∑
proposed [14–18]. In graph-based clustering methods, the dataset f (Nbk (xi )) = f (Nbk−1 (xi ))}
is considered a graph G = (V , E ) in which the vertex set (V ) i=1 i=1
represents the objects, and the edge set (E ) represents the simi- where k is initialized with 1, Nbk (xi ) is the number of objects
larity between objects. However, spectral clustering(SC ) [14], the xi ’s reverse neighbors in the kth iteration and f (x) is defined as
most widely used graph-based clustering method, is vulnerable follows:
to noise and outliers [18].
, other wise
{
0
To compensate for the above mentioned deficiencies, in this f (x) =
paper, we consider clusters as connected regions with high en- 1 , if x == 0
ergy. To determine the high-energy connected regions, we pro- The details of this method are presented in Algorithm 1 and
pose a cut-point clustering method named CutPC based on a the natural neighbor of object xi is defined as follows:
natural neighbor graph. First, we construct a sparse graph using
the concept of natural neighbors. Then, the algorithm partitions Definition 3 (Natural Neighbors). For each object x, the natural
the graph into its subgroups by cutting points of which cut-point neighbors are the k nearest neighbors where k is equal to the
values are above the critical value. Last, the other objects are natural characteristic λ, denoted as NaN (x).
assigned to clusters according to the connectivity of subgroups.
The remainder of this paper is organized as follows. In Sec- Obviously, natural neighbors are different from traditional k
tion 2, a brief overview of the natural neighbor stable structure nearest neighbors, and the whole computational procedure of the
Natural Neighbor Stable Structure can automatically be fulfilled
is presented. The process of the proposed clustering algorithm
without any parameter.
and its computational complexity are presented in Section 3. The
experimental results with some synthetic and real-world datasets
3. Proposed clustering method
are demonstrated in Section 4. Finally, a summary of this paper
and future work are presented in Section 5.
To identify the clusters that are considered connected regions
with high energy effectively, the proposed clustering method con-
2. Related work sists of three main steps. The first step constructs a sparse graph
using natural neighbors. In this graph, all the observations are
The Natural Neighbor Stable Structure [22], a new concept represented by nodes with edges that connect each observation
coming from the objective reality, is inspired by interpersonal to its natural neighbors. Second, we detect the noise points (Defi-
relationships in human society, namely the number of one’s true nition 7) and perform the noise cutting (Definition 8). Finally, the
friends should be the number of people taking him/her as a subgraphs formed after the above operations are clustered using
friend and considered by him/her as friends at the same time. The a method to identify the connected components. Each connected
key idea of the Natural Neighbor Stable Structure is that objects component is determined as a cluster. The details of the proposed
lying in a sparse region have low energy, whereas objects lying clustering algorithm are shown in Algorithm 2.
in a dense region have high energy [23]. Based on the above
statement, the Natural Neighbor Stable Structure of data objects 3.1. Constructing the natural neighbor graph
can be formulated as follows:
(∀xi() ∃xj (k ∈ The first step of the proposed method is to represent the
( N))) ∧ ((i ̸= j)
( )
(1) data by a graph structure. Representing a dataset as a graph is
→ xi ∈ NNk xj ∧ xj ∈ NNk (xi )
)
useful for clustering local and nonlinear patterns. There are many
where NNk (xi ) is the kth nearest neighbor of object xi . The defi- methods for constructing the graph from the given dataset: the
nitions of k nearest neighbors and reverse neighbors can be given fully connected graph, ε -connected graph and k-nearest neighbor
as follows: graph. In a fully connected graph, any two points have an edge
between them, and all the weights of the edges are calculated
Definition 1 (k Nearest Neighbors). The k nearest neighbors of universally using the Gaussian kernel function. However, this
point xi are a set of points x in D with d (xi , x) ≤ d (xi , o), which method not only increases the time complexity, but also requires
is: manual determination of the variance parameter δ of the Gaus-
sian kernel. An ε -connected graph connects all points whose
NNk (xi ) = xj ∈ D| d xi , xj ≤ d (xi , o)
{ ( ) }
(2) pairwise distance is smaller than ε . Apparently, the appropriate ε
where o is the kth nearest neighbor of object xi . value is hard to determine. A k-nearest neighbor graph, an edge
between two objects is created if and only if one belongs to the k-
Definition 2 (Reverse Neighbors). The reverse neighbors of point nearest neighbor set of the other. However, k must be determined
xi are a set of points x that consider xi as its k neatest neighbor, manually. Considering the uncontrollable factors of the above
which is: methods, we use the concept of natural neighbors to construct
a graph named the natural neighbor graph. The definition of a
RNN (xi ) = {x ∈ D| xi ∈ NNk (x)} (3) natural neighbor graph is as follows:
The formation of the Natural Neighbor Stable Structure is
Definition 4 (Natural Neighbor Graph). A natural neighbor-based
achieved as follows: continuously expand the neighbor searching
graph with n nodes is constructed as follows. An edge eij between
range k, and the number of reverse neighbors for each object is
nodes xi and xj is defined as:
computed at the same time. The stopping criteria of the iteration
if xi ∈ NaN xj or xj ∈ NaN (xi )
{ ( )
are that all objects have reverse neighbors or the number of 1
eij = (5)
objects without reverse neighbors does not change anymore. In 0 other w ise
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 3

Algorithm 1 NaN-searching
Input: D(The data set)
Output: λ, NaN
1: Initial k = 1, Nb (xi ) = 0, NNk (xi ) = ∅, RNNk (xi ) = ∅;
2: create a KD-tree T from the data set D;
3: while true do
4: for each object xi in D do
5: find( the
) rth neighbor xj of xi using T
6: Nb xj = Nb (xi ) + 1;
7: NNk (xi ) = NNk−1 (xi ) ∪ xj ;
8: end for
9: Nbk = Nb;
10: Find the number of objects that Nb (xi ) == 0,which is denoted by Num
11: if Num does not change any more or Num==0 then
12: break;
13: end if
14: k = k + 1;
15: end while
16: λ = k;
17: for each object xi in D do
18: NaN (xi ) = NNλ (xi );
19: end for
20: Return λ, NaN

Algorithm 2 CutPC
Input: D(The data set)
Output: CL(the results of clustering)
1: Initialize CL = −1 for all objects;
2: [k, NaN] =NaN-searching(D);
3: Construct the natural neighbor graph G according to Definition 4;
4: NS = NS-finding(D);
5: for each object p in NS do
6: CL (p) = 0;
7: end for
8: Perform noises cutting in graph G according to Definition 7;
9: CL = Assigning(G);
10: Return CL

where NaN (xi ) is the natural neighbors set of object xi .


It is obvious that using the natural neighbors to construct
the graph is not only not necessary to set any parameters but
can achieve the natural neighbor stable structure. Moreover, the
natural neighbor graph can more clearly reflect that the objects
lying in a sparse region have low energy, whereas objects lying
in a dense region have high energy. A natural neighbor graph
constructed from a dataset of 513 objects is shown in Fig. 1.

3.2. Detecting the noise points and performing noise cutting

The idea of using the reverse density to extract noise points is


straightforward and effective. To identify the high-energy regions,
we compute the reverse density for each object and then obtain
the critical reverse density. The noise points are determined by
comparing the reverse density with the critical reverse density.
As shown in Fig. 2, the red points represent the noise points. Thus,
the definition of reverse density and critical reverse density is as
follows:
Fig. 1. Constructing natural neighbors graph.
Definition 5 (Reverse Density). We can use NaN (xi ) to calculate
the reverse density of point xi . This method of reverse density
calculates the mean distance to natural neighbors as follows: where k is the natural neighbor characteristic value λ, NaN (xi )
1 represents the natural neighbors of point xi , and d xi , xj is the
∑ ( )
τ (xi ) = d(xi , xj ) (6)
k Euclidean distance between xi and xj .
xj ∈NaN(xi )
4 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504

Fig. 2. Detection of noise points. Fig. 4. Assigning the objects to clusters.

Table 1
The characteristics of the ten synthetic datasets.
Datasets Number of instances Number of attributes Number of clusters
D1 513 2 5
D2 420 2 2
D3 1532 2 2
D4 3883 2 3
D5 1114 2 4
D6 1064 2 2
D7 1915 2 6
D8 1427 2 4
D9 8533 2 7
D10 8000 2 6

3.3. Assigning the other points to clusters

In the last step, our algorithm assigns the data objects to


clusters based on the remaining connections. It is straightforward
to determine the connected components of a graph using a nat-
Fig. 3. Performing noise cutting. ural neighbor search, and the idea originates from the nearest
neighbor search [24]. The details of this algorithm are shown
in algorithm 4. A search process that begins with a node iden-
tifies the whole connected component before returning. Having
Definition 6 (Critical Reverse Density). The critical reverse density
identified a connected component in this way, we seed an in-
θ is defined as follows: dex that has not currently been searched. We then apply the
natural neighbor search to find a new connected component.
θ = mean(τ (x)) + α × std(τ (x))(∀x ∈ D) (7)
Finally, when all the nodes have been searched, the algorithm is
terminated, and the group index vector is returned.
where mean (τ (x)) represents the average reverse density of all
The final clustering result is shown in Fig. 4. The symbol
objects in data set D, std (τ (x)) represents the standard deviation
‘+’ represents the noise objects, and different colors represent
of all objects’ reverse density in dataset D, and α is a tuning different clusters.
coefficient. It is suitable for most data sets that α is 1. Therefore,
we set it to 1 by default. 3.4. Complexity analysis

Definition 7 (Noise Point). If one object can be considered a noise We assume that n is the total number of points in the dataset.
point, it must satisfy the following formula: The time complexity of the proposed method depends on the
following: (1) According to the natural neighbor algorithm op-
NOISE = { x| ∀x ∈ D, τ (x) > θ} (8) timized by KD-tree [25], the time complexity for finding the
natural neighbors is O (n log n). Therefore, the time complexity
The proposed approach for determining the noise points is for constructing the natural neighbors graph is O (n log n). (2) The
described in Algorithm 3. Based on the noise point identified time complexity for computing the reverse density of each point
above, we conduct noise cutting, which is defined as follows: and finding the noise points through the critical reverse density is
O (n). (3) Assigning the remaining points to clusters can be solved
by a linear time algorithm. In summary, the overall complexity of
Definition 8 (Noise Cutting). The noise point and the edges con-
the proposed clustering method is O (n log n).
nected to this point are all removed from the natural neighbor
graph. 4. Experimental analysis

The result after the noise cutting operation is shown in Fig. 3. In this section, a set of experiments are conducted to evaluate
We find that five connected subgraphs have been formed here. the performance of the proposed method, and we compare it
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 5

Algorithm 3 NS-finding
Input: D(The data set)
Output: NS(the noise point set)
1: Initialize NS = ∅; τ (x) = ∅
2: [k, NaN] = NaN − searching (D);
3: for each object xi in D do
4: Calculate the reverse density τ (xi ) using Eq. (6);
5: τ (x) = τ (x) ∪ τ (xi )
6: end for
7: Calculate the critical reverse density θ using Eq. (7);
8: for each object xi in D do
9: if the object xi satisfies the Eq. (8) then
10: NS = NS ∪ {xi }
11: end if
12: end for
13: Return NS

Algorithm 4 Natural neighbor search(Assigning)


Input: G(including the nodes information(D) and the natural neighbors information in each node(NaN))
Output: CL(the results of clustering)
1: Initialize label=0;
2: for each object xi in D do
3: if CL (xi ) == −1 then //No labels are assigned at this point
4: label = label + 1;
5: NaNx = NaN (xi ) ∪ xi ;
6: CL (xi ) = label; ( )
7: while ∃xj ∈ NaNx & CL xj == −1 do
8: temp = NaN(xj ) − xi ;
9: for each object xk in temp do
10: if (xk is not in NaNx) & CL (xk ) == −1 then
11: NaNx = NaNx ∪ xk ;
12: end if
13: end for
14: end while
15: end if
16: end for
17: Return CL

Fig. 5. Ten original synthetic datasets.

to well-known and state-of-the-art clustering methods including 4.1. Experiments on synthetic datasets
k-means [6], DBSCAN [12], DPC [9], SC [14] and DCore [21].
The experiments are performed on a PC with an Intel i5-8250U In this part, we use ten synthetic datasets to evaluate the
performance of our method. As shown in Table 1, the character-
CPU, 8G RAM, Windows 10 64 bit OS, and the MATLAB 2018a
istics of the synthetic data sets are described. The ten original
programming environment. The code of CutPC is available for synthetic datasets are displayed in Fig. 5. D1 and D2 contain
download from the link https://github.com/lintao6/CutPC. spherical clusters with different numbers. D1 consists of five
6 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504

Table 2
The parameter settings of each clustering method in the ten synthetic datasets.
Datasets k-means DBSCAN DPC SC DCore CutPC
D1 N = 5 Eps = 10 dc = 2% k = 10 σ = 1 r1 = 10 r2 = 5 R = 10 −
Minpts = 5 N = 5 T1 = 10 Tn = 5
D2 N = 2 Eps = 0.25 dc = 2% k = 10 σ = 1 r1 = 1 r2 = 0.5 R = 1 −
Minpts = 5 N = 2 T1 = 30 Tn = 5
D3 N = 2 Eps = 0.1 dc = 2% k = 10 σ = 1 r1 = 0.3 r2 = 0.1 R = 0.2 −
Minpts = 8 N = 2 T1 = 20 Tn = 5
D4 N = 3 Eps = 0.3 dc = 2% k = 10 σ = 1 r1 = 0.25 r2 = 0.1 R = 0.5 −
Minpts = 10 N = 3 T1 = 30 Tn = 10
D5 N = 4 Eps = 10 dc = 2% k = 20 σ = 5 r1 = 10 r2 = 10 R = 15 −
Minpts = 5 N = 4 T1 = 30 Tn = 10
D6 N = 2 Eps = 0.3 dc = 2% k = 3 σ = 0.15 r1 = 1 r2 = 1.5 R = 1.1 −
Minpts = 5 N = 2 T1 = 30 Tn = 5
D7 N = 6 Eps = 8 dc = 2% k = 5 σ = 0.4 r1 = 10 r2 = 14 R = 14 −
Minpts = 5 N = 6 T1 = 30 Tn = 10
D8 N = 4 Eps = 0.25 dc = 2% k = 10 σ = 10 r1 = 0.35 r2 = 0.3 R = 0.5 −
Minpts = 4 N = 4 T1 = 10 Tn = 3
D9 N = 7 Eps = 1 dc = 2% k = 10 σ = 0.15 r1 = 1 r2 = 0.9 R = 2 −
Minpts = 10 N = 7 T1 = 30 Tn = 10
D10 N = 6 Eps = 6 dc = 2% K = 12 σ = 0.15 r1 = 15 r2 = 15 R = 15 α = 0.5
Minpts = 5 N = 6 T1 = 35 Tn = 15

Fig. 6. The results of 5 clustering algorithms in D1.

clusters with 513 objects, including some noise objects. D2 con- clusters, one line cluster and one spherical cluster with a total of
tains two clusters with 420 objects, including noise objects. By 8000 objects, including some noise objects.
contrast, the remaining datasets contain clusters with arbitrary Table 2 illustrates the parameter settings of each clustering
shapes. D3 is composed of two moon manifold clusters with method in ten synthetic datasets. For the k-means method, we
1532 objects, including noise objects. D4 consists of three circle need to set an initial cluster number N. DBSCAN requires two
clusters with some noise objects and a total of 3883 objects. D5 parameters to set Eps and Minpts. The cutoff distance dc of DPC
needs to be set, and the density is obtained by the exponential
contains one circle cluster and three spherical clusters, including
kernel suggested by DPC. For SC, k needs to be set to construct a
noise objects, for a total of 1114 objects. D6 includes two spiral
k-nearest neighbor graph, the variance σ of the Gaussian kernel
clusters with noise objects, for a total of 1064 objects. D7 consists
needs to be set to compute the similarity between two objects,
of four spherical clusters in two right-angle line clusters with and the initial cluster number N needs to be set. In regard to
some noise objects, for a total of 1915 objects. D8 contains three DCore, the selection of the parameters r1, r2, R, T1 and Tn affects
spherical clusters and one manifold cluster with 1427 objects, the clustering results. We test different parameter settings to
including noise objects. D9 consists of three circle clusters, two obtain better results. With regard to CutPC, although we set the
spiral clusters and two spherical clusters with a total of 8533 tuning coefficient α to 0.5, which can achieve a better result in
objects, including noise objects. D10 contains four right-angle line D10, generally, no parameters need to be set.
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 7

Fig. 7. The results of 5 clustering algorithms in D2.

Fig. 8. The results of 5 clustering algorithms in D3.

Figs. 6 and 7 show that all clustering methods can offer im- correct clustering in D5. Due to improper choices of parameters,
pressive results in D1 and D2. This means that all the algorithms DCore finds three clusters. The experiment again illustrates that
work well for spherical clusters. However, all approaches other k-means and DPC are not suitable for circle clusters. Fig. 11
than the CutPC need to set parameters. shows whether those algorithms can process clusters with spiral
The clustering shown in Fig. 8 demonstrates that except for k- clusters. The results reveal that DBSCAN, SC, and CutPC obtain the
means and DPC, other methods have a good result in D3. In fact, correct clustering, while k-means, DCore and CutPC do not.
k-means and DPC cannot detect the nonspherical clusters. From The clustering displayed in Fig. 12 demonstrates that DBSCAN
Fig. 9, we find that DBSCAN, DCore, and CutPC perform better and CutPC recognize the clusters in the D7 dataset, while k-
than the other algorithms in D4, while DCore mistakenly assigns means, DPC, SC, and DCore do not. The results of DPC and DCore
some objects. Although k-means, DPC, and SC can find the correct are similar, but none of them obtain the correct clusters, and
clusters in a spherical shape, they cannot deal with circle clusters. neither does SC. Although k-means finds the correct number
The clustering shown in Fig. 10 reveals that DBSCAN, SC, of clusters, it cannot obtain the correct clustering result. The
and CutPC have similar results in that both of them can obtain clustering shown in Fig. 13 illustrates that DCore and CutPC are
8 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504

Fig. 9. The results of 5 clustering algorithms in D4.

Fig. 10. The results of 5 clustering algorithms in D5.

similar in that both of them obtain a correct clustering in D8, demonstrates that only DCore and CutPC can detect rectangle
while k-means, DBSCAN, DPC, and SC do not. In fact, incorrect
clusters. Because of inappropriate parameter selection,DBSCAN
parameters choices lead to incorrect clustering in DBSCAN and
SC. Additionally, k-means and DPC cannot deal with nonspherical and SC do not obtain impressive clustering results, and neither
clusters.
The experimental results displayed in Figs. 14 and 15 are used do k-means nor DPC. Therefore, CutPC can be applied to more
to evaluate the clustering performance on D9 and D10, which complex situations.
are more complex patterns. In Fig. 14, only the CutPC algorithm
can obtain the correct result. DBSCAN and DCore have similar From Figs. 6 to 15, we can see that CutPC performs better than
results in that neither of them can deal with spiral clusters, nor
other algorithms. Moreover, CutPC can automatically detect noise
do k-means and DPC. Although SC can detect spiral clusters, it
fails to detect circle clusters. The clustering shown in Fig. 15 and outliers.
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 9

Fig. 11. The results of 5 clustering algorithms in D6.

Fig. 12. The results of 5 clustering algorithms in D7.

4.2. Metrics for measurement the clustering performance in real-world datasets using external
criteria such as Accuracy [26], F-measure, and NMI [27].
Cluster validity indexes have been used to measure the quality Our first choice of evaluation criterion is accuracy [26] (ACC ).
of the clustering results. In general, there are two approaches For n objects xi ∈ Rj , pi and ci are the inherent category label
for clustering validity indexes: internal criteria and external cri-
and the predicted cluster label of xi respectively, the calculation
teria. Internal validity indexes evaluate the fitness of clusters
produced by clustering methods based on the properties of clus- formula of ACC is as follows:
ters themselves. However, external validity indexes evaluate the n
performance of clustering based on comparing with the predicted

ACC = δ (pi , map (ci )) /n (9)
class labels by target class labels. In this paper, we evaluate i=1
10 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504

Fig. 13. The results of 5 clustering algorithms in D8.

Fig. 14. The results of 5 clustering algorithms in D9.

where map (.) is a mapping function that maps the predicted Table 3
The characteristics of six real-world datasets.
cluster label and its inherent cluster label by Hungarian algo-
rithm [28], let δ (a, b) equal to 1 if a = b or equal to 0 otherwise. Datasets Number of Number of Number of
instances attributes cluster
ACC ∈ [0, 1], in other words, the higher the values of the ACC are,
Pageblock 5473 10 5
the better the clustering performance will be. Htru2 17 898 8 2
The second evaluation criterion is F-measure (F 1), and the Thyroid 215 5 3
formula of F 1 is as follows: BreastCancer 699 9 2
Banknote 1372 4 2
F 1 = (2 ∗ P ∗ R) /(P + R) (10) Inonsphere 351 34 2

where P represents Precision and R represents Recall. F 1 value


also ranges from 0 to 1.
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 11

Fig. 15. The results of 5 clustering algorithms in D10.

Table 4
The parameter setting of each clustering algorithm in six real-world datasets.
Datasets k-means DBSCAN DPC SC DCore CutPC
Pageblock N = 5 Eps = 1 dc = 2% k = 10 σ = 1 r1 = 0.2 r2 = 0.15 R = 0.2 −
Minpts = 8 N = 5 T1 = 5 Tn = 2
Htru2 N = 2 Eps = 2 dc = 2% k = 10 σ = 1 r1 = 5 r2 = 1 R = 15 −
Minpts = 6 N = 2 T1 = 10 Tn = 5
Thyroid N = 3 Eps = 2 dc = 2% k = 10 σ = 0.5 r1 = 5 r2 = 1 R = 15 −
Minpts = 4 N = 3 T1 = 10 Tn = 5
BreastCancer N = 2 Eps = 0.5 dc = 2% k = 5 σ = 1 r1 = 1 r2 = 5 R = 5 α = 0.55
Minpts = 4 N = 2 T1 = 10 Tn = 5
Banknote N = 2 Eps = 0.5 dc = 2% k = 5 σ = 1 r1 = 1 r2 = 5 R = 5 −
Minpts = 4 N = 2 T1 = 10 Tn = 5
Inonsphere N = 2 Eps = 1 dc = 2% k = 10 σ = 0.5 r1 = 1 r2 = 5 R = 1 −
Minpts = 6 N = 2 T1 = 10 Tn = 5

Finally, Normalized Mutual Information (NMI) [27] is a well parameters may result in relatively different results. The same
known metric to evaluate clustering algorithms. This measure goes for spectral clustering. Therefore, for fairness, we run all the
employs information theory to quantify the differences between methods 20 times in all real datasets. The means and standard
two clustering partitions and it is defined as follows: deviations (in parentheses) of accuracy, F-measure, and NMI (20
NMI = 2 ∗ I(p, c)/(H (p) + H (c )) (11) runs) generated by different methods on six real-world datasets
with given parameters are shown in Tables 5 to 7. In these tables,
where I(p, c) is the mutual information between p and c, H is the the best results are boldfaced, and the second-best results are
entropy for random variable, and the value of NMI is also in the indicated by the star (∗) notation.
range of [0, 1]. To reflect the experimental results in real-world datasets more
intuitively, we draw the histograms of the accuracy, F-measure,
4.3. Experiments on real-world data sets
and NMI indexes of different clustering algorithms for the six
real-world datasets, as shown in Figs. 16 to 18.
To further demonstrate the superiority of CutPC, we test the
In terms of Accuracy, as shown in Fig. 16, it is obvious that
performance by using several benchmarking real-world datasets,
although CutPC does not have much of a gap compared to other
including Pageblock, Htru2, Thyroid, BreastCancer, Banknote, and
Inonsphere, obtained from the University of California, Irvine algorithms in the Pageblock dataset, the second-best results are
(UCI) machine learning repository [29]. The details of those achieved in the Thyroid and Ionosphere datasets, and the best
datasets are given in Table 3. Table 4 illustrates the parameter results are obtained in the Htru2, BreastCancer, and Banknote
setting of each clustering algorithm in six real-world datasets of datasets. In the aspect of the F-measure, as displayed in Fig. 17,
UCI. It is worth noting that CutPC can obtain a better result in the we clearly find that CutPC obtains the best results in all cases
BreastCancer dataset when the tuning coefficient α is 0.55. except that it achieves the second-best result in the Banknote
Considering that the selection of the initial center of the k- datasets. Considering the NMI aspect, as demonstrated in Fig. 18,
means algorithm is random, multiple experiments with given it is apparent that CutPC achieves the best results in all datasets.
12 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504

Table 5
The means and standard deviations(in parentheses) of Accuracy (20 runs) generated by different methods on six real-world datasets with given parameters.
Datasets Kmeans DBSCAN DPC SC Dcore CutPC
Pageblock 0.8998(0.000) 0.8990∗ (0.000) 0.8986(0.000) 0.8977(0.000) 0.8977(0.000) 0.8979(0.000)
Htru2 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9086(0.000)
Thyroid 0.7986(0.047) 0.6977(0.000) 0.7209(0.000) 0.7121(0.018) 0.9163(0.000) 0.8000∗ (0.000)
BreastCancer 0.9585∗ (0.000) 0.9313(0.000) 0.6738(0.000) 0.6552(0.000) 0.9299(0.000) 0.9599(0.000)
Banknote 0.6122(0.000) 0.7813(0.000) 0.7413(0.000) 0.8231∗ (0.001) 0.6436(0.000) 0.9308(0.000)
Ionosphere 0.7011(0.025) 0.8319(0.000) 0.6410(0.000) 0.6410(0.000) 0.8120(0.000) 0.8234∗ (0.000)

Table 6
The means and standard deviations(in parentheses) of F-measure (20 runs) generated by different methods on six real-world datasets with given parameters.
Datasets Kmeans DBSCAN DPC SC Dcore CutPC
Pageblock 0.8132(0.014) 0.8602∗ (0.000) 0.8602∗ (0.000) 0.5279(0.051) 0.8595(0.000) 0.8647(0.000)
Htru2 0.8057(0.000) 0.8747(0.000) 0.8799(0.000) 0.8802∗ (0.000) 0.8251(0.000) 0.8828(0.000)
Thyroid 0.7542∗ (0.088) 0.5985(0.000) 0.6678(0.000) 0.5839(0.031) 0.5574(0.000) 0.7682(0.000)
BreastCancer 0.9584∗ (0.000) 0.9325(0.000) 0.6942(0.000) 0.6046(0.051) 0.8957(0.000) 0.9588(0.000)
Banknote 0.6026(0.000) 0.5363(0.000) 0.7311(0.000) 0.8236(0.001) 0.6260(0.000) 0.7648∗ (0.000)
Ionosphere 0.7127(0.010) 0.7456(0.000) 0.7813∗ (0.000) 0.5877(0.059) 0.6316(0.000) 0.8062(0.000)

Table 7
The means and standard deviations(in parentheses) of NMI (20 runs) generated by different methods on six real-world datasets with given parameters.
Datasets Kmeans DBSCAN DPC SC Dcore CutPC
Pageblock 0.0554(0.003) 0.0245(0.000) 0.0280(0.000) 0.0649∗ (0.006) 0.0492(0.000) 0.0709(0.000)
Htru2 0.0265∗ (0.000) 0.0053(0.000) 6.3227e−06(0.000) 2.4445e−04(0.000) 0.0203(0.000) 0.0858(0.000)
Thyroid 0.3323(0.099) 0.2224(0.000) 0.1021(0.000) 0.0989(0.027) 0.3482∗ (0.000) 0.3832(0.000)
BreastCancer 0.7361∗ (0.000) 0.6931(0.000) 0.0547(0.000) 0.0265(0.024) 0.5406(0.000) 0.7455(0.000)
Banknote 0.0303(0.000) 0.1924(0.000) 0.3464∗ (0.000) 0.3275(0.001) 0.0671(0.000) 0.4566(0.000)
Ionosphere 0.1151(0.046) 0.2950∗ (0.000) 6.8026e−16(0.000) 0.0080(0.007) 0.1918(0.000) 0.3810(0.000)

Fig. 16. The Accuracy of different clustering algorithms for the six real-world datasets.

From the above analysis, we conclude that CutPC provides the synthetic and real-world datasets are shown in Tables 8
an overall good performance in clustering compared to other
and 9. The results in synthetic and real-world datasets all show
clustering algorithms.
that although CutPC requires higher computational resources
4.4. Runtime
compared to k-means and DBSCAN, it runs faster than DPC and
In this part, the algorithms are compared based on their
SC apparently. Moreover, the computational resources that CutPC
time performances in synthetic and real-world datasets. The
consuming time (in seconds) of the six clustering methods on requires are similar to DCore.
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 13

Fig. 17. The F-measure index of different clustering algorithms for the six real-world datasets.

Fig. 18. The NMI index of different clustering algorithms for the six real-world datasets.

5. Conclusions and future work object, and vice versa. Additionally, the critical reverse density
is obtained by statistical characteristics to determine the noise
In this paper, we propose a novel clustering algorithm named objects. Then, the noise cutting process in the natural neighbor
CutPC. The CutPC consists of three main steps. The first step graph is completed. In the last step, we identify the connected
constructs a natural neighbor graph using natural neighbors. Un- components and set them as clusters. The proposed clustering
like the traditional k-nearest neighbor graph, the natural neigh- algorithm can automatically identify arbitrary shape clusters and
bor graph does not need to set any parameters to achieve a detect noise objects without prior parameter settings. To evaluate
natural neighbor stable structure whose main idea is that noise the superiority of the proposed method, extensive experiments
objects lie in low energy regions. In the second step, we define are conducted on both synthetic and real-world datasets. The
the reverse density to reflect the energy of each object, where the results prove the effectiveness and robustness of the proposed
lower the reverse density value is, the higher the energy of the method compared with other algorithms.
14 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504

Table 8 [5] S. Nanda, G. Panda, A survey on nature inspired metaheuristic algorithms


The consuming time of six clustering methods on synthetic datasets. for partitional clustering, Swarm Evol. Comput. 16 (2014) 1–18, http:
Datasets Kmeans DBSCAN DPC SC Dcore CutPC //dx.doi.org/10.1016/j.swevo.2013.11.003.
[6] A.K. Jain, Data clustering:50 years beyond k-means, Pattern Recognit. Lett.
D1 0.080 0.035 2.303 0.439 0.34 0.242
31 (8) (2010) 651–666.
D2 0.066 0.026 2.114 0.182 0.129 0.167
[7] F. Murtagh, P. Contreras, Algorithms for hierarchical clustering: an
D3 0.073 0.066 4.877 2.263 0.414 0.449
overview, Wiley Interdiscip. Rev.:Data Min. knowl. Discov. 2 (2012) 86–97,
D4 0.112 0.239 38.196 30.854 1.523 1.595
http://dx.doi.org/10.1002/widm.53.
D5 0.096 0.046 3.723 0.893 0.310 0.334
[8] S. Zhou, Z. Xu, F. Liu, Method for determining the optimal number of
D6 0.097 0.037 3.018 0.907 0.485 0.277
clusters based on agglomerative hierarchical clustering, IEEE Trans. Neural
D7 0.121 0.095 10.437 4.695 0.752 0.604
Netw. Learn. Syst. 28 (2017) 3007–3017.
D8 0.098 0.078 4.808 1.854 0.391 0.430
[9] A.A. Rodriguez, Clustering by fast search and find of density peaks, Science
D9 0.138 0.859 341.659 430.447 20.665 6.104
344 (6191) (2014) 1492–1496.
D10 0.130 0.919 283.578 267.695 5.574 7.133
[10] Study on density peaks clustering based on k-nearest neighbors and
principal component analysis, Knowl.-Based Syst. 99 (2016) 135–145, http:
//dx.doi.org/10.1016/j.knosys.2016.02.001.
Table 9
[11] R. Liu, H. Wang, X. Yu, Shared-nearest-neighbor-based clustering by fast
The consuming time of six clustering methods on real-world datasets.
search and find of density peaks, Inform. Sci. 450 (2018) 200–226, http:
Datasets Kmeans DBSCAN DPC SC Dcore CutPC //dx.doi.org/10.1016/j.ins.2018.03.031.
Pageblock 0.138 0.358 73.629 109.610 1.832 3.116 [12] J.X.M. Ester, H.P. Kriegel, A density-based algorithm for discovering clusters
Htru2 0.569 3.705 2239.426 4609.402 159.271 62.167 in large spatial databases with noise, in: International Conference on
Thyroid 0.089 0.026 2.122 0.313 0.118 0.133 Knowledge Discovery and Data Mining, Vol. 96, 1996, pp. 226–231.
BreastCancer 0.095 0.025 1.963 0.310 0.205 0.347 [13] P.H. Kriegel, P. Kröger, J. Sander, A. Zimek, Density-based clustering, Wiley
Banknote 0.100 0.045 3.602 1.636 0.655 0.436 Interdiscip. Rev.:Data Min. Knowl. Discov. 1 (2011) 231–240, http://dx.doi.
Ionosphere 0.094 0.022 2.290 0.907 0.192 0.164 org/10.1002/widm.30.
[14] W.Y. Jordan, I. M, On spectral clustering:analysis and an algorithm, Adv.
Neural Inf. Process. Syst. (2002) 849–856.
[15] O. Grygorash, Y. Zhou, Z. Jorgensen, Minimum spanning tree based clus-
tering algorithms, 18th international conference on Tools with Artificial
However, the proposed method does not perform well when Intelligence (ICTAI) (2006) 73–81, http://dx.doi.org/10.1109/ICTAI.2006.83.
the datasets have large variations in density. In addition, it can- [16] R.A. Alper Aksac, Tansel Özyer, Cutesc: Cutting edge spatial clustering
not escape the curse of dimensionality because the Euclidean technique based on proximity graphs, Pattern Recognit. 96 (2019) http:
distance is used to measure the energy of each point. Future //dx.doi.org/10.1016/j.patcog.2019.06.014.
[17] M. Deng, Q. Liu, T. Cheng, Y. Shi, An adaptive spatial clustering algorithm
research will improve the performance of the CutPC algorithm in
based on delaunay triangulation, Comput. Environ. Urban Syst. 35 (2011)
high-dimensional datasets with different densities. 320–332, http://dx.doi.org/10.1016/j.compenvurbsys.2011.02.003.
[18] Y. Kim, H. Do, S. Kim, Outer-points shaver: Robust graph-based clustering
Declaration of competing interest via node cutting, Pattern Recognit. 97 (2019) http://dx.doi.org/10.1016/j.
patcog.2019.107001.
[19] S. Johnson, Hierarchical clustering schemes, Psychometrika 32 (1967)
The authors declare that they have no known competing finan- 241–254.
cial interests or personal relationships that could have appeared [20] J. Xie, Z. Xiong, Y. Zhang, Y. Feng, J. Ma, Density core-based clustering
to influence the work reported in this paper. algorithm with dynamic scannig radius, Knowl.-Based Syst. 142 (2018)
58,70.
[21] Y. Chen, S. Tang, L. Zhou, C. Wang, J. Du, T. Wang, S. Pei, Decentralized
Acknowledgments
clustering by finding loose and distribute density cores, Inform. Sci.
433–434 (2018) 510–526, http://dx.doi.org/10.1016/j.ins.2016.08.009.
The authors would like to thank the editor and anonymous re- [22] j.Q. Zhu, J. Feng, A self-adaptive neighborhood method without parameter
viewers for their valuable comments and suggestions. This work k, Pattern Recognit. Lett. 80 (2016) 30–36, http://dx.doi.org/10.1016/j/
is funded by National Natural Science Foundation of China (no. patrec.2016.05.007.
[23] J. Huang, Q. Zhu, L. Yang, J. Feng, A non-parameter outlier detection
51608070) and Fundamental Research Funds for the Central Uni- algorithm based on natural neighbor, Knowl.-Based Syst. 92 (2016) 71–77,
versities, China (no. 2019CDCGJSJ329). http://dx.doi.org/10.1016/j.knosys.2015.10.014.
[24] G. Bounova, O. Weck, Overview of metrics and their correlation patterns
References for multiple-metric topology analysis on heterogeneous graph ensembles,
phys 85 (2012) http://dx.doi.org/10.1103/PhysRevE.85.016117.
[25] J. Bentley, Multidimensional binary search trees used for associated
[1] R. Liu, H. Wang, X. Yu, Shared-nearest-neighbor-based clustering by fast
searching, Commun. ACM 18(9) (1975) 509–517.
search and find of density peaks, Inform. Sci. 450 (2018) 200–226, http:
[26] M. Wu, B. Schlkopf, A local learning approach for clustering, Proc. Adv.
//dx.doi.org/10.1016/j.ins.2018.03.031.
Neural Inf. Process. Syst. (2006) 1529–1536.
[2] P.N. Tan, Introduction to Data Mining, Pearson Education India, 2006.
[27] X.N. Vinh, J. Epps, J. Bailey, Information theoretic measures for cluster-
[3] R. Janani, S. Vijayarani Dr, Text document clustering using spectral clus-
ings comparison, in: International Conference on Machine Learning, Vol.
tering algorithm with particle swarm optimization, Expert Syst. Appl. 134
1073–1080, 2009.
(2019) 192–200, http://dx.doi.org/10.1016/j.eswa.2019.05.030.
[28] C. Papadimitriou, K. Steiglitz, Combinatorial Optimization:Algorithms and
[4] K. McGarry, Discovery of functional protein groups by clustering commu-
Complexity, Courier Dover Publications, 1998.
nity links and integration of ontological knowledge, Expert Syst. Appl. 40
[29] M. Lichman, Uci machine learning repository, 2013, http://archive.ics.uci.
(2013) 5101–5112, http://dx.doi.org/10.1016/j.eswa.2013.03.027.
edu/ml.

You might also like