A Novel Graph-Based Clustering Method Using Noise Cutting
A Novel Graph-Based Clustering Method Using Noise Cutting
Information Systems
journal homepage: www.elsevier.com/locate/is
article info a b s t r a c t
Article history: Recently, many methods have appeared in the field of cluster analysis. Most existing clustering
Received 12 December 2019 algorithms have considerable limitations in dealing with local and nonlinear data patterns. Algorithms
Received in revised form 14 January 2020 based on graphs provide good results for this problem. However, some widely used graph-based
Accepted 18 January 2020
clustering methods, such as spectral clustering algorithms, are sensitive to noise and outliers. In this
Available online 21 January 2020
paper, a cut-point clustering algorithm (CutPC) based on a natural neighbor graph is proposed. The
Recommended by Dennis Shasha
CutPC method performs noise cutting when a cut-point value is above the critical value. Normally,
Keywords: the method can automatically identify clusters with arbitrary shapes and detect outliers without any
Graph-based clustering prior knowledge or preparatory parameter settings. The user can also adjust a coefficient to adapt
Natural neighbors clustering solutions for particular problems better. Experimental results on various synthetic and real-
Noise cutting world datasets demonstrate the obvious superiority of CutPC compared with k-means, DBSCAN, DPC,
SC, and DCore.
© 2020 Elsevier Ltd. All rights reserved.
https://doi.org/10.1016/j.is.2020.101504
0306-4379/© 2020 Elsevier Ltd. All rights reserved.
2 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504
it can detect clusters with arbitrary shapes and outliers, the this time, the search range k is the natural neighbor characteristic
performance of DBSCAN depends on the scanning radius Eps and value λ [23]. Therefore, λ is defined as follows:
density threshold MinPts determined by the user. n
∑
Traditional clustering methods have considerable limitations λ = min{ k| f (nbk (xi )) = 0 or
in dealing with local and nonlinear data patterns. To overcome i=1
this issue, many graph-based clustering algorithms have been n n (4)
∑ ∑
proposed [14–18]. In graph-based clustering methods, the dataset f (Nbk (xi )) = f (Nbk−1 (xi ))}
is considered a graph G = (V , E ) in which the vertex set (V ) i=1 i=1
represents the objects, and the edge set (E ) represents the simi- where k is initialized with 1, Nbk (xi ) is the number of objects
larity between objects. However, spectral clustering(SC ) [14], the xi ’s reverse neighbors in the kth iteration and f (x) is defined as
most widely used graph-based clustering method, is vulnerable follows:
to noise and outliers [18].
, other wise
{
0
To compensate for the above mentioned deficiencies, in this f (x) =
paper, we consider clusters as connected regions with high en- 1 , if x == 0
ergy. To determine the high-energy connected regions, we pro- The details of this method are presented in Algorithm 1 and
pose a cut-point clustering method named CutPC based on a the natural neighbor of object xi is defined as follows:
natural neighbor graph. First, we construct a sparse graph using
the concept of natural neighbors. Then, the algorithm partitions Definition 3 (Natural Neighbors). For each object x, the natural
the graph into its subgroups by cutting points of which cut-point neighbors are the k nearest neighbors where k is equal to the
values are above the critical value. Last, the other objects are natural characteristic λ, denoted as NaN (x).
assigned to clusters according to the connectivity of subgroups.
The remainder of this paper is organized as follows. In Sec- Obviously, natural neighbors are different from traditional k
tion 2, a brief overview of the natural neighbor stable structure nearest neighbors, and the whole computational procedure of the
Natural Neighbor Stable Structure can automatically be fulfilled
is presented. The process of the proposed clustering algorithm
without any parameter.
and its computational complexity are presented in Section 3. The
experimental results with some synthetic and real-world datasets
3. Proposed clustering method
are demonstrated in Section 4. Finally, a summary of this paper
and future work are presented in Section 5.
To identify the clusters that are considered connected regions
with high energy effectively, the proposed clustering method con-
2. Related work sists of three main steps. The first step constructs a sparse graph
using natural neighbors. In this graph, all the observations are
The Natural Neighbor Stable Structure [22], a new concept represented by nodes with edges that connect each observation
coming from the objective reality, is inspired by interpersonal to its natural neighbors. Second, we detect the noise points (Defi-
relationships in human society, namely the number of one’s true nition 7) and perform the noise cutting (Definition 8). Finally, the
friends should be the number of people taking him/her as a subgraphs formed after the above operations are clustered using
friend and considered by him/her as friends at the same time. The a method to identify the connected components. Each connected
key idea of the Natural Neighbor Stable Structure is that objects component is determined as a cluster. The details of the proposed
lying in a sparse region have low energy, whereas objects lying clustering algorithm are shown in Algorithm 2.
in a dense region have high energy [23]. Based on the above
statement, the Natural Neighbor Stable Structure of data objects 3.1. Constructing the natural neighbor graph
can be formulated as follows:
(∀xi() ∃xj (k ∈ The first step of the proposed method is to represent the
( N))) ∧ ((i ̸= j)
( )
(1) data by a graph structure. Representing a dataset as a graph is
→ xi ∈ NNk xj ∧ xj ∈ NNk (xi )
)
useful for clustering local and nonlinear patterns. There are many
where NNk (xi ) is the kth nearest neighbor of object xi . The defi- methods for constructing the graph from the given dataset: the
nitions of k nearest neighbors and reverse neighbors can be given fully connected graph, ε -connected graph and k-nearest neighbor
as follows: graph. In a fully connected graph, any two points have an edge
between them, and all the weights of the edges are calculated
Definition 1 (k Nearest Neighbors). The k nearest neighbors of universally using the Gaussian kernel function. However, this
point xi are a set of points x in D with d (xi , x) ≤ d (xi , o), which method not only increases the time complexity, but also requires
is: manual determination of the variance parameter δ of the Gaus-
sian kernel. An ε -connected graph connects all points whose
NNk (xi ) = xj ∈ D| d xi , xj ≤ d (xi , o)
{ ( ) }
(2) pairwise distance is smaller than ε . Apparently, the appropriate ε
where o is the kth nearest neighbor of object xi . value is hard to determine. A k-nearest neighbor graph, an edge
between two objects is created if and only if one belongs to the k-
Definition 2 (Reverse Neighbors). The reverse neighbors of point nearest neighbor set of the other. However, k must be determined
xi are a set of points x that consider xi as its k neatest neighbor, manually. Considering the uncontrollable factors of the above
which is: methods, we use the concept of natural neighbors to construct
a graph named the natural neighbor graph. The definition of a
RNN (xi ) = {x ∈ D| xi ∈ NNk (x)} (3) natural neighbor graph is as follows:
The formation of the Natural Neighbor Stable Structure is
Definition 4 (Natural Neighbor Graph). A natural neighbor-based
achieved as follows: continuously expand the neighbor searching
graph with n nodes is constructed as follows. An edge eij between
range k, and the number of reverse neighbors for each object is
nodes xi and xj is defined as:
computed at the same time. The stopping criteria of the iteration
if xi ∈ NaN xj or xj ∈ NaN (xi )
{ ( )
are that all objects have reverse neighbors or the number of 1
eij = (5)
objects without reverse neighbors does not change anymore. In 0 other w ise
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 3
Algorithm 1 NaN-searching
Input: D(The data set)
Output: λ, NaN
1: Initial k = 1, Nb (xi ) = 0, NNk (xi ) = ∅, RNNk (xi ) = ∅;
2: create a KD-tree T from the data set D;
3: while true do
4: for each object xi in D do
5: find( the
) rth neighbor xj of xi using T
6: Nb xj = Nb (xi ) + 1;
7: NNk (xi ) = NNk−1 (xi ) ∪ xj ;
8: end for
9: Nbk = Nb;
10: Find the number of objects that Nb (xi ) == 0,which is denoted by Num
11: if Num does not change any more or Num==0 then
12: break;
13: end if
14: k = k + 1;
15: end while
16: λ = k;
17: for each object xi in D do
18: NaN (xi ) = NNλ (xi );
19: end for
20: Return λ, NaN
Algorithm 2 CutPC
Input: D(The data set)
Output: CL(the results of clustering)
1: Initialize CL = −1 for all objects;
2: [k, NaN] =NaN-searching(D);
3: Construct the natural neighbor graph G according to Definition 4;
4: NS = NS-finding(D);
5: for each object p in NS do
6: CL (p) = 0;
7: end for
8: Perform noises cutting in graph G according to Definition 7;
9: CL = Assigning(G);
10: Return CL
Table 1
The characteristics of the ten synthetic datasets.
Datasets Number of instances Number of attributes Number of clusters
D1 513 2 5
D2 420 2 2
D3 1532 2 2
D4 3883 2 3
D5 1114 2 4
D6 1064 2 2
D7 1915 2 6
D8 1427 2 4
D9 8533 2 7
D10 8000 2 6
Definition 7 (Noise Point). If one object can be considered a noise We assume that n is the total number of points in the dataset.
point, it must satisfy the following formula: The time complexity of the proposed method depends on the
following: (1) According to the natural neighbor algorithm op-
NOISE = { x| ∀x ∈ D, τ (x) > θ} (8) timized by KD-tree [25], the time complexity for finding the
natural neighbors is O (n log n). Therefore, the time complexity
The proposed approach for determining the noise points is for constructing the natural neighbors graph is O (n log n). (2) The
described in Algorithm 3. Based on the noise point identified time complexity for computing the reverse density of each point
above, we conduct noise cutting, which is defined as follows: and finding the noise points through the critical reverse density is
O (n). (3) Assigning the remaining points to clusters can be solved
by a linear time algorithm. In summary, the overall complexity of
Definition 8 (Noise Cutting). The noise point and the edges con-
the proposed clustering method is O (n log n).
nected to this point are all removed from the natural neighbor
graph. 4. Experimental analysis
The result after the noise cutting operation is shown in Fig. 3. In this section, a set of experiments are conducted to evaluate
We find that five connected subgraphs have been formed here. the performance of the proposed method, and we compare it
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 5
Algorithm 3 NS-finding
Input: D(The data set)
Output: NS(the noise point set)
1: Initialize NS = ∅; τ (x) = ∅
2: [k, NaN] = NaN − searching (D);
3: for each object xi in D do
4: Calculate the reverse density τ (xi ) using Eq. (6);
5: τ (x) = τ (x) ∪ τ (xi )
6: end for
7: Calculate the critical reverse density θ using Eq. (7);
8: for each object xi in D do
9: if the object xi satisfies the Eq. (8) then
10: NS = NS ∪ {xi }
11: end if
12: end for
13: Return NS
to well-known and state-of-the-art clustering methods including 4.1. Experiments on synthetic datasets
k-means [6], DBSCAN [12], DPC [9], SC [14] and DCore [21].
The experiments are performed on a PC with an Intel i5-8250U In this part, we use ten synthetic datasets to evaluate the
performance of our method. As shown in Table 1, the character-
CPU, 8G RAM, Windows 10 64 bit OS, and the MATLAB 2018a
istics of the synthetic data sets are described. The ten original
programming environment. The code of CutPC is available for synthetic datasets are displayed in Fig. 5. D1 and D2 contain
download from the link https://github.com/lintao6/CutPC. spherical clusters with different numbers. D1 consists of five
6 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504
Table 2
The parameter settings of each clustering method in the ten synthetic datasets.
Datasets k-means DBSCAN DPC SC DCore CutPC
D1 N = 5 Eps = 10 dc = 2% k = 10 σ = 1 r1 = 10 r2 = 5 R = 10 −
Minpts = 5 N = 5 T1 = 10 Tn = 5
D2 N = 2 Eps = 0.25 dc = 2% k = 10 σ = 1 r1 = 1 r2 = 0.5 R = 1 −
Minpts = 5 N = 2 T1 = 30 Tn = 5
D3 N = 2 Eps = 0.1 dc = 2% k = 10 σ = 1 r1 = 0.3 r2 = 0.1 R = 0.2 −
Minpts = 8 N = 2 T1 = 20 Tn = 5
D4 N = 3 Eps = 0.3 dc = 2% k = 10 σ = 1 r1 = 0.25 r2 = 0.1 R = 0.5 −
Minpts = 10 N = 3 T1 = 30 Tn = 10
D5 N = 4 Eps = 10 dc = 2% k = 20 σ = 5 r1 = 10 r2 = 10 R = 15 −
Minpts = 5 N = 4 T1 = 30 Tn = 10
D6 N = 2 Eps = 0.3 dc = 2% k = 3 σ = 0.15 r1 = 1 r2 = 1.5 R = 1.1 −
Minpts = 5 N = 2 T1 = 30 Tn = 5
D7 N = 6 Eps = 8 dc = 2% k = 5 σ = 0.4 r1 = 10 r2 = 14 R = 14 −
Minpts = 5 N = 6 T1 = 30 Tn = 10
D8 N = 4 Eps = 0.25 dc = 2% k = 10 σ = 10 r1 = 0.35 r2 = 0.3 R = 0.5 −
Minpts = 4 N = 4 T1 = 10 Tn = 3
D9 N = 7 Eps = 1 dc = 2% k = 10 σ = 0.15 r1 = 1 r2 = 0.9 R = 2 −
Minpts = 10 N = 7 T1 = 30 Tn = 10
D10 N = 6 Eps = 6 dc = 2% K = 12 σ = 0.15 r1 = 15 r2 = 15 R = 15 α = 0.5
Minpts = 5 N = 6 T1 = 35 Tn = 15
clusters with 513 objects, including some noise objects. D2 con- clusters, one line cluster and one spherical cluster with a total of
tains two clusters with 420 objects, including noise objects. By 8000 objects, including some noise objects.
contrast, the remaining datasets contain clusters with arbitrary Table 2 illustrates the parameter settings of each clustering
shapes. D3 is composed of two moon manifold clusters with method in ten synthetic datasets. For the k-means method, we
1532 objects, including noise objects. D4 consists of three circle need to set an initial cluster number N. DBSCAN requires two
clusters with some noise objects and a total of 3883 objects. D5 parameters to set Eps and Minpts. The cutoff distance dc of DPC
needs to be set, and the density is obtained by the exponential
contains one circle cluster and three spherical clusters, including
kernel suggested by DPC. For SC, k needs to be set to construct a
noise objects, for a total of 1114 objects. D6 includes two spiral
k-nearest neighbor graph, the variance σ of the Gaussian kernel
clusters with noise objects, for a total of 1064 objects. D7 consists
needs to be set to compute the similarity between two objects,
of four spherical clusters in two right-angle line clusters with and the initial cluster number N needs to be set. In regard to
some noise objects, for a total of 1915 objects. D8 contains three DCore, the selection of the parameters r1, r2, R, T1 and Tn affects
spherical clusters and one manifold cluster with 1427 objects, the clustering results. We test different parameter settings to
including noise objects. D9 consists of three circle clusters, two obtain better results. With regard to CutPC, although we set the
spiral clusters and two spherical clusters with a total of 8533 tuning coefficient α to 0.5, which can achieve a better result in
objects, including noise objects. D10 contains four right-angle line D10, generally, no parameters need to be set.
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 7
Figs. 6 and 7 show that all clustering methods can offer im- correct clustering in D5. Due to improper choices of parameters,
pressive results in D1 and D2. This means that all the algorithms DCore finds three clusters. The experiment again illustrates that
work well for spherical clusters. However, all approaches other k-means and DPC are not suitable for circle clusters. Fig. 11
than the CutPC need to set parameters. shows whether those algorithms can process clusters with spiral
The clustering shown in Fig. 8 demonstrates that except for k- clusters. The results reveal that DBSCAN, SC, and CutPC obtain the
means and DPC, other methods have a good result in D3. In fact, correct clustering, while k-means, DCore and CutPC do not.
k-means and DPC cannot detect the nonspherical clusters. From The clustering displayed in Fig. 12 demonstrates that DBSCAN
Fig. 9, we find that DBSCAN, DCore, and CutPC perform better and CutPC recognize the clusters in the D7 dataset, while k-
than the other algorithms in D4, while DCore mistakenly assigns means, DPC, SC, and DCore do not. The results of DPC and DCore
some objects. Although k-means, DPC, and SC can find the correct are similar, but none of them obtain the correct clusters, and
clusters in a spherical shape, they cannot deal with circle clusters. neither does SC. Although k-means finds the correct number
The clustering shown in Fig. 10 reveals that DBSCAN, SC, of clusters, it cannot obtain the correct clustering result. The
and CutPC have similar results in that both of them can obtain clustering shown in Fig. 13 illustrates that DCore and CutPC are
8 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504
similar in that both of them obtain a correct clustering in D8, demonstrates that only DCore and CutPC can detect rectangle
while k-means, DBSCAN, DPC, and SC do not. In fact, incorrect
clusters. Because of inappropriate parameter selection,DBSCAN
parameters choices lead to incorrect clustering in DBSCAN and
SC. Additionally, k-means and DPC cannot deal with nonspherical and SC do not obtain impressive clustering results, and neither
clusters.
The experimental results displayed in Figs. 14 and 15 are used do k-means nor DPC. Therefore, CutPC can be applied to more
to evaluate the clustering performance on D9 and D10, which complex situations.
are more complex patterns. In Fig. 14, only the CutPC algorithm
can obtain the correct result. DBSCAN and DCore have similar From Figs. 6 to 15, we can see that CutPC performs better than
results in that neither of them can deal with spiral clusters, nor
other algorithms. Moreover, CutPC can automatically detect noise
do k-means and DPC. Although SC can detect spiral clusters, it
fails to detect circle clusters. The clustering shown in Fig. 15 and outliers.
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 9
4.2. Metrics for measurement the clustering performance in real-world datasets using external
criteria such as Accuracy [26], F-measure, and NMI [27].
Cluster validity indexes have been used to measure the quality Our first choice of evaluation criterion is accuracy [26] (ACC ).
of the clustering results. In general, there are two approaches For n objects xi ∈ Rj , pi and ci are the inherent category label
for clustering validity indexes: internal criteria and external cri-
and the predicted cluster label of xi respectively, the calculation
teria. Internal validity indexes evaluate the fitness of clusters
produced by clustering methods based on the properties of clus- formula of ACC is as follows:
ters themselves. However, external validity indexes evaluate the n
performance of clustering based on comparing with the predicted
∑
ACC = δ (pi , map (ci )) /n (9)
class labels by target class labels. In this paper, we evaluate i=1
10 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504
where map (.) is a mapping function that maps the predicted Table 3
The characteristics of six real-world datasets.
cluster label and its inherent cluster label by Hungarian algo-
rithm [28], let δ (a, b) equal to 1 if a = b or equal to 0 otherwise. Datasets Number of Number of Number of
instances attributes cluster
ACC ∈ [0, 1], in other words, the higher the values of the ACC are,
Pageblock 5473 10 5
the better the clustering performance will be. Htru2 17 898 8 2
The second evaluation criterion is F-measure (F 1), and the Thyroid 215 5 3
formula of F 1 is as follows: BreastCancer 699 9 2
Banknote 1372 4 2
F 1 = (2 ∗ P ∗ R) /(P + R) (10) Inonsphere 351 34 2
Table 4
The parameter setting of each clustering algorithm in six real-world datasets.
Datasets k-means DBSCAN DPC SC DCore CutPC
Pageblock N = 5 Eps = 1 dc = 2% k = 10 σ = 1 r1 = 0.2 r2 = 0.15 R = 0.2 −
Minpts = 8 N = 5 T1 = 5 Tn = 2
Htru2 N = 2 Eps = 2 dc = 2% k = 10 σ = 1 r1 = 5 r2 = 1 R = 15 −
Minpts = 6 N = 2 T1 = 10 Tn = 5
Thyroid N = 3 Eps = 2 dc = 2% k = 10 σ = 0.5 r1 = 5 r2 = 1 R = 15 −
Minpts = 4 N = 3 T1 = 10 Tn = 5
BreastCancer N = 2 Eps = 0.5 dc = 2% k = 5 σ = 1 r1 = 1 r2 = 5 R = 5 α = 0.55
Minpts = 4 N = 2 T1 = 10 Tn = 5
Banknote N = 2 Eps = 0.5 dc = 2% k = 5 σ = 1 r1 = 1 r2 = 5 R = 5 −
Minpts = 4 N = 2 T1 = 10 Tn = 5
Inonsphere N = 2 Eps = 1 dc = 2% k = 10 σ = 0.5 r1 = 1 r2 = 5 R = 1 −
Minpts = 6 N = 2 T1 = 10 Tn = 5
Finally, Normalized Mutual Information (NMI) [27] is a well parameters may result in relatively different results. The same
known metric to evaluate clustering algorithms. This measure goes for spectral clustering. Therefore, for fairness, we run all the
employs information theory to quantify the differences between methods 20 times in all real datasets. The means and standard
two clustering partitions and it is defined as follows: deviations (in parentheses) of accuracy, F-measure, and NMI (20
NMI = 2 ∗ I(p, c)/(H (p) + H (c )) (11) runs) generated by different methods on six real-world datasets
with given parameters are shown in Tables 5 to 7. In these tables,
where I(p, c) is the mutual information between p and c, H is the the best results are boldfaced, and the second-best results are
entropy for random variable, and the value of NMI is also in the indicated by the star (∗) notation.
range of [0, 1]. To reflect the experimental results in real-world datasets more
intuitively, we draw the histograms of the accuracy, F-measure,
4.3. Experiments on real-world data sets
and NMI indexes of different clustering algorithms for the six
real-world datasets, as shown in Figs. 16 to 18.
To further demonstrate the superiority of CutPC, we test the
In terms of Accuracy, as shown in Fig. 16, it is obvious that
performance by using several benchmarking real-world datasets,
although CutPC does not have much of a gap compared to other
including Pageblock, Htru2, Thyroid, BreastCancer, Banknote, and
Inonsphere, obtained from the University of California, Irvine algorithms in the Pageblock dataset, the second-best results are
(UCI) machine learning repository [29]. The details of those achieved in the Thyroid and Ionosphere datasets, and the best
datasets are given in Table 3. Table 4 illustrates the parameter results are obtained in the Htru2, BreastCancer, and Banknote
setting of each clustering algorithm in six real-world datasets of datasets. In the aspect of the F-measure, as displayed in Fig. 17,
UCI. It is worth noting that CutPC can obtain a better result in the we clearly find that CutPC obtains the best results in all cases
BreastCancer dataset when the tuning coefficient α is 0.55. except that it achieves the second-best result in the Banknote
Considering that the selection of the initial center of the k- datasets. Considering the NMI aspect, as demonstrated in Fig. 18,
means algorithm is random, multiple experiments with given it is apparent that CutPC achieves the best results in all datasets.
12 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504
Table 5
The means and standard deviations(in parentheses) of Accuracy (20 runs) generated by different methods on six real-world datasets with given parameters.
Datasets Kmeans DBSCAN DPC SC Dcore CutPC
Pageblock 0.8998(0.000) 0.8990∗ (0.000) 0.8986(0.000) 0.8977(0.000) 0.8977(0.000) 0.8979(0.000)
Htru2 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9084∗ (0.000) 0.9086(0.000)
Thyroid 0.7986(0.047) 0.6977(0.000) 0.7209(0.000) 0.7121(0.018) 0.9163(0.000) 0.8000∗ (0.000)
BreastCancer 0.9585∗ (0.000) 0.9313(0.000) 0.6738(0.000) 0.6552(0.000) 0.9299(0.000) 0.9599(0.000)
Banknote 0.6122(0.000) 0.7813(0.000) 0.7413(0.000) 0.8231∗ (0.001) 0.6436(0.000) 0.9308(0.000)
Ionosphere 0.7011(0.025) 0.8319(0.000) 0.6410(0.000) 0.6410(0.000) 0.8120(0.000) 0.8234∗ (0.000)
Table 6
The means and standard deviations(in parentheses) of F-measure (20 runs) generated by different methods on six real-world datasets with given parameters.
Datasets Kmeans DBSCAN DPC SC Dcore CutPC
Pageblock 0.8132(0.014) 0.8602∗ (0.000) 0.8602∗ (0.000) 0.5279(0.051) 0.8595(0.000) 0.8647(0.000)
Htru2 0.8057(0.000) 0.8747(0.000) 0.8799(0.000) 0.8802∗ (0.000) 0.8251(0.000) 0.8828(0.000)
Thyroid 0.7542∗ (0.088) 0.5985(0.000) 0.6678(0.000) 0.5839(0.031) 0.5574(0.000) 0.7682(0.000)
BreastCancer 0.9584∗ (0.000) 0.9325(0.000) 0.6942(0.000) 0.6046(0.051) 0.8957(0.000) 0.9588(0.000)
Banknote 0.6026(0.000) 0.5363(0.000) 0.7311(0.000) 0.8236(0.001) 0.6260(0.000) 0.7648∗ (0.000)
Ionosphere 0.7127(0.010) 0.7456(0.000) 0.7813∗ (0.000) 0.5877(0.059) 0.6316(0.000) 0.8062(0.000)
Table 7
The means and standard deviations(in parentheses) of NMI (20 runs) generated by different methods on six real-world datasets with given parameters.
Datasets Kmeans DBSCAN DPC SC Dcore CutPC
Pageblock 0.0554(0.003) 0.0245(0.000) 0.0280(0.000) 0.0649∗ (0.006) 0.0492(0.000) 0.0709(0.000)
Htru2 0.0265∗ (0.000) 0.0053(0.000) 6.3227e−06(0.000) 2.4445e−04(0.000) 0.0203(0.000) 0.0858(0.000)
Thyroid 0.3323(0.099) 0.2224(0.000) 0.1021(0.000) 0.0989(0.027) 0.3482∗ (0.000) 0.3832(0.000)
BreastCancer 0.7361∗ (0.000) 0.6931(0.000) 0.0547(0.000) 0.0265(0.024) 0.5406(0.000) 0.7455(0.000)
Banknote 0.0303(0.000) 0.1924(0.000) 0.3464∗ (0.000) 0.3275(0.001) 0.0671(0.000) 0.4566(0.000)
Ionosphere 0.1151(0.046) 0.2950∗ (0.000) 6.8026e−16(0.000) 0.0080(0.007) 0.1918(0.000) 0.3810(0.000)
Fig. 16. The Accuracy of different clustering algorithms for the six real-world datasets.
From the above analysis, we conclude that CutPC provides the synthetic and real-world datasets are shown in Tables 8
an overall good performance in clustering compared to other
and 9. The results in synthetic and real-world datasets all show
clustering algorithms.
that although CutPC requires higher computational resources
4.4. Runtime
compared to k-means and DBSCAN, it runs faster than DPC and
In this part, the algorithms are compared based on their
SC apparently. Moreover, the computational resources that CutPC
time performances in synthetic and real-world datasets. The
consuming time (in seconds) of the six clustering methods on requires are similar to DCore.
L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504 13
Fig. 17. The F-measure index of different clustering algorithms for the six real-world datasets.
Fig. 18. The NMI index of different clustering algorithms for the six real-world datasets.
5. Conclusions and future work object, and vice versa. Additionally, the critical reverse density
is obtained by statistical characteristics to determine the noise
In this paper, we propose a novel clustering algorithm named objects. Then, the noise cutting process in the natural neighbor
CutPC. The CutPC consists of three main steps. The first step graph is completed. In the last step, we identify the connected
constructs a natural neighbor graph using natural neighbors. Un- components and set them as clusters. The proposed clustering
like the traditional k-nearest neighbor graph, the natural neigh- algorithm can automatically identify arbitrary shape clusters and
bor graph does not need to set any parameters to achieve a detect noise objects without prior parameter settings. To evaluate
natural neighbor stable structure whose main idea is that noise the superiority of the proposed method, extensive experiments
objects lie in low energy regions. In the second step, we define are conducted on both synthetic and real-world datasets. The
the reverse density to reflect the energy of each object, where the results prove the effectiveness and robustness of the proposed
lower the reverse density value is, the higher the energy of the method compared with other algorithms.
14 L.-T. Li, Z.-Y. Xiong, Q.-Z. Dai et al. / Information Systems 91 (2020) 101504