0% found this document useful (0 votes)
35 views17 pages

2016 Study On Density Peaks Clustering Based On K-Nearest Neighbors and Principal Component Analysis

Uploaded by

1913434222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views17 pages

2016 Study On Density Peaks Clustering Based On K-Nearest Neighbors and Principal Component Analysis

Uploaded by

1913434222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Accepted Manuscript

Study on Density Peaks Clustering Based on k-Nearest Neighbors


and Principal Component Analysis

Mingjing Du , Shifei Ding , Hongjie Jia

PII: S0950-7051(16)00079-4
DOI: 10.1016/j.knosys.2016.02.001
Reference: KNOSYS 3421

To appear in: Knowledge-Based Systems

Received date: 11 July 2015


Revised date: 31 January 2016
Accepted date: 1 February 2016

Please cite this article as: Mingjing Du , Shifei Ding , Hongjie Jia , Study on Density Peaks Clustering
Based on k-Nearest Neighbors and Principal Component Analysis, Knowledge-Based Systems (2016),
doi: 10.1016/j.knosys.2016.02.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Study on Density Peaks Clustering Based on k-Nearest


Neighbors and Principal Component Analysis
Mingjing Du1,2, Shifei Ding1, 2, Hongjie Jia1,2
1
(School of Computer Science and Technology, China University of Mining and Technology, Xuzhou
221116,China )
2
(Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese
Academy of Sciences, Beijing 100090,China)

T
Abstract: Density peaks clustering (DPC) algorithm published in the US journal Science in 2014 is a

IP
novel clustering algorithm based on density. It needs neither iterative process nor more parameters.
However, original algorithm only has taken into account the global structure of data, which leads to
missing many clusters. In addition, DPC does not perform well when data sets have relatively high

CR
dimension. Especially, DPC generates wrong number of clusters of real-world data sets. In order to
overcome the first problem, we propose a density peaks clustering based on k nearest neighbors
(DPC-KNN) which introduces the idea of k nearest neighbors (KNN) into DPC and has another option

US
for the local density computation. In order to overcome the second problem, we introduce principal
component analysis (PCA) into the model of DPC-KNN and further bring forward a method based on
PCA (DPC-KNN-PCA), which preprocesses high-dimensional data. By experiments on synthetic data
AN
sets, we demonstrate the feasibility of our algorithms. By experiments on real-world data sets, we
compared this algorithm with k-means algorithm and spectral clustering (SC) algorithm in accuracy.
Experimental results show that our algorithms are feasible and effective.
M

Keywords: Data clustering, Density peaks, k nearest neighbors (KNN), principal component analysis
(PCA)
ED

1 Introduction
Clustering, used mostly as an unsupervised learning method, is a major technique for data mining.
PT

The main aim of cluster analysis is to divide a given population into groups or clusters with common
characteristics, since similar objects are grouped together, while dissimilar objects belong to different
clusters. Clustering is useful in exploratory pattern-analysis, grouping, decision-making, and
CE

machine-learning situations, including data mining, document retrieval, image segmentation, and
pattern classification [1]. Clustering methods are generally divided into five groups: hierarchical
clustering, partitioning clustering, density-based clustering, grid-based clustering and model-based
AC

clustering [2]. Each method has its own strengths and weaknesses.
Density-based clustering [3-7] is represented by DBSCAN [3]. In density-based clustering,
clusters are defined as areas of higher density than the remainder of the data set. Density-based clusters
can have an arbitrary shape in the feature space. In addition, DBSCAN does not require one to specify
the number of clusters in the data a priori. However, it is very sensible to the user-defined parameter
values, often producing very different clustering results in the data set even for slightly different
parameter settings [2].
Like DBSCAN and the mean-shift method [8], density peaks clustering (DPC) algorithm [9]
proposed by Rodriguez and Laio is able to detect non-spherical clusters and does not require one to
specify the number of clusters. This method is robust with respect to the choice of 𝑑𝑐 as the only
ACCEPTED MANUSCRIPT

parameter. DPC is based on the idea that cluster centers are characterized by a higher density than their
neighbors and by a relatively large distance from points with higher densities. Several researches
[10-15] have been going on around this method.
But DPC still has some defects. The local structure of data has not been taken into account in
DPC when it calculates the local density. For example, DPC does not perform well when clusters have
different densities. Having clusters of different densities is very common in data sets. The local density
of DPC will lead to missing many clusters. Figure 1 presents that clusters cannot be all detected with
the local density of DPC. If p is small, two clusters in the lower-left corner are detected as a single
cluster. However, if p is high, two clusters near the bottom are detected as a single cluster. In this case,
DPC is not able to find clusters.

T
IP
CR
Ground Truth p=2% p=20% p=50%

US
Fig.1. DPC on these clusters of different densities
In order to overcome this problem, we propose a novel DPC based on k nearest neighbors
(DPC-KNN). The proposed method makes use of the ideas of the k nearest neighbors for the local
AN
density computation.
In addition, it does a poor job of finding the clusters of high-dimensional data. It may generate
wrong number of clusters of real-world data sets. This is because many of the dimensions in high
dimensional data are often irrelevant. These irrelevant dimensions can confuse DPC by hiding clusters
M

in noisy data. Another reason is that its overwhelming dependence on the distances between points.
Two main quantities of DPC are both relevant to the distances. And for this reason, the problem, “the
curse of dimensionality”, is exacerbated. As the number of dimensions in a dataset increases, distance
ED

measures become increasingly meaningless [16]. Additional dimensions spread out the points until, in
very high dimensions, they are almost equidistant from each other. More specific details are shown in
Section 4. On the basis of the former, we further bring forward a method based on principal component
PT

analysis (DPC-KNN-PCA).
We test our algorithms on synthetic data sets to demonstrate their feasibility. In order to assess the
CE

performance of proposed algorithms, we compare proposed algorithms with other algorithms on some
UCI data sets. Our algorithms have achieved satisfactory results in most data sets. The rest of this paper
is organized as follows. In Section 2, we describe the principle of the DPC method, and introduce the k
AC

nearest neighbors and principal component analysis. In Section 3, we make a detailed description of
DPC-KNN and DPC-KNN-PCA. In Section 4, we present experimental results in synthetic data sets
and UCI data sets, then we analyze the performance of proposed algorithms. Finally, some conclusions
and the intending work are given in the last section.

2. Related works
The proposed DPC-KNN is based on DPC and KNN. The proposed DPC-KNN-PCA is based on
former theories and PCA. This section provides brief reviews of DPC, KNN, and PCA.
2.1. Density peaks clustering
Rodriguez and Laio proposed an algorithm published in the US journal Science. Its idea is that
ACCEPTED MANUSCRIPT

cluster centers are characterized by a higher density than their neighbors and by a relatively large
distance from points with higher densities [9]. This method utilizes two important quantities: One is the
local density ρ𝑖 of each point i, and the other is its distance δ𝑖 from points of higher density. The two
quantities correspond to two assumptions with respect to the cluster centers. One is that the cluster
centers are surrounded by neighbors with a lower local density. The other is that they have relatively
larger distance to the points of higher density. In the following, we will describe the computation of ρ𝑖
and δ𝑖 in much more detail.
Assume that the data set is 𝑿𝑁×𝑀 = ,𝒙1 , 𝒙2 , ⋯ , 𝒙𝑁 -𝑇 , where 𝒙𝑖 = ,𝒙1𝑖 , 𝒙2𝑖 , , 𝒙𝑀𝑖 - is the vector
with M attributes and N is the number of points. The distance matrix of the data set needs to be
computed first. Let d(𝒙𝒊 , 𝒙𝒋 ) denote the Euclidean distance between the point 𝒙𝒊 and the point 𝒙𝒋 , as

T
follows:
d(𝒙𝒊 , 𝒙𝒋 ) = ‖𝒙𝒊 − 𝒙𝒋 ‖

IP
(1)
2
The local density of a point 𝒙𝒊 , denoted by ρ𝑖 , is defined as:

CR
ρ𝑖 = ∑𝑗 𝜒(d(𝒙𝒊 , 𝒙𝒋 ) − 𝑑𝑐 )
1, 𝑥 < 0 (2)
𝜒(𝑥) = {
0, 𝑥 ≥ 0
where 𝑑𝑐 is a cutoff distance. ρ𝑖 is defined as the number of points that are adjacent to point 𝒙𝒊 .

US
There is another local density computation in the code presented by Rodriguez and Laio. If the former
is called a hard threshold, the latter will be called a soft threshold. Specifically, ρ𝑖 is defined as a
AN
Gaussian kernel function, as follows:
2
d(𝒙𝒊 ,𝒙𝒋 )
ρ𝑖 = ∑𝑗 𝑒𝑥𝑝 (− 2 ) (3)
𝑑𝑐

where 𝑑𝑐 is an adjustable parameter, controlling the weight degradation rate.


M

𝑑𝑐 is the only variable in Formula (2) and (3). The process for selecting 𝑑𝑐 is actually that for
selecting the average number of neighbors of all points in data set. In the code, 𝑑𝑐 is define as:
ED

𝑑𝑐 = 𝑑⌈𝑁 𝑝 (4)
𝑑 ×100⌉

where 𝑁𝑑 = (𝑁2) and 𝑑⌈𝑁 𝑝 ∈ 𝐷 = [𝑑1 , 𝑑1 , ⋯ 𝑑𝑁𝑑 ]. D is a set of all the distances between every
𝑑 ×100⌉
PT

two points in data set, which are sorted in ascending order. N denotes the number of points in data set.
𝑝
⌈𝑁𝑑 × ⌉ is the subscript of 𝑑⌈𝑁 𝑝 , where ⌈∙⌉ is the ceiling function and p is a percentage.
100 𝑑 ×100⌉
CE

The computation of δ𝑖 is quite simple. The minimum distance between the point of 𝒙𝒊 and any
other points with higher density, denoted by δ𝑖 , is defined as:
AC

min𝑗:ρ𝑖>ρ𝑗 .d(𝒙𝒊 , 𝒙𝒋 )/ , 𝑖𝑓 ∃𝑗 𝑠. 𝑡. ρ𝑖 > ρ𝑗


δ𝑖 = { (5)
max𝑗 .d(𝒙𝒊 , 𝒙𝒋 )/ , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Only those points with relative high ρ𝑖 and high δ𝑖 are considered as cluster centers. The points
with high ρ𝑖 and δ𝑖 value are also called as peaks that have higher densities than other points. A point
is assigned to the same cluster as its nearest neighbor peak.
After cluster centers have been found, DPC assigns each remaining points to the same cluster as
its nearest neighbors with higher density. A representation named as decision graph is introduced to
help one to make a decision. This representation is the plot of δ𝑖 as a function of ρ𝑖 for each point.
The following algorithm is a summary of DPC.
ACCEPTED MANUSCRIPT

Algorithm1. DPC algorithm.


Inputs:
The samples 𝑿 ∈ 𝑁×𝑀
The parameter 𝑑𝑐
Outputs:
The label vector of cluster index: ∈ 𝑁×1
Method:
Step 1: Calculate distance matrix according to Formula (1)
Step 2: Calculate ρ𝑖 for point i according to Formula (2) or (3)
Step 3: Calculate δ𝑖 for point i according to Formula (5)

T
Step 4: Plot decision graph and select cluster centers

IP
Step 5: Assign each remaining point to the nearest cluster center
Step 6: Return y

CR
2.2. K nearest neighbors
K nearest neighbors (KNN) has already been exploited for classification [17-21]. This approach
has been shown to be a powerful technique for density estimation [22], clustering [23-25] and other

US
fields. As the name implies, the goal of this approach is to find the k-nearest neighbors of a sample
among N samples. In general, the distances between points are achieved by calculating the Euclidean
distance.
AN
To compute the local density preferably, we use an idea based on k nearest neighbors (KNN).
We always assume that we are given N data points 𝑿 = ,𝒙1 , 𝒙2 , ⋯ , 𝒙𝑁 -𝑇 which have been drawn
in M dimensional space. As distance function between points we use the Euclidean distance, which is
denoted by d(∙,∙) . It is required to compute the k nearest neighbors of a sample 𝒙𝑖 among
M

*𝒙1 , 𝒙2 , ⋯ , 𝒙𝑖−1 , 𝒙𝑖+1 , ⋯ , 𝒙𝑁 +. Sorting these distances in ascending order is to find the first K distances.
The k-th distance corresponds to the k-th nearest neighbor. By kNN(𝒙𝑖 ) we denote the set of the k
nearest neighbors of 𝒙𝑖 among *𝒙1 , 𝒙2 , ⋯ , 𝒙𝑖−1 , 𝒙𝑖+1 , ⋯ , 𝒙𝑁 +. More specifics of the idea are discussed
ED

in Section 3.
The nearest neighbor step is simple to implement, though it scales in the worst case as O(𝑁 2 ), if
performed without any optimizations. Some efficient methods such as K-D tree [26] or ball trees [27]
PT

can be used to compute the neighbors in O(𝑁 log 𝑁) time.


2.3. Principal components analysis
CE

Principal components analysis (PCA) is a dimensionality reduction algorithm that can be used to
significantly speed up unsupervised feature learning algorithm. The basic idea of PCA is to project the
original data onto a lower-dimensional subspace, which highlights the principal directions of variation
AC

of the data.
The following steps describe this algorithm procedure.
(1) Make each of the features have the same mean (zero) and variance.
(2) Calculate the covariance matrix Σ.
(3) Calculate the eigenvectors 𝒖𝒊 and the eigenvalues 𝜆𝒊 of Σ.
(4) Sort these eigenvalues in decreasing order and stack the eigenvectors 𝒖𝒊 corresponding to the
eigenvalue 𝜆𝒊 in columns to form the matrix U.

3. Density peaks clustering based on KNN & density peaks clustering based on KNN and PCA
There are still some defects in DPC. To solve these problems, we propose the following solutions.
ACCEPTED MANUSCRIPT

3.1. Density peaks clustering based on k nearest neighbors


Firstly, the local structure of data is not assessed by the local density in DPC. For this, we propose
a novel DPC based on k nearest neighbors (DPC-KNN).
A shortcoming of the local density is that it is not sensitive to the local geometric of the data.
Especially, when there is a great difference between the clusters in the density, there is a great
difference between cluster centers on the local density. If 𝑑𝑐 is so low that the distances between
cluster centers are distinguished on decision graph, it is hard to select cluster centers. In our work, we
introduce the idea of KNN into the calculation of the local density.
Let 𝒙𝑖 ∈ 𝑿, d(∙,∙) the Euclidean distance function, and the NN𝑘 (𝒙𝑖 ) be the 𝑘 𝑡ℎ nearest point
to 𝒙𝑖 according to d, the k-nearest neighbors (kNN(∙)) of 𝒙𝑖 is defined as:

T
kNN(𝒙𝑖 ) = {𝑗 ∈ 𝑋|d(𝒙𝒊 , 𝒙𝒋 ) ≤ d(𝒙𝒊 , NN𝑘 (𝒙𝑖 ))} (6)

IP
We can use kNN(𝒙𝑖 ) to calculate the local density. This new local density is calculating the mean
distance to k nearest neighbors, as follows:

CR
1 2
ρ𝑖 = 𝑒𝑥𝑝 (− . ∑𝑥𝑗 ∈kNN(𝒙𝑖) d(𝒙𝒊 , 𝒙𝒋 ) /) (7)
𝑘

where k is computed as a percentage (p) of the number of points N, so k = ⌈𝑝 × 𝑁⌉. Because a

US
higher value of the local density means a higher density, so Formula (7) is a Gaussian kernel function
as an inverse measure of the distance.
Then we can adopt the idea of KNN to the local density in DPC. The following algorithm is a
AN
summary of the proposed DPC-KNN.
Algorithm2. DPC-KNN algorithm.
Inputs:
The samples 𝑿 ∈ 𝑁×𝑀
M

The parameter p
Outputs:

ED

The label vector of cluster index: 𝑁×1


Method:
Step 1: Calculate distance matrix according to Formula (1)
Step 2: Calculate ρ𝑖 for point i according to Formula (7)
PT

Step 3: Calculate δ𝑖 for point i according to Formula (5)


Step 4: Plot decision graph and select cluster centers
CE

Step 5: Assign each remaining point to the nearest cluster center


Step 6: Return y
Complexity Analysis: Suppose N is the total number of points in data set. The complexity in
calculating the similarity matrix is O(𝑁 2 ). DPC-KNN also needs O(𝑁 2 ) to compute the local density.
AC

In addition, we cost O(𝑁 log 𝑁) in the sorting process with quick sort. For the progress to determine
the cluster centers, we take no account of the time. As the complexity in assignment procedure is O(𝑁),
the total time complexity of our DPC-KNN method is O(𝑁 2 ) + O(𝑁 2 ) + O(𝑁 log 𝑁) + O(𝑁) +
O(𝑁) ∼ O(𝑁 2 ).
3.2. Density peaks clustering based on k nearest neighbors and principal component analysis
In addition, DPC does not perform well when data has relatively high dimension. Especially, in
real-world data sets, DPC generates wrong number of clusters of real-world data sets. On the basis of
DPC-KNN, we further bring forward a method based on principal component analysis
(DPC-KNN-PCA).
ACCEPTED MANUSCRIPT

Firstly, we make each of the features have the same mean (zero). Then, compute the matrix Σ as
follows:
1
Σ = ∑𝑁 𝑇
𝑖=1 𝑥𝑖 𝑥𝑖 (8)
𝑁

If x has zero mean, then Σ is exactly the covariance matrix of x. We can compute the
eigenvectors of Σ, and stack the eigenvectors in columns to form the matrix U:
| | | |
𝐔 = [𝒖1 𝒖2 ⋯𝒖𝑀 ] (9)
| | | |
where 𝒖1 is the principal eigenvector (corresponding to the largest eigenvalue), 𝒖2 is the second

T
eigenvector, and so on. Also, let λ1 , λ1 , ⋯ , λ𝑀 be the corresponding eigenvalues.

IP
We can represent x in the (𝒖1 , 𝒖1 , ⋯ , 𝒖𝑀 )-basis by computing

𝒖𝑻𝟏 𝒙

CR
𝑻
𝐱 𝒓𝒐𝒕 = 𝑼𝑻 𝒙 = 𝒖𝟐 𝒙 (10)

[𝒖𝑻𝑴 𝒙]

US
The subscript "rot" comes from the observation that this corresponds to a reflection of the original
(𝒊)
data. We can compute 𝒙𝒓𝒐𝒕 = 𝑼𝑻 𝒙(𝒊) for every point i. If we want to reduce this data to one dimension
(the principal direction of variation of the data), we can set
(𝒊)
̃(𝒊) = 𝒙𝒓𝒐𝒕,𝟏 = 𝒖𝑻𝟏 𝒙(𝒊) ∈
𝒙
AN
(11)
M 𝒌
If x in ̃∈
and we want to reduce it to a k dimensional representation 𝒙 (where k < M) ,
we would take the first k components of 𝐱 𝒓𝒐𝒕 , which correspond to the top k directions of variation. In
̃ can also be arrived at by using an approximation to 𝐱 𝒓𝒐𝒕 where all
other words, our definition of 𝒙
M

but the first k components are zero, as follows:


𝐱 𝒓𝒐𝒕,𝟏 𝐱 𝒓𝒐𝒕,𝟏
⋮ ⋮
ED

𝐱 𝐱 𝒓𝒐𝒕,𝒌
̃ = 𝒓𝒐𝒕,𝒌 ≈ 𝐱
𝒙 = 𝐱 𝒓𝒐𝒕 (12)
𝟎 𝒓𝒐𝒕,𝒌+𝟏
⋮ ⋮
[ 𝟎 ] [ 𝐱 𝒓𝒐𝒕,𝑴 ]
PT

The reason that it drops the later components of 𝐱 𝒓𝒐𝒕 is that the first few components are
considerably larger than the later components.
CE

To decide how to set k, we will usually look at the percentage of variance retained for different
values of k. Generally, let λ1 , λ1 , ⋯ , λ𝑀 be the eigenvalues of Σ (sorted in decreasing order), so that
λ𝑖 is the eigenvalue corresponding to the eigenvector 𝒖𝑖 . Then if we retain k principal components, the
AC

percentage of variance retained is given by:


∑𝑘
𝑖=1 λ𝑖
∑𝑀
(13)
𝑖=1 λ𝑖

In this paper, we pick the smallest value of k that satisfies


∑𝑘
𝑖=1 λ𝑖
∑𝑀
≥ 0.99 (14)
𝑖=1 λ𝑖

The following algorithm is a summary of the proposed DPC-KNN-PCA.


Algorithm3. DPC-KNN-PCA algorithm.
Inputs:
The samples 𝑿 ∈ 𝑁×𝑀
ACCEPTED MANUSCRIPT

The parameter p
Outputs:
The label vector of cluster index: ∈ 𝑁×1
Method:
Step 1: Make all of the features have the same mean (zero) and variance
Step 2: Compute the covariance matrix Σ according to Formula (8)
Step 3: Compute the eigenvectors 𝒖𝑖 and the eigenvalues λ𝑖 of Σ
Step 4: Reduce the data and keep 99 % principal component to Formula (14)
Step 5: Calculate distance matrix according to Formula (1)
Step 6: Calculate ρ𝑖 for point i according to Formula (7)

T
Step 7: Calculate δ𝑖 for point i according to Formula (5)

IP
Step 8: Plot decision graph and select cluster centers
Step 9: Assign each remaining point to the nearest cluster center

CR
Step 10: Return y
Complexity Analysis: Suppose N is the total number of points in data set and M is the number of
features of each point. Covariance matrix computation is O(𝑀2 𝑁). Its eigenvalues decomposition is.
So, O(𝑀3 ) the complexity of PCA is O(𝑀3 + 𝑀2 𝑁) . The total time complexity of our
DPC-KNN-PCA method is O(𝑀3 + 𝑀2 𝑁 + 𝑁 2 ).
US
AN
4. Experiments and results
In this section, we will test the performance of DPC-KNN-PCA through two types of the
experiments. By experiments on synthetic data sets, we demonstrate the feasibility of the algorithm. By
experiments on real-world data sets, we compared this algorithm with k-means algorithm, spectral
M

clustering (SC) algorithm in accuracy.


We do experiments in a work station with a core i7 DMI2-Intel 3.6 GHz processor and 18 GB
ED

RAM running MATLAB 2012B. We run k-means algorithm, SC algorithm, 10 times in real-world data
sets. This paper measures the similarity between data points with the famous Euclidean distance, which
is used widely to measure the similarity of spatial data, as shown Formula (1). In DPC-KNN and
DPC-KNN-PCA, we select the parameter p from [0.1% 0.2% 0.5% 1% 2% 6%]. In DPC, the parameter
PT

𝑑𝑐 is also selected from [0.1% 0.2% 0.5% 1% 2% 6%]. The kernel of spectral clustering (SC)
algorithm is the Gaussian kernel. In SC, we select the parameter δ from [0.5 1 2 3 4].
CE

4.1. Experiments on synthetic data sets


We test the performance of our algorithms on synthetic data sets. The synthetic data sets are two
dimensional, which makes things easy from the visualization point of view. Because the performances
AC

of DPC-KNN and DPC-KNN-PCA are similar in 2-dimensional data sets, we only test the performance
of DPC-KNN-PCA.
4.1.1. Synthetic data sets
Our algorithm is tested by eleven data sets whose geometric shapes are shown in Figure 2. The
first data set, R15 [28] is generated as 15 similar 2-D Gaussian distributions that are positioned in rings.
The second data set, Aggregation [29] consists of the seven perceptually distinct groups of points
where there are non-Gaussian clusters. The third data set, Flame [30] that are of different size and
shape. The fourth data set, D that are of clusters of different densities. S1, S2, S3 and S4 [31] are four
2-D data sets with varying complexity in terms of spatial data distributions: i.e., in S1 the overlap is the
smallest, whereas in S4 the overlap is the greatest. A1, A2 and A3 [32] are two-dimensional sets with
ACCEPTED MANUSCRIPT

varying number of circular clusters (M=20, 35, 50). We demonstrate the power of these algorithms on
these test cases.

T
IP
Data set R15 Data set Aggregation

CR
US
AN

Data set Flame Data set D


M
ED
PT
CE

Data set S1 Data set S2


AC

Data set S3 Data set S4


ACCEPTED MANUSCRIPT

Data set A1 Data set A2

T
IP
CR
US
AN
Data set A3
Fig.2. Visualization of two-dimensional data sets
4.1.2. The evaluation of clustering results in synthetic data sets
There are 15 clusters, 600 points in R15. In this case, our approaches are robust respect to
M

parameter p. Figure 3 presents a lot of clusters found by our algorithms with different p and selections
of cluster centers. As shown, DPC-KNN-PCA gets perfect results, except when p=0.5%.
ED
PT
CE
AC

p=0.1% p=0.2% p=0.5%


ACCEPTED MANUSCRIPT

T
IP
CR
p=1% p=2% p=6%
Fig.3. DPC-KNN-PCA on R15 set with different values of p
There are 2 clusters, 240 points in Aggregation set and are 7 clusters, 788 points in Flame set. The

US
two data sets consist of some clusters that are of different size and shape. D set has 2 clusters and 97
points. Its clusters are of different densities. Clustering results proposed by DPC-KNN-PCA have been
given in Figure 4.
AN
M
ED
PT
CE

Data set Aggregation, p=1% Data set Flame, p=0.5% Data set D, p=6%
Fig.4. DPC-KNN-PCA on Aggregation and Flame sets
AC

The data sets S1 to S4 are two-dimensional sets with varying complexity in terms of spatial data
distributions. The data sets have 5000 points around 15 clusters with a varying degrees of overlap. As
shown in Figure 5, the performance of DPC-KNN-PCA is perfect for data sets with varying
complexity.
ACCEPTED MANUSCRIPT

T
IP
CR
Data set S1, p= 1% Data set S2, p=1%

US
AN
M
ED

Data set S3, p= 1% Data set S4, p=1%


Fig.5. DPC-KNN-PCA on the S1, S2, S3, and S4 sets
PT

A1, A2 and A3 sets are large data sets with varying number of clusters. A1 has 3000 points around
20 clusters. A2 has 5250 points around 35 clusters. There are 7500 points and 50 clusters on A3. We
CE

demonstrate the robustness of the DPC-KNN-PCA for the quantity, as shown in Figure 6.
AC
ACCEPTED MANUSCRIPT

T
IP
CR
Data set A1, p= 2% Data set A2, p=2% Data set A3, p=2%
Fig.6. DPC-KNN-PCA on the A1, A2, and A3 sets
As these experiments illustrate our algorithms are very effective in finding clusters of arbitrary

US
shape, density, distribution and number.
4.2. Experiments on real-world data sets
The performances of our algorithms are compared with classical methods (k-means algorithm and
AN
spectral clustering algorithm).
4.2.1. Real-world data sets
The data sets used in the experiments are all from the UCI Machine Learning Repository, which
include Iris, LED digits, Seeds, Heart, Pen-based digits, Waveform, Sonar. The details of those data
M

sets are given in Table 1.

Table 1 The details of UCI data sets


ED

Data Sets Cluster Dimension N

Iris 3 4 150
LED digits 10 7 500
PT

Seeds 3 7 210
Heart 2 13 270
CE

Pen-based digits 10 16 109962


Waveform 3 21 5000
Sonar 2 60 208
AC

4.2.2. Quality of the clustering results


This paper uses clustering accuracy (ACC) [33] to measure the quality of the clustering results.
𝑗
For N distinct samples 𝐱 𝑖 ∈ , y𝑖 and c𝑖 are the inherent category label and the predicted cluster
label of 𝐱 𝑖 , the calculation formula of ACC is
𝑁

ACC = ∑ 𝛿(y𝑖 , 𝑚𝑎𝑝(c𝑖 ))⁄𝑁 (15)


𝑖=1

where 𝑚𝑎𝑝(∙) maps each cluster label to a category label by the Hungarian algorithm [34] and this
mapping is optimal, let 𝛿(y𝑖 , c𝑖 ) equal to 1 if y𝑖 = c𝑖 or equals to 0 otherwise. The higher the values
of the ACC are, the better the clustering performance will be.
ACCEPTED MANUSCRIPT

The comparison of these algorithms is shown in Table 2. In Table 2, the symbol - means that the
algorithm cannot work in the data set.
Table 2 The performance comparison of proposed algorithms

Data sets Accuracy DPC DPC-KNN DPC-KNN-PCA SC k-means


Iris Mean 0.94 0.96 0.88 0.8867 ± 0 0.8560 ± 0.097
Parameter 𝑑𝑐 = 0.1% p=1% p=4% δ = 0.5
LED digits Mean - 0.7460 0.6700 0.6360 ± 0.0552 0.5208 ± 0.0713
Parameter p=6% p=6% δ = 0.5
Seeds Mean 0.8952 0.9143 0.9143 0.9048 ± 0 0.8905 ± 0
Parameter 𝑑𝑐 = 1% p=2% p=2% δ=1

T
Heart Mean - 0.8111 0.8259 0.7963 ± 0 0.7166 ± 0.0640

IP
Parameter p=1% p=6% δ=4
Pen-based digits Mean - 0.7618 0.7623 0.7177 ± 0.0276 0.7004 ± 0.0466

CR
Parameter p=1% p=0.2% δ=3
Waveform Mean 0.5676 0.5840 0.6452 0.5054 ± 0 0.5012 ± 0
Parameter 𝑑𝑐 = 0.5% p=0.2% p=0.1% δ = 0.5
0.5433 ± 0 0.5433 ± 0.0124

US
Sonar Mean - - 0.6442
Parameter p=1% δ=2

There are less than ten features in the first three data sets. And the last two data sets have more
AN
than ten features. As shown in Table 2, DPC-KNN outperforms others in low-dimensional data sets
such as Iris, LED digits and Seeds. However, DPC-KNN-PCA has a better performance compared to
DPC-KNN in relatively high-dimensional data sets. To be specific, the higher the features of the data
set are, the greater advantage DPC-KNN-PCA have over DPC-KNN. It is obvious that DPC-KNN-PCA
M

algorithm has achieved gratifying results in most data sets.


Then, the reason that produced above results is discussed. DPC-KNN, without missing any
ED

information of features, performed better than DPC-KNN-PCA in low-dimensional data set. Due to the
phenomenon of the so called curse of dimensionality, the similarity between samples becomes
meaningless in high dimensional data spaces. Although similarity between high dimensional samples is
not meaningful, similarity according to subsets of attributes is still meaningful. PCA not only reduces
PT

the dimensionality of the data, but also maintains as much information as possible. When some data
sets have relatively high dimensions, DPC does a poor job of finding the clusters, which we need to
CE

pay extra attention to. For example, Sonar data set has relatively high dimensions compared to the
number of samples. Figure 7 shows only one cluster center found by DPC with different 𝑑𝑐 on
decision graph. In this case, it is unacceptable that we were incapable of making the right choices.
AC

DPC-KNN-PCA has a favorable performance compared to the original algorithm, as shown Figure 8.
In consequence, DPC-KNN-PCA outperforms DPC-KNN in high-dimensional data sets.
ACCEPTED MANUSCRIPT

𝑑𝑐 = 0.1% 𝑑𝑐 = 0.2% 𝑑𝑐 = 0.5%

𝑑𝑐 = 1% 𝑑𝑐 = 2% 𝑑𝑐 = 6%
Fig.7. DPC on Sonar set with different values of 𝑑𝑐

T
IP
CR
US
Fig.8. DPC-KNN-PCA on Sonar set
AN
5. Conclusions
DPC-KNN has another option based on k nearest neighbors (KNN) for the local density
computation. On the basis of DPC-KNN, a method based on principal component analysis
(DPC-KNN-PCA) is presented to improve the performance of the former on real-world data sets. In
M

this paper, presented algorithms show the power in some synthetic data sets. Besides the good
feasibility, proposed algorithms get better clustering performances compared to classical methods
(k-means algorithm and spectral clustering algorithm) and DPC in UCI data sets. In low-dimensional
ED

data sets, DPC-KNN outperforms others. And DPC-KNN-PCA has achieved gratifying results in
relatively high-dimensional data sets.
However, the proposed algorithm does not perform well when there is a collection of points
PT

forming vertical streaks in data set. Future research is to improve the performance of DPC-KNN-PCA
algorithm on manifold data set.
CE

6. Acknowledgements
This work is supported by the National Natural Science Foundation of China (No.61379101), and
AC

the National Key Basic Research Program of China (No. 2013CB329502).

References
[1] A. K. Jain, M. N. Murty, P. J. Flynn. Data clustering: a review. ACM computing surveys (CSUR),
1999, 31(3): 264-323.
[2] J. Han, M. Kamber. Data mining: concepts and techniques. Morgan Kaufman, San Francisco,
California, USA, 2000.
[3] M. Ester, H. P. Kriegel, J. Sander, X. W. Xu. A density-based algorithm for discovering clusters in
ACCEPTED MANUSCRIPT

large spatial databases with noise. Proceedings of Second International Conference on Knowledge
Discovery and Data Mining, 1996, 96(34): 226-231.
[4] J. Sander, M. Ester, H. P. Kriegel, X. Xu. Density-based clustering in spatial databases: The
algorithm GDBSCAN and its applications. Data mining and knowledge discovery, 1998, 2(2):
169-194.
[5] M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander. OPTICS: ordering points to identify the
clustering structure. Proceedings of ACM SIGMOD Conference, 1999, 28(2): 49-60.
[6] X. Xu, M. Ester, H. P. Kriegel, J. Sander. A distribution-based clustering algorithm for mining in
large spatial databases. Proceedings of the 14th International Conference on Data Engineering,
1998: 324-331.

T
[7] R. J. G. B. Campello, D. Moulavi, J. Sander. Density-based clustering based on hierarchical

IP
density estimates. Proceedings 17th Pacific-Asia Conference on Knowledge Discovery and Data
Mining, 2013: 160-172.

CR
[8] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 1995, 17(8): 790-799.
[9] A. Rodriguez, A. Laio. Clustering by fast search and find of density peaks. Science, 2014,

US
344(6191):1492-1496.
[10] K. Sun, X. Geng, L. Ji. Exemplar Component Analysis: A fast band selection method for
hyperspectral imagery. IEEE Geoscience and Remote Sensing Letters, 2015, 12(5): 998-1002.
AN
[11] Y. Zhang, Y. Xia, Y. Liu, W. M. Wang. Clustering sentences with density peaks for
multi-document summarization. Proceedings of Human Language Technologies: The 2015
Annual Conference of the North American Chapter of the ACL, 2015: 1262-1267.
[12] K. Xie, J. Wu, W. Yang, C. Y. Sun. K-means clustering based on density for scene image
M

classification. Proceedings of the 2015 Chinese Intelligent Automation Conference, 2015: 379-38.
[13] Y. W. Chen, D. H. Lai, H. Qi, J. L. Wang, J. X. Du. A new method to estimate ages of facial image
for large database. Multimedia Tools and Applications, 2015: 1-19.
ED

[14] W. Zhang, J. Li. Extended fast search clustering algorithm: widely density clusters, no density
peaks. arXiv preprint arXiv:1505.05610, 2015.
[15] G. Chen, X. Zhang, Z. J. Wang, F. L. Li. Robust support vector data description for outlier
PT

detection with noise or uncertain data. Knowledge-Based Systems, 2015, 90: 129-137.
[16] L. Parsons, E. Haque, H. Liu. Subspace clustering for high dimensional data: a review. ACM
CE

SIGKDD Explorations Newsletter, 2004, 6(1): 90-105.


[17] H. L. Chen, B. Yang, G. Wang, J. Liu, X. Xu, S. J. Wang, D. Y. Liu. A novel bankruptcy prediction
model based on an adaptive fuzzy k-nearest neighbor method. Knowledge-Based Systems, 2011,
AC

24(8): 1348-1359.
[18] T. Basu, C. A. Murthy. Towards enriching the quality of k-nearest neighbor rule for document
classification. International Journal of Machine Learning and Cybernetics, 2014, 5(6): 897-905.
[19] T. M. Cover, P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, 1967, 13(1): 21-27.
[20] M. Hao, Z. Qiao. Identification of the pesticide fluorescence spectroscopy based on the PCA and
KNN. Proceedings of 2010 3rd International Conference on Advanced Computer Theory and
Engineering (ICACTE), 2010, 3: 184-186.
[21] P. S. Hiremath, M. Hiremath. 3D face recognition based on radon transform, PCA, LDA using
KNN and SVM. International Journal of Image Graphics & Signal Processing, 2014, 6(7): 36-43.
ACCEPTED MANUSCRIPT

[22] D. O. Loftsgaarden, C. P. Quesenberry. A nonparametric estimate of a multivariate density


function. Annals of Mathematical Statistics, 1965, 36(1):1049-1051.
[23] R. Jarvis, E. Patrick. Clustering using a similarity measure based on shared near neighbors. IEEE
Transactions on Computers, 1973, 100(11): 1025-1034.
[24] J. Schneider, M. Vlachos. Fast parameterless density-based clustering via random projections.
Proceedings of the 22nd ACM International Conference on Conference on Information &
Knowledge Management, 2013: 861-866.
[25] E. Aksehirli, B. Goethals, E. Muller, J. Vreeken. Cartification: A neighborhood preserving
transformation for mining high dimensional data. Proceedings of the 13th IEEE International
Conference on Data Mining (ICDM), 2013: 937-942.

T
[26] J. L. Bentley. Multidimensional Binary Search Trees Used for Associated Searching.

IP
Communications of The ACM, 1975, 18(9):509-517.
[27] S. M. Omohundro. Five balltree construction algorithms. Technical Report TR-89-063,

CR
International Computer Science Institute, 1989.
[28] C. J. Veenman, M. J. T. Reinders, E. Backer. A maximum variance cluster algorithm. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9): 1273-1280.

US
[29] A. Gionis, H. Mannila, P .Tsaparas. Clustering aggregation. ACM Transactions on Knowledge
Discovery from Data (TKDD), 2007, 1(4):341-352.
[30] L. Fu, E. Medico. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray
AN
data. BMC bioinformatics, 2007, 8(1):399-408.
[31] P. Fränti, O. Virmajoki. Iterative shrinking method for clustering problems. Pattern Recognition,
2006, 39(5): 761-775.
[32] I. Kärkkäinen, P. Fränti. Dynamic local search algorithm for the clustering problem. Technical
M

Report A-2002-6, Department of Computer Science, University of Joensuu, 2002.


[33] S. F. Ding, H. J. Jia, Z. Z. Shi. Spectral clustering algorithm based on adaptive Nyström sampling
for big data analysis. Journal of Software, 2014, 25(9): 2037-2049.
ED

[34] C. H. Papadimitriou, K. Steiglitz. Combinatorial optimization: algorithms and complexity. Courier


Dover Publications, Mineola, New York, USA, 1998.
PT
CE
AC

You might also like