Outer-Points Shaver-Robust Graph-Based Clustering Via Node Cutting
Outer-Points Shaver-Robust Graph-Based Clustering Via Node Cutting
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: Graph-based clustering is an efficient method for identifying clusters in local and nonlinear data pat-
Received 4 September 2018 terns. Among the existing methods, spectral clustering is one of the most prominent algorithms. How-
Revised 30 May 2019
ever, this method is vulnerable to noise and outliers. This study proposes a robust graph-based cluster-
Accepted 13 August 2019
ing method that removes the data nodes of relatively low density. The proposed method calculates the
Available online 14 August 2019
pseudo-density from a similarity matrix, and reconstructs it using a sparse regularization model. In this
Keywords: process, noise and the outer points are determined and removed. Unlike previous edge cutting-based
Graph-based clustering methods, the proposed method is robust to noise while detecting clusters because it cuts out irrelevant
Unsupervised learning nodes. We use a simulation and real-world data to demonstrate the usefulness of the proposed method
Spectral clustering by comparing it to existing methods in terms of clustering accuracy and robustness to noisy data. The
Pseudo-density reconstruction comparison results confirm that the proposed method outperforms the alternatives.
Node cutting
© 2019 Elsevier Ltd. All rights reserved.
https://doi.org/10.1016/j.patcog.2019.107001
0031-3203/© 2019 Elsevier Ltd. All rights reserved.
2 Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001
Fig. 1. Graphical illustration of the proposed clustering algorithm: (A) Detecting noise and outer points in the similarity graph, (B) cutting out the detected nodes and
identifying connected components, and (C) assigning the noise and outer points to the nearest cluster.
Second, it is necessary to predetermine the number of clusters be- method. The removal of outer points resembles shaving, and thus
cause the k-means algorithm should be applied in a state where we named the algorithm an “outer-points shaver.” In addition, it
spectral embedding is performed in a low-dimensional space. is not necessary to predefine the number of clusters, which can
Various methods have been proposed to address these issues. be determined from the sparsity parameter in the optimization
The methods mainly focus on the construction of a robust simi- model.
larity matrix or the robust decomposition of a similarity matrix. Li The main contributions of this study can be summarized as fol-
et al. [11] proposed a noise-robust-spectral clustering method, fo- lows:
cusing on data to which uniform noise has been added. The data
is transformed to a new space in which the noise points are clus- (1) We propose a new graph-based clustering method with lin-
tered together. In this manner, noise points are detected, and ro- ear regularization model that minimizes the density recon-
bust clustering results are produced. Huang et al. [12] aimed to struction error, subject to data-node selection constraints.
improve the robustness in clustering algorithms based on heat ker- The method is robust to noise and outliers because it cuts
nel theory. Heat kernel can statistically depict traces of random out low-density and noisy points to detect the fundamen-
walk. The concept has an intrinsic connection with diffusion dis- tal structures of clusters. To the best of our knowledge, this
tance which produces more robust clustering results than basic Eu- is the first trial to utilize the linear regularization model to
clidean distance. The authors integrated the heat distributed along perform density-based clustering.
the time scale to measure the distance between each point pair in (2) We present theoretical results that guarantee the grouping
their eigen space. Li et al. [2] proposed the construction of fuzzy- effect in selecting nodes. The grouping effect implies that if
set-based affinity graphs by identifying and exploiting discrimina- two observations have similar nearest neighbors and one of
tive features. The method captures subtle similarity information the two is selected, then the other observation can be simul-
distributed over discriminative feature subspaces to reveal the la- taneously selected with high probability.
tent data distribution. The affinity graph leads to produce more ac- (3) The proposed method does not require significant effort to
curate clustering results than the Gaussian kernel-based similarity determine the number of clusters in advance. The optimal
graph. Bojchevski et al. [13] proposed sparse and latent decomposi- cluster number can be determined through parameter sensi-
tion of the similarity graph. The method jointly learns the spectral tivity analysis of the proposed formulation.
embedding as well as the noisy data. By decomposing the similar- (4) To demonstrate the usefulness of the proposed method, we
ity graph into sparse corruptions and clean data, the method en- conduct simulation and real-world case studies. The results
hances the robustness of spectral clustering. Li et al. [14] investi- demonstrate that our proposed method outperforms the al-
gated the robustness of spectral clustering methods for grouping ternatives.
multi-scale data. The authors proposed the algorithm that com-
The remainder of this paper is organized as follows.
putes an affinity matrix that simultaneously considers the feature
Section 2 presents the details of the proposed outer-points
similarity and reachability similarity of objects. The methods pro-
shaver method. Section 3 presents a simulation study to examine
posed by Li et al. [11] and Bojchevski et al. [13] have similar moti-
the performance of the proposed method and compare it with
vation to the proposed method. The methods share the same idea:
other methods under various scenarios. Section 4 presents a case
decomposing the data into clean and noisy data. Thus in our sim-
study to demonstrate the applicability of the proposed method.
ulation and case studies, we compare our proposed method with
Finally, Section 5 presents our concluding remarks.
the methods of Refs. [11,13].
In this study, we propose a graph-based clustering method for
clustering local and nonlinear pattern data with noise. To en- 2. Proposed method
hance the robustness of the clustering results, we propose the
use of density-reconstruction-based node cutting, which consti- The proposed outer-points shaver (OPS) algorithm consists of
tutes a new method for robust graph-based clustering. The pro- four main steps. The first is to represent the data as a k-nearest
posed method calculates the pseudo-density from a similarity ma- neighbor graph. In this graph, all the observations are represented
trix, and reconstructs it using a sparse regularization model. In by nodes with edges that connect each observation to its nearest
the process, noise and outer points of clusters are identified and neighbors. Second, we conduct pseudo-density reconstruction us-
removed from the similarity graph. After identifying clusters, the ing a linear regularization model to determine the outer points.
removed outer points are assigned to the nearest cluster. Unlike The sparse selection property of the regularization model con-
previous edge cutting-based methods, the proposed method is ro- straints the coefficients of outer points to be zero. The points se-
bust to noise while detecting clusters because it cuts out irrel- lected as the outer points are temporarily excluded from the clus-
evant nodes. Fig. 1 illustrates the main concept of the proposed tering procedure. Third, the subgraphs containing the inner points
Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001 3
are clustered using an algorithm to identify connected compo- by investigating the importance of the observations to reconstruct-
nents. Each connected component is determined as a cluster. Fi- ing Ss . Problem (3) effectively selects the top T important observa-
nally, the temporarily excluded outer points are each assigned to tions to reconstruct Ss . Note that we use the unnormalized spectral
their nearest cluster. matrix because the normalization is unnecessary for the proposed
pseudo-density reconstruction with regularization. The normaliza-
2.1. Constructing the k-nearest neighbor graph tion procedure is essential for the methods based on the graph
cut criteria [8]. The unnormalized cut favors the cutting of small
The first step of the proposed algorithm is to represent the data sets of isolated nodes because the cut objective increases with the
by a graph structure. As mentioned in Section 1, representing a number of edges going across the two partitioned parts. However,
data set as a graph is useful for clustering local and nonlinear our formulation consists of a reconstruction error objective with
patterns. There are several choices of constructing a graph from regularization terms. The coefficients of the formulation are opti-
the given data: the fully connected graph, ε -neighborhood graph, mized to reduce the reconstruction error with sparsity, consider-
and k-nearest neighbor graph. A fully connected graph connects ing the connection status of corresponding nodes. Therefore, the
all points with positive similarities to each other and weights all formulation does not require the normalized Laplacian matrix.
edges using a similarity function such as a Gaussian kernel func- Although the optimization formulation selects the most opti-
tion. An ε -neighborhood graph connects all points whose pairwise mal subset of observations, the cardinality constraint makes Prob-
distances are smaller than ε . Both graphs have a scaling factor lem (3) NP-hard [16]. To address the computational challenges of
as a parameter, and it is challenging to determine an appropriate the most optimal subset problem, the computationally tractable
scaling factor. On the other hand, the k-nearest neighbor graph is convex-optimization-based “least absolute shrinkage and selection
known to be robust to the choice of the parameter k. Therefore, operator (LASSO)” has been proposed [17]. Our proposed problem
utilizing the k-nearest neighbor for graph-based clustering is rec- represents a special case of the LASSO problem, which has an equal
ommended [10]. Thus, in this study, we use the k-nearest neighbor number of features and observations. The LASSO formulation can
graph to cluster the data. The definition of the k-nearest neighbor address the computational challenges. However, the LASSO tends
graph is as follows: to select only one feature from a group of collinear features. For
example, if two observations have similar nearest neighbors, then
Definition 1. k-nearest neighbor graph: A k-nearest neighbor- the correlation of the two columns corresponding to the observa-
based graph with n nodes is constructed as follows: An edge eij tions becomes high, and one of the observations is determined as
between nodes i and j is defined as an outer point. Thus, to address this problem, we propose using
the elastic net [18] form of the formulation, which selects collinear
1 if xi ∈ Nk (x j ) or x j ∈ Nk (xi )
ei j = , (1) features simultaneously. The optimization formulation for estimat-
0 otherwise ing the coefficient of the OPS model is
2 2 2
s the other hand, the inequality strictly holds when the conditions
S − Sβˆ p + λ1 βˆ p + λ2 βˆ p < Ss − Sβˆ + λ1 βˆ are not satisfied.
2 1 2 2 1
2 Fig. 2 illustrates the effect of the sparsity parameter λ1 through
ˆ
+ λ2 β . (6) a toy example. As mentioned previously, larger λ1 value increases
2
the number of outer points. We conduct a sensitivity analysis on
This result implies that L(βˆ p ) < L(βˆ ), which is a contradiction. λ1 and λ2 for the clustering result in Section 3.3.
Therefore, the optimal coefficients βˆ must have nonnegative val- After determining the outer points, we conduct node cutting
ues, i.e., βˆi ≥ 0, i = 1, 2, ..., n defined as follows:
Lemma 1 guarantees that the optimal coefficients have nonneg- Definition 2. Node cutting: The outer points O = {i : βi = 0} and
ative values. Using this result, we can derive the following group- the edges connected to the outer points {ei j : βi = 0orβ j = 0} are
ing effect theorem in OPS. removed from the k-nearest neighbor graph. (In practice, we re-
move the columns and rows of outer points in the similarity ma-
Theorem 1. Let βˆ be the solution of the OPS given data (Ss , S) trix.)
with parameters λ1 and λ2 , and let D(i, j ) = S1s |βˆi − βˆ j |. Then,
√
2
2.3. Finding connected component by clustering
S s +S s
D(i, j ) ≤ λ 2 i
2
j
− SiT S j where Sis is the ith entry of Ss and Si is
2
the ith column vector of S. Note that SiT S j implies the number of ob- Removing the outer points breaks the connections inside the
servations connected to observations i and j simultaneously. supergraph and creates subgraphs. In this situation, we can per-
form straightforward clustering by identifying the connected com-
Proof. . The optimal solution βˆ satisfies ∂ L(β )/∂ βk |β =βˆ = 0 if ponents and setting them as clusters. A connected component is
βˆk = 0. Therefore, we have a subgraph in which any two nodes are connected to each other
by paths, and it is not connected to additional nodes in the super-
− 2SiT Ss − Sβˆ + λ1 sgn βˆi + 2λ2 βˆi = 0, (7) graph. If we identify all the connected components in the super-
graph, the clustering process is complete.
It is straightforward to determine the connected components of
− 2STj Ss − Sβˆ + λ1 sgn βˆ j + 2λ2 βˆ j = 0, (8) a graph using a nearest neighbor search [22]. A search beginning at
a node will identify the entire connected component before return-
where sgn is the sign function. Subtracting Eq. (8) from
ing. To identify all the connected components of a graph, nearest
Eq. (7) yields
neighbors of the nodes in each connected component set are itera-
tively obtained and added to the set. When searching for a nearest
STj − SiT Ss − Sβˆ + λ2 βˆi − βˆ j = 0, (9)
neighbor, it searches only nodes that are not already present in the
which is equivalent to set. We execute the algorithm until no more nearest neighbors are
1 T
identified. Having identified a connected component in this man-
βˆi − βˆ j = Si − S j Ss − Sβˆ . (10) ner, we seed an index that has not currently been searched. We
λ2 then apply the nearest neighbor search to find a new connected
From Eq. (10) we have the following inequality: component. Finally, when all the nodes have been searched, the
s algorithm is terminated, and the group index vector is returned.
βˆi − βˆ j ≤ S i − S j
1
S − Sβˆ . (11)
λ2 2
2 2.4. Assign outer points to the nearest cluster
From Lemma 1, we must have L(βˆ ) ≤ L(0 ), i.e. Ss − Sβ
ˆ 2+
2 In this step, we assign the outer points to clusters that have
λ1 β
ˆ 1 + λ2 β
ˆ 2 ≤ Ss 2 , which yields
2 2 already been identified. If an outer point belongs to a neighboring
2 set of previously clustered nodes, it is assigned to that cluster. Each
s
S − Sβˆ ≤ Ss 2 .
2
(12) outer point is assigned to a cluster via a k-nearest neighbor classi-
2
fication algorithm. In our experiment study, we varied the number
Utilizing Eq. (11) and Eq. (12), we have the following inequali- of nearest neighbors (k) from one to 20 and found the one that
ties: does not affect the final clustering results. Here we found the ap-
1 s 1 propriate number of nearest neighbors is 10. The outer-point as-
D(i, j ) ≤ S − Sβˆ Si − S j ≤ S i − S j . (13)
λ2 S 2
s
2
2 λ2 2 signing scheme is expressed as follows:
Since S is the similarity matrix of the nearest neigh- yi = arg max ei j i ∈ O, (14)
bor graph, Si − S j 22 = Sis + Ssj − 2SiT S j holds. Therefore, D(i, j ) ≤ Cm
j∈Cm
√
2
Sis +S2j where Cm denotes the mth cluster identified by the OPS and yi is
λ2 2 − SiT S j as desired.
assigned as the cluster label for outer point i. In the previous step,
Here, D(i, j) describes the difference between the coeffi- the clusters reflecting the intrinsic structure have been reasonably
cients of observations i and j. The upper bound in the above identified. Therefore, the straightforward k-nearest neighbor algo-
inequality provides a quantitative description of the group- rithm performs adequately in assigning the outer points to clus-
ing effect of the OPS formulation. D(i, j ) = |βˆi − βˆ j |/Ss 2 = ters. In cases where the number of features is large, and the perfor-
mance of the distance-based classification algorithm is inadequate,
( Si − S j Ss − Sβ
ˆ | cos θ | )/(λ2 Ss ) where θ indicates the
2 2 2
a classifier such as a regularization method or partial least squares
angle between vectors Si − S j and Ss − Sβˆ . The residual vector can can be used to address the dimensionality issue.
n
be calculated as follows: Ss − Sβˆ = (1 − βq )Sq . Thus, if we as-
q=1 2.5. Computational complexity
sume that the rank of S is equal to n, the equality of the first in-
equality in Eq. (13) holds when the following conditions are satis- This Section examines the computational complexity of the pro-
fied: βq = 1, ∀q ∈ {1, 2, ..., n}\{i, j} and |(1 − βi )/(1 − β j )| = 1. On posed OPS. Constructing the k-nearest neighbor graph consists of
Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001 5
Fig. 2. Outer-point identification results for the sparsity parameter λ1 : (A) λ1 = 0.1, (B) λ1 = 0.2, (C) λ1 = 0.3, (D) λ1 = 0.4, and (E) λ1 = 0.5. The other parameters λ2 and k
were set to 0.3 and 150, respectively. The blue dots denote original data, and the red dots denote outer points. (F) illustrates the shaved result, with outer points removed,
when λ1 = 0.5.
Fig. 3. Distributions of the simulation data: (A) three clusters, (B) two densities, (C) forest, (D) crescent full moon, (E) two kernels, (F) two spirals, (G) chameleon1, (H)
chameleon2, and (I) chameleon3.
number of clusters. Note that in the proposed OPS, it is unneces- subtracting the disagreement proportion of partitions from the
sary to predefine the number of clusters because this number can Rand index. The Hubert index is defined as follows:
be intuitively determined during the adjustment of the sparsity pa-
rameter. Section 3.3 presents the details of the parameter search n ni j ni nj
+4 i j −2 i + j
for the proposed method. 2 2 2 2
To compare the performances of the proposed and compara- Hubert Index = .
n
tive methods, we used two performance measures: the adjusted
2
Rand index [27] and the Hubert index [28], which are both vari-
ants of the Rand index [29]. The Rand index is a measure of (16)
agreement between two data partitions. The adjusted Rand index
is a normalized version of the Rand index [27], which measures Intuitively, the Hubert index measures the correlation between
the degree of agreement more precisely by removing the effect of two data clusters.
the expected similarity. Given two partitions A = {A1 , A2 , . . . Ar } and
B = {B1 , B2 , . . . Bs } of n data points, the adjusted Rand index is de- 3.2. Clustering performance comparison
fined as follows:
Tables 2 and 3 present the clustering performance results in
Adjusted Rand Index
terms of the adjusted Rand index and Hubert index, respectively.
ni j ni nj n The proposed OPS method outperformed the other clustering algo-
i j − i j
2 2 2 2 rithms. The OPS method exhibited the highest adjusted Rand in-
= , dex and Hubert index across all the data sets. The OPS method
1 ni nj ni nj n
2 i + j − i j
performed adequately for local and nonlinear pattern data, such as
2 2 2 2 2
chameleons, a crescent full moon, half kernels, and two spirals. The
(15) OPS method outperformed k-means and affinity propagation even
for local and linear pattern data, such as forest and three clusters.
where nij is the number of data points in Ai ∩Bj , ni = j ni j , and Fig. 4 illustrates graphically the clustering results for the proposed
nj = i ni j . On the other hand, the Hubert index is calculated by OPS method. From the numerical results, we can verify that when
Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001 7
Table 2
Clustering performance results for simulation data measured by the adjusted Rand index.
Three clusters 0.8704 0.8760 0.8417 0.8750 0.8620 0.8694 0.8713 0.8825
Two densities 0.6971 0.6991 0.5379 0.5931 0.7725 0.8262 0.8915 0.9447
Forest 0.8124 0.8128 0.6010 0.7833 0.6340 0.7613 0.8466 0.9021
Crescent full moon 0.0002 0.0104 0.7485 0.7781 0.4789 0.5059 0.6939 0.8006
Half kernels 0.0018 0.0042 0.7645 0.7472 0.1913 0.5358 0.7120 0.8183
Two spirals 0.0430 0.0472 0.8386 0.7564 0.0564 0.6841 0.7918 0.8506
Chameleon1 0.5608 0.6003 0.9275 0.9335 0.7936 0.7945 0.8961 0.9673
Chameleon2 0.3451 0.3867 0.8742 0.8637 0.7524 0.8191 0.9156 0.9927
Chameleon3 0.3638 0.4034 0.8173 0.8891 0.6789 0.7587 0.9330 0.9931
Table 3
Clustering performance results for simulation data measured by the Hubert index.
Three clusters 0.8848 0.8898 0.8718 0.8488 0.8773 0.8949 0.8558 0.8956
Two densities 0.6971 0.6986 0.5379 0.6270 0.7725 0.8835 0.9168 0.9447
Forest 0.9534 0.9534 0.8747 0.9045 0.9000 0.9500 0.9612 0.9757
Crescent full moon 0.0002 0.0107 0.7655 0.7936 0.4789 0.3288 0.6402 0.8003
Half kernels 0.0018 0.0042 0.7645 0.7449 0.1328 0.0901 0.4261 0.8216
Two spirals 0.0430 0.0489 0.8556 0.8320 0.0564 0.1105 0.4669 0.8506
Chameleon1 0.7338 0.7609 0.9567 0.8905 0.8687 0.8889 0.9299 0.9795
Chameleon2 0.6663 0.6857 0.9196 0.8644 0.8381 0.7443 0.8169 0.9956
Chameleon3 0.6789 0.6930 0.8842 0.8716 0.8221 0.6929 0.7731 0.9961
Fig. 4. Graphical clustering results for the proposed OPS algorithm: (A) three clusters, (B) two densities, (C) forest, (D) crescent full moon, (E) half kernels, (F) two spirals,
(G) chameleon1, (H) chameleon2, and (I) chameleon3.
8 Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001
Fig. 5. Parameter study on the number of nearest neighbors to construct the similarity graph (k) and the sparsity adjusting parameterλ1 : (A) three clusters, (B) two densities,
(C) forest, (D) crescent full moon, (E) half kernels, (F) two spirals, (G) chameleon1, (H) chameleon2, and (I) chameleon3. The colored bars indicate that the height of the bar
is equal to the actual number of clusters.
applied to noisy data, the proposed OPS method provides robust First, the results demonstrate that sparsity parameter λ1 is a
clustering capability. key factor in determining the number of clusters. Given the num-
ber of nearest neighbors k, the number of clusters varies relatively
3.3. Determining the number of clusters significantly as λ1 changes. It is worth noting that the number of
clusters is stable as λ1 varies when the number of clusters deter-
The number of clusters is an important parameter for cluster- mined by the OPS is equal to the actual number of clusters. If λ1
ing. In general, the number of clusters that maximizes a cluster- is small, the number of outer points removed to determine the ac-
ing performance is selected as an optimum. However, various clus- tual cluster structures is insufficient so that connections between
tering performance measures provide different optimal numbers the clusters are maintained. The redundant connections result in
of clusters [20]. This section demonstrates that the proposed OPS the number of clusters that is lower than the actual number of
can determine the number of clusters in a straightforward manner. clusters. As λ1 increases, a sufficient number of outer points were
In addition, we conducted a sensitivity analysis of the parameters removed and thus, the number of clusters is equal to the actual
such as the number of nearest neighbors k when constructing the number of clusters. In this state, because the intrinsic cluster struc-
k-nearest neighbor graph and smoothing parameter λ2 . ture is determined, the number of clusters does not change sig-
Fig. 5 illustrates the parameter study results. The x-axis repre- nificantly regardless of the number of outer points removed. If λ1
sents the sparsity adjusting parameter λ1 , the y-axis represents the increases excessively, the intrinsic cluster structure is divided into
number of nearest neighbors (k) when constructing the k-nearest smaller clusters so that the number of clusters increases. Or the
neighbor graph, and the z-axis represents the number of clusters. number of clusters can be reduced because all the points compos-
The colored bars indicate that the number of clusters determined ing the clusters are removed. Through these results, we can deter-
by the OPS is equal to the actual number of clusters. The spar- mine the number of clusters by simply identifying an interval in
sity parameter λ1 was varied from 0 to 0.75 at intervals of 0.05 which the number of clusters is stable regardless of λ1 . Note that
units. The parameter for the number of nearest neighbors was var- the affinity propagation algorithm also determines the number of
ied from 60 to 150 and 260 to 350, at intervals of 10 units. The clusters by identifying the stable interval as suggested in the pro-
experimental results have the following two implications: posed OPS method [25].
Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001 9
Fig. 6. Parameter study on the sparsity adjusting parameterλ1 and smoothing parameter λ2 : (A) three clusters, (B) two densities, (C) forest, (D) crescent full moon, (E) half
kernels, (F) two spirals, (G) chameleon1, (H) chameleon2, and (I) chameleon3. The colored bars indicate that the height of the bar is equal to the actual number of clusters.
Fig. 8. Noise sensitivity analysis: (A) three clusters, (B) two densities, (C) forest, (D) crescent full moon, (E) half kernels, (F) two spirals, (G) chameleon1, (H) chameleon2,
and (I) chameleon3.
3.4. Noise sensitivity analysis eraging 10 data sets. A method that maintained a larger adjusted
Rand index over the noise ratio would be considered the better
We analyzed the robustness of the proposed OPS method by ex- one. We could observe that the proposed method evidently out-
amining its clustering performance with different amounts of noise performed the other methods in terms of the robustness to noise:
in a data set. We used the nine simulation data and added noise the adjusted Rand index values for the proposed OPS tended to
points randomly with the noise ratio α increased from 0 to 50%. be higher than those of the other methods, exhibiting smaller
The noise ratio is the ratio of the number of added noises to the Rand index changes with increases in the noise ratio. This im-
number of given data points. The noise points are generated from plies that the proposed method efficiently removed noise data,
a uniform distribution where maximum and minimum values for and thus accurately identified intrinsic cluster structures. If the
each coordinate are estimated with the given data set. Fig. 7 illus- amount of noise data is increased, many similarity edges of high
trates an example plot of noise added data. To measure the qual- weight are generated by these noise data. Because spectral cluster-
ity of clustering, we averaged the results over 10 data sets for each ing is based on edge cutting, it is challenging to identify an opti-
noise parameter. When we calculate the clustering index, the given mal edge cut that reflects intrinsic cluster structures. Noise-robust
data points are only considered. The parameters for each cluster- clustering achieves a higher performance than spectral clustering.
ing method were optimized for the data with a noise ratio of 0% However, the method based on edge cutting also exhibited poor
and kept fixed while increasing the noise ratio. We determined the performance in the form of noise increase. The clustering perfor-
parameters for proposed OPS based on the parameter sensitivity mance of the DBSCAN also degraded significantly with an increase
analysis. in noise because the method is sensitive to the tuning parameters.
Fig. 8 presents the results for the proposed and the seven clus- An increase in noise, causing the distribution of the base data to
tering benchmark methods. The x-axis represents the noise ratio, change, requires the DBSCAN parameters to be adjusted. On the
and the y-axis represents the adjusted Rand index obtained by av- other hand, the proposed OPS method is robust to noise because
Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001 11
Fig. 9. Experimental results of the computational complexity. The x-axis and y-axis represent the number of observations and computational time, respectively.
Table 4
Summary of the actual case data sets.
it is based on node cutting, so that it removes the generated noise commonly used for training and testing in the field of machine
data and identifies intrinsic cluster structures. learning. In our case study, we used the testing part only. The
Canadian Institute for Advanced Research (CIFAR) data set contains
3.5. Experimental results of computational complexity images of animals and vehicles, separated into 10 classes. The orig-
inal color images were transformed to grayscale images. Our case
We presented the theoretical results of the computational com- study used only the testing part of the data set. The Columbia Ob-
plexity of the proposed OPS in Section 2.5. This Section examines ject Image Library (COIL) contains the grayscale images of 20 ob-
the experimental complexity of the OPS. To generate the data sets, jects, which were placed on a motorized turntable against a black
we referred to the distribution of the three clusters data set. We background. The turntable was rotated through 360°, to vary the
checked the computational time by varying the number of obser- object pose. We resized the images from 128 × 128 to 32 × 32. The
vations (n = 6,0 0 0, 12,0 0 0, …, 30,0 0 0) and features (p = 10, 20, …, human activity data set contains accelerometer and gyroscope sen-
50). To perform this experiment, we used MATLAB to implement sor data, measured by smartphones to classify human activities
the OPS on a personal computer (Intel® Core TM i7-8700 CPU @ into six categories (walking, walking upstairs, walking downstairs,
3.20 GHz, 32.00 GB RAM). Fig. 9 illustrates the results. When the sitting, standing, and lying down). The pen digit data set contains
number of observations is relatively small (less than or equal to coordinate information for digit pixels from the hand-written digit
12,0 0 0), the experimental complexity is O(n1.89 ). As the number of data set. The researchers resampled eight points of the pixels de-
observations doubles, the computational time increases 3.7 times. termined by the x and y coordinates, such that the data set has
As the number of observations increases to 30,0 0 0, the experimen- 16 features. The gesture phase data set contains features extracted
tal complexity increases to approximately O(n2.21 ). We expect that from seven videos of people gesticulating. The United States Postal
the computational complexity asymptotically converges to the the- Service (USPS) data set is composed of 11,0 0 0 scaled handwritten
oretical result O(n3 ). Note that the results show that the impact of digit images.
the number of features on the computational time is limited. Table 5 presents the clustering results in terms of the adjusted
Rand index. The proposed OPS method outperformed the other
4. Case study clustering algorithms. The OPS exhibited the highest clustering per-
formance for five out of seven data sets. Although NRSC and KM
To demonstrate the applicability to real situations, we con- achieved the highest clustering results for the COIL and gesture
ducted experiments on seven benchmark data sets, summarized in phase data sets, the OPS performed comparably. Table 6 presents
Table 4. Segment data were drawn randomly from a database of the clustering results measured by the Hubert index, indicating
seven outdoor images, which were hand-segmented to create a la- that the OPS method achieved the best results for five out of seven
bel for each pixel. The data set contains extracted features for im- data sets. These results demonstrate that the proposed OPS method
age segmentation. The Modified National Institute of Standards and provides robust and accurate clustering performance when applied
Technology (MNIST) data set contains handwritten digits, which is to actual case data.
12 Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001
Table 5
Clustering performance results for actual case data measured by the adjusted Rand index.
Table 6
Clustering performance results for actual case data measured by the Hubert index.
In the COIL data and Gesture Phase data, RSC and KM achieved The proposed method shows that the clustering problem can
good clustering performance. In the corresponding data, there are be optimally and efficiently solved by regularized pseudo-density
large correlations between the observations. COIL is image data reconstruction. Depending on how to use the regularization, there
obtained by taking pictures of each object at various angles. The is some potential for its use to solve various pattern recognition
gesture phase is a record of the change of motion sensor over problems. One is to extend our method to yield a label propa-
time. COIL has large spatial correlations and gesture phase has gation algorithm for semi-supervised learning. We can derive the
large temporal correlations between observations. In this situation, solution path of the elastic net over the sparsity parameter. The
the approach of removing observations like OPS could easily break solution path can be used to obtain an effective label propaga-
graph connections in clusters and lead to the degradation of clus- tion path and accurately predict the label of unlabeled data. The
tering performance. In the case of USPS data, graph-based methods other is to use the proposed method as a noise removal method
such as OPS and RSC showed good performance. It is expected that to enhance classification performance. The proposed OPS removes
there are nonlinear characteristics in the data. In terms of the ad- low-density points in the nearest neighbor graph. We can construct
justed Rand index, OPS performs better, and from the perspective the graph with classification data set and remove the noisy points
of the Hubert index, RSC achieves better performance. We believe with pseudo-density reconstruction with regularization method. As
that there are less noisy points in the data so that the difference the noise sensitivity analysis shows, we expect that the method
between the two methods was not significant. As the results of would efficiently reduce the degree of noise, while preserving the
noise sensitivity analysis, it is expected that there will be more ob- normal data points. In addition, there is one more interesting fu-
vious difference in performance when there is much noise in the ture study to extend the proposed method to identify outliers. The
data. outer points provided by the OPS algorithm are possible outliers
because they lie in the low-density region. Therefore, it would
5. Conclusions be worthwhile considering an additional decision rule that could
identify some outliers from the outer points instead of directly as-
In this study, we have examined the problem of robustly clus- signing cluster labels for all of the outer points.
tering data points on a similarity graph with noise. We proposed
the OPS method to detect and remove outer points to reveal the Acknowledgments
intrinsic cluster structure. Determining outer points is a type of
best subset selection problem, which is NP-hard. To solve the prob- The authors would like to thank the editor and reviewers for
lem efficiently, we relaxed the mixed integer programming formu- their useful comments and suggestions, which were of great help
lation to a convex optimization formulation. Through a simulation in improving the quality of the paper. This research was supported
and an actual case study, we compared the performance of the by Brain Korea PLUS; the Basic Science Research Program through
proposed OPS method with those of other clustering methods. We the National Research Foundation of Korea, funded by the Ministry
observed that the proposed OPS method outperformed the other of Science, ICT and Future Planning (NRF-2016R1A2B1008994); the
methods in that it clustered the noisy data accurately and exhib- Ministry of Trade, Industry & Energy under the Industrial Tech-
ited a robust performance when the proportion of noise was in- nology Innovation Program (R1623371), and by an Institute for In-
creased. Although the method shows promising results, the for- formation & communications Technology Promotion grant funded
mulation contains the symmetric similarity matrix which increases by the Korea government (No. 2018-0-00440, ICT-based Crime Risk
the memory and computational complexity in O(n2 ). To handle the Prediction and Response Platform Development for Early Aware-
issue, it would be interesting to solve the formulation with an al- ness of Risk Situation).
ternating direction method of multipliers which can solve the for-
mulation in parallel. The OPS also has some limitations in cluster-
References
ing data that have large temporal or spatial correlations between
observations. We expect that if we add additional constraints that [1] X.L. Sui, L. Xu, X. Qian, T. Liu, Convex clustering with metric learning, Pattern
consider the correlations, we could address the limitations. Recognit 81 (2018) 575–584. https://doi.org/10.1016/j.patcog.2018.04.019.
Y. Kim, H. Do and S.B. Kim / Pattern Recognition 97 (2020) 107001 13
[2] Q. Li, Y. Ren, L. Li, W. Liu, Fuzzy based affinity learning for spectral cluster- [23] P. Fränti, O. Virmajoki, Iterative shrinking method for clustering problems,
ing, Pattern Recognit 60 (2016) 531–542. https://doi.org/10.1016/j.patcog.2016. Pattern Recognit 39 (2006) 761–775. https://doi.org/10.1016/j.patcog.2005.
06.011. 09.012.
[3] J.A. Hartigan, M.A. Wong, Algorithm as 136: a k-means clustering algorithm, J. [24] G. Karypis, E.-.H. Han, V. Kumar, Chameleon: hierarchical clustering using dy-
R. Stat. Soc. C-Appl. 28 (1979) 100–108. https://doi.org/10.2307/2346830. namic modeling, Comput 32 (1999) 68–75. https://doi.org/10.1109/2.781637.
[4] A. Lukasová, Hierarchical agglomerative clustering procedure, Pattern Recognit [25] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Sci-
11 (1979) 365–381. https://doi.org/10.1016/0 031-3203(79)90 049-9. ence 315 (2007) 972–976. https://doi.org/10.1126/science.1136800.
[5] F. De Morsier, D. Tuia, M. Borgeaud, V. Gass, J.-.P. Thiran, Cluster validity [26] M. Ester, H.-.P. Kriegel, J. Sander, X. Xu, in: A Density-Based Algorithm
measure and merging system for hierarchical clustering considering outliers, for Discovering Clusters in Large Spatial Databases with Noise, KDD, 1996,
Pattern Recognit 48 (2015) 1478–1489. https://doi.org/10.1016/j.patcog.2014.10. pp. 226–231.
003. [27] L. McInnes, J. Healy, Accelerated hierarchical density based clustering, in: 2017
[6] D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space IEEE International Conference on Data Mining Workshops, 2017, pp. 33–42.
analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 603–619. https://doi. https://doi.org/10.1109/ICDMW.2017.12.
org/10.1109/34.10 0 0236. [28] L. Hubert, P. Arabie, Comparing partitions, J. Classif 2 (1985) 193–218. https:
[7] Y. Qin, Z.L. Yu, C.-.D. Wang, Z. Gu, Y. Li, A novel clustering method based on //doi.org/10.1007/BF01908075.
hybrid k-nearest-neighbor graph, Pattern Recognit 74 (2018) 1–14. https://doi. [29] L.J. Hubert, F.B. Baker, The comparison and fitting of given classifica-
org/10.1016/j.patcog.2017.09.008. tion schemes, J. Math. Psychol. 16 (1977) 233–253. https://doi.org/10.1016/
[8] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, 0 022-2496(77)90 054-2.
in: Advances in Neural Information Processing Systems, 2002, pp. 849–856. [30] C. Blake, C. Merz, UCI Repository of Machine Learning Databases, University of
[9] W. Jiang, W. Liu, F.L. Chung, Knowledge transfer for spectral clustering, Pattern California, Department of Information and Computer Science, Irvine, CA, 1998.
Recognit 81 (2018) 484–496. https://doi.org/10.1016/j.patcog.2018.04.018. [31] Y. LeCun, The mnist MNIST database of handwritten digits, http://yann.lecun.
[10] U. Von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (2007) 395– com/exdb/mnist/, (1998).
416. https://doi.org/10.10 07/s11222-0 07- 9033- z. [32] S.A. Nene, S.K. Nayar, H. Murase, Columbia object image library (coilCOIL-20),
[11] Z. Li, J. Liu, S. Chen, X. Tang, Noise robust spectral clustering, in: IEEE 11th (1996).
International Conference on Computer Vision, 2007, pp. 1–8. https://doi.org/ [33] D. Anguita, A. Ghio, L. Oneto, X. Parra, J.L. Reyes-Ortiz, A Public Domain Dataset
10.1109/ICCV.2007.4409061. for Human Activity Recognition using Smartphones, ESANN, 2013.
[12] H. Huang, S. Yoo, H. Qin, D. Yu, A robust clustering algorithm based on ag- [34] F. Alimoglu, E. Alpaydin, Y. Denizhan, Combining multiple classifiers for pen-
gregated heat kernel mapping, in: IEEE 11th International Conference on Data based handwritten digit recognition, (1996).
Mining, 2011, pp. 270–279. https://doi.org/10.1109/ICDM.2011.15. [35] R.C.B. Madeo, S.M. Peres, C.A. de Moraes Lima, Gesture phase segmentation
[13] A. Bojchevski, Y. Matkovic, S. Günnemann, Robust spectral clustering for noisy using support vector machines, Expert Syst. Appl. 56 (2016) 100–115. https:
data: Modeling sparse corruptions improves latent embeddings, in: Proceed- //doi.org/10.1016/j.eswa.2016.02.021.
ings of the 23rd ACM SIGKDD International Conference on Knowledge Dis- [36] J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pat-
covery and Data Mining, 2017, pp. 737–746. https://doi.org/10.1145/3097983. tern Anal. Mach. Intell. 16 (1994) 550–554. https://doi.org/10.1109/34.291440.
3098156.
[14] X. Li, B. Kao, S. Luo, M. Ester, ROSC: robust spectral clustering on multi-scale Younghoon Kim received a Ph.D. in Industrial Management Engineering in 2019
data, in: Proceedings of the 2018 World Wide Web Conference on World Wide from Korea University, Seoul, Korea. His-research interests include feature selec-
Web, 2018, pp. 157–166. https://doi.org/10.1145/3178876.3185993. tion algorithms for high-dimensional data and discrete optimization-based machine
[15] K. Kim, J. Lee, Nonlinear dynamic projection for noise reduction of dispersed learning methods.
manifolds, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2014) 2303–2309. https:
//doi.org/10.1109/TPAMI.2014.2318727.
[16] B.K. Natarajan, Sparse approximate solutions to linear systems, SIAM J. Com- Hyungrok Do received his B.S. in 2014 and is currently a Ph.D. candidate in the
put. 24 (1995) 227–234. https://doi.org/10.1137/S0097539792240406. Department of Industrial Management Engineering in Korea University, Seoul, Ko-
[17] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. rea. His-research interests include machine learning algorithms and optimization.
B-Met. (1996) 267–288.
[18] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. Seoung Bum Kim is a Professor in the Department of Industrial Management Engi-
R. Stat. Soc. B 67 (2005) 301–320. https://doi.org/10.1111/j.1467-9868.2005. neering at Korea University. From 20 05–20 09, he was an Assistant Professor in the
00503.x. Department of Industrial & Manufacturing Systems Engineering at the University of
[19] J. Friedman, T. Hastie, H. Höfling, R. Tibshirani, Pathwise coordinate optimiza- Texas at Arlington. He received an M.S. in Industrial and Systems Engineering in
tion, Ann. Appl. Stat. 1 (2007) 302–332. https://doi.org/10.1214/07-AOAS131. 2001, an M.S. in Statistics in 2004, and a Ph.D. in Industrial and Systems Engineer-
[20] J. Liu, S. Ji, J. Ye, in: SLEP: Sparse learning With Efficient Projections, 6, Arizona ing in 2005 from the Georgia Institute of Technology. Dr. Kim’s research interests
State University, 2009, p. 7. utilize data mining methodologies to create new methods for various problems ap-
[21] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and pearing in engineering and science. He has expertise in the machine learning algo-
statistical learning via the alternating direction method of multipliers, Found. rithms for feature extraction/selection problems. He has published more than 100
Trends Mach. Learn. 3 (2011) 1–122. https://doi.org/10.1561/220 0 0 0 0 016. internationally recognized journals and refereed conference proceedings. He was
[22] G. Bounova, O. de Weck, Overview of metrics and their correlation patterns awarded the Jack Youden Prize as the best expository paper in Technometrics for
for multiple-metric topology analysis on heterogeneous graph ensembles, Phys. the Year 2003. He is actively involved with the INFORMS, serving as president for
Rev. E 85 (2012) 016117. https://doi.org/10.1103/PhysRevE.85.016117. the INFORMS Section on Data Mining.