K - Means Clustering With Outlier Removal
K - Means Clustering With Outlier Removal
a r t i c l e i n f o a b s t r a c t
Article history: Outlier detection is an important data analysis task in its own right and removing the outliers from
Received 5 July 2016 clusters can improve the clustering accuracy. In this paper, we extend the k-means algorithm to provide
Available online 8 March 2017
data clustering and outlier detection simultaneously by introducing an additional “cluster” to the k-means
MSC: algorithm to hold all outliers. We design an iterative procedure to optimize the objective function of the
62H30 proposed algorithm and establish the convergence of the iterative procedure. Numerical experiments on
68T10 both synthetic data and real data are provided to demonstrate the effectiveness and efficiency of the
91C20 proposed algorithm.
62P10
© 2017 Elsevier B.V. All rights reserved.
Keywords:
Data clustering
k-means
Outlier detection
1. Introduction tending the k-means algorithm for outlier detection. Dave and Kr-
ishnapuram [7] proposed to use an additional “cluster” for the
The goal of data clustering is to identify homogeneous groups fuzzy c-means algorithm to hold all outliers. In the KMOR algo-
or clusters from a set of objects. In other words, data clustering rithm, we use the same idea of introducing an additional “cluster”
aims to divide a set of objects into groups or clusters such that that contains all outliers. Given a desired number of clusters k, the
objects in the same cluster are more similar to each other than to KMOR algorithm partitions the dataset into k + 1 groups, which in-
objects from other clusters [3,11]. As an unsupervised learning pro- clude k clusters and a group of outliers that cannot fit into the k
cess, data clustering is often used as a preliminary step for data clusters. Unlike most existing clustering algorithms with outlier de-
analytics. For example, data clustering is used to identify the pat- tection, the KMOR algorithm assigns all outliers into a group natu-
terns hidden in gene expression data [30], to produce a good qual- rally during the clustering process.
ity of clusters or summaries for big data to address the associated The remaining part of this paper is organized as follows. In
storage and analytical issues [9], to select representative insurance Section 2, we give a review of clustering algorithms that can detect
policies from a large portfolio in order to build metamodel models outliers. In Section 3, we present the KMOR algorithm in detail.
[12,14]. In Section 4, we demonstrate the performance of the KMOR algo-
Many clustering algorithms have been developed in the past rithm with numerical results on both synthetic and real datasets.
sixty years. Among these algorithms, the k-means algorithm is Section 5 concludes the paper with some remarks.
one of the oldest and most commonly used clustering algorithms
[22,31]. Despite being used widely, the k-means algorithm has sev-
eral drawbacks. One drawback is that it is sensitive to noisy data 2. Related work
and outliers. For example, the k-means algorithm is not able to
recover correctly the two clusters shown in Fig. 1(a) due to the Kadam and Pund [27] and Aggarwal [2, Chapter 8] reviewed
outliers. As we can see from Fig. 1(b), three points were clustered several approaches to detect outliers, including the cluster-based
incorrectly. approach. Aggarwal [1] devoted a whole book to outlier analy-
Motivated by Dave and Krishnapuram [7], we propose in this sis. Yu et al. [36] proposed the OEDP k-means algorithm by re-
paper the KMOR (k-means with outlier removal) algorithm by ex- moving outliers from the dataset before applying the k-means al-
gorithm. Aparna and Nair [5] proposed the CHB-K-Means algo-
rithm by using a weighted attribute matrix to detect outliers. Jiang
∗
Corresponding author. et al. [24] proposed two initialization methods for the k-modes
E-mail address: [email protected] (G. Gan). algorithm to choose initial cluster centers that are not outliers.
http://dx.doi.org/10.1016/j.patrec.2017.03.008
0167-8655/© 2017 Elsevier B.V. All rights reserved.
G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14 9
Fig. 1. An illustration showing that the k-means algorithm is sensitive to outliers. (a) A data set with two clusters and two outliers. The two clusters are plotted by triangles
and circles, respectively. The two outliers are denoted by plus signs. (b) Two clusters found by the k-means algorithm. The two found clusters are plotted by triangles and
circles, respectively.
Although much work has been done on outlier analysis, few of are considered outliers. Zhou et al. [38] proposed a three-stage k-
them perform clustering and detect outliers simultaneously. In this means algorithm to cluster data and detect outliers. In the first
section, we focus on clustering methods with the built-in mecha- stage, the fuzzy c-means algorithm is applied to cluster the data. In
nism of outlier detection and give a review of those methods. the second stage, local outliers are identified and the cluster cen-
Jiang et al. [25] proposed a two-phase clustering algorithm for ters are recalculated. In the third stage, certain clusters are merged
outlier detection. In the first phase, the k-means algorithm is mod- and global outliers are identified. Zhang et al. [37] introduced a
ified to partition the data in such a way that a data point is as- measure called Local Distance-based Outlier Factor (LDOF) to mea-
signed to be a new cluster center if the data point is far away from sure the outlier-ness of objects in scattered datasets. Pamula et al.
all clusters. In the second phase, a minimum spanning tree is con- [33] used the k-means algorithm to prune some points around
structed based on the cluster centers obtained from the first phase. the cluster centers and the LDOF measure to identify outliers from
Clusters in small sub trees are considered as outliers. He et al. the remaining points. Jayakumar and Thomas [23] proposed an ap-
[19] introduced the concept of cluster-based local outlier and de- proach to detect outliers based on the Mahalanobis distance.
signed a measure, called cluster-based local outlier factor (CBLOF), Ahmed and Naser [4] proposed the ODC (Outlier Detection and
to identify such outliers. Clustering) algorithm to detect outliers. The ODC algorithm is a
Hautamäki et al. [18] proposed the ORC (Outlier Removal Clus- modified version of the k-means algorithm. In the ODC algorithm,
tering) algorithm to identify clusters and outliers from a dataset a data point that is at least p times the average distance away
simultaneously. The ORC algorithm consists of two consecutive from its centroid is considered as an outlier. Chawla and Gionis
stages: the first stage is a purely k-means algorithm; the second [6] proposed the k-means– algorithm to provide data clustering
stage iteratively removes the data points that are far away from and outlier detection simultaneously. The k-means– algorithm re-
their cluster centroids. Rehm et al. [34] defined outliers in terms of quires two parameters: k and l, which specify the desired number
noise distance. The data points that are about the noise distance or of clusters and the desired number of top outliers, respectively.
further away from any other cluster centers get high membership Ott et al. [32] extended the facility location formulation to
degrees to the outlier cluster. model the joint clustering and outlier detection problem and pro-
Jiang and An [26] also proposed a two-stage algorithm, called posed a subgradient-based algorithm to solve the resulting opti-
CBOD (Clustering Based Outlier Detection), to detect outliers from mization problem. The model requires pairwise distances of the
datasets. In the first stage, an one-pass clustering algorithm is ap- dataset and the number of outliers as input. Whang et al. [35] pro-
plied to divide a dataset into hyper spheres with almost the same posed the NEO-k-means (Non-exhaustive Overlapping k-means) al-
radius. In the second stage, outlier factors for all clusters obtained gorithm, which is also able to identify outliers during the cluster-
from the first stage are calculated and the clusters are sorted ac- ing process.
cording to their outlier factors. Clusters with high outlier factors
10 G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14
Some of the aforementioned algorithms perform clustering and Eq. (3), the objective function reaches zero with ui,k+1 = 1, ui,l = 0
outlier detection in stages. In these algorithms, a clustering algo- for i = 1, 2, . . . , n and l = 1, 2, . . . , k. In other words, the objective
rithm is used to divide the dataset into clusters and some mea- function without the condition is minimized when all data points
sure is calculated for the data points based on the clusters to iden- are put into the group of outliers. When n0 = 0, the condition
tify outliers. The ODC algorithm, the k-means– algorithm, and the given in Eq. (3) implies that ui,k+1 = 0 for i = 1, 2, . . . , n. In this
NEO-k-means algorithm integrate outlier detection into the clus- case, the KMOR objective function becomes the standard k-means
tering process. However, data points that are removed as outliers objective function. When 0 < n0 < n, the KMOR objective function
during the iterative process of the ODC algorithm cannot be used also becomes the standard k-means objective function when γ →
as normal points again when the centroids are updated. Determin- ∞. In this setting, the second term ui,k+1 D(U, Z ) is more dominant,
ing the parameters α and β in the NEO-k-means algorithm is time the minimization procedure will favour ui,k+1 to be zero, i.e., no
consuming. points will be assigned to the group of outliers. Also it is obvious
in the KMOR objective function that we do not allow n0 ≥ n in or-
3. The KMOR algorithm der to prevent the algorithm from assigning all points to the group
of outliers.
In the KMOR algorithm, a data point that is at least γ × davg Like the k-means algorithm, the KMOR algorithm starts with a
away from all the cluster centers is considered as an outlier, where set of k initial cluster centers and then keeps updating U and Z
γ is a multiplier and davg is the average distance calculated dy- until some stopping criterion is achieved. However, the objective
namically during the clustering process. function of the KMOR algorithm involves the interaction of the two
To describe the KMOR algorithm, let X = {x1 , x2 , . . . , xn } be a set of variables: ui,l (l = 1, 2, . . . , k ) and ui,k+1 . As a result, the iter-
numerical dataset containing n data points, each of which is de- ative process of the KMOR algorithm is different from that of the
scribed by d numerical attributes. Let k be the desired number of standard k-means algorithm. To describe the iterative process of
clusters. Let U = (uil )n×(k+1 ) be an n × (k + 1 ) binary matrix (i.e., the KMOR algorithm, we define
uil ∈ {0, 1}) such that for each i = 1, 2, . . . , n,
n
k
k+1
Q (U, V, Z ) = ui,l xi − zl + ui,k+1 D(V, Z ) ,
2
(5)
uil = 1. (1) i=1 l=1
l=1
where V = (vi,l )n×(k+1 ) is an n × (k + 1 ) binary matrix that satis-
The binary matrix U has k + 1 columns. The last column of U is fies conditions given in (1) and (3). Comparing Eqs. (2) and (5), we
used to indicate whether a data point is an outlier. If xi is an have Q (U , U , Z ) = P (U, Z ). According to Q(U, V, Z), we can optimize
outlier, then ui,k+1 = 1. If xi is a normal point, then ui,k+1 = 0. If Q by solving three subproblems: Q(U, ·, ·), Q( ·, V, ·) and Q( ·, ·,
ui,k+1 = 0, then uil = 1 for some l ∈ {1, 2, . . . , k}, where l is the in- Z) iteratively. The pseudo-code of the KMOR algorithm is given in
dex of the cluster to which xi belongs. The binary matrix U is a Algorithm 1. For each subproblem, we have the following theorems
partition matrix that divides the dataset X into k + 1 groups, which to guarantee the optimality.
include k normal clusters and one special “cluster” that contains
the outliers.
The KMOR algorithm divides X into k clusters and a group of Algorithm 1: Pseudo-code of the KMOR algorithm, where σ
outliers by minimizing the following objective function and Nmax are two parameters used to terminate the algorithm.
Input: X, k, γ , n0 , δ , Nmax
n
k
P (U, Z ) = uil xi − zl + ui,k+1 D(U, Z ) ,
2
(2) Output: Optimal values of U and Z
i=1 l=1
1 Initialize Z (0 ) = {z1(0 ) , z2(0 ) , . . . , zk(0 ) } by selecting k points from
X randomly;
subject to
2 Update U (0 ) by assigning xi to its nearest center for
n
i = 1, 2, . . . , n;
ui,k+1 ≤ n0 , (3) 3 s ← 0;
i=1
4 P ( 0 ) ← 0;
where 0 ≤ n0 < n is a parameter, Z = {z1 , z2 , . . . , zk } is a set of 5 while True do
cluster centers, · is the L2 norm and 6 Update U (s+1 ) by minimizing Q (U, U (s ) , Z (s ) ) according to
Theorem 1;
γ
k
n
D(U, Z ) = n u j,l x j − zl 2 . (4) 7 Update Z (s+1 ) by minimizing Q (U (s+1 ) , U (s+1 ) , Z ) according
n− j=1 u j,k+1 to Theorem 3;
l=1 j=1
s ← s + 1;
Here γ ≥ 0 is also a parameter. The parameters n0 and γ are used
8
to control the number of outliers. How to select appropriate values 9 P (s+1 ) ← P U (s+1 ) , Z (s+1 ) ;
for n0 and γ is discussed in the end of this section. The first term 10 if P (s+1 ) − P (s ) < δ or s ≥ Nmax then
of P(U, Z) is employed in the standard k-means algorithm to put 11 Break;
similar data points into a cluster. The second term of P(U, Z) is 12 end
used to check how to assign a point to be an outlier based on the 13 end
average distance calculated from D(U, Z).
The condition given in Eq. (3) limits the number of outliers to
be at most n0 . In fact, the purpose of this condition is to make sure
that Theorem 1. Let V = V ∗ and Z = Z ∗ be fixed. Let m1 , m2 , . . . , mn ∈
{1, 2, . . . , k} such that
n
ui,k+1 < n. di,mi = min di,l ,
i=1 1≤l≤k
It is worth noting that this condition is necessary for the ob- where di,l = xi − z∗l 2 , that is, mi is the index of the center to
jective function to be nontrivial. Without the condition given in which the point xi is closest. Let (i1 , i2 , . . . , in ) be a permutation of
G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14 11
Proof. We only need to show that for any binary matrix U satisfy- Q (U ∗ , U ∗ , Z ∗ ) ≤ Q (U ∗ , V ∗ , Z ∗ ).
ing the conditions (1) and (3), we have Proof. We only need to show that
Q (U ∗ , V ∗ , Z ∗ ) ≤ Q (U, V ∗ , Z ∗ ).
D(U ∗ , Z ∗ ) ≤ D(V ∗ , Z ∗ ). (7)
To do that, we let U be an arbitrary binary matrix that satisfies
conditions (1) and (3) and let By Theorem 1, we know that U∗
is a binary matrix that satisfies
conditions (1) and (3) and minimizes Q(U, V∗ , Z∗ ). Hence we have
O = {i ∈ {1, 2, . . . , n} : ui,k+1 = 1}
Q (U ∗ , V ∗ , Z ∗ ) ≤ Q (V ∗ , V ∗ , Z ∗ )
and t = |O|. Let
or
k
a∗i = u∗il dil + u∗i,k+1 D(V ∗ , Z ∗ ), i = 1, 2, . . . , n
n
k
l=1 u∗i,l xi − + z∗l 2 u∗i,k+1 D (V , Z )
∗ ∗
Let us first consider the case when t = t ∗ . In this case, we have Eq. (7) follows from the above inequality, the assumption, and the
a∗i = as j = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t and a∗i ≤ as j for j = t + 1, t + definition of D(V, Z). This completes the proof.
j j
2, . . . , n. Hence we have According to Theorem 2, we can set V∗ equal to U∗ , and guar-
n
n
antee that the objective function value is always non-increasing.
Q (U ∗ , V ∗ , Z ∗ ) = a∗i j ≤ as j = Q (U, V ∗ , Z ∗ ).
j=1 j=1 Theorem 3. Let U = U ∗ and V = U ∗ be fixed. Then the cluster centers
Z∗ that minimizes the function (5) is given by
Now let us consider the case when t < t∗ . In this case, we have
n
a∗i = as j = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t. For j = t + 1, t + 2, . . . , t ∗ , ∗
i=1 ui,s xi,s
∗
j
zl,s = n ∗
(9)
we have
i=1 ui,s
k
as j = us j l ds j l ≥ ds j ,ms j > D(V ∗ , Z ∗ ) = a∗i j . for l = 1, 2, . . . , k and s = 1, 2, . . . , d, where xi = [xi,1 , xi, 2 , , xi, d ].
l=1
Proof. By combining Eqs. (4) and (5), we get
For j = t ∗ + 1, . . . , n, we have a∗i ≤ as j . Hence we have n
j
γ j=1 u∗j,k+1
n
k
n
n Q (U , U , Z ) = 1 +
∗ ∗
n ∗
u∗i,l xi − zl 2 .
Q (U , V , Z ) =
∗ ∗ ∗
a∗i j ≤ as j = Q (U, V , Z ).
∗ ∗ n− j=1 u j,k+1 i=1 l=1
j=1 j=1
Minimizing the above equation with respect to Z is equivalent to
For the case when t > t∗ , we have t∗ < t ≤ n0 because U satisfies minimizing the following function
the condition (3). In this case, we have a∗i = as j = D(V ∗ , Z ∗ ) for j =
j
n
k
1, 2, . . . , t ∗ . For j = t ∗ + 1, . . . , t, we have f (Z ) = u∗i,l xi − zl 2 .
a∗i j = di j ,mi ≤ D(V ∗ , D∗ ) = as j . i=1 l=1
j
Taking derivative of the above equation with respect to zls and
For j = t + 1, t + 2, . . . , n, we have a∗i ≤ as j . In this case, we also
j equating the derivative to zero lead to the result. This completes
have the proof.
n
n
Q (U ∗ , V ∗ , Z ∗ ) = a∗i j ≤ as j = Q (U, V ∗ , Z ∗ ). According to Theorem 3, we see that the update of cluster cen-
j=1 j=1 ters is the same as that in the standard k-means algorithm.
12 G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14
By using the results in Theorems 1–3, we can show that the Table 1
Average statistics of 100 runs of the four algorithms on syn-
KMOR algorithm converges. In particular, we have
thetic datasets. The runtime is measured in seconds. (a) Results
P (U (s+1) , Z (s+1) ) = Q (U (s+1) , U (s+1) , Z (s+1) ) on the first synthetic dataset with k = 2. (b) Results on the sec-
ond synthetic dataset with k = 7.
≤ Q (U (s+1) , U (s+1) , Z (s ) ) ≤ Q (U (s+1) , U (s ) , Z (s ) )
KMOR ODC k-means– NEO-k-means
≤ Q (U (s ) , U (s ) , Z (s ) ) = P (U (s ) , Z (s ) ).
R 0.91 0.87 0.32 0.34
We see that the objective is non-increasing. As U is finite and the ME 0.05 0.16 0.95 0.94
objective function value is bounded below by zero, the algorithm Outliers 6.14 5.04 53 52.27
Runtime 0.003 0.003 0.019 0.015
will terminate as the objective function value is not changed after
(a)
a finite number of iterations. KMOR ODC k-means– NEO-k-means
As shown in Algorithm 1, the KMOR algorithm requires three R 0.562 0.782 0.291 0.292
main parameters k, n0 , and γ . The first parameter k is the de- ME 0.707 1 0.709 0.717
Outliers 272.37 0 408 407.43
sired number of clusters. The second parameter n0 is the maxi-
Runtime 0.029 0.01 0.044 0.078
mum number of outliers. The purpose of this parameter is to pre- (b)
vent the algorithm from assigning all the points to the group of
outliers. The third parameter γ is used to classify normal points
and outliers. In general, when the value of γ increases, the number
of outliers decreases. The two additional parameters δ and Nmax chance. The second measure, denoted by ME , is used to measure
are used to terminate the algorithm. the performance of the clustering algorithm√
in terms of outlier de-
The parameters n0 and γ are used together to control the num- tection. The measure ME ranges from 0 to 2. A smaller value of
ber of outliers. If we know the percentage of outliers in a dataset, ME indicates a better result.
then we can set n0 to that number and set γ to be 1 so that the
algorithm will identify a group of n0 outliers and divide the re-
4.1. Experiments on synthetic data sets
maining data points into k clusters. If we do not know the per-
centage of outliers in a dataset, then we can set n0 to a reasonable
To show that the proposed algorithm works, we generated two
large number (e.g., 0.5n) and set γ appropriately to capture the
synthetic datasets with some outliers. The two synthetic datasets
outliers. For example, we can set γ in such a way that the scaled
are shown in Fig. 2. The first synthetic dataset contains 106 data
average distance D(U, Z) is approximately equal to the maximum
points, including 2 clusters and 6 outliers. The second synthetic
of kl=1 ui,l xi − zl 2 (1 ≤ i ≤ n). Suppose that
dataset contains 816 data points, including 7 clusters and 16 out-
k liers.
L = max ui,l xi − zl The KMOR algorithm has two main parameters: n0 and γ . The
1≤i≤n
l=1 parameter n0 specifies the maximum number of outliers. The pa-
rameter γ specifies the multiplier of the average squared distance
and kl=1 ui,l xi − zl (1 ≤ i ≤ n) are uniformly distributed on [0,
for outlier detection. In our experiments, we used n0 = 0.5n and
L]. Then such γ can be derived from the following relation:
γ = 3 as discussed in the end of Section 3. For the ODC algo-
n1 2
γ s rithm, the parameter p is used to control the number of outliers.
L ≈ L2 , A smaller value of p leads to more outliers. In our experiments,
n1 n1
s=1 we set p = 6 for ODC. For the k-mean– algorithm, the parame-
n ter l refers to the top number of outliers. In our experiment, we
where n1 = n − j=1 u j,k+1 . Noting that
set l = 0.5n for k-means– that is the same as the parameter n0 in
n1
n31 n2 n1 KMOR. In the NEO-k-means algorithm, α captures the degree of
s2 = + 1 + , overlap and β n is the maximum number of outliers. For compari-
3 2 6
s=1
son purpose, we set α = 0 and β = 0.5 for NEO-k-means. Since all
we get γ ≈ 3. four algorithms can be affected by the cluster center initialization
If we do not know the percentage of outliers in a dataset, set- problem, we run each algorithm 100 times with different initial
ting γ = 3 is a good initial guess. To apply the KMOR algorithm, we cluster centers selected randomly from the datasets.
use the following default values for parameters: γ = 3, n0 = 0.1n, Table 1 summarizes the average accuracy and runtime when
δ = 10−6 , and Nmax = 100. the four algorithms are applied to the two synthetic datasets. From
Table 1(a), we see that the KMOR algorithm performs the best
4. Numerical experiments among the four algorithms in terms of overall accuracy as mea-
sured by the corrected Rand index. The k-means– algorithm and
In this section, we demonstrate the performance of the KMOR the NEO-k-means algorithm produced similar results. In addition,
algorithm using both synthetic and real datasets. We shall compare the KMOR algorithm identified 6.14 outliers on average, which is
the KMOR algorithm with the ODC algorithm [4], the k-mean– al- close to the actual number of outliers in the first synthetic dataset.
gorithm [6], and the NEO-k-means algorithm [35], which are clus- The average number of outliers identified by the k-means– algo-
tering algorithms that perform clustering and outlier detection si- rithm and the NEO-k-means algorithm is close to the specified
multaneously. number of outliers.
To measure the performance of the KMOR algorithm, we use Table 1(b) shows the average statistics of 100 runs of the four
the following two measures: the corrected Rand index [15,21] and algorithms on the second synthetic dataset. For the second syn-
the distance of a classifier on the Receiver Operating Curve graph thetic dataset, the ODC algorithm achieved the best overall per-
from the perfect classifier [4]. The first measure, denoted by R, is formance. However, the ODC algorithm did not identify any out-
used to measure the overall accuracy of the clustering algorithm liers as the average number of identified outliers is zero. Again,
in terms of clustering and outlier detection. The value of R ranges the number of outliers identified by the k-means– algorithm and
from −1 to 1. A value of 1 indicates a perfect agreement between the NEO-k-means algorithm is close to the specified number of
the two partitions; while a negative value indicates agreement by outliers. The KMOR algorithm identified 272.37 outliers on aver-
G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14 13
6
0
−2
4
V2
V2
−4
2
−6
0
−8
−4 −2 0 2 4 −1 0 1 2 3 4 5
V1 V1
(a) (b)
Fig. 2. Two synthetic datasets with outliers.
age, which is less than the specified number of outliers n0 . We can Table 2
increase γ to decrease the number of outliers. Average statistics of 100 runs of the four algorithms on real
datasets. (a) Results on the WBC dataset with k = 1. (b) Results
In term of runtime, the k-means– algorithm and the NEO-k- on the Shuttle dataset with k = 3.
means algorithm were slower than the KMOR algorithm and the
KMOR ODC k-means– NEO-k-means
ODC algorithm. For the second synthetic dataset, the NEO-k-means
algorithm is the slowest because it needs to sort the distances be- R 0.695 0 0.477 0.481
tween all points and all cluster centers. ME 0.127 1 0.236 0.234
Outliers 299 0 349 348
Runtime 0.023 0.009 0.037 0.021
4.2. Experiments on real data sets
(a)
KMOR ODC k-means– NEO-k-means
To test the performance of the proposed algorithm, we also ob- R 0.46 0.44 0.36 0.36
tained real datasets from UCI machine learning Repository [10]: the ME 0.99 1.002 1.009 1.009
WBC dataset and the Shuttle dataset. The WBC dataset contains Outliers 1106.7 72.07 4350 4349.96
Runtime 1.262 1.646 2.689 5.26
699 records, each of which is described by 9 numerical attributes.
(b)
The WBC dataset contains 2 clusters: malignant and benign. The
benign cluster contains 458 records and the malignant cluster con-
tains 241 records. We treat the benign records as normal and the
malignant records are outliers. The WBC dataset was used to study the way we select parameter values for the synthetic datasets, we
outlier detection by He et al. [19], Jiang and An [26], and Duan use p = 9 for ODC, l = 0.1n for k-means–,and α = 0 and β = 0.1for
et al. [8]. The shuttle dataset contains 58,0 0 0 records, which are NEO-k-means.
described by 9 numerical features. The shuttle dataset consists of From Table 2(a), we see that each of the four algorithms pro-
a training set and a test set. We use the training set in our ex- duced identical clustering results for the 100 runs. Since the de-
periments. The training set contains 43,500 records and 7 classes. sired number of clusters is 1 for the WBC dataset, all the 100 runs
The largest three classes contain 99.57% of the points. We treat the produced the same results. The standard deviations of the cor-
points in the three largest classes as normal points and points in rected Rand index, the ME measure, and the number of outliers
the reset four classes as outliers. The shuttle dataset was used to are zero. The runtime of each run was different due to the op-
study outliers by Chawla and Gionis [6]. erating system. By comparing the average corrected Rand indices
We applied KMOR, ODC, k-means–, and NEO-k-means to the and the average classifier distances, we see that the KMOR algo-
real datasets 100 times with different initial cluster centers, which rithm achieved the best performance in terms of overall accuracy
are selected randomly from the datasets. The average corrected and outlier detection.
Rand index, the average ME measure, the average number of out- Table 2(b) summarizes the performance of the four algorithms
liers, and the average runtime of these 100 runs on the real on the shuttle dataset. From this table we see that the KMOR algo-
datasets are summarized in Table 2(a) and (b). For the WBC rithm achieved the best performance in terms of overall accuracy.
dataset, we used the parameter values mentioned before. Since the This test shows the KMOR algorithm is able to converge fast for
shuttle dataset is a large dataset, we used a larger value for γ and large datasets. The NEO-k-means algorithm was the slowest algo-
a smaller value for n0 in order to control the number of outliers. rithm due to the fact that it needs to sort nk distances for record
In particular, we use n0 = 0.1n and γ = 9 for KMOR. Similar to assignments.
14 G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14
In summary, the tests on both synthetic data and real data have [9] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, A. Y.Zomaya, I. Khalil, S. Foufou,
shown that the KMOR algorithm is able to cluster data and de- A. Bouras, A survey of clustering algorithms for big data: taxonomy amp; em-
pirical analysis, IEEE Trans. Emerg. Top Comput. PP (99) (2014). 1–1
tect outliers simultaneously. In addition, the tests also show that [10] A. Frank, A. Asuncion, UCI machine learning repository, 2010. Available at http:
the KMOR algorithm is able to outperform the ODC algorithm, //archive.ics.uci.edu/ml.
the k-means– algorithm, and the NEO-k-means algorithm in terms [11] G. Gan, Data Clustering in C++: An Object-Oriented Approach, Data Mining
and Knowledge Discovery Series, Chapman & Hall/CRC Press, Boca Raton, FL,
of overall accuracy and outlier detection. For large datasets, the USA, 2011.
KMOR algorithm is also able to outperform other algorithms in [12] G. Gan, Application of data clustering and machine learning in variable annuity
term of speed. valuation, Insurance: Math. Econ. 53 (3) (2013) 795–801.
[13] G. Gan, K. Chen, A soft subspace clustering algorithm with log-transformed
distances, Big Data and Inf. Anal. 1 (1) (2016) 93–109.
5. Conclusions [14] G. Gan, S. Lin, Valuation of large variable annuity portfolios under nested sim-
ulation: a functional data approach, Insurance: Math. Econ. 62 (2015) 138–150.
[15] G. Gan, M.K.-P. Ng, Subspace clustering using affinity propagation, Pattern
Both clustering and outlier detection are important data analy-
Recognit. 48 (4) (2015) 1451–1460.
sis tasks. In this paper, we proposed the KMOR algorithm by ex- [16] G. Gan, M.K.-P. Ng, Subspace clustering with automatic feature grouping, Pat-
tending the k-means algorithm to provide data clustering and out- tern Recognit. 48 (11) (2015) 3703–3713.
lier detection simultaneously. In the KMOR algorithm, two param- [17] G. Gan, J. Wu, Z. Yang, A fuzzy subspace algorithm for clustering high dimen-
sional data, in: X. Li, S. Wang, Z. Dong (Eds.), Lecture Notes in Artificial Intelli-
eters n0 and γ are used to control the number of outliers. The pa- gence, 4093, Springer-Verlag, 2006, pp. 271–278.
rameter n0 is the maximum number of outliers the proposed algo- [18] V. Hautamäki, S. Cherednichenko, I. Kärkkäinen, T. Kinnunen, P. Fränti, Improv-
rithm will produce regardless the value of γ . For fixed n0 , a larger ing k-means by outlier removal, in: Proceedings of the 14th Scandinavian Con-
ference on Image Analysis, SCIA’05, 2005, pp. 978–987.
value of γ leads to less number of outliers. We can also estimate [19] Z. He, X. Xu, S. Deng, Discovering cluster-based local outliers, Pattern Recognit.
the two parameters within the algorithm. For example, we can fol- Lett. 24 (9–10) (2003) 1641–1650.
low the approach proposed in [35] by running the traditional k- [20] J. Huang, M. Ng, H. Rong, Z. Li, Automated variable weighting in k-means type
clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 657–668.
means algorithm on a dataset to estimate n0 and γ . [21] L. Hubert, P. Arabie, Comparing partitions, J. Classification 2 (1985) 193–218.
We compared the KMOR algorithm, the ODC algorithm [4], the [22] A. Jain, Data clustering: 50 years beyond k-means, Pattern Recognit. Lett. 31
k-means– algorithm [6], and the NEO-k-means algorithm [35]. The (8) (2010) 651–666.
[23] G.S.D.S. Jayakumar, B.J. Thomas, A new procedure of clustering based on mul-
experiments on both synthetic data and real data have shown that
tivariate outlier detection, J. Data Sci. 11 (2013) 69–84.
the KMOR algorithm is able to cluster data and detect outliers at [24] F. Jiang, G. Liu, J. Du, Y. Sui, Initialization of k-modes clustering using outlier
the same time. The tests have also shown that the KMOR algo- detection techniques, Inf. Sci. 332 (2016) 167–183.
[25] M. Jiang, S. Tseng, C. Su, Two-phase clustering process for outliers detection,
rithm was able to outperform other algorithms in terms of accu-
Pattern Recognit. Lett. 22 (6–7) (2001) 691–700.
racy and runtime. Since outlier detection in the KMOR algorithm [26] S.-Y. Jiang, Q. An, Clustering-based outlier detection method, in: Fifth Inter-
is natural part of the clustering process, points can move between national Conference on Fuzzy Systems and Knowledge Discovery, 2, 2008,
normal clusters and the outlier cluster. In the ODC algorithm, how- pp. 429–433.
[27] N.V. Kadam, M.A. Pund, Joint approach for outlier detection, Int. J. Comput. Sci.
ever, points assigned to the outlier cluster cannot be reassigned to Appl. 6 (2) (2013) 445–448.
a normal cluster. [28] S.-S. Kim, Variable selection and outlier detection for automated k-means clus-
In future, we would like to extend the KMOR algorithm in the tering, Commun. Statis. Appl. Methods 22 (1) (2015) 55–67.
[29] D. Lei, Q. Zhu, J. Chen, H. Lin, P. Yang, Information Engineering and Applica-
following directions. First, we would like to investigate other ways tions: International Conference on Information Engineering and Applications
to control the number of outliers. Currently we control the num- (IEA 2011), Springer London, London, pp. 363–372.
ber of outliers by the parameter n0 . Second, we would like to [30] J.D. MacCuish, N.E. MacCuish, Clustering in Bioinformatics and Drug Discovery,
CRC Press, Boca Raton, FL, 2010.
extend the KMOR algorithm for subspace clustering [13,15–17,20]. [31] J. Macqueen, Some methods for classification and analysis of multivariate ob-
Currently the KMOR algorithm was not designed to identify clus- servations, in: L. LeCam, J. Neyman (Eds.), Proceedings of the 5th Berkeley
ters embedded in subspaces of the original data space. Finally, it is Symposium on Mathematical Statistics andProbability, 1, University of Califor-
nia Press, Berkely, CA, USA, 1967, pp. 281–297.
also interesting to investigate how to select an appropriate value
[32] L. Ott, L. Pang, F.T. Ramos, S. Chawla, On integrated clustering and outlier de-
for the parameter k required by the KMOR algorithm [28,29]. In tection, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger
the current version of the algorithm, we assume that k is given. (Eds.), Advances in Neural Information Processing Systems 27, Curran Asso-
ciates, Inc., 2014, pp. 1359–1367.
[33] R. Pamula, J. Deka, S. Nandi, An outlier detection method based on clustering,
References
in: Second International Conference on Emerging Applications of Information
Technology, 2011, pp. 253–256.
[1] C.C. Aggarwal, Outlier Analysis, Springer, New York, NY, 2013. [34] F. Rehm, F. Klawonn, R. Kruse, A novel approach to noise clustering for outlier
[2] C.C. Aggarwal, Data Mining: The Textbook, Springer, New York, NY, 2015. detection, Soft Comput. 11 (5) (2007) 489–494.
[3] C.C. Aggarwal, C.K. Reddy (Eds.), Data Clustering: Algorithms and Applications, [35] J. Whang, I.S. Dhillon, D. Gleich, Non-exhaustive, overlapping k-means, in:
CRC Press, Boca Raton, FL, USA, 2013. SIAM International Conference on Data Mining (SDM), 2015.
[4] M. Ahmed, A. Naser, A novel approach for outlier detection and clustering im- [36] Q. Yu, Y. Luo, C. Chen, X. Ding, Outlier-eliminated k-means clustering algorithm
provement, in: Proceedings of the 8th IEEE Conference on Industrial Electron- based on differential privacy preservation, Applied Intelligence (2016) 1–13.
ics and Applications (ICIEA), 2013, pp. 577–582. [37] K. Zhang, M. Hutter, H. Jin, A new local distance-based outlier detection ap-
[5] K. Aparna, M.K. Nair, Computational Intelligence in Data Mining, vol. 2, proach for scattered real-world data, in: Proceedings of the 13th Pacific-Asia
Springer, pp. 25–35. Conference on Advances in Knowledge Discovery and Data Mining, PAKDD ’09,
[6] S. Chawla, A. Gionis, k-means: a unified approach to clustering and outlier de- 2009, pp. 813–822.
tection, SIAM, pp. 189–197. [38] Y. Zhou, H. Yu, X. Cai, A novel k-means algorithm for clustering and outlier de-
[7] R. Dave, R. Krishnapuram, Robust clustering methods: a unified view, IEEE tection, in: Second International Conference on Future Information Technology
Trans. Fuzzy Syst. 5 (2) (1997) 270–293. and Management Engineering, 2009, pp. 476–480.
[8] L. Duan, L. Xu, Y. Liu, J. Lee, Cluster-based outlier detection, Ann. Oper. Res. 168
(1) (2009) 151–168.