0% found this document useful (0 votes)

9 views7 pages

K - Means Clustering With Outlier Removal

This paper introduces the KMOR algorithm, an extension of the k-means clustering method that incorporates outlier detection by adding an additional cluster for outliers. The algorithm optimizes its objective function through an iterative process and demonstrates improved clustering accuracy by effectively managing outliers. Numerical experiments validate the efficiency and effectiveness of the KMOR algorithm on both synthetic and real datasets.

Uploaded by

Robert BOIS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views7 pages

K - Means Clustering With Outlier Removal

Uploaded by

Robert BOIS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Pattern Recognition Letters 90 (2017) 8–14

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

k-means clustering with outlier removal

Guojun Gan a,∗, Michael Kwok-Po Ng b
a
Department of Mathematics, University of Connecticut, 341 Mansﬁeld Road, Storrs, CT, 06269-1009, USA
b
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, China

a r t i c l e i n f o a b s t r a c t

Article history: Outlier detection is an important data analysis task in its own right and removing the outliers from
Received 5 July 2016 clusters can improve the clustering accuracy. In this paper, we extend the k-means algorithm to provide
Available online 8 March 2017
data clustering and outlier detection simultaneously by introducing an additional “cluster” to the k-means
MSC: algorithm to hold all outliers. We design an iterative procedure to optimize the objective function of the
62H30 proposed algorithm and establish the convergence of the iterative procedure. Numerical experiments on
68T10 both synthetic data and real data are provided to demonstrate the effectiveness and eﬃciency of the
91C20 proposed algorithm.
62P10
© 2017 Elsevier B.V. All rights reserved.
Keywords:
Data clustering
k-means
Outlier detection

1. Introduction tending the k-means algorithm for outlier detection. Dave and Kr-
ishnapuram [7] proposed to use an additional “cluster” for the
The goal of data clustering is to identify homogeneous groups fuzzy c-means algorithm to hold all outliers. In the KMOR algo-
or clusters from a set of objects. In other words, data clustering rithm, we use the same idea of introducing an additional “cluster”
aims to divide a set of objects into groups or clusters such that that contains all outliers. Given a desired number of clusters k, the
objects in the same cluster are more similar to each other than to KMOR algorithm partitions the dataset into k + 1 groups, which in-
objects from other clusters [3,11]. As an unsupervised learning pro- clude k clusters and a group of outliers that cannot ﬁt into the k
cess, data clustering is often used as a preliminary step for data clusters. Unlike most existing clustering algorithms with outlier de-
analytics. For example, data clustering is used to identify the pat- tection, the KMOR algorithm assigns all outliers into a group natu-
terns hidden in gene expression data [30], to produce a good qual- rally during the clustering process.
ity of clusters or summaries for big data to address the associated The remaining part of this paper is organized as follows. In
storage and analytical issues [9], to select representative insurance Section 2, we give a review of clustering algorithms that can detect
policies from a large portfolio in order to build metamodel models outliers. In Section 3, we present the KMOR algorithm in detail.
[12,14]. In Section 4, we demonstrate the performance of the KMOR algo-
Many clustering algorithms have been developed in the past rithm with numerical results on both synthetic and real datasets.
sixty years. Among these algorithms, the k-means algorithm is Section 5 concludes the paper with some remarks.
one of the oldest and most commonly used clustering algorithms
[22,31]. Despite being used widely, the k-means algorithm has sev-
eral drawbacks. One drawback is that it is sensitive to noisy data 2. Related work
and outliers. For example, the k-means algorithm is not able to
recover correctly the two clusters shown in Fig. 1(a) due to the Kadam and Pund [27] and Aggarwal [2, Chapter 8] reviewed
outliers. As we can see from Fig. 1(b), three points were clustered several approaches to detect outliers, including the cluster-based
incorrectly. approach. Aggarwal [1] devoted a whole book to outlier analy-
Motivated by Dave and Krishnapuram [7], we propose in this sis. Yu et al. [36] proposed the OEDP k-means algorithm by re-
paper the KMOR (k-means with outlier removal) algorithm by ex- moving outliers from the dataset before applying the k-means al-
gorithm. Aparna and Nair [5] proposed the CHB-K-Means algo-
rithm by using a weighted attribute matrix to detect outliers. Jiang
∗
Corresponding author. et al. [24] proposed two initialization methods for the k-modes
E-mail address: [email protected] (G. Gan). algorithm to choose initial cluster centers that are not outliers.

http://dx.doi.org/10.1016/j.patrec.2017.03.008
0167-8655/© 2017 Elsevier B.V. All rights reserved.
G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14 9

Fig. 1. An illustration showing that the k-means algorithm is sensitive to outliers. (a) A data set with two clusters and two outliers. The two clusters are plotted by triangles
and circles, respectively. The two outliers are denoted by plus signs. (b) Two clusters found by the k-means algorithm. The two found clusters are plotted by triangles and
circles, respectively.

Although much work has been done on outlier analysis, few of are considered outliers. Zhou et al. [38] proposed a three-stage k-
them perform clustering and detect outliers simultaneously. In this means algorithm to cluster data and detect outliers. In the first
section, we focus on clustering methods with the built-in mecha- stage, the fuzzy c-means algorithm is applied to cluster the data. In
nism of outlier detection and give a review of those methods. the second stage, local outliers are identified and the cluster cen-
Jiang et al. [25] proposed a two-phase clustering algorithm for ters are recalculated. In the third stage, certain clusters are merged
outlier detection. In the first phase, the k-means algorithm is mod- and global outliers are identified. Zhang et al. [37] introduced a
ified to partition the data in such a way that a data point is as- measure called Local Distance-based Outlier Factor (LDOF) to mea-
signed to be a new cluster center if the data point is far away from sure the outlier-ness of objects in scattered datasets. Pamula et al.
all clusters. In the second phase, a minimum spanning tree is con- [33] used the k-means algorithm to prune some points around
structed based on the cluster centers obtained from the first phase. the cluster centers and the LDOF measure to identify outliers from
Clusters in small sub trees are considered as outliers. He et al. the remaining points. Jayakumar and Thomas [23] proposed an ap-
[19] introduced the concept of cluster-based local outlier and de- proach to detect outliers based on the Mahalanobis distance.
signed a measure, called cluster-based local outlier factor (CBLOF), Ahmed and Naser [4] proposed the ODC (Outlier Detection and
to identify such outliers. Clustering) algorithm to detect outliers. The ODC algorithm is a
Hautamäki et al. [18] proposed the ORC (Outlier Removal Clus- modified version of the k-means algorithm. In the ODC algorithm,
tering) algorithm to identify clusters and outliers from a dataset a data point that is at least p times the average distance away
simultaneously. The ORC algorithm consists of two consecutive from its centroid is considered as an outlier. Chawla and Gionis
stages: the first stage is a purely k-means algorithm; the second [6] proposed the k-means– algorithm to provide data clustering
stage iteratively removes the data points that are far away from and outlier detection simultaneously. The k-means– algorithm re-
their cluster centroids. Rehm et al. [34] defined outliers in terms of quires two parameters: k and l, which specify the desired number
noise distance. The data points that are about the noise distance or of clusters and the desired number of top outliers, respectively.
further away from any other cluster centers get high membership Ott et al. [32] extended the facility location formulation to
degrees to the outlier cluster. model the joint clustering and outlier detection problem and pro-
Jiang and An [26] also proposed a two-stage algorithm, called posed a subgradient-based algorithm to solve the resulting opti-
CBOD (Clustering Based Outlier Detection), to detect outliers from mization problem. The model requires pairwise distances of the
datasets. In the first stage, an one-pass clustering algorithm is ap- dataset and the number of outliers as input. Whang et al. [35] pro-
plied to divide a dataset into hyper spheres with almost the same posed the NEO-k-means (Non-exhaustive Overlapping k-means) al-
radius. In the second stage, outlier factors for all clusters obtained gorithm, which is also able to identify outliers during the cluster-
from the first stage are calculated and the clusters are sorted ac- ing process.
cording to their outlier factors. Clusters with high outlier factors
10 G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14

Some of the aforementioned algorithms perform clustering and Eq. (3), the objective function reaches zero with ui,k+1 = 1, ui,l = 0
outlier detection in stages. In these algorithms, a clustering algo- for i = 1, 2, . . . , n and l = 1, 2, . . . , k. In other words, the objective
rithm is used to divide the dataset into clusters and some mea- function without the condition is minimized when all data points
sure is calculated for the data points based on the clusters to iden- are put into the group of outliers. When n0 = 0, the condition
tify outliers. The ODC algorithm, the k-means– algorithm, and the given in Eq. (3) implies that ui,k+1 = 0 for i = 1, 2, . . . , n. In this
NEO-k-means algorithm integrate outlier detection into the clus- case, the KMOR objective function becomes the standard k-means
tering process. However, data points that are removed as outliers objective function. When 0 < n0 < n, the KMOR objective function
during the iterative process of the ODC algorithm cannot be used also becomes the standard k-means objective function when γ →
as normal points again when the centroids are updated. Determin- ∞. In this setting, the second term ui,k+1 D(U, Z ) is more dominant,
ing the parameters α and β in the NEO-k-means algorithm is time the minimization procedure will favour ui,k+1 to be zero, i.e., no
consuming. points will be assigned to the group of outliers. Also it is obvious
in the KMOR objective function that we do not allow n0 ≥ n in or-
3. The KMOR algorithm der to prevent the algorithm from assigning all points to the group
of outliers.
In the KMOR algorithm, a data point that is at least γ × davg Like the k-means algorithm, the KMOR algorithm starts with a
away from all the cluster centers is considered as an outlier, where set of k initial cluster centers and then keeps updating U and Z
γ is a multiplier and davg is the average distance calculated dy- until some stopping criterion is achieved. However, the objective
namically during the clustering process. function of the KMOR algorithm involves the interaction of the two
To describe the KMOR algorithm, let X = {x1 , x2 , . . . , xn } be a set of variables: ui,l (l = 1, 2, . . . , k ) and ui,k+1 . As a result, the iter-
numerical dataset containing n data points, each of which is de- ative process of the KMOR algorithm is different from that of the
scribed by d numerical attributes. Let k be the desired number of standard k-means algorithm. To describe the iterative process of
clusters. Let U = (uil )n×(k+1 ) be an n × (k + 1 ) binary matrix (i.e., the KMOR algorithm, we deﬁne
uil ∈ {0, 1}) such that for each i = 1, 2, . . . , n,

n
k
k+1
Q (U, V, Z ) = ui,l xi − zl + ui,k+1 D(V, Z ) ,
2
(5)
uil = 1. (1) i=1 l=1
l=1
where V = (vi,l )n×(k+1 ) is an n × (k + 1 ) binary matrix that satis-
The binary matrix U has k + 1 columns. The last column of U is ﬁes conditions given in (1) and (3). Comparing Eqs. (2) and (5), we
used to indicate whether a data point is an outlier. If xi is an have Q (U , U , Z ) = P (U, Z ). According to Q(U, V, Z), we can optimize
outlier, then ui,k+1 = 1. If xi is a normal point, then ui,k+1 = 0. If Q by solving three subproblems: Q(U, ·, ·), Q( ·, V, ·) and Q( ·, ·,
ui,k+1 = 0, then uil = 1 for some l ∈ {1, 2, . . . , k}, where l is the in- Z) iteratively. The pseudo-code of the KMOR algorithm is given in
dex of the cluster to which xi belongs. The binary matrix U is a Algorithm 1. For each subproblem, we have the following theorems
partition matrix that divides the dataset X into k + 1 groups, which to guarantee the optimality.
include k normal clusters and one special “cluster” that contains
the outliers.
The KMOR algorithm divides X into k clusters and a group of Algorithm 1: Pseudo-code of the KMOR algorithm, where σ
outliers by minimizing the following objective function and Nmax are two parameters used to terminate the algorithm.
Input: X, k, γ , n0 , δ , Nmax

n
k
P (U, Z ) = uil xi − zl + ui,k+1 D(U, Z ) ,
2
(2) Output: Optimal values of U and Z
i=1 l=1
1 Initialize Z (0 ) = {z1(0 ) , z2(0 ) , . . . , zk(0 ) } by selecting k points from
X randomly;
subject to
2 Update U (0 ) by assigning xi to its nearest center for

n
i = 1, 2, . . . , n;
ui,k+1 ≤ n0 , (3) 3 s ← 0;
i=1
4 P ( 0 ) ← 0;
where 0 ≤ n0 < n is a parameter, Z = {z1 , z2 , . . . , zk } is a set of 5 while True do
cluster centers, · is the L2 norm and 6 Update U (s+1 ) by minimizing Q (U, U (s ) , Z (s ) ) according to
Theorem 1;
γ
k
n
D(U, Z ) = n u j,l x j − zl 2 . (4) 7 Update Z (s+1 ) by minimizing Q (U (s+1 ) , U (s+1 ) , Z ) according
n− j=1 u j,k+1 to Theorem 3;
l=1 j=1
s ← s + 1;
Here γ ≥ 0 is also a parameter. The parameters n0 and γ are used
8

to control the number of outliers. How to select appropriate values 9 P (s+1 ) ← P U (s+1 ) , Z (s+1 ) ;
for n0 and γ is discussed in the end of this section. The ﬁrst term 10 if P (s+1 ) − P (s ) < δ or s ≥ Nmax then
of P(U, Z) is employed in the standard k-means algorithm to put 11 Break;
similar data points into a cluster. The second term of P(U, Z) is 12 end
used to check how to assign a point to be an outlier based on the 13 end
average distance calculated from D(U, Z).
The condition given in Eq. (3) limits the number of outliers to
be at most n0 . In fact, the purpose of this condition is to make sure
that Theorem 1. Let V = V ∗ and Z = Z ∗ be ﬁxed. Let m1 , m2 , . . . , mn ∈
{1, 2, . . . , k} such that

n
ui,k+1 < n. di,mi = min di,l ,
i=1 1≤l≤k

It is worth noting that this condition is necessary for the ob- where di,l = xi − z∗l 2 , that is, mi is the index of the center to
jective function to be nontrivial. Without the condition given in which the point xi is closest. Let (i1 , i2 , . . . , in ) be a permutation of
G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14 11

(1, 2, . . . , n ) such that This completes the proof.

di1 ,mi ≥ di2 ,mi ≥ · · · ≥ din ,min . According to Theorem 1, the assignment of a point to a cluster
1 2

Deﬁne can be determined similar to that of the standard k-means algo-

rithm except we remove outliers in the clustering process.
O∗ = {i1 , i2 , . . . , in0 } ∩ {i ∈ {1, 2, . . . , n} : di,mi > D(V ∗ , Z ∗ )},
Theorem 2. Let V = V ∗ and Z = Z ∗ be fixed. Let U∗ be the binary ma-
where n0 is a parameter of the KMOR algorithm. Then the binary ma-
trix defined in Eq. (6). Then by Theorem 1, we know that U∗ satisfies
trix U∗ that satisfies the conditions (1) and (3) and minimizes the
the conditions (1) and (3) and minimizes the function given in Eq. (5).
function given in Eq. (5) is given as:
Suppose that
1, / O∗ and l = mi ,
if i ∈
u∗i,l = (6)
n
n
0, if i ∈ O∗ , u∗i,k+1 = v∗i,k+1 .
i=1 i=1
for i = 1, 2, . . . , n and l = 1, 2, . . . , k. For l = k + 1, we have u∗i,k+1 =
Then
1 − ks=1 u∗is .

Proof. We only need to show that for any binary matrix U satisfy- Q (U ∗ , U ∗ , Z ∗ ) ≤ Q (U ∗ , V ∗ , Z ∗ ).
ing the conditions (1) and (3), we have Proof. We only need to show that
Q (U ∗ , V ∗ , Z ∗ ) ≤ Q (U, V ∗ , Z ∗ ).
D(U ∗ , Z ∗ ) ≤ D(V ∗ , Z ∗ ). (7)
To do that, we let U be an arbitrary binary matrix that satisﬁes
conditions (1) and (3) and let By Theorem 1, we know that U∗
is a binary matrix that satisﬁes
conditions (1) and (3) and minimizes Q(U, V∗ , Z∗ ). Hence we have
O = {i ∈ {1, 2, . . . , n} : ui,k+1 = 1}
Q (U ∗ , V ∗ , Z ∗ ) ≤ Q (V ∗ , V ∗ , Z ∗ )
and t = |O|. Let
or

k

a∗i = u∗il dil + u∗i,k+1 D(V ∗ , Z ∗ ), i = 1, 2, . . . , n
n
k
l=1 u∗i,l xi − + z∗l 2 u∗i,k+1 D (V , Z )
∗ ∗

and i=1 l=1

k
n
k
ai = ui,l dil + ui,k+1 D(V , Z ), ∗ ∗
i = 1, 2, . . . , n. ≤ v∗i,l xi − z∗l 2 + v∗i,k+1 D(V ∗ , Z ∗ ) . (8)
l=1 i=1 l=1

Then we have a∗i = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t ∗ and a∗i = di j ,mi for n n

j j j From the assumption that i=1 u∗i,k+1 = i=1 v∗i,k+1 and Eq. (8), we
t∗
j = + 1, . . . , n, where = t∗ |O∗ |. get
Let (s1 , s2 , . . . , sn ) be a permutation of (1, 2, . . . , n ) such that
as j = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t and

n
k
n
k
u∗i,l xi − z∗l 2 ≤ v∗i,l xi − z∗l 2 .
ast+1 ≥ ast+2 ≥ · · · ≥ asn . i=1 l=1 i=1 l=1

Let us first consider the case when t = t ∗ . In this case, we have Eq. (7) follows from the above inequality, the assumption, and the
a∗i = as j = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t and a∗i ≤ as j for j = t + 1, t + definition of D(V, Z). This completes the proof.
j j
2, . . . , n. Hence we have According to Theorem 2, we can set V∗ equal to U∗ , and guar-

n
n
antee that the objective function value is always non-increasing.
Q (U ∗ , V ∗ , Z ∗ ) = a∗i j ≤ as j = Q (U, V ∗ , Z ∗ ).
j=1 j=1 Theorem 3. Let U = U ∗ and V = U ∗ be fixed. Then the cluster centers
Z∗ that minimizes the function (5) is given by
Now let us consider the case when t < t∗ . In this case, we have
n
a∗i = as j = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t. For j = t + 1, t + 2, . . . , t ∗ , ∗
i=1 ui,s xi,s
∗
j
zl,s = n ∗
(9)
we have
i=1 ui,s

k
as j = us j l ds j l ≥ ds j ,ms j > D(V ∗ , Z ∗ ) = a∗i j . for l = 1, 2, . . . , k and s = 1, 2, . . . , d, where xi = [xi,1 , xi, 2 , , xi, d ].
l=1
Proof. By combining Eqs. (4) and (5), we get
For j = t ∗ + 1, . . . , n, we have a∗i ≤ as j . Hence we have n
j
γ j=1 u∗j,k+1
n
k

n
n Q (U , U , Z ) = 1 +
∗ ∗
n ∗
u∗i,l xi − zl 2 .
Q (U , V , Z ) =
∗ ∗ ∗
a∗i j ≤ as j = Q (U, V , Z ).
∗ ∗ n− j=1 u j,k+1 i=1 l=1
j=1 j=1
Minimizing the above equation with respect to Z is equivalent to
For the case when t > t∗ , we have t∗ < t ≤ n0 because U satisﬁes minimizing the following function
the condition (3). In this case, we have a∗i = as j = D(V ∗ , Z ∗ ) for j =
j
n
k
1, 2, . . . , t ∗ . For j = t ∗ + 1, . . . , t, we have f (Z ) = u∗i,l xi − zl 2 .
a∗i j = di j ,mi ≤ D(V ∗ , D∗ ) = as j . i=1 l=1
j
Taking derivative of the above equation with respect to zls and
For j = t + 1, t + 2, . . . , n, we have a∗i ≤ as j . In this case, we also
j equating the derivative to zero lead to the result. This completes
have the proof.

n
n
Q (U ∗ , V ∗ , Z ∗ ) = a∗i j ≤ as j = Q (U, V ∗ , Z ∗ ). According to Theorem 3, we see that the update of cluster cen-
j=1 j=1 ters is the same as that in the standard k-means algorithm.
12 G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14

By using the results in Theorems 1–3, we can show that the Table 1
Average statistics of 100 runs of the four algorithms on syn-
KMOR algorithm converges. In particular, we have
thetic datasets. The runtime is measured in seconds. (a) Results
P (U (s+1) , Z (s+1) ) = Q (U (s+1) , U (s+1) , Z (s+1) ) on the first synthetic dataset with k = 2. (b) Results on the sec-
ond synthetic dataset with k = 7.
≤ Q (U (s+1) , U (s+1) , Z (s ) ) ≤ Q (U (s+1) , U (s ) , Z (s ) )
KMOR ODC k-means– NEO-k-means
≤ Q (U (s ) , U (s ) , Z (s ) ) = P (U (s ) , Z (s ) ).
R 0.91 0.87 0.32 0.34
We see that the objective is non-increasing. As U is finite and the ME 0.05 0.16 0.95 0.94
objective function value is bounded below by zero, the algorithm Outliers 6.14 5.04 53 52.27
Runtime 0.003 0.003 0.019 0.015
will terminate as the objective function value is not changed after
(a)
a finite number of iterations. KMOR ODC k-means– NEO-k-means
As shown in Algorithm 1, the KMOR algorithm requires three R 0.562 0.782 0.291 0.292
main parameters k, n0 , and γ . The first parameter k is the de- ME 0.707 1 0.709 0.717
Outliers 272.37 0 408 407.43
sired number of clusters. The second parameter n0 is the maxi-
Runtime 0.029 0.01 0.044 0.078
mum number of outliers. The purpose of this parameter is to pre- (b)
vent the algorithm from assigning all the points to the group of
outliers. The third parameter γ is used to classify normal points
and outliers. In general, when the value of γ increases, the number
of outliers decreases. The two additional parameters δ and Nmax chance. The second measure, denoted by ME , is used to measure
are used to terminate the algorithm. the performance of the clustering algorithm√
in terms of outlier de-
The parameters n0 and γ are used together to control the num- tection. The measure ME ranges from 0 to 2. A smaller value of
ber of outliers. If we know the percentage of outliers in a dataset, ME indicates a better result.
then we can set n0 to that number and set γ to be 1 so that the
algorithm will identify a group of n0 outliers and divide the re-
4.1. Experiments on synthetic data sets
maining data points into k clusters. If we do not know the per-
centage of outliers in a dataset, then we can set n0 to a reasonable
To show that the proposed algorithm works, we generated two
large number (e.g., 0.5n) and set γ appropriately to capture the
synthetic datasets with some outliers. The two synthetic datasets
outliers. For example, we can set γ in such a way that the scaled
are shown in Fig. 2. The first synthetic dataset contains 106 data
average distance D(U, Z) is approximately equal to the maximum
points, including 2 clusters and 6 outliers. The second synthetic
of kl=1 ui,l xi − zl 2 (1 ≤ i ≤ n). Suppose that
dataset contains 816 data points, including 7 clusters and 16 out-

k liers.
L = max ui,l xi − zl The KMOR algorithm has two main parameters: n0 and γ . The
1≤i≤n
l=1 parameter n0 specifies the maximum number of outliers. The pa-
rameter γ specifies the multiplier of the average squared distance
and kl=1 ui,l xi − zl (1 ≤ i ≤ n) are uniformly distributed on [0,
for outlier detection. In our experiments, we used n0 = 0.5n and
L]. Then such γ can be derived from the following relation:
γ = 3 as discussed in the end of Section 3. For the ODC algo-
n1 2
γ s rithm, the parameter p is used to control the number of outliers.
L ≈ L2 , A smaller value of p leads to more outliers. In our experiments,
n1 n1
s=1 we set p = 6 for ODC. For the k-mean– algorithm, the parame-
n ter l refers to the top number of outliers. In our experiment, we
where n1 = n − j=1 u j,k+1 . Noting that
set l = 0.5n for k-means– that is the same as the parameter n0 in

n1
n31 n2 n1 KMOR. In the NEO-k-means algorithm, α captures the degree of
s2 = + 1 + , overlap and β n is the maximum number of outliers. For compari-
3 2 6
s=1
son purpose, we set α = 0 and β = 0.5 for NEO-k-means. Since all
we get γ ≈ 3. four algorithms can be affected by the cluster center initialization
If we do not know the percentage of outliers in a dataset, set- problem, we run each algorithm 100 times with different initial
ting γ = 3 is a good initial guess. To apply the KMOR algorithm, we cluster centers selected randomly from the datasets.
use the following default values for parameters: γ = 3, n0 = 0.1n, Table 1 summarizes the average accuracy and runtime when
δ = 10−6 , and Nmax = 100. the four algorithms are applied to the two synthetic datasets. From
Table 1(a), we see that the KMOR algorithm performs the best
4. Numerical experiments among the four algorithms in terms of overall accuracy as mea-
sured by the corrected Rand index. The k-means– algorithm and
In this section, we demonstrate the performance of the KMOR the NEO-k-means algorithm produced similar results. In addition,
algorithm using both synthetic and real datasets. We shall compare the KMOR algorithm identified 6.14 outliers on average, which is
the KMOR algorithm with the ODC algorithm [4], the k-mean– al- close to the actual number of outliers in the first synthetic dataset.
gorithm [6], and the NEO-k-means algorithm [35], which are clus- The average number of outliers identified by the k-means– algo-
tering algorithms that perform clustering and outlier detection si- rithm and the NEO-k-means algorithm is close to the specified
multaneously. number of outliers.
To measure the performance of the KMOR algorithm, we use Table 1(b) shows the average statistics of 100 runs of the four
the following two measures: the corrected Rand index [15,21] and algorithms on the second synthetic dataset. For the second syn-
the distance of a classifier on the Receiver Operating Curve graph thetic dataset, the ODC algorithm achieved the best overall per-
from the perfect classifier [4]. The first measure, denoted by R, is formance. However, the ODC algorithm did not identify any out-
used to measure the overall accuracy of the clustering algorithm liers as the average number of identified outliers is zero. Again,
in terms of clustering and outlier detection. The value of R ranges the number of outliers identified by the k-means– algorithm and
from −1 to 1. A value of 1 indicates a perfect agreement between the NEO-k-means algorithm is close to the specified number of
the two partitions; while a negative value indicates agreement by outliers. The KMOR algorithm identified 272.37 outliers on aver-
G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14 13

6
0
−2

4
V2

V2
−4

2
−6

0
−8

−4 −2 0 2 4 −1 0 1 2 3 4 5

V1 V1

(a) (b)
Fig. 2. Two synthetic datasets with outliers.

age, which is less than the speciﬁed number of outliers n0 . We can Table 2
increase γ to decrease the number of outliers. Average statistics of 100 runs of the four algorithms on real
datasets. (a) Results on the WBC dataset with k = 1. (b) Results
In term of runtime, the k-means– algorithm and the NEO-k- on the Shuttle dataset with k = 3.
means algorithm were slower than the KMOR algorithm and the
KMOR ODC k-means– NEO-k-means
ODC algorithm. For the second synthetic dataset, the NEO-k-means
algorithm is the slowest because it needs to sort the distances be- R 0.695 0 0.477 0.481
tween all points and all cluster centers. ME 0.127 1 0.236 0.234
Outliers 299 0 349 348
Runtime 0.023 0.009 0.037 0.021
4.2. Experiments on real data sets
(a)
KMOR ODC k-means– NEO-k-means
To test the performance of the proposed algorithm, we also ob- R 0.46 0.44 0.36 0.36
tained real datasets from UCI machine learning Repository [10]: the ME 0.99 1.002 1.009 1.009
WBC dataset and the Shuttle dataset. The WBC dataset contains Outliers 1106.7 72.07 4350 4349.96
Runtime 1.262 1.646 2.689 5.26
699 records, each of which is described by 9 numerical attributes.
(b)
The WBC dataset contains 2 clusters: malignant and benign. The
benign cluster contains 458 records and the malignant cluster con-
tains 241 records. We treat the benign records as normal and the
malignant records are outliers. The WBC dataset was used to study the way we select parameter values for the synthetic datasets, we
outlier detection by He et al. [19], Jiang and An [26], and Duan use p = 9 for ODC, l = 0.1n for k-means–,and α = 0 and β = 0.1for
et al. [8]. The shuttle dataset contains 58,0 0 0 records, which are NEO-k-means.
described by 9 numerical features. The shuttle dataset consists of From Table 2(a), we see that each of the four algorithms pro-
a training set and a test set. We use the training set in our ex- duced identical clustering results for the 100 runs. Since the de-
periments. The training set contains 43,500 records and 7 classes. sired number of clusters is 1 for the WBC dataset, all the 100 runs
The largest three classes contain 99.57% of the points. We treat the produced the same results. The standard deviations of the cor-
points in the three largest classes as normal points and points in rected Rand index, the ME measure, and the number of outliers
the reset four classes as outliers. The shuttle dataset was used to are zero. The runtime of each run was different due to the op-
study outliers by Chawla and Gionis [6]. erating system. By comparing the average corrected Rand indices
We applied KMOR, ODC, k-means–, and NEO-k-means to the and the average classiﬁer distances, we see that the KMOR algo-
real datasets 100 times with different initial cluster centers, which rithm achieved the best performance in terms of overall accuracy
are selected randomly from the datasets. The average corrected and outlier detection.
Rand index, the average ME measure, the average number of out- Table 2(b) summarizes the performance of the four algorithms
liers, and the average runtime of these 100 runs on the real on the shuttle dataset. From this table we see that the KMOR algo-
datasets are summarized in Table 2(a) and (b). For the WBC rithm achieved the best performance in terms of overall accuracy.
dataset, we used the parameter values mentioned before. Since the This test shows the KMOR algorithm is able to converge fast for
shuttle dataset is a large dataset, we used a larger value for γ and large datasets. The NEO-k-means algorithm was the slowest algo-
a smaller value for n0 in order to control the number of outliers. rithm due to the fact that it needs to sort nk distances for record
In particular, we use n0 = 0.1n and γ = 9 for KMOR. Similar to assignments.
14 G. Gan, M.K.-P. Ng / Pattern Recognition Letters 90 (2017) 8–14

In summary, the tests on both synthetic data and real data have [9] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, A. Y.Zomaya, I. Khalil, S. Foufou,
shown that the KMOR algorithm is able to cluster data and de- A. Bouras, A survey of clustering algorithms for big data: taxonomy amp; em-
pirical analysis, IEEE Trans. Emerg. Top Comput. PP (99) (2014). 1–1
tect outliers simultaneously. In addition, the tests also show that [10] A. Frank, A. Asuncion, UCI machine learning repository, 2010. Available at http:
the KMOR algorithm is able to outperform the ODC algorithm, //archive.ics.uci.edu/ml.
the k-means– algorithm, and the NEO-k-means algorithm in terms [11] G. Gan, Data Clustering in C++: An Object-Oriented Approach, Data Mining
and Knowledge Discovery Series, Chapman & Hall/CRC Press, Boca Raton, FL,
of overall accuracy and outlier detection. For large datasets, the USA, 2011.
KMOR algorithm is also able to outperform other algorithms in [12] G. Gan, Application of data clustering and machine learning in variable annuity
term of speed. valuation, Insurance: Math. Econ. 53 (3) (2013) 795–801.
[13] G. Gan, K. Chen, A soft subspace clustering algorithm with log-transformed
distances, Big Data and Inf. Anal. 1 (1) (2016) 93–109.
5. Conclusions [14] G. Gan, S. Lin, Valuation of large variable annuity portfolios under nested sim-
ulation: a functional data approach, Insurance: Math. Econ. 62 (2015) 138–150.
[15] G. Gan, M.K.-P. Ng, Subspace clustering using affinity propagation, Pattern
Both clustering and outlier detection are important data analy-
Recognit. 48 (4) (2015) 1451–1460.
sis tasks. In this paper, we proposed the KMOR algorithm by ex- [16] G. Gan, M.K.-P. Ng, Subspace clustering with automatic feature grouping, Pat-
tending the k-means algorithm to provide data clustering and out- tern Recognit. 48 (11) (2015) 3703–3713.
lier detection simultaneously. In the KMOR algorithm, two param- [17] G. Gan, J. Wu, Z. Yang, A fuzzy subspace algorithm for clustering high dimen-
sional data, in: X. Li, S. Wang, Z. Dong (Eds.), Lecture Notes in Artificial Intelli-
eters n0 and γ are used to control the number of outliers. The pa- gence, 4093, Springer-Verlag, 2006, pp. 271–278.
rameter n0 is the maximum number of outliers the proposed algo- [18] V. Hautamäki, S. Cherednichenko, I. Kärkkäinen, T. Kinnunen, P. Fränti, Improv-
rithm will produce regardless the value of γ . For fixed n0 , a larger ing k-means by outlier removal, in: Proceedings of the 14th Scandinavian Con-
ference on Image Analysis, SCIA’05, 2005, pp. 978–987.
value of γ leads to less number of outliers. We can also estimate [19] Z. He, X. Xu, S. Deng, Discovering cluster-based local outliers, Pattern Recognit.
the two parameters within the algorithm. For example, we can fol- Lett. 24 (9–10) (2003) 1641–1650.
low the approach proposed in [35] by running the traditional k- [20] J. Huang, M. Ng, H. Rong, Z. Li, Automated variable weighting in k-means type
clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 657–668.
means algorithm on a dataset to estimate n0 and γ . [21] L. Hubert, P. Arabie, Comparing partitions, J. Classification 2 (1985) 193–218.
We compared the KMOR algorithm, the ODC algorithm [4], the [22] A. Jain, Data clustering: 50 years beyond k-means, Pattern Recognit. Lett. 31
k-means– algorithm [6], and the NEO-k-means algorithm [35]. The (8) (2010) 651–666.
[23] G.S.D.S. Jayakumar, B.J. Thomas, A new procedure of clustering based on mul-
experiments on both synthetic data and real data have shown that
tivariate outlier detection, J. Data Sci. 11 (2013) 69–84.
the KMOR algorithm is able to cluster data and detect outliers at [24] F. Jiang, G. Liu, J. Du, Y. Sui, Initialization of k-modes clustering using outlier
the same time. The tests have also shown that the KMOR algo- detection techniques, Inf. Sci. 332 (2016) 167–183.
[25] M. Jiang, S. Tseng, C. Su, Two-phase clustering process for outliers detection,
rithm was able to outperform other algorithms in terms of accu-
Pattern Recognit. Lett. 22 (6–7) (2001) 691–700.
racy and runtime. Since outlier detection in the KMOR algorithm [26] S.-Y. Jiang, Q. An, Clustering-based outlier detection method, in: Fifth Inter-
is natural part of the clustering process, points can move between national Conference on Fuzzy Systems and Knowledge Discovery, 2, 2008,
normal clusters and the outlier cluster. In the ODC algorithm, how- pp. 429–433.
[27] N.V. Kadam, M.A. Pund, Joint approach for outlier detection, Int. J. Comput. Sci.
ever, points assigned to the outlier cluster cannot be reassigned to Appl. 6 (2) (2013) 445–448.
a normal cluster. [28] S.-S. Kim, Variable selection and outlier detection for automated k-means clus-
In future, we would like to extend the KMOR algorithm in the tering, Commun. Statis. Appl. Methods 22 (1) (2015) 55–67.
[29] D. Lei, Q. Zhu, J. Chen, H. Lin, P. Yang, Information Engineering and Applica-
following directions. First, we would like to investigate other ways tions: International Conference on Information Engineering and Applications
to control the number of outliers. Currently we control the num- (IEA 2011), Springer London, London, pp. 363–372.
ber of outliers by the parameter n0 . Second, we would like to [30] J.D. MacCuish, N.E. MacCuish, Clustering in Bioinformatics and Drug Discovery,
CRC Press, Boca Raton, FL, 2010.
extend the KMOR algorithm for subspace clustering [13,15–17,20]. [31] J. Macqueen, Some methods for classification and analysis of multivariate ob-
Currently the KMOR algorithm was not designed to identify clus- servations, in: L. LeCam, J. Neyman (Eds.), Proceedings of the 5th Berkeley
ters embedded in subspaces of the original data space. Finally, it is Symposium on Mathematical Statistics andProbability, 1, University of Califor-
nia Press, Berkely, CA, USA, 1967, pp. 281–297.
also interesting to investigate how to select an appropriate value
[32] L. Ott, L. Pang, F.T. Ramos, S. Chawla, On integrated clustering and outlier de-
for the parameter k required by the KMOR algorithm [28,29]. In tection, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger
the current version of the algorithm, we assume that k is given. (Eds.), Advances in Neural Information Processing Systems 27, Curran Asso-
ciates, Inc., 2014, pp. 1359–1367.
[33] R. Pamula, J. Deka, S. Nandi, An outlier detection method based on clustering,
References
in: Second International Conference on Emerging Applications of Information
Technology, 2011, pp. 253–256.
[1] C.C. Aggarwal, Outlier Analysis, Springer, New York, NY, 2013. [34] F. Rehm, F. Klawonn, R. Kruse, A novel approach to noise clustering for outlier
[2] C.C. Aggarwal, Data Mining: The Textbook, Springer, New York, NY, 2015. detection, Soft Comput. 11 (5) (2007) 489–494.
[3] C.C. Aggarwal, C.K. Reddy (Eds.), Data Clustering: Algorithms and Applications, [35] J. Whang, I.S. Dhillon, D. Gleich, Non-exhaustive, overlapping k-means, in:
CRC Press, Boca Raton, FL, USA, 2013. SIAM International Conference on Data Mining (SDM), 2015.
[4] M. Ahmed, A. Naser, A novel approach for outlier detection and clustering im- [36] Q. Yu, Y. Luo, C. Chen, X. Ding, Outlier-eliminated k-means clustering algorithm
provement, in: Proceedings of the 8th IEEE Conference on Industrial Electron- based on differential privacy preservation, Applied Intelligence (2016) 1–13.
ics and Applications (ICIEA), 2013, pp. 577–582. [37] K. Zhang, M. Hutter, H. Jin, A new local distance-based outlier detection ap-
[5] K. Aparna, M.K. Nair, Computational Intelligence in Data Mining, vol. 2, proach for scattered real-world data, in: Proceedings of the 13th Pacific-Asia
Springer, pp. 25–35. Conference on Advances in Knowledge Discovery and Data Mining, PAKDD ’09,
[6] S. Chawla, A. Gionis, k-means: a unified approach to clustering and outlier de- 2009, pp. 813–822.
tection, SIAM, pp. 189–197. [38] Y. Zhou, H. Yu, X. Cai, A novel k-means algorithm for clustering and outlier de-
[7] R. Dave, R. Krishnapuram, Robust clustering methods: a unified view, IEEE tection, in: Second International Conference on Future Information Technology
Trans. Fuzzy Syst. 5 (2) (1997) 270–293. and Management Engineering, 2009, pp. 476–480.
[8] L. Duan, L. Xu, Y. Liu, J. Lee, Cluster-based outlier detection, Ann. Oper. Res. 168
(1) (2009) 151–168.

Es M150 SXH 04
No ratings yet
Es M150 SXH 04
6 pages
Prashant Darden Reco1 R
No ratings yet
Prashant Darden Reco1 R
2 pages
UTP Student 2012 Handbook
No ratings yet
UTP Student 2012 Handbook
173 pages
Pembaharuan Panduan Kepenulisan KRISNA II
No ratings yet
Pembaharuan Panduan Kepenulisan KRISNA II
10 pages
Duan
No ratings yet
Duan
18 pages
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
No ratings yet
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
21 pages
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
No ratings yet
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
3 pages
Discovering Cluster-Based Local Outliers: Zengyou He, Xiaofei Xu, Shengchun Deng
No ratings yet
Discovering Cluster-Based Local Outliers: Zengyou He, Xiaofei Xu, Shengchun Deng
10 pages
1 s2.0 S1319157821001701 Main
No ratings yet
1 s2.0 S1319157821001701 Main
12 pages
Ads Lab6
No ratings yet
Ads Lab6
4 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
References
No ratings yet
References
6 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
Dbmi Sem6 Final Theory
No ratings yet
Dbmi Sem6 Final Theory
29 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut
No ratings yet
Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut
7 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Impact of Outlier Removal and Normalization Approa
No ratings yet
Impact of Outlier Removal and Normalization Approa
6 pages
Research Article: A Robust K-Means Clustering Algorithm Based On Observation Point Mechanism
No ratings yet
Research Article: A Robust K-Means Clustering Algorithm Based On Observation Point Mechanism
11 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
Unit 4
No ratings yet
Unit 4
17 pages
Outlierfin
No ratings yet
Outlierfin
19 pages
Data Mining For BI - Part 5
No ratings yet
Data Mining For BI - Part 5
34 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Prevention of Security Concerns During Outlier Detection
No ratings yet
Prevention of Security Concerns During Outlier Detection
3 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
Outlier Analysis
No ratings yet
Outlier Analysis
18 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
33 pages
Ads 7
No ratings yet
Ads 7
6 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
1 s2.0 S0952197615001736 Main
No ratings yet
1 s2.0 S0952197615001736 Main
14 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Chap9 Anomaly Detection
No ratings yet
Chap9 Anomaly Detection
46 pages
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
No ratings yet
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
12 pages
Unit Iv DM
No ratings yet
Unit Iv DM
15 pages
An Improved K-Means Algorithm Based On Fuzzy Metrics
No ratings yet
An Improved K-Means Algorithm Based On Fuzzy Metrics
9 pages
Feature Bagging For Outlier Detection
No ratings yet
Feature Bagging For Outlier Detection
11 pages
参考文献3
No ratings yet
参考文献3
9 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
LOF: Identifying Density-Based Local Outliers: Markus M. Breunig, Hans-Peter Kriegel, Raymond T. NG, Jörg Sander
No ratings yet
LOF: Identifying Density-Based Local Outliers: Markus M. Breunig, Hans-Peter Kriegel, Raymond T. NG, Jörg Sander
12 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
Lecture-8 Outlier Detection
No ratings yet
Lecture-8 Outlier Detection
72 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
1 s2.0 S0031320319301608 Main
No ratings yet
1 s2.0 S0031320319301608 Main
18 pages
1 s2.0 S1877050923018549 Main
No ratings yet
1 s2.0 S1877050923018549 Main
5 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
Reverse Accessible in Local Outlier Factor Density Based Recognition
No ratings yet
Reverse Accessible in Local Outlier Factor Density Based Recognition
10 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
Department of Information Technology: Question Bank TE IT AY-22-23 Sem VI Module 04: Clustering & Outlier Analysis
No ratings yet
Department of Information Technology: Question Bank TE IT AY-22-23 Sem VI Module 04: Clustering & Outlier Analysis
4 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Pricelist
No ratings yet
Pricelist
184 pages
Marioff HI-FOG Land Applications: Approvals Matrix
No ratings yet
Marioff HI-FOG Land Applications: Approvals Matrix
2 pages
The Future of Energy Storage System Safety
No ratings yet
The Future of Energy Storage System Safety
2 pages
M3B10 M8a10 M10a16 M2a10 M4a10 M1a4 M17a16
No ratings yet
M3B10 M8a10 M10a16 M2a10 M4a10 M1a4 M17a16
1 page
Install Arch Linux Uefi (Without Grub Menu)
No ratings yet
Install Arch Linux Uefi (Without Grub Menu)
23 pages
Rockwell Hardness Test
100% (1)
Rockwell Hardness Test
7 pages
9241 Ic Datos
No ratings yet
9241 Ic Datos
5 pages
Year 5 and 6 Statutory Spellings and Handwriting Mindfulness Booklet
No ratings yet
Year 5 and 6 Statutory Spellings and Handwriting Mindfulness Booklet
33 pages
The National Library of The Philippines
No ratings yet
The National Library of The Philippines
14 pages
CFD Course Introduction - Montenegro
0% (1)
CFD Course Introduction - Montenegro
56 pages
Tillotson HU Technical Manual
No ratings yet
Tillotson HU Technical Manual
8 pages
14ME401-H&P Lesson Plan
No ratings yet
14ME401-H&P Lesson Plan
3 pages
List of Power Stations in Vietnam: Coal Gas Turbines Solar Power Plants Wind Power Plants Biomass Hydroelectricity Notes
No ratings yet
List of Power Stations in Vietnam: Coal Gas Turbines Solar Power Plants Wind Power Plants Biomass Hydroelectricity Notes
37 pages
5.04 Corrective Action Plan
No ratings yet
5.04 Corrective Action Plan
2 pages
Ezc Raymond
No ratings yet
Ezc Raymond
678 pages
Characterization And Properties Of Green-Emitting Β-Sialon:Eu2+ Powder Phosphors For White Light-Emitting Diodes
No ratings yet
Characterization And Properties Of Green-Emitting Β-Sialon:Eu2+ Powder Phosphors For White Light-Emitting Diodes
4 pages
PL-300 Practice Exam Instructions
No ratings yet
PL-300 Practice Exam Instructions
4 pages
MD6310 Manual Operación Mantenimiento (May 2019)
100% (1)
MD6310 Manual Operación Mantenimiento (May 2019)
238 pages
The Design Criteria For Water Supply Facilities
No ratings yet
The Design Criteria For Water Supply Facilities
252 pages
WITchar - 2025 - 1
No ratings yet
WITchar - 2025 - 1
1 page
Concrete Cracking Based On Aci
100% (1)
Concrete Cracking Based On Aci
3 pages
North and South
No ratings yet
North and South
1 page
It May Thus Terminate An Output Pulse Prematurely
No ratings yet
It May Thus Terminate An Output Pulse Prematurely
2 pages
Weighing Balance Service Manual
100% (1)
Weighing Balance Service Manual
51 pages
PKTEWC16 - Manuals
No ratings yet
PKTEWC16 - Manuals
11 pages
The Design of A Peristaltic Pump: Interim Report
No ratings yet
The Design of A Peristaltic Pump: Interim Report
8 pages
Minera Santo Domingo PROJECT #M40405 Santo Domingo Project Basic Engineering - Lntp3 - 2
No ratings yet
Minera Santo Domingo PROJECT #M40405 Santo Domingo Project Basic Engineering - Lntp3 - 2
12 pages

K - Means Clustering With Outlier Removal

Uploaded by

K - Means Clustering With Outlier Removal

Uploaded by

Pattern Recognition Letters 90 (2017) 8–14

Contents lists available at ScienceDirect

Pattern Recognition Letters

k-means clustering with outlier removal

(1, 2, . . . , n ) such that This completes the proof.

Deﬁne can be determined similar to that of the standard k-means algo-

and i=1 l=1

Then we have a∗i = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t ∗ and a∗i = di j ,mi for n n

You might also like

Then we have a∗i = D(V ∗ , Z ∗ ) for j = 1, 2, . . . , t ∗ and a∗i = di j ,mi for n n