Efficient K-Means Clustering Algorithm Using Feature Weight and Min-Max Normalization
Efficient K-Means Clustering Algorithm Using Feature Weight and Min-Max Normalization
Ei Ei Phyo Ei Ei Myat
Department of Information Technology Department of Information Technology
Technological University (Thanlyin), Technological University (Thanlyin),
Myanmar Myanmar
Abstract: Clustering is a process of partitioning a set of data into a set of meaningful sub-classes, called clusters. K-means is an effective
clustering technique used to separate similar data into groups based on initial centroids of clusters. In this paper, the proposed algorithm
applies normalization prior to clustering on the available data as well as the proposed approach calculates initial centroids based on
weights. Experimental results prove the betterment of proposed clustering algorithm over existing K-means clustering algorithm in terms
of computational complexity and overall performance.
Keywords: clustering, k-means clustering, min-max normalization, gain ratio, initial centroid
1. INTRODUCTION
Data mining [1] [2] or knowledge discovery is a process of
analyzing large amounts of data and extracting useful
2. RELATED WORKS
information. Data mining is widely used in various areas like Zhang Chen et al. [6] proposed the initial centroids algorithm
financial data analysis, retail and telecommunication industry, based on k-means that have avoided alternative randomness of
biological data analysis, fraud detection, spatial data analysis initial center. Fang Yuan [7] proposed the initial centroids
and other scientific applications. Clustering is categorized as algorithm. The standard k-means algorithm selects k-objects
one of the data descriptive analysis technique that builds randomly from the given data set as the initial centroids. If
clusters of data objects in such a way that objects in a cluster different initial values are given for the centroids, the accuracy
are closer to each other than the objects of other clusters. K- output by the standard k-means algorithm can be affected. In
means uses the concept of Euclidean distance to calculate the Yuan’s method the initial centroids are calculated
centroids of the clusters. This method is less effective when systematically. To overcome the efficient k-means clustering
new data sets are added and have no effect on the measured method reduces computing efficiency and accuracy when the
distance between various data objects. The computational dataset is increased.
complexity of k means algorithm is also very high [1] [3]. K- 3. METHODOLOGY
means is the most popular and best understood traditional
clustering algorithm which starts by selecting the random initial 3.1 Min-Max Normalization
centroids and computes the distance between the centroids and Min-max normalization [14] performs a linear transformation
the data objects are computed. The objects are then clustered on the original data. Min-max is a technique that helps to
with the centroids at a minimum distance [4]. The algorithm normalize a dataset. It will scale the dataset between the 0 and
iteratively groups the data objects with minimum distance until 1. Suppose that minA and maxA are the minimum and maximum
there is no change in the centroid or members of the cluster values of an attribute, A. Min-max normalization maps a value,
group. Normalization is used to eliminate redundant data and v, of A to v' in the range [new_minA, new_maxA] by computing
ensures that good quality clusters are generated which can 𝑣 − 𝑚𝑖𝑛𝐴
improve the efficiency of clustering algorithms. So it becomes 𝑣′ = (𝑛𝑒𝑤_𝑚𝑎𝑥𝐴 − 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴) + 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴
𝑚𝑎𝑥𝐴 −𝑚𝑖𝑛𝐴
an essential step before clustering as Euclidean distance is very
Min-max normalization preserves the relationships among the
sensitive to the changes in the differences [5]. A feature weight
original data values. It will encounter an “out-of-bounds” error
algorithm can be seen as the combination of a search technique
if a future input case for normalization falls outside of the
for proposing new feature subsets, along with an evaluation
original data range for A.
measure which scores the different feature subsets. The
simplest algorithm is to test each possible subset of features 3.2 Gain Ratio
finding the one which minimizes the error rate. This is an Information gain applied to attributes that can take on a large
exhaustive search of the space, and is computationally number of distinct values might learn the training set too well.
intractable for all but the smallest of feature sets [15]. The information gain measure is biased toward tests with many
The remaining sections of the paper are organized as follows: outcomes. That is, it prefers to select attributes having a large
Section 2, review of related works, 3 describes the number of values. The gain ratio [15] is defined as
methodology, section 4 denotes the comparison methods, Gain(A)
section 5 explicates the experimental results and finally section GainRatio (A) =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴)
6 explains the conclusion of the proposed work.
The attribute with the maximum gain ratio is selected as the
splitting attribute. The split information approaches 0, the ratio
becomes unstable. A constraint is added to avoid this, whereby
www.ijsea.com 479
International Journal of Science and Engineering Applications
Volume 7–Issue 12,479-482, 2018, ISSN:-2319–7560
the information gain of the test selected must be large at least Step 3: Compute the feature weight for each attribute and
as great as the average gain over all tests examined. update the dataset.
Step 4: Initialize the first K cluster
3.3 K-Means Clustering Algorithm Step 5: Calculate centroid point of each cluster formed in the
The basic idea of K-means algorithm is to classify the dataset dataset.
D into k different clusters where D is the dataset of n data; k is Step 6: Assign each record in the dataset for only one of the
the number of desired clusters. The algorithm consists of two initial cluster using a measure Euclidean distance.
basic phases [12]. The first phase is to select the initial
centroids for each cluster randomly. The second and final phase Step 7: Repeat step 4 until convergence is achieved, that is until
is to take each point in dataset and assign it to the nearest a pass through the training sample causes no new assignments.
centroids [12]. To measure the distance between points
Euclidean Distance method is used. When a new point is
4. COMPARISON METHODS
assigned to a cluster the cluster mean is immediately updated 4.1 Normalized Mutual Information (NMI)
by calculating the average of all the points in that cluster [13]. The normalized mutual information [11] is a good measure
After all the points are included in some clusters the early for determining the quality of clustering. Comparing the NMI
grouping is done. Now each data object is assigned to a cluster between different clustering, having different number of
based on closeness with cluster center where closeness is clusters the proposed efficient k-means clustering can be
measured by Euclidean distance. This process of assigning a measured. The value of NMI is large, the cluster quality is good.
data points to a cluster and updating cluster centroids continues
2 ×𝐼 (𝑌,𝐶)
until the convergence criteria is met or the centroids don’t differ NMI (Y, C) =
[𝐻 (𝑌)+𝐻 (𝐶)]
between two consecutive iterations. Once, a situation is met
where centroids don’t move any more the algorithm ends. The 4.2 Silhouette Coefficient (SC)
k-means clustering algorithm is given below. The silhouette coefficient [9] is used to compare the quality
Step 1: Begin with a decision on the value of k = number of of clustering for origin k-means and proposed method. The
clusters. cluster quality is good, when the value of SC is large.
N
1
Step 2: Put any initial partition that classifies the data into k
clusters. You may assign the training samples
SC
N
s ( x)
i 1
randomly, or systematically as the following:
b( x ) a ( x )
1. Take the first k training sample as single-element
s ( x)
max{ a ( x), b( x)}
clusters
2. Assign each of the remaining (N-k) training samples 4.3 Sum of Square Error (SSE)
to the cluster with the nearest centroid. The sum of square error [10] is defined to use measure the
quality of clustering that is the difference of error between the
After that each assignment, recomputed the centroid
original k-means and proposed method. The value of SSE is
of the gaining cluster.
small, the cluster quality is good.
Step 3: Take each sample in sequence and compute its distance
from the centroid of each of the clusters. If a sample is SSE = ∑𝐾
𝑖=1 ∑𝑥𝜖𝐶𝑖 𝑑𝑖𝑠𝑡2 (𝑚𝑖, 𝑥)
not currently in the cluster with the closest centroid,
switch this sample to that cluster and update the 5. EXPERIMENTAL RESULTS
centroid of the cluster gaining the new sample and the In this section, we use Iris [8] dataset to validate the proposed
cluster losing the sample. algorithm. The performance of our proposed algorithm is
examined in the quality of clustering required with different
Step 4: Repeat step 3 until convergence is achieved, that is until real world dataset and compared with the origin k-means
a pass through the training sample causes no new algorithm.
assignments.
In Table 1 and Figure 1 shows the comparison of proposed
3.4 Proposed Methodology algorithm and k-means with the quality of clustering for cluster
The proposed efficient k-means clustering is upgraded the two.
origin k-means clustering to reduce the computational
Table 1. Evaluation result of Iris data for the quality of
complexity. In the proposed efficient k-means clustering
clustering, k = 2
method, the normalization and feature weight are applied.
Firstly, the methodology employs normalized dataset by using Sum
min-max normalization to improve the efficiency of clustering Number Normalized of
algorithm. After that gain ratio method compute feature Silhouette
weights for each attributes of the data to minimize the error of Mutual Squar
Method Coefficien
rate. It the centroids are then posted to the traditional clustering Cluster Informatio e
algorithms for being executed in the way it normally does. The t (SC)
results of the proposed work are validated against number of s n (NMI) Error
iterations and accuracy obtained and compared with the (SSE)
randomly selected initial centroids.
Step 1: Accept the dataset to cluster as input values K-
K=2 0.6565 0.7813 0.9077
Step 2: Perform a linear transformation on the original dataset Means
using mix-max normalization
www.ijsea.com 480
International Journal of Science and Engineering Applications
Volume 7–Issue 12,479-482, 2018, ISSN:-2319–7560
Propose
0.6565 0.8149 0.0047
d Table 3. Evaluation result of Iris data for the quality of
clustering, k = 4
Normalized Sum of
Number Silhouette
1 Mutual Square
of Method Coefficient
Information Error
0.8 Clusters (SC)
(NMI) (SSE)
K-Means
0.6 K-Means 0.7006 0.681 0.409
K=4
0.4 Proposed Proposed 0.8 0.9141 0.0019
Algorithm
0.2
0
NMI SC SSE 1
0.8
Figure 1. The quality of clustering for cluster 2
K-Means
The comparison of proposed algorithm and k-means with the 0.6
quality of clustering for cluster three. The result is shown in
0.4 Proposed
Figure 2 and Table 2.
0.2 Algorithm
Table 2. Evaluation result of Iris data for the quality of
clustering, k = 3
0
Sum
Normalized NMI SC SSE
Number Silhouette of
Mutual
of Method Coefficient Square Figure 3. The quality of clustering for cluster 4
Information
Clusters (SC) Error The comparison of proposed algorithm and k-means with the
(NMI) quality of clustering for cluster five. The result is shown in
(SSE)
Figure 4 and Table 4.
K-Means 0.7419 0.8149 0.5281 Table 4. Evaluation result of Iris data for the quality of
clustering, k = 5
K=3
Normalized Sum of
Proposed 0.8642 0.8515 0.0023 Number Silhouette
Mutual Square
of Method Coefficient
Information Error
Clusters (SC)
(NMI) (SSE)
1
K-Means 0.6939 0.6591 0.3167
0.8 K=5
Proposed 0.7036 0.5604 0.0015
K-Means
0.6
0.4 Proposed
Algorithm
0.2 0.8
0.7
0 0.6
NMI SC SSE K-Means
0.5
0.4
Figure 2. The quality of clustering for cluster 3
0.3 Proposed
In Table 3 and Figure 3 shows the comparison of proposed 0.2 Algorithm
algorithm and k-means with the quality of clustering for cluster
four. 0.1
0
NMI SC SSE
www.ijsea.com 481
International Journal of Science and Engineering Applications
Volume 7–Issue 12,479-482, 2018, ISSN:-2319–7560
www.ijsea.com 482