0% found this document useful (0 votes)
64 views6 pages

Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648

Abstract— The ability to monitor the progress of students’ academic performance is a critical issue to the academic community of higher learning. A system for analyzing students’ results based on cluster analysis and uses standard statistical algorithms to arrange their scores data according to the level of their performance is described. In this paper, we also implemented k-mean clustering algorithm for analyzing students’ result data. The model was combined with the deterministic model to anal

Uploaded by

Somu Naskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views6 pages

Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648

Abstract— The ability to monitor the progress of students’ academic performance is a critical issue to the academic community of higher learning. A system for analyzing students’ results based on cluster analysis and uses standard statistical algorithms to arrange their scores data according to the level of their performance is described. In this paper, we also implemented k-mean clustering algorithm for analyzing students’ result data. The model was combined with the deterministic model to anal

Uploaded by

Somu Naskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Engineering Research & Technology (IJERT)

ISSN: 2278-0181
Vol. 2 Issue 7, July - 2013

Analysis And Study Of K-Means Clustering Algorithm


Sudhir Singh and Nasib Singh Gill
Deptt of Computer Science & Applications
M. D. University, Rohtak, Haryana

Abstract We present our algorithm in Section 3, time


Study of this paper describes the behavior of K- complexity of algorithms in Section 4, we describe
means algorithm. Through this paper we have try to the experimental results in Section 5 and we conclude
overcome the limitations of K-means algorithm by with Section 6.
proposed algorithm. Basically actual K-mean
algorithm takes lot of time when it is applied on a 2. K-MEANS CLUSTERING
large database. That’s why the proposed clustering
concept comes into picture to provide quick and K-means algorithm is one of the partitioning based
efficient clustering technique on large data set. In clustering algorithms [2]. The general objective is to
this paper performance evaluation is done for obtain the fixed number of partitions/clusters that
proposed algorithm using Max Hospital Diabetic minimize the sum of squared Euclidean distances
Patient Dataset. between objects and cluster centroids.
Keywords: Clustering, K-means, Threshold, Let X={xi| i=1,2,…………..,n} be a data set with n
outlier, Square Error. objects, k is the number of clusters, mj is the centroid
RT
of cluster cj where j=1,2,……….,k. Then the
algorithm finds the distance between a data object
and a centroid by using the following Euclidean
1. Introduction
IJE

distance formula [1].


The Euclidean distance between
Clustering is the process of partitioning or grouping a two points/objects/items in a dataset, defined by point
given set of patterns into disjoint clusters. This is X and point Y is defined by Equation below [5].
done such that patterns in the same cluster are alike EUCLIDEAN DISTANCE(X,Y) = ( |X1-
and patterns belonging to two different clusters are Y1|2 + |X2-Y2|2 + ... + |XN-1-YN-1|2 + |XN-YN|2 )1/2
different. Clustering has been a widely studied OR Euclidean distance formula=√∑|xi-mj|2
problem in a variety of application domains. Several where X represents is the first data point, Y is the
algorithms have been proposed in the literature for second data point, N is the number of characteristics
clustering: CLARA, CLARANS [6], Focusing or attributes in data mining terminology.
Techniques [4], P-CLUSTER [5]. DBSCAN [3] and Starting from an initial distribution of cluster centers
BIRCH [7]. The k-means method has been shown to in data space, each object is assigned to the cluster
be effective in producing good clustering results for with closest center, after which each center itself is
many practical applications. However, a direct updated as the center of mass of all objects belonging
algorithm of k-means method requires time to that particular cluster. The procedure is repeated
proportional to the product of number of patterns and until convergence.
number of clusters per iteration. This is
computationally very expensive especially for large 2.1. K-MEANS ALGORITHM [1]
datasets. We propose a novel algorithm for
implementing the k-means method. Our algorithm INPUT: // Set of n items to cluster
produces the same or comparable (due to the round-
off errors) clustering results to the direct k-means D= {d1, d2, d3,………………, dn}
algorithm. It has significantly superior performance
than the direct k-means algorithm in most cases. The
// No. of cluster (temporary cluster)
rest of this paper is organized as follows. We review
previously proposed approaches for improving the randomly chosen i.e. k
performance of the k-means algorithms in Section 2.

IJERTV2IS70648 www.ijert.org 2546


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 7, July - 2013

// So below, K is set of subset of D selection of initial centroid points it is


as temporary cluster and C is set of susceptible to a local optimum and may miss
centroids of those clusters. the global optimum. It may converge to
suboptimal solutions. This means
K= {k1, k2, k3,………………, kk suboptimal classification may be found,
requiring multiple runs with different initial
},
conditions. The selection of spurious data
points as a center may lead to no data points
C= {c1, c2, c3,………………, ck}
in the class, with the outcome that the center
cannot be updated.
Where k1= {d1}, k2=
3. It can model only a spherical shape of
{d2}, k3= {d3}…… kk= {dk} clusters. Thus the non convex shape of
clusters cannot be modeled in center based
And c1=d1, c2=d2, clustering.
c3=d3,………. ck=dk, 4. It is sensitive to outliers since a small
amount of outliers can substantially
// here k<=n influence the mean value.
5. Due to the nature of iteration scheme in
Output: // // K is set of subset of D as final producing the clustering result, it begins at
cluster and C is set of centroids starting cluster centroids and iteratively
of these cluster. updates these centroids to decrease the
square error. But it is not confirmed how
K= {k1, k2, k3,………………, kk many time it iterates which is not relevant
}, for bigger data set. It may take a huge
number of iterations to converge. Such
number of iterations cannot be determined
RT
C= {c1, c2, c3,………………, ck}
beforehand and may change from run to run.
Algorithm: Result may be bad with high dimensional
data.
IJE

K-means (D, K, C) 6. It cannot be used for clustering problems


whose results cannot fit in main memory,
1. Arbitrarily choose k objects from D as which is the case when data set has very
high dimensionality or desired number of
the initial cluster centers. cluster is too big.
2. Repeat
3. (re) assign each object to the cluster
to which the object is the most similar, 3. PROPOSED CLUSTERING
based on the mean value
of the objects in the cluster.
ALGORITHM
4. Update the cluster means, i.e., Input: // A set D of n objects to cluster. A threshold
calculate the mean value of the objects value Tth.
for each cluster.
5. Until no change. D= {d1, d2, d3,………………, dn}, Tth

2.2. LIMITATTIONS OF K-MEANS Output:// A set K of k subsets of D as final clusters


CLUSTERING ALGORITHM and a set C of centroids of these
clusters.
A critical look at the available literature indicates the
following shortcomings are in the existing K-means K= {k1, k2, k3,………………, kk},
clustering algorithms [13].
1. In partitioning based K-means clustering C= {c1,c2,c3,………………,ck}
algorithms, the number of clusters (k) needs
to be determined beforehand. Algorithm:
2. The algorithm is sensitive to an initial seed
selection (starting cluster centroids). Due to

IJERTV2IS70648 www.ijert.org 2547


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 7, July - 2013

Proposed cluster algorithm (D,Tth) influence the mean value. In proposed


clustering algorithm outliers can’t influence
1. Let k=1 the mean value. They can be easily
2. // Randomly choose a object from identified and removed (if desired).
D, let it be p 4. In K-means it is not confirmed that how
many times it iterates but in proposed
k1= {p}
clustering it is known.
3. K={k1} 5. Data are stored in secondary memory and
4. c1=p data objects are transferred to main memory
5. C={c1} one at a time for clustering. Only the cluster
6. Assign a constant value to Tth representations i.e. centroid are stored
7. for l=2 to n do permanently in main memory to alleviate
8. Choose next random point space limitations thus space requirements of
proposed algorithm is very small, necessary
from D other than already
only for the centroids of clusters. In K-
chose points let it be q. means memory space is more required to
9. Determine m, distance store each object permanently in memory
between q and centroid along with centroids.
cm(1<=m<=k) in C such
that distance is minimum 4. TIME COMPEXITY
using eq. (1).
10. If (distance<=Tth) then 4.1 TIME COMPLEXITY OF K-MEANS
11. km=km union q CLUSTERING ALGORITHM [1]
12. Calculate new
mean (centroid To calculate the running time of K-means algorithm
RT
cm) for cluster it is necessary to know the number of times each
km using eq. (2). statement run and cost of running. But sometimes
13. Else k=k+1 number of steps is not known so it has been assumed.
IJE

14. kk={q} For example let number of times first statement runs
15. K=K union {kk} with cost m1 is q (>=1). For each q next statement ,
16. ck=q for i=1,2,…………n where n is number of data
17. C=C union {ck} objects, runs n+1 times with cost m2. For each q and
for each n, next statement runs k+1 times, where k is
number of cluster with cost m3. 4th statement runs
3.1. ADVANTAGES OF PROPOSED one time for each q and for each n with cost m4.
CLUSTERING Calculating new mean for each cluster requires k+1
runs for each q with cost m5.
Having looked at the available literature indicates the
following advantages can be found in proposed Running time for algorithm is the sum of running
clustering over K-means clustering algorithm. time for each statement executed i.e.
1. In K-means clustering algorithms, the
number of clusters (k) needs to be T(n) =m1*q+m2*1∑q (n+1)+m3*1∑q 1∑
n

determined beforehand but in proposed (k+1)+m4*1∑q 1∑n 1+m5*1∑q (k+1).


clustering algorithm it is not required. It
generates number of clusters automatically. =
2. K-means depends upon initial selection of m1*q+m2*q*(n+1)+m3*q*n*(k+1)+m4*q*n*1+m5
cluster points, it is susceptible to a local
*q*(k+1).
optimum and may miss global optimum.
Proposed clustering algorithm is employed
= m1*q+m2*q*n+ m2*q+m3*q*n*k+
to improve the chances of finding the global
optimum. m3*q*n+m4*q*n+m5*q*k+ m5*q.
3. K-means is sensitive to outliers since a small
amount of outliers can substantially

IJERTV2IS70648 www.ijert.org 2548


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 7, July - 2013

of clusters. Rest of statement is part of if-then-else


=(m1+m2+m5)*q+(m2+m3+m4)*q*n+m3*q*n*k. body which runs for n-1 times. Let if-then part body
runs for r times with cost m11, m12 and then else
For worst case it will be O(ni) where part body runs for n-1-r times with cost m13, m14,
2<=i<3 m15, m16.
For best case it will be O(n) Running time algorithm is the sum of running time
for each statement executed i.e.
For average case it will be O(n2).
T(n)=m1*1+ m2*1+ m3*1+ m4*1+ m5*1 +m6*1+
4.2 Time complexity of proposed clustering m7*n+ m8*q+ m9*i=2∑n (k+1)+ m10*(n-1)+m11*r+
algorithm. m12*r+ m13*(n-1-r)+ m14*(n-1-r)+ m15*(n-1-r)+
m16*(n-1-r)+ m17*(n-1-r).
Time taken by an algorithm depends on the input data
set. Clustering a thousand data objects takes longer T(n)=m1+ m2+ m3+ m4+ m5 +m6+ (m7+ m10+
time than clustering one object. Moreover K-means m13+m14+m15)*n-( m10+m13+ m14+ m15+ m16+
and proposed algorithm takes different amounts of m17)+( m11+ m12- m13- m14- m15- m16-
time to cluster same data objects. In general, the time m17)*r+m9* i=2∑n (k+1)+m8*q.
taken by an algorithm grows with the size of input, so
it is traditional to describe the running time of For worst case let p increases with increase in i then
program as a function of size of its input. To do so,
n
there is need to define the terms “Running Time ” i=2∑ (k+1)=2+3………..n
and “Size of Input” more carefully. Most natural
=n*(n+1)/2-1
measure is the number of objects in the input. In this
RT
analysis number of objects is represented by n. So T(n)=m1+ m2+ m3+ m4+ m5 +m6+ (m7+ m10+
Running time of an algorithm on a particular input is m13+m14+m15)*n-( m10+m13+ m14+ m15+ m16+
the number of primitive operations or “steps”
IJE

m17)+( m11+ m12- m13- m14- m15- m16-


executed. It is convenient to define the notion of m17)*r+m9* n*(n+1)/2-1+m8*q.
steps so that it is as machine –independent as
possible. A constant amount of time is required to T(n)=O(n2)
execute each line of algorithm. One line may take
n
different amount of time than another line, but it is For best case let p=1 for 2<=i<=n then i=2∑

assumed that each execution of ith line takes time mi (k+1)=2*n


where mi is a constant. In the following discussion ,
T(n)=m1+ m2+ m3+ m4+ m5 +m6+ (m7+ m10+
expression for running time of both algorithms
m13+m14+m15)*n-( m10+m13+ m14+ m15+ m16+
evolves from a messy formula that uses all the
m17)+( m11+ m12- m13- m14- m15- m16-
statement costs mi to a much simpler notation that
m17)*r+m9* 2*n+m8*q.
concise and more easily manipulated. This simpler
notation makes it easy to determine whether one T(n)=O(n)
algorithm is more efficient than another.
For average case it will be O(ni) for 1<=i<=2.
In proposed clustering algorithm, like incremental K-
means, number of times each statement runs is Table1 Comparison of algorithm’s running time
known. 1st, 2nd, 3rd, 4th, 5th and 6th statement runs one Name of Worst Average Best case
time only with cost m1, m2, m3, m4, m5 and m6 algorithm case case
k-means O(ni) O(n2) O(n)
respectively. Next statement for i=2,3,…………..n,
where
runs n times with cost m7 where n is number of data 2<=i<3
objects. 8th statement finds next random object to Proposed O(n2) O(ni) O(n)
cluster. 9th statement scans centroid of each cluster Algorithm where
with cost m9. So it runs k+1 times where k is number 1<=i<=2

IJERTV2IS70648 www.ijert.org 2549


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 7, July - 2013

Above table shows three test cases having minimum


number of object in a cluster as 2,3 and 4, threshold
5. Experimental Result value varies from 6 to 12 for each test case. On
different –different threshold value we have obtained
The implementation of proposed algorithm is using different values of square error, number of object as
Dot Net Visual Studio 2008 using language C# and Outlier and number of cluster form.
backend Microsoft SQL Server 2008. We have
evaluated our algorithm on Max hospital data set of
diabetic patients. All the experimental results PROPOSED CLUSTERING
reported are on Intel Core i3 whose clock speed of TECH.
processor is 3.0GHz and the memory size is 4 GB
running on window7 home basic. 20
THERSHOL
D VALUE
Table 2: Experimental Result obtained by Proposed 15
12
Algorithm 11
10 10 9 NO. OF
8
7 CLUSTER
MIN. 6
5 FORMED
NO.
OF NO. NO. SQUARE
0 ERROR
TE THER SQUA OBJE OF OF *100
1 2 3 4 5 6 7
ST SHOL RE CT IN OBJEC CLUS
C D ERRO A T. AS TER
RT
A VALU R CLUS OUTLI FOR- Figure 1: Graph representing test case1.
SE E *100 TER. ERS MED
12 17.57 2 2 9 PROPOSED CLUSTERING
IJE

11 15.18 2 1 11
TECH.
10 9.14 2 4 12
1 9 7.64 2 3 13 20
18
8 6.22 2 6 12 THERSHOL
16
7 4.84 2 8 12 D VALUE
14
6 3.78 2 11 12 12
12 11
12 17.2 3 6 7 10 10 NO. OF
9 8
11 14.79 3 7 8 8 7 CLUSTER
6 FORMED
10 8.42 3 12 8 6
2 9 6.9 3 11 9 4 SQUARE
8 5.58 3 14 8 2 ERROR
7 4.35 3 14 9 0 *100
6 3.56 3 15 10 1 2 3 4 5 6 7
12 17.21 4 6 7
11 14.13 4 10 7 Figure 2: Graph representing test case2.
10 7.49 4 18 6
3 9 5.8 4 20 6
8 5.32 4 17 7
7 3.92 4 20 7
6 2.78 4 27 6

IJERTV2IS70648 www.ijert.org 2550


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 7, July - 2013

7. References
PROPOSED CLUSTERING
1. Han, J. &Kamber, M. (2012). Data Mining: Concepts
TECH. and Techniques. 3rd.ed. Boston: Morgan Kaufmann
Publishers.
30 2. Sudhir Singh, Dr. Nasib Singh Gill,Comparative Study
Of Different Data Mining Techniques : A Review, www.
25 THERSHOL ijltemas.in, Volume II, Issue IV, APRIL 2013 IJLTEMAS
D VALUE ISSN 2278 – 2540.
20 3. M. Ester, H. Kriegel, J. Sander, and X. Xu. A Density-
Based Algorithm for Discovering Clusters in Large Spatial
15 NO. OF Databases with Noise. Proc. of the 2nd Int’l Conf. on
12 Knowledge Discovery and Data Mining, August 1996.
11 CLUSTER
10 4. M. Ester, H. Kriegel, and X. Xu. Knowledge Discovery
10 9 8 FORMED in Large Spatial Databases: Focusing Techniques for
7 6
SQUARE Efficient Class Identification. Proc. of the Fourth Int’l.
5 Symposium on Large Spatial Databases, 1995.
ERROR
5. D. Judd, P. McKinley, and A. Jain. Large-Scale Parallel
0 *100 Data Clustering. Proc. Int’l Conference on Pattern
Recognition, August 1996.
1 2 3 4 5 6 7
6. R. T. Ng and J. Han. Efficient and Effective Clustering
Methods for Spatial Data Mining. Proc. of the 20th Int’l
Conference on Very Large Databases, Santiago, Chile,
Figure 3: Graph representing test case3. pages 144–155, 1994.
7. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An
Above graph shows that Efficient Data Clustering Method for Very Large Databases
Proc. of the 1996 ACM SIGMOD Int’l Conf. on
1. As threshold value decreases Square Error Management of Data, Montreal, Canada, pages 103–114,
RT
June 1996.
decreases. Lower the value of Square Error 8. Performance Evaluation of Incremental K-means
higher the compactness of cluster and as Clustering Algorithm, Sanjay Chakraborty , N.K. Nagwani
separate as possible. Hence as we decrease National Institute of Technology (NIT) Raipur, CG, India,
IJE

IIJDWM, Journal homepage: www.ifrsa.org.


the threshold value cluster quality increases.
9. PERFORMANCE ANALYSIS OF PARTITIONAL
2. As we decreases the threshold value number AND INCREMENTAL CLUSTERING, Seminar National
of cluster form increases. Aplikasi Teknologi Informasi 2005 (SNATI 2005) ISBN:
3. As we decrease the threshold value number 979-756-061-6 Yogyakarta, 18 June 2005.
10. Performance Evaluation of Incremental K-means
of object as Outlier increases. Clustering Algorithm, IFRSA International Journal of Data
Warehousing & Mining |Vol1|issue 1|Aug 2011.
6. Conclusions 11. M H Dunham, “Data Mining: Introductory and
Advanced Topics,” Prentice Hall, 2002.
In this paper we presented an algorithm for 12. R C Dubes, A K Jain, “Algorithms for Clustering
Data,” Prentice Hall, 1988.
performing K-means clustering. Our experimental 13. Data Mining and Statistics for Decision Making, Page
result demonstrated that our scheme can improve the no. 251, Stephane Tuffey, Wiley Publication.
direct K-means algorithm. This paper also explains
the time complexity of K-means and our purposed
algorithm.

There are several improvements possible to the basic


strategy presented in this paper. One approach will be
to use the concept of Nearest Neighbor Clustering
Algorithm to improve the compactness of clusters.

IJERTV2IS70648 www.ijert.org 2551

You might also like