0% found this document useful (0 votes)
13 views48 pages

K Mean Clustering

The K-means clustering algorithm is used to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean. It works by iteratively assigning datapoints to centroids and updating the centroid positions until convergence is reached. The algorithm starts by randomly initializing k centroids and then calculates the distance between each datapoint and centroid, assigning each datapoint to its closest centroid. It then recalculates the positions of the k centroids as the means of the datapoints in each cluster and repeats this process until the centroids no longer move.

Uploaded by

Rexline S J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views48 pages

K Mean Clustering

The K-means clustering algorithm is used to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean. It works by iteratively assigning datapoints to centroids and updating the centroid positions until convergence is reached. The algorithm starts by randomly initializing k centroids and then calculates the distance between each datapoint and centroid, assigning each datapoint to its closest centroid. It then recalculates the positions of the k centroids as the means of the datapoints in each cluster and repeats this process until the centroids no longer move.

Uploaded by

Rexline S J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

K-MEANS

CLUSTERING
INTRODUCTION-
What is clustering?

 Clustering is the classification of objects into


different groups, or more precisely, the
partitioning of a data set into subsets
(clusters), so that the data in each subset
(ideally) share some common trait - often
according to some defined distance measure.
 The output image is clearly showing the five different
clusters with different colors.
 The clusters are formed between two parameters of the
dataset; Annual income of customer and Spending.
 We can change the colors and labels as per the
requirement or choice.
 Cluster1 shows the customers with average salary and average
spending so we can categorize these customers as
 Cluster2 shows the customer has a high income but low spending, so
we can categorize them as careful.
 Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
 Cluster4 shows the customers with low income with very high
spending so they can be categorized as careless.
 Cluster5 shows the customers with high income and high spending so
they can be categorized as target, and these customers can be the
most profitable customers for the mall owner.
How does the K-Means Algorithm Work?

 Step-1: Select the number K to decide the number of clusters.


 Step-2: Select random K points or centroids. (It can be other from the
input dataset).
 Step-3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
 Step-4: Calculate the variance and place a new centroid of each
cluster.
 Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
 Step-6: If any reassignment occurs, then go to step-4 else go to
FINISH.
 Step-7: The model is ready.
1
2
3
4
5
6
7
8
9
10
11
K-MEANS CLUSTERING
 The k-means algorithm is an algorithm to cluster
n objects based on attributes into k partitions,
where k < n.
 It is similar to the
expectation-maximization algorithm for mixtures of
Gaussians in that they both attempt to find the
centers of natural clusters in the data.
 It assumes that the object attributes form a
vector space.
 An algorithm for partitioning (or clustering) N
data points into K disjoint subsets Sj
containing data points so as to minimize the
sum-of-squares criterion

where xn is a vector representing the the nth


data point and uj is the geometric centroid of
the data points in Sj.
 Simply speaking k-means clustering is an
algorithm to classify or to group the objects
based on attributes/features into K number of
group.
 K is positive integer number.
 The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.
How the K-Mean Clustering
algorithm works?
 Step 1: Begin with a decision on the value of k =
number of clusters .
 Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the
training samples randomly,or systematically
as the following:
1.Take the first k training sample as single-
element clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest centroid.
After each assignment, recompute the centroid of
the gaining cluster.
 Step 3: Take each sample in sequence and
compute its distance from the centroid
of each of the clusters. If a sample is not
currently in the cluster with the
closest centroid, switch this
sample to that cluster and update the
centroid of the cluster gaining the
new sample and the cluster losing the
sample.
 Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through
the training sample causes no new
assignments.
A Simple example showing the
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
 Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
 Their new centroids are:
Step 3:
 Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.

 Therefore, the new


clusters are:
{1,2} and {3,4,5,6,7}

 Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)
 Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

 Therefore, there is no
change in the cluster.
 Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
PLOT
(with K=3)

Step 1 Step 2
PLOT
Real-Life Numerical Example
of K-Means Clustering
We have 4 medicines as our training data points object
and each medicine has 2 attributes. Each attribute
represents coordinate of the object. We have to
determine which medicines belong to cluster 1 and
which medicines belong to the other cluster.
Attribute1 (X): Attribute 2 (Y): pH
Object weight index
1 1
Medicine A

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4
Step 1:
 Initial value of
centroids : Suppose
we use medicine A and
medicine B as the first
centroids.
 Let and c1 and c2

denote the coordinate


of the centroids, then
c1=(1,1) and c2=(2,1)
 Objects-Centroids distance : we calculate the
distance between cluster centroid to each object.
Let us use Euclidean distance, then we have
distance matrix at iteration 0 is

 Each column in the distance matrix symbolizes the


object.
 The first row of the distance matrix corresponds to the
distance of each object to the first centroid and the
second row is the distance of each object to the second
centroid.
 For example, distance from medicine C = (4, 3) to the
first centroid is , and its distance to the
second centroid is , is etc.
Step 2:
 Objects clustering : We
assign each object based
on the minimum distance.
 Medicine A is assigned to
group 1, medicine B to
group 2, medicine C to
group 2 and medicine D to
group 2.
 The elements of Group
matrix below is 1 if and
only if the object is
assigned to that group.
 Iteration-1, Objects-Centroids distances : The
next step is to compute the distance of all
objects to the new centroids.
 Similar to step 2, we have distance matrix at
iteration 1 is
 Iteration-1, Objects
clustering:Based on the new
distance matrix, we move the
medicine B to Group 1 while
all the other objects remain.
The Group matrix is shown
below

 Iteration 2, determine
centroids: Now we repeat step
4 to calculate the new centroids
coordinate based on the
clustering of previous iteration.
Group1 and group 2 both has
two members, thus the new
centroids are
and
 Iteration-2, Objects-Centroids distances :
Repeat step 2 again, we have new distance
matrix at iteration 2 as
 Iteration-2, Objects clustering: Again, we
assign each object based on the minimum
distance.

 We obtain result that . Comparing the


grouping of last iteration and this iteration reveals
that the objects does not move group anymore.
 Thus, the computation of the k-mean clustering
has reached its stability and no more iteration is
needed..
We get the final grouping as the results as:

Object Feature1(X): Feature2 Group


weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2
K-Means Clustering Visual Basic Code

Sub kMeanCluster (Data() As Variant, numCluster As Integer)


' main function to cluster data into k number of Clusters
' input:
' + Data matrix (0 to 2, 1 to TotalData);
' Row 0 = cluster, 1 =X, 2= Y; data in columns
' + numCluster: number of cluster user want the data to be clustered
' + private variables: Centroid, TotalData
' ouput:
' o) update centroid
' o) assign cluster number to the Data (= row 0 of Data)

Dim i As Integer
Dim j As Integer
Dim X As Single
Dim Y As Single
Dim min As Single
Dim cluster As Integer
Dim d As Single
Dim sumXY()

Dim isStillMoving As Boolean


isStillMoving = True
if totalData <= numCluster Then
'only the last data is put here because it designed to be interactive
Data(0, totalData) = totalData ' cluster No = total data
Centroid(1, totalData) = Data(1, totalData) ' X
Centroid(2, totalData) = Data(2, totalData) ' Y
Else
'calculate minimum distance to assign the new data
min = 10 ^ 10 'big number
X = Data(1, totalData)
Y = Data(2, totalData)
For i = 1 To numCluster
Do While isStillMoving
' this loop will surely convergent
'calculate new centroids
' 1 =X, 2=Y, 3=count number of data
ReDim sumXY(1 To 3, 1 To numCluster)
For i = 1 To totalData
sumXY(1, Data(0, i)) = Data(1, i) + sumXY(1, Data(0, i))
sumXY(2, Data(0, i)) = Data(2, i) + sumXY(2, Data(0, i))
Data(0, i))
sumXY(3, Data(0, i)) = 1 + sumXY(3, Data(0, i))
Next i
For i = 1 To numCluster
Centroid(1, i) = sumXY(1, i) / sumXY(3, i)
Centroid(2, i) = sumXY(2, i) / sumXY(3, i)
Next i
'assign all data to the new centroids
isStillMoving = False

For i = 1 To totalData
min = 10 ^ 10 'big number
X = Data(1, i)
Y = Data(2, i)
For j = 1 To numCluster
d = dist(X, Y, Centroid(1, j), Centroid(2, j))
If d < min Then
min = d
cluster = j
End If
Next j
If Data(0, i) <> cluster Then
Data(0, i) = cluster
isStillMoving = True
End If
Next i
Loop
End If
End Sub
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial
grouping will determine the cluster significantly.
2. The number of cluster, K, must be determined before
hand. Its disadvantage is that it does not yield the same
result with each run, since the resulting clusters depend
on the initial random assignments.
3. We never know the real cluster, using the same data,
because if it is inputted in a different order it may
produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm
may be trapped in the local optimum.
Applications of K-Mean
Clustering
 It is relatively efficient and fast. It computes result
at O(tkn), where n is number of objects or points, k
is number of clusters and t is number of iterations.
 k-means clustering can be applied to machine
learning or data mining
 Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known
as Vector Quantization or Image Segmentation).
 Also used for choosing color palettes on old
fashioned graphical display devices and Image
Quantization.
CONCLUSION
 K-means algorithm is useful for undirected
knowledge discovery and is relatively simple.
K-means has found wide spread usage in lot
of fields, ranging from unsupervised learning
of neural network, Pattern recognitions,
Classification analysis, Artificial intelligence,
image processing, machine vision, and many
others.
References
 Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c-means,
hierarchical, mixture of gaussians) + some interactive demos (java applets).

 Digital Image Processing and Analysis-byB.Chanda and D.Dutta Majumdar.

 H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. "Spectral Relaxation for K-means
Clustering", Neural Information Processing Systems vol.14 (NIPS 2001). pp. 1057-
1064, Vancouver, Canada. Dec. 2001.

 J. A. Hartigan (1975) "Clustering Algorithms". Wiley.

 J. A. Hartigan and M. A. Wong (1979) "A K-Means Clustering Algorithm", Applied


Statistics, Vol. 28, No. 1, p100-108.

 D. Arthur, S. Vassilvitskii (2006): "How Slow is the k-means Method?,"

 D. Arthur, S. Vassilvitskii: "k-means++ The Advantages of Careful Seeding" 2007


Symposium on Discrete Algorithms (SODA).

 www.wikipedia.com

You might also like