0% found this document useful (0 votes)
49 views38 pages

K-Means Clustering

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number (k) of clusters. It works by assigning each data point to the cluster with the nearest mean and recalculating the means for each cluster iteratively until convergence is reached. Key steps include initializing k cluster centers, calculating distances between data points and centers, assigning points to closest clusters, and updating cluster centers. The algorithm aims to minimize within-cluster variance but results depend on initialization and it assumes spherical clusters of equal size and density.

Uploaded by

madhullika204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views38 pages

K-Means Clustering

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number (k) of clusters. It works by assigning each data point to the cluster with the nearest mean and recalculating the means for each cluster iteratively until convergence is reached. Key steps include initializing k cluster centers, calculating distances between data points and centers, assigning points to closest clusters, and updating cluster centers. The algorithm aims to minimize within-cluster variance but results depend on initialization and it assumes spherical clusters of equal size and density.

Uploaded by

madhullika204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

K-means Clustering

K-means Clustering
• What is clustering?
• Why would we want to cluster?
• How would you determine clusters?
• How can you do this efficiently?
Clustering
• Unsupervised learning
– Requires data, but no labels
• Detect patterns e.g. in
– Group emails or search results
– Customer shopping patterns
– Regions of images
• Useful when don’t know what you’re looking for
• Basic idea: group together similar instances
• What could “similar” mean?
– One option: Euclidean distance
• Clustering results are crucially dependent on the
measure of similarity (or distance) between “points”
to be clustered
Clustering algorithms
K-means Clustering
Basic Algorithm:
• Step 0: select K
• Step 1: Randomly select any K data points as cluster
centers.
• Step 2: calculate distance from each object to each
cluster center.
• What type of distance should we use?
– Squared Euclidean distance
– given distance function
K-means Clustering
• Step 3: Assign each object to the closest cluster
• Step 4: Compute the new centroid for each cluster
– The center of a cluster is computed by taking mean of all
the data points contained in that cluster.
• Iterate from step 2 to 4:
– Calculate distance from objects to cluster centroids.
– Assign objects to closest cluster
– Recalculate new centroids
K-means Clustering
• Stop based on convergence criteria
– Center of newly formed clusters do not change
– Data points remain present in the same cluster
– Maximum number of iterations are reached
Example 2
Example 3
• Cluster the following eight points (with (x, y) representing
locations) into three clusters:
– A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b =
(x2, y2) is defined as-
– Ρ(a, b) = |x2 – x1| + |y2 – y1|

• Use K-Means Algorithm to find the three cluster centers after


the second iteration.
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1|
– = |2 – 2| + |10 – 10| = 0
• Calculating Distance Between A1(2, 10) and C2(5, 8)-
– Ρ(A1, C2) = |x2 – x1| + |y2 – y1|
– = |5 – 2| + |8 – 10| = 3 + 2 = 5
• Calculating Distance Between A1(2, 10) and C3(1, 2)-
– Ρ(A1, C3) = |x2 – x1| + |y2 – y1|
– = |1 – 2| + |2 – 10| = 1 + 8 = 9
New clusters are-

• Cluster-01: First cluster contains points-


– A1(2, 10)

• Cluster-02: Second cluster contains points-


– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)
– A8(4, 9)

• Cluster-03: Third cluster contains points-


– A2(2, 5)
– A7(1, 2)
• Re-compute the new cluster centers.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.

• For Cluster-01:
– We have only one point A1(2, 10) in Cluster-01.
– So, cluster center remains the same.

• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)

• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)

• This is completion of Iteration-01.


Iteration - 2
• Calculate the distance of each point from each of the center of
the three clusters
• The distance is calculated by using the given distance function.
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| = 0

• Calculating Distance Between A1(2, 10) and C2(6, 6)-


– Ρ(A1, C2) = |x2 – x1| + |y2 – y1| = |6 – 2| + |6 – 10| = 4 + 4 = 8

• Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-


– Ρ(A1, C3) = |x2 – x1| + |y2 – y1| = |1.5 – 2| + |3.5 – 10| = 0.5 + 6.5 = 7
New clusters

• Cluster-01: First cluster contains points-


– A1(2, 10)
– A8(4, 9)

• Cluster-02: Second cluster contains points-


– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)

• Cluster-03: Third cluster contains points-


– A2(2, 5)
– A7(1, 2)
New cluster centers

• For Cluster-01:
– Center of Cluster-01 = ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)

• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25)

• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
K-means Clustering
• Strengths
– Simple iterative method
– Guaranteed to converge in a finite number of iterations
– User provides “K”
– Running time per iteration:
• Assign data points to closest cluster center O(KN) time
• Change the cluster center to the average of its assigned points O(N)
• Weaknesses
– Often too simple  bad results
– can not handle noisy data and outliers
– Difficult to guess the correct “K”
– not suitable to identify clusters when the clusters have varying sizes,
different densities or non-convex shapes
K-means Issues
• Distance measure is squared Euclidean
– Scale should be similar in all dimensions
• Rescale data?
– Not good for nominal data. Why?
• Approach tries to minimize the within-cluster sum of
squares error (WCSS) or inertia
– Implicit assumption that SSE is similar for each group
WCSS
• The over all WCSS is given by:

• The goal is to find the smallest WCSS


• Does this depend on the initial seed values?
• Possibly.
• Figure shows two suboptimal solutions that the algorithm can
converge to if you are not lucky with the random initialization
step.
Finding the optimal number of clusters

• The inertia is not a good performance metric when trying to


choose k because it keeps getting lower as we increase k.
• Indeed, the more clusters there are, the closer each instance
will be to its closest centroid, and therefore the lower the
inertia will be
• When plotting the inertia as a function of the number of
clusters k, the curve often contains an inflexion point called
the “elbow”
• curve has roughly the shape of an arm, and there is an
“elbow”
• K-Means fails to cluster these ellipsoidal blobs properly
Image segmentation using K-Means with various
numbers of color clusters
Bottom Line
• K-means
– Easy to use
– Need to know K
– May need to scale data
– Good initial method
• Local optima
– No guarantee of optimal solution
– Repeat with different starting values

You might also like