KMeans Clustering Report
KMeans Clustering Report
1. Introduction
K-Means Clustering is one of the most popular unsupervised machine learning
algorithms used for data partitioning and pattern recognition. It is particularly effective in
grouping data into clusters based on similarity. The name “K-Means” derives from its
method of locating the centroids (means) of K clusters.
In clustering, the goal is to divide a dataset into distinct groups such that data points in
the same group are more similar to each other than to those in other groups. K-Means is
efficient, easy to implement, and widely used in various domains such as image
compression, market segmentation, social network analysis, and anomaly detection.
The objective is to minimize the intra-cluster variance, also called the within-cluster sum
of squares (WCSS).
3. Mathematical Formulation
Given a set of data points X = {x₁, x₂, ..., xₙ}, K-means clustering aims to partition them
into K clusters C = {C₁, C₂, ..., Cₖ} by minimizing:
Where μᵢ is the centroid of cluster Cᵢ, and ||x - μᵢ||² is the squared Euclidean distance
between a point and its cluster centroid.
Another approach is the Silhouette Score, which measures how similar an object is to its
own cluster compared to other clusters. Higher values indicate better-defined clusters.
Disadvantages:
- Requires the number of clusters K to be specified in advance.
- Sensitive to initial placement of centroids.
- Struggles with clusters of non-spherical shapes or varying densities.
- Not suitable for categorical data without preprocessing.
6. Applications of K-Means
- Customer Segmentation: Grouping customers based on behavior, purchase history, etc.
- Image Compression: Reducing the number of colors using cluster centroids.
- Document Classification: Grouping articles or texts by similarity.
- Anomaly Detection: Identifying outliers in network traffic or transaction data.
7. Conclusion
K-Means Clustering is a fundamental and powerful technique in machine learning and
data analysis. Its intuitive approach and speed make it a strong choice for many practical
applications, especially where the data structure is relatively simple. Despite its
limitations, K-Means often serves as a good baseline and is frequently used in
exploratory data analysis.