Kmeansfinal
Kmeansfinal
ClusteringGroup-2
Thejaswi S
Samir K
Swathi G
Vatsalya k
Sruthi N
Overview of Clustering
• The task of grouping data points based on their similarity with each other is called
• Defined under Unsupervised Learning, which derives insights from unlabeled data
• Evaluates similarity between points using metrics such as: Euclidean Distance, Cosine
• The formula calculates the sum of squared distances between each data point
(p) and its respective centroid (C₁, C₂, C₃) within each cluster.
Choosing the Number of Clusters (K)
Steps:
• Perform K-means clustering on the dataset for
different K values (typically from 1 to 10).
• Calculate WCSS for each K value.
• Plot WCSS values against the number of clusters
(K).
• Identify the "elbow" point in the graph (sharp
bend).The K value corresponding to the "elbow" is
considered optimal.
Advantages
1.Simplicity and Efficiency:
⚬ Easy to implement
⚬ Computationally efficient for large datasets
2.Scalability:
⚬ Handles large datasets well
3.Versatility:
⚬ Suitable for various data types.
⚬ Works well with well-separated clusters.
4.Flexibility:
⚬ Can be used for market segmentation, anomaly detection, etc.
Disadvantages
1.Choosing the Right K:
2. Image Compression:
⚬ Image editing software compression tools (e.g., TinyPNG) use K-Means to
compress images without losing much quality.
⚬ In medical imaging, images (like X-rays) are segmented into different regions for
efficient storage and analysis, This reduces the total number of colors in the
image, saving memory and computational resources.
Applications
■ 3. Document Clustering:
■ To categorize a large collection of text documents into clusters/topics for easier retrieval,
management, and understanding.
■ News organizations (e.g., BBC, Google News) use document clustering to group news articles into
topics like "sports," "politics," "technology," etc.
■ In customer support systems, K-Means is used to group support tickets based on issue types for
faster resolution.
⚬ 4. Anomaly Detection
⚬ To identify outliers or unusual data points that do not conform to the general pattern of the dataset.
⚬ Fraud Detection: Banks and financial institutions use K-Means to identify fraudulent transactions by
flagging transactions that deviate significantly from normal patterns.
⚬ Example: Credit card purchases in unusual locations or at irregular times.
⚬ Cybersecurity: Detecting unusual user behavior, such as login attempts from suspicious locations.
Comparison with Other Clustering
Methods
Conclusion
• K-means Clustering is a powerful tool for grouping data into meaningful clusters.
• It is simple, easy to implement, and widely used in practice for tasks such as
segmentation and anomaly detection.
• Choosing K (number of clusters) is a critical step; methods like the Elbow Method can
help.
• While efficient for large datasets, K-means has limitations like sensitivity to initial
centroids and assumptions about cluster shape.
• Despite its limitations, K-means remains a go-to algorithm for unsupervised learning
and exploratory data analysis.