0% found this document useful (0 votes)
18 views16 pages

Kmeansfinal

K-Means Clustering is an unsupervised learning algorithm that partitions data into K clusters based on similarity, using distance metrics like Euclidean Distance. The algorithm iteratively assigns data points to the nearest centroid and recalculates centroids until convergence, with applications in customer segmentation, image compression, and anomaly detection. While K-Means is efficient and versatile, it has limitations such as difficulty in determining the optimal number of clusters and sensitivity to initial centroids.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views16 pages

Kmeansfinal

K-Means Clustering is an unsupervised learning algorithm that partitions data into K clusters based on similarity, using distance metrics like Euclidean Distance. The algorithm iteratively assigns data points to the nearest centroid and recalculates centroids until convergence, with applications in customer segmentation, image compression, and anomaly detection. While K-Means is efficient and versatile, it has limitations such as difficulty in determining the optimal number of clusters and sensitivity to initial centroids.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

K-Means

ClusteringGroup-2
Thejaswi S
Samir K
Swathi G
Vatsalya k
Sruthi N
Overview of Clustering
• The task of grouping data points based on their similarity with each other is called

Clustering or Cluster Analysis. .

• Defined under Unsupervised Learning, which derives insights from unlabeled data

without a target variable.

• Forms groups of homogeneous data points from a heterogeneous dataset.

• Evaluates similarity between points using metrics such as: Euclidean Distance, Cosine

Similarity, Manhattan Distance.


Types of clustering
1.Centroid-based Clustering (Partitioning methods):
⚬ Groups data based on proximity, using metrics like Euclidean Distance.
⚬ Example algorithms: K-Means, K-Medoids.
2.Density-based Clustering (Model-based methods):
⚬ Finds clusters based on data density, automatically determining cluster size.
⚬ Example algorithm: DBSCAN.
3.Connectivity-based Clustering (Hierarchical clustering):
⚬ Builds clusters hierarchically, creating a dendrogram (tree structure).
⚬ Two approaches: Agglomerative (Bottom-Up) Divisive (Top-Down)
4.Distribution-based Clustering:
⚬ Groups data points based on statistical probability distributions.
⚬ Example: Gaussian Mixture Model
K-Means clustering:
• K-means clustering is an unsupervised machine learning algorithm used to partition a
dataset into K clusters, where each data point belongs to the cluster with the nearest
mean.
• It iteratively assigns each point to the closest cluster center, recalculates the cluster
centers, and repeats the process until convergence.
• The goal of clustering is to divide a dataset into groups (clusters) such that:
⚬ Data points within the same group are more similar to each other.
⚬ Data points from different groups are more different from each other.
• It’s about grouping data based on similarity and difference to reveal patterns or
insights in the data.
Key Concepts
• Centroids: Central points that represent the center of each cluster. They are
calculated as the mean of all points assigned to a cluster.
• Clusters: Groups of data points that are similar to each other based on proximity to a
centroid. The number of clusters is defined as K.
• Distance Metrics: Methods to calculate the similarity or dissimilarity between
points.
⚬ Euclidean Distance: A popular distance metric, calculated as the straight-line
distance between two points in space.
Algorithm Workflow:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
• Suppose we have two variables, M1 and M2, represented in the scatter plot on the right.
We aim to divide the dataset into K=2 clusters.
• To start, we randomly select two points as centroids, which are not part of the dataset.
Next, we assign each data point to its nearest centroid by calculating the distance
between the points.
• A median line is drawn between the centroids to help in this assignment.
• The center of gravity of the assigned data points is calculated to determine new
centroids.
• The assignment process is repeated, and new centroids are found.
• Data points are reassigned to the closest centroid.
• The process continues until no data points switch clusters, forming the final clusters.
• The assumed centroids are removed, and the two final clusters are formed.
Choosing the Number of Clusters (K)
• Elbow Method:
• Objective: Find the optimal number of clusters (K) by evaluating how well the clusters
fit the data.
• WCSS (Within Cluster Sum of Squares)WCSS measures the total variations within a
cluster.

• The formula calculates the sum of squared distances between each data point
(p) and its respective centroid (C₁, C₂, C₃) within each cluster.
Choosing the Number of Clusters (K)
Steps:
• Perform K-means clustering on the dataset for
different K values (typically from 1 to 10).
• Calculate WCSS for each K value.
• Plot WCSS values against the number of clusters
(K).
• Identify the "elbow" point in the graph (sharp
bend).The K value corresponding to the "elbow" is
considered optimal.
Advantages
1.Simplicity and Efficiency:
⚬ Easy to implement
⚬ Computationally efficient for large datasets
2.Scalability:
⚬ Handles large datasets well
3.Versatility:
⚬ Suitable for various data types.
⚬ Works well with well-separated clusters.
4.Flexibility:
⚬ Can be used for market segmentation, anomaly detection, etc.
Disadvantages
1.Choosing the Right K:

⚬ The optimal number of clusters (K) is hard to determine.


2.Sensitive to Initial Centroids:
⚬ The algorithm can converge to different solutions depending on the initial
centroids.
3.Assumes Spherical Clusters:
⚬ Performs poorly when clusters are non-spherical or have different sizes and
densities.
4.Sensitive to Outliers:
⚬ Outliers can significantly affect the clustering results.
Applications
1.Customer Segmentation:
■ Grouping customers based on purchasing behavior for targeted marketing
■ E-commerce platforms like Amazon or Flipkart use K-Means clustering to segment
customers into categories such as "frequent buyers," "occasional shoppers," and
"high-value customers."
■ Marketing teams design personalized ads, product recommendations, and
discount strategies for each cluster.

2. Image Compression:
⚬ Image editing software compression tools (e.g., TinyPNG) use K-Means to
compress images without losing much quality.
⚬ In medical imaging, images (like X-rays) are segmented into different regions for
efficient storage and analysis, This reduces the total number of colors in the
image, saving memory and computational resources.
Applications
■ 3. Document Clustering:
■ To categorize a large collection of text documents into clusters/topics for easier retrieval,
management, and understanding.
■ News organizations (e.g., BBC, Google News) use document clustering to group news articles into
topics like "sports," "politics," "technology," etc.
■ In customer support systems, K-Means is used to group support tickets based on issue types for
faster resolution.
⚬ 4. Anomaly Detection
⚬ To identify outliers or unusual data points that do not conform to the general pattern of the dataset.
⚬ Fraud Detection: Banks and financial institutions use K-Means to identify fraudulent transactions by
flagging transactions that deviate significantly from normal patterns.
⚬ Example: Credit card purchases in unusual locations or at irregular times.
⚬ Cybersecurity: Detecting unusual user behavior, such as login attempts from suspicious locations.
Comparison with Other Clustering
Methods
Conclusion
• K-means Clustering is a powerful tool for grouping data into meaningful clusters.
• It is simple, easy to implement, and widely used in practice for tasks such as
segmentation and anomaly detection.
• Choosing K (number of clusters) is a critical step; methods like the Elbow Method can
help.
• While efficient for large datasets, K-means has limitations like sensitivity to initial
centroids and assumptions about cluster shape.
• Despite its limitations, K-means remains a go-to algorithm for unsupervised learning
and exploratory data analysis.

You might also like