Clustering
Clustering
Introduction to Clustering
Clustering is an essential unsupervised learning technique used in data
analysis to group similar data points into clusters based on certain
characteristics or features. The goal of clustering is to identify patterns or
structures in data without predefined labels. These methods are widely used in
fields such as marketing (customer segmentation), biology (gene expression
analysis), and social network analysis.
Cut the tree at the height where there are exactly kkk branches (clusters)
below the cut.
Clustering 1
Cut the tree at this threshold height.
Any cluster merging above this height is not allowed, resulting in multiple
clusters.
we plot the the wcss to the number of clusters when the wcss is constant
and 0 we can say that this is the perfect number of cluster
Silhouette Analysis
Measures how similar each point is to its own cluster compared to other
clusters.=
Create multiple random datasets with the same dimensions and range
as your original data.
Clustering 2
Cluster the random datasets for each kkk and compute their dispersion.
4. Calculate Gap:
5. Choose Optimal k:
What is Clustering?
Clustering is a method to group similar data points together based on their characteristics.
It’s used to find patterns in data without labels (unsupervised learning).
Examples: In marketing (to group similar customers), in biology (for gene analysis), or in social networks (to find similar
users).
Key Terms in Clustering:
Clusters: Groups of similar data points.
Similarity/Dissimilarity: Measures how close or far apart data points are from each other (e.g., using distance metrics like
Euclidean distance).
Applications: Used for things like compressing data, detecting unusual data points (anomalies), or exploring data.
How to Decide the Number of Clusters:
Specify the Number of Clusters:
Set a limit for how different clusters can be before they are considered separate.
If the distance between clusters is too high, don’t merge them.
Highest Jump (Elbow Method):
After clustering, look at the dendrogram and find the biggest jump in distance.
Cut just before this jump to find a reasonable number of clusters.
The Elbow Method helps you find the number of clusters by plotting how "spread out" the data is. When the spread stops
changing a lot, you’ve found the right number of clusters.
Evaluating Clustering Quality:
Silhouette Analysis:
Measures how similar a point is to its own cluster compared to other clusters.
Score:
+1 = Well clustered.
0 = On the border between clusters.
-1 = Likely in the wrong cluster.
Gap Statistic: