0% found this document useful (0 votes)
2 views3 pages

Clustering

Clustering is an unsupervised learning technique that groups similar data points to identify patterns without predefined labels, with applications in various fields such as marketing and biology. Methods for determining the number of clusters include specifying a target number, using a dissimilarity threshold, and applying the Elbow Method. Evaluation techniques like Silhouette Analysis and the Gap Statistic help assess the quality and separation of the clusters formed.

Uploaded by

Rana Ben Fraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views3 pages

Clustering

Clustering is an unsupervised learning technique that groups similar data points to identify patterns without predefined labels, with applications in various fields such as marketing and biology. Methods for determining the number of clusters include specifying a target number, using a dissimilarity threshold, and applying the Elbow Method. Evaluation techniques like Silhouette Analysis and the Gap Statistic help assess the quality and separation of the clusters formed.

Uploaded by

Rana Ben Fraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Clustering

Introduction to Clustering
Clustering is an essential unsupervised learning technique used in data
analysis to group similar data points into clusters based on certain
characteristics or features. The goal of clustering is to identify patterns or
structures in data without predefined labels. These methods are widely used in
fields such as marketing (customer segmentation), biology (gene expression
analysis), and social network analysis.

Core Concepts of Clustering


1. Clusters: Collections of data points grouped together based on similarity.

2. Similarity/Dissimilarity: Measured using distance metrics such as


Euclidean, Manhattan, or cosine similarity.

3. Applications: Data compression, anomaly detection, and exploratory data


analysis

How to cut the dendrogram to identify the number


of clusters :
a. Specify the Number of Clusters
Decide beforehand how many clusters you want (e.g., kkk).

Cut the tree at the height where there are exactly kkk branches (clusters)
below the cut.

This is a straightforward method if you have a target number of clusters in


mind.

b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli


yfout etoul mta l khat niveau mou3ayen khater yekhtalef binet l
cluster lowel wel cluster lekhra w me tesselnish kifeh yetehseb
)
Define a maximum allowable distance (or dissimilarity) for merging clusters.

Clustering 1
Cut the tree at this threshold height.

Any cluster merging above this height is not allowed, resulting in multiple
clusters.

C.Highest Jump (Elbow Method in Dendrograms) METHODE


KHAYBA
After hierarchical clustering, examine the dendrogram for the largest
vertical gap (or "jump") in the linkage distance.

This jump indicates a significant dissimilarity between clusters. By cutting


the dendrogram just before this jump, you can determine a reasonable
number of clusters.

ELBOW METHOD: “me trakazesh fehom yesser

calculated using “within cluster sum of square WCSS”

we plot the the wcss to the number of clusters when the wcss is constant
and 0 we can say that this is the perfect number of cluster

Now how to evaluate it ?

Silhouette Analysis
Measures how similar each point is to its own cluster compared to other
clusters.=

Silhouette Score ranges from −1-1−1 to +1+1+1:

+1: Point is well-clustered.

0: Point is on the boundary between clusters.

−1: Point is likely misclassified.

Gap statistic clustering: Ahsen wahda talka nb of clusters


1. Compute Within-Cluster Dispersion (WkW_kWk​):

Measure how compact the clusters are in your data.

2. Generate Random Reference Data:

Create multiple random datasets with the same dimensions and range
as your original data.

3. Calculate Dispersion for Random Data:

Clustering 2
Cluster the random datasets for each kkk and compute their dispersion.

4. Calculate Gap:

A larger gap indicates better clustering.

5. Choose Optimal k:

Select k where the gap is maximized or stabilizes significantly.

The gap statistic determines the optimal number of clusters (kkk) by


comparing the within-cluster dispersion of your data to that of random data. It
identifies how well-separated the clusters are compared to a random baseline.

What is Clustering?
Clustering is a method to group similar data points together based on their characteristics.
It’s used to find patterns in data without labels (unsupervised learning).
Examples: In marketing (to group similar customers), in biology (for gene analysis), or in social networks (to find similar
users).
Key Terms in Clustering:
Clusters: Groups of similar data points.
Similarity/Dissimilarity: Measures how close or far apart data points are from each other (e.g., using distance metrics like
Euclidean distance).
Applications: Used for things like compressing data, detecting unusual data points (anomalies), or exploring data.
How to Decide the Number of Clusters:
Specify the Number of Clusters:

Decide how many clusters you want (e.g., 3 clusters).


Cut the tree (dendrogram) at the point where there are exactly 3 branches.
Use a Dissimilarity Threshold:

Set a limit for how different clusters can be before they are considered separate.
If the distance between clusters is too high, don’t merge them.
Highest Jump (Elbow Method):

After clustering, look at the dendrogram and find the biggest jump in distance.
Cut just before this jump to find a reasonable number of clusters.
The Elbow Method helps you find the number of clusters by plotting how "spread out" the data is. When the spread stops
changing a lot, you’ve found the right number of clusters.
Evaluating Clustering Quality:
Silhouette Analysis:

Measures how similar a point is to its own cluster compared to other clusters.
Score:
+1 = Well clustered.
0 = On the border between clusters.
-1 = Likely in the wrong cluster.
Gap Statistic:

Measures how well-separated your clusters are.


Steps:
Calculate how tight (compact) the clusters are.
Compare this with random data to see if your clusters are better.
If there’s a big gap, it means your clusters are good.
Summary:
Clustering groups similar data together to find hidden patterns.
You decide how many clusters to have, often by using methods like the Elbow Method or Gap Statistic.
To evaluate how good the clustering is, you can use Silhouette Analysis or the Gap Statistic.
Clustering 3

You might also like