0% found this document useful (0 votes)

2 views3 pages

Clustering

Clustering is an unsupervised learning technique that groups similar data points to identify patterns without predefined labels, with applications in various fields such as marketing and biology. Methods for determining the number of clusters include specifying a target number, using a dissimilarity threshold, and applying the Elbow Method. Evaluation techniques like Silhouette Analysis and the Gap Statistic help assess the quality and separation of the clusters formed.

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views3 pages

Clustering

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Clustering

Introduction to Clustering
Clustering is an essential unsupervised learning technique used in data
analysis to group similar data points into clusters based on certain
characteristics or features. The goal of clustering is to identify patterns or
structures in data without predefined labels. These methods are widely used in
fields such as marketing (customer segmentation), biology (gene expression
analysis), and social network analysis.

Core Concepts of Clustering

1. Clusters: Collections of data points grouped together based on similarity.

2. Similarity/Dissimilarity: Measured using distance metrics such as

Euclidean, Manhattan, or cosine similarity.

3. Applications: Data compression, anomaly detection, and exploratory data

analysis

How to cut the dendrogram to identify the number

of clusters :
a. Specify the Number of Clusters
Decide beforehand how many clusters you want (e.g., kkk).

Cut the tree at the height where there are exactly kkk branches (clusters)
below the cut.

This is a straightforward method if you have a target number of clusters in

mind.

b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli

yfout etoul mta l khat niveau mou3ayen khater yekhtalef binet l
cluster lowel wel cluster lekhra w me tesselnish kifeh yetehseb
)
Define a maximum allowable distance (or dissimilarity) for merging clusters.

Clustering 1
Cut the tree at this threshold height.

Any cluster merging above this height is not allowed, resulting in multiple
clusters.

C.Highest Jump (Elbow Method in Dendrograms) METHODE

KHAYBA
After hierarchical clustering, examine the dendrogram for the largest
vertical gap (or "jump") in the linkage distance.

This jump indicates a significant dissimilarity between clusters. By cutting

the dendrogram just before this jump, you can determine a reasonable
number of clusters.

ELBOW METHOD: “me trakazesh fehom yesser

calculated using “within cluster sum of square WCSS”

we plot the the wcss to the number of clusters when the wcss is constant
and 0 we can say that this is the perfect number of cluster

Now how to evaluate it ?

Silhouette Analysis
Measures how similar each point is to its own cluster compared to other
clusters.=

Silhouette Score ranges from −1-1−1 to +1+1+1:

+1: Point is well-clustered.

0: Point is on the boundary between clusters.

−1: Point is likely misclassified.

Gap statistic clustering: Ahsen wahda talka nb of clusters

1. Compute Within-Cluster Dispersion (WkW_kWk):

Measure how compact the clusters are in your data.

2. Generate Random Reference Data:

Create multiple random datasets with the same dimensions and range
as your original data.

3. Calculate Dispersion for Random Data:

Clustering 2
Cluster the random datasets for each kkk and compute their dispersion.

4. Calculate Gap:

A larger gap indicates better clustering.

5. Choose Optimal k:

Select k where the gap is maximized or stabilizes significantly.

The gap statistic determines the optimal number of clusters (kkk) by

comparing the within-cluster dispersion of your data to that of random data. It
identifies how well-separated the clusters are compared to a random baseline.

What is Clustering?
Clustering is a method to group similar data points together based on their characteristics.
It’s used to find patterns in data without labels (unsupervised learning).
Examples: In marketing (to group similar customers), in biology (for gene analysis), or in social networks (to find similar
users).
Key Terms in Clustering:
Clusters: Groups of similar data points.
Similarity/Dissimilarity: Measures how close or far apart data points are from each other (e.g., using distance metrics like
Euclidean distance).
Applications: Used for things like compressing data, detecting unusual data points (anomalies), or exploring data.
How to Decide the Number of Clusters:
Specify the Number of Clusters:

Decide how many clusters you want (e.g., 3 clusters).

Cut the tree (dendrogram) at the point where there are exactly 3 branches.
Use a Dissimilarity Threshold:

Set a limit for how different clusters can be before they are considered separate.
If the distance between clusters is too high, don’t merge them.
Highest Jump (Elbow Method):

After clustering, look at the dendrogram and find the biggest jump in distance.
Cut just before this jump to find a reasonable number of clusters.
The Elbow Method helps you find the number of clusters by plotting how "spread out" the data is. When the spread stops
changing a lot, you’ve found the right number of clusters.
Evaluating Clustering Quality:
Silhouette Analysis:

Measures how similar a point is to its own cluster compared to other clusters.
Score:
+1 = Well clustered.
0 = On the border between clusters.
-1 = Likely in the wrong cluster.
Gap Statistic:

Measures how well-separated your clusters are.

Steps:
Calculate how tight (compact) the clusters are.
Compare this with random data to see if your clusters are better.
If there’s a big gap, it means your clusters are good.
Summary:
Clustering groups similar data together to find hidden patterns.
You decide how many clusters to have, often by using methods like the Elbow Method or Gap Statistic.
To evaluate how good the clustering is, you can use Silhouette Analysis or the Gap Statistic.
Clustering 3

Biomarkers For Immunotherapy of Cancer: Methods and Protocols
100% (20)
Biomarkers For Immunotherapy of Cancer: Methods and Protocols
730 pages
Maintenance and Preservation of Microbial Cultures PDF
100% (1)
Maintenance and Preservation of Microbial Cultures PDF
16 pages
Taxonomy
No ratings yet
Taxonomy
46 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Malayalam Flashcards Worksheets
No ratings yet
Malayalam Flashcards Worksheets
5 pages
Executive Function
100% (1)
Executive Function
23 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
Module 3 Activities (Anatomy and PHYSIOLOGY)
100% (1)
Module 3 Activities (Anatomy and PHYSIOLOGY)
19 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Finding Optimal Number of Clusters
No ratings yet
Finding Optimal Number of Clusters
53 pages
Clustering
No ratings yet
Clustering
69 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Mec 109 New
No ratings yet
Mec 109 New
369 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
Module12.02 UnsupervisedLearning
No ratings yet
Module12.02 UnsupervisedLearning
25 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Week 10
No ratings yet
Week 10
84 pages
Clustering
No ratings yet
Clustering
20 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
Cluster Analysis Notes
No ratings yet
Cluster Analysis Notes
37 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Unit 4
No ratings yet
Unit 4
63 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Unit Iv
No ratings yet
Unit Iv
19 pages
K-Means Clustering
No ratings yet
K-Means Clustering
14 pages
Kmeans Clustering
No ratings yet
Kmeans Clustering
3 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
Biochemistry Answer Key BLUE PACOP
100% (1)
Biochemistry Answer Key BLUE PACOP
26 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 5
No ratings yet
Unit 5
10 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Module 5
No ratings yet
Module 5
43 pages
Full Syllabus Test Papter No.-04 - Rishabh Sir - Anil
100% (1)
Full Syllabus Test Papter No.-04 - Rishabh Sir - Anil
18 pages
Understanding Clustering - A Comprehensive Guide To
No ratings yet
Understanding Clustering - A Comprehensive Guide To
5 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
What Can Live in The Ocean
No ratings yet
What Can Live in The Ocean
28 pages
Silhouette (Clustering) : Method
No ratings yet
Silhouette (Clustering) : Method
7 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Lecture 6
No ratings yet
Lecture 6
42 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
My Lecture On CLUSTER ANALYSIS PDF
No ratings yet
My Lecture On CLUSTER ANALYSIS PDF
55 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Stat 390 Presentation 2
No ratings yet
Stat 390 Presentation 2
14 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Clustering
No ratings yet
Clustering
8 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Heineken Beer Production Process
No ratings yet
Heineken Beer Production Process
3 pages
Clustering
No ratings yet
Clustering
11 pages
Anatomy and Physiology I Syllabus
100% (2)
Anatomy and Physiology I Syllabus
4 pages
Basics of Biomedical Instrumentation Notes
No ratings yet
Basics of Biomedical Instrumentation Notes
24 pages
Oxidative Stress and Chronic Degenerative Diseases - A Role For Antioxidants PDF
No ratings yet
Oxidative Stress and Chronic Degenerative Diseases - A Role For Antioxidants PDF
513 pages
There Are 12 Pairs of Cranial Nerves
No ratings yet
There Are 12 Pairs of Cranial Nerves
6 pages
Lesson 1 Gender and Sexuality As A Social Reality
No ratings yet
Lesson 1 Gender and Sexuality As A Social Reality
5 pages
Lecture 10 Notes 2023 Osmoregulation - 1968568183
No ratings yet
Lecture 10 Notes 2023 Osmoregulation - 1968568183
12 pages
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
100% (1)
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
9 pages
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
No ratings yet
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
2 pages
Chapter1 Introduction Java 2024
No ratings yet
Chapter1 Introduction Java 2024
61 pages
Music Therapy in The Age of Aquarius
No ratings yet
Music Therapy in The Age of Aquarius
12 pages
Lesson of The Kaibab
No ratings yet
Lesson of The Kaibab
2 pages
Silva Et Al., 2017
No ratings yet
Silva Et Al., 2017
13 pages
Sas 5
No ratings yet
Sas 5
5 pages
Butterfly Garden PDF
No ratings yet
Butterfly Garden PDF
6 pages
Cek Harga e Kat Chol
No ratings yet
Cek Harga e Kat Chol
8 pages
Water Pollution Control For Mandalay Kan PDF
No ratings yet
Water Pollution Control For Mandalay Kan PDF
4 pages
Where Can Buy Body Size in Mammalian Paleobiology Estimation and Biological Implications John Damuth (Editor) Ebook With Cheap Price
100% (12)
Where Can Buy Body Size in Mammalian Paleobiology Estimation and Biological Implications John Damuth (Editor) Ebook With Cheap Price
84 pages
Correlation Between Simple Visual and Auditory Reaction Time With Falls Efficacy Scale in Geriatric Population
No ratings yet
Correlation Between Simple Visual and Auditory Reaction Time With Falls Efficacy Scale in Geriatric Population
6 pages
Chap1-3 (IA) Complexity - BigO
No ratings yet
Chap1-3 (IA) Complexity - BigO
104 pages
2019 Survey On Applicants For Training of Gad Experts
No ratings yet
2019 Survey On Applicants For Training of Gad Experts
3 pages
Life Processes (Part-3)
No ratings yet
Life Processes (Part-3)
24 pages
PracticeQuestions Final
No ratings yet
PracticeQuestions Final
9 pages
Main Idea Hw-Shorter
No ratings yet
Main Idea Hw-Shorter
25 pages
International Trade Insights - Scholarly Flashcards
No ratings yet
International Trade Insights - Scholarly Flashcards
4 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Chapter4-Blockchain Application Design
No ratings yet
Chapter4-Blockchain Application Design
17 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Chap1-2 (IA) Complexity - Examples
No ratings yet
Chap1-2 (IA) Complexity - Examples
167 pages
NoteGPT Flashcards 1739123443917
No ratings yet
NoteGPT Flashcards 1739123443917
10 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Agrima - Sahni CV
No ratings yet
Agrima - Sahni CV
1 page
BBCCT-105 - English (Jan2024-Dec2024)
No ratings yet
BBCCT-105 - English (Jan2024-Dec2024)
3 pages
PDF
No ratings yet
PDF
5 pages
Exponential Smoothing Ovherview
No ratings yet
Exponential Smoothing Ovherview
4 pages
Chapter 1summary Request
No ratings yet
Chapter 1summary Request
4 pages
Spectral Clustering
No ratings yet
Spectral Clustering
4 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Core Concepts of Clustering

2. Similarity/Dissimilarity: Measured using distance metrics such as

3. Applications: Data compression, anomaly detection, and exploratory data

How to cut the dendrogram to identify the number

This is a straightforward method if you have a target number of clusters in

b. Use a Dissimilarity Threshold : Maaneha tgoss chajra wakteli

C.Highest Jump (Elbow Method in Dendrograms) METHODE

This jump indicates a significant dissimilarity between clusters. By cutting

ELBOW METHOD: “me trakazesh fehom yesser

calculated using “within cluster sum of square WCSS”

Now how to evaluate it ?

Silhouette Score ranges from −1-1−1 to +1+1+1:

+1: Point is well-clustered.

0: Point is on the boundary between clusters.

−1: Point is likely misclassified.

Gap statistic clustering: Ahsen wahda talka nb of clusters

Measure how compact the clusters are in your data.

2. Generate Random Reference Data:

3. Calculate Dispersion for Random Data:

A larger gap indicates better clustering.

Select k where the gap is maximized or stabilizes significantly.

The gap statistic determines the optimal number of clusters (kkk) by

Decide how many clusters you want (e.g., 3 clusters).

Measures how well-separated your clusters are.

You might also like