0% found this document useful (0 votes)

14 views50 pages

Week 10

ML Course and projec

Uploaded by

adeelniaz1391

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views50 pages

Week 10

ML Course and projec

Uploaded by

adeelniaz1391

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Unsupervised Learning

Dr. Saifal Talpur

27-11-23
Reminder
• "Machine learning is the subfield of computer science that gives computers the
ability to learn without being explicitly programmed."
Arthur Samuel, 1959

• What do we mean by machine learning? Most computer

programs today are handcrafted by humans. Software
engineers carefully craft every rule that governs how
software behaves and then translate it into computer
code.

4
What kind of AI do you • Supervised
learning
know?

• In supervised learning, we need to have training examples, (such as images of animals,

and labels)
• If we have a high number of these labeled training examples, we can train a classifier
on detecting the subtle statistical patterns that differentiate dogs from all other
animals.
• The classifier does not know what a dog fundamentally is. It only knows the statistical
patterns that linked images to dogs in training.
• If a supervised learning classifier encounters something that's very different from the
training data, it can often get confused and will just output nonsense.
5
Other type of
learning?
• While supervised learning is the majority of industrial AI, it
requires labelled examples.
• Unsupervised learning: Imagine that as a bank you have a large
number of customers. You would be interested in regrouping
them into different market segments, but what we don't know
how.
• In this example, we would like an algorithm that looks at a lot of
data from customers and groups them into segments. This is an
example of unsupervised learning.

6
Reinforcement
learning
• In reinforcement learning, we train agents who take
actions in an environment, such as a self‐driving car
on the road, or an asset manager to take positions.
While we do not have labels, that is, we cannot tell
what the correct action is in any situation, we can assign
rewards or punishments.

• For example, we could reward keeping a proper distance

from the car in front.

7
Let us concentrate
today on Unsupervised
learning

8
What can we do with unsupervised
learning?
• Clustering
• K-means, K-means++
• CAH
• DBSCAN

• Dimensionality reduction
• PCA
• Auto-encoder

• Generative models
• GAN

9
What Is
Clustering?

10
Clustering
Models
• So far, we have discussed supervised learning
where we were predicting a known class label
• Clustering models are unsupervised
• We are trying to learn and understand patterns in
unlabeled data
• The goal is to group similar data points into
segments/clusters
• You may hear ”clustering” and ”segmentation”
both used to describe these models – they are
synonymous
• Business stakeholders often are more familiar with
“segmentation” than “clustering”

11
Mathematically

Divide data into meaningful,homogeneous,subsets/clusters/classes,

•For a better understanding of the underlying processes of data generation
•As an initialization of other tasks (eg.Supervised classification)

12
Clustering
Main ingredients
•The number of clusters,k
•The distance between points,d
•Evaluation of the quality of clusters
•Comparison between different clustering results
•The optimization procedure

13
Clustering

Approaches

•Hirerachical (divisive or
aggloremative)
•Centroid or partition-based
•Density-based
•Statistical modeling-based

14
Clustering Use
Cases
• Customer segmentation
• Rewards data misuse detection
• Segmentation on product and customer
strategy
• Anomaly detection

15
K‐Means
Clustering

16
K‐Means
Procedure
1. Select number of clusters before running the model, often called
k
2. Randomly choose k centroids (cluster centers)
• Can use K‐Means++ to reduce randomness by placing cluster centers a far
distance apart
3. Calculate the distance of each data point to all cluster centers
and assign all data points to the closest cluster
4. Find new centroids of each cluster by taking the mean of all data
points in the cluster
5. Use the new centroids and repeat steps 3 and 4 until the cluster
centers stop moving

17
K‐Means Visually

1
Image source: 8
https://towardsdatascience.com/k‐means‐clustering‐explained‐4528df86a120
K‐means—pitfalls

19
K‐means—pitfalls

20
K‐means—pitfalls

21
K‐means—pitfalls

22
K‐means—pitfalls

23
K‐Means Pros and
Cons
Pro
• Easy to interpret
• Scalable to large data
sets Cons
• Easy to overfit and only a small number of features can be used
• Does not handle highly correlated features well
• Number of clusters has to be preset
• Can only draw linear boundaries. If your data has non‐linear
boundaries, it will not perform well.
• Sensitive to outliers
• Slows down substantially as the number of samples increases
because distances between all data points and centroids
must be calculated with each adjustment
24
Clustering
Evaluation

25
Cluster Evaluation Metrics:
Inertia
• Inertia: The sum or squared distances of of all samples to their closest centroid
(cluster center)
• Distortion: Weighted sum of the squared distances between from data point to
its centroid

26
Cluster Evaluation Metrics:
Inertia
• Inertia will always decrease, looking for a leveling‐off point
• Seeing leveling off below at segments 4 and 5
• No rule of thumb for “good inertia”, can only compare
multiple models to each other

27
Cluster Evaluation Metrics:
Distortion
• Distortion will always decrease, looking for a leveling‐off
point
• Seeing leveling off below at segments 4 and 5
• Also, no rule of thumb for “good distortion”, can only
compare multiple models to each other

28
Image source:
https://livebook.manning.com/concept/r/dunn‐index
Cluster Evaluation Metrics: Elbow
Method
• Used to choose the
optimal number
of clusters
• Vary number of
Steep
clusters and er
monitor evaluation
metrics
Less
• Looking for where the Steep
slope becomes
less steep and the
metric improves less
rapidly
• Showing
“diminishing
returns”

2
9
Cluster Evaluation Metrics:
Silhouette Score

Silhouette Score:
Average distance
between the intra‐
cluster and inter‐cluster
variation, normalized
by the maximum.

3
0
Source:
https://www.analyticsvidhya.com/blog/2021/05/k ‐mean‐getting‐the‐optimal‐number‐of‐cluste
rs/;
Cluster Evaluation Metrics:
Silhouette Score
• Individual scores vary from ‐1 and +1
• The silhouette score is the average across all
data points

Interpretation
+1: The sample is far away from the neighboring
cluster
0: The sample is on or very near to the decision
boundary of a neighboring cluster
‐1: The sample may have been assigned to the
wrong cluster

3
1
Cluster Evaluation Metrics:
Silhouette Plots
• Thickness of plot represents the cluster size
• The silhouette scores is shown on the
horizontal axis

32
Image source:
https://scikit‐learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
33
Image source:
https://scikit‐learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
Tips and
Tricks

34
Incorporating Business Knowledge

• Unlike supervised learning

there is no “right answer” in
segmentation
• Using evaluation metrics
we can exclude many
“wrong” answers
• Always weight business value
equally with evaluation metrics
• Example: Shown on right 2
clusters has the best evaluation
metric, but 2 clusters will be
useless to the business 35
Avoiding
Overfitting
• Especially with categorical variables, be mindful of
overfitting
• An overfitted clustering model may automatically
put customers into one segment because of one
variable even if that variable is not important
• Always be wary if 0% of a categorical variable
is in one segment
• This may happen and be correct but you should always
make sure it makes “business sense”
• More commonly it is caused by overfitting
• Example when not overfitted: Rewards segmentation
with no use of points in lower value segments
• Example when overfitted: A common product is used
by 0% of a segment
36
Data with
Outliers
• K‐Means and Hierarchical clustering are often
ineffective with data that has extreme outliers
• If this is an issue, all the best customers may be in one
segment and the other segments look very similar
• The model is focusing on parsing out the “best” data
points and it loses power with the “least valuable” data
points
• In these cases, tend towards clustering methods
based on
density, like DBSCAN or Gaussian Mixture
Modeling

37
Utilize
Weights!
• Weights are one of the best tools you
have in both segmentation and predictive
models
• Often useful to give higher weight to rows
demonstrating patterns of high business value
• Often creates smaller “good” groups and larger
“worse” groups
• Example: In an automotive dealer repair
segmentation we weighted rows with higher
dealership repair spend and shorter recency as
being 20% more important

38
Criteria for Dividing
Clusters
Linkage criteria is the criteria used for choosing the closest data points
to merge with one another. It determine the rules for
combining clusters.
Linkage Criteria Description Pros/Cons
Ward’s Linkage Minimizes variance of clusters. • Biased towards globular
Aim is to choose the combination clusters
with the smallest increase in • Good with noisy data
variance.
Average Linkage Minimize average distance • Biased towards globular
between the points in each clusters
cluster • Good with noisy data
Centroid Linkage Maximizes difference between • Good with noisy data
centroids (mean of all data points) • Best with globular clusters
Complete Linkage Maximizes the distance between • Good with noisy data, often
the two farthest data points in breaks data into large
each cluster clusters
• Best with globular clusters
Single Linkage Maximizes the distance between • Impacted less by outliers
the two closest data points in each • Prone to noise
cluster 39
Criteria for Dividing Clusters

4
Source: 2
https://dataaspirant.com/hierarchical‐clustering‐algorithm/#t‐1608531820
444
Choosing the Number of Clusters
• When clusters are combined, you
create a dendrogram of each
combination
• The vertical line represents the
difference between the t wo
clusters
• The larger the distance of the
vertical line, the more
dissimilar the clusters are
from one another
• To choose the number of
clusters, draw and line
and separate the dendrogram
across the tallest vertical line

41
Hierarchical Clustering Pros
and Cons
Pro
• Do not need to set the number of cluster before modeling
• There are more “levers to pull” and tweak in the models to fit
the model to your data

Cons
• More complex to understand and explain than K‐Means
• More difficult to tune
• Not scalable to large data sets

42
DBSCAN

43
DBSCAN

44
DBSCAN—Algorithm

Let ClusterCount=0.For every point p:

1. If p it is not a core point,assign a null label to it [e.g.,zero]
2. If p
is a core point,a new cluster is formed [with label
ClusterCount:= ClusterCount+1]
Then find all points density-reachable from p and classify them in the cluster.
[Reassign the zero labels but not the others]
Repeat this process until all of the points have been visited.

45
DBSCAN ‐ Large Eps

46
DBSCAN ‐ Optimal Eps

47
In application

48
DBSCAN

49
Thank You

Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Geography 200 Questions 1547013925 78.
100% (1)
Geography 200 Questions 1547013925 78.
21 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
1694601073-Unit 3.1 Unsupervised Learning CU 2.0
No ratings yet
1694601073-Unit 3.1 Unsupervised Learning CU 2.0
35 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Piping Tie-Rod Design Made Simple
No ratings yet
Piping Tie-Rod Design Made Simple
3 pages
Agriculture Board Exam Reviewer
No ratings yet
Agriculture Board Exam Reviewer
1 page
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Unit 4
No ratings yet
Unit 4
96 pages
Unsupervised Learning Notes
No ratings yet
Unsupervised Learning Notes
21 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Quote For Every of The Day
No ratings yet
Quote For Every of The Day
53 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
No ratings yet
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
95 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unit IV
No ratings yet
Unit IV
96 pages
Intrenship Report: Engineering
No ratings yet
Intrenship Report: Engineering
16 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Week 9
No ratings yet
Week 9
66 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering
No ratings yet
Clustering
38 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML Clustering
No ratings yet
ML Clustering
33 pages
Unit 4
No ratings yet
Unit 4
53 pages
Sense and Sensibility - Jane Austen
No ratings yet
Sense and Sensibility - Jane Austen
300 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Clustering
No ratings yet
Clustering
44 pages
COT Checklist
100% (2)
COT Checklist
2 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
DSA Presentation Group 6
No ratings yet
DSA Presentation Group 6
34 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Letter of Inquiry
No ratings yet
Letter of Inquiry
8 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Clustering
No ratings yet
Clustering
22 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
M3 - Unsupervised Machine Learning
No ratings yet
M3 - Unsupervised Machine Learning
35 pages
Lec 05 Unsupervised-Kmeans
No ratings yet
Lec 05 Unsupervised-Kmeans
50 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Machine Learning4
No ratings yet
Machine Learning4
39 pages
Unsupervised - Learning Final
No ratings yet
Unsupervised - Learning Final
20 pages
ML Unit5 Notes
No ratings yet
ML Unit5 Notes
18 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Unit-4 ML
No ratings yet
Unit-4 ML
16 pages
K Means
No ratings yet
K Means
9 pages
Week 14 and 15 Machine Learning Unsupervised 2
No ratings yet
Week 14 and 15 Machine Learning Unsupervised 2
25 pages
Unit 5
No ratings yet
Unit 5
5 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Advertisement No. IITM/R/4/2024 Dated 11.03.2024: Indian Institute of Technology Madras
No ratings yet
Advertisement No. IITM/R/4/2024 Dated 11.03.2024: Indian Institute of Technology Madras
5 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
UnsupervisedLearning FoundationalMathofAI S24
No ratings yet
UnsupervisedLearning FoundationalMathofAI S24
6 pages
K Mean
No ratings yet
K Mean
9 pages
Pdms Command Line Syntax Advanced
No ratings yet
Pdms Command Line Syntax Advanced
6 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Arizona550GT-XT User Manual Rev-1.2-A PDF
No ratings yet
Arizona550GT-XT User Manual Rev-1.2-A PDF
269 pages
Unit 1
No ratings yet
Unit 1
115 pages
POL 211 Lecture 1 10dec16
No ratings yet
POL 211 Lecture 1 10dec16
13 pages
Ceramic in Dentistry: - History & Evolution - Types of Ceramic - Usage of Ceramic
No ratings yet
Ceramic in Dentistry: - History & Evolution - Types of Ceramic - Usage of Ceramic
32 pages
Waveguide Losses and Input-Output Coupling
No ratings yet
Waveguide Losses and Input-Output Coupling
29 pages
IMS-MG-001-VER 01 - Safe Driving in Basement
No ratings yet
IMS-MG-001-VER 01 - Safe Driving in Basement
4 pages
M.E Mid-Term Case-Study Project
No ratings yet
M.E Mid-Term Case-Study Project
19 pages
A Study of Inbound Logistics Mode Based On JIT Production in Cruise Ship Construction
No ratings yet
A Study of Inbound Logistics Mode Based On JIT Production in Cruise Ship Construction
18 pages
Exotec Solutions Internship - Mobile Robot Control (Concatenated) PDF
No ratings yet
Exotec Solutions Internship - Mobile Robot Control (Concatenated) PDF
3 pages
Mode S Reply Encoding: Clock
No ratings yet
Mode S Reply Encoding: Clock
1 page
Assignment 1 Green Buildings
No ratings yet
Assignment 1 Green Buildings
3 pages
CLASSIFICATION OF THINGS and OWNERSHIP
No ratings yet
CLASSIFICATION OF THINGS and OWNERSHIP
10 pages
Vlsi 1
No ratings yet
Vlsi 1
12 pages
Gogolcoin White Paper
No ratings yet
Gogolcoin White Paper
19 pages
Lec 8
No ratings yet
Lec 8
12 pages
Syllabus of Oracle Project Accounting Training
No ratings yet
Syllabus of Oracle Project Accounting Training
3 pages
Extra Exercises
No ratings yet
Extra Exercises
6 pages
WKShop Chapter 1
No ratings yet
WKShop Chapter 1
6 pages
Teaching
No ratings yet
Teaching
4 pages
Education System
No ratings yet
Education System
2 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet