Week 10
Week 10
27-11-23
Reminder
• "Machine learning is the subfield of computer science that gives computers the
ability to learn without being explicitly programmed."
Arthur Samuel, 1959
4
What kind of AI do you • Supervised
learning
know?
6
Reinforcement
learning
• In reinforcement learning, we train agents who take
actions in an environment, such as a self‐driving car
on the road, or an asset manager to take positions.
While we do not have labels, that is, we cannot tell
what the correct action is in any situation, we can assign
rewards or punishments.
7
Let us concentrate
today on Unsupervised
learning
8
What can we do with unsupervised
learning?
• Clustering
• K-means, K-means++
• CAH
• DBSCAN
• Dimensionality reduction
• PCA
• Auto-encoder
• Generative models
• GAN
9
What Is
Clustering?
10
Clustering
Models
• So far, we have discussed supervised learning
where we were predicting a known class label
• Clustering models are unsupervised
• We are trying to learn and understand patterns in
unlabeled data
• The goal is to group similar data points into
segments/clusters
• You may hear ”clustering” and ”segmentation”
both used to describe these models – they are
synonymous
• Business stakeholders often are more familiar with
“segmentation” than “clustering”
11
Mathematically
12
Clustering
Main ingredients
•The number of clusters,k
•The distance between points,d
•Evaluation of the quality of clusters
•Comparison between different clustering results
•The optimization procedure
13
Clustering
Approaches
•Hirerachical (divisive or
aggloremative)
•Centroid or partition-based
•Density-based
•Statistical modeling-based
14
Clustering Use
Cases
• Customer segmentation
• Rewards data misuse detection
• Segmentation on product and customer
strategy
• Anomaly detection
15
K‐Means
Clustering
16
K‐Means
Procedure
1. Select number of clusters before running the model, often called
k
2. Randomly choose k centroids (cluster centers)
• Can use K‐Means++ to reduce randomness by placing cluster centers a far
distance apart
3. Calculate the distance of each data point to all cluster centers
and assign all data points to the closest cluster
4. Find new centroids of each cluster by taking the mean of all data
points in the cluster
5. Use the new centroids and repeat steps 3 and 4 until the cluster
centers stop moving
17
K‐Means Visually
1
Image source: 8
https://towardsdatascience.com/k‐means‐clustering‐explained‐4528df86a120
K‐means—pitfalls
19
K‐means—pitfalls
20
K‐means—pitfalls
21
K‐means—pitfalls
22
K‐means—pitfalls
23
K‐Means Pros and
Cons
Pro
• Easy to interpret
• Scalable to large data
sets Cons
• Easy to overfit and only a small number of features can be used
• Does not handle highly correlated features well
• Number of clusters has to be preset
• Can only draw linear boundaries. If your data has non‐linear
boundaries, it will not perform well.
• Sensitive to outliers
• Slows down substantially as the number of samples increases
because distances between all data points and centroids
must be calculated with each adjustment
24
Clustering
Evaluation
25
Cluster Evaluation Metrics:
Inertia
• Inertia: The sum or squared distances of of all samples to their closest centroid
(cluster center)
• Distortion: Weighted sum of the squared distances between from data point to
its centroid
26
Cluster Evaluation Metrics:
Inertia
• Inertia will always decrease, looking for a leveling‐off point
• Seeing leveling off below at segments 4 and 5
• No rule of thumb for “good inertia”, can only compare
multiple models to each other
27
Cluster Evaluation Metrics:
Distortion
• Distortion will always decrease, looking for a leveling‐off
point
• Seeing leveling off below at segments 4 and 5
• Also, no rule of thumb for “good distortion”, can only
compare multiple models to each other
28
Image source:
https://livebook.manning.com/concept/r/dunn‐index
Cluster Evaluation Metrics: Elbow
Method
• Used to choose the
optimal number
of clusters
• Vary number of
Steep
clusters and er
monitor evaluation
metrics
Less
• Looking for where the Steep
slope becomes
less steep and the
metric improves less
rapidly
• Showing
“diminishing
returns”
2
9
Cluster Evaluation Metrics:
Silhouette Score
Silhouette Score:
Average distance
between the intra‐
cluster and inter‐cluster
variation, normalized
by the maximum.
3
0
Source:
https://www.analyticsvidhya.com/blog/2021/05/k ‐mean‐getting‐the‐optimal‐number‐of‐cluste
rs/;
Cluster Evaluation Metrics:
Silhouette Score
• Individual scores vary from ‐1 and +1
• The silhouette score is the average across all
data points
Interpretation
+1: The sample is far away from the neighboring
cluster
0: The sample is on or very near to the decision
boundary of a neighboring cluster
‐1: The sample may have been assigned to the
wrong cluster
3
1
Cluster Evaluation Metrics:
Silhouette Plots
• Thickness of plot represents the cluster size
• The silhouette scores is shown on the
horizontal axis
32
Image source:
https://scikit‐learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
33
Image source:
https://scikit‐learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
Tips and
Tricks
34
Incorporating Business Knowledge
37
Utilize
Weights!
• Weights are one of the best tools you
have in both segmentation and predictive
models
• Often useful to give higher weight to rows
demonstrating patterns of high business value
• Often creates smaller “good” groups and larger
“worse” groups
• Example: In an automotive dealer repair
segmentation we weighted rows with higher
dealership repair spend and shorter recency as
being 20% more important
38
Criteria for Dividing
Clusters
Linkage criteria is the criteria used for choosing the closest data points
to merge with one another. It determine the rules for
combining clusters.
Linkage Criteria Description Pros/Cons
Ward’s Linkage Minimizes variance of clusters. • Biased towards globular
Aim is to choose the combination clusters
with the smallest increase in • Good with noisy data
variance.
Average Linkage Minimize average distance • Biased towards globular
between the points in each clusters
cluster • Good with noisy data
Centroid Linkage Maximizes difference between • Good with noisy data
centroids (mean of all data points) • Best with globular clusters
Complete Linkage Maximizes the distance between • Good with noisy data, often
the two farthest data points in breaks data into large
each cluster clusters
• Best with globular clusters
Single Linkage Maximizes the distance between • Impacted less by outliers
the two closest data points in each • Prone to noise
cluster 39
Criteria for Dividing Clusters
4
Source: 2
https://dataaspirant.com/hierarchical‐clustering‐algorithm/#t‐1608531820
444
Choosing the Number of Clusters
• When clusters are combined, you
create a dendrogram of each
combination
• The vertical line represents the
difference between the t wo
clusters
• The larger the distance of the
vertical line, the more
dissimilar the clusters are
from one another
• To choose the number of
clusters, draw and line
and separate the dendrogram
across the tallest vertical line
41
Hierarchical Clustering Pros
and Cons
Pro
• Do not need to set the number of cluster before modeling
• There are more “levers to pull” and tweak in the models to fit
the model to your data
Cons
• More complex to understand and explain than K‐Means
• More difficult to tune
• Not scalable to large data sets
42
DBSCAN
43
DBSCAN
44
DBSCAN—Algorithm
45
DBSCAN ‐ Large Eps
46
DBSCAN ‐ Optimal Eps
47
In application
48
DBSCAN
49
Thank You
50