ML+Clustering
ML+Clustering
145
146
1
11/22/2023
147
Supervised Learning
Purpose
Given a dataset {(x i ,yi ) ∈ X × Y, i = 1,...,N}, learn the
dependancies between X and Y.
148
2
11/22/2023
Unsupervised Learning
Purpose
From observations {x i ∈ X ,i = 1,...,N}, learn the organisation
of X and discover homogen subsets.
► Example: Categorize customers. x i encodes a customer with
features encoding its social condition and behavior.
149
Purpose
Within a dataset, only a small part of sample have a corresponding
label, i.e. {(x 1, y1), ···,(x k ,yk ), x k+1, ···,N}. The goal is to
infer the classes of unlabeled data.
► Example: Filter webpages. Number of webpages is
tremendous, only few of them can be labeled by an expert.
► Methods: Bayesian methods, SVM, Graph Neural Networks,
...
150
3
11/22/2023
Supervised vs.
Unsupervised
•Supervised Approaches
• Target (what model is predicting) is provided
• ‘Labeled’ data
• Classification & regression are supervised.
•Unsupervised Approaches
• Target is unknown or unavailable
• ‘unlabeled’ data
• Cluster analysis & association analysis are
unsupervised.
151
Supervised Unsupervised
(target is available) (target is not available)
Classification Cluster
Analysis
Regression Association
Analysis
152
4
11/22/2023
Windy
Rainy
Cloudy
Image source:
http://www.davidson.k12.nc.us/parents students/inclement_weather
153
Regression
Predict numeric value
Goal:
154
5
11/22/2023
Teenagers
155
Association Analysis
156
6
11/22/2023
scikit-learn
157
Preprocessing Tools
158
7
11/22/2023
Different Tasks
► Supervised Learning
► Unsupervised Learning
► Semi Supervised Learning
159
http://scikit-learn.org/stable/documentation.html
160
8
11/22/2023
Dimensionality Reduction
• Enables you to reduce features while preserving variance
• scikit-learn has capabilities for:
• Principal Component Analysis (PCA)
• Singular Value Decomposition
• Factor Analysis
• Independent Component Analysis
• Matrix Factorization
• Latent Dirichlet Allocation
161
Model Selection
162
9
11/22/2023
Summary of scikit-learn
163
Clustering
164
10
11/22/2023
Clustering http://scikit-learn.org/stable/modules/clustering.html#clustering
165
166
11
11/22/2023
167
Cluster Analysis
• Divides data into clusters
• Similar items are placed in same cluster
Intra-cluster
differences are
minimized
168
12
11/22/2023
Euclidean Distance
169
B B
Cosine Similarity
Manhattan Distance
► Minkoswski Distance (norm p)
► Manhattan distance (p = 1)
170
13
11/22/2023
Distance dm(x 1, x 2) I
► Euclidean distance (p = 2)
171
Distance dm(x 1, x 2) II
172
14
11/22/2023
|A∪B |−|A∩B |
► Jaccard : dJ (A, B) = |A∪B |
173
Distance properties
174
15
11/22/2023
175
Illustration
176
16
11/22/2023
177
178
17
11/22/2023
Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE
179
180
18
11/22/2023
A Good
Clustering
C1
g1
C3
g3
C2
g g
g4 C4
g2
Good Partition ?
181
A Good
Partition
g3
g1 g2
g4
182
19
11/22/2023
ǁx − zǁ 2
sm (x, z) = exp−
σ
183
Scaled Values
Weight
Height
184
20
11/22/2023
There is no ‘correct’
clustering
185
non-fiction
children’s
186
21
11/22/2023
187
Labeled samples
for science fiction
customers
188
22
11/22/2023
Anomalies that
require further
v analysis
189
190
23
11/22/2023
Questions
raised:
191
Clustering
Methods
► Many methods exist . . .
► Hierarchical Clustering
► Agglomerative Clustering
► Distances used
► Agglomeration strategies
► Splitting Clustering
► Kmeans and derivatives
► DBSCAN
► Spectral Clustering
► ...
► Modelisation Clustering
► Gaussian Mixtures models
► One Class SVM
192
24
11/22/2023
k-Means Clustering
193
Cluster Analysis
• Divides data into clusters
• Similar items are in same cluster
Intra-cluster
differences are
minimized
194
25
11/22/2023
k-Means Algorithm
Select k initial centroids (cluster centers)
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached
centroid
X
195
Brute Force
1. Build all possible partitions
2. Evaluate each clustering et keep the best clustering
Problem
Number of possible clusterings increases exponentially
26
11/22/2023
• A better solution
► Minimizing intra-class inertia, w.r.t. µk , k = 1,…,K
197
198
27
11/22/2023
K-means algorithm
199
K-Means :
illustration
Clustering in K = 2 clusters
Data Initialisation Iteration 1
La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2
−3 −2 0 1 2 3 4 5
−2 −2
−1
−4 −2 0 2 4 6 −4 −2 0 2 4 6
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6
200
28
11/22/2023
Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results
201
• Approaches: k=?
• Visualization
• Application-Dependent
• Data-Driven
202
29
11/22/2023
K clusters
► Hard problem; depends on data
► Fixed a priori by the problematic
► Search for the best partition for different K > 1; Find a break
in Jw (K ) decreasing
► Constrain the density and/or volume of clusters
► Use criteria to evaluate clusterings
► Compute clustering for each K = 1, . . . , Kmax
► Compute criteria J(K )
► Choose K ∗ the K having the best criteria
203
204
30
11/22/2023
K-Means : Discussion
205
Stopping Criteria
X
206
31
11/22/2023
Some
criteria
207
Some
Criteria
208
32
11/22/2023
Interpreting Results
• Examine cluster centroids
• How are clusters different?
X
X Compare centroids
to see how clusters
are different
X
209
K-Means Summary
210
33