0% found this document useful (0 votes)
2 views

ML+Clustering

The document provides an overview of machine learning concepts, including supervised, unsupervised, and semi-supervised learning, along with various techniques such as clustering and classification. It discusses the scikit-learn library, its functionalities for preprocessing, model selection, and evaluation, as well as clustering methods like k-means. Additionally, it highlights the importance of distance metrics in clustering and the evaluation of clustering quality.

Uploaded by

vagifsamadov2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML+Clustering

The document provides an overview of machine learning concepts, including supervised, unsupervised, and semi-supervised learning, along with various techniques such as clustering and classification. It discusses the scikit-learn library, its functionalities for preprocessing, model selection, and evaluation, as well as clustering methods like k-means. Additionally, it highlights the importance of distance metrics in clustering and the evaluation of clustering quality.

Uploaded by

vagifsamadov2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

11/22/2023

Machine Learning is…


… learning from data
… on its own

145

Machine Learning is…


… learning from data
… on its own
… discovering hidden patterns

146

1
11/22/2023

Machine Learning is…


… learning from data
… on its own
… discovering hidden patterns
… data-driven decisions

147

Supervised Learning

Purpose
Given a dataset {(x i ,yi ) ∈ X × Y, i = 1,...,N}, learn the
dependancies between X and Y.

► Example: Learn the links between cardiac risk and food


habits. x i is one person describe by d features concerning its
food habits; yi is a binary category (risky, not risky).

► yi are essential for the learning process.

► Methods : K-Nearest Neighbors, SVM, Decision Tree, . . .

148

2
11/22/2023

Unsupervised Learning

Purpose
From observations {x i ∈ X ,i = 1,...,N}, learn the organisation
of X and discover homogen subsets.
► Example: Categorize customers. x i encodes a customer with
features encoding its social condition and behavior.

► Methods: Hierarichal clustering, K-Means, Reinforcement


learning, . . .

149

Semi supervised Learning

Purpose
Within a dataset, only a small part of sample have a corresponding
label, i.e. {(x 1, y1), ···,(x k ,yk ), x k+1, ···,N}. The goal is to
infer the classes of unlabeled data.
► Example: Filter webpages. Number of webpages is
tremendous, only few of them can be labeled by an expert.
► Methods: Bayesian methods, SVM, Graph Neural Networks,
...

150

3
11/22/2023

Supervised vs.
Unsupervised
•Supervised Approaches
• Target (what model is predicting) is provided
• ‘Labeled’ data
• Classification & regression are supervised.

•Unsupervised Approaches
• Target is unknown or unavailable
• ‘unlabeled’ data
• Cluster analysis & association analysis are
unsupervised.

151

Categories of Machine Learning Techniques

Supervised Unsupervised
(target is available) (target is not available)

Classification Cluster
Analysis
Regression Association
Analysis

152

4
11/22/2023

Classification Goal: Predict category


Sunny

Windy

Rainy

Cloudy
Image source:
http://www.davidson.k12.nc.us/parents students/inclement_weather

153

Regression
Predict numeric value
Goal:

154

5
11/22/2023

Goal: Organize similar


Cluster Analysis
items into groups.
Seniors
Adults

Teenagers

Image source: http://www.monetate.com/blog/the-intrinsic-value-of-customer-


segmentation

155

Association Analysis

Goal: Find rules to capture


associations between items.

156

6
11/22/2023

scikit-learn

• Open source library for Machine Learning in


Python
• Built on top of NumPy, SciPy, matplotlib
• Active community for development
• Improved continuously by developers

157

Preprocessing Tools

•Utility Functions for


• Transforming raw feature vectors to suitable format

•Provides API for


• Scaling of features: remove mean and keep unit variance
• Normalization to have unit norm
• Binarization to turn data into 0 or 1 format
• One Hot Encoding for categorical features
• Handling of missing values
• Generating higher order features
• Build custom transformations

158

7
11/22/2023

Different Tasks
► Supervised Learning
► Unsupervised Learning
► Semi Supervised Learning

159

Provides organized tutorials with specifics.

http://scikit-learn.org/stable/documentation.html

160

8
11/22/2023

Dimensionality Reduction
• Enables you to reduce features while preserving variance
• scikit-learn has capabilities for:
• Principal Component Analysis (PCA)
• Singular Value Decomposition
• Factor Analysis
• Independent Component Analysis
• Matrix Factorization
• Latent Dirichlet Allocation

161

Model Selection

• Provides methods for Cross Validation

• Library functions for tuning hyper parameters

• Model Evaluation mechanisms to measure model performance

• Plotting methods for visualizing scores to evaluate models

162

9
11/22/2023

Summary of scikit-learn

• Extensive set of tools for full pipeline in Machine Learning

• Dependable due to community support

• Provides easy to use API for training, and making predictions

• Collection of the best, most popular, algorithms in one place

163

Clustering

164

10
11/22/2023

Clustering http://scikit-learn.org/stable/modules/clustering.html#clustering

• sklearn.cluster gives algorithms for grouping of unlabeled data

165

Cluster Analysis Overview


Goal: Organize similar items into groups

166

11
11/22/2023

Cluster Analysis Examples


• Segment customer base into groups
• Characterize different weather patterns for a region
• Group news articles into topics
• Discover crime hot spots
• NLP: Find set of texts
• Documents: Automatic classification (Driver License,
ID, Passport)
• Marketing: Client profiles

167

Cluster Analysis
• Divides data into clusters
• Similar items are placed in same cluster
Intra-cluster
differences are
minimized

Inter-cluster differences are


v maximized

168

12
11/22/2023

Distance – main focus

Euclidean Distance

169

Distance – other methods


A A

B B
Cosine Similarity

Manhattan Distance
► Minkoswski Distance (norm p)

► Manhattan distance (p = 1)

170

13
11/22/2023

Distance dm(x 1, x 2) I

► Euclidean distance (p = 2)

171

Distance dm(x 1, x 2) II

► Matrix based distance, W sdp

► Mahalanobis distance W = C −1 with C covariance matrix.

172

14
11/22/2023

Distance between discrete values


► Let x1 ∈ { c1 , . . . , ck } et x 2 ∈ { d1 , . . . , dh }
► Contingency table A(x 1 , x2 ) = [aij ]
► aij : times when x1 = ci AND x2 = dj

► Hamming Distance: sum when vectors differ

|A∪B |−|A∩B |
► Jaccard : dJ (A, B) = |A∪B |

173

Distance properties

Four properties of a metric


dm : X × X → [0, inf )
1. Non-negativity : dm (x, y) ≥ 0
2. Symmetry : dm (x, y) = dist(y, x)
3. Identity : dm (x, y) = 0 ⇔x = y
4. Triange inequality : dm(x, y) ≤ dm(x, z) + dm(z, y)

174

15
11/22/2023

Distance between Clusters


How to estimate dm (C1, C2) ?

175

Illustration

Single Linkage Complete Linkage

Average Linkage Centers of gravity

176

16
11/22/2023

How to evaluate the quality of a


clustering ?

177

How to evaluate the quality of a


clustering ?

error = distance between sample & centroid


X squared error = error2

Sum of squared errors between all


samples & centroid

Sum over all clusters WSSE


Within-Cluster Sum of Squared Error
= Intra Cluster Inertia Jw

178

17
11/22/2023

How to evaluate the quality of a


clustering ?

WSSE1 < WSSE2 WSSE1 is better numerically

Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE

179

How to evaluate the quality of a


clustering ?

180

18
11/22/2023

A Good
Clustering
C1
g1
C3

g3
C2
g g
g4 C4
g2

Total Inertia = Intra Cluster Inertia + Inter Cluster Inertia

Good Partition ?

Minimise Intra cluster Inertia and Maximise Inter Cluster


Inertia

181

A Good
Partition

High Inter Cluster Inertia Low Inter Cluster Inertia


Low Intra Cluster Inertia High Intra Cluster Inertia

g3

g1 g2

g4

182

19
11/22/2023

Terms used: Similarity and


Dissimilarity

► Dissimilarity dm: small value → points are close (e.g.


distance)
dm(x, z) = ǁx − zǁ 22
► Similarity sm : big value → points are close (e.g. RBF)

ǁx − zǁ 2
sm (x, z) = exp−
σ

183

Normalizing Input Variables

Scaled Values

Weight
Height

184

20
11/22/2023

Cluster Analysis Notes


Unsupervised

There is no ‘correct’
clustering

Clusters don’t come


with labels

Interpretation and analysis required to


make sense of clustering results!

185

Uses of Cluster Results


• Data segmentation
• Analysis of each segment can provide insights
science fiction

non-fiction

children’s

186

21
11/22/2023

Uses of Cluster Results


• Categories for classifying new data
• New sample assigned to closest cluster
• Label of closest cluster used to
classify new sample

187

Uses of Cluster Results

• Labeled data for classification


• Cluster samples used as labeled data

Labeled samples
for science fiction
customers

188

22
11/22/2023

Uses of Cluster Results

• Basis for anomaly detection


• Cluster outliers are anomalies

Anomalies that
require further
v analysis

189

• Organize similar items into groups


• Analyzing clusters often leads to useful
insights about data
• Clusters require analysis and interpretation

190

23
11/22/2023

Questions
raised:

► Data Nature: Binary, texts, numeric, trees, . . .


► Similarity between data
► What is a cluster ?
► What is a good cluster ?
► How many clusters ?
► Which algorithm ?
► Evaluation of clustering results

191

Clustering
Methods
► Many methods exist . . .
► Hierarchical Clustering
► Agglomerative Clustering
► Distances used
► Agglomeration strategies
► Splitting Clustering
► Kmeans and derivatives
► DBSCAN
► Spectral Clustering
► ...

► Modelisation Clustering
► Gaussian Mixtures models
► One Class SVM

192

24
11/22/2023

k-Means Clustering

193

Cluster Analysis
• Divides data into clusters
• Similar items are in same cluster
Intra-cluster
differences are
minimized

Inter-cluster differences are


maximized

194

25
11/22/2023

k-Means Algorithm
Select k initial centroids (cluster centers)
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached

centroid
X

195

K-means for Clustering


Purpose
►D = { xi ∈ Rd } i =1,···,N
► Clustering in K < N clusters Ck

Brute Force
1. Build all possible partitions
2. Evaluate each clustering et keep the best clustering

Problem
Number of possible clusterings increases exponentially

For N = 10 and K = 4, we have 34105 possible clusterings !


196

26
11/22/2023

K-means for Clustering

• A better solution
► Minimizing intra-class inertia, w.r.t. µk , k = 1,…,K

► Use of an heuristic: we will have a good clustering but not


necessarily the best one according to Jw

197

K-means for Clustering

A famous algorithm: K-means


1. Consider we have gravity centers µk , k = 1, ···, K
2. we affect each xi to the closest cluster Cl :

3. We recompute µk for each Ck, k = 1, ···, K


4. We continue until we reach convergence

198

27
11/22/2023

K-means algorithm

199

K-Means :
illustration
Clustering in K = 2 clusters
Data Initialisation Iteration 1
La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2
−3 −2 0 1 2 3 4 5
−2 −2
−1
−4 −2 0 2 4 6 −4 −2 0 2 4 6

Iteration 2 Iteration 3 Iteration 5


Clusters obtenus à l’iteration 2 Clusters obtenus à l’iteration 3 Clusters obtenus à l’iteration 5
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6

200

28
11/22/2023

Choosing Initial Centroids


Issue:
Final clusters are sensitive to initial centroids

Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results

201

Choosing Value for k

• Approaches: k=?
• Visualization

• Application-Dependent

• Data-Driven

202

29
11/22/2023

How to choose the number of clusters ?

K clusters
► Hard problem; depends on data
► Fixed a priori by the problematic
► Search for the best partition for different K > 1; Find a break
in Jw (K ) decreasing
► Constrain the density and/or volume of clusters
► Use criteria to evaluate clusterings
► Compute clustering for each K = 1, . . . , Kmax
► Compute criteria J(K )
► Choose K ∗ the K having the best criteria

203

Elbow Method for Choosing k


“Elbow” suggests value for
k should be 3

204

30
11/22/2023

K-Means : Discussion

► Jw decreases at each iteration


► It converges towards a local minimum of Jw
► Quick convergence
► Initialisation of µk :
► Randomly within xi domain
► Randomly K among X
► Different initializations lead to different
clustering

205

Stopping Criteria
X

When to stop iterating?


• No changes to centroids
• Number of samples changing clusters
is below threshold

206

31
11/22/2023

Some
criteria

207

Some
Criteria

208

32
11/22/2023

Interpreting Results
• Examine cluster centroids
• How are clusters different?

X
X Compare centroids
to see how clusters
are different
X

209

K-Means Summary

• Classic algorithm for cluster analysis


• Simple to understand and implement
and is efficient
• Value of k must be specified
• Final clusters are sensitive to initial
centroids

210

33

You might also like