0% found this document useful (0 votes)

2 views

ML+Clustering

The document provides an overview of machine learning concepts, including supervised, unsupervised, and semi-supervised learning, along with various techniques such as clustering and classification. It discusses the scikit-learn library, its functionalities for preprocessing, model selection, and evaluation, as well as clustering methods like k-means. Additionally, it highlights the importance of distance metrics in clustering and the evaluation of clustering quality.

Uploaded by

vagifsamadov2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

ML+Clustering

Uploaded by

vagifsamadov2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

11/22/2023

Machine Learning is…

… learning from data
… on its own

145

Machine Learning is…

… learning from data
… on its own
… discovering hidden patterns

146

1
11/22/2023

Machine Learning is…

… learning from data
… on its own
… discovering hidden patterns
… data-driven decisions

147

Supervised Learning

Purpose
Given a dataset {(x i ,yi ) ∈ X × Y, i = 1,...,N}, learn the
dependancies between X and Y.

► Example: Learn the links between cardiac risk and food

habits. x i is one person describe by d features concerning its
food habits; yi is a binary category (risky, not risky).

► yi are essential for the learning process.

► Methods : K-Nearest Neighbors, SVM, Decision Tree, . . .

148

2
11/22/2023

Unsupervised Learning

Purpose
From observations {x i ∈ X ,i = 1,...,N}, learn the organisation
of X and discover homogen subsets.
► Example: Categorize customers. x i encodes a customer with
features encoding its social condition and behavior.

► Methods: Hierarichal clustering, K-Means, Reinforcement

learning, . . .

149

Semi supervised Learning

Purpose
Within a dataset, only a small part of sample have a corresponding
label, i.e. {(x 1, y1), ···,(x k ,yk ), x k+1, ···,N}. The goal is to
infer the classes of unlabeled data.
► Example: Filter webpages. Number of webpages is
tremendous, only few of them can be labeled by an expert.
► Methods: Bayesian methods, SVM, Graph Neural Networks,
...

150

3
11/22/2023

Supervised vs.
Unsupervised
•Supervised Approaches
• Target (what model is predicting) is provided
• ‘Labeled’ data
• Classification & regression are supervised.

•Unsupervised Approaches
• Target is unknown or unavailable
• ‘unlabeled’ data
• Cluster analysis & association analysis are
unsupervised.

151

Categories of Machine Learning Techniques

Supervised Unsupervised
(target is available) (target is not available)

Classification Cluster
Analysis
Regression Association
Analysis

152

4
11/22/2023

Classification Goal: Predict category

Sunny

Windy

Rainy

Cloudy
Image source:
http://www.davidson.k12.nc.us/parents students/inclement_weather

153

Regression
Predict numeric value
Goal:

154

5
11/22/2023

Goal: Organize similar

Cluster Analysis
items into groups.
Seniors
Adults

Teenagers

Image source: http://www.monetate.com/blog/the-intrinsic-value-of-customer-

segmentation

155

Association Analysis

Goal: Find rules to capture

associations between items.

156

6
11/22/2023

scikit-learn

• Open source library for Machine Learning in

Python
• Built on top of NumPy, SciPy, matplotlib
• Active community for development
• Improved continuously by developers

157

Preprocessing Tools

•Utility Functions for

• Transforming raw feature vectors to suitable format

•Provides API for

• Scaling of features: remove mean and keep unit variance
• Normalization to have unit norm
• Binarization to turn data into 0 or 1 format
• One Hot Encoding for categorical features
• Handling of missing values
• Generating higher order features
• Build custom transformations

158

7
11/22/2023

Different Tasks
► Supervised Learning
► Unsupervised Learning
► Semi Supervised Learning

159

Provides organized tutorials with specifics.

http://scikit-learn.org/stable/documentation.html

160

8
11/22/2023

Dimensionality Reduction
• Enables you to reduce features while preserving variance
• scikit-learn has capabilities for:
• Principal Component Analysis (PCA)
• Singular Value Decomposition
• Factor Analysis
• Independent Component Analysis
• Matrix Factorization
• Latent Dirichlet Allocation

161

Model Selection

• Provides methods for Cross Validation

• Library functions for tuning hyper parameters

• Model Evaluation mechanisms to measure model performance

• Plotting methods for visualizing scores to evaluate models

162

9
11/22/2023

Summary of scikit-learn

• Extensive set of tools for full pipeline in Machine Learning

• Dependable due to community support

• Provides easy to use API for training, and making predictions

• Collection of the best, most popular, algorithms in one place

163

Clustering

164

10
11/22/2023

Clustering http://scikit-learn.org/stable/modules/clustering.html#clustering

• sklearn.cluster gives algorithms for grouping of unlabeled data

165

Cluster Analysis Overview

Goal: Organize similar items into groups

166

11
11/22/2023

Cluster Analysis Examples

• Segment customer base into groups
• Characterize different weather patterns for a region
• Group news articles into topics
• Discover crime hot spots
• NLP: Find set of texts
• Documents: Automatic classification (Driver License,
ID, Passport)
• Marketing: Client profiles

167

Cluster Analysis
• Divides data into clusters
• Similar items are placed in same cluster
Intra-cluster
differences are
minimized

Inter-cluster differences are

v maximized

168

12
11/22/2023

Distance – main focus

Euclidean Distance

169

Distance – other methods

A A

B B
Cosine Similarity

Manhattan Distance
► Minkoswski Distance (norm p)

► Manhattan distance (p = 1)

170

13
11/22/2023

Distance dm(x 1, x 2) I

► Euclidean distance (p = 2)

171

Distance dm(x 1, x 2) II

► Matrix based distance, W sdp

► Mahalanobis distance W = C −1 with C covariance matrix.

172

14
11/22/2023

Distance between discrete values

► Let x1 ∈ { c1 , . . . , ck } et x 2 ∈ { d1 , . . . , dh }
► Contingency table A(x 1 , x2 ) = [aij ]
► aij : times when x1 = ci AND x2 = dj

► Hamming Distance: sum when vectors differ

|A∪B |−|A∩B |
► Jaccard : dJ (A, B) = |A∪B |

173

Distance properties

Four properties of a metric

dm : X × X → [0, inf )
1. Non-negativity : dm (x, y) ≥ 0
2. Symmetry : dm (x, y) = dist(y, x)
3. Identity : dm (x, y) = 0 ⇔x = y
4. Triange inequality : dm(x, y) ≤ dm(x, z) + dm(z, y)

174

15
11/22/2023

Distance between Clusters

How to estimate dm (C1, C2) ?

175

Illustration

Single Linkage Complete Linkage

Average Linkage Centers of gravity

176

16
11/22/2023

How to evaluate the quality of a

clustering ?

177

How to evaluate the quality of a

clustering ?

error = distance between sample & centroid

X squared error = error2

Sum of squared errors between all

samples & centroid

Sum over all clusters WSSE

Within-Cluster Sum of Squared Error
= Intra Cluster Inertia Jw

178

17
11/22/2023

How to evaluate the quality of a

clustering ?

WSSE1 < WSSE2 WSSE1 is better numerically

Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE

179

How to evaluate the quality of a

clustering ?

180

18
11/22/2023

A Good
Clustering
C1
g1
C3

g3
C2
g g
g4 C4
g2

Total Inertia = Intra Cluster Inertia + Inter Cluster Inertia

Good Partition ?

Minimise Intra cluster Inertia and Maximise Inter Cluster

Inertia

181

A Good
Partition

High Inter Cluster Inertia Low Inter Cluster Inertia

Low Intra Cluster Inertia High Intra Cluster Inertia

g1 g2

182

19
11/22/2023

Terms used: Similarity and

Dissimilarity

► Dissimilarity dm: small value → points are close (e.g.

distance)
dm(x, z) = ǁx − zǁ 22
► Similarity sm : big value → points are close (e.g. RBF)

ǁx − zǁ 2
sm (x, z) = exp−
σ

183

Normalizing Input Variables

Scaled Values

Weight
Height

184

20
11/22/2023

Cluster Analysis Notes

Unsupervised

There is no ‘correct’
clustering

Clusters don’t come

with labels

Interpretation and analysis required to

make sense of clustering results!

185

Uses of Cluster Results

• Data segmentation
• Analysis of each segment can provide insights
science fiction

non-fiction

children’s

186

21
11/22/2023

Uses of Cluster Results

• Categories for classifying new data
• New sample assigned to closest cluster
• Label of closest cluster used to
classify new sample

187

Uses of Cluster Results

• Labeled data for classification

• Cluster samples used as labeled data

Labeled samples
for science fiction
customers

188

22
11/22/2023

Uses of Cluster Results

• Basis for anomaly detection

• Cluster outliers are anomalies

Anomalies that
require further
v analysis

189

• Organize similar items into groups

• Analyzing clusters often leads to useful
insights about data
• Clusters require analysis and interpretation

190

23
11/22/2023

Questions
raised:

► Data Nature: Binary, texts, numeric, trees, . . .

► Similarity between data
► What is a cluster ?
► What is a good cluster ?
► How many clusters ?
► Which algorithm ?
► Evaluation of clustering results

191

Clustering
Methods
► Many methods exist . . .
► Hierarchical Clustering
► Agglomerative Clustering
► Distances used
► Agglomeration strategies
► Splitting Clustering
► Kmeans and derivatives
► DBSCAN
► Spectral Clustering
► ...

► Modelisation Clustering
► Gaussian Mixtures models
► One Class SVM

192

24
11/22/2023

k-Means Clustering

193

Cluster Analysis
• Divides data into clusters
• Similar items are in same cluster
Intra-cluster
differences are
minimized

Inter-cluster differences are

maximized

194

25
11/22/2023

k-Means Algorithm
Select k initial centroids (cluster centers)
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached

centroid
X

195

K-means for Clustering

Purpose
►D = { xi ∈ Rd } i =1,···,N
► Clustering in K < N clusters Ck

Brute Force
1. Build all possible partitions
2. Evaluate each clustering et keep the best clustering

Problem
Number of possible clusterings increases exponentially

For N = 10 and K = 4, we have 34105 possible clusterings !

196

26
11/22/2023

K-means for Clustering

• A better solution
► Minimizing intra-class inertia, w.r.t. µk , k = 1,…,K

► Use of an heuristic: we will have a good clustering but not

necessarily the best one according to Jw

197

K-means for Clustering

A famous algorithm: K-means

1. Consider we have gravity centers µk , k = 1, ···, K
2. we affect each xi to the closest cluster Cl :

3. We recompute µk for each Ck, k = 1, ···, K

4. We continue until we reach convergence

198

27
11/22/2023

K-means algorithm

199

K-Means :
illustration
Clustering in K = 2 clusters
Data Initialisation Iteration 1
La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2
−3 −2 0 1 2 3 4 5
−2 −2
−1
−4 −2 0 2 4 6 −4 −2 0 2 4 6

Iteration 2 Iteration 3 Iteration 5

Clusters obtenus à l’iteration 2 Clusters obtenus à l’iteration 3 Clusters obtenus à l’iteration 5
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6

200

28
11/22/2023

Choosing Initial Centroids

Issue:
Final clusters are sensitive to initial centroids

Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results

201

Choosing Value for k

• Approaches: k=?
• Visualization

• Application-Dependent

• Data-Driven

202

29
11/22/2023

How to choose the number of clusters ?

K clusters
► Hard problem; depends on data
► Fixed a priori by the problematic
► Search for the best partition for different K > 1; Find a break
in Jw (K ) decreasing
► Constrain the density and/or volume of clusters
► Use criteria to evaluate clusterings
► Compute clustering for each K = 1, . . . , Kmax
► Compute criteria J(K )
► Choose K ∗ the K having the best criteria

203

Elbow Method for Choosing k

“Elbow” suggests value for
k should be 3

204

30
11/22/2023

K-Means : Discussion

► Jw decreases at each iteration

► It converges towards a local minimum of Jw
► Quick convergence
► Initialisation of µk :
► Randomly within xi domain
► Randomly K among X
► Different initializations lead to different
clustering

205

Stopping Criteria
X

When to stop iterating?

• No changes to centroids
• Number of samples changing clusters
is below threshold

206

31
11/22/2023

Some
criteria

207

Some
Criteria

208

32
11/22/2023

Interpreting Results
• Examine cluster centroids
• How are clusters different?

X
X Compare centroids
to see how clusters
are different
X

209

K-Means Summary

• Classic algorithm for cluster analysis

• Simple to understand and implement
and is efficient
• Value of k must be specified
• Final clusters are sensitive to initial
centroids

210

McNeill Dysphagia Therapy Program
No ratings yet
McNeill Dysphagia Therapy Program
5 pages
Zalo Challenge Ai Advertising Banner Generation
No ratings yet
Zalo Challenge Ai Advertising Banner Generation
6 pages
Unit 4 Introduction to Algorithm
No ratings yet
Unit 4 Introduction to Algorithm
10 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
dm 4
No ratings yet
dm 4
76 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Clustering
No ratings yet
Clustering
22 pages
K Means
No ratings yet
K Means
9 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering
No ratings yet
Clustering
44 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Week 10
No ratings yet
Week 10
50 pages
Clustering
No ratings yet
Clustering
84 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
(PML ITS - Week 10) - Clustering
No ratings yet
(PML ITS - Week 10) - Clustering
42 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
R20 machine learning unit 4
No ratings yet
R20 machine learning unit 4
49 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
No ratings yet
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
48 pages
Unit-4
No ratings yet
Unit-4
53 pages
UNIT5
No ratings yet
UNIT5
60 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
IDS26 Clustering and Classification
No ratings yet
IDS26 Clustering and Classification
30 pages
Clustering
No ratings yet
Clustering
7 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Week 9
No ratings yet
Week 9
66 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
6 - Into To Data Science Techniques and Clustering
No ratings yet
6 - Into To Data Science Techniques and Clustering
16 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Introduction to Machine Learning (1)
No ratings yet
Introduction to Machine Learning (1)
89 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
02 - Clustering
No ratings yet
02 - Clustering
43 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering
No ratings yet
Clustering
104 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Clustering Part-A
No ratings yet
Clustering Part-A
41 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
04-FSSR_DS610_2024=2025T1_Kmeans
No ratings yet
04-FSSR_DS610_2024=2025T1_Kmeans
57 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Artificial Intelligence Algorithms
From Everand
Artificial Intelligence Algorithms
akosnemeth
No ratings yet
chapter 5
No ratings yet
chapter 5
22 pages
chapter 1
No ratings yet
chapter 1
13 pages
LinearRegression
No ratings yet
LinearRegression
64 pages
Association-Rules
No ratings yet
Association-Rules
33 pages
Students' Difficulties Understanding A Specific Topic or Subject
100% (1)
Students' Difficulties Understanding A Specific Topic or Subject
7 pages
Cor DhvtsugJQBqMXmfH
No ratings yet
Cor DhvtsugJQBqMXmfH
1 page
Download ebooks file Macroeconomics D. N. Dwivedi all chapters
No ratings yet
Download ebooks file Macroeconomics D. N. Dwivedi all chapters
51 pages
Test Development
No ratings yet
Test Development
30 pages
Wiring A Web For A Global Good: Reaction Paper
100% (1)
Wiring A Web For A Global Good: Reaction Paper
3 pages
5 Colounm Script Example
No ratings yet
5 Colounm Script Example
2 pages
在英国完成您的作业
100% (1)
在英国完成您的作业
12 pages
Educational Reforms in Different Provinces of Pakistan
No ratings yet
Educational Reforms in Different Provinces of Pakistan
10 pages
General Guidelines For The Project Work Submission
No ratings yet
General Guidelines For The Project Work Submission
5 pages
Cover Letter Abb
No ratings yet
Cover Letter Abb
2 pages
Dacanay File Term
No ratings yet
Dacanay File Term
10 pages
Ausubel's Meaningful Verbal Learning/ Subsumption Theory: Reporters: Ryne Romwel Guzman Oliver Dilla
100% (2)
Ausubel's Meaningful Verbal Learning/ Subsumption Theory: Reporters: Ryne Romwel Guzman Oliver Dilla
17 pages
Defence Management and Pestle Analysis
No ratings yet
Defence Management and Pestle Analysis
10 pages
Sathyaraj CV
No ratings yet
Sathyaraj CV
4 pages
Grade 9 ksq 4 by Cevahir A.A
No ratings yet
Grade 9 ksq 4 by Cevahir A.A
3 pages
Assignment 8: (Https://swayam - Gov.in)
No ratings yet
Assignment 8: (Https://swayam - Gov.in)
4 pages
Disposition 1
No ratings yet
Disposition 1
3 pages
God Games What Do You Do Forever (Neil Freer) (Z-Library)
100% (4)
God Games What Do You Do Forever (Neil Freer) (Z-Library)
444 pages
Best BPT College in Jaipur
No ratings yet
Best BPT College in Jaipur
5 pages
ACTION RESEARCH Proposal SERVANT EAST Leadership
No ratings yet
ACTION RESEARCH Proposal SERVANT EAST Leadership
29 pages
Blog E2language Com Ielts Reading Test
No ratings yet
Blog E2language Com Ielts Reading Test
12 pages
Sevenoaks School 11 Plus Maths Entrance Examination 2021
No ratings yet
Sevenoaks School 11 Plus Maths Entrance Examination 2021
10 pages
How To Create A Rubric in Moodle Assignment
No ratings yet
How To Create A Rubric in Moodle Assignment
6 pages
Distance Learning ... For Educators, Trainers and Leaders, Vol.-8-No.-1-2011-2
No ratings yet
Distance Learning ... For Educators, Trainers and Leaders, Vol.-8-No.-1-2011-2
94 pages
Buchloh Benjamin 1984 Theorizing The Avant-Garde
100% (1)
Buchloh Benjamin 1984 Theorizing The Avant-Garde
2 pages
practice sheet - trigonometry 2
No ratings yet
practice sheet - trigonometry 2
3 pages
Evan Mcdermott Resume2
No ratings yet
Evan Mcdermott Resume2
1 page
P-3, Unit-1, Part-1, Need For Assessment
No ratings yet
P-3, Unit-1, Part-1, Need For Assessment
4 pages