0% found this document useful (0 votes)

16 views20 pages

Unsupervised Learning

This document discusses clustering techniques for big data analytics. It introduces hierarchical clustering and k-means clustering. For hierarchical clustering, it covers distance measures, definitions, and applications for market segmentation. It then discusses k-means clustering and provides an overview of the k-means algorithm and loss function.

Uploaded by

ignacio.pelirojo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views20 pages

Unsupervised Learning

Uploaded by

ignacio.pelirojo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

SESSION 4: Unsupervised learning

BIG DATA ANALYTICS

Hierarchical Clustering

K-means Clustering

Jeroen VK Rombouts
Distance
Imagine 2 groups: A and B. Where does the individual Belong?
2m § A?
§ B?
§…

Average Value of Group A

Size

New Individual

Average Value of Group B

1.5 m

90 kg Weight 100 kg

2
Choice of distance

¡ Euclidian: d (µ , p ) = [
å i i
µ - p ]2

– Very easy but the data has to be prepared

¡ Mahalanobis (generalization): d (µ , p ) = (µi - pi )T S -1 (µi - pi )

– With S the covariance matrix

– Automatic but does not solve the issue of different group variances

( )
¡ Sebestyen: d µ g , p = ( µ gi - pi )T S g ( µ gi - pi )
-1

p • p is closer to the centroid of group 1 than to group

• 2.
• • •
• • • However, the different spread of the two groups
• •
• • • •
•
• • • shows how p is more logically be considered to
•
• • • • • •• belong to group 2.
• • • •
• •μ1 •
• • •
• • • • μ2 •
•
The Sebestyen distance, considering k distances
• • from the centroids, where each distance is
• • • •
• weighted by the inverse of the variability of the
• • • • corresponding group, permit to solve this
• problem.
• •
3
Clustering Definition

¡ Given a set of data points, each having a set of attributes, and a

similarity measure among them, find clusters such that

– Data points in one cluster are more similar to one another.

– Data points in separate clusters are less similar to one another.

¡ Similarity Measures:

– Mostly used: Euclidean Distance if attributes are continuous

– Mahalanobis, Sebestyen, or another like Levenshtein for text

analytics?

4
Clustering Application: Market Segmentation

¡ Goal: subdivide a market into distinct subsets of customers where any

subset may conceivably be selected as a market target to be reached
with a distinct marketing mix.

¡ Approach:

– Collect different attributes of customers based on their activities,

profitability or other business-related information.

– Find clusters of similar customers.

– Measure the clustering quality by observing buying patterns of

customers in same cluster vs. those from different clusters.

5
Example for customers profitability segmentation in
banking: CLV vs. Relationship Intensity Fro
m
r (Ba ea
nk l pr
ing oje
se ct
cto
r)

Customer Customers to retain or

Lifetime Value nurture intensively
Customers to acquire
aggressively

Customers to retain
or nurture

Margin improvement
needed

Relationship Intensity
(based on the Share-of-Wallet)
Potential leads

6
Building a Typology of the Statistical Units

o oo * * oo o
* * ** o o o o
* o oo * * o o +o
o o * ** * *+o+ o o o
* ** + + +++ +
* ** + + +
+ ++ + + +
+ + +

Data clearly structured Real situation where

in three groups groups are close to each other
(if not partially overlapping)

7
Hierarchical Clustering
BIG DATA ANALYTICS

Jeroen VK Rombouts
Hierarchical (agglomerative) Clustering: intuition

Initial Step

Each individual forms a cluster. We group together the two nearest

individuals.

Successive Steps
At each step, we group together the two clusters Gi and Gj minimising
the Ward criterion d(Gi, Gj).

9
How many clusters? Using a dendrogram?
2 groups?
3 groups?
4 groups?
19 groups?
Choosing the
“cutting” level x
x x

Definition of
the clusters
Ex: Segmenting the respondents based on preferences

11
Selecting the number of clusters with the dendrogram
§ CODE: BDA_RSCRIPT_Data_Clustering.ipynb

12
Characterize the groups: cluster interpretation

¡ Group 1: mostly BBP and Palm

¡ Group 2: Nokia

¡ Group 3: BBP

¡ Group 4: Sidekick

13
Visualize if possible, e.g. BBP vs. Sidekick

Group 1: mostly BBP and Palm

Group 2: Nokia

Group 3: BBP

Group 4: Sidekick

14
Quality of the typology in K classes

¡ The sum of squares explained by the typology in K classes is equal to

the total sum of squares minus the intra-classes sum of squares of the
typology in K classes.

¡ From a statistical perspective, the quality of the typology is measured

by the part of the total sum of squares explained by the typology.

¡ From a business perspective, the quality of the typology is assessed

by how clear and relevant the resulting typology is

15
K-means Clustering
BIG DATA ANALYTICS

Jeroen VK Rombouts
K-means clustering

¡ Hierarchical clustering does not work on big data sets since the
algorithm is computationally intensive, O(n2)

¡ K-means clustering works on big data since the computing time

complexity is linear, i.e. O(n)

¡ K-means clustering requires to select the number of clusters

¡ Algorithm K-means for K clusters

– 1 Assign randomly each observation to a cluster

– 2 For each cluster, compute centroids (averages)

– 3 Re-assign each observation to the closest centroid

– 4 Repeat step 2 and step 3 until no observations change cluster

anymore
17
K-means clustering loss function

¡ Criterion function to gauge quality of fit and determine number of

clusters:
$ &
– Loss = Σ!"# Σ%"# 𝐼!% ||𝑥! − 𝜇% ||2

With 𝐼!% and indicator function equal to 1 if the data point (xi) is
assigned to the cluster (k) and 0 otherwise

– Elbow Plot to determine # clusters

18
K-means clustering: Free public wifi New York

¡ code “BDA_RSCRIPT_Data_Clustering_NewYork.ipynb”

¡ datafile “NYC_Free_Public_WiFi_03292017.csv”

19
Gaussian Mixtures
BIG DATA ANALYTICS

Clustering can also be done with Gaussian

mixtures, which can be estimated with
maximum likelihood techniques

Jeroen VK Rombouts

BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Słowacja Wszystko PDF
No ratings yet
Słowacja Wszystko PDF
379 pages
5G RAN Planning, Dimensioning, and Optimization
100% (4)
5G RAN Planning, Dimensioning, and Optimization
74 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Clustering: Analisis Big Data - Pertemuan 6
No ratings yet
Clustering: Analisis Big Data - Pertemuan 6
51 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Section 3
No ratings yet
Section 3
22 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
UNIT5
No ratings yet
UNIT5
60 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Lec 2
No ratings yet
Lec 2
32 pages
Data Warehousing PDF 6
No ratings yet
Data Warehousing PDF 6
13 pages
09 Clustering
No ratings yet
09 Clustering
21 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
CS8091 BDA Unit 2
No ratings yet
CS8091 BDA Unit 2
101 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
UCS551 Chapter 7 - Clustering
No ratings yet
UCS551 Chapter 7 - Clustering
6 pages
Clustering
No ratings yet
Clustering
84 pages
Chapter 5 CLUSTERING
No ratings yet
Chapter 5 CLUSTERING
36 pages
11 Chapter 3
No ratings yet
11 Chapter 3
17 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
8.cluster Analysis HCA
No ratings yet
8.cluster Analysis HCA
31 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Introduction To Data Science: Clustering
No ratings yet
Introduction To Data Science: Clustering
45 pages
Zara
No ratings yet
Zara
47 pages
Cluster Analysis - Part B
No ratings yet
Cluster Analysis - Part B
25 pages
Clustering
No ratings yet
Clustering
25 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
UNIT II-Segmentation, Positioning, and Product Optimization
No ratings yet
UNIT II-Segmentation, Positioning, and Product Optimization
48 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
Dry Needling Courses For Pelvic Floor Physiotherapists
0% (1)
Dry Needling Courses For Pelvic Floor Physiotherapists
9 pages
Python Machine Learning
No ratings yet
Python Machine Learning
19 pages
Clustering
No ratings yet
Clustering
39 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Together Klasa 5 SB Unit 1
No ratings yet
Together Klasa 5 SB Unit 1
12 pages
EN - BioMajesty 6010 - C
100% (1)
EN - BioMajesty 6010 - C
2 pages
Hyundai Sonata (2009 - 2010) - Fuse Box Diagram
No ratings yet
Hyundai Sonata (2009 - 2010) - Fuse Box Diagram
8 pages
9 DCT Generic Standards
No ratings yet
9 DCT Generic Standards
40 pages
Feminist Standpoint Theory - Internet Encyclopedia of Philosophy
No ratings yet
Feminist Standpoint Theory - Internet Encyclopedia of Philosophy
17 pages
5f551-D36f-0026-86b-301b84f4a UPDATED The Self Help Planner 1 Better Goal Planner
No ratings yet
5f551-D36f-0026-86b-301b84f4a UPDATED The Self Help Planner 1 Better Goal Planner
15 pages
Data Analyst Cheat Sheet
No ratings yet
Data Analyst Cheat Sheet
28 pages
Présentation Anglais À L'oral
No ratings yet
Présentation Anglais À L'oral
4 pages
Module 5 Psy002
No ratings yet
Module 5 Psy002
15 pages
Manual Del Medidor de Campo
No ratings yet
Manual Del Medidor de Campo
17 pages
Annual Report Unvr 2017
No ratings yet
Annual Report Unvr 2017
280 pages
Big Blue 450 Duo CST
No ratings yet
Big Blue 450 Duo CST
76 pages
RD - Incident Rail Commander
No ratings yet
RD - Incident Rail Commander
7 pages
Raft Activity and Lesson Plan
No ratings yet
Raft Activity and Lesson Plan
18 pages
Soe Hed Cbcs Syllabus
No ratings yet
Soe Hed Cbcs Syllabus
53 pages
Grade 10 Physics Mid Exam
No ratings yet
Grade 10 Physics Mid Exam
5 pages
Timestamp Enrollment No Name Father Name Gender
No ratings yet
Timestamp Enrollment No Name Father Name Gender
12 pages
Setting Time of Cement
No ratings yet
Setting Time of Cement
3 pages
Untitled
No ratings yet
Untitled
28 pages
A 12.4-32 GHZ Cmos Down-Conversion Mixer For 28 GHZ 5G New Radio (NR)
No ratings yet
A 12.4-32 GHZ Cmos Down-Conversion Mixer For 28 GHZ 5G New Radio (NR)
11 pages
Internal Storage Encoding of Characters
No ratings yet
Internal Storage Encoding of Characters
4 pages
Unit 5 Reading Questions
No ratings yet
Unit 5 Reading Questions
1 page
Si 2008
No ratings yet
Si 2008
5 pages
CSCE 636: Deep Learning (Fall 2019) Assignment #2 Due 10/4/2019
No ratings yet
CSCE 636: Deep Learning (Fall 2019) Assignment #2 Due 10/4/2019
2 pages
Wilson - Petrarch's Queer History
No ratings yet
Wilson - Petrarch's Queer History
26 pages
Stronghold 3 - Keyboard Shortcuts
No ratings yet
Stronghold 3 - Keyboard Shortcuts
2 pages
Mclogit
No ratings yet
Mclogit
19 pages
Topcon GR 5 Manual: Click Here To Download
No ratings yet
Topcon GR 5 Manual: Click Here To Download
3 pages