Unsupervised Learning
Unsupervised Learning
Hierarchical Clustering
K-means Clustering
Jeroen VK Rombouts
Distance
Imagine 2 groups: A and B. Where does the individual Belong?
2m § A?
§ B?
§…
New Individual
1.5 m
90 kg Weight 100 kg
2
Choice of distance
¡ Euclidian: d (µ , p ) = [
å i i
µ - p ]2
– Automatic but does not solve the issue of different group variances
( )
¡ Sebestyen: d µ g , p = ( µ gi - pi )T S g ( µ gi - pi )
-1
¡ Similarity Measures:
4
Clustering Application: Market Segmentation
¡ Approach:
5
Example for customers profitability segmentation in
banking: CLV vs. Relationship Intensity Fro
m
r (Ba ea
nk l pr
ing oje
se ct
cto
r)
Customers to retain
or nurture
Margin improvement
needed
Relationship Intensity
(based on the Share-of-Wallet)
Potential leads
6
Building a Typology of the Statistical Units
o oo * * oo o
* * ** o o o o
* o oo * * o o +o
o o * ** * *+o+ o o o
* ** + + +++ +
* ** + + +
+ ++ + + +
+ + +
7
Hierarchical Clustering
BIG DATA ANALYTICS
Jeroen VK Rombouts
Hierarchical (agglomerative) Clustering: intuition
Initial Step
Successive Steps
At each step, we group together the two clusters Gi and Gj minimising
the Ward criterion d(Gi, Gj).
9
How many clusters? Using a dendrogram?
2 groups?
3 groups?
4 groups?
19 groups?
Choosing the
“cutting” level x
x x
Definition of
the clusters
Ex: Segmenting the respondents based on preferences
11
Selecting the number of clusters with the dendrogram
§ CODE: BDA_RSCRIPT_Data_Clustering.ipynb
12
Characterize the groups: cluster interpretation
¡ Group 2: Nokia
¡ Group 3: BBP
¡ Group 4: Sidekick
13
Visualize if possible, e.g. BBP vs. Sidekick
Group 2: Nokia
Group 3: BBP
Group 4: Sidekick
14
Quality of the typology in K classes
15
K-means Clustering
BIG DATA ANALYTICS
Jeroen VK Rombouts
K-means clustering
¡ Hierarchical clustering does not work on big data sets since the
algorithm is computationally intensive, O(n2)
With 𝐼!% and indicator function equal to 1 if the data point (xi) is
assigned to the cluster (k) and 0 otherwise
18
K-means clustering: Free public wifi New York
¡ code “BDA_RSCRIPT_Data_Clustering_NewYork.ipynb”
¡ datafile “NYC_Free_Public_WiFi_03292017.csv”
19
Gaussian Mixtures
BIG DATA ANALYTICS
Jeroen VK Rombouts