0% found this document useful (0 votes)
16 views20 pages

Unsupervised Learning

This document discusses clustering techniques for big data analytics. It introduces hierarchical clustering and k-means clustering. For hierarchical clustering, it covers distance measures, definitions, and applications for market segmentation. It then discusses k-means clustering and provides an overview of the k-means algorithm and loss function.

Uploaded by

ignacio.pelirojo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

Unsupervised Learning

This document discusses clustering techniques for big data analytics. It introduces hierarchical clustering and k-means clustering. For hierarchical clustering, it covers distance measures, definitions, and applications for market segmentation. It then discusses k-means clustering and provides an overview of the k-means algorithm and loss function.

Uploaded by

ignacio.pelirojo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

SESSION 4: Unsupervised learning

BIG DATA ANALYTICS

Hierarchical Clustering

K-means Clustering

Jeroen VK Rombouts
Distance
Imagine 2 groups: A and B. Where does the individual Belong?
2m § A?
§ B?
§…

Average Value of Group A


Size

New Individual

Average Value of Group B

1.5 m

90 kg Weight 100 kg

2
Choice of distance

¡ Euclidian: d (µ , p ) = [
å i i
µ - p ]2

– Very easy but the data has to be prepared

¡ Mahalanobis (generalization): d (µ , p ) = (µi - pi )T S -1 (µi - pi )


– With S the covariance matrix

– Automatic but does not solve the issue of different group variances

( )
¡ Sebestyen: d µ g , p = ( µ gi - pi )T S g ( µ gi - pi )
-1

p • p is closer to the centroid of group 1 than to group


• 2.
• • •
• • • However, the different spread of the two groups
• •
• • • •

• • • shows how p is more logically be considered to

• • • • • •• belong to group 2.
• • • •
• •μ1 •
• • •
• • • • μ2 •

The Sebestyen distance, considering k distances
• • from the centroids, where each distance is
• • • •
• weighted by the inverse of the variability of the
• • • • corresponding group, permit to solve this
• problem.
• •
3
Clustering Definition

¡ Given a set of data points, each having a set of attributes, and a


similarity measure among them, find clusters such that

– Data points in one cluster are more similar to one another.

– Data points in separate clusters are less similar to one another.

¡ Similarity Measures:

– Mostly used: Euclidean Distance if attributes are continuous

– Mahalanobis, Sebestyen, or another like Levenshtein for text


analytics?

4
Clustering Application: Market Segmentation

¡ Goal: subdivide a market into distinct subsets of customers where any


subset may conceivably be selected as a market target to be reached
with a distinct marketing mix.

¡ Approach:

– Collect different attributes of customers based on their activities,


profitability or other business-related information.

– Find clusters of similar customers.

– Measure the clustering quality by observing buying patterns of


customers in same cluster vs. those from different clusters.

5
Example for customers profitability segmentation in
banking: CLV vs. Relationship Intensity Fro
m
r (Ba ea
nk l pr
ing oje
se ct
cto
r)

Customer Customers to retain or


Lifetime Value nurture intensively
Customers to acquire
aggressively

Customers to retain
or nurture

Margin improvement
needed

Relationship Intensity
(based on the Share-of-Wallet)
Potential leads

6
Building a Typology of the Statistical Units

o oo * * oo o
* * ** o o o o
* o oo * * o o +o
o o * ** * *+o+ o o o
* ** + + +++ +
* ** + + +
+ ++ + + +
+ + +

Data clearly structured Real situation where


in three groups groups are close to each other
(if not partially overlapping)

7
Hierarchical Clustering
BIG DATA ANALYTICS

Jeroen VK Rombouts
Hierarchical (agglomerative) Clustering: intuition

Initial Step

Each individual forms a cluster. We group together the two nearest


individuals.

Successive Steps
At each step, we group together the two clusters Gi and Gj minimising
the Ward criterion d(Gi, Gj).

9
How many clusters? Using a dendrogram?
2 groups?
3 groups?
4 groups?
19 groups?
Choosing the
“cutting” level x
x x

Definition of
the clusters
Ex: Segmenting the respondents based on preferences

11
Selecting the number of clusters with the dendrogram
§ CODE: BDA_RSCRIPT_Data_Clustering.ipynb

12
Characterize the groups: cluster interpretation

¡ Group 1: mostly BBP and Palm

¡ Group 2: Nokia

¡ Group 3: BBP

¡ Group 4: Sidekick

13
Visualize if possible, e.g. BBP vs. Sidekick

Group 1: mostly BBP and Palm

Group 2: Nokia

Group 3: BBP

Group 4: Sidekick

14
Quality of the typology in K classes

¡ The sum of squares explained by the typology in K classes is equal to


the total sum of squares minus the intra-classes sum of squares of the
typology in K classes.

¡ From a statistical perspective, the quality of the typology is measured


by the part of the total sum of squares explained by the typology.

¡ From a business perspective, the quality of the typology is assessed


by how clear and relevant the resulting typology is

15
K-means Clustering
BIG DATA ANALYTICS

Jeroen VK Rombouts
K-means clustering

¡ Hierarchical clustering does not work on big data sets since the
algorithm is computationally intensive, O(n2)

¡ K-means clustering works on big data since the computing time


complexity is linear, i.e. O(n)

¡ K-means clustering requires to select the number of clusters

¡ Algorithm K-means for K clusters

– 1 Assign randomly each observation to a cluster

– 2 For each cluster, compute centroids (averages)

– 3 Re-assign each observation to the closest centroid

– 4 Repeat step 2 and step 3 until no observations change cluster


anymore
17
K-means clustering loss function

¡ Criterion function to gauge quality of fit and determine number of


clusters:
$ &
– Loss = Σ!"# Σ%"# 𝐼!% ||𝑥! − 𝜇% ||2

With 𝐼!% and indicator function equal to 1 if the data point (xi) is
assigned to the cluster (k) and 0 otherwise

– Elbow Plot to determine # clusters

18
K-means clustering: Free public wifi New York

¡ code “BDA_RSCRIPT_Data_Clustering_NewYork.ipynb”

¡ datafile “NYC_Free_Public_WiFi_03292017.csv”

19
Gaussian Mixtures
BIG DATA ANALYTICS

Clustering can also be done with Gaussian


mixtures, which can be estimated with
maximum likelihood techniques

Jeroen VK Rombouts

You might also like