Data Analytics
Algorithms
Topic 4
Sulfeeza Mohd Drus, CISB474
Topic Outline
Unsupervised Machine Learning algorithms
Association Rules
Clustering
1. Partitioning
2. Hierarchical
Sulfeeza Mohd Drus, CISB474 Source: Baesens
Clustering
Grouping data into small groups based on similarity
such that data in the same group (cluster) are as similar
as possible and data in different groups are as different
as possible.
Help users understand the natural grouping or
structure in a data set. Used either as a stand-alone
tool to get insight into data distribution or as a
preprocessing step for other algorithms.
Sulfeeza Mohd Drus, CISB474 Source: Wikipedia, Stefanowski (2008)
Clustering
Example: How do we want to group these fruits?
Sulfeeza Mohd Drus, CISB474
Clustering
a) Grouping based on colour
Sulfeeza Mohd Drus, CISB474
Clustering
b) Grouping based on shape
Sulfeeza Mohd Drus, CISB474
Clustering
Other examples:
Cluster customers based on their purchase histories, so
that a targeted marketing program can be developed
Cluster products based on the sets of customers who
purchased them
Cluster documents based on similar words
Cluster DNA sequences based on edit distance
Sulfeeza Mohd Drus, CISB474 Source: Wikipedia
Clustering
Good clustering method will produce high quality
clusters with:
High intra-cluster similarity
Low inter-cluster similarity
Inter-cluster Intra-cluster
distance is distance is
maximized Education minimized
Income
Age
Sulfeeza Mohd Drus, CISB474
Clustering
The quality of a clustering result depends on
the similarity measure used
implementation of the similarity measure
The quality of a clustering method is also measured by
its ability to discover some or all of the hidden
patterns
Sulfeeza Mohd Drus, CISB474
Clustering
Steps to perform cluster analysis:
Step 1:
Formulate the problem Decide on the clustering variable
Step 2:
Decide on the clustering procedure
Step 3:
Decide on the number of clusters
Step 4:
Validate and Interpret cluster solution
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Steps to perform cluster analysis:
Step 1:
Formulate the problem Decide on the clustering variable
The objective of this step is:
To select variables that could provide a clear-cut
differentiation between segments/groups regarding a
specific managerial objective
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Steps to perform cluster analysis:
Step 1:
Formulate the problem Decide on the clustering variable
Types and examples of clustering variables
General Specific
Observable Cultural, geographic, User status, usage
(directly measurable) demographic, socio- frequency, store and brand
economic loyalty
Unobservable Psychographics, values, Benefits, perceptions,
(inferred) personality, lifestyle attitudes, intentions,
preferences
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Steps to perform cluster analysis:
Step 2:
Decide on the clustering procedure
Option 1: Option 2:
Hierarchical methods Partitioning methods
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Hierarchical methods
Agglomerative Divisive
Each object starts with their All objects start in the same
own separate cluster. Then, cluster, and gradually split up
two closest (most similar) until each object becomes in
cluster is combined and individual cluster
repeatedly performed until all
objects become in one cluster
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Steps in hierarchical methods:
1. Measure of similarity/dissimilarity
Euclidean distance
Manhattan distance Point A
Point B
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Lets do some exercises:
Calculate the Euclidean distance for the following data sets:
A1 = (2,10)
A2 = (2,5)
A3 = (8,4)
A4 = (5,8)
A5 = (7,5)
A6 = (6,4)
A7 = (1,2)
A8 = (4,9)
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
To calculate Euclidean distance from A1 to A2:
d(A1, A2) = (2-2)2 + (10-5)2
= 02 + 52
= 25
=5
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Euclidean distance matrix
A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 ? ? ? ? ? ?
A2
A3
A4
A5
A6
A7
A8
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Euclidean distance matrix
A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 6 3.6 7.1 7.2 8.1 2.2
A2 0 6.1 4.2 5 4.1 3.1 4.5
A3 0 5 1.4 1.4 7.2 6.4
A4 0 3.6 4.1 7.2 1.4
A5 0 1.4 6.7 5
A6 0 5.4 5.4
A7 0 7.6
A8 0
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
2. Choose clustering algorithm
Single linkage
The distance between two objects is defined to be the smallest distance
possible between them.
If both objects are clusters, the distance between the two closest members are
used.
Complete linkage
This method is much like the single linkage, but instead of using the
minimum of the distances, we use the maximum distance
Average linkage
The distance between two clusters is defined as the average distance between
all pairs of the two clusters members
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
2. Choose clustering algorithm
Sulfeeza Mohd Drus, CISB474
Clustering
2. Perform hierarchical clustering using single link algorithms for
the sample datasets
A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 6 3.6 7.1 7.2 8.1 2.2
A2 0 6.1 4.2 5 4.1 3.1 4.5
A3 0 5 1.4 1.4 7.2 6.4
A4 0 3.6 4.1 7.2 1.4
A5 0 1.4 6.7 5
A6 0 5.4 5.4
A7 0 7.6
A8 0
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Initial state 8 clusters
A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 6 3.6 7.1 7.2 8.1 2.2
A2 0 6.1 4.2 5 4.1 3.1 4.5
A3 0 5 1.4 1.4 7.2 6.4
A4 0 3.6 4.1 7.2 1.4
A5 0 1.4 6.7 5
A6 0 5.4 5.4
A7 0 7.6
A8 0
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering A1 A2 A3 A4, A5, A7
A8 A6
A1 0 5 6 2.2 7.1 8.1
1st clustering cycle A2 0 5 4.2 4.1 3.1
A3 0 5 1.4 7.2
A4, A8 0 3.6 7.2
A5, A6 0 5.4
A7 0
5.0
4.0
3.0
2.0
1.0
A4 A8 A5 A6 A3
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
A1 A2 A4, A8 A3, A7
A5, A6
2nd clustering cycle A1
A2
0 5
0
2.2
4.2
6
4.1
8.1
3.1
A4, A8 0 3.6 7.2
A3, A5, A6 0 5.4
A7 0
5.0
4.0
3.0
2.0
1.0
A4 A8 A5 A6 A3 A1
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
A2 A1, A4, A3, A5, A7
A8 A6
3rd clustering cycle A2
A1, A4, A8
0 4.2
0
4.1
3.6
3.1
7.2
A3, A5, A6 0 5.4
A7 0
5.0
4.0
3.0
2.0
1.0
A4 A8 A5 A6 A3 A1 A2 A7
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
A2, A7 A1, A4, A3, A5,
A8 A6
4th clustering cycle A2, A7 0 4.2 4.1
A1, A4, A8 0 3.6
A3, A5, A6 0
5.0
4.0
3.0
2.0
1.0
A4 A8 A5 A6 A3 A1 A2 A7
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
A2, A7 A1, A4, A8,
A3, A5, A6
5th clustering cycle A2, A7 0 4.1
A1, A4, A8, A3, A5, 0
A6
5.0
4.0
3.0
2.0
1.0
A4 A8 A5 A6 A3 A1 A2 A7
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Steps to perform cluster analysis:
Partitioning methods
The most common one is k-means
However, for partitioning methods, we need to
determine our initial cluster.
We will get back to k-means later
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Steps to perform cluster analysis:
Step 3:
Decide on the number of clusters
The main objective is:
To achieve maximum inter-variance and minimum
intra-variance
One of the method to determine number of clusters is
by using elbow plot
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Elbow criterion:
Choose a number of clusters so that by adding another cluster
does not add sufficient additional information
Sulfeeza Mohd Drus, CISB474 Source: Derek Kane, Venkatesan (2007)
Clustering - revisit
Steps to perform k-means cluster analysis:
1. Choose the number of cluster, k
2. Generate k random points as cluster centroids
3. Assign each point to the nearest cluster centroid
4. Recompute the new cluster centroid
5. Repeat the two previous steps until the convergence
criterion is met
Sulfeeza Mohd Drus, CISB474 Source: Venkatesan, 2007
Clustering - revisit
Lets say we use the same data set:
A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 6 3.6 7.1 7.2 8.1 2.2
A2 0 6.1 4.2 5 4.1 3.1 4.5
A3 0 5 1.4 1.4 7.2 6.4
A4 0 3.6 4.1 7.2 1.4
A5 0 1.4 6.7 5
A6 0 5.4 5.4
A7 0 7.6
A8 0
Sulfeeza Mohd Drus, CISB474 Source: Venkatesan, 2007
Clustering - revisit
Steps to perform k-means cluster analysis:
1. Choose the number of cluster, k
Lets say we start with k=3
2. Generate k random points as cluster centroids
And we choose the following points as the cluster centroids
A1, A4, A7
Sulfeeza Mohd Drus, CISB474 Source: Venkatesan, 2007
Clustering - revisit
Then, we need to calculate the distance for each point to each
centroid (seed) Distance from A1 to: Distance from A2 to:
Seed 1 0 Seed 1 5
Seed 1 Seed 2 3.6 Seed 2 4.2
Seed 3 8.1 Seed 3 3.1
Seed 2 Distance from A3 to: Distance from A4 to:
Seed 1 6 Seed 1 3.6
Seed 2 5 Seed 2 0
Seed 3 7.2 Seed 3 7.2
Distance from A5 to: Distance from A6 to:
Seed 1 7.1 Seed 1 7.2
Seed 2 3.6 Seed 2 4.1
Seed 3 6.7 Seed 3 5.4
Seed 3
Distance from A7 to: Distance from A8 to:
Seed 1 8.1 Seed 1 2.2
Seed 2 7.2 Seed 2 1.4
Seed 3 0 Seed 3 7.6
Sulfeeza Mohd Drus, CISB474
Clustering - revisit
3. Assign each point to the nearest cluster centroid
Distance from A1 to: Distance from A2 to:
Seed 1 0 Seed 1 5
Seed 1 Seed 2 3.6 Seed 2 4.2
Seed 3 8.1 Seed 3 3.1
Seed 2 Distance from A3 to: Distance from A4 to:
Seed 1 6 Seed 1 3.6
Seed 2 5 Seed 2 0
Seed 3 7.2 Seed 3 7.2
Distance from A5 to: Distance from A6 to:
Seed 1 7.1 Seed 1 7.2
Seed 2 3.6 Seed 2 4.1
Seed 3 6.7 Seed 3 5.4
Seed 3
Distance from A7 to: Distance from A8 to:
Seed 1 8.1 Seed 1 2.2
Seed 2 7.2 Seed 2 1.4
Seed 3 0 Seed 3 7.6
Sulfeeza Mohd Drus, CISB474
Clustering - revisit
4. Recompute the new cluster centroid
Seed 1
Seed 1 = (2,10)
Seed 2 Seed 2 = (8+5+7+6+4)/5,
Seed 2 = (4+8+5+4+9)/5
Seed 2 = (6, 6)
Seed 3 = (2+1)/2, (5+2)/2
Seed 2 = (1.5, 3.5)
Seed 3
Sulfeeza Mohd Drus, CISB474
Clustering - revisit
Seed 1 Seed 1 = (2,10)
Seed 2 = (8+5+7+6+4)/5,
Seed 2 = (4+8+5+4+9)/5
New Seed 2 Seed 2 = (6, 6)
Seed 3 = (2+1)/2, (5+2)/2
New Seed 3 Seed 2 = (1.5, 3.5)
Sulfeeza Mohd Drus, CISB474
Clustering - revisit
5. Repeat the two previous steps until the convergence
criterion is met
Convergence criterion when the assignment of points in clusters do
not change over multiple iterations
Sulfeeza Mohd Drus, CISB474
Clustering
Steps to perform cluster analysis:
Step 4:
Validate and Interpret cluster solution
Stability and Validity
Stability is evaluated using different clustering procedures
on the same data and testing whether these yield the same
results.
Eg: for hierarchical clustering, use different distance
measures
Validity can be evaluated using criterion validity
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Clustering
Steps to perform cluster analysis:
Step 4:
Validate and Interpret cluster solution
Profiling of cluster
Interpreting the clusters by examining the cluster centroids
It helps to shed light on whether the segments are
conceptually distinguishable
This information will also help to find the meaningful label
or name for the cluster to adequately reflects the objects in
the cluster
Sulfeeza Mohd Drus, CISB474 Source: Mooi & Sarstedt, 2011
Sulfeeza Mohd Drus, CISB474