Cluster Analysis
Cluster Analysis
Pertemuan11-APG-
What do
you think
about
“Cluster”?
2 Cluster Analysis
Outline
• Introduction
• Conceptual support of Cluster Analysis
• How does Cluster Analysis work?
• Measuring Similarity
• Forming clusters:
• Hierarchical Clustering Methods
• Non Hierarchical Clustering Method
• Measuring Heterogeneity (number of cluster)
3 Cluster Analysis
Introduction
• Searching the data for a structure of "natural" groupings is an important exploratory
technique.
• Groupings can provide an informal means for assessing:
• dimensionality,
• identifying outliers,
• and suggesting interesting hypotheses concerning relationships.
4 Cluster Analysis
• Example:
5 Cluster Analysis
The Objectives of Cluster Analysis
1. Taxonomy description
• The most traditional use of cluster analysis has been for exploratory purposes
and formation of a taxonomy ( an empirically based classification of objects)
• It can also generate hypotheses related to the structure of the objects.
• It also can be used for confirmatory purposes
2. Data simplification
• Cluster analysis also develop a simplified perspective by grouping observation
for further analysis
• Whereas factor analysis attempt to provide dimensions/structure to variables,
cluster analysis performs the same tasks for observations
3. Relationship identification
• With the cluster defined, the researcher has a means revealing relationship
among the observations that typically is not possible with the individual
observations.
6 Cluster Analysis
Conceptual support of Cluster Analysis
The most common criticism must be addressed by conceptual support:
• Cluster analysis is descriptive, atheoretical, and non inferential,
• It is only an exploratory technique
• Nothing guarantee unique solutions
• Cluster analysis is always create clusters, regardless of the actual existence of any
structure in the data.
• When using cluster analysis, the researcher is making an assumption of some
structure among the objects.
• The researcher should always remember that just because clusters can be
found, does not validate their existence.
• The cluster solution is not generalizable because is totally dependent upon the
variables used as the basis for the similarity measures
7 Cluster Analysis
How does Cluster Analysis work?
The primary objective of cluster analysis is to define the structure of the data by placing the most
similar observations into groups.
8 Cluster Analysis
Measuring Similarity
9 Cluster Analysis
Distance measures:
Euclidian distance (It is often preferred for clustering).
The Euclidian distance between two 𝑝-dimensional observations (items),
𝐱 ′ = 𝑥1 , 𝑥2 , … , 𝑥𝑝 and 𝒚′ = 𝑦1 , 𝑦2 , … , 𝑦𝑝 is
Mahalanobis distance (statistical distance) When the variables are correlated, the
𝑑 𝐱, 𝐲 = 𝐱 − 𝐲 ′ 𝑺−1 𝐱 − 𝐲 Mahalanobis distance is likely the most
appropriate
Minkowski metric
𝑝 1/𝑚
𝑚
𝑑 𝐱, 𝐲 = 𝑥𝑖 − 𝑦𝑖
𝑖=1
For 𝑚 = 1, it measures the city-block distance
For 𝑚 = 2, it become the Euclidian distance
10 Cluster Analysis
Two additional popular measures of "distance" or dissimilarity are
given by the Canberra metric and the Czekanowski coefficient. Both of these measures
are defined for nonnegative variables only
Canberra metric
Czekanowski coefficient
“The researcher is encouraged to explore alternative cluster solution obtained when using
different distance measures in an effort to best represent the underlying data patterns.”
11 Cluster Analysis
Forming Cluster
12 Cluster Analysis
three types of clustering
methods (Beverrit &
Hothorn, 2011):
• Agglomerative
hierarchical
techniques,
• k-means clustering,
and
• model-based
clustering.
13 Cluster Analysis
14 Cluster Analysis
Hierarchical Clustering Method
The two basic types of hierarchical clustering The results of both agglomerative and
procedure are: divisive methods may be displayed in the
• Agglomerative (linkage methods) form of a two-dimensional diagram known
Start with the individual objects (N clusters). as a dendrogram
The most similar objects are first grouped,
and these initial groups are merged according
to their similarities. Eventually, as the similarity
decreases, all subgroups are fused into a single
cluster.
• Divisive (work in the opposite direction).
An initial single group of objects is divided into
two subgroups such that the objects in one
subgroup are "far from" the objects in the
other. These subgroups are then further
divided into dissimilar subgroups; the process
continues until there are as many subgroups as
objects-that is, until each object forms a group
15 Cluster Analysis
The following are the steps in the agglomerative hierarchical clustering algorithm
for grouping N objects (items or variables) :
16 Cluster Analysis
Linkage Methods:
• single linkage (minimum distance or nearest neighbor),
• complete linkage (maximum distance or farthest neighbor), and
• average linkage (average distance)
• Centroid Method (the similarity between two clusters is the distance between cluster
centroid). 𝑑 𝐱, 𝐲 = 𝐱ത − 𝐲ത 2
• Ward’s method (the similarity between two clusters is not a single measure similarity, but
rather the sum of squares within clusters summed over all variables).
17 Cluster Analysis
18 Cluster Analysis
Nonhierarchical Clustering Method
• The number of clusters, 𝐾, may either be specified in advance or determined as
part of the clustering procedure.
• Nonhierarchical methods can be applied to much larger data sets than can
hierarchical techniques.
• Nonhierarchical methods start from either (1) an initial partition of items into
groups or (2) an initial set of seed points, which will form the nuclei of clusters
• One of the more popular nonhierarchical procedures, the K-means method.
K-Means Method
1. Partition the items into K initial clusters.
2. Proceed through the list of items, assigning an item to the cluster whose centroid
(mean) is nearest. (Distance is usually computed using Euclidean distance with
either standardized or unstandardized observations.) Recalculate the centroid for
the cluster receiving the new item and for the cluster losing the item.
3. Repeat Step 2 until no more reassignments take place.
19 Cluster Analysis
20 Cluster Analysis
Step 1, we have
23 Cluster Analysis
Forming Cluster
24 Cluster Analysis
25 Cluster Analysis
Measuring Heterogeneity
26 Cluster Analysis
Evaluating cluster size
(Application on SPSS)
How many cluster
should we have?
The basic rationale is
that when large
increases in
heterogeneity occur in
moving from one stage
to the next, the
researcher selects the
prior cluster solution
27 Cluster Analysis
Four-
cluster
solution
28 Cluster Analysis
Hierarchical clustering in R
dm <- dist(measure[, c("chest", "waist", "hips")])
plot(cs <- hclust(dm, method = "single"))
plot(cc <- hclust(dm, method = "complete"))
plot(ca <- hclust(dm, method = "average"))
29 Cluster Analysis
body_pc <- princomp(dm, cor = TRUE)
xlim <- range(body_pc$scores[,1])
plot(body_pc$scores[,1:2], type = "n", xlim = xlim, ylim = xlim)
lab <- cutree(cs, h = 3.8) # for single linkage
text(body_pc$scores[,1:2], labels = lab, cex = 0.6)
single linkage solutions often contain long “straggly" clusters that do not give
a useful description of the data (Everitt & Hothorn, 2011)
30 Cluster Analysis
K-Means Clustering Method
library(tidyverse)
library(cluster)
library(factoextra)
library(gridExtra)
data('USArrests')
d_frame <- USArrests
d_frame <- na.omit(d_frame) #Removing the missing values
d_frame <- scale(d_frame) # standardizing
head(d_frame) # show the data
31 Cluster Analysis
kmeans3 <- kmeans(d_frame, centers = 3, nstart = 25)
#DataFlair
kmeans4 <- kmeans(d_frame, centers = 4, nstart = 25)
kmeans5 <- kmeans(d_frame, centers = 5, nstart = 25)
#Comparing the Plots
plot1 <- fviz_cluster(kmeans2, geom = "point", data = d_frame)
+ ggtitle("k = 2")
plot2 <- fviz_cluster(kmeans3, geom = "point", data = d_frame)
+ ggtitle("k = 3")
plot3 <- fviz_cluster(kmeans4, geom = "point", data = d_frame)
+ ggtitle("k = 4")
plot4 <- fviz_cluster(kmeans5, geom = "point", data = d_frame)
+ ggtitle("k = 5")
grid.arrange(plot1, plot2, plot3, plot4, nrow = 2)
32 Cluster Analysis
33 Cluster Analysis
Model based clustering
library(mclust)
by using mclust, invoked on its own or through another package, you
accept the license agreement in the mclust LICENSE file and at
http://www.stat.washington.edu/mclust/license.txt
mc <- Mclust(X)
34 Cluster Analysis