0% found this document useful (0 votes)

8 views

Cluster Analysis

The document provides an overview of cluster analysis, including conceptual support, how it works, measuring similarity, hierarchical and non-hierarchical clustering methods, and measuring heterogeneity. Cluster analysis groups objects based on similarities to explore natural groupings in data without assumptions about the number of groups.

Uploaded by

Rachmat Hidayat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Cluster Analysis

Uploaded by

Rachmat Hidayat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Cluster Analysis

Pertemuan11-APG-
What do
you think
about
“Cluster”?

2 Cluster Analysis
Outline
• Introduction
• Conceptual support of Cluster Analysis
• How does Cluster Analysis work?
• Measuring Similarity
• Forming clusters:
• Hierarchical Clustering Methods
• Non Hierarchical Clustering Method
• Measuring Heterogeneity (number of cluster)

3 Cluster Analysis
Introduction
• Searching the data for a structure of "natural" groupings is an important exploratory
technique.
• Groupings can provide an informal means for assessing:
• dimensionality,
• identifying outliers,
• and suggesting interesting hypotheses concerning relationships.

• Cluster analysis is a group of multivariate techniques whose primary purpose is to group

objects (e.g. respondents, products or other entities) based on characteristics they possess.
• No assumptions are made concerning the number of groups or the group structure.
• Grouping is done on the basis of similarities or distances (dissimilarities) .

• Cluster analysis differs from factor analysis:

• Cluster analysis groups objects, whereas factor analysis groups variables
• Factor analysis makes the grouping based on pattern of variation (correlation) in the
data, whereas cluster analysis makes grouping on the basis of distance (proximity)

4 Cluster Analysis
• Example:

Marketing – In the area of marketing, we use clustering to explore and select

customers that are potential buyers of the product. This differentiates the most
likeable customers from the ones who possess the least tendency to purchase the
product. After the clusters have been developed, businesses can keep a track of their
customers and make necessary decisions to retain them in that cluster.
Retail – Retail industries make use of clustering to group customers based on their
preferences, style, choice of wear as well as store preferences. This allows them to
manage their stores in a much more efficient manner.
Medical Science – Medicine and health industries make use of clustering algorithms
to facilitate efficient diagnosis and treatment of their patients as well as the discovery of
new medicines. Based on the age, group, genetic coding of the patients, these
organisations are better capable to understand diagnosis through robust clustering.
Sociology – Clustering is used in Data Mining operations to divide people based on
their demographics, lifestyle, socioeconomic status, etc. This can help the law
enforcement agencies to group potential criminals and even identify them with an
efficient implementation of the clustering algorithm.

5 Cluster Analysis
The Objectives of Cluster Analysis
1. Taxonomy description
• The most traditional use of cluster analysis has been for exploratory purposes
and formation of a taxonomy ( an empirically based classification of objects)
• It can also generate hypotheses related to the structure of the objects.
• It also can be used for confirmatory purposes
2. Data simplification
• Cluster analysis also develop a simplified perspective by grouping observation
for further analysis
• Whereas factor analysis attempt to provide dimensions/structure to variables,
cluster analysis performs the same tasks for observations
3. Relationship identification
• With the cluster defined, the researcher has a means revealing relationship
among the observations that typically is not possible with the individual
observations.

6 Cluster Analysis
Conceptual support of Cluster Analysis
The most common criticism must be addressed by conceptual support:
• Cluster analysis is descriptive, atheoretical, and non inferential,
• It is only an exploratory technique
• Nothing guarantee unique solutions
• Cluster analysis is always create clusters, regardless of the actual existence of any
structure in the data.
• When using cluster analysis, the researcher is making an assumption of some
structure among the objects.
• The researcher should always remember that just because clusters can be
found, does not validate their existence.
• The cluster solution is not generalizable because is totally dependent upon the
variables used as the basis for the similarity measures

7 Cluster Analysis
How does Cluster Analysis work?
The primary objective of cluster analysis is to define the structure of the data by placing the most
similar observations into groups.

To accomplish the tasks, we must address three basic questions:

1. How do we measure similarity?
2. How do we form clusters?
• Hierarchical method
• Non hierarchical method
3. How many groups do we form?
• Measuring heterogeneity

Cluster analysis decision process:

• Partitioning the data set to form clusters and selecting to cluster solution
• Interpreting the cluster to understand the characteristics of each cluster and develop a
name/label that appropriately defines its nature.
• Validating the result of the final cluster solution (i.e. , determining the stability)

8 Cluster Analysis
Measuring Similarity

9 Cluster Analysis
Distance measures:
Euclidian distance (It is often preferred for clustering).
The Euclidian distance between two 𝑝-dimensional observations (items),
𝐱 ′ = 𝑥1 , 𝑥2 , … , 𝑥𝑝 and 𝒚′ = 𝑦1 , 𝑦2 , … , 𝑦𝑝 is

Squared (absolute) Euclidian distance It is recommended for centroid or

𝑑 2 𝐱, 𝐲 = 𝐱 − 𝐲 ′ 𝐱 − 𝐲 Ward’s method of clustering

Mahalanobis distance (statistical distance) When the variables are correlated, the
𝑑 𝐱, 𝐲 = 𝐱 − 𝐲 ′ 𝑺−1 𝐱 − 𝐲 Mahalanobis distance is likely the most
appropriate
Minkowski metric
𝑝 1/𝑚
𝑚
𝑑 𝐱, 𝐲 = ෍ 𝑥𝑖 − 𝑦𝑖
𝑖=1
For 𝑚 = 1, it measures the city-block distance
For 𝑚 = 2, it become the Euclidian distance
10 Cluster Analysis
Two additional popular measures of "distance" or dissimilarity are
given by the Canberra metric and the Czekanowski coefficient. Both of these measures
are defined for nonnegative variables only

Canberra metric

Czekanowski coefficient

Greater distance means observations are less similar

“The researcher is encouraged to explore alternative cluster solution obtained when using
different distance measures in an effort to best represent the underlying data patterns.”

11 Cluster Analysis
Forming Cluster

12 Cluster Analysis
three types of clustering
methods (Beverrit &
Hothorn, 2011):
• Agglomerative
hierarchical
techniques,
• k-means clustering,
and
• model-based
clustering.

13 Cluster Analysis
14 Cluster Analysis
Hierarchical Clustering Method
The two basic types of hierarchical clustering The results of both agglomerative and
procedure are: divisive methods may be displayed in the
• Agglomerative (linkage methods) form of a two-dimensional diagram known
Start with the individual objects (N clusters). as a dendrogram
The most similar objects are first grouped,
and these initial groups are merged according
to their similarities. Eventually, as the similarity
decreases, all subgroups are fused into a single
cluster.
• Divisive (work in the opposite direction).
An initial single group of objects is divided into
two subgroups such that the objects in one
subgroup are "far from" the objects in the
other. These subgroups are then further
divided into dissimilar subgroups; the process
continues until there are as many subgroups as
objects-that is, until each object forms a group
15 Cluster Analysis
The following are the steps in the agglomerative hierarchical clustering algorithm
for grouping N objects (items or variables) :

1. Start with 𝑁 clusters, each containing a single entity and an 𝑁 × 𝑁 symmetric

matrix of distances (or similarities) 𝐃 = { 𝑑𝑖𝑘 } .
2. Search the distance matrix for the nearest (most similar) pair of clusters. Let the
distance between "most similar" clusters 𝑈 and 𝑉 be 𝑑𝑈𝑉 .
3. Merge clusters 𝑈 and 𝑉. Label the newly formed cluster ( 𝑈𝑉) . Update the entries
in the distance matrix by (a) deleting the rows and columns corresponding to
clusters 𝑈 and 𝑉 and (b) adding a row and column giving the distances between
cluster ( 𝑈𝑉) and the remaining clusters.
4. Repeat Steps 2 and 3 a total of 𝑁 − 1 times. (All objects will be in a single cluster
after the algorithm terminates.) Record the identity of clusters that are merged and
the levels (distances or similarities) at which the mergers take place.

16 Cluster Analysis
Linkage Methods:
• single linkage (minimum distance or nearest neighbor),
• complete linkage (maximum distance or farthest neighbor), and
• average linkage (average distance)
• Centroid Method (the similarity between two clusters is the distance between cluster
centroid). 𝑑 𝐱, 𝐲 = 𝐱ത − 𝐲ത 2
• Ward’s method (the similarity between two clusters is not a single measure similarity, but
rather the sum of squares within clusters summed over all variables).

17 Cluster Analysis
18 Cluster Analysis
Nonhierarchical Clustering Method
• The number of clusters, 𝐾, may either be specified in advance or determined as
part of the clustering procedure.
• Nonhierarchical methods can be applied to much larger data sets than can
hierarchical techniques.
• Nonhierarchical methods start from either (1) an initial partition of items into
groups or (2) an initial set of seed points, which will form the nuclei of clusters
• One of the more popular nonhierarchical procedures, the K-means method.
K-Means Method
1. Partition the items into K initial clusters.
2. Proceed through the list of items, assigning an item to the cluster whose centroid
(mean) is nearest. (Distance is usually computed using Euclidean distance with
either standardized or unstandardized observations.) Recalculate the centroid for
the cluster receiving the new item and for the cluster losing the item.
3. Repeat Step 2 until no more reassignments take place.

19 Cluster Analysis
20 Cluster Analysis
Step 1, we have

Step 2, compute Euclidian distance

Since A is closer to cluster ( AB ) than to cluster ( CD ), it is not reassigned.

Continuing, we get

and, consequently, B is reassigned to cluster (CD) , giving cluster

(BCD) and the following updated coordinates of the centroid:
21 Cluster Analysis
22 Cluster Analysis
Illustration (simple example)

23 Cluster Analysis
Forming Cluster

24 Cluster Analysis
25 Cluster Analysis
Measuring Heterogeneity

26 Cluster Analysis
Evaluating cluster size
(Application on SPSS)
How many cluster
should we have?
The basic rationale is
that when large
increases in
heterogeneity occur in
moving from one stage
to the next, the
researcher selects the
prior cluster solution

27 Cluster Analysis
Four-
cluster
solution

28 Cluster Analysis
Hierarchical clustering in R
dm <- dist(measure[, c("chest", "waist", "hips")])
plot(cs <- hclust(dm, method = "single"))
plot(cc <- hclust(dm, method = "complete"))
plot(ca <- hclust(dm, method = "average"))

29 Cluster Analysis
body_pc <- princomp(dm, cor = TRUE)
xlim <- range(body_pc$scores[,1])
plot(body_pc$scores[,1:2], type = "n", xlim = xlim, ylim = xlim)
lab <- cutree(cs, h = 3.8) # for single linkage
text(body_pc$scores[,1:2], labels = lab, cex = 0.6)

single linkage solutions often contain long “straggly" clusters that do not give
a useful description of the data (Everitt & Hothorn, 2011)

30 Cluster Analysis
K-Means Clustering Method

library(tidyverse)
library(cluster)
library(factoextra)
library(gridExtra)
data('USArrests')
d_frame <- USArrests
d_frame <- na.omit(d_frame) #Removing the missing values
d_frame <- scale(d_frame) # standardizing
head(d_frame) # show the data

kmeans2 <- kmeans(d_frame, centers = 2, nstart = 25)

str(kmeans2) # define to string data
fviz_cluster(kmeans2, data = d_frame) #make cluster plot

31 Cluster Analysis
kmeans3 <- kmeans(d_frame, centers = 3, nstart = 25)
#DataFlair
kmeans4 <- kmeans(d_frame, centers = 4, nstart = 25)
kmeans5 <- kmeans(d_frame, centers = 5, nstart = 25)
#Comparing the Plots
plot1 <- fviz_cluster(kmeans2, geom = "point", data = d_frame)
+ ggtitle("k = 2")
plot2 <- fviz_cluster(kmeans3, geom = "point", data = d_frame)
+ ggtitle("k = 3")
plot3 <- fviz_cluster(kmeans4, geom = "point", data = d_frame)
+ ggtitle("k = 4")
plot4 <- fviz_cluster(kmeans5, geom = "point", data = d_frame)
+ ggtitle("k = 5")
grid.arrange(plot1, plot2, plot3, plot4, nrow = 2)

32 Cluster Analysis
33 Cluster Analysis
Model based clustering

Finite mixture Gaussian model:

• Assumption: 𝑿 ~ multivariate normal

library(mclust)
by using mclust, invoked on its own or through another package, you
accept the license agreement in the mclust LICENSE file and at
http://www.stat.washington.edu/mclust/license.txt

mc <- Mclust(X)

best model: ellipsoidal, equal shape with 3 components

34 Cluster Analysis

Chapter 10 Practice Questions .PDF - BS ESQ P1 Unit 10 D&F Limited Owns Several Fruit Farms. The Fruit Is Sold To Large Supermar
No ratings yet
Chapter 10 Practice Questions .PDF - BS ESQ P1 Unit 10 D&F Limited Owns Several Fruit Farms. The Fruit Is Sold To Large Supermar
1 page
(Michelle v. Lee) Paul, The Stoics, and The Body o
No ratings yet
(Michelle v. Lee) Paul, The Stoics, and The Body o
238 pages
Cluster Analysis: Prentice-Hall, Inc
No ratings yet
Cluster Analysis: Prentice-Hall, Inc
33 pages
Cluster Analysis GP Seminar
No ratings yet
Cluster Analysis GP Seminar
13 pages
Cluster Analysis CH 20
No ratings yet
Cluster Analysis CH 20
2 pages
Group#10 (Cluster Analysis)
No ratings yet
Group#10 (Cluster Analysis)
53 pages
Chapter-5-Cluster Analysis PDF
No ratings yet
Chapter-5-Cluster Analysis PDF
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
2 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
In Marketing, Cluster Analysis Is Used For: Statistical
No ratings yet
In Marketing, Cluster Analysis Is Used For: Statistical
3 pages
Cluster Analysis
No ratings yet
Cluster Analysis
45 pages
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
No ratings yet
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
41 pages
BA2 7 Cluster
No ratings yet
BA2 7 Cluster
33 pages
Unit 3 Clustering
No ratings yet
Unit 3 Clustering
101 pages
Cluster Analysis
No ratings yet
Cluster Analysis
101 pages
M4 - Clustering
No ratings yet
M4 - Clustering
43 pages
Multivariate Analysis (MVA) Is Based On The Statistical Principle of
No ratings yet
Multivariate Analysis (MVA) Is Based On The Statistical Principle of
5 pages
unit4_ml[1]
No ratings yet
unit4_ml[1]
20 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Chapter 8 - Consumer Perception and Preference
No ratings yet
Chapter 8 - Consumer Perception and Preference
29 pages
clustering
No ratings yet
clustering
6 pages
DataMining_Unit4_notes
No ratings yet
DataMining_Unit4_notes
27 pages
ASE - PPT - Unit 2 Discriminant Cluster Analysis
No ratings yet
ASE - PPT - Unit 2 Discriminant Cluster Analysis
27 pages
Presentation Malo
No ratings yet
Presentation Malo
65 pages
MA Unit 5
No ratings yet
MA Unit 5
7 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
8.Cluster Analysis HCA
No ratings yet
8.Cluster Analysis HCA
31 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Week-9-Part-2 Agglomerative Clustering
No ratings yet
Week-9-Part-2 Agglomerative Clustering
40 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Block 18 ST3188
No ratings yet
Block 18 ST3188
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Cluster Analysis
No ratings yet
Cluster Analysis
61 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Marielle Caccam Jewel Refran
No ratings yet
Marielle Caccam Jewel Refran
100 pages
Chapter 20: Cluster Analysis: Advance Marketing Research
No ratings yet
Chapter 20: Cluster Analysis: Advance Marketing Research
40 pages
Cluster_analysis
No ratings yet
Cluster_analysis
22 pages
Clustering and Classification: - Task
No ratings yet
Clustering and Classification: - Task
16 pages
Clustering
No ratings yet
Clustering
7 pages
Unit 2
No ratings yet
Unit 2
89 pages
G Lavanya Computerscience
No ratings yet
G Lavanya Computerscience
51 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
Unit 2
No ratings yet
Unit 2
33 pages
2 - Review Article - Introduction To Multivariate Analysis
No ratings yet
2 - Review Article - Introduction To Multivariate Analysis
8 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Clustering
No ratings yet
Clustering
7 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Chapter 4 - Cluster Analysis
No ratings yet
Chapter 4 - Cluster Analysis
55 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet
OOAD Project
No ratings yet
OOAD Project
23 pages
GSOE9011 - A03 CV - Job Listing Attachment 2024-T2
No ratings yet
GSOE9011 - A03 CV - Job Listing Attachment 2024-T2
2 pages
Randfile 03
No ratings yet
Randfile 03
32 pages
RGU Prospectus 2015-16
No ratings yet
RGU Prospectus 2015-16
67 pages
Primary ESeries
No ratings yet
Primary ESeries
21 pages
International Relations Paper
No ratings yet
International Relations Paper
15 pages
Pathways: This Place Long Island
No ratings yet
Pathways: This Place Long Island
10 pages
Financial Statement Analysis
No ratings yet
Financial Statement Analysis
4 pages
Spy Game Script - Transcript From The Screenplay And:or Brad Pitt and Robert Redford Spy Movie
100% (1)
Spy Game Script - Transcript From The Screenplay And:or Brad Pitt and Robert Redford Spy Movie
34 pages
TCW Reviewer - World of Regions
No ratings yet
TCW Reviewer - World of Regions
2 pages
RRB JE 2024 Reasoning (Topic_Wise) Hard to Easy (English)
No ratings yet
RRB JE 2024 Reasoning (Topic_Wise) Hard to Easy (English)
43 pages
Sist en 1176 1 2018 A1 2024
No ratings yet
Sist en 1176 1 2018 A1 2024
14 pages
Hebrews 2 5-11
No ratings yet
Hebrews 2 5-11
6 pages
A2 Grammar Book
No ratings yet
A2 Grammar Book
62 pages
Infringementreport
No ratings yet
Infringementreport
2 pages
TEST DAU VAO TOEIC
No ratings yet
TEST DAU VAO TOEIC
8 pages
Ceramic Brackets
No ratings yet
Ceramic Brackets
28 pages
Bad Debts Accounts
No ratings yet
Bad Debts Accounts
14 pages
An Inspector Calls Quotes
No ratings yet
An Inspector Calls Quotes
4 pages
Class 7 Geography Chapter 4 Question Answers - Air
No ratings yet
Class 7 Geography Chapter 4 Question Answers - Air
6 pages
news6651d58c86268MSU Admission Prospectus 2024-25
No ratings yet
news6651d58c86268MSU Admission Prospectus 2024-25
57 pages
Pup Thesis Format
100% (3)
Pup Thesis Format
7 pages
ENIGMA 1.6 Requirements For Smart Lighting Solutions
No ratings yet
ENIGMA 1.6 Requirements For Smart Lighting Solutions
24 pages
Questioned Documents
No ratings yet
Questioned Documents
185 pages
Argumentative Text Presentation
No ratings yet
Argumentative Text Presentation
17 pages
Project Report: in Partial Fulfillment of The Requirement of The Award For The Degree of
No ratings yet
Project Report: in Partial Fulfillment of The Requirement of The Award For The Degree of
42 pages
LBU Quote Unquote-A Guide To Harvard Referencing (2019)
No ratings yet
LBU Quote Unquote-A Guide To Harvard Referencing (2019)
64 pages
How To Learn AI From Scratch in 2023: A Complete Guide From The Experts
No ratings yet
How To Learn AI From Scratch in 2023: A Complete Guide From The Experts
23 pages