0% found this document useful (0 votes)
36 views59 pages

Lecture 02 - Cluster Analysis 1

This document provides an overview of cluster analysis techniques presented in a data analytics lecture. It discusses factor analysis, different clustering methods like hierarchical and k-means clustering, key steps in conducting cluster analysis such as formulating the problem, selecting distance measures and clustering procedures, and interpreting results. Examples are provided to illustrate cluster analysis and its applications in segmentation.

Uploaded by

Ghabri Fida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views59 pages

Lecture 02 - Cluster Analysis 1

This document provides an overview of cluster analysis techniques presented in a data analytics lecture. It discusses factor analysis, different clustering methods like hierarchical and k-means clustering, key steps in conducting cluster analysis such as formulating the problem, selecting distance measures and clustering procedures, and interpreting results. Examples are provided to illustrate cluster analysis and its applications in segmentation.

Uploaded by

Ghabri Fida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

MEDITERRANEAN SCHOOL OF BUSINESS

COURSE: DATA ANALYTICS

PROFESSOR: Dr. Ramla Jarrar

Lecture 02: Cluster Analysis (3h)

1
Review

2
Factor Analysis
Factor analysis is a class of procedures used for data reduction and
summarization.
It is an interdependence technique: no distinction between dependent
and independent variables.

Factor analysis is used:


◦ To identify underlying dimensions, or factors, that explain the
correlations among a set of variables.
◦ To identify a new, smaller, set of uncorrelated variables to replace the
original set of correlated variables.
Factor Analysis Model
The common factors themselves can be expressed as linear
combinations of the observed variables.

Fi = Wi1X1 + Wi2X2 + Wi3X3 + . . . + WikXk

Where:
Fi = estimate of ith factor
Wi= weight or factor score coefficient
k = number of variables
The Factor Analysis Process

Formulate the Interpret your Calculate your


Problem factors factor stores

Look at the Rotate your


correlation matrix factors

Determine the
Determine the
Method of
number of factors
Extraction
This Session: Cluster
Analysis

6
Course Structure: 24h including 9h labs

Underlying theory Recap Business Statistics (3h inc. Lab)


Data Analytics

Factor Analysis (3h)+Lab (3h)

Cluster Analysis (3h)+Lab (3h)


Multidimensional
Techniques
Regression (4.5h)+Lab (3h)

Capstone Session (1.5h)

7
Multivariate Methods
Regression

Metric Dependent
Variable
ANOVA if the repressors
are nonmetric
Dependence

Nonmetric Dependent MDA (Multiple


Variable Discriminant Analysis)
Type of relationship
tested

The relationship is
Factor Analysis
among the variables

Interdependence
The relationship is
among Cluster Analysis
cases/respondents

8
Example
In our database we count more than 10,000 Customers - we know
their age, city name, income, employment status, designation (i.e.
level of seniority).

You have to sell 100 Blackberry phones(each costs $1000) to the


people in this group.

How can you be efficient in your sales strategy?


Example of Clustering
◦Divide the whole population
into two groups employed /
unemployed

◦Further divide the employed


population into two groups
high/low salary

◦Further divide that group into


high /low designation (i.e.
Senior Vs. less Senior)
Cluster Analysis
Cluster analysis is a class of techniques used to classify objects or cases into relatively
homogeneous groups called clusters. Objects in each cluster tend to be similar to each
other and dissimilar to objects in the other clusters. Cluster analysis is also called
classification analysis, or numerical taxonomy.

Both cluster analysis and discriminant analysis are concerned with classification.
◦ However, discriminant analysis requires prior knowledge of the cluster or group
membership for each object or case included, to develop the classification rule.
◦ In contrast, in cluster analysis there is no a priori information about the group or cluster
membership for any of the objects. Groups or clusters are suggested by the data, not
defined a priori.
Cluster analysis Vs. Factor analysis
CLUSTER ANALYSIS FACTOR ANALYSIS

Grouping is based on the Grouping is based on patterns of


distance (proximity). variation (correlation)

We form group of people We form group of variables based


based on their responses to on the several people’s responses
several variables. to those variables.
An Ideal Clustering Situation
Groups are distinct:

Variable 1

Variable 2
A Practical Clustering Situation
Groups are not that distinct:

Variable 1

X
Variable 2
Statistics Associated with Cluster Analysis
Agglomeration schedule: An agglomeration schedule gives information on the
objects or cases being combined at each stage of a hierarchical clustering
process.
Cluster centroid: The cluster centroid is the mean values of the variables for all
the cases or objects in a particular cluster.
Cluster centers: The cluster centers are the initial starting points in
nonhierarchical clustering. Clusters are built around these centers, or seeds.
Cluster membership: Cluster membership indicates the cluster to which each
object or case belongs.
Statistics Associated with Cluster Analysis
Dendrogram: A dendrogram, or tree graph, is a graphical device for
displaying clustering results. Vertical lines represent clusters that are
joined together. The position of the line on the scale indicates the
distances at which clusters were joined. The dendrogram is read from left
to right.
Distances between cluster centers: These distances indicate how
separated the individual pairs of clusters are. Clusters that are widely
separated are distinct, and therefore desirable.
Statistics Associated with Cluster Analysis
Icicle diagram: An icicle diagram is a graphical display of clustering
results, so called because it resembles a row of icicles hanging from the
eaves of a house. The columns correspond to the objects being
clustered, and the rows correspond to the number of clusters. An icicle
diagram is read from bottom to top.
Similarity/distance coefficient matrix: A similarity/distance coefficient
matrix is a lower-triangle matrix containing pairwise distances between
objects or cases.
Conducting Cluster Analysis

Decide on
Formulate Select a Select a Interpret & Assess the
the
the Distance Clustering Profile Validity of
Number of
Problem Measure Procedure Clusters Clustering
Clusters
I - Formulate the Problem
• Perhaps the most important part of formulating the clustering problem is
selecting the variables on which the clustering is based.
• Inclusion of even one or two irrelevant variables may distort an otherwise
useful clustering solution.
• Basically, the set of variables selected should describe the similarity
between objects in terms that are relevant to the research problem.
• The variables should be selected based on past research, theory, or a
consideration of the hypotheses being tested. In exploratory research,
the researcher should exercise judgment and intuition.
II – Select a Distance Measure (1)
Several distance measures are available, each with
specific characteristics.

◦ Euclidean distance. The most commonly used measure


of similarity. It is the square root of the sum of the
squared differences in values for each variable.
◦ Squared Euclidean distance. The sum of the squared
differences without taking the square root.
◦ City- block (Manhattan) distance. Euclidean distance.
Uses the sum of the variables’ absolute differences
◦ Chebychev distance. Is the maximum of the absolute
difference in the clustering variables’ values. Frequently
used when working with metric (or ordinal) data.
◦ Mahalanobis distance (D2). Is a generalized distance
measure that accounts for the correlations among
variables in a way that weights each variables equally.
II – Select a Distance Measure (2)
•If the variables are measured in vastly different units, the clustering
solution will be influenced by the units of measurement. In these
cases, before clustering respondents, we must
• standardize the data by rescaling each variable to have a
mean of zero and a standard deviation of unity. It is also
desirable to eliminate outliers (cases with atypical values).
•How?
•Use of different distance measures may lead to different clustering
results. Hence, it is advisable to use different measures and
compare the results.
III – Select a Clustering Procedure (1)
Simple

Linkage Complete

Agglomerative Variance Average


Clustering

Hierarchical
Divisive Centroid
K-Means

Two-Step

22
III – Select a Clustering Procedure (2)
Hierarchical clustering is characterized by the development of a
hierarchy or tree-like structure. Hierarchical methods can be
agglomerative or divisive.
◦ Agglomerative clustering starts with each object in a separate
cluster. Clusters are formed by grouping objects into bigger and
bigger clusters. This process is continued until all objects are
members of a single cluster.
◦ Divisive clustering starts with all the objects grouped in a single
cluster. Clusters are divided or split until each object is in a
separate cluster.
III – Select a Clustering Procedure (3)-HP
Linkage Methods: The single Linkage
The single linkage method is based on minimum distance, or the
nearest neighbor rule:
◦ At every stage, the distance between two clusters is the distance
between their two closest points.

Single Linkage

Minimum Distance

Cluster 1 Cluster 2
III – Select a Clustering Procedure (3)- HP
Linkage Methods: The complete Linkage
The complete linkage method is similar to single linkage, except
that it is based on the maximum distance or the furthest neighbor
approach:
◦ In complete linkage, the distance between two clusters is calculated
as the distance between their two furthest points.

Complete Linkage
Maximum Distance

Cluster 1 Cluster 2
III – Select a Clustering Procedure (3)- HP
Linkage Methods: Average Linkage
The average linkage method works similarly. However, in this
method, the distance between two clusters is defined as the
average of the distances between all pairs of objects, where one
member of the pair is from each of the clusters.

Average Linkage

Average Distance
Cluster 1 Cluster 2
III – Select a Clustering Procedure (3)-HP
Variance Method
• The variance methods attempt to generate clusters to minimize the
within-cluster variance.
• A commonly used variance method is the Ward's procedure:
◦ For each cluster, the means for all the variables are computed.
◦ Then, for each object, the squared Euclidean distance to the cluster means is
calculated.
◦ These distances are summed for all the objects.

Ward’s Procedure
III – Select a Clustering Procedure (3)-HP
Centroid Method
• In the centroid methods, the distance between two clusters is the distance
between their centroids (means for all the variables). Every time objects are
grouped, a new centroid is computed.
• Of the hierarchical methods, average linkage and Ward's methods have been
shown to perform better than the other procedures.

Centroid Method
IV – Select a Clustering Procedure (3) NHP
K-Means Method
• The nonhierarchical clustering methods are frequently referred to as k-means clustering:
◦ Note that in this procedure the number k of clusters is fixed
• In the sequential threshold method, a cluster center is selected and all objects within a
prespecified threshold value from the center are grouped together. Then a new cluster
center or seed is selected, and the process is repeated for the unflustered points. Once
an object is clustered with a seed, it is no longer considered for clustering with subsequent
seeds.
• Algorithm:
1. Place K points (or seeds) into the space represented by the objects that are being clustered
2. These points represent initial group centroids.
3. Assign each object to the group that has the closest centroid.
4. When all objects have been assigned, recalculate the positions of the K centroids
5. Repeat Steps 2 and 3 until the centroids no longer move.
Hierarchical vs Non hierarchical methods
HIERARCHICAL CLUSTERING NON HIERARCHICAL CLUSTERING

No decision about the number of Faster, more reliable


clusters
Need to specify the number of clusters
Problems when data contain a high (arbitrary)
level of error
Need to set the initial seeds (arbitrary)
Can be very slow
Initial decision are more influential (one
step only)
IV – Select a Clustering Procedure (3)
Two Step Method
• It
has been suggested that the hierarchical and
nonhierarchical methods be used in tandem.
•First, an initial clustering solution is obtained using a
hierarchical procedure, such as average linkage or
Ward's.

• The number of clusters and cluster centroids so


obtained are used as inputs to the k-means procedure.
IV - Decide on the Number of Clusters
•Theoretical, conceptual, or practical considerations may suggest a certain
number of clusters.
•In hierarchical clustering, the distances at which clusters are combined can be
used as criteria. This information can be obtained from the agglomeration
schedule or from the dendrogram.
•In nonhierarchical clustering, the ratio of total within-group variance to
between-group variance can be plotted against the number of clusters.
• The point at which an elbow or a sharp bend occurs indicates an appropriate
number of clusters.
•The relative sizes of the clusters should be meaningful.
V - Interpreting and Profiling the Clusters
• Interpreting and profiling clusters involves examining the
cluster centroids. The centroids enable us to describe
each cluster by assigning it a name or label.

• It is often helpful to profile the clusters in terms of


variables that were not used for clustering. These may
include demographic, psychographic, product usage,
media usage, or other variables.
VI - Assess Reliability and Validity
1. Perform cluster analysis on the same data using different distance
measures. Compare the results across measures to determine the
stability of the solutions.
2. Use different methods of clustering and compare the results.
3. Split the data randomly into halves. Perform clustering separately on
each half. Compare cluster centroids across the two subsamples.
4. Delete variables randomly. Perform clustering based on the reduced
set of variables. Compare the results with those obtained by
clustering based on the entire set of variables.
5. In nonhierarchical clustering, the solution may depend on the order of
cases in the data set. Make multiple runs using different order of
cases until the solution stabilizes.
Illustration
Example: Attitudinal Data
Lifestyle Questionnaire
We asked a sample of 19 individuals to rate their attitudes
towards the following statements on a 1-7 Scale:
1. I like having fun
2. Going out is bad for budget
3. I like eating out
4. I always look for bargains and best buys
5. I don’t care about going out often
6. I like comparing prices
Attitudinal Data For Case No. V1 V2 V3 V4 V5 V6

Clustering 1
2
6
2
4
3
7
1
3
4
2
5
3
4
3 7 2 6 4 1 3
4 4 6 4 5 3 6
5 1 3 2 2 6 4
6 6 4 6 3 3 4
7 5 3 6 3 3 4
8 7 3 7 4 1 4
9 2 4 3 3 6 3
10 3 5 3 6 4 6
11 1 3 2 3 5 3
12 5 4 5 4 2 4
13 2 2 1 5 4 4
14 4 6 4 6 4 7
15 6 5 4 2 1 4
16 3 5 4 6 4 7
17 4 4 7 2 2 5
18 3 7 2 6 4 3
19 4 6 3 7 2 7
Results of Hierarchical
Clustering
Stage 6: Individual 10 joins the cluster formed at stage
Stage 1: 1, i.e. 14 & 16: now we have a cluster with 14, 16, 10
Individuals 14 Agglomeration Schedule
and 16 are the Cluster Combined Stage Cluster First Appears
Stage Coefficients Next Stage
first to be Cluster 1 Cluster 2 Cluster 1 Cluster 2
joined together 1 14 16 1.000 0 0 6
2 6 7 2.000 0 0 7
3 2 13 3.500 0 0 14
Stage 2: 4 5 11 5.000 0 0 8
Individuals 6 & 5 3 8 6.500 0 0 15
7 now form a 6 10 14 8.167 0 1 9
segment
7 6 12 10.500 2 0 10
8 5 9 13.000 4 0 14
The amount of
9 4 10 15.583 0 6 11
error created at
10 1 6 18.500 0 7 12
each clustering
11 4 19 23.250 9 0 16
stage. A large
12 1 17 28.600 10 0 13
jump in the value
13 1 15 36.833 12 0 15
of the error term
14 2 5 46.533 3 8 17
indicates that two
different things 15 1 3 59.200 13 5 18

have been 16 4 18 74.367 11 0 17

brought together 17 2 4 154.545 14 16 18


18 1 2 300.632 15 17 0
Dendogram 3 clusters seems to give a satisfactory overall similarity

• Graphical representation (tree graph)

Cluster 3
of the results of a hierarchical
procedure.

• Starting with each object as a

Cluster 2
separate cluster.

• The dendogram shows graphically


how the clusters are combined at

Cluster 1
each step of the procedure until all
are contained in a single cluster
Cluster Membership
Cluster Membership
Solution 1: 3 clusters Solution 2: 2 clusters
Case 3 Clusters 2 Clusters
1 1 1
2 2 2
Individuals 1 & 3 1 1
3 belong to 4 3 2
cluster 1 5 2 2
6 1 1
Individuals 5 & 7 1 1
9 belong to 8 1 1
cluster 2 Individuals'
9 2 2
10 3 2 membership for the
11 2 2 2 cluster solution
12 1 1
13 2 2
Individuals 14
14 3 2
& 16 belong to
15 1 1
cluster
16 3 2
17 1 1
18 3 2
19 3 2
Cluster Membership: Icicle Plot
Individuals Individuals or cases
9,11,5,13,2
belong to
cluster 2

3 Cluster 3 Cluster 2 Cluster 1 A 3 cluster Solution

Individuals 18,
19, 16,14,,10,4
belong to
cluster 3
Cluster Membership: Icicle Plot
Individuals Individuals or cases
9,11,5,13,2
belong to
cluster 3

1 2 3 4 5
Individual 18 A 5 cluster Solution
belongs to
cluster 1
Cluster Centroids: Description of
clusters
Examine the cluster centroids (we retained 3 clusters here)

Report

Mean
I always I like
I Iike having Going out is Bad I like Eating I Don't Care
Cluster look for Best Comparing
fun For Budget Out about going out
Buys Prices
1 5.7500 3.6250 6.0000 3.1250 1.8750 3.8750

2 1.6000 3.0000 1.8000 3.4000 5.2000 3.6000

3 3.5000 5.8333 3.3333 6.0000 3.5000 6.0000

Total 3.9474 4.1579 4.0526 4.1053 3.2632 4.4737


• Cluster 1: 3eich (party people)
• Cluster 2: Zeid nekess (don’t care)
• Cluster 3 : El Hedhek (the stingy)
Results of Nonhierarchical
Clustering (K-Means)
Centroid of the initial Centroid of the final solution: individuals
solution will be classified relative to these centroids

Initial Cluster Centers Final Cluster Centers

Cluster Cluster

1 2 3
1 2 3
I like having fun 3.50 1.67 5.75

I like having fun 4.00 1.00 7.00 Going out is bad for
5.83 3.00 3.63
Going out is bad for budget
6.00 3.00 2.00
budget
I like eating Out 3.33 1.83 6.00
I like eating Out 3.00 2.00 6.00
I always look for best buys
6.00 3.50 3.13
I always look for best and bargains
7.00 2.00 4.00
buys and bargains
I don't care about going
I don't care about going 3.50 5.50 1.88
2.00 6.00 1.00 out
out
I like comparing prices 6.00 3.67 3.88
I like comparing prices 7.00 4.00 3.00

We go through a series of iterations until the centroids stabilize


and do not change from one iteration to the other
Cluster Membership (k-Means) for 3
clusters Cluster Membership
Case Cluster Distance
Number
1 3 1.414
2 2 1.190
3 3 2.550
4 1 1.404
Final cluster membership

5 2 1.756
6 3 1.225
7 3 1.500
8 3 2.121
9 2 1.848 Distance from
10 1 1.143 each individual
11 2 1.190
and the cluster
12 3 1.581
13 2 2.533 centroid
14 1 1.404
15 3 2.828
16 1 1.624
17 3 2.598
18 1 3.555
19 1 2.154
20 2 1.658
Distance between the 3 cluster
centroids
Distances between Final Cluster Centers

Cluster 1 2 3

1 5.416 5.698

2 5.416 6.910

3 5.698 6.910

The between cluster distance should be bigger than the within cluster
distance
Thank you
Anova Analysis

47
Simple Example
Suppose a marketing researcher wishes to determine market segments in a
community based on patterns of loyalty to brands and stores.
A small sample of seven respondents is selected as a pilot test of how cluster
analysis is applied.
◦ Two measures of loyalty were measured for each respondents on 0-10 scale :
◦ V1(store loyalty)
◦ V2(brand loyalty)
Scatter Plot of the responses
How do we measure similarity?
Proximity Matrix of Euclidean Distance Between Observations
Observations
Observation
A B C D E F G

A ---
B 3.162 ---
C 5.099 2.000 ---
D 5.099 2.828 2.000 ---
E 5.000 2.236 2.236 4.123 ---
F 6.403 3.606 3.000 5.000 1.414 ---
G 3.606 2.236 3.606 5.000 2.000 3.162 ---
How do we form clusters?
SIMPLE RULE:
◦ Identify the two most similar(closest) observations not already in the same cluster and
combine them.

◦ We apply this rule repeatedly to generate a number of cluster solutions, starting with
each observation as its own “cluster” and then combining two clusters at a time until all
observations are in a single cluster.
◦ This process is termed a hierarchical procedure because it moves in a stepwise fashion to form an entire range
of cluster solutions. It is also an agglomerative method because clusters are formed by combining existing
clusters
How do we form clusters?
AGGLOMERATIVE PROCESS CLUSTER SOLUTION
Minimum Overall Similarity
Distance Measure
Step Unclustered Observation Cluster Membership Number of (Average
Observationsa Pair Clusters Within-Cluster
Distance)
Initial Solution (A)(B)(C)(D)(E)(F)(G) 7 0
1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414
2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896
6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420
• In steps 1,2,3 and 4, the OSM does not change substantially, which indicates that we
are forming other clusters with essentially the same heterogeneity of the existing
clusters.
• When we get to step 5, we see a large increase. This indicates that joining clusters
(B-C-D) and (E-F-G) resulted a single cluster that was markedly less homogenous.
How many groups do we form?
Therefore, the three – cluster solution of Step 4 seems the most appropriate for a
final cluster solution, with two equally sized clusters, (B-C-D) and (E-F-G), and a
single outlying observation (A).

This approach is particularly useful in identifying outliers, such as Observation A. It


also depicts the relative size of varying clusters, although it becomes unwieldy
when the number of observations increases.
Graphical Portrayals
Graphical Portrayals: Dendogram
Dendogram
Graphical representation (tree graph) of the results of a hierarchical procedure. Starting
with each object as a separate cluster.
The dendogram shows graphically how the clusters are combined at each step of the
procedure until all are contained in a single cluster
Clustering in SPSS
To select this procedures using SPSS for Windows click:

Analyze>Classify>Hierarchical Cluster …

Analyze>Classify>K-Means Cluster …

Analyze>Classify>Two-Step Cluster …
SPSS Windows: Hierarchical Clustering
1. Select ANALYZE from the SPSS menu bar.
2. Click CLASSIFY and then HIERARCHICAL CLUSTER.
3. Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],” “Don’t Care [v5],”
and “Compare Prices [v6].” in to the VARIABLES box.
4. In the CLUSTER box check CASES (default option). In the DISPLAY box check STATISTICS and
PLOTS (default options).
5. Click on STATISTICS. In the pop-up window, check AGGLOMERATION SCHEDULE. In the
CLUSTER MEMBERSHIP box check RANGE OF SOLUTIONS. Then, for MINIMUM NUMBER OF
CLUSTERS: enter 2 and for MAXIMUM NUMBER OF CLUSTERS enter 4. Click CONTINUE.
6. Click on PLOTS. In the pop-up window, check DENDROGRAM. In the ICICLE box check ALL
CLUSTERS (default). In the ORIENTATION box, check VERTICAL. Click CONTINUE.
7. Click on METHOD. For CLUSTER METHOD select WARD’S METHOD. In the MEASURE box check
INTERVAL and select SQUARED EUCLIDEAN DISTANCE. Click CONTINUE
8. Click OK.
SPSS Windows: K-Means
Clustering
1. Select ANALYZE from the SPSS menu bar.
2. Click CLASSIFY and then K-MEANS CLUSTER.
3. Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],”
“Don’t Care [v5],” and “Compare Prices [v6].” in to the VARIABLES box.
4. For NUMBER OF CLUSTER select 3.
5. Click on OPTIONS. In the pop-up window, In the STATISTICS box, check
INITIAL CLUSTER CENTERS and CLUSTER INFORMATION FOR EACH CASE. Click
CONTINUE.
6. Click OK.
SPSS Windows: Two-Step
Clustering
1. Select ANALYZE from the SPSS menu bar.
2. Click CLASSIFY and then TWO-STEP CLUSTER.
3. Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],”
“Don’t Care [v5],” and “Compare Prices [v6].” in to the CONTINUOUS
VARIABLES box.
4. For DISTANCE MEASURE select EUCLIDEAN.
5. For NUMBER OF CLUSTER select DETERMINE AUTOMATICALLY.
6. For CLUSTERING CRITERION select AKAIKE’S INFORMATION CRITERION (AIC).
7. Click OK.

You might also like