0% found this document useful (0 votes)
18 views24 pages

Cluster Analysis Finalllll

Uploaded by

Vinit Badani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views24 pages

Cluster Analysis Finalllll

Uploaded by

Vinit Badani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Cluster Analysis

-Prof. Chitvan Mehrotra


What is Cluster Analysis?
 Cluster Analysis is a multivariate interdependence technique whose primary objectiv e is to classify objects into relativ ely
homogenous groups.

 Cluster is a collection of data objects

 Similar (cohesiv e) to one another within the same cluster (high intra-class similarity)

 Dissimilar to the objects in other clusters (low inter-class similarity)


 Cluster Analysis inv olves finding groups of objects such that the objects in a group will be similar (or related) to one another and
different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster
distances are
distances are
maximized
minimized
Examples of Clustering Applications
 Medicine – What are the diagnostic clusters? To answer this question the researcher
would devise a diagnostic questionnaire that includes possible symptoms (for
example, in psychology, anxiety, depression etc.). The cluster analysis can then
identify groups of patients that have similar symptoms.

 Marketing – What are the customer segments? To answer this question a market
researcher may conduct a survey covering needs, attitudes, demographics, and
behavior of customers. The researcher then may use cluster analysis to identify
homogenous groups of customers that have similar needs and attitudes.

 Education – What are student groups that need special attention? Researchers may
measure psychological, aptitude, and achievement characteristics. A cluster analysis
then may identify what homogeneous groups exist among students (for example, high
achievers in all subjects, or students that excel in certain subjects but fail in others).

 Fraud Detection – Combinations of rules, are used to explore random, 'fraudulent' or


'non-fraudulent‘ transactions. These clusters are then used to train a supervised
machine learning algorithm that is able to cluster new records as either fraudulent or
non-fraudulent.
Clustering procedures
 Hierarchical procedures-Tree like structure for understanding the levels of
observation
 – Agglomerative (start from n clusters, to get to 1 cluster)
 – Divisive (start from 1 cluster, to get to n cluster)

 Non hierarchical procedures- A centroid is chosen and distance from the centroid is
used to form clusters
 – K-means clustering
Concept of Euclidean Distance
 Objective of clustering is to group the similar objects together.
 E.D. (based on the Pythagoras Theorem) is used to assess how
similar or different the objects are.
 It measures similarity in terms of distance between pairs of
object.
 Object with smaller distance are similar to each other than those
with larger distance
SPSS Example 1- Psychiatric
Treatment
 We wanted to look at clusters of cases referred for psychiatric
treatment.
 We measured each subject on four questionnaires: Spielberger Trait
Anxiety Inventory (STAI), the Beck Depression Inventory (BDI), a
measure of Intrusive Thoughts and Rumination (IT) and a measure of
Impulsive Thoughts and Actions (Impulse).
 The rationale behind this analysis is that people with the same
disorder should report a similar pattern of scores across the measures
(so the profiles of their responses should be clustered).
 To check the analysis, trained psychologists were asked to agree a
diagnosis based on the DSM-IV (GAD - Generalized Anxiety Disorder,
DEP – Depression & OCD - Obsessive Compulsive Disorder)
SPSS Commands

First perform a hierarchical method to define the number of clusters.


Then use the k-means procedure to actually form the clusters.
 Analyze/Classify/Hierarchical Cluster Analysis
 Select the four diagnostic questionnaires from the list on the left-
hand side and drag them to the box labelled Variables.
 Statistics (Agglomeration Schedule)
 Plots (Dendrogram, Vertical)
 Method (Wards Method; Squared Euclidian Distance, Standardize -
Z Scores, By Variables)
 Save (Single Solution; Number of clusters = 3)
 Ok
Interpretation 1
Stage at which each cluster first appears. Zeroes
indicate that single clusters existed before analysis.

The coefficients column indicates the


Euclidean distance or distance between
the two clusters (or cases) joined at each
stage

The jump in coefficients (56-19=37


but one cluster solution not preferred
so next Biggest jump in coefficients is
between Stage 13 and stage 12(from
bottom to top) (19-4=15 )so stage 13
is where we should stop.
Number of Clusters = 2
Agglomeration Schedule
Stage Cluster First
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 14 16 2.000 0 0 3
2 6 7 2.000 0 0 7
3 10 14 2.667 0 1 8
4 2 13 3.000 0 0 15
5 5 11 3.000 0 0 9
This table
6 3 8 3.000 0 0 16
shows how the
7 6 12 3.333 2 0 10
cases are
8 4 10 3.500 0 3 13
clustered
9 5 9 4.000 5 0 11
together at
10 1 6 4.167 0 7 12
each stage of
11 5 20 5.667 9 0 15
the cluster
12 1 17 5.800 10 0 14
analysis.
13 4 19 6.000 8 0 17
14 1 15 7.933 12 0 16
15 2 5 8.200 4 11 18
16 1 3 9.714 14 6 19
17 4 18 10.067 13 0 18
18 2 4 25.212 15 17 19
19 1 2 34.589 16 18 0
Agglomeration Schedule
Stage Cluster First
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 14 16 2.000 0 0 3
2 6 7 2.000 0 0 7
3 10 14 2.667 0 1 8
4 2 13 3.000 0 0 15
5 5 11 3.000 0 0 9
6 3 8 3.000 0 0 16
7 6 12 3.333 For instance,
2 in this 0 10
8 4 10 3.500 example,0 cases 14 3and 16 13
9 5 9 4.000 are joined5 at stage 01. This 11
10 1 6 4.167 is shown in
0 the Clusters
7 12
11 5 20 5.667 Combined 9 columns. 0 15
12 1 17 5.800 When clusters
10 or cases
0 14
13 4 19 6.000 are joined,
8 they are0 17
14 1 15 7.933 subsequently
12 labeled0 with 16
15 2 5 8.200 the smaller
4 of the two
11 18
16 1 3 9.714 14
cluster numbers 6 19
17 4 18 10.067 13 0 18
18 2 4 25.212 15 17 19
19 1 2 34.589 16 18 0
Agglomeration Schedule
Stage Cluster First
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 14 16 2.000 0 0 3
2 6 7 2.000 0 0 7
3 10 14 2.667 0 1 8
4 2 13 3.000 0 0 15
5 5 11 3.000 0 0 9
6 3 8 3.000 0 coefficients
The 0 16
7 6 12 3.333 2
column 0
indicates 10
8 4 10 3.500 0 3 13
the Euclidean
9 5 9 4.000 5 0 11
distance or
10 1 6 4.167 0 7 12
distance between
11 5 20 5.667 9 0 15
the two clusters (or
12 1 17 5.800 10 0 14
13 4 19 6.000
cases)
8
joined0
at 17
14 1 15 7.933 each stage0
12 16
15 2 5 8.200 4 11 18
16 1 3 9.714 14 6 19
17 4 18 10.067 13 0 18
18 2 4 25.212 15 17 19
19 1 2 34.589 16 18 0
STEPS TO FIND THE NUMBER OF CLUSTERS USING AGGLOMERATION SCHEDULE

• the agglomeration schedule, or the order in which variables


combine with each other. If we use this, we can find from the bottom
two rows going up, the maximum difference between the coefficients
at each stage. The last row indicates one cluster, the row before that
indicates a 2-cluster solution, and so on.
• Wherever the maximum difference between coefficients occurs, the
lower row indicates the number of clusters. 2-cluster solution, and so
on. Wherever the maximum difference between coefficients occurs,
the lower row indicates the number of clusters
Agglomeration Schedule
Stage Cluster First
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 14 16 2.000 0 0 3
2 6 7 2.000 For0 a good cluster
0 7
3 10 14 2.667 solution,
0 you 1will 8
4 2 13 3.000 see0 a sudden0 jump 15
5 5 11 3.000 in 0the distance0 9
6 3 8 3.000 coefficient
0 (or
0 a 16
7 6 12 3.333 sudden
2 drop0in the 10
8 4 10 3.500 similarity
0 3 13
9 5 9 4.000 5 0
coefficient.(read 11
10 1 6 4.167 0 the bottom
from 7 to 12
11 5 20 5.667 9 so the 0
top) 15
12 1 17 5.800 10
stopping 0 is
stage 14
13 4 19 6.000 188which means 0 a 17
14 1 15 7.933 12 cluster 0
two 16
15 2 5 8.200 4 11 18
solution.
16 1 3 9.714 14 6 19
17 4 18 10.067 13 0 18
18 2 4 25.212 15 17 19
19 1 2 34.589 16 18 0
Agglomeration Schedule
Stage Cluster First
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 14 16 2.000 0 0 3
2 6 7 2.000 0 0 7
3 10 14 2.667 0 1 8
4
The next part2of 13 3.000 0 0 15
5 the table shows
5 11 3.000 0 0 9
6 the stage at which
3 8 3.000 0 0 16
7 each cluster first
6 12 3.333 2 0 10
8 appears. Single4 10 3.500 0 3 13
9 cases existed5 9 4.000 5 0 11
10 before we started
1 6 4.167 0 7 12
11 the analysis,5so 20 5.667 9 0 15
12 they are indicated
1 17 5.800 10 0 14
13 by zeroes here.4 19 6.000 8 0 17
14 1 15 7.933 12 0 16
15 2 5 8.200 4 11 18
16 1 3 9.714 14 6 19
17 4 18 10.067 13 0 18
18 2 4 25.212 15 17 19
19 1 2 34.589 16 18 0
Agglomeration Schedule
Stage Cluster First
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 14 16 2.000 0 0 3
2 6 7 2.000 0 0 7
3 10 14 2.667 0 1 8
4 2 13 3.000 0 0 15
5 5 11 3.000 0 0 9
6 3 8 3.000 0 0 16
7 6 12 3.333 2 0 10
8 4 10 3.500 0 3 13
9 consider case 14 5 9 4.000 5 0 11
10 1
for stage 1 & stage 6 4.167 0 7 12
11 3. 1 here shows5 20 5.667 9 0 15
12 1
that cluster 14 had 17 5.800 10 0 14
13 already appeared.4 19 6.000 8 0 17
14 1 15 7.933 12 0 16
15 2 5 8.200 4 11 18
16 1 3 9.714 14 6 19
17 4 18 10.067 13 0 18
18 2 4 25.212 15 17 19
19 1 2 34.589 16 18 0
Agglomeration Schedule
Stage Cluster First
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 14 16 2.000 0 0 3
2 6 7 2.000 0 0 7
3 10 14 2.667 0 1 8
4 2 13 3.000 0 0 15
5 5 11 3.000 0 0 9
6 3 8 3.000 0 0 16
7 6 12 3.333 2 0 10
8 The
4 last column
10 3.500 0 3 13
9 shows
5 the 9 4.000 5 0 11
10 subsequent
1 6 stage 4.167 0 7 12
11 at
5 which the
20 newly 5.667 9 0 15
12 merged
1 cluster
17 is 5.800 10 0 14
13 combined
4 19with yet 6.000 8 0 17
14 another
1 cluster.
15 Eg 7.933 12 0 16
15 consider
2 case
5 14 8.200 4 11 18
16 for
1 stage 1 3& 3 9.714 14 6 19
17 4 18 10.067 13 0 18
18 2 4 25.212 15 17 19
19 1 2 34.589 16 18 0
Interpretation 2

The dendrogram, additionally


provides a rescaled distance
measure between the various cluster
combines at various stages.
SPSS – K-means Cluster Analysis

 K-means does clustering, assign clusters to respondents and


describes clusters based on dimensions.
 Open Psychiatric Test – CA file.
 Analyze/Classify/K-means Cluster Analysis
 Move all importance_ variables, Number of Clusters = 3 (from earlier
hierarchical clustering)
 Iterate (number of iterations = 99)
 Save (Cluster Membership, Distance from Cluster Centers)
 Options (Initial Clusters, Anova Table, Exclude Cases Pairwise)
 OK
Interpretation 1
The initial cluster centers are the variable
values of the k well-spaced observations.
(The "cluster center" is the arithmetic mean
of all the points belonging to the cluster)

• The iteration history shows the progress of the


clustering process at each step.
• In early iterations, the cluster centers shift.
• By the 3rd iteration, cluster centers have
converged, The process stops when there is
no change in cluster centers.
Interpretation 2

Each case is allotted to either cluster


1 or cluster 2 here. With the help of
this table we can conclude how
many cases fall in each cluster(cluster
size)
Interpretation 3

The ANOVA table indicates which variables


contribute the most to your cluster solution.
If the ‘Sig.’ for variable is less than 0.05, the
contribution of that variable for
differentiating clusters is significant.
Interpretation 4

The final cluster centers are computed as the mean for each variable within each final
cluster. The final cluster centers reflect the characteristics of the typical case for each
cluster.
Interpretation -5

Cluster size
Thank you.

You might also like