0% found this document useful (0 votes)
430 views38 pages

Cluster Analysis: Hierarchical Agglomerative Cluster Analysis Use of A Created Cluster Variable in Secondary Analysis

Cluster analysis is a technique used to group subjects into clusters based on similarities across multiple variables. It involves creating a distance matrix to quantify the similarity between each pair of subjects, then using clustering algorithms to sort subjects into groups that are as internally homogeneous as possible but distinctly different from other groups. There are many options for measuring distances, clustering algorithms, and determining the optimal number of clusters, and results can vary, so there is no single best approach to cluster analysis.

Uploaded by

rohit7853
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
430 views38 pages

Cluster Analysis: Hierarchical Agglomerative Cluster Analysis Use of A Created Cluster Variable in Secondary Analysis

Cluster analysis is a technique used to group subjects into clusters based on similarities across multiple variables. It involves creating a distance matrix to quantify the similarity between each pair of subjects, then using clustering algorithms to sort subjects into groups that are as internally homogeneous as possible but distinctly different from other groups. There are many options for measuring distances, clustering algorithms, and determining the optimal number of clusters, and results can vary, so there is no single best approach to cluster analysis.

Uploaded by

rohit7853
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

1

Cluster Analysis

Hierarchical agglomerative cluster


analysis

Use of a created cluster variable in


secondary analysis

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
2

KEY CONCEPTS
*****
Cluster Analysis

Research questions addressed by cluster analysis


Cluster analysis assumptions
Alternative names for cluster analysis
Caveats in using cluster analysis
Similarity/dissimilarity matrix, also called a distance matrix
• Squared Euclidean distance
• Euclidean distance
• Cosine of vector variables
• City block (Manhattan distance)
• Chebychev distance metric
• Distances in absolute power metric
• Pearson product-moment correlation coefficient
• Minkowski metric
• Mahalanobis D2
• Jaccard's coefficient(s)
• Gower's coefficient
• Simple matching coefficient
Cluster-seeking vs. cluster-imposing methods
Clustering algorithms
• Hierarchical Methods
Agglomerative Methods
Single average/linkage (nearest neighbor)
Complete average/linkage (furthest neighbor)
Average linkage
Ward's error sum of squares
Centroid method
Median clustering
-Divisive Methods
K-means clustering
Trace methods
A Splinter-Average Distance method
Automatic Interaction Detection (AID)
• Non-Hierarchical Methods
Iterative Methods
Sequential threshold method
Parallel threshold method
Optimizing methods

KEY CONCEPTS (CONT.)

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
3

Factor Analysis
Q-Analysis
Density Methods
Multivariate probability approaches
(NORMIX, NORMAP)
Clumping Methods
Graphic Methods
Glyphs & Metroglyphs
Fourier Series
Chernoff Faces
Agglomeration Schedule
Fusion coefficient
Alternative ways to determine the optimal number of clusters
Criteria: clusters as internally homogeneous and significantly different from each other
Dendrogram
Scaled distance
Cluster scores
Profiling clusters
Using a cluster variable as an IV or DV in secondary analysis
Sokal, Robert & Smeath, Peter, Principles of Numerical Taxonomy (1963)
Steps in cluster analysis
Variable selection, construction of data base, testing assumptions
Selecting measure of similarity/distance
Selecting clustering algorithm
Determining number of clusters
Profile clusters
Validation

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
4

Cluster Analysis

Interdependency Technique

 Designed to group a sample of subjects

Into significantly different groups

Based upon a number of variables

 The groups are constructed to be as different


as statistically possible

And as internally homogeneous as


statistically possible

Assumptions

 The sample needs to be representative of


the population

 Multiple collinearity among the variables


should be minimal

 Absence of outliers & good N to k ratio

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
5

Cluster Analysis by Other Names


Similar techniques have been independently
developed in various fields, giving rise to different
names for this statistical technique (e.g. biology,
archeology. etc.)

Cluster Analysis

Numerical Taxonomy

Q-Analysis

Typology Analysis

Classification Analysis

There are a number of different clustering


techniques depending upon …

The procedure used to measure the similarity


or distance among subjects

And the clustering algorithm used.

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
6

Caveats in Using Cluster Analysis

 There is no one best way to perform a cluster


analysis

 There are many methods and most lack rigorous


statistical reasoning or proofs

 Cluster analysis is used in different disciplines,


which favor different techniques for:

Measuring the similarity or distance among


subjects relative to the variables

And the clustering algorithm used

 Different clustering techniques can produce


different cluster solutions

 Cluster analysis is supposed to be “cluster


-seeking”, but in fact it is “cluster - imposing”

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
7

Applications of Cluster Analysis

Cluster analysis seeks to reduce a sample of cases


to a few statistically different groups, i.e. clusters,
based upon differences/similarities across a set of
multiple variables

A useful tool for constructing typologies among


cases

Example

Is each case filed with court unique, or can


cases be sorted into distinctly different types
based upon the amount of the evidence, quality
of the defense, complexity of the charges, etc.?

Example

Is a murder a murder, or can cases be sorted


into distinctively different types on the basis of
victim/offender characteristics, circumstances,
motives, etc.?

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
8

The Logic of Cluster Analysis

Step 1 Cluster analysis begins with an N x k


database

Step 2 Using one of several methods, an N x N


matrix is created that indicates the similarity (or
dissimilarity) of very case to every other case, based
on the k number of variables

Matrix of Dissimilarities

Subjects 1 2 3 … N

1 1.782 2.538 … 47.236

2 1.782 0.821 … 39.902

3 2.538 0.821 … 41.652

… … … … …

n 47.236 39.902 41.652 …

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
9

The Logic of Cluster Analysis (cont.)

Step 3 Using one of several clustering algorithms,


the subjects are sorted into significantly different
groups where …

The subjects within each group are as


homogeneous as possible, and …

The groups are as different from one another as


possible

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
10

Measures of Similarity or Difference

Cluster analysis begins by creating a matrix


indicating the similarity between (or the distance
between) each pair of subjects relative to the k
variables in the database.

There are a number of ways that this can be done.

Technique Technique

Squared Euclidean Distance * Pearson Correlation Coefficient *

Euclidean Distance * Mahalanobis D 2 *

Cosine of Vector Variables * Minkowski Metric *

City Block or Manhattan Distances * Jaccard’s Coefficient

Chebychev Distance Metric * Gower’s Coefficient

Distances in the Absolute Power Simple Matching Coefficient


Metric

* Available in SPSS

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
11

An Example of
Squared Euclidean Distances

Subjects
Variables
Subject Subject (Si - Sj) (Si - Sj) 2
1 2

X1 18 19 -1 1

X2 15 17 -2 4

X3 9 10 -1 1

X4 12 10 +2 4

X5 0 1 +1 1

X6 1 1 0 0

X7 9 8 +1 1

Totals NA NA NA 12

Squared Euclidean Distance =  (Si - Sj) 2 = 12

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
12

A Variety of Clustering Algorithms

 There is no proven best way to cluster subjects


into homogeneous groups

 Different techniques have been developed in


different fields based upon different logics (e.g.
biology, archeology, etc.)

 Given the same database, similar clustering


results can be achieved using different clustering
algorithms, but not always.

 Clustering algorithms are generally classified into


two broad types …

Hierarchical methods

Non-hierarchical methods

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
13

Hierarchical Clustering Algorithms

Agglomerative Methods Divisive Methods

Single Average (Nearest K-Means Clustering *


Neighbor) *

Complete Average (Furthest Trace Methods


Neighbor) *

Average Linkage * A-Splinter-Average Distance


Method

Ward’s Error Sum of Squares * Automatic Interaction Detection


(AID)

Centroid Method *

Median Clustering

* Available in SPSS

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
14

Non-hierarchical Clustering Algorithms


Iterative Methods

Sequential Threshold Method


Parallel Threshold Method
Optimization Methods

Factor Analysis

Q-Factor Analysis

Density Methods

Multivariate Probability Approaches


NORMIX
NORMAP

Clumping Methods

Graphic Methods

Glyphs
Metroglyphs
Fourier Series
Chernoff Faces

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
15

An Example of a Clustering Algorithm


Ward’s Errors Sum of Squares Algorithm

Imagine that data on seven variables (Xk) was


gathered on 70 subjects (n)

Imagine further that a dissimilarity matrix was


constructed indicating the differences among all
pairs of subjects using squared Euclidean distances

Step 1 Ward's algorithm begins with each of 70


subjects in their own cluster

Step 2 Next it finds the two subjects that are most


similar and creates a cluster with two subjects

Now there are 69 clusters, one with two


subjects, and 68 with one subject each

Step 3 Now it finds the next two most similar


subjects and creates a two-subject cluster

Now there are 68 clusters, two with two subjects


each, and 66 with one subject each

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
16

An Example of a Clustering Algorithm


Ward’s Errors Sum of Squares Algorithm (cont.)

As Ward's algorithm progresses it will begin to


combine a single subject into a pre-existing cluster,

And then begins to combine one pre-existing


cluster with another

This process is continued until all 70 subjects are


finally combined into one cluster

Ward's algorithm forms clusters by selecting that


subject (or another cluster if combining clusters)
which minimizes the within cluster sum of squares
(i.e. error sum of squares)

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
17

A Seven Variable Example of


Cluster Analysis

The database 70 subjects and 7 variables

The variables

Sentence in years: sentence

Number of prior convictions: pr_conv

Degree of drug dependency: dr_score

Age: age

Age at first arrest: age_firs

Educational equivalency: educ_eqv

Level of work skill: skl_indx

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
18

Steps in the Cluster Analysis

Step 1 Transform the seven variables to standard


scores, i.e. Z-scores

Step 2 Create a dissimilarity matrix using squared


Euclidean distances

Squared Euclidean Distances

Subjects 1 2 3 … 70

1 1.782 2.538 … 47.236

2 1.782 0.821 … 39.902

3 2.538 0.821 … 41.652

… … … … …

70 47.236 39.902 41.652 …

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
19

Steps in the Cluster Analysis (cont.)

Step 3 Use Ward's algorithm to cluster the 70


subjects, beginning with 70 clusters of one subject
each and terminating with one cluster containing all
70 subjects
Agglomeration Schedule

Cluster Coefficients Stage Next Stage


Combined Cluster First
Appears
Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2

1 62 63 .255 0 0 40
2 31 33 .610 0 0 37
3 2 3 1.021 0 0 43
4 7 8 1.502 0 0 31
5 29 30 1.984 0 0 45
6 14 15 2.495 0 0 31
7 52 67 3.031 0 0 34
8 18 19 3.588 0 0 49
9 46 47 4.191 0 0 35
10 27 28 4.803 0 0 44
11 36 40 5.437 0 0 33
12 9 13 6.095 0 0 49
13 48 49 6.760 0 0 51
14 32 38 7.435 0 0 42
15 20 21 8.128 0 0 39
16 22 64 8.844 0 0 39
17 35 39 9.580 0 0 52
18 5 12 10.324 0 0 36
19 23 24 11.093 0 0 29
20 57 59 11.878 0 0 32
21 37 43 12.702 0 0 42
22 6 10 13.551 0 0 55
23 1 4 14.439 0 0 28
24 11 45 15.358 0 0 46
25 41 44 16.284 0 0 33
26 55 56 17.220 0 0 41
27 51 66 18.237 0 0 48

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
20

28 1 50 19.329 23 0 47
29 17 23 20.483 0 19 38
30 54 69 21.732 0 0 41
31 7 14 23.076 4 6 46
32 57 58 24.425 20 0 53
33 36 41 25.784 11 25 40
34 52 53 27.173 7 0 51
35 42 46 28.626 0 9 58
36 5 16 30.251 18 0 54
37 31 34 32.018 2 0 62
38 17 68 33.905 29 0 59
39 20 22 35.806 15 16 57
40 36 62 37.855 33 1 56
41 54 55 39.918 30 26 50
42 32 37 42.118 14 21 52
43 2 65 44.428 3 0 47
44 25 27 46.758 0 10 45
45 25 29 49.344 44 5 59
46 7 11 52.395 31 24 54
47 1 2 55.709 28 43 63
48 26 51 59.223 0 27 61
49 9 18 62.772 12 8 57
50 54 70 66.383 41 0 65
51 48 52 70.076 13 34 60
52 32 35 73.798 42 17 58
53 57 60 77.659 32 0 65
54 5 7 81.736 36 46 55
55 5 6 86.189 54 22 64
56 36 61 90.955 40 0 66
57 9 20 97.853 49 39 60
58 32 42 105.430 52 35 62
59 17 25 114.736 38 45 67
60 9 48 125.105 57 51 61
61 9 26 136.517 60 48 63
62 31 32 150.461 37 58 68
63 1 9 167.695 47 61 64
64 1 5 194.756 63 55 66
65 54 57 222.045 50 53 67
66 1 36 258.210 64 56 68
67 17 54 298.955 59 65 69
68 1 31 361.556 66 62 69
69 1 17 483.000 68 67 0

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
21

Interpretation of the Agglomeration


Schedule
Stage 1 Cases 62 and 63 are combined into a
cluster. Now there is one cluster with two cases and
68 clusters with one case each, 69 total clusters, or
70 - 1 = 69

Coefficient The squared Euclidean distance over


which these two cases were joined = 0.255, called a
fusion coefficient

Next Stage The next stage at which one of these


cases is joined to a cluster is Stage 40 when case 62
is joined to case 36

Stage 33 Cases 36 and 41 are joined together


over a distance = 25.784. At this stage 37 clusters
have been formed (70 - 33 = 37)

Stage Cluster first Appears

Cluster 1 Notice that case 36 was previously


joined with case 40 at Stage 11

Cluster 2 Again, notice that case 41 was


previously joined with case 44 at Stage 25

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
22

Interpretation of the Agglomeration Schedule (cont.)

Next Stage The next stage at which one of these


cases is joined to a cluster is Stage 40 when case 36
is joined with case 62

Stage 69 Case 1 is joined with case 17 at an


Euclidean distance of 483.0, clearly two cases that
are very dissimilar.

At Stage 69 all 70 cases have been included in a


single cluster. Obviously this one cluster is a
heterogeneous cluster, containing many very
dissimilar cases.

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
23

How Do You Determine the Optimal


Number of Clusters in the Final Solution?
In this example, Ward's algorithm yields clusters
ranging from 70 clusters with one case each, to one
cluster containing all 70 cases.

Somewhere in between these two extremes is an


optimal number of clusters which best satisfies the
following conditions …

The clusters are as internally homogeneous as


possible (i.e. minimum within sum of squares)

And the various clusters are as different as possible

Determining the optimal number of clusters

Theory about the number of underlying groups

Ease of profiling the groups

Magnitude of change in the fusion coefficient

Dendogram with rescaled distance measure

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
24

What is a Dendogram?
* * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * *

Dendrogram using Ward Method

Rescaled Distance Cluster Combine

C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+

Case 62 62 -+
Case 63 63 -+
Case 36 36 -+
Case 40 40 -+-------------+
Case 41 41 -+ |
Case 44 44 -+ |
Case 61 61 -+ |
Case 6 6 -+ |
Case 10 10 -+ |
Case 5 5 -+---------+ +---------+
Case 12 12 -+ | | |
Case 16 16 -+ | | |
Case 11 11 -+ | | |
Case 45 45 -+ | | |
Case 7 7 -+ | | |
Case 8 8 -+ | | |
Case 14 14 -+ +---+ |
Case 15 15 -+ | |
Case 1 1 -+ | |
Case 4 4 -+ | |
Case 50 50 -+-----+ | |
Case 2 2 -+ | | |
Case 3 3 -+ | | |
Case 65 65 -+ | | |
Case 51 51 -+ +---+ |
Case 66 66 -+---+ | |
Case 26 26 -+ | | +-----------------------+
Case 48 48 -+ | | | |
Case 49 49 -+---+-+ | |
Case 52 52 -+ | | |
Case 67 67 -+ | | |
Case 53 53 -+ | | |
Case 20 20 -+ | | |
Case 21 21 -+-+ | | |
Case 22 22 -+ | | | |
Case 64 64 -+ +-+ | |
Case 18 18 -+ | | |
Case 19 19 -+-+ | |
Case 9 9 -+ | |
Case 13 13 -+ | |
Case 31 31 -+ | |
Case 33 33 -+---+ | |
Case 34 34 -+ | | |
Case 46 46 -+ +-------------------+ |
Case 47 47 -+-+ | |
Case 42 42 -+ +-+ |
Case 35 35 -+ | |
Case 39 39 -+-+ |
Case 32 32 -+ |
Case 38 38 -+ |
Case 37 37 -+ |
Case 43 43 -+ |
Case 23 23 -+ |
Case 24 24 -+ |
Case 17 17 -+-+ |
Case 68 68 -+ +-------------+ |
Case 29 29 -+ | | |
Case 30 30 -+-+ | |
Case 27 27 -+ | |
Case 28 28 -+ | |
Case 25 25 -+ +-------------------------------+
Case 55 55 -+ |
Case 56 56 -+ |
Case 54 54 -+---------+ |
Case 69 69 -+ | |
Case 70 70 -+ +-----+
Case 57 57 -+ |
Case 59 59 -+ |
Case 58 58 -+---------+
Case 60 60 -+

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
25

What is a Dendogram? (cont.)

The Scaled Distance

The fusion coefficient transformed to a scale


ranging from 0 to 25

The Dendogram

The dendogram shows which cases were joined


together into clusters and at what distance, and
at latter stages, which clusters were joined
together into larger clusters, and at what
distance.

Interpretation

The point at which the "foothills" become the


"mountain peaks" is probability the optimal
number of clusters

Optimal Number of Clusters

A five-cluster solution appears about optimal

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
26

Computing a Five-Cluster Solution

Having hypothesized that a five-cluster solution may


be optimal …

The next step is to compute a five-cluster


solution and …

Save the cluster scores

Cluster scores

In this case, a cluster score is a number


between 1 and 5 assigned to each case
indicating the cluster to which a particular case
has been assigned

5-Cluster Solution

This is accomplished by repeating the cluster


analysis and specifying that five clusters are to
be extracted and the cluster scores saved.

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
27

Saved Cluster Scores


1.0 1
2.0 1
3.0 1
4.0 1
5.0 1
6.0 1
7.0 1
8.0 1
9.0 1
10.0 1
11.0 1
12.0 1
13.0 1
14.0 1
15.0 1
16.0 1
17.0 2
18.0 1
19.0 1
20.0 1
21.0 1
22.0 1
23.0 2
24.0 2
25.0 2
26.0 1
27.0 2



46.0 3
47.0 3
48.0 1
49.0 1
50.0 1
51.0 1
52.0 1
53.0 1
54.0 5
55.0 5
56.0 5
57.0 5
58.0 5
59.0 5
60.0 5
61.0 4
62.0 4
63.0 4
64.0 1
65.0 1
66.0 1
67.0 1
68.0 2
69.0 5
70.0 5

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
28

Profiling the Five Clusters

One way to profile the characteristics of the five


clusters is to compute the means of the seven
variables for each of the five clusters
Ward Method

Cluster 1
+------------------------+-----------+
| | Mean |
+------------------------+-----------+
|SENTENCE | 4.6 |
| | |
|PR_CONV | 1.5 |
| | |
|DR_SCORE | 7.5 |
| | |
|AGE | 21.6 |
| | |
|AGE_FIRS | 16.2 |
| | |
|EDUC_EQV | 7.3 |
| | |
|SKL_INDX | 6.0 |
+------------------------+-----------+

Ward Method

Cluster 2
+------------------------+-----------+
| | Mean |
+------------------------+-----------+
|SENTENCE | 7.3 |
| | |
|PR_CONV | 4.8 |
| | |
|DR_SCORE | 5.7 |
| | |
|AGE | 24.7 |
| | |
|AGE_FIRS | 14.4 |
| | |
|EDUC_EQV | 3.4 |
| | |
|SKL_INDX | 2.8 |
+------------------------+-----------+

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
29

Profiling the Five Clusters (cont.)


Ward Method

Cluster 3
+------------------------+-----------+
| | Mean |
+------------------------+-----------+
|SENTENCE | 2.4 |
| | |
|PR_CONV | .9 |
| | |
|DR_SCORE | 3.3 |
| | |
|AGE | 21.3 |
| | |
|AGE_FIRS | 19.3 |
| | |
|EDUC_EQV | 3.3 |
| | |
|SKL_INDX | 2.5 |
+------------------------+-----------+

Ward Method

Cluster 4
+------------------------+-----------+
| | Mean |
+------------------------+-----------+
|SENTENCE | 3.1 |
| | |
|PR_CONV | .9 |
| | |
|DR_SCORE | 3.0 |
| | |
|AGE | 20.6 |
| | |
|AGE_FIRS | 19.0 |
| | |
|EDUC_EQV | 10.7 |
| | |
|SKL_INDX | 8.1 |
+------------------------+-----------+

Ward Method

Cluster 5
+------------------------+-----------+
| | Mean |
+------------------------+-----------+
|SENTENCE | 16.3 |
| | |
|PR_CONV | 2.1 |
| | |
|DR_SCORE | 8.1 |
| | |
|AGE | 30.2 |
| | |
|AGE_FIRS | 14.7 |
| | |
|EDUC_EQV | 5.3 |
| | |
|SKL_INDX | 3.8 |
+------------------------+-----------+

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
30

Ranking the Variable Means of the Five


Clusters

Variable Clusters

1 2 3 4 5

Age M H L LL HH

Age_Firs M L HH H LL

Dr_Score H M L LL HH

Educ_Eqv H L LL HH M

Pr_Conv M HH L LL H

Sentence M H LL L HH

Skl_Indx H L LL HH M

LL = lowest L = low M = median H = high HH = Highest

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
31

Profile Descriptions of the Five Clusters


Cluster 1

Better educated drug users who are highly skilled


workers, about median age

Cluster 2

Older offenders, unskilled, poorly educated with some


history of drug use, career criminals serving long
sentences

Cluster 3

Young 1st offenders, unskilled, poorly educated with


little drug history, serving very short sentences

Cluster 4

Very young, highly educated, skilled 1st offenders


serving short sentences, little history of drug use

Cluster 5

Severely drug dependent old offenders with long


criminal careers serving very long sentences

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
32

Secondary Applications of the Results of a


Cluster Analysis

Some statistical techniques use a priori categorical


independent or dependent variables such as
analysis of variance or discriminant analysis.

Cluster analysis allows us to create an empirically


derived categorical variable wherein the groups or
clusters are determined to be homogeneous and
significantly different from each other.

Other statistical tests can then be conducted using


the cluster variable as a categorical IV or DV.

Example

Do the five clusters of offenders differ


significantly in the seriousness of the crime of
which they were convicted? This is a one-way
ANOVA problem.

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
33

Secondary Applications of the Results of a Cluster Analysis (cont.)

Univariate Analysis of Variance


Between-Subjects Factors

N
Ward 1 33
Method 2 9
3 12
4 7
5 9

Tests of Between-Subjects Effects

Dependent Variable: SER_INDX


Type III Sum
Source of Squares df Mean Square F Sig.
Corrected Model 152.593a 4 38.148 19.471 .000
Intercept 853.296 1 853.296 435.527 .000
CLU5_1 152.593 4 38.148 19.471 .000
Error 127.350 65 1.959
Total 1306.000 70
Corrected Total 279.943 69
a. R Squared = .545 (Adjusted R Squared = .517)

Post Hoc Tests


Ward Method

Interpretation

There are significant mean differences in the crime


seriousness of the offences committed by the five
clusters of offenders.

Tukey's HSD test is used to determine which group


mean differences are significant.

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
34

Secondary Applications of the Results of a Cluster Analysis (cont.)

Multiple Comparisons

Dependent Variable: SER_INDX


Tukey HSD

Mean
Difference 95% Confidence Interval
(I) Ward Method (J) Ward Method (I-J) Std. Error Sig. Lower Bound Upper Bound
1 2 -2.5152* .5264 .000 -3.9921 -1.0382
3 1.2348 .4718 .079 -8.9081E-02 2.5588
4 1.3420 .5825 .157 -.2923 2.9763
5 -2.8485* .5264 .000 -4.3254 -1.3716
2 1 2.5152* .5264 .000 1.0382 3.9921
3 3.7500* .6172 .000 2.0182 5.4818
4 3.8571* .7054 .000 1.8779 5.8364
5 -.3333 .6598 .987 -2.1847 1.5181
3 1 -1.2348 .4718 .079 -2.5588 8.908E-02
2 -3.7500* .6172 .000 -5.4818 -2.0182
4 .1071 .6657 1.000 -1.7607 1.9750
5 -4.0833* .6172 .000 -5.8152 -2.3515
4 1 -1.3420 .5825 .157 -2.9763 .2923
2 -3.8571* .7054 .000 -5.8364 -1.8779
3 -.1071 .6657 1.000 -1.9750 1.7607
5 -4.1905* .7054 .000 -6.1697 -2.2112
5 1 2.8485* .5264 .000 1.3716 4.3254
2 .3333 .6598 .987 -1.5181 2.1847
3 4.0833* .6172 .000 2.3515 5.8152
4 4.1905* .7054 .000 2.2112 6.1697
Based on observed means.
*. The mean difference is significant at the .05 level.

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
35

Secondary Applications of the Results of a Cluster Analysis (cont.)

SER_INDX
a,b,c
Tukey HSD
Subset
Ward Method N 1 2
4 7 2.1429
3 12 2.2500
1 33 3.4848
2 9 6.0000
5 9 6.3333
Sig. .196 .982
Means for groups in homogeneous subsets are displayed.
Based on Type III Sum of Squares
The error term is Mean Square(Error) = 1.959.
a. Uses Harmonic Mean Sample Size = 10.445.
b. The group sizes are unequal. The harmonic mean
of the group sizes is used. Type I error levels are
not guaranteed.
c. Alpha = .05.

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
36

Using the Categorical Cluster Variable as a


Dependent Variable

Example

To what extent does the type of defense


counsel, pretrial jail time, and time to case
disposition predict differences among the five
groups of offenders?

This is a discriminant analysis problem with the


cluster variable as the DV. (If the cluster variable
were used as the IV, this would be a MANOVA
problem)

Discriminant analysis results

Three discriminant functions were extracted


since there are 3 IVs, which is less than 5
groups. (functions: g-1 or k, if kg)

Only the 1st discriminant function is significant.

Z1 = -0.313 - 0.866 council + 0.021 jail_tm -0.002 tm_disp

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
37

Using the Cluster Variable as a Dependent Variable (cont.)

Discriminant

Group Statistics

Valid N (listwise)
Ward Method Unweighted Weighted
1 COUNSEL 33 33.000
JAIL_TM 33 33.000
TM_DISP 33 33.000
2 COUNSEL 9 9.000
JAIL_TM 9 9.000
TM_DISP 9 9.000
3 COUNSEL 12 12.000
JAIL_TM 12 12.000
TM_DISP 12 12.000
4 COUNSEL 7 7.000
JAIL_TM 7 7.000
TM_DISP 7 7.000
5 COUNSEL 9 9.000
JAIL_TM 9 9.000
TM_DISP 9 9.000
Total COUNSEL 70 70.000
JAIL_TM 70 70.000
TM_DISP 70 70.000

Analysis 1
Summary of Canonical Discriminant Functions
Eigenvalues

Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
1 .492a 89.5 89.5 .574
2 .042a 7.6 97.1 .200
3 .016a 2.9 100.0 .125
a. First 3 canonical discriminant functions were used in the
analysis.

Wilks' Lambda

Wilks'
Test of Function(s) Lambda Chi-square df Sig.
1 through 3 .633 29.686 12 .003
2 through 3 .945 3.678 6 .720
3 .984 1.019 2 .601

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
38

Using the Cluster Variable as a Dependent Variable (cont.)

Standardized Canonical Discriminant Function Coefficients

Function
1 2 3
COUNSEL .549 .863 .523
JAIL_TM -.627 .807 .607
TM_DISP .102 .384 -.962

Structure Matrix

Function
1 2 3
JAIL_TM -.867* .488 .103
COUNSEL .848* .455 .271
TM_DISP -.086 .555 -.827*
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
*. Largest absolute correlation between each variable and
any discriminant function

Canonical Discriminant Function Coefficients

Function
1 2 3
COUNSEL 1.235 1.943 1.176
JAIL_TM -.016 .020 .015
TM_DISP .004 .015 -.039
(Constant) -.304 -3.221 2.205
Unstandardized coefficients

Functions at Group Centroids

Function
Ward Method 1 2 3
1 .213 -.115 -9.76E-02
2 -.803 .140 -2.89E-02
3 .673 .366 4.822E-02
4 .618 -.266 .291
5 -1.357 -1.51E-03 9.600E-02
Unstandardized canonical discriminant functions
evaluated at group means

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

You might also like