Clad Cluster Analysisi Slides-Clusteranalysis
Clad Cluster Analysisi Slides-Clusteranalysis
Cluster Analysis
Objective: Group data points into classes of similar points based on a
series of variables
Useful to find the true groups that are assumed to really exist, BUT if the
analysis generates unexpected groupings it could inform new relationships
you might want to investigate
Also useful for data reduction by finding which data points are similar and
allow for subsampling of the original dataset without losing information
distance
2
1
If distances are not equal between points we D
can draw a “hanging tree” to illustrate
A C
distances 0 B
Building trees & creating groups
Good News: If your data really has clear groups all methods will find them and
give you similar results
Therefore it is best to try multiple algorithms and see what groups logically
make sense
If you have a dummy dataset with pre-determined groups you can use it to see
which algorithm best recreates what you expect
Cluster analysis in R Distance matrix of your data rows
based on your predictor variables
You need to calculate this before
running the cluster analysis
CA in R: We create distance
hclust(distMatrix,method) (stats package) matrices in Lab 5
What type of algorithm should be used to cluster points and define groups
" ward.D" = Ward’s minimum variance method
" ward.D2" = Ward’s minimum variance method – however dissimilarities are squared before clustering
"single" = Nearest neighbours method
"complete" = distance between two clusters is defined as the maximum distance between an observation in one
cluster and an observation in the other cluster
"average" = distance between two clusters is defined as the mean distance between an observation in one
cluster and an observation in the other cluster
"mcquitty " = when two clusters are be joined, the distance of the new cluster to any other cluster is calculated as
the average of the distances of the soon to be joined clusters to that other cluster
"median" = uses group median
"centroid" = uses group centroid
Cluster analysis in R