0% found this document useful (0 votes)
4 views

Module12.02 UnsupervisedLearning

Stat learning

Uploaded by

cadi0761
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module12.02 UnsupervisedLearning

Stat learning

Uploaded by

cadi0761
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Clustering

Clustering

• Clustering refers to a very broad set of techniques for


finding subgroups, or clusters, in a data set.
• We seek a partition of the data into distinct groups so that
the observations within each group are quite similar to
each other.
P C A vs Clustering

• P C A looks for a low-dimensional representation of the


observations that explains a good fraction of the variance.
• Clustering looks for homogeneous subgroups among the
observations.
Two clustering methods

• In K-means clustering, observations are partitioned into a


pre-specified number of clusters.
• In hierarchical clustering, number of clusters are not known
beforehand
• A tree-like visual representation of the observations, called
dendrogram, is created to view at once the clusterings
obtained for each possible number of clusters, from 1 to n.
K-means clustering
K=2 K=3 K=4

A simulated data set with 150 observations in 2-dimensional space. Panels show the
results of applying K-means clustering with different values of K , the number of
clusters. The color of each observation indicates the cluster to which it was assigned
using the K-means clustering algorithm. Note that there is no ordering of the
clusters, so the cluster coloring is arbitrary.
These cluster labels were not used in clustering; instead, they are the outputs of the
clustering procedure.
Details of K-means clustering

Let , . . . , denote sets containing the indices of the


observations in each cluster. These sets satisfy two
properties:
. In other words, each
observation belongs to at least one of the K clusters.
. In other words, the clusters
a r e non-overlapping: no observation belongs to more than
one cluster.
For instance, if the ith observation is in the kth cluster, then
i .
• The within-cluster variation for cluster C k is a measure
W C V ( C k ) of the amount by which the observations within
a cluster differ from each other.
• Hence it is an optimization problem

(2)

• In words, this formula says partition the observations into K


clusters such that the total within-cluster variation, summed
over all K clusters, is as small as possible.
How to define within-cluster variation?

• Typically Euclidean distance is used


(3)
, ∈

where |Ck| denotes the number of observations in the kth


cluster.

• Combining (2) and (3) gives the optimization problem that


defines K-means clustering,
(4)
, ∈
K-Means Clustering Algorithm

1. Randomly assign a number, from 1 to K , to each of the


observations. These serve as initial cluster assignments for
the observations.
2. Iterate until the cluster assignments stop changing:
1. For each of the K clusters, compute the cluster centroid. The
kth cluster centroid is the vector of the p feature means for the
observations in the kth cluster.
2. Assign each observation to the cluster whose centroid is
closest (where closest is defined using Euclidean distance).
Properties of the Algorithm

• This algorithm is guaranteed to decrease the value of the


objective (4) at each step. Why? Note that

, ∈ ∈

where | | ∈ is the mean for feature j in cluster


.
• However it is not guaranteed to give the global minimum.
• This is why clustering should be tried with a number of initial
solutions
Hierarchical Clustering

• K-means clustering requires pre-specification of the number


of clusters K .
• Hierarchical clustering is an alternative approach which
does not require that we commit to a particular choice of
K.
• HC also provides a tree-like visualization
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...

A B

C
D

A B

C
D

A B

C
D

A B

C
D

A B

C
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.

Dendrogram

4
3
D
E
A B
2

C
1
0

C
E
B
A
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
Complete dissimilarities between the observations in cluster A and
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
Average dissimilarities between the observations in cluster A and
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
A n Example

4
X2
2
0
−2

−6 −4 −2 0 2

X1

45 observations generated in 2-dimensional space. In reality there are three


distinct classes, shown in separate colors.
However, we will treat these class labels as unknown and will seek to cluster the
observations in order to discover the classes from the data.
10 Application of hierarchical clustering

10

10
8

8
6

6
4

4
2

2
0

0
Details of previous figure
• Left: Dendrogram obtained from hierarchically clustering the data from
previous slide, with complete linkage and Euclidean distance.
• Center: The dendrogram from the left-hand panel, cut at a height of 9
(indicated by the dashed line). Th i s cut results in two distinct clusters, shown
in different colors.
• Right: The dendrogram from the left-hand panel, now cut at a height of 5. Th is
cut results in three distinct clusters, shown in different colors. Note that the colors
were not used in clustering, but are simply used for display purposes in this
figure.
Choice of Dissimilarity Measure
• So far used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• Here correlation is computed between the observation
profiles for each pair of observations.
• Correlation care more about the shape, than the levels
20

Observation 1
Observation 2
Observation 3
15
10

2
5

1
0

5 10 15 20

Variable Index
Practical Issues for Clustering

1. Scaling is necessary
2. In some cases, standardization may be useful
3. What dissimilarity measure and linkage should be used (for HC)?
4. Choice of for K-means clustering
5. Which features should be used to drive the clustering?
Example

Gene expression measurement for 8000 genes, sample collected


from 88 women with breast cancer

Average linkage, correlation metric

Subset of 500 intrinsic genes were studied, before and after


chemotherapy (which genes were varying by how much, within
women and between women)
Heatmap
Based on the
gene
expression,
Samples were
clustered

Survival curves for


different groups

You might also like