Module12 - Unsupervised Learning
Module12 - Unsupervised Learning
Learning
Reference Books
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
The population size (pop) and ad spending (ad) for 100 different cities are shown as
purple circles. The green solid line indicates the first principal component direction,
and the blue dashed line indicates the second principal component direction.
Computation of Principal Components
• Since each of the xij has mean zero, then so does zi1 .
Hence the sample variance of the zi1 can be written as
1 n
2
zi1
n i=1
• Plugging in (1) the first principal component loading vector
solves the optimization problem
2
maximize 1 n
σ
∅11 …∅p1 n i=1
σpj=1 ∅j1 xij subject to σpj=1 ∅2j1 = 1
UrbanPop
3
2
0.5
Hawaii California
RhodM
e aIslU
saatnacdhuseNttesw Jersey
Connecticut
Second Principal Component
Washington Colorado
1
0.0
0
Alaska
Arkansas Alabama
Georgia
VermontWest Virginia Murder
−0.5
South Carolina
−2
North Carolina
Mississippi
−3
−3 −2 −1 0 1 2 3
The 𝑖 𝑡ℎ PC is given by
1.0
UrbanPop UrbanPop
3
150
2
0.5
100
** ** *
Second Principal Component
0.5
* **
1
*
* *
50
* * * * * Other
* Other
* * * ** * *
* * ** *
0.0
* * *
0
*
* * * * *** ** * * ** ** * *
* * * *
0.0
* *
* * A*ssault 0 * * ** * *M*urd*er *
* * * Assa
* ** * *
* * * * * *
−1
* * *
* *
M*urder
−50
* *
−0.5
−0.5
−2
−100
**
−3
1 −2 0
−2 5 0
Σ=
0 0 2
• •
•
1.0
• • ••
•
• • • •• • •• •••
• •
•
• •
0.5
Secondprincipal component
•• •
• •
• • •
• • •• • •
0.0
••• •
• •
• •• • •• •• • •
•
• • •
• •
• • • •
•• •
−0.5
• •
• •••
• • ••
• • • •
• •
−1.0
• The first two principal components of a data set span the plane
that is closest to the n observations
𝑝 2
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑨∈ℛ𝑛×𝑀 ,𝑩∈ℛ 𝑝×𝑀 σ𝑗=1 σ𝑛𝑖=1 𝑥𝑖𝑗 − σM
m=1 a im bjm
𝑥ො𝑖𝑗 ← aො im b jm
m=1
• aො im represents the strength with which the ith user belongs to the
𝑚𝑡ℎ clique, a group of customers that enjoys movies of the 𝑚𝑡ℎ genre;
• b jm represents the strength with which the jth movie belongs to the
𝑚𝑡ℎ genre.
Clustering
Clustering
A simulated data set with 150 observations in 2-dimensional space. Panels show the
results of applying K-means clustering with different values of K , the number of
clusters. The color of each observation indicates the cluster to which it was assigned
using the K-means clustering algorithm. Note that there is no ordering of the
clusters, so the cluster coloring is arbitrary.
These cluster labels were not used in clustering; instead, they are the outputs of the
clustering procedure.
Details of K-means clustering
A B
C
D
A B
C
D
A B
C
D
A B
C
D
A B
C
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Dendrogram
4
3
D
E
A B 2
C
1
0
E
D
C
B
A
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
Complete dissimilarities between the observations in cluster A and
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
Average dissimilarities between the observations in cluster A and
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
A n Example
4
X2
2
0
−2
−6 −4 −2 0 2
X1
10
10
8
8
6
6
4
4
2
2
0
0
All in 1 cluster Cut at a height of Cut at a height of
9, with 2 clusters 5, with 3 clusters
Choice of Dissimilarity Measure
• So far used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• Here correlation is computed between the observation
profiles for each pair of observations.
• Correlation care more about the shape, than the levels
20
Observation 1
Observation 2
Observation 3
15
10
2
5
1
0
5 10 15 20
Variable Index
Practical Issues for Clustering
1. Scaling is necessary
2. In some cases, standardization may be useful
3. What dissimilarity measure and linkage should be used (for HC)?
4. Choice of 𝐾 for K-means clustering
5. Which features should be used to drive the clustering?
Example