0% found this document useful (0 votes)
17 views52 pages

Module12 - Unsupervised Learning

This document discusses unsupervised learning methods, specifically principal component analysis (PCA) and clustering. It provides details on how PCA works, including finding linear combinations of variables with maximal variance and using singular value decomposition to solve the optimization problem. An example using US crime data is shown to illustrate PCA.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views52 pages

Module12 - Unsupervised Learning

This document discusses unsupervised learning methods, specifically principal component analysis (PCA) and clustering. It provides details on how PCA works, including finding linear combinations of variables with maximal variance and using singular value decomposition to solve the optimization problem. An example using US crime data is shown to illustrate PCA.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Unsupervised

Learning
Reference Books

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An


introduction to statistical learning (Vol. 112, p. 18). New York:
springer.

Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.


(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York:
springer.

Johnson, R. A., & Wichern, D. W. (2002). Applied multivariate


statistical analysis.
Unsupervised Learning

• Unsupervised learning is a type of algorithm that learns patterns


from untagged data.
• Unsupervised learning is more subjective than supervised
learning, as there is no simple goal for the analysis, such as
prediction of a response.
• We will discuss two unsupervised learning methods:
1. Principal components analysis
2. Clustering
Principal Components Analysis

• PCA produces a low-dimensional representation of a dataset.


• It finds a sequence of linear combinations of the variables
that have maximal variance, and are mutually uncorrelated.
• Apart from producing derived variables for use in supervised
learning problems, PCA also serves as a tool for data
visualization.
Principal Components Analysis: details

• The first principal component of a set of features X1 ,X2 , . . . ,X p


is the normalized linear combination of the features
Z1 = ∅11 X1 +∅21 X2 + ..… +∅p1 X p

that has the largest variance. B y normalized, we mean that


p
෍ ∅2j1 = 1
j=1

• The elements ∅11 , … , ∅p1 as the loadings of the first principal


component; together, the loadings make up the principal
component loading vector
∅𝟏 = (∅𝟏𝟏 ∅𝟐𝟏 … ∅𝐩𝟏 )𝐓
P C A : example

35
30
25
Ad Spending
20
15
10
5
0

10 20 30 40 50 60 70

Population

The population size (pop) and ad spending (ad) for 100 different cities are shown as
purple circles. The green solid line indicates the first principal component direction,
and the blue dashed line indicates the second principal component direction.
Computation of Principal Components

• Suppose we have a n × p data set X.


• Assume variables in X has been centered to have mean zero.
• Get the linear combination of the sample feature values
of the form
zi1 = ∅11 xi1 +∅21 xi2 + ..… +∅p1 x ip
for i = 1, . . . , n that has largest sample variance, (1)

subject to the constraint that


p
෍ ∅2j1 = 1
j=1

• Since each of the xij has mean zero, then so does zi1 .
Hence the sample variance of the zi1 can be written as
1 n
2
෍ zi1
n i=1
• Plugging in (1) the first principal component loading vector
solves the optimization problem
2
maximize 1 n
σ
∅11 …∅p1 n i=1
σpj=1 ∅j1 xij subject to σpj=1 ∅2j1 = 1

• This problem can be solved via a singular-value decomposition


of the matrix X.
• We refer to Z 1 as the first principal component, with
realized values z 11 , . . . , z n 1.
• The second principal component is the linear combination of
X1 ,X 2 , . . . , X p that has maximal variance among all linear
combinations that are uncorrelated with Z 1 .
• The second principal component scores z 12 , z22 , . . . , z n2
take the form

zi2 = ∅12 xi1 + ∅22 xi2 + ..… + ∅p2 xip

where ∅2 is the second principal component loading vector,


with elements ∅11 , ∅21 , … , ∅p1.
Illustration

• USAarrests data: For each of the fifty states in the United


States, the data set contains the number of arrests per
100,000 residents for each of three crimes: Assault, Murder,
and Other. We also record UrbanPop (the percent of the
population in each state living in urban areas).
• The principal component score vectors have length n = 50,
and the principal component loading vectors have length
p = 4.
• P C A was performed after standardizing each variable to
have mean zero and standard deviation one.
USAarrests data: P C A plot
−0.5 0.0 0.5

UrbanPop

3
2

0.5
Hawaii California
RhodM
e aIslU
saatnacdhuseNttesw Jersey

Connecticut
Second Principal Component

Washington Colorado
1

Ohio New York Nevada


n sininnesota Pennsylvania
WiscoM IllinoisArizona
Oregon
Texas
Other
Dm
s klaho
KansaO elaaware Missouri
Nebraska Indiana Michigan
New HaImowpashire

0.0
0

New Mexico Florida


Idaho Virginia
Wyoming
Maine Maryland
rth Dakota Montana
Assault
South Dakota TennesseLeouisiana
Kentucky
−1

Alaska
Arkansas Alabama
Georgia
VermontWest Virginia Murder

−0.5
South Carolina
−2

North Carolina
Mississippi
−3

−3 −2 −1 0 1 2 3

First Principal Component


Figure details

The first two principal components for the USArrests data.


• The blue state names represent the scores for the first
two principal components.
• The orange arrows indicate the first two principal
component loading vectors (with axes on the top and
right). For example, the loading for Other on the first
component is 0.54, and its loading on the second
principal component 0.17 [the word Rape is centered at
the point (0.54, 0.17)].
• This figure is known as a biplot, because it displays
both the principal component scores and the principal
component loadings.
Figure details

• First loading vector places approximately equal weights on


Assult, Murder and Other
• It indicates that the first PC represents crime in the city
• The second loading vector places most of the weight on
urban pop.
• The second PC represents the urban population
• Crime related variables are correlated (high murder rate is
associated with high assault)
• Urbanpop variable is less correlated with the other three.
How to Determine Principal Components

Let 𝚺 be the covariance matrix of the random variable


𝑿𝑻 = {𝐗 𝟏 ,𝐗 𝟐, . . . , 𝐗 𝐩 }

Let 𝚺 has eigenvalue-eigenvector pairs 𝜆1 , 𝒆𝟏 , 𝜆2 , 𝒆𝟐 … , (𝜆𝑝 , 𝒆𝒑 )


Where 𝜆1 ≥ 𝜆2 … ≥ 𝜆𝑝 ≥ 0

The 𝑖 𝑡ℎ PC is given by

𝒁𝒊 = 𝒆𝒊𝟏 X1 +𝒆𝒊𝟐 X2 +… .. +𝒆𝒊𝒑 Xp where 𝑖 = 1,2, … 𝑝

With following properties


𝑉𝑎𝑟 𝑍𝑖 = 𝒆𝑻𝒊 𝚺𝒆𝒊 = 𝜆𝑖 where 𝑖 = 1,2, … 𝑝
𝐶𝑜𝑣 𝑍𝑖 , 𝑍𝑘 = 0 for 𝑖 ≠ 𝑘
Another Interpretation of Principal Components

• The first principal component loading vector has a very


special property: it defines the hyperplane in p-dimensional
space that is closest to the n observations (using average
squared Euclidean distance as a measure of closeness).
• The notion of principal components as the dimensions that
are closest to the n observations extends beyond just the
first principal component.
Scaling of the variables
• If the variables are in different units, scaling each to have
standard deviation equal to one is recommended.
• Variance of Murder, Other, Assault and UrbanPop are:
18.97, 87.73, 6945.16 and 209.5
• If they are in the same units, scaling is not mandatory
Scaled Unscaled
−0.5 0.0 0.5 −0.5 0.0 0.5 1.0

1.0
UrbanPop UrbanPop
3

150
2

0.5

100
** ** *
Second Principal Component

Second Principal Component


*

0.5
* **
1

*
* *

50
* * * * * Other
* Other
* * * ** * *
* * ** *
0.0

* * *
0

*
* * * * *** ** * * ** ** * *
* * * *

0.0
* *
* * A*ssault 0 * * ** * *M*urd*er *
* * * Assa
* ** * *
* * * * * *
−1

* * *
* *
M*urder
−50

* *
−0.5

−0.5
−2

−100

**
−3

−3 −1 0 1 2 3 −100 −50 0 50 100


−2 150
First Principal Component First Principal Component
Proportion of Variance Explained

• To understand the strength of each component, measure the


proportion of variance explained ( P V E ) by each one.
• The total variance present in a data set (assuming that the
variables have been centered to have mean zero) is defined as
n p n
1
෍ Var X j = ෍ ෍ xij2
n
i=1 j=1 i=1
and the variance explained by the mth principal component is
2
n n p
1 2
1
෍ zim = ෍ ෍ ∅jm xij
n n
i=1 i=1 j=1
• Therefore, the P V E of the mth principal component is given
by the positive quantity between 0 and 1
2
σni=1 σpj=1 ∅jm xij
σpj=1 σni=1 xij2
Scree Plots

Left: Proportion of variance explained by each of the four


principal components in the USArrests data.
Right: The cumulative proportion of variance explained by
the four principal components in the USArrests data.
Example

Suppose Random Variable 𝑋1 , 𝑋2 , 𝑋3 have the covariance matrix

1 −2 0
−2 5 0
Σ=
0 0 2

Determine the principal components


PCA for Missing Values and Matrix Completion
• Often datasets have missing values, which can be a nuisance.

• Removing the rows that contain missing observations and perform


data analysis on the complete rows is wasteful, and depending on
the fraction missing, could be unrealistic.

• Alternatively, if xij is missing, can be replaced by the mean of the


jth column (using the non-missing entries to compute the mean).

• Although this is a common and convenient strategy, but


correlation between the variables is not exploited

• Principal components can be used to impute through a process


known as matrix completion.

• Sometimes data is missing by necessity (matrix of movie reviews)


PCA for Missing Values and Matrix Completion

Netflix movie rating data excerpt


Another Interpretation of Principal Components

• •

1.0
• • ••

• • • •• • •• •••
• •

• •

0.5
Secondprincipal component
•• •
• •
• • •
• • •• • •

0.0
••• •
• •
• •• • •• •• • •

• • •
• •
• • • •
•• •

−0.5
• •
• •••
• • ••
• • • •
• •
−1.0

−1.0 −0.5 0.0 0.5 1.0


First principal component
PCA for Missing Values and Matrix Completion
• Principal components provide low-dimensional linear surfaces that
are closest to the observations.

• The first two principal components of a data set span the plane
that is closest to the n observations

• The first three PCs of a data set span the three-dimensional


hyperplane that is closest to the n observations, and so forth.

• Using this interpretation, together the first M principal component


score vectors and the first M PC loading vectors provide the
best M-dimensional approximation (in terms of Euclidean
distance) to the ith observation 𝑥𝑖𝑗 .
• This is represented by 𝑥𝑖𝑗 ≈ σM m=1 zim ∅jm
PCA for Missing Values and Matrix Completion
• More formally, this can be represented by an optimization
problem
• Suppose data matrix 𝑋 is column centered
• Out of all approximations of the form 𝑥𝑖𝑗 ≈ σM
m=1 a im bjm , the
one with smallest RSS is given by

𝑝 2
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑨∈ℛ𝑛×𝑀 ,𝑩∈ℛ 𝑝×𝑀 σ𝑗=1 σ𝑛𝑖=1 𝑥𝑖𝑗 − σM
m=1 a im bjm

• t can be shown that for any value of M, the columns of the


matrices 𝑨 and 𝑩 that solve the above problem are the first M
principal components score and loading vectors.

• The smallest possible value of the above objective


2
σ𝑝𝑗=1 σ𝑛𝑖=1 𝑥𝑖𝑗 − σMm=1 zim ∅jm
• This property can be used for missing value imputations.
PCA for Missing Values and Matrix Completion
• A modified optimization problem
2
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑨∈ℛ𝑛×𝑀 ,𝑩∈ℛ 𝑝×𝑀 σ 𝑖,𝑗 ∈𝓞 𝑥𝑖𝑗 − σM
m=1 aim bjm

• 𝓞 is the set of all observed pairs 𝑖, 𝑗

• A missing observation can be estimated by


𝑥ො𝑖𝑗 = σM ො im b෠ jm
m=1 a

• An iterative approach can be used to solve the above optimization


problem.
Iterative Algorithm Matrix Completion

1. Initialize by creating data matrix 𝑋෨ by imputing missing values by


column mean

2. Repeat steps a to c until the objective no longer decreases


a. Solve:
2
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑨∈ℛ𝑛×𝑀 ,𝑩∈ℛ 𝑝×𝑀 σ 𝑖,𝑗 ∈𝓞 𝑥෤𝑖𝑗 − σM m=1 a im bjm
by computing PCs of 𝑋෨
b. For each i, j ∉ 𝓞 , set 𝑥෤𝑖𝑗 ← σM m=1 aො im b෠ jm
c. Compute objective:
2
σ 𝑖,𝑗 ∈𝓞 𝑥𝑖𝑗 − σM ො im b෠ jm
m=1 a

• Return estimated missing entries 𝑥෤𝑖𝑗 , for i, j ∉ 𝓞


PCA for Missing Values and Matrix Completion

𝑖 𝑡ℎ customer rating for movie 𝑗 can be approximated by


M

𝑥ො𝑖𝑗 ← ෍ aො im b෠ jm
m=1
• aො im represents the strength with which the ith user belongs to the
𝑚𝑡ℎ clique, a group of customers that enjoys movies of the 𝑚𝑡ℎ genre;
• b෠ jm represents the strength with which the jth movie belongs to the
𝑚𝑡ℎ genre.
Clustering
Clustering

• Clustering refers to a very broad set of techniques for


finding subgroups, or clusters, in a data set.
• We seek a partition of the data into distinct groups so that
the observations within each group are quite similar to
each other.
P C A vs Clustering

• P C A looks for a low-dimensional representation of the


observations that explains a good fraction of the variance.
• Clustering looks for homogeneous subgroups among the
observations.
Two clustering methods

• In K-means clustering, observations are partitioned into a


pre-specified number of clusters.
• In hierarchical clustering, number of clusters are not known
beforehand
• A tree-like visual representation of the observations, called
dendrogram, is created to view at once the clusterings
obtained for each possible number of clusters, from 1 to n.
K-means clustering
K=2 K=3 K=4

A simulated data set with 150 observations in 2-dimensional space. Panels show the
results of applying K-means clustering with different values of K , the number of
clusters. The color of each observation indicates the cluster to which it was assigned
using the K-means clustering algorithm. Note that there is no ordering of the
clusters, so the cluster coloring is arbitrary.
These cluster labels were not used in clustering; instead, they are the outputs of the
clustering procedure.
Details of K-means clustering

Let C1 , . . . , CK denote sets containing the indices of the


observations in each cluster. These sets satisfy two
properties:
1. C1 ∪ C2 ∪ ⋯ ∪ CK = {1, … , n}. In other words, each
observation belongs to at least one of the K clusters.
2. Ck ∩ Ck′ = ∅ for all k ≠ k ′ .In other words, the clusters
a r e non-overlapping: no observation belongs to more than
one cluster.
For instance, if the ith observation is in the kth cluster, then
i ∈ CK .
• The within-cluster variation for cluster C k is a measure
W C V ( C k ) of the amount by which the observations within
a cluster differ from each other.
• Hence it is an optimization problem
K
minimize
෍ W(Ck ) (2)
C1 , … , CK
k=1
• In words, this formula says partition the observations into K
clusters such that the total within-cluster variation, summed
over all K clusters, is as small as possible.
How to define within-cluster variation?

• Typically Euclidean distance is used


p
1
W Ck = ෍ ෍(xij − xi′j )2 (3)
|Ck | ′
i,i ∈ Ck
j=1
where |C k| denotes the number of observations in the kth
cluster.

• Combining (2) and (3) gives the optimization problem that


defines K-means clustering,
K p
minimize 1 (4)
෍ ෍ ෍(xij − xi′j )2
C1 , … , CK |Ck | i,i′ ∈ Ck
k=1 j=1
K-Means Clustering Algorithm

1. Randomly assign a number, from 1 to K , to each of the


observations. These serve as initial cluster assignments for
the observations.
2. Iterate until the cluster assignments stop changing:
1. For each of the K clusters, compute the cluster centroid. The
kth cluster centroid is the vector of the p feature means for the
observations in the kth cluster.
2. Assign each observation to the cluster whose centroid is
closest (where closest is defined using Euclidean distance).
Properties of the Algorithm

• This algorithm is guaranteed to decrease the value of the


objective (4) at each step. Why? Note that
p p
1
෍ ෍(xij − xi′j )2 = 2 ෍ ෍(xij − xത kj )2
|Ck | i,i′ ∈ Ck i ∈ Ck
j=1 j=1
1
where xത kj = σi ∈ Ck xij is the mean for feature j in cluster
|Ck |
Ck .
• However it is not guaranteed to give the global minimum.
• This is why clustering should be tried with a number of initial
solutions
Hierarchical Clustering

• K-means clustering requires pre-specification of the number


of clusters K .
• Hierarchical clustering is an alternative approach which
does not require that we commit to a particular choice of
K.
• HC also provides a tree-like visualization
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...

A B

C
D

A B

C
D

A B

C
D

A B

C
D

A B

C
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.

Dendrogram

4
3
D
E
A B 2
C
1
0

E
D

C
B
A
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
Complete dissimilarities between the observations in cluster A and
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
Average dissimilarities between the observations in cluster A and
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
A n Example

4
X2
2
0
−2

−6 −4 −2 0 2

X1

45 observations generated in 2-dimensional space. In reality there are three


distinct classes, shown in separate colors.
However, we will treat these class labels as unknown and will seek to cluster the
observations in order to discover the classes from the data.
10
Application of hierarchical clustering

10

10
8

8
6

6
4

4
2

2
0

0
All in 1 cluster Cut at a height of Cut at a height of
9, with 2 clusters 5, with 3 clusters
Choice of Dissimilarity Measure
• So far used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• Here correlation is computed between the observation
profiles for each pair of observations.
• Correlation care more about the shape, than the levels
20

Observation 1
Observation 2
Observation 3
15
10

2
5

1
0

5 10 15 20

Variable Index
Practical Issues for Clustering

1. Scaling is necessary
2. In some cases, standardization may be useful
3. What dissimilarity measure and linkage should be used (for HC)?
4. Choice of 𝐾 for K-means clustering
5. Which features should be used to drive the clustering?
Example

Gene expression measurement for 8000 genes, sample collected


from 88 women with breast cancer

Average linkage, correlation metric

Subset of 500 intrinsic genes were studied, before and after


chemotherapy (which genes were varying by how much, within
women and between women)
Heatmap
Based on the
gene
expression,
Samples were
clustered

Survival curves for


different groups

You might also like