0% found this document useful (0 votes)
11 views84 pages

Lecture07 95791

Lecture 7 covers unsupervised learning, focusing on hierarchical clustering and EM clustering, as well as dimensionality reduction and association rule mining. Hierarchical clustering provides a deterministic approach to clustering without needing to pre-specify the number of clusters, using methods like single, complete, and average linkage to measure dissimilarity between clusters. The lecture also discusses the shortcomings of single and complete linkage methods and introduces average linkage as a balanced alternative.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views84 pages

Lecture07 95791

Lecture 7 covers unsupervised learning, focusing on hierarchical clustering and EM clustering, as well as dimensionality reduction and association rule mining. Hierarchical clustering provides a deterministic approach to clustering without needing to pre-specify the number of clusters, using methods like single, complete, and average linkage to measure dissimilarity between clusters. The lecture also discusses the shortcomings of single and complete linkage methods and introduces average linkage as a balanced alternative.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Lecture 7: Unsupervised Learning

Part I: Hierarchical Clustering, EM clustering


Part II: Dimensionality Reduction, Association rule
mining

Prof. Alexandra Chouldechova


95-791: Data Mining

1 / 74
Agenda for Part I

• Hierarchical clustering

• Mixture models (EM clustering)

2 / 74
Hierarchical clustering

• K-means is an objective-based approach that requires us to


pre-specify the number of clusters K

• The answer it gives is somewhat random: it depends on the random


initialization we started with

• Hierarchical clustering is an alternative approach that does not


require a pre-specified choice of K, and which provides a
deterministic answer (no randomness)

• We’ll focus on bottom-up or agglomerative hierarchical clustering

• top-down or divisive clustering is also good to know about, but we


won’t directly cover it here

3 / 74
D

A B

3
Each point starts as its own cluster
4 / 74
D

A B

3
We merge the two clusters (points) that are closet to each other
4 / 74
D

A B

3
Then we merge the next two closest clusters
4 / 74
D

A B

3
Then the next two closest clusters…
4 / 74
D

A B

3
Until at last all of the points are all in a single cluster
4 / 74
Hierarchical
Agglomerative HierarchicalClustering
Clustering Algorithm
• Start
The approach in words:
with each point in its own cluster.
• Start with each point
• Identify the two closest in its own
clusters. Merge cluster.
them.
• Identify the closest two clusters and merge them.
• Repeat until all points are in a single cluster
• Repeat.
To•visualize
Ends when all points
the results, arelook
we can in aatsingle cluster.dendrogram
the resulting
Dendrogram

4
3
D
E
A B
C 2
1
0

D
E
B
A
C
y-axis on dendrogram is (proportional to) the distance between the clusters
39 / 52
that got merged at that step 5 / 74
Linkages
• Let dij = d(xi , xj ) denote the dissimilarity1 (distance) between
observation xi and xj

• At our first step, each cluster is a single point, so we start by merging


the two observations that have the lowest dissimilarity

• But after that…we need to think about distances not between


points, but between sets (clusters)

• The dissimilarity between two clusters is called the linkage

• i.e., Given two sets of points, G and H, a linkage is a dissimilarity


measure d(G, H) telling us how different the points in these sets are

• Let’s look at some examples


1
We’ll talk more about dissimilarities in a moment
6 / 74
Common
Types oflinkage types
Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Complete
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Average
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.

45 / 52
7 / 74
Single linkage
In single linkage (i.e., nearest-neighbor linkage), the dissimilarity between
G, H is the smallest dissimilarity between two points in different groups:

dsingle (G, H) = min d(xi , xj )


i∈G, j∈H

2
● ● ●

● ●

● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●
Example (dissimilarities dij are

1
● ● ●● ● ●
● ●●
● ● ●


distances, groups are marked ●



●● ●
●● ● ●


●●
by colors): single linkage score
0


●● ●
● ●


dsingle (G, H) is the distance of −1



● ●


● ●● ● ●
the closest pair ●
● ●
●● ●







●●
●●
● ●●
−2

● ●

−2 −1 0 1 2

8 / 74
Single linkage example
Here n = 60, xi ∈ R2 , dij = ∥xi − xj ∥2 . Cutting the tree at h = 0.9
gives the clustering assignments marked by colors

● ●●

3

1.0

● ●
● ●

● ● ●
2

● ●
●●

0.8


● ●
● ●
● ●
1

● ●
● ● ●

0.6
● ●

Height

● ●
● ● ●● ●
0



0.4
● ●



−1



● ● ●

0.2
● ●



−2


0.0
−2 −1 0 1 2 3

Cut interpretation: for each point xi , there is another point xj in its


cluster such that d(xi , xj ) ≤ 0.9

9 / 74
Complete linkage
In complete linkage (i.e., furthest-neighbor linkage), dissimilarity
between G, H is the largest dissimilarity between two points in different
groups:
dcomplete (G, H) = max d(xi , xj )
i∈G, j∈H

2
● ● ●

● ●

● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●
Example (dissimilarities dij are

1
● ● ●● ● ●
● ●●
● ● ●


distances, groups are marked by ●



●● ●
●● ● ●


●●
colors): complete linkage score
0


●● ●
● ●


dcomplete (G, H) is the distance of −1



● ●


● ●● ● ●
the furthest pair ●
● ●
●● ●







●●
●●
● ●●
−2

● ●

−2 −1 0 1 2

10 / 74
Complete linkage example
Same data as before. Cutting the tree at h = 5 gives the clustering
assignments marked by colors

● ●●

3

6

● ●
● ●

● ● ●
2

5
● ●
●● ●

● ●
● ●

4
● ●
1

● ●
● ● ●
● ●

Height

● ●

3
● ●● ●
0




● ●

2

−1



● ● ●
● ●

1


−2


0
−2 −1 0 1 2 3

Cut interpretation: for each point xi , every other point xj in its cluster
satisfies d(xi , xj ) ≤ 5

11 / 74
Average linkage
In average linkage, the dissimilarity between G, H is the average
dissimilarity over all points in opposite groups:
1 ∑
daverage (G, H) = d(xi , xj )
|G| · |H| i∈G, j∈H

Example (dissimilarities dij are ●

2
● ● ●

● ●

distances, groups are marked by ● ●




●●

● ● ● ●● ● ●●
●● ● ● ●
colors): average linkage score

1
● ● ●● ● ●
● ●●
● ● ●


daverage (G, H) is the average dis- ●



●● ●
●● ● ●


●●

0
tance across all pairs ●

●●



● ●
● ●

● ●
−1

● ●● ●
(Plot here only shows distances ●
● ●








●● ● ●
●●
between the green points and ●●
● ●●
−2

● ●
one orange point) ●

−2 −1 0 1 2

12 / 74
Average linkage example
Same data as before. Cutting the tree at h = 2.5 gives clustering
assignments marked by the colors

● ●●

3

3.0

● ●
● ●

2.5
● ●
2

● ●
●● ●

● ●

2.0

● ●
1

● ●
● ● ●
● ●

Height

1.5
● ●
● ● ●● ●
0




● ●

1.0


−1



● ● ●
● ●

0.5



−2


0.0

−2 −1 0 1 2 3

Cut interpretation: there really isn’t a good one! §


hi
13 / 74
Shortcomings of Single and Complete linkage

Single and complete linkage have some practical problems:


• Single linkage suffers from chaining.
◦ In order to merge two groups, only need one pair of points to be close,
irrespective of all others. Therefore clusters can be too spread out, and
not compact enough.
• Complete linkage avoids chaining, but suffers from crowding.
◦ Because its score is based on the worst-case dissimilarity between pairs,
a point can be closer to points in other clusters than to points in its own
cluster. Clusters are compact, but not far enough apart.

Average linkage tries to strike a balance. It uses average pairwise


dissimilarity, so clusters tend to be relatively compact and relatively far
apart

14 / 74
Example of chaining and crowding
Single Complete

● ●●
● ● ●●

3

3
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
2

2
● ● ● ●
●● ● ●● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●
1

1
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●● ● ● ● ● ●● ● ●
0

0
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
−1

−1
● ●
● ●
● ● ● ● ● ●
●● ●●
● ●

● ●

−2

−2
● ●
● ●

−2 −1 0 1 2 3 Average −2 −1 0 1 2 3

● ●●

3


● ●
● ●
● ● ● ●
2

● ●
●● ●

● ● ●

● ●
1

● ●
● ● ●
● ●


● ● ●● ● ●
0


● ●
● ●
● ●

−1



● ● ●
●●



−2


−2 −1 0 1 2 3
15 / 74
Shortcomings of average linkage

Average linkage has its own problems:

• Unlike single and complete linkage, average linkage doesn’t give us a


nice interpretation when we cut the dendrogram

• Results of average linkage clustering can change if we simply apply a


monotone increasing transformation to our dissimilarity measure,
our results can change
◦ E.g., d → d2 or d → ed
1+ed

◦ This can be a big problem if we’re not sure precisely what dissimilarity
measure we want to use

◦ Single and Complete linkage do not have this problem

16 / 74
Average linkage monotone dissimilarity transformation
Avg linkage: distance Avg linkage: distance^2

● ●●
● ● ●●

3

3
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
2

2
● ● ● ●
●● ● ●● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●
1

1
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●● ● ● ● ● ●● ● ●
0

0
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
−1

−1
● ●
● ●
● ● ● ● ● ●
●● ●●
● ●

● ●

−2

−2
● ●
● ●

−2 −1 0 1 2 3 −2 −1 0 1 2 3

The left panel uses d(xi , xj ) = ∥xi − xj ∥2 (Euclidean distance), while


the right panel uses ∥xi − xj ∥22 . The left and right panels would be same
as one another if we used single or complete linkage. For average
linkage, we see that the results can be different.
17 / 74
Where should we place cell towers?

[source: http://imagicdigital.com/, Mark & Danya Henninger]

Suppose we wanted to place cell towers in a way that ensures that no building
is more than 3000ft away from a cell tower. What linkage should we use to
cluster buildings, and where should we cut the dendrogram, to solve this
problem?
18 / 74
Dissimilarity measures

• The choice of linkage can greatly affect the structure and quality of
the resulting clusters

• The choice of dissimilarity (equivalently, similarity) measure is


arguably even more important
• To come up with a similarity measure, you may need to think
carefully and use your intuition about what it means for two
observations to be similar. E.g.,
◦ What does it mean for two people to have similar purchasing behaviour?
◦ What does it mean for two people to have similar music listening habits?

• You can apply hierarchical clustering to any similarity measure


s(xi , xj ) you come up with. The difficult part is coming up with a
good similarity measure in the first place.

19 / 74
Example: Clustering time series

Here’s an example of using hier-


archical clustering to cluster time
series.

You can quantify the similarity


between two time series by cal-
culating the correlation between
them. There are different kinds
of correlations out there.

[source: A Scalable Method for Time Series Clustering,

Wang et al]

(a) benchmark
20 / 74
K-means vs Hierarchical clustering
• K-means:
© Low memory usage
© Essentially O(n) compute time
§ Results are sensitive to random initialization
§ Number of clusters is pre-defined
?? Awkward with categorical variables

• Hierarchical clustering:
© Deterministic algorithm
© Dendrogram shows us clusterings for various choices of K
© Requires only a distance matrix, quantifying how dissimilar observations
are from one another
• We can use a dissimilarity measure that gracefully handles categorical
variables, missing values, etc
§ Memory-heavy, more computationally intensive than K-means

There are lots of practical considerations...


21 / 74
What does it mean for two things to be “similar”?

[source: photographs by Sebastian Magnani]


22 / 74
K -means
What Limitations
shape might the Illustrated
clusters be expected to have?
Non-convex/non-round-shaped clusters: Standard K -means fails!

Clusters with different densities

Picture courtesy: Christof Monz (Queen Mary, Univ. of London)

(CS5350/6350)
[source: Piyush Rai, CS5350/6350 @October
Data Clustering
Utah]4, 2011
23 / 74
Not that your data will ever look like this, but...

Single linkage Complete linkage

The chaining property of Single linkage actually helps us in this case.


Single linkage reproduces the clusters exactly, while Complete linkage
strives to find compact clusters, and hence fails on this problem.
[source: Ajda Pretnar, Orange, University of Ljubljana]24 / 74
You might get data that looks like this

• Of course, we don’t get to observe the labels


• This is a problem that K-means doesn’t do well on
[source: Wikipedia, Public Domain image]
25 / 74
Here’s what K-means does

• K-means struggles when the clusters have different densities


26 / 74
Gaussian Mixture Models (EM Clustering)

[source: Wikipedia, Public Domain image]

• If this was a classification problem, we’d try to use QDA


• The unsupervised setting, we can apply the same kind of intuition
• Basic idea: Let’s assume that the data generating process is a mixture
of Multivariate Normals
27 / 74
Gaussian Mixture Models

• Assume each observation has probability πk of coming from cluster k


• Assume that all observations from cluster k a drawn randomly from
a MVN(µk , Σk ) distribution

●●
● ●
● ●
● ●● ● ● ● ●
● ●● ● ●●● ●
●●● ●● ●
● ●● ●● ●
6 ● ●● ●● ●
● ●
● ● ● ●
● ●● ● ●●●●●●● ●●●

● ● ● ● ●● ●●●●
● ● ●●●
●● ●● ●●●
●●●●
●● ● ● ●●
● ● ●●

●●● ● ● ● ●
● ● ●
●● ● ● ● ●●
●●● ● ● ●●●
● ●●● ●● ●●●● ●●
● ●● ●●● ● ● ●●● ●
● ● ●●●● ● ●
●● ● ● ●

● ● ● ●● ● ●●●
● ●● ●●●● ●● ●
●●● ● ● ●
●●● ●●● ●●
● ● ●●● ●●●
● ● ● ●● ●●
●●● ● ● ●●●● ●● ●●●●

●●
● ●●
●●● ● ●

● ●●
● ●●●● ● ● ●● ●

●● ●

● ●
●●●●● ●
●●●
●●●●●●
● ● ●


●●
● ●


● ● ●● ●
●●
● ●
● ●
●●
●●

●●●
●●
●●


●●
●●




●●●
●●●●


●●
●●
●●●
●●



●●

●●
●●




●●


●●●
●●

●●
●●

●●●●



●●

●●






●●
●●●●●

●●●●
●● ●●

●● ●●● ●●
● ●● ● ●●● ● ●
● class
●● ●●
●● ●●
●●●●●●
● ● ● ●●●● ●


●●●

●●
● ●●●●●
● ●● ● ●● ●
● ●
●●
● ●●● ●
●● ● ● ●
●●
●●

●●●● ●●
● ● ● ●●●●●●●● ● ● ● ●●
●●
●● ●●
● ●● ●● ● ●●●●●
● ●●●●●
●● ●
●●● ●● ●
●● ●●
● ●●
●● ●


● ●


●● ●● ●● ●● ●● ●
3 ● ●

●●●
●●


●●●

●●
●●

●●

●● ●


●●

●●
●●

●●●

●●● ●●●
●●● ●●
●●●
●● ●
● ●


●●


●●●●

●●●
● ●●


●●

●●
●●


●●●●
●●●
●●●




●●●



●●● ● ●
●● ●● ●
● 1
● ●
● ●● ●● ●
● ● ●●●
X2

●●● ●● ●
● ●●
●●●●●● ●●● ● ● ● ●
●●● ●● ● ●
●●●●
●●
●●● ●

●●●● ●●

● ● ●●● ● ● ● ●
●●●●● ●
●●●●● ●●
● ●
● ●
●●● ●
● ●●●● ●● ● ● ● ● ●●●●●●
●●● ●
●●●
●●●
●●
●●●●● ● ●●
●● ●
● ●●●
●●
● ●●● ● ●●

● ●● ●
● ●●● ●●●●● ●●●● ●●●
● ●● ●●●
● ●● ●●
●●● ●
● ●
●●●
●●●
●●● ●


●● ●
●●
●●


●●
● ●
●●●
●●

● ●
●●●●

● ●●●● ●● ● ●● ●
● ● ● ●● ● ● ●●
● ●
●●

● ●●
●●●
●●
●●●● ● ●●●
● ●●●● ● ● ● ● ● ● ● ● ●
● ● ●
● ●● ●
●● ●
●●●●●
●●●










●● ●
●●●●
●●


●●● ●●●● ● ● ● ● 2
●●●●●●● ●● ●
●● ●●●
●● ●●●●
●●
●●● ● ●●● ●● ●●
●● ●

●● ●

●●●


●●
● ●
●●●●●

●●
●●
● ●
●● ●●●●
●●●●●
● ●
●●●
● ●●● ●
●● ●●● ●●●●● ●●●
●● ●●
●●
●● ●
● ●●●● ●
● ●● ●●● ●●●
●●●● ●



●●
● ●
●●
●●





●●
● ● ●
●●





●●●

●●●

●●


●● ●●
●●

● ●●●●● ●●● ● ● ● ● ● ●

●● ●
● ● ● ●● ●●●

●● ●●
●●●●●
●● ●● ●●● ●●● ● ●
●●● ●● ●●● ●
● ● ●●
●●
● ●●
● ●

●●● ●●● ●● ●●
●●●● ●●●●
● ●
● ●
●●●● ● ●●●
● ●
●●●

● ●
●●
●●●● ●● ●●
● ●

● ●●
●● ● ●
● ●● ●
●●●●●
● ●● ●● ● ● ●●● ●●● ● ●●●
● ●●

●●


●●●●● ●●

● ●

●●
●●

●●●
● ●
●●● ● ● 3
●● ● ●● ● ●●● ●● ●● ●
●●
●●● ● ●
●●● ● ● ●●

●●●●
● ●● ●


●●●
●●
●●

● ●
●●●●●
● ● ●● ●
●● ●
● ●●● ● ●● ● ● ●●●●●●●● ●● ●
●● ● ●●
●● ●


●●●●
● ●
●●●●●●
●●
●●● ●
● ● ●●●
● ● ●
●● ●● ●●



●●
●● ●
● ● ●
● ●

●● ● ● ● ●
● ● ●
●●● ●
●● ●

●●●●





● ●
●●●●

●●● ●
● ●


● ●●●● ●
●● ● ● ● ● ● ● ● ●
● ●● ●

●● ●● ●
●● ●●●●●
●● ●●● ●●●● ● ●● ● ● ●● ● ●●● ● ●● ●●●●● ● ● ●●●
● ● ●●
● ●●●● ● ● ●● ● ●●●● ●● ● ●● ●●● ●●●●● ● ● ●● ● ● ●●● ●●●
● ●●● ●
●● ● ● ●● ● ● ●● ●
●●
● ● ●● ● ● ● ● ●●● ●●● ● ●
● ●

●● ●●●● ● ● ●
0 ● ● ●● ●● ●
●●

● ● ●● ● ●●
●●
●● ●● ●
● ●●
●●● ●●●●●●●


●●●●●
●● ●
●●●




●●●● ● ●●
● ● ● ●●● ●●
●● ●● ●
●●
●●● ●●●●●● ● ●●
●● ● ●● ● ●●
●● ● ● ● ●● ● ●● ● ●●● ●
● ●●● ●
●● ●
●●● ●●

● ●● ●●● ● ●● ●

● ● ● ● ● ●● ●●●●
●●
● ●●●
●●


●●● ●
●●
●●●
●● ●●
●●
● ●●
●●
●●●

● ●●● ●●
● ●● ● ● ● ●
● ●
● ● ● ●●●●●
●●
● ●●●●●●
●●●●● ●●
●●

●●●●●
●●
●●●
● ●
●● ●

●●●

● ●●●●●●●●●
● ●●●
● ●● ●●●
● ●●● ●● ●

●● ●


●●●




●●


●●












●●●





●●
●●



●●



●●●

●●●






●●




●●●

●●●
●● ●● ● ●● ● ● ●
● ● ●● ●●● ●

●●
●●
●●




●●●
● ●●●


●●
●●●





●●











●●

●●●●
●●




●●

●●●

●●

●●

●●



●● ●
●●
●●
●●●● ●
● ●
● ● ●● ●●●●
●●● ● ●●● ●
●●●●
●●●●●
●●

●●●●



● ●●●●●●●

●●
●●
●●●● ●●
●●
● ●●● ●
●● ●
● ● ●
●●●● ●● ● ●● ●●
● ●

●●

●●
●●
●●●
● ●●●
●●●
●●●
● ●
●●●
● ● ● ●●
● ●● ●
● ● ● ●● ●●● ● ● ●●
● ●
● ● ● ● ●● ●
●●●●●●
●●●
●●●




●●




●●


●●


●●●
●●●●

●●
●●




●●● ●●
●●
●●

●●
●●●

●●●● ● ●●● ●●●● ●
●● ● ●
●●● ●
● ● ●● ● ●

●●● ●●●●●
● ●●● ● ● ● ●
● ● ●
●●
●● ●● ●● ●●
● ●●
●● ●● ●●●● ● ●
●●● ●● ● ●●● ●●●
●●●●
● ●
●●● ●

● ●●
● ●●●●
●●


●●●
● ●●
●●● ● ●●
●●●●● ● ●
●●●
● ●● ●


●●● ● ● ●
●● ● ●

● ●●
−3 ● ●

−6 −3 0 3
X1

i.e., We’re assuming that there are latent class labels that we don’t observe.

28 / 74
6

X2
0

−3

−6 −3 0 3
X1

• We don’t know what the mixture components look like ahead of time
• We’ll use an iterative algorithm that feels kind of like K-means
• Start by fixing K at some value
• Now, if we have estimates µ̂k , Σ̂k , π̂k , we can calculate the posterior
probability that observation i comes from class k (E-step)
• If we have the posterior probabilities, we can get updated estimates
of µ̂k , Σ̂k , π̂k (M-step)
29 / 74
EM Algorithm (sketch) for Gaussian mixtures
• Take initial guesses for the parameters µ̂k , Σ̂k , π̂k

• Repeat until convergence:

• (Expectation step) Using the current parameter estimates, calculate the


responsibilities
θ̂i,k = P(i ∈ Ck | xi )

• (Maximization step) Using the current responsibilities, re-estimate the


parameters.
◦ This is done with weighted averaging
E.g., the mean estimates work out to
∑n
θ̂i,k xi
µ̂k = ∑i=1
n
i=1 θi,k

30 / 74
Step 0

95-791 Data Mining Lecture 7 Slide 49 Copyright © 2015 Artur Dubrawski

• Each circle (mini pie-chart) is an observation


• Large ovals in the background represent initial µ̂k , Σ̂k .
π̂k = 1/3 for all 3 classes
• Pie chart segments correspond to responsibilities estimates from
current µ̂k , Σ̂k , π̂k .
Example: Learning a Gaussian Mixture with E-M 31 / 74
Step 1

1 iteration of both the Copyright


E-step© 2015
95-791 Data Mining Lecture 7 Slide 50
andArturM-step
Dubrawski

32 / 74
Step 2

another iteration of both Copyright


95-791 Data Mining Lecture 7 Slide 51
the E-step and M-step
© 2015 Artur Dubrawski

32 / 74
Step 3

another iteration of both Copyright


95-791 Data Mining Lecture 7 Slide 52
the E-step and M-step.
© 2015 Artur Dubrawski

32 / 74
Step 4

another iteration of both Copyright


95-791 Data Mining Lecture 7 Slide 53
the E-step and M-step..
© 2015 Artur Dubrawski

32 / 74
Step 5

another iteration of both the E-step and M-step...


95-791 Data Mining Lecture 7 Slide 54 Copyright © 2015 Artur Dubrawski

32 / 74
Step 6

another iteration of both Copyright


95-791 Data Mining Lecture 7 Slide 55
the E-step and M-step....
© 2015 Artur Dubrawski

32 / 74
Step 20

Final picture: algorithm has converged


95-791 Data Mining Lecture 7 Slide 56 Copyright © 2015 Artur Dubrawski

32 / 74
Gaussian Mixture Modeling vs. K-means

• GMM’s do better on this example because they essentially allow for a


data-adaptive notion of distance when assigning points to centroids
• i.e., In the original data, we have 2 clumps with small variance, and one
clump with large variance
• K-means can’t capture this added information
• GMM’s say: An observation belongs to Ck if its variance-adjusted (i.e.,
Σk -adjusted) distance to µk is small 33 / 74
A note on variable scaling
• Variable scaling can matter a lot
• Here’s an example of data in some original scaling.

7.5 7.5
● ●
●● ● ●● ●
● ● ● ● ● ● ● ●
● ●● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●● ●● ● ●● ● ●● ●● ● ●●
● ● ● ●
5.0 ● ●
●● ●● ● ● ● ● ●
● 5.0 ● ●
●● ●● ● ● ● ● ●

● ●●● ● ● ● ● ●●● ● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●●
● ●
●●●●●●● ●●● ● ● ●●● ●
●● ●●●●●●● ●●● ● ● ●●● ●
●●
●●● ● ● ●●●
● ●●●● ●●● ● ●
●●
● ● ●●● ● ● ●●●
● ● ● ●●● ●
● ●

●● ●
●● ● ●● ●●
●●● ● ● ● ●● ● ● ● ●● ●
●●● ● ● ●
● ● ● ●
● ●● ● ●●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ●● ●● ●
● ● ● ●
●● ● ● ●● ●● ●● ● ●
●●●● ● ●● ●●● ● ●●
● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ●●● ●●●●● ●●●
● ● ●● ● ● ● ●●●●● ● ●● ● ● ● ●● ● ●●● ● ● ●
● ● ● ● ●● ● ●● ●● ● ●●● ●●
2.5 ●●●●● ● ●●

● ●● ● ● ●●● ●
●●
● ●●● ●● ● 2.5 ●●●● ●
● ●●
● ●● ● ● ●●
● ● ●● ● ●● ●
X2

X2

● ●●● ● ●● ● ● ● ●●
● ● ●●

●●●●● ●
● ●●● ● ●● ● ● ●

●●
●● ● ●●

●●●●●
● ● ●●●
● ●●●●● ● ● ●●●
●●●● ●●
● ● ● ●
●● ● ● ● ●●●




●●
● ● ● ●● ● ● ● ●●●




●●
● ● ●
● ● ● ● ●● ●●
● ● ● ● ● ●● ●●

● ●● ● ● ●●● ●●● ●● ●
● ●●● ●
● ●● ● ● ●●● ●●● ●● ●
● ●● ●
● ● ●● ● ● ● ● ● ●● ●● ●
● ● ●● ●●
● ●● ● ● ●● ●
● ● ●● ● ● ● ●● ●●
● ● ● ●● ●● ●● ● ● ●
● ● ●● ● ● ● ● ●● ●● ● ●
●●
● ● ● ● ●● ● ● ● ● ● ●● ● ●●
● ● ● ● ●● ● ● ● ● ● ●● ●
● ● ●● ●
●● ● ● ● ●● ●
●● ●
●● ●●● ● ●● ●●
● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●
0.0 ●

●●● ● ●
● ●
●●●
●● ● ● ● ●
●●
● 0.0 ●

●●● ● ●
● ●
●●●
●● ● ● ● ●
●●

● ●●●● ●● ●●
● ●
● ● ● ● ● ● ●●●● ●● ●
●●●
● ● ● ● ●
● ● ● ●●● ● ● ● ●●●● ● ●●● ●
●● ● ●●
●● ●●
● ●● ●●●● ● ●● ● ●●
●● ● ●● ●
● ●
● ● ●●● ●
●● ●
● ● ●● ●● ●● ● ●
● ● ●●●●●
●●●●
●● ●● ●● ●●
● ● ●
● ●●
● ●●●● ●
●●●

●●●●●
●●●●●
● ● ●● ●●●
●●
●● ●●●●
●●
● ● ●●● ●●●●●


●●
●●● ● ● ●●●● ●●●
●●●●
●● ●
●●● ● ●
●● ● ●●●●●●●●●● ●●●●● ● ●
● ● ●● ● ●●●●●●●●●● ●●●●● ●
● ●
● ●● ●

●●●●●●
● ● ●●● ● ● ●● ●
●●●●●●
● ● ●●● ●

● ●●● ●
● ● ● ●●●
● ●● ●
● ●● ●●● ● ● ● ●● ●●●● ● ● ●
● ● ● ● ● ● ●●● ● ● ● ●
● ●● ● ● ●● ●
● ●● ● ● ● ●● ● ●
−2.5 ●
−2.5 ●

−6 −3 0 3 −6 −3 0 3
X1 X1

Labeled data K = 3-means clustering

34 / 74
A note on variable scaling
• Here’s what happens if we rescale X2 via X2 ← X2 /3

7.5 7.5

5.0 5.0

2.5 2.5
X2

X2
● ●
● ● ●●● ● ● ● ● ●●● ● ●
● ●● ● ● ●● ●
● ●● ●●● ● ● ● ● ●
●●
● ●● ●●● ● ● ● ● ●
●●
●● ● ● ●● ●
●●

●●●●●● ●● ●● ● ● ● ●● ● ● ●● ●
●●

●●●●●● ●● ●● ● ● ●
●●● ●●●●
●●● ●
●●●
● ● ●●

● ● ● ●●
●●●
● ● ● ●●● ●● ● ●●
●●
●●● ●●●●
●●● ●
●●●
● ● ●●

● ● ● ●●
●●●
● ● ● ●●● ●● ● ●●
●●
●●●● ●●
● ●● ●●● ●●●
●●●●●● ●● ●●● ● ●●●● ●●
● ●● ●●● ●●●
●●●●●● ●● ●●● ●
●● ● ●●●●●
●●● ●●●
● ●

● ● ●● ●● ● ●
●●●
●●●●
● ●
●●●
●●●●
● ●●●●●●● ● ●
● ●● ● ●●●●●
●●● ●●●
● ●

● ● ●● ●● ● ●
●●●
●●●●
● ●
●●●
●●●●
● ●●●●●●● ● ●

●● ● ●●● ● ●●
●● ● ●●●
● ● ● ● ●● ●
● ●
● ●●
●● ●●●●

● ●

●●

●●● ●● ● ●● ● ●●● ● ●●
●● ● ●●●
● ● ● ● ●● ●
● ●
● ●●
●● ●●●●

● ●

●●

●●● ●● ●
● ● ● ●●● ● ● ● ●●

●●●●● ●
●●
● ●
●●●
●● ●

● ●●●● ● ● ● ● ● ●●● ● ● ● ●●

●●●●● ●
●●
● ●
●●●
●● ●

● ●●●● ● ●
● ● ● ●●●●●● ● ●●●● ● ● ●
●●●●●
●●●●● ●●●
●●●● ● ●● ● ● ● ●●●●●● ● ●●●● ● ● ●
●●●●●
●●●●● ●●●
●●●● ● ●●
● ●
●● ● ●●

● ●● ● ●● ● ● ● ●● ●●● ●●●
●●● ●
●●● ●
●●
● ● ● ●
● ●
●● ● ●●

● ●● ● ●● ● ● ● ●● ●●● ●●●
●●● ●
●●● ●
●●
● ● ● ●
● ●● ●● ●●● ● ●● ●●● ●● ●● ● ●● ●● ●●● ● ●● ●●● ●● ●●
0.0 ● ● ●
●●
● ●● ●●


●●
●●●
● ●●●
● ●● ●
●● ● ● ●●
● 0.0 ● ● ●
●●
● ●● ●●


●●
●●●
● ●●●
● ●● ●
●● ● ● ●●

● ●●● ●●●
● ●●

● ●
●● ● ● ●●
●●
● ● ●●● ●●●
● ●●

● ●
●● ● ● ●●
●●

● ●
●● ●
● ●●●●
● ●●

●●
●●
●●

●●
●●
●● ●●

●●●● ●●●●
●● ● ● ● ●
●● ●
● ●●●●
● ●●

●●
●●●

●●
●●
●● ●●

●●●● ●●●●
●● ● ●
●●●●●


● ●
●● ●●●
●● ●●
●●
●●●
● ●
●●●
●●●●
●● ● ● ● ● ●
● ●
●● ●●●
●● ●●
●●
●●●
● ●
●●●
●●●●
●● ●●●●● ● ● ● ●
● ●● ●●●●●●
●●
●●●


●●


● ●

●●
● ●

● ●
● ●● ●●●●●●
●●
●●●


●●


● ●

●●
● ●

● ●
● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●

−2.5 −2.5

−6 −3 0 3 −6 −3 0 3
X1 X1

Labeled rescaled data K-means clustering

35 / 74
A note on variable scaling
• To make it easier to compare the results, in the right panel of the
Figure below we colour the points in the original data based on the
clustering obtained from the scaled data

7.5 7.5
● ●
●● ● ●● ●
● ● ● ● ● ● ● ●
● ●● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●● ●● ● ●● ● ●● ●● ● ●●
● ● ● ●
5.0 ● ●
●● ●● ● ● ● ● ●
● 5.0 ● ●
●● ●● ● ● ● ● ●

● ●●● ● ● ● ● ●●● ● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●●
● ●
●●●●●●● ●●● ● ● ●●●●●● ●●●●●●● ●●● ● ● ●●●●●●
●●● ● ● ●●●
● ●●●● ●●● ● ●
●●
● ● ●●● ● ● ●●●
● ● ● ●●● ●
● ●

●● ●
●● ● ●● ●●
●●● ● ● ● ●● ● ● ● ●● ●
●●● ● ● ●
● ● ● ●
● ●● ● ●●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ●● ●● ●
● ● ● ●
●● ● ● ●● ●● ●● ● ●
●●●● ● ●● ●●● ● ●●
● ● ●●● ●● ● ● ● ● ● ●● ● ●●● ● ●● ●●
●● ● ●● ● ●●● ●●●●● ●●●

●● ●● ● ●
● ● ●●●●●
● ● ●●● ● ● ●
●● ●
●● ●● ● ●●● ●● ● ●●● ● ● ●
2.5 ●●●● ●
● ●

● ●● ● ● ●● ●
●●
● ●●● ●● ● 2.5 ●●●● ●
● ●●
● ●● ● ● ●●
● ● ●●● ●● ●
X2

X2

● ●●● ● ●● ● ● ● ●●
● ● ●●
●●

● ●● ●
● ●●● ● ●● ● ● ●

●●
●● ● ●

●●●
● ●●
● ● ●●●
●●●●●● ● ● ●●●
●●●● ●●
● ● ● ●
●● ● ● ● ●●●

●●
●●●●● ● ●● ● ● ● ●●●



●●●●● ●
● ● ●● ●●
● ● ● ●● ●●

● ●● ● ● ● ● ●●● ●●●●● ●
● ●● ●
● ●● ● ● ● ● ●●● ●●●●● ●
● ●● ●
● ● ●● ●● ● ● ● ●● ●● ●
● ● ●● ●●
● ● ● ●● ●● ●● ● ● ● ● ● ●● ●●
● ● ● ●● ●● ●● ● ● ●
● ● ●● ●● ● ● ● ● ●● ●● ● ●
●●
● ● ● ● ●● ● ● ● ● ● ●● ● ●●
● ● ● ● ●● ● ● ● ● ● ●● ●
● ● ●● ●
●● ● ● ● ●● ●
●● ●
●● ●●
● ● ●● ●●
● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●
0.0 ●

●●● ● ●
● ●
●●●
●● ● ● ● ●
●●
● 0.0 ●

●●● ● ●
● ●
●●●
●● ● ● ● ●
●●

● ●●●● ●● ●●
● ●
● ● ● ● ● ● ●●●● ●● ●
●●●
● ● ● ● ●
● ● ● ●●● ● ● ● ●●●● ● ●●● ●
●● ● ●●
●● ●●
● ●● ●●●● ● ●● ● ●●
●● ● ●● ●
● ●
● ● ●●● ●
●● ●
● ● ●● ●● ●● ● ●
● ● ●●●●●
●●●●
●● ●● ●● ●●
● ● ●
● ●●
● ●●●● ●
●●●

●●●●●
●●●●●
● ● ●● ●●●
●●
●● ●●●●
●●
● ● ●●● ●●●●●


●●
●●● ● ● ●●●● ●●●
●●●●
●● ●
●●● ● ●
●● ● ●●●●●●●●●● ●●●●● ● ●
● ● ●● ● ●●●●●●●●●● ●●●●● ●
● ●
● ●● ●

●●●●●●
● ● ●●● ● ● ●● ●
●●●●●●
● ● ●●● ●

● ●●● ●
● ● ● ●●●
● ●● ●
● ●● ●●● ● ● ● ●● ●●●● ● ● ●
● ● ● ● ● ● ●●● ● ● ● ●
● ●● ● ● ●● ●
● ●● ● ● ● ●● ● ●
−2.5 ●
−2.5 ●

−6 −3 0 3 −6 −3 0 3
X1 X1

K-means from unscaled data K-means from scaled data


36 / 74
10

1.2

1500
1.0
8

0.8
6

1000
0.6
4

0.4

500
2

0.2
0.0
0

0
Socks Computers Socks Computers Socks Computers

Figure 10.14 from ISL. Here we have # socks and computers purchased. Each
observation is represented by a coloured bar. How we scale the Socks and
Computers variables affects how dissimilar we view the people to be.
One general approach: Rescale all variables to have variance 1.
37 / 74
38 / 74
Agenda for Part II

• Dimensionality reduction
◦ Principal Components Analysis
◦ Correspondence analysis
◦ Multidimensional scaling

• Association rules

39 / 74
Dimensionality reduction
• Dimensionality reduction describes a family of methods for identifying
directions along with the data varies most highly
• E.g., the plot below shows Ad Spending vs. Population for n = 100
different cities.
PCA: example
• Most of the variation is along the direction of the green diagonal line
• A smaller amount of variation is in the direction of the dashed blue line
35
30
25
Ad Spending
20
15
10
5
0

10 20 30 40 50 60 70

Population
40 / 74
Principal components analysis

35
30
25
Ad Spending
20
15
10
5
0

10 20 30 40 50 60 70

Population

The• population size (pop) and ad spending (ad) for 100 di↵erent
Principal components analysis (PCA) takes a data frame and computes the
citiesdirections
are shown as purple
of greatest circles. The green solid line indicates
variation
the•first principal
Essentially: component
PCA finds direction,
linear combinations and
of the the features
original blue dashed
that
line indicates the ofsecond
explain as much principal
the variation component
in the data as possible direction.
• The green diagonal line in the Figure above is called the first principal
direction 7/
41 / 74
Computation of Principal Components
• Start with an n × p data set X. We only care about variation, so
assume all of the columns (variables) have mean 0.
• To find the first principal component, look for the linear combination
of features
zi1 = ϕ11 xi1 + ϕ12 xi2 + . . . + ϕp1 xip
for i = 1, . . . , n that has the largest sample variance, subject to the

constraint pj=1 ϕ2j1 = 1
∑n
• We started by assuming that n1 i=1 xij = 0, which ensures that the
1 ∑n
sample mean n i=1 zi1 = 0 also
• So we just need to find values of ϕj1 to maximize the sample
variance:
1∑ n
z2
n i=1 i1
∑p 2
subject to j=1 ϕj1 = 1.

42 / 74
What does this give us?
• We get a vector Z1 = (z11 , z21 , . . . , zn1 ) called the first principal
component

• The vector ϕ1 = (ϕ11 , ϕ21 , . . . , ϕp1 ) is called a loading vector, and


defines a direction in feature space along which the data varies the
most

• By projecting the n data points x1 , . . . , xn onto this direction, we get


the principal component scores (z11 , z21 , . . . , zn1 )

• To find the second principal component Z2 , we repeat the process,


further requiring that Z2 be uncorrelated with Z1
◦ This amounts to requiring that the second principal direction ϕ2 be
orthogonal to the first direction ϕ1

• In general, for the kth principal component Zk , we figure out the


direction ϕk that maximizes the variance, requiring that ϕk be
orthogonal to ϕ1 , ϕ2 , . . . , ϕk−1
43 / 74
A picture

[source: https://onlinecourses.science.psu.edu/stat857]
• This Figure shows an observation xi = (xi1 , xi2 ) along with zi1 , its
projection onto the first principal component direction, and zi2 , its
projection onto the second principal component direction

44 / 74
Example: USArrests Data

• USArrests data: For each of the n = 50 states in the United


States, the data set contains the number of arrests per 100,000
residents for each of three crimes: Assault, Murder, and Rape.

• We also record UrbanPop, the percent of the population in each


state living in urban areas.

• The principal component score vectors have length n = 50, and the
principal component loading vectors (directions) have length p = 4.

• PCA was performed after standardizing each variable to have mean


zero and standard deviation one.
◦ Normalization is very important!

45 / 74
USArrests biplot
arrests.scaled <- scale(USArrests) # Normalize the data
arrests.pca <- princomp(arrests.scaled) # Perform PCA
biplot(arrests.pca, scale = 0) # Construct biplot
−0.5 0.0 0.5

3
Mississippi
North Carolina
2

South Carolina

0.5
Murder West Virginia
Vermont
Georgia
Alaska Alabama Arkansas
1

Kentucky
Louisiana
Tennessee South Dakota
Assault North Dakota
Maryland Montana
Wyoming Maine
Comp.2

New Mexico Virginia Idaho

0.0
Florida New Hampshire
0

Michigan Indiana Nebraska Iowa


Missouri Oklahoma
Delaware Kansas
Texas
Rape Oregon Pennsylvania Wisconsin
Illinois Minnesota
Nevada Arizona
New York Ohio
−1

Colorado Washington
Connecticut
New Jersey Utah Island
Massachusetts
Rhode
California

−0.5
Hawaii
−2
−3

UrbanPop

−3 −2 −1 0 1 2 3

Comp.1 46 / 74
What happened?
−0.5 0.0 0.5
−0.5 0.0 0.5

3 UrbanPop

3
Mississippi
North Carolina
2

South Carolina

2
0.5

0.5
Murder West Virginia
Vermont Hawaii
Rhode Island
Massachusetts
Utah New Jersey California
Georgia
Alaska Alabama Arkansas
1

Kentucky

Second Principal Component


Connecticut
Louisiana
Tennessee

1
South Dakota Washington Colorado
Assault New York Nevada
Montana North Dakota Minnesota Pennsylvania
Ohio IllinoisArizona
Maryland Maine
Wisconsin Oregon Rape
Wyoming
Comp.2

Virginia Idaho Texas


New Mexico Delaware Missouri
Oklahoma
Kansas

0.0
Florida New Hampshire Nebraska Indiana Michigan
0

Iowa Iowa

0.0
Michigan Indiana Nebraska New Hampshire

0
Florida
Missouri Oklahoma
Delaware Kansas Idaho Virginia New Mexico
Texas Wyoming
Rape Oregon Pennsylvania Wisconsin Maine Maryland
Illinois Minnesota Montana
Nevada Arizona
New York Ohio North Dakota Assault
Washington
−1

Colorado South Dakota Tennessee


Louisiana
Connecticut Kentucky

−1
Arkansas Alabama Alaska
New Jersey Utah Island
Massachusetts
Rhode Georgia
California Hawaii

−0.5
VermontWest Virginia Murder

−0.5
−2

South Carolina

−2
North Carolina
Mississippi
−3

UrbanPop

−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Comp.1 First Principal Component

Our biplot ISL Figure 10.1


• The direction of the arrows and location of the points is now flipped...
• The PCA solution is non-unique in this sense: Loading vectors ϕ1 and
−ϕ1 will produce the same variance for the scores zi1 , but will result
in opposite signs.
47 / 74
Let’s look at the loadings
Here’s what ISL reports for the loadings ϕ1 , ϕ2 :
PC1 PC2
Murder 0.5358995 -0.4181809
Assault 0.5831836 -0.1879856
UrbanPop 0.2781909 0.8728062
Rape 0.5434321 0.1673186

Here’s what we got (all 4 loading vectors are shown):


Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.53590 0.41818 -0.34123 0.64923
Assault -0.58318 0.18799 -0.26815 -0.74341
UrbanPop -0.27819 -0.87281 -0.37802 0.13388 16 / 52

Rape -0.54343 -0.16732 0.81778 0.08902

• If we flip the sign of all the values in our table, we’ll get the same
answer as ISL.
• We’ll want to flip the signs on the loadings ϕj and the scores zj
48 / 74
Here’s our new biplot
arrests.pca$loadings = -arrests.pca$loadings # Flip loadings (phi's)
arrests.pca$scores = -arrests.pca$scores # Flip scores (z's)
biplot(arrests.pca, scale = 0) # Construct biplot

−0.5 0.0 0.5

UrbanPop
3
2

0.5
Hawaii
Rhode Island
Massachusetts California
Utah New Jersey
Connecticut
1

Washington Colorado
Ohio New York
Arizona Nevada
Minnesota
Wisconsin Pennsylvania Oregon IllinoisRape
Texas
Comp.2

Delaware
Oklahoma Missouri
Kansas
Iowa Nebraska Indiana Michigan

0.0
New Hampshire
0

Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
North Dakota Montana
Assault
South Dakota Tennessee
Louisiana
Kentucky
−1

Arkansas Alabama Alaska


Georgia
Vermont
West Virginia Murder

−0.5
South Carolina
−2

North Carolina
Mississippi
−3

−3 −2 −1 0 1 2 3

Comp.1
49 / 74
−0.5 0.0 0.5

UrbanPop

3
2

0.5
Loadings Hawaii
Rhode Island
Massachusetts
Utah New Jersey California

Connecticut

1
Washington Colorado
Ohio New York
Arizona Nevada
Minnesota
Wisconsin IllinoisRape
ϕ1 ϕ2 Pennsylvania Oregon
Texas

Comp.2
Kansas Delaware
Oklahoma Missouri
Iowa Nebraska Indiana Michigan

0.0
Murder 0.54 -0.42 New Hampshire

0
Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
Assault 0.58 -0.19 North Dakota
South Dakota
Montana
Assault
Kentucky Tennessee
Louisiana

−1
UrbanPop 0.28 0.87 Arkansas Alabama Alaska
Georgia
Vermont
West Virginia Murder
Rape 0.54 0.17

−0.5
South Carolina

−2
North Carolina
Mississippi
−3

−3 −2 −1 0 1 2 3

Comp.1

• The word UrbanPop is centered at (0.28, 0.87) (in terms of the top
and right side coordinate axes)

50 / 74
Let’s look at the scores (z’s)

Comp.1 Comp.2 Comp.3 Comp.4


Alabama 0.98 -1.12 0.44 -0.15
Alaska 1.93 -1.06 -2.02 0.43
Arizona 1.75 0.74 -0.05 0.83
Arkansas -0.14 -1.11 -0.11 0.18
California 2.50 1.53 -0.59 0.34
Colorado 1.50 0.98 -1.08 -0.00

• Just like we get 4 loading vectors, we get 4 score vectors.


• The table above shows the scores for all 4 principal components
• The biplot is constructed by plotting the points with Comp.1 on the
x-axis and Comp.2 on the y-axis

51 / 74
−0.5 0.0 0.5

UrbanPop

3
Scores

0.5
Comp.1 Comp.2 Hawaii
Rhode Island
Massachusetts
Utah New Jersey California

Connecticut
Alabama 0.98 -1.12

1
Washington Colorado
Ohio New York
Arizona Nevada
Minnesota
Wisconsin Pennsylvania Oregon IllinoisRape
Alaska 1.93 -1.06 Texas

Comp.2
Kansas Delaware
Oklahoma Missouri
Iowa Nebraska Indiana Michigan

0.0
New Hampshire

0
Arizona 1.75 0.74 Maine
Idaho Wyoming
Virginia New Mexico
Florida
Maryland
Arkansas -0.14 -1.11 North Dakota
South Dakota
Montana
Tennessee
Assault
Louisiana
Kentucky

−1
California 2.50 1.53 Vermont
West Virginia
Arkansas Alabama Alaska
Georgia
Murder

−0.5
Colorado 1.50 0.98 South Carolina

−2
.. .. .. North Carolina
Mississippi
. . . −3

−3 −2 −1 0 1 2 3

Comp.1

• The word California is centered at (2.5, 1.53) (in terms of the


bottom and left side coordinate axes)

52 / 74
Why does PCA work well on this data?
50 150 250 30 50 70 90

● ● ●
● ● ●

15
● ● ● ● ● ●
● ● ●
●● ● ● ● ● ●● ● ●● ●●

● ● ● ●
● ●● ●● ● ●● ●

10
●● ● ● ● ● ●

Murder ●

●●


●●




●●
●●●

● ●
● ● ● ●
● ●● ●

●●

● ●● ●● ●
●● ● ●●
● ● ● ●
● ● ● ● ● ●

5
●● ● ● ● ●
● ●● ● ● ● ●●
● ● ● ●
● ● ●● ●
● ●●
●● ●
●●● ● ●● ● ● ●●●

● ● ●

● ● ● ●
● ● ● ●
● ● ● ●
● ●
250

● ● ● ● ● ●
●● ● ● ● ●● ●
● ● ● ●

0.80 Assault ●






● ●


●●

150

● ●● ● ●● ● ●
● ● ● ● ● ●
● ● ● ●● ● ●● ● ●
● ●●● ●● ● ● ●
●● ● ● ● ●
● ●
50

●● ●●
● ●● ● ● ● ● ●


40



● ●

30
0.56 0.67 Rape ●
● ●●●
●●
●●


● ●
●●

20
● ● ●
● ●
●● ●● ●● ●
● ● ●●

● ● ●

10

● ●
● ● ●
90
70

0.07
0.26 0.41 UrbanPop
50
30

5 10 15 10 20 30 40

In this pairs plot we can clearly see that the crime rate variables—Murder,
Assault, and Rape—are highly correlated with one another. They provide
redundant information. The first loading vector winds up forming a
combination of these three features, essentially compressing 3 features into 1. 53 / 74
Proportion Variance Explained

• Dimensionality reduction techniques such as PCA work well when


the data is essentially low dimensional
◦ i.e., when there are many groups of highly correlated features

• i.e., We may have p features, but we might be able to describe the


data with just k ≪ p linear combinations of them
◦ For instance, if we have data on children, height, weight and age will
all be highly correlated
◦ PCA will be able to identify a linear combination of these features that
we we’ll roughly be able to interpret as the child’s size

• In general, to understand how well PCA is doing, we look at the


proportion of variance explained (PVE) of each principal component.

54 / 74
Proportion Variance Explained
• Assume as usual that all the variables have been centered to have
mean 0.
• The total variance present in the data is defined as


p ∑
p
1∑
n
var(Xj ) = x2ij
j=1 j=1
n i=1

and the variance explained by the mth principal component is

1∑ n
var(Zm ) = z2
n i=1 im

• The PVE of the mth component is the ratio of these quantities:

var(Zm )
PVE(Zm ) =
total variance

55 / 74
1.0

1.0
Cumulative Prop. Variance Explained
0.8

0.8
Prop. Variance Explained

0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Principal Component Principal Component

Figure 10.4 from ISL

• Left panel shows PVE(Zm ) for all 4 principal components in the


USArrests data

• Right panel shows cumulative PVE: i.e., values of m≤k PVE(Zm )
for k = 1, 2, 3, 4
56 / 74
How many principal components should we use?

• There’s no simple answer to this question

• Cross-validation is not available for this problem


◦ CV allows us to estimate test error... but we’re doing unsupervised
learning here and we don’t really have a notion of test error to work with
◦ If we treated our principal components as derived features in a regression
or classification task, we could certainly run Cross-validation in that
setting

• Often, people like to look at so-called scree plots, which is what we


showed on the previous slide

57 / 74
Finding elbows in scree plots

[source: https://gugginotes.wordpress.com/]
• Eigenvalue y-axis label should be interpreted as PVE
• Rule-of-thumb: Stop at the elbow in the Scree plot. (k = 5 here)
58 / 74
Principal Components Regression
• Suppose we’re back in the supervised learning setting where we have
observations (xi , yi )

• We have a lot of feature (p is large), and we suspect that many of


them may be redundant/highly correlated

• We can perform PCA on the data matrix X, treating this as a feature


engineering step

• This will give us k < p new features z1 , . . . , zk , corresponding to the


top k principal components

• Then we can model y on z1 , . . . , zk instead of on the xj

• This method is called Principal Components Regression (PCR)


◦ Of course, there’s a classification version of PCR as well
59 / 74
Simple PCR example
• Suppose that we have two features: age and height

• Everyone in our sample is between 10 and 17 years old, so these


features are highly correlated
• Instead of using both features in our models, we could just use the
1st principal component.

● ●

●●
● ●
● ●●
● ●
● ● ●
● ●

● ●
● ●

160 ● ●● ● ●
● ● ●

● ●
●● ●
● ● ● ●
● ●
● ●

height

● ●
● ● ● ●
● ●


● ●
● ●


140 ●



● ●
● ● ●
●● ●
● ● ●
● ● ●
●●
● ●
● ●
● ●
● ● ● ●


●●

120
●●

10 12 14 16
Age

60 / 74
PCR Success case
• Will PCR work? This depends on the outcome, y

• Let’s look at a classification setting where y = 1 if the individual is in


high school, and 0 otherwise.

● ● ●
●●
● ●
● ●●
● ●
● ● ●
● ●

● ●
● ●

160 ● ●● ● ●
● ● ●

● ●
●● ●
● ● ● ●
● ●
● ●
● in.highschool
height

● ●
● ● ● ●
● ●


● 0





● 1

140 ●



● ●
● ● ●
●● ●
● ● ●
● ● ●
●●
● ●
● ●
● ●
● ●
● ●


●●

120

10 12 14 16
Age

• Using just the first principal component for classification will work
really well here!
61 / 74
PCR Success case

Second principal component


1

● ● ●

● ● ● ●
● ●● ● ●● ● ● ●● ●
● ●
● ●




in.highschool
● ●● ●
●●● ●
● ● ● ●● ●
0 ●

●●
● ● ● ● ●
● 0
● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●● ● ● ●

● ● ● 1
● ●
● ● ● ●
● ● ●● ● ●
●●
● ● ● ● ●

● ●●

−1

−2

−2 −1 0 1 2
First principal component

• This is what the data looks like when plotted in terms of the two
principal components
• We can clearly see that a logistic model with y ∼ z1 will classify
really well 62 / 74
PCR Failure case
• Now what if our outcome was instead y = 1 if the person is tall for
their age, and 0 otherwise.
• Here’s a scatterplot of the data, colour-coded by outcome.

● ● ●
●●
● ●
● ●●
● ●
● ● ●
● ●

● ●
● ●

160 ● ●● ● ●
● ● ●

● ●
●● ●
● ● ● ●
● ●
● ●
● tall.for.age
height

● ●
● ● ● ●
● ●


● 0





● 1

140 ●



● ●
● ● ●
●● ●
● ● ●
● ● ●
●●
● ●
● ●
● ●
● ● ● ●


●●

120

10 12 14 16
Age

• The first principal component is going to be entirely orthogonal to


the interesting direction for classification!
63 / 74
PCR Failure case

Second principal component


1

● ● ●

● ● ● ●
● ● ● ● ●● ● ● ●● ●
● ●
● ●




tall.for.age
● ●● ●
●●● ●
● ● ● ●● ●
0 ● ●

●●
● ●

● ● ●
● 0
● ●
● ● ● ● ● ●
● ● ● ●
● ● ●● ● ● ●
● ● ●
● ● ●
● ●
● 1
● ● ●
● ● ● ●
●●
● ● ● ● ●

● ●●

−1

−2

−2 −1 0 1 2
First principal component

• A logistic model with y ∼ z1 will fail completely!

• Take-away: PCR will work well when the leading principal


components define directions that capture variation in the outcome y
64 / 74
PCA captures linear directions of variation
• PCA works well when the relationships between the features are linear
(e.g., when features are linearly correlated)
• Shown below is an example where the two axes are clearly strongly
associated, but the association is non-linear
• The PCA directions are reasonable... but don’t really capture the key trend
in the data
• Principal curves can be applied in such settings: This method generalizes
PCA by fitting 1-dimensional curves instead of lines

PCA Principal curve


[img source: http://what-when-how.com/]65 / 74
Correspondence Analysis

• PCA is great and widely used

• But it’s limited to numeric or ordinal data

• We may want to perform dimensionality reduction with categorical


features

• Correspondence analysis is an extension of PCA that gracefully


handles categorical features

• The figure on the next page shows a biplot obtained by running


Correspondence analysis on a dataset where consumers were asked
to check whether or not they associated particular attributes (blue
points) with various car models (red points)

66 / 74
[source: http://joelcadwell.blogspot.com/] 67 / 74
Multidimensional scaling
• Multidimensional scaling (MDS) is a dimensionality reduction methods
for visualizing the level of similarity among individuals in a dataset
• Instead of feeding in the data set X, we feed in a distance matrix
specifying the pairwise distances between all observations
◦ Recall: For hierarchical clustering, we also don’t need X. We just
operate on the distance matrix
• Here’s MDS output from an analysis of voting similarity between
Republicans and Democrats (blue) in the house of representatives

68 / 74
Multidimensional scaling: US cities
• Here’s an example of what MDS would do if applied to a matrix
giving pairwise distances between all of the “major” cities in the US

• Awesome! It doesn’t get the right rotation, but we can’t expect it to.
The reconstruction is otherwise excellent.
69 / 74
Association rules (Market Basket Analysis)
• Association rule learning has both a supervised and unsupervised
learning flavour
• We didn’t discuss the supervised version when we were talking about
regression and classification, but you should know that it exists.
◦ Look up: Apriori algorithm (Agarwal, Srikant, 1994)
◦ In R: apriori from the arules package
• Basic idea: Suppose you’re consulting for a department store, and
your client wants to better understand patterns in their customers’
purchases
• patterns or rules look something like:
{suit, belt} ⇒ {dress shoes}
{bath towels} ⇒ {bed sheets}
| {z } | {z }
LHS RHS

◦ In words: People who buy a new suit and belt are more likely to also
by dress shoes.
◦ People who by bath towels are more likely to buy bed sheets 70 / 74
Basic concepts
• Association rule learning gives us an automated way of identifying
these types of patterns
• There are three important concepts in rule learning: support,
confidence, and lift
• The support of an item or an item set is the fraction of transactions
that contain that item or item set.
◦ We want rules with high support, because these will be applicable to a
large number of transactions
◦ {suit, belt, dress shoes} likely has sufficiently high support to be
interesting
◦ {luggage, dehumidifer, teapot} likely has low support
• The confidence of a rule is the probability that a new transaction
containing the LHS item(s) {suit, belt} will also contain the RHS
item(s) {dress shoes}
• The lift of a rule is
support(LHS, RHS) P({suit, belt, dress shoes})
=
support(LHS) · support(RHS) P({suit, belt})P({dress shoes})
71 / 74
>
> #require(arules)An example
>
> a_list<-list(
+ c("CrestTP","CrestTB"),
+ c("OralBTB"),
+ c("BarbSC"),
+ c("ColgateTP","BarbSC"),
+ c("OldSpiceSC"),
+ c("CrestTP","CrestTB"),
+ c("AIMTP","GUMTB","OldSpiceSC"),
+ c("ColgateTP","GUMTB"),
+ c("AIMTP","OralBTB"),
+ c("CrestTP","BarbSC"),
+ c("ColgateTP","GilletteSC"),
+ c("CrestTP","OralBTB"),

+ A subset of drug store transactions is displayed above
c("AIMTP"),

+ Firstc("AIMTP","GUMTB","BarbSC"),
transaction: Crest ToothPaste, Crest ToothBrush
+
• c("ColgateTP","CrestTB","GilletteSC"),
Second transaction: OralB ToothBrush
+ c("CrestTP","CrestTB","OldSpiceSC"),

+ etc… c("OralBTB"),
+ c("AIMTP","OralBTB","OldSpiceSC"),
[source: Stephen B. Vardeman, STAT502X at Iowa State University]
+ c("ColgateTP","GilletteSC"), 72 / 74
. confidence minval smax arem aval originalSupport support minle
98 {} 0.5 0.1Tr98 1 none FALSE TRUE 0.02
99 {} Tr99
100 {} Tr100
algorithmic
> control:
> rules<-apriori(trans,parameter=list(supp=.02, conf=.5, target="rules"))
filter tree heap memopt load sort verbose
0.1specification:
parameter TRUE TRUE FALSE TRUE 2 TRUE
confidence minval smax arem aval originalSupport support minlen maxlen target ext
0.5 0.1 1 none FALSE TRUE 0.02 1 10 rules FALSE
apriori - find association rules with the apriori algorithm
algorithmic control:
version
filter tree4.21 (2004.05.09)
heap memopt load sort verbose (c) 1996-2004 Christian Borge
set0.1 item appearances
TRUE TRUE FALSE TRUE ...[02 item(s)] done [0.00s].
TRUE

•set
aprioritransactions
This -says: Consider
find association...[9 item(s),
only those
rules with therules 100
where
apriori transaction(s)]
theBorgelt
algorithm item sets havedone [0.00s]
sorting
version 4.21and recoding items
(2004.05.09) ... [9 Christian
(c) 1996-2004 item(s)] done [0.00s].
setsupport
creating at 0.02,item(s)]
least...[0
item appearances
transaction and confidence
tree done
... at least
[0.00s].
done 0.5
[0.00s].
set transactions ...[9 item(s), 100 transaction(s)] done [0.00s].
checking subsets
sorting and recoding of...size
items 1 2 3done
[9 item(s)] done [0.00s].
[0.00s].
•writing
Here’stransaction
creating what
... [5wetree
wind...up
rule(s)] with
done [0.00s].[0.00s].
done
checking subsets of size 1 2 3 done [0.00s].
creating
writing ... [5S4rule(s)]
object ... done [0.00s].
done [0.00s].
>
creating S4 object ... done [0.00s].
>
> inspect(head(sort(rules,by="lift"),n=20))
> inspect(head(sort(rules,by="lift"),n=20))
lhs
lhs rhs rhs support confidence
support confidence
lift lift
1 {GilletteSC} => {ColgateTP} 0.03 1.0000000 20.00000
1 {GilletteSC}
2 {ColgateTP} => {ColgateTP}
=> {GilletteSC} 0.03 0.6000000 0.03
20.00000 1.0000000 20.00000
2 {ColgateTP}
3 {CrestTB} => {GilletteSC}
=> {CrestTP} 0.03 0.7500000 0.03
15.00000 0.6000000 20.00000
4 {CrestTP}
3 {CrestTB}=>
5 {GUMTB}
{CrestTB}
=> {CrestTP}
=> {AIMTP}
0.03 0.6000000 15.00000
0.02 0.6666667 0.03
13.33333 0.7500000 15.00000
4 {CrestTP} => {CrestTB} 0.03 0.6000000 15.00000
5 {GUMTB} => {AIMTP} 0.02 0.6666667 13.33333

[source: Stephen B. Vardeman, STAT502X at Iowa State University]

73 / 74
Acknowledgements

All of the lectures notes for this class feature content borrowed with or
without modification from the following sources:
• 36-462/36-662 Lecture notes (Prof. Tibshirani, Prof. G’Sell, Prof. Shalizi)

• 95-791 Lecture notes (Prof. Dubrawski)

• An Introduction to Statistical Learning, with applications in R (Springer, 2013)


with permission from the authors: G. James, D. Witten, T. Hastie and R.
Tibshirani

• Applied Predictive Modeling, (Springer, 2013), Max Kuhn and Kjell Johnson

74 / 74

You might also like