Lecture07 95791
Lecture07 95791
1 / 74
Agenda for Part I
• Hierarchical clustering
2 / 74
Hierarchical clustering
3 / 74
D
A B
3
Each point starts as its own cluster
4 / 74
D
A B
3
We merge the two clusters (points) that are closet to each other
4 / 74
D
A B
3
Then we merge the next two closest clusters
4 / 74
D
A B
3
Then the next two closest clusters…
4 / 74
D
A B
3
Until at last all of the points are all in a single cluster
4 / 74
Hierarchical
Agglomerative HierarchicalClustering
Clustering Algorithm
• Start
The approach in words:
with each point in its own cluster.
• Start with each point
• Identify the two closest in its own
clusters. Merge cluster.
them.
• Identify the closest two clusters and merge them.
• Repeat until all points are in a single cluster
• Repeat.
To•visualize
Ends when all points
the results, arelook
we can in aatsingle cluster.dendrogram
the resulting
Dendrogram
4
3
D
E
A B
C 2
1
0
D
E
B
A
C
y-axis on dendrogram is (proportional to) the distance between the clusters
39 / 52
that got merged at that step 5 / 74
Linkages
• Let dij = d(xi , xj ) denote the dissimilarity1 (distance) between
observation xi and xj
45 / 52
7 / 74
Single linkage
In single linkage (i.e., nearest-neighbor linkage), the dissimilarity between
G, H is the smallest dissimilarity between two points in different groups:
2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●
Example (dissimilarities dij are
1
● ● ●● ● ●
● ●●
● ● ●
●
●
distances, groups are marked ●
●
●
●
●● ●
●● ● ●
●
●●
by colors): single linkage score
0
●
●
●● ●
● ●
●
●
dsingle (G, H) is the distance of −1
●
●
●
● ●
●
● ●● ● ●
the closest pair ●
● ●
●● ●
●
●
●
●
●
●
●
●
●●
●●
● ●●
−2
● ●
●
−2 −1 0 1 2
8 / 74
Single linkage example
Here n = 60, xi ∈ R2 , dij = ∥xi − xj ∥2 . Cutting the tree at h = 0.9
gives the clustering assignments marked by colors
● ●●
●
3
1.0
●
● ●
● ●
●
● ● ●
2
● ●
●●
0.8
●
●
● ●
● ●
● ●
1
● ●
● ● ●
0.6
● ●
Height
●
● ●
● ● ●● ●
0
●
●
●
0.4
● ●
●
●
●
−1
●
●
● ● ●
0.2
● ●
●
●
●
−2
●
0.0
−2 −1 0 1 2 3
9 / 74
Complete linkage
In complete linkage (i.e., furthest-neighbor linkage), dissimilarity
between G, H is the largest dissimilarity between two points in different
groups:
dcomplete (G, H) = max d(xi , xj )
i∈G, j∈H
2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●
Example (dissimilarities dij are
1
● ● ●● ● ●
● ●●
● ● ●
●
●
distances, groups are marked by ●
●
●
●
●● ●
●● ● ●
●
●●
colors): complete linkage score
0
●
●
●● ●
● ●
●
●
dcomplete (G, H) is the distance of −1
●
●
●
● ●
●
● ●● ● ●
the furthest pair ●
● ●
●● ●
●
●
●
●
●
●
●
●
●●
●●
● ●●
−2
● ●
●
−2 −1 0 1 2
10 / 74
Complete linkage example
Same data as before. Cutting the tree at h = 5 gives the clustering
assignments marked by colors
● ●●
●
3
6
●
● ●
● ●
●
● ● ●
2
5
● ●
●● ●
●
● ●
● ●
4
● ●
1
● ●
● ● ●
● ●
Height
●
● ●
●
3
● ●● ●
0
●
●
●
● ●
●
●
2
●
−1
●
●
● ● ●
● ●
●
1
●
●
−2
●
0
−2 −1 0 1 2 3
Cut interpretation: for each point xi , every other point xj in its cluster
satisfies d(xi , xj ) ≤ 5
11 / 74
Average linkage
In average linkage, the dissimilarity between G, H is the average
dissimilarity over all points in opposite groups:
1 ∑
daverage (G, H) = d(xi , xj )
|G| · |H| i∈G, j∈H
2
● ● ●
●
● ●
1
● ● ●● ● ●
● ●●
● ● ●
●
●
daverage (G, H) is the average dis- ●
●
●
●
●● ●
●● ● ●
●
●●
0
tance across all pairs ●
●
●●
●
●
●
● ●
● ●
●
● ●
−1
●
● ●● ●
(Plot here only shows distances ●
● ●
●
●
●
●
●
●
●
●
●● ● ●
●●
between the green points and ●●
● ●●
−2
● ●
one orange point) ●
−2 −1 0 1 2
12 / 74
Average linkage example
Same data as before. Cutting the tree at h = 2.5 gives clustering
assignments marked by the colors
● ●●
●
3
3.0
●
● ●
● ●
●
●
2.5
● ●
2
● ●
●● ●
●
● ●
●
2.0
●
● ●
1
● ●
● ● ●
● ●
Height
●
1.5
● ●
● ● ●● ●
0
●
●
●
● ●
●
1.0
●
●
−1
●
●
● ● ●
● ●
0.5
●
●
●
−2
●
0.0
−2 −1 0 1 2 3
14 / 74
Example of chaining and crowding
Single Complete
● ●●
● ● ●●
●
3
3
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
2
2
● ● ● ●
●● ● ●● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●
1
1
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●● ● ● ● ● ●● ● ●
0
0
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
−1
−1
● ●
● ●
● ● ● ● ● ●
●● ●●
● ●
●
● ●
●
−2
−2
● ●
● ●
−2 −1 0 1 2 3 Average −2 −1 0 1 2 3
● ●●
●
3
●
● ●
● ●
● ● ● ●
2
● ●
●● ●
●
● ● ●
●
● ●
1
● ●
● ● ●
● ●
●
●
● ● ●● ● ●
0
●
● ●
● ●
● ●
●
−1
●
●
● ● ●
●●
●
●
●
−2
●
●
−2 −1 0 1 2 3
15 / 74
Shortcomings of average linkage
◦ This can be a big problem if we’re not sure precisely what dissimilarity
measure we want to use
16 / 74
Average linkage monotone dissimilarity transformation
Avg linkage: distance Avg linkage: distance^2
● ●●
● ● ●●
●
3
3
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
2
2
● ● ● ●
●● ● ●● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●
1
1
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●● ● ● ● ● ●● ● ●
0
0
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
−1
−1
● ●
● ●
● ● ● ● ● ●
●● ●●
● ●
●
● ●
●
−2
−2
● ●
● ●
−2 −1 0 1 2 3 −2 −1 0 1 2 3
Suppose we wanted to place cell towers in a way that ensures that no building
is more than 3000ft away from a cell tower. What linkage should we use to
cluster buildings, and where should we cut the dendrogram, to solve this
problem?
18 / 74
Dissimilarity measures
• The choice of linkage can greatly affect the structure and quality of
the resulting clusters
19 / 74
Example: Clustering time series
Wang et al]
(a) benchmark
20 / 74
K-means vs Hierarchical clustering
• K-means:
© Low memory usage
© Essentially O(n) compute time
§ Results are sensitive to random initialization
§ Number of clusters is pre-defined
?? Awkward with categorical variables
• Hierarchical clustering:
© Deterministic algorithm
© Dendrogram shows us clusterings for various choices of K
© Requires only a distance matrix, quantifying how dissimilar observations
are from one another
• We can use a dissimilarity measure that gracefully handles categorical
variables, missing values, etc
§ Memory-heavy, more computationally intensive than K-means
(CS5350/6350)
[source: Piyush Rai, CS5350/6350 @October
Data Clustering
Utah]4, 2011
23 / 74
Not that your data will ever look like this, but...
●●
● ●
● ●
● ●● ● ● ● ●
● ●● ● ●●● ●
●●● ●● ●
● ●● ●● ●
6 ● ●● ●● ●
● ●
● ● ● ●
● ●● ● ●●●●●●● ●●●
●
● ● ● ● ●● ●●●●
● ● ●●●
●● ●● ●●●
●●●●
●● ● ● ●●
● ● ●●
●
●●● ● ● ● ●
● ● ●
●● ● ● ● ●●
●●● ● ● ●●●
● ●●● ●● ●●●● ●●
● ●● ●●● ● ● ●●● ●
● ● ●●●● ● ●
●● ● ● ●
●
● ● ● ●● ● ●●●
● ●● ●●●● ●● ●
●●● ● ● ●
●●● ●●● ●●
● ● ●●● ●●●
● ● ● ●● ●●
●●● ● ● ●●●● ●● ●●●●
●
●●
● ●●
●●● ● ●
●
● ●●
● ●●●● ● ● ●● ●
●
●● ●
●
● ●
●●●●● ●
●●●
●●●●●●
● ● ●
●
●
●●
● ●
●
●
● ● ●● ●
●●
● ●
● ●
●●
●●
●
●●●
●●
●●
●
●
●●
●●
●
●
●
●
●●●
●●●●
●
●
●●
●●
●●●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●●
●
●●
●●
●
●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●●●●
●
●●●●
●● ●●
●
●● ●●● ●●
● ●● ● ●●● ● ●
● class
●● ●●
●● ●●
●●●●●●
● ● ● ●●●● ●
●
●
●●●
●
●●
● ●●●●●
● ●● ● ●● ●
● ●
●●
● ●●● ●
●● ● ● ●
●●
●●
●
●●●● ●●
● ● ● ●●●●●●●● ● ● ● ●●
●●
●● ●●
● ●● ●● ● ●●●●●
● ●●●●●
●● ●
●●● ●● ●
●● ●●
● ●●
●● ●
●
●
● ●
●
●
●● ●● ●● ●● ●● ●
3 ● ●
●
●●●
●●
●
●
●●●
●
●●
●●
●
●●
●
●● ●
●
●
●●
●
●●
●●
●
●●●
●
●●● ●●●
●●● ●●
●●●
●● ●
● ●
●
●
●●
●
●
●●●●
●
●●●
● ●●
●
●
●●
●
●●
●●
●
●
●●●●
●●●
●●●
●
●
●
●
●●●
●
●
●
●●● ● ●
●● ●● ●
● 1
● ●
● ●● ●● ●
● ● ●●●
X2
●●● ●● ●
● ●●
●●●●●● ●●● ● ● ● ●
●●● ●● ● ●
●●●●
●●
●●● ●
●
●●●● ●●
●
● ● ●●● ● ● ● ●
●●●●● ●
●●●●● ●●
● ●
● ●
●●● ●
● ●●●● ●● ● ● ● ● ●●●●●●
●●● ●
●●●
●●●
●●
●●●●● ● ●●
●● ●
● ●●●
●●
● ●●● ● ●●
●
● ●● ●
● ●●● ●●●●● ●●●● ●●●
● ●● ●●●
● ●● ●●
●●● ●
● ●
●●●
●●●
●●● ●
●
●
●● ●
●●
●●
●
●
●●
● ●
●●●
●●
●
● ●
●●●●
●
● ●●●● ●● ● ●● ●
● ● ● ●● ● ● ●●
● ●
●●
●
● ●●
●●●
●●
●●●● ● ●●●
● ●●●● ● ● ● ● ● ● ● ● ●
● ● ●
● ●● ●
●● ●
●●●●●
●●●
●
●
●
●
●
●
●
●
●
●
●● ●
●●●●
●●
●
●
●●● ●●●● ● ● ● ● 2
●●●●●●● ●● ●
●● ●●●
●● ●●●●
●●
●●● ● ●●● ●● ●●
●● ●
●
●● ●
●
●●●
●
●
●●
● ●
●●●●●
●
●●
●●
● ●
●● ●●●●
●●●●●
● ●
●●●
● ●●● ●
●● ●●● ●●●●● ●●●
●● ●●
●●
●● ●
● ●●●● ●
● ●● ●●● ●●●
●●●● ●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●
●●
● ● ●
●●
●
●
●
●
●
●●●
●
●●●
●
●●
●
●
●● ●●
●●
●
● ●●●●● ●●● ● ● ● ● ● ●
●
●● ●
● ● ● ●● ●●●
●
●● ●●
●●●●●
●● ●● ●●● ●●● ● ●
●●● ●● ●●● ●
● ● ●●
●●
● ●●
● ●
●
●●● ●●● ●● ●●
●●●● ●●●●
● ●
● ●
●●●● ● ●●●
● ●
●●●
●
● ●
●●
●●●● ●● ●●
● ●
●
● ●●
●● ● ●
● ●● ●
●●●●●
● ●● ●● ● ● ●●● ●●● ● ●●●
● ●●
●
●●
●
●
●●●●● ●●
●
● ●
●
●●
●●
●
●●●
● ●
●●● ● ● 3
●● ● ●● ● ●●● ●● ●● ●
●●
●●● ● ●
●●● ● ● ●●
●
●●●●
● ●● ●
●
●
●●●
●●
●●
●
● ●
●●●●●
● ● ●● ●
●● ●
● ●●● ● ●● ● ● ●●●●●●●● ●● ●
●● ● ●●
●● ●
●
●
●●●●
● ●
●●●●●●
●●
●●● ●
● ● ●●●
● ● ●
●● ●● ●●
●
●
●
●●
●● ●
● ● ●
● ●
●
●● ● ● ● ●
● ● ●
●●● ●
●● ●
●
●●●●
●
●
●
●
●
● ●
●●●●
●
●●● ●
● ●
●
●
● ●●●● ●
●● ● ● ● ● ● ● ● ●
● ●● ●
●
●● ●● ●
●● ●●●●●
●● ●●● ●●●● ● ●● ● ● ●● ● ●●● ● ●● ●●●●● ● ● ●●●
● ● ●●
● ●●●● ● ● ●● ● ●●●● ●● ● ●● ●●● ●●●●● ● ● ●● ● ● ●●● ●●●
● ●●● ●
●● ● ● ●● ● ● ●● ●
●●
● ● ●● ● ● ● ● ●●● ●●● ● ●
● ●
●
●● ●●●● ● ● ●
0 ● ● ●● ●● ●
●●
●
● ● ●● ● ●●
●●
●● ●● ●
● ●●
●●● ●●●●●●●
●
●
●●●●●
●● ●
●●●
●
●
●
●
●●●● ● ●●
● ● ● ●●● ●●
●● ●● ●
●●
●●● ●●●●●● ● ●●
●● ● ●● ● ●●
●● ● ● ● ●● ● ●● ● ●●● ●
● ●●● ●
●● ●
●●● ●●
●
● ●● ●●● ● ●● ●
●
● ● ● ● ● ●● ●●●●
●●
● ●●●
●●
●
●
●●● ●
●●
●●●
●● ●●
●●
● ●●
●●
●●●
●
● ●●● ●●
● ●● ● ● ● ●
● ●
● ● ● ●●●●●
●●
● ●●●●●●
●●●●● ●●
●●
●
●●●●●
●●
●●●
● ●
●● ●
●
●●●
●
● ●●●●●●●●●
● ●●●
● ●● ●●●
● ●●● ●● ●
●
●● ●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●●
●● ●● ● ●● ● ● ●
● ● ●● ●●● ●
●
●●
●●
●●
●
●
●
●
●●●
● ●●●
●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●
●
●●
●
●●●
●
●●
●
●●
●
●●
●
●
●
●● ●
●●
●●
●●●● ●
● ●
● ● ●● ●●●●
●●● ● ●●● ●
●●●●
●●●●●
●●
●
●●●●
●
●
●
● ●●●●●●●
●
●●
●●
●●●● ●●
●●
● ●●● ●
●● ●
● ● ●
●●●● ●● ● ●● ●●
● ●
●
●●
●
●●
●●
●●●
● ●●●
●●●
●●●
● ●
●●●
● ● ● ●●
● ●● ●
● ● ● ●● ●●● ● ● ●●
● ●
● ● ● ● ●● ●
●●●●●●
●●●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●●
●●●●
●
●●
●●
●
●
●
●
●●● ●●
●●
●●
●
●●
●●●
●
●●●● ● ●●● ●●●● ●
●● ● ●
●●● ●
● ● ●● ● ●
●
●●● ●●●●●
● ●●● ● ● ● ●
● ● ●
●●
●● ●● ●● ●●
● ●●
●● ●● ●●●● ● ●
●●● ●● ● ●●● ●●●
●●●●
● ●
●●● ●
●
● ●●
● ●●●●
●●
●
●
●●●
● ●●
●●● ● ●●
●●●●● ● ●
●●●
● ●● ●
●
●
●●● ● ● ●
●● ● ●
●
● ●●
−3 ● ●
−6 −3 0 3
X1
i.e., We’re assuming that there are latent class labels that we don’t observe.
28 / 74
6
X2
0
−3
−6 −3 0 3
X1
• We don’t know what the mixture components look like ahead of time
• We’ll use an iterative algorithm that feels kind of like K-means
• Start by fixing K at some value
• Now, if we have estimates µ̂k , Σ̂k , π̂k , we can calculate the posterior
probability that observation i comes from class k (E-step)
• If we have the posterior probabilities, we can get updated estimates
of µ̂k , Σ̂k , π̂k (M-step)
29 / 74
EM Algorithm (sketch) for Gaussian mixtures
• Take initial guesses for the parameters µ̂k , Σ̂k , π̂k
30 / 74
Step 0
32 / 74
Step 2
32 / 74
Step 3
32 / 74
Step 4
32 / 74
Step 5
32 / 74
Step 6
32 / 74
Step 20
32 / 74
Gaussian Mixture Modeling vs. K-means
7.5 7.5
● ●
●● ● ●● ●
● ● ● ● ● ● ● ●
● ●● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●● ●● ● ●● ● ●● ●● ● ●●
● ● ● ●
5.0 ● ●
●● ●● ● ● ● ● ●
● 5.0 ● ●
●● ●● ● ● ● ● ●
●
● ●●● ● ● ● ● ●●● ● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●●
● ●
●●●●●●● ●●● ● ● ●●● ●
●● ●●●●●●● ●●● ● ● ●●● ●
●●
●●● ● ● ●●●
● ●●●● ●●● ● ●
●●
● ● ●●● ● ● ●●●
● ● ● ●●● ●
● ●
●
●● ●
●● ● ●● ●●
●●● ● ● ● ●● ● ● ● ●● ●
●●● ● ● ●
● ● ● ●
● ●● ● ●●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ●● ●● ●
● ● ● ●
●● ● ● ●● ●● ●● ● ●
●●●● ● ●● ●●● ● ●●
● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ●●● ●●●●● ●●●
● ● ●● ● ● ● ●●●●● ● ●● ● ● ● ●● ● ●●● ● ● ●
● ● ● ● ●● ● ●● ●● ● ●●● ●●
2.5 ●●●●● ● ●●
●
● ●● ● ● ●●● ●
●●
● ●●● ●● ● 2.5 ●●●● ●
● ●●
● ●● ● ● ●●
● ● ●● ● ●● ●
X2
X2
●
● ●●● ● ●● ● ● ● ●●
● ● ●●
●
●●●●● ●
● ●●● ● ●● ● ● ●
●
●●
●● ● ●●
●
●●●●●
● ● ●●●
● ●●●●● ● ● ●●●
●●●● ●●
● ● ● ●
●● ● ● ● ●●●
●
●
●
●
●●
● ● ● ●● ● ● ● ●●●
●
●
●
●
●●
● ● ●
● ● ● ● ●● ●●
● ● ● ● ● ●● ●●
●
● ●● ● ● ●●● ●●● ●● ●
● ●●● ●
● ●● ● ● ●●● ●●● ●● ●
● ●● ●
● ● ●● ● ● ● ● ● ●● ●● ●
● ● ●● ●●
● ●● ● ● ●● ●
● ● ●● ● ● ● ●● ●●
● ● ● ●● ●● ●● ● ● ●
● ● ●● ● ● ● ● ●● ●● ● ●
●●
● ● ● ● ●● ● ● ● ● ● ●● ● ●●
● ● ● ● ●● ● ● ● ● ● ●● ●
● ● ●● ●
●● ● ● ● ●● ●
●● ●
●● ●●● ● ●● ●●
● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●
0.0 ●
●
●●● ● ●
● ●
●●●
●● ● ● ● ●
●●
● 0.0 ●
●
●●● ● ●
● ●
●●●
●● ● ● ● ●
●●
●
● ●●●● ●● ●●
● ●
● ● ● ● ● ● ●●●● ●● ●
●●●
● ● ● ● ●
● ● ● ●●● ● ● ● ●●●● ● ●●● ●
●● ● ●●
●● ●●
● ●● ●●●● ● ●● ● ●●
●● ● ●● ●
● ●
● ● ●●● ●
●● ●
● ● ●● ●● ●● ● ●
● ● ●●●●●
●●●●
●● ●● ●● ●●
● ● ●
● ●●
● ●●●● ●
●●●
●
●●●●●
●●●●●
● ● ●● ●●●
●●
●● ●●●●
●●
● ● ●●● ●●●●●
●
●
●●
●●● ● ● ●●●● ●●●
●●●●
●● ●
●●● ● ●
●● ● ●●●●●●●●●● ●●●●● ● ●
● ● ●● ● ●●●●●●●●●● ●●●●● ●
● ●
● ●● ●
●
●●●●●●
● ● ●●● ● ● ●● ●
●●●●●●
● ● ●●● ●
●
● ●●● ●
● ● ● ●●●
● ●● ●
● ●● ●●● ● ● ● ●● ●●●● ● ● ●
● ● ● ● ● ● ●●● ● ● ● ●
● ●● ● ● ●● ●
● ●● ● ● ● ●● ● ●
−2.5 ●
−2.5 ●
−6 −3 0 3 −6 −3 0 3
X1 X1
34 / 74
A note on variable scaling
• Here’s what happens if we rescale X2 via X2 ← X2 /3
7.5 7.5
5.0 5.0
2.5 2.5
X2
X2
● ●
● ● ●●● ● ● ● ● ●●● ● ●
● ●● ● ● ●● ●
● ●● ●●● ● ● ● ● ●
●●
● ●● ●●● ● ● ● ● ●
●●
●● ● ● ●● ●
●●
●
●●●●●● ●● ●● ● ● ● ●● ● ● ●● ●
●●
●
●●●●●● ●● ●● ● ● ●
●●● ●●●●
●●● ●
●●●
● ● ●●
●
● ● ● ●●
●●●
● ● ● ●●● ●● ● ●●
●●
●●● ●●●●
●●● ●
●●●
● ● ●●
●
● ● ● ●●
●●●
● ● ● ●●● ●● ● ●●
●●
●●●● ●●
● ●● ●●● ●●●
●●●●●● ●● ●●● ● ●●●● ●●
● ●● ●●● ●●●
●●●●●● ●● ●●● ●
●● ● ●●●●●
●●● ●●●
● ●
●
● ● ●● ●● ● ●
●●●
●●●●
● ●
●●●
●●●●
● ●●●●●●● ● ●
● ●● ● ●●●●●
●●● ●●●
● ●
●
● ● ●● ●● ● ●
●●●
●●●●
● ●
●●●
●●●●
● ●●●●●●● ● ●
●
●● ● ●●● ● ●●
●● ● ●●●
● ● ● ● ●● ●
● ●
● ●●
●● ●●●●
●
● ●
●
●●
●
●●● ●● ● ●● ● ●●● ● ●●
●● ● ●●●
● ● ● ● ●● ●
● ●
● ●●
●● ●●●●
●
● ●
●
●●
●
●●● ●● ●
● ● ● ●●● ● ● ● ●●
●
●●●●● ●
●●
● ●
●●●
●● ●
●
● ●●●● ● ● ● ● ● ●●● ● ● ● ●●
●
●●●●● ●
●●
● ●
●●●
●● ●
●
● ●●●● ● ●
● ● ● ●●●●●● ● ●●●● ● ● ●
●●●●●
●●●●● ●●●
●●●● ● ●● ● ● ● ●●●●●● ● ●●●● ● ● ●
●●●●●
●●●●● ●●●
●●●● ● ●●
● ●
●● ● ●●
●
● ●● ● ●● ● ● ● ●● ●●● ●●●
●●● ●
●●● ●
●●
● ● ● ●
● ●
●● ● ●●
●
● ●● ● ●● ● ● ● ●● ●●● ●●●
●●● ●
●●● ●
●●
● ● ● ●
● ●● ●● ●●● ● ●● ●●● ●● ●● ● ●● ●● ●●● ● ●● ●●● ●● ●●
0.0 ● ● ●
●●
● ●● ●●
●
●
●●
●●●
● ●●●
● ●● ●
●● ● ● ●●
● 0.0 ● ● ●
●●
● ●● ●●
●
●
●●
●●●
● ●●●
● ●● ●
●● ● ● ●●
●
● ●●● ●●●
● ●●
●
● ●
●● ● ● ●●
●●
● ● ●●● ●●●
● ●●
●
● ●
●● ● ● ●●
●●
●
● ●
●● ●
● ●●●●
● ●●
●
●●
●●
●●
●
●●
●●
●● ●●
●
●●●● ●●●●
●● ● ● ● ●
●● ●
● ●●●●
● ●●
●
●●
●●●
●
●●
●●
●● ●●
●
●●●● ●●●●
●● ● ●
●●●●●
●
●
● ●
●● ●●●
●● ●●
●●
●●●
● ●
●●●
●●●●
●● ● ● ● ● ●
● ●
●● ●●●
●● ●●
●●
●●●
● ●
●●●
●●●●
●● ●●●●● ● ● ● ●
● ●● ●●●●●●
●●
●●●
●
●
●●
●
●
● ●
●
●●
● ●
●
● ●
● ●● ●●●●●●
●●
●●●
●
●
●●
●
●
● ●
●
●●
● ●
●
● ●
● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●
−2.5 −2.5
−6 −3 0 3 −6 −3 0 3
X1 X1
35 / 74
A note on variable scaling
• To make it easier to compare the results, in the right panel of the
Figure below we colour the points in the original data based on the
clustering obtained from the scaled data
7.5 7.5
● ●
●● ● ●● ●
● ● ● ● ● ● ● ●
● ●● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●● ●● ● ●● ● ●● ●● ● ●●
● ● ● ●
5.0 ● ●
●● ●● ● ● ● ● ●
● 5.0 ● ●
●● ●● ● ● ● ● ●
●
● ●●● ● ● ● ● ●●● ● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●●
● ●
●●●●●●● ●●● ● ● ●●●●●● ●●●●●●● ●●● ● ● ●●●●●●
●●● ● ● ●●●
● ●●●● ●●● ● ●
●●
● ● ●●● ● ● ●●●
● ● ● ●●● ●
● ●
●
●● ●
●● ● ●● ●●
●●● ● ● ● ●● ● ● ● ●● ●
●●● ● ● ●
● ● ● ●
● ●● ● ●●●● ● ● ● ● ● ●● ● ●●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ●● ●● ●
● ● ● ●
●● ● ● ●● ●● ●● ● ●
●●●● ● ●● ●●● ● ●●
● ● ●●● ●● ● ● ● ● ● ●● ● ●●● ● ●● ●●
●● ● ●● ● ●●● ●●●●● ●●●
●
●● ●● ● ●
● ● ●●●●●
● ● ●●● ● ● ●
●● ●
●● ●● ● ●●● ●● ● ●●● ● ● ●
2.5 ●●●● ●
● ●
●
● ●● ● ● ●● ●
●●
● ●●● ●● ● 2.5 ●●●● ●
● ●●
● ●● ● ● ●●
● ● ●●● ●● ●
X2
X2
●
● ●●● ● ●● ● ● ● ●●
● ● ●●
●●
●
● ●● ●
● ●●● ● ●● ● ● ●
●
●●
●● ● ●
●
●●●
● ●●
● ● ●●●
●●●●●● ● ● ●●●
●●●● ●●
● ● ● ●
●● ● ● ● ●●●
●
●●
●●●●● ● ●● ● ● ● ●●●
●
●
●
●●●●● ●
● ● ●● ●●
● ● ● ●● ●●
●
● ●● ● ● ● ● ●●● ●●●●● ●
● ●● ●
● ●● ● ● ● ● ●●● ●●●●● ●
● ●● ●
● ● ●● ●● ● ● ● ●● ●● ●
● ● ●● ●●
● ● ● ●● ●● ●● ● ● ● ● ● ●● ●●
● ● ● ●● ●● ●● ● ● ●
● ● ●● ●● ● ● ● ● ●● ●● ● ●
●●
● ● ● ● ●● ● ● ● ● ● ●● ● ●●
● ● ● ● ●● ● ● ● ● ● ●● ●
● ● ●● ●
●● ● ● ● ●● ●
●● ●
●● ●●
● ● ●● ●●
● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●
0.0 ●
●
●●● ● ●
● ●
●●●
●● ● ● ● ●
●●
● 0.0 ●
●
●●● ● ●
● ●
●●●
●● ● ● ● ●
●●
●
● ●●●● ●● ●●
● ●
● ● ● ● ● ● ●●●● ●● ●
●●●
● ● ● ● ●
● ● ● ●●● ● ● ● ●●●● ● ●●● ●
●● ● ●●
●● ●●
● ●● ●●●● ● ●● ● ●●
●● ● ●● ●
● ●
● ● ●●● ●
●● ●
● ● ●● ●● ●● ● ●
● ● ●●●●●
●●●●
●● ●● ●● ●●
● ● ●
● ●●
● ●●●● ●
●●●
●
●●●●●
●●●●●
● ● ●● ●●●
●●
●● ●●●●
●●
● ● ●●● ●●●●●
●
●
●●
●●● ● ● ●●●● ●●●
●●●●
●● ●
●●● ● ●
●● ● ●●●●●●●●●● ●●●●● ● ●
● ● ●● ● ●●●●●●●●●● ●●●●● ●
● ●
● ●● ●
●
●●●●●●
● ● ●●● ● ● ●● ●
●●●●●●
● ● ●●● ●
●
● ●●● ●
● ● ● ●●●
● ●● ●
● ●● ●●● ● ● ● ●● ●●●● ● ● ●
● ● ● ● ● ● ●●● ● ● ● ●
● ●● ● ● ●● ●
● ●● ● ● ● ●● ● ●
−2.5 ●
−2.5 ●
−6 −3 0 3 −6 −3 0 3
X1 X1
1.2
1500
1.0
8
0.8
6
1000
0.6
4
0.4
500
2
0.2
0.0
0
0
Socks Computers Socks Computers Socks Computers
Figure 10.14 from ISL. Here we have # socks and computers purchased. Each
observation is represented by a coloured bar. How we scale the Socks and
Computers variables affects how dissimilar we view the people to be.
One general approach: Rescale all variables to have variance 1.
37 / 74
38 / 74
Agenda for Part II
• Dimensionality reduction
◦ Principal Components Analysis
◦ Correspondence analysis
◦ Multidimensional scaling
• Association rules
39 / 74
Dimensionality reduction
• Dimensionality reduction describes a family of methods for identifying
directions along with the data varies most highly
• E.g., the plot below shows Ad Spending vs. Population for n = 100
different cities.
PCA: example
• Most of the variation is along the direction of the green diagonal line
• A smaller amount of variation is in the direction of the dashed blue line
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
40 / 74
Principal components analysis
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
The• population size (pop) and ad spending (ad) for 100 di↵erent
Principal components analysis (PCA) takes a data frame and computes the
citiesdirections
are shown as purple
of greatest circles. The green solid line indicates
variation
the•first principal
Essentially: component
PCA finds direction,
linear combinations and
of the the features
original blue dashed
that
line indicates the ofsecond
explain as much principal
the variation component
in the data as possible direction.
• The green diagonal line in the Figure above is called the first principal
direction 7/
41 / 74
Computation of Principal Components
• Start with an n × p data set X. We only care about variation, so
assume all of the columns (variables) have mean 0.
• To find the first principal component, look for the linear combination
of features
zi1 = ϕ11 xi1 + ϕ12 xi2 + . . . + ϕp1 xip
for i = 1, . . . , n that has the largest sample variance, subject to the
∑
constraint pj=1 ϕ2j1 = 1
∑n
• We started by assuming that n1 i=1 xij = 0, which ensures that the
1 ∑n
sample mean n i=1 zi1 = 0 also
• So we just need to find values of ϕj1 to maximize the sample
variance:
1∑ n
z2
n i=1 i1
∑p 2
subject to j=1 ϕj1 = 1.
42 / 74
What does this give us?
• We get a vector Z1 = (z11 , z21 , . . . , zn1 ) called the first principal
component
[source: https://onlinecourses.science.psu.edu/stat857]
• This Figure shows an observation xi = (xi1 , xi2 ) along with zi1 , its
projection onto the first principal component direction, and zi2 , its
projection onto the second principal component direction
44 / 74
Example: USArrests Data
• The principal component score vectors have length n = 50, and the
principal component loading vectors (directions) have length p = 4.
45 / 74
USArrests biplot
arrests.scaled <- scale(USArrests) # Normalize the data
arrests.pca <- princomp(arrests.scaled) # Perform PCA
biplot(arrests.pca, scale = 0) # Construct biplot
−0.5 0.0 0.5
3
Mississippi
North Carolina
2
South Carolina
0.5
Murder West Virginia
Vermont
Georgia
Alaska Alabama Arkansas
1
Kentucky
Louisiana
Tennessee South Dakota
Assault North Dakota
Maryland Montana
Wyoming Maine
Comp.2
0.0
Florida New Hampshire
0
Colorado Washington
Connecticut
New Jersey Utah Island
Massachusetts
Rhode
California
−0.5
Hawaii
−2
−3
UrbanPop
−3 −2 −1 0 1 2 3
Comp.1 46 / 74
What happened?
−0.5 0.0 0.5
−0.5 0.0 0.5
3 UrbanPop
3
Mississippi
North Carolina
2
South Carolina
2
0.5
0.5
Murder West Virginia
Vermont Hawaii
Rhode Island
Massachusetts
Utah New Jersey California
Georgia
Alaska Alabama Arkansas
1
Kentucky
1
South Dakota Washington Colorado
Assault New York Nevada
Montana North Dakota Minnesota Pennsylvania
Ohio IllinoisArizona
Maryland Maine
Wisconsin Oregon Rape
Wyoming
Comp.2
0.0
Florida New Hampshire Nebraska Indiana Michigan
0
Iowa Iowa
0.0
Michigan Indiana Nebraska New Hampshire
0
Florida
Missouri Oklahoma
Delaware Kansas Idaho Virginia New Mexico
Texas Wyoming
Rape Oregon Pennsylvania Wisconsin Maine Maryland
Illinois Minnesota Montana
Nevada Arizona
New York Ohio North Dakota Assault
Washington
−1
−1
Arkansas Alabama Alaska
New Jersey Utah Island
Massachusetts
Rhode Georgia
California Hawaii
−0.5
VermontWest Virginia Murder
−0.5
−2
South Carolina
−2
North Carolina
Mississippi
−3
UrbanPop
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
• If we flip the sign of all the values in our table, we’ll get the same
answer as ISL.
• We’ll want to flip the signs on the loadings ϕj and the scores zj
48 / 74
Here’s our new biplot
arrests.pca$loadings = -arrests.pca$loadings # Flip loadings (phi's)
arrests.pca$scores = -arrests.pca$scores # Flip scores (z's)
biplot(arrests.pca, scale = 0) # Construct biplot
UrbanPop
3
2
0.5
Hawaii
Rhode Island
Massachusetts California
Utah New Jersey
Connecticut
1
Washington Colorado
Ohio New York
Arizona Nevada
Minnesota
Wisconsin Pennsylvania Oregon IllinoisRape
Texas
Comp.2
Delaware
Oklahoma Missouri
Kansas
Iowa Nebraska Indiana Michigan
0.0
New Hampshire
0
Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
North Dakota Montana
Assault
South Dakota Tennessee
Louisiana
Kentucky
−1
−0.5
South Carolina
−2
North Carolina
Mississippi
−3
−3 −2 −1 0 1 2 3
Comp.1
49 / 74
−0.5 0.0 0.5
UrbanPop
3
2
0.5
Loadings Hawaii
Rhode Island
Massachusetts
Utah New Jersey California
Connecticut
1
Washington Colorado
Ohio New York
Arizona Nevada
Minnesota
Wisconsin IllinoisRape
ϕ1 ϕ2 Pennsylvania Oregon
Texas
Comp.2
Kansas Delaware
Oklahoma Missouri
Iowa Nebraska Indiana Michigan
0.0
Murder 0.54 -0.42 New Hampshire
0
Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
Assault 0.58 -0.19 North Dakota
South Dakota
Montana
Assault
Kentucky Tennessee
Louisiana
−1
UrbanPop 0.28 0.87 Arkansas Alabama Alaska
Georgia
Vermont
West Virginia Murder
Rape 0.54 0.17
−0.5
South Carolina
−2
North Carolina
Mississippi
−3
−3 −2 −1 0 1 2 3
Comp.1
• The word UrbanPop is centered at (0.28, 0.87) (in terms of the top
and right side coordinate axes)
50 / 74
Let’s look at the scores (z’s)
51 / 74
−0.5 0.0 0.5
UrbanPop
3
Scores
0.5
Comp.1 Comp.2 Hawaii
Rhode Island
Massachusetts
Utah New Jersey California
Connecticut
Alabama 0.98 -1.12
1
Washington Colorado
Ohio New York
Arizona Nevada
Minnesota
Wisconsin Pennsylvania Oregon IllinoisRape
Alaska 1.93 -1.06 Texas
Comp.2
Kansas Delaware
Oklahoma Missouri
Iowa Nebraska Indiana Michigan
0.0
New Hampshire
0
Arizona 1.75 0.74 Maine
Idaho Wyoming
Virginia New Mexico
Florida
Maryland
Arkansas -0.14 -1.11 North Dakota
South Dakota
Montana
Tennessee
Assault
Louisiana
Kentucky
−1
California 2.50 1.53 Vermont
West Virginia
Arkansas Alabama Alaska
Georgia
Murder
−0.5
Colorado 1.50 0.98 South Carolina
−2
.. .. .. North Carolina
Mississippi
. . . −3
−3 −2 −1 0 1 2 3
Comp.1
52 / 74
Why does PCA work well on this data?
50 150 250 30 50 70 90
● ● ●
● ● ●
15
● ● ● ● ● ●
● ● ●
●● ● ● ● ● ●● ● ●● ●●
●
● ● ● ●
● ●● ●● ● ●● ●
10
●● ● ● ● ● ●
Murder ●
●●
●
●
●●
●
●
●
●
●●
●●●
●
● ●
● ● ● ●
● ●● ●
●
●●
●
●
● ●● ●● ●
●● ● ●●
● ● ● ●
● ● ● ● ● ●
5
●● ● ● ● ●
● ●● ● ● ● ●●
● ● ● ●
● ● ●● ●
● ●●
●● ●
●●● ● ●● ● ● ●●●
●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
250
● ● ● ● ● ●
●● ● ● ● ●● ●
● ● ● ●
0.80 Assault ●
●
●
●
●
●
●
● ●
●
●
●●
●
150
● ●● ● ●● ● ●
● ● ● ● ● ●
● ● ● ●● ● ●● ● ●
● ●●● ●● ● ● ●
●● ● ● ● ●
● ●
50
●● ●●
● ●● ● ● ● ● ●
●
●
40
●
●
●
● ●
●
30
0.56 0.67 Rape ●
● ●●●
●●
●●
●
●
●
● ●
●●
20
● ● ●
● ●
●● ●● ●● ●
● ● ●●
●
● ● ●
10
●
● ●
● ● ●
90
70
0.07
0.26 0.41 UrbanPop
50
30
5 10 15 10 20 30 40
In this pairs plot we can clearly see that the crime rate variables—Murder,
Assault, and Rape—are highly correlated with one another. They provide
redundant information. The first loading vector winds up forming a
combination of these three features, essentially compressing 3 features into 1. 53 / 74
Proportion Variance Explained
54 / 74
Proportion Variance Explained
• Assume as usual that all the variables have been centered to have
mean 0.
• The total variance present in the data is defined as
∑
p ∑
p
1∑
n
var(Xj ) = x2ij
j=1 j=1
n i=1
1∑ n
var(Zm ) = z2
n i=1 im
var(Zm )
PVE(Zm ) =
total variance
55 / 74
1.0
1.0
Cumulative Prop. Variance Explained
0.8
0.8
Prop. Variance Explained
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
57 / 74
Finding elbows in scree plots
[source: https://gugginotes.wordpress.com/]
• Eigenvalue y-axis label should be interpreted as PVE
• Rule-of-thumb: Stop at the elbow in the Scree plot. (k = 5 here)
58 / 74
Principal Components Regression
• Suppose we’re back in the supervised learning setting where we have
observations (xi , yi )
● ●
●
●●
● ●
● ●●
● ●
● ● ●
● ●
●
● ●
● ●
●
160 ● ●● ● ●
● ● ●
●
● ●
●● ●
● ● ● ●
● ●
● ●
●
height
● ●
● ● ● ●
● ●
●
●
● ●
● ●
●
●
140 ●
●
●
●
● ●
● ● ●
●● ●
● ● ●
● ● ●
●●
● ●
● ●
● ●
● ● ● ●
●
●
●●
120
●●
10 12 14 16
Age
60 / 74
PCR Success case
• Will PCR work? This depends on the outcome, y
● ● ●
●●
● ●
● ●●
● ●
● ● ●
● ●
●
● ●
● ●
●
160 ● ●● ● ●
● ● ●
●
● ●
●● ●
● ● ● ●
● ●
● ●
● in.highschool
height
● ●
● ● ● ●
● ●
●
●
● 0
●
●
●
●
●
● 1
●
140 ●
●
●
●
● ●
● ● ●
●● ●
● ● ●
● ● ●
●●
● ●
● ●
● ●
● ●
● ●
●
●
●●
120
●
●
10 12 14 16
Age
• Using just the first principal component for classification will work
really well here!
61 / 74
PCR Success case
● ● ●
●
● ● ● ●
● ●● ● ●● ● ● ●● ●
● ●
● ●
●
●
●
●
in.highschool
● ●● ●
●●● ●
● ● ● ●● ●
0 ●
●
●●
● ● ● ● ●
● 0
● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●● ● ● ●
●
● ● ● 1
● ●
● ● ● ●
● ● ●● ● ●
●●
● ● ● ● ●
●
● ●●
−1
−2
−2 −1 0 1 2
First principal component
• This is what the data looks like when plotted in terms of the two
principal components
• We can clearly see that a logistic model with y ∼ z1 will classify
really well 62 / 74
PCR Failure case
• Now what if our outcome was instead y = 1 if the person is tall for
their age, and 0 otherwise.
• Here’s a scatterplot of the data, colour-coded by outcome.
● ● ●
●●
● ●
● ●●
● ●
● ● ●
● ●
●
● ●
● ●
●
160 ● ●● ● ●
● ● ●
●
● ●
●● ●
● ● ● ●
● ●
● ●
● tall.for.age
height
● ●
● ● ● ●
● ●
●
●
● 0
●
●
●
●
●
● 1
●
140 ●
●
●
●
● ●
● ● ●
●● ●
● ● ●
● ● ●
●●
● ●
● ●
● ●
● ● ● ●
●
●
●●
120
●
●
10 12 14 16
Age
● ● ●
●
● ● ● ●
● ● ● ● ●● ● ● ●● ●
● ●
● ●
●
●
●
●
tall.for.age
● ●● ●
●●● ●
● ● ● ●● ●
0 ● ●
●
●●
● ●
●
● ● ●
● 0
● ●
● ● ● ● ● ●
● ● ● ●
● ● ●● ● ● ●
● ● ●
● ● ●
● ●
● 1
● ● ●
● ● ● ●
●●
● ● ● ● ●
●
● ●●
−1
−2
−2 −1 0 1 2
First principal component
66 / 74
[source: http://joelcadwell.blogspot.com/] 67 / 74
Multidimensional scaling
• Multidimensional scaling (MDS) is a dimensionality reduction methods
for visualizing the level of similarity among individuals in a dataset
• Instead of feeding in the data set X, we feed in a distance matrix
specifying the pairwise distances between all observations
◦ Recall: For hierarchical clustering, we also don’t need X. We just
operate on the distance matrix
• Here’s MDS output from an analysis of voting similarity between
Republicans and Democrats (blue) in the house of representatives
68 / 74
Multidimensional scaling: US cities
• Here’s an example of what MDS would do if applied to a matrix
giving pairwise distances between all of the “major” cities in the US
• Awesome! It doesn’t get the right rotation, but we can’t expect it to.
The reconstruction is otherwise excellent.
69 / 74
Association rules (Market Basket Analysis)
• Association rule learning has both a supervised and unsupervised
learning flavour
• We didn’t discuss the supervised version when we were talking about
regression and classification, but you should know that it exists.
◦ Look up: Apriori algorithm (Agarwal, Srikant, 1994)
◦ In R: apriori from the arules package
• Basic idea: Suppose you’re consulting for a department store, and
your client wants to better understand patterns in their customers’
purchases
• patterns or rules look something like:
{suit, belt} ⇒ {dress shoes}
{bath towels} ⇒ {bed sheets}
| {z } | {z }
LHS RHS
◦ In words: People who buy a new suit and belt are more likely to also
by dress shoes.
◦ People who by bath towels are more likely to buy bed sheets 70 / 74
Basic concepts
• Association rule learning gives us an automated way of identifying
these types of patterns
• There are three important concepts in rule learning: support,
confidence, and lift
• The support of an item or an item set is the fraction of transactions
that contain that item or item set.
◦ We want rules with high support, because these will be applicable to a
large number of transactions
◦ {suit, belt, dress shoes} likely has sufficiently high support to be
interesting
◦ {luggage, dehumidifer, teapot} likely has low support
• The confidence of a rule is the probability that a new transaction
containing the LHS item(s) {suit, belt} will also contain the RHS
item(s) {dress shoes}
• The lift of a rule is
support(LHS, RHS) P({suit, belt, dress shoes})
=
support(LHS) · support(RHS) P({suit, belt})P({dress shoes})
71 / 74
>
> #require(arules)An example
>
> a_list<-list(
+ c("CrestTP","CrestTB"),
+ c("OralBTB"),
+ c("BarbSC"),
+ c("ColgateTP","BarbSC"),
+ c("OldSpiceSC"),
+ c("CrestTP","CrestTB"),
+ c("AIMTP","GUMTB","OldSpiceSC"),
+ c("ColgateTP","GUMTB"),
+ c("AIMTP","OralBTB"),
+ c("CrestTP","BarbSC"),
+ c("ColgateTP","GilletteSC"),
+ c("CrestTP","OralBTB"),
•
+ A subset of drug store transactions is displayed above
c("AIMTP"),
•
+ Firstc("AIMTP","GUMTB","BarbSC"),
transaction: Crest ToothPaste, Crest ToothBrush
+
• c("ColgateTP","CrestTB","GilletteSC"),
Second transaction: OralB ToothBrush
+ c("CrestTP","CrestTB","OldSpiceSC"),
•
+ etc… c("OralBTB"),
+ c("AIMTP","OralBTB","OldSpiceSC"),
[source: Stephen B. Vardeman, STAT502X at Iowa State University]
+ c("ColgateTP","GilletteSC"), 72 / 74
. confidence minval smax arem aval originalSupport support minle
98 {} 0.5 0.1Tr98 1 none FALSE TRUE 0.02
99 {} Tr99
100 {} Tr100
algorithmic
> control:
> rules<-apriori(trans,parameter=list(supp=.02, conf=.5, target="rules"))
filter tree heap memopt load sort verbose
0.1specification:
parameter TRUE TRUE FALSE TRUE 2 TRUE
confidence minval smax arem aval originalSupport support minlen maxlen target ext
0.5 0.1 1 none FALSE TRUE 0.02 1 10 rules FALSE
apriori - find association rules with the apriori algorithm
algorithmic control:
version
filter tree4.21 (2004.05.09)
heap memopt load sort verbose (c) 1996-2004 Christian Borge
set0.1 item appearances
TRUE TRUE FALSE TRUE ...[02 item(s)] done [0.00s].
TRUE
•set
aprioritransactions
This -says: Consider
find association...[9 item(s),
only those
rules with therules 100
where
apriori transaction(s)]
theBorgelt
algorithm item sets havedone [0.00s]
sorting
version 4.21and recoding items
(2004.05.09) ... [9 Christian
(c) 1996-2004 item(s)] done [0.00s].
setsupport
creating at 0.02,item(s)]
least...[0
item appearances
transaction and confidence
tree done
... at least
[0.00s].
done 0.5
[0.00s].
set transactions ...[9 item(s), 100 transaction(s)] done [0.00s].
checking subsets
sorting and recoding of...size
items 1 2 3done
[9 item(s)] done [0.00s].
[0.00s].
•writing
Here’stransaction
creating what
... [5wetree
wind...up
rule(s)] with
done [0.00s].[0.00s].
done
checking subsets of size 1 2 3 done [0.00s].
creating
writing ... [5S4rule(s)]
object ... done [0.00s].
done [0.00s].
>
creating S4 object ... done [0.00s].
>
> inspect(head(sort(rules,by="lift"),n=20))
> inspect(head(sort(rules,by="lift"),n=20))
lhs
lhs rhs rhs support confidence
support confidence
lift lift
1 {GilletteSC} => {ColgateTP} 0.03 1.0000000 20.00000
1 {GilletteSC}
2 {ColgateTP} => {ColgateTP}
=> {GilletteSC} 0.03 0.6000000 0.03
20.00000 1.0000000 20.00000
2 {ColgateTP}
3 {CrestTB} => {GilletteSC}
=> {CrestTP} 0.03 0.7500000 0.03
15.00000 0.6000000 20.00000
4 {CrestTP}
3 {CrestTB}=>
5 {GUMTB}
{CrestTB}
=> {CrestTP}
=> {AIMTP}
0.03 0.6000000 15.00000
0.02 0.6666667 0.03
13.33333 0.7500000 15.00000
4 {CrestTP} => {CrestTB} 0.03 0.6000000 15.00000
5 {GUMTB} => {AIMTP} 0.02 0.6666667 13.33333
73 / 74
Acknowledgements
All of the lectures notes for this class feature content borrowed with or
without modification from the following sources:
• 36-462/36-662 Lecture notes (Prof. Tibshirani, Prof. G’Sell, Prof. Shalizi)
• Applied Predictive Modeling, (Springer, 2013), Max Kuhn and Kjell Johnson
74 / 74