Lec 2
Lec 2
Rajan Patel
1 / 19
Classification problem
oo
oo o
o
o
o
Recall:
o oo oo o
o
o oo oo ooo
o o oo ooo o
oo o o o o ooo oo oo
o
oo
o o
o oo
o o
o oo o
o
ooo o o o o o
o o o
oo
o
I X = (X1 , X2 ) are inputs.
o oo o o o o o o o
o o oooo o ooo o o o o o
oo ooo
I Color Y ∈ {Yellow , Blue} is the
X2
o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o oo o
o o o oo o ooo o
oo
o
o ooooo oooo
oo o
o
output.
o o oo o o
o o oo oo o o o
X1
I Purple line is Bayes boundary —
the best we could do if we knew
Figure: * the joint distribution of (X, Y )
Figure 2.13
2 / 19
K-nearest neighbors
To assign a color to the input ×, we look at its K = 3 nearest
neighbors. We predict the color of the majority of the neighbors.
o o
o o o o
o o o o
o o
o o
o o o o
o o
o o
o o
Figure: *
Figure 2.14
3 / 19
K-nearest neighbors also has a decision boundary
KNN: K=10
oo o
oo o
o
o
o oo oo o
o o
o oo oo o o
o o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o
o
o ooo o o o o o
o
o o o oo o o o o
o
o oo o o o o o
o o
o o oooo o ooo o o o o o
oo o o oo o o oo o
X2
o
o o o
o
o o
oo o oo o oo o
o o o o oo o
o o oo oo o
o o o ooo o
o
o ooo o
oo o ooooo o
o oo o
o oo oo o
o o oo oo o
o o ooooo o
o o oo o
o oo o
o o o
X1
Figure: *
Figure 2.15
4 / 19
The higher K, the smoother the decision boundary
o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
oo o oo o
o oo oo o oo o oo oo o oo
o o o
o o o o o
o o
oo o o o o o oo oo oo o o o o o oo oo
o o oo oo o oo o o oo oo o oo
o o o o o oo o o o o o o o oo o o
oo o o o o oo o o o o
o o o oo o o o o o o oo o o o
o oo o o o o o oo o o o o
o o oo o o o o o oo o o o
oo o o
oo ooo o oo ooo oo o o
oo ooo o oo ooo
o o o o oooo o o
o
o o o o o oooo o o
o
o
o o o o o o o o
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o oo o o o oo o o
o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o o oo o o o
o o oo o o
o o o oo o o o o o o oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o
o o
Figure: *
Figure 2.16
5 / 19
Clustering
I K-means clustering
I Hierarchical clustering
6 / 19
K-means clustering
I K is the number of clusters and must be fixed in advance.
I The goal of this method is to maximize the similarity of
samples within each cluster:
K
X 1 X
min W (C` ) ; W (C` ) = Distance2 (xi,: , xj,: ).
C1 ,...,CK |C` |
`=1 i,j∈C`
K=2 K=3 K=4
Figure: *
Figure 10.5
7 / 19
K-means clustering algorithm
I Find the centroid of each cluster `; i.e. the average x`,: of all
the samples in the cluster:
1 X
x`,j = xi,j for j = 1, . . . , p.
|C` |
i∈C`
8 / 19
K-means clustering algorithm
Data Step 1 Iteration 1, Step 2a
Figure: *
Figure 10.6
9 / 19
Properties of K-means clustering
10 / 19
Example: K-means output with different
initializations
320.9 235.8 235.8
Figure: *
Figure 10.7 11 / 19
Hierarchical clustering
9 9
0.5
0.5
0.0
0.0
7 7
8 8
X2
X2
5 5
3 3
−0.5
−0.5
2 2
−1.0
−1.0
1 1
6 6
−1.5
−1.5
4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1 X1
9 9
0.5
0.5
12 / 19
Hierarchical clustering
9
0.5
0.0
7 7
8
X2
5 5
3
−0.5
2
−1.0
1
6
−1.5
4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1
9
0.5
12 / 19
−0.5
−0.5
2 2
Hierarchical clustering
−1.0
−1.0
1 1
6 6
−1.5
−1.5
4 4
Most algorithms for hierarchical clustering are agglomerative.
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1 X1
9 9
0.5
0.5
0.0
0.0
7 7
8 8
X2
X2
5 5
3 3
−0.5
−0.5
2 2
−1.0
−1.0
1 1
6 6
−1.5
−1.5
4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1 X1
12 / 19
−0.
2
−1.0
−1.5 6
1 Hierarchical clustering
4
.0 Most algorithms for hierarchical clustering are agglomerative.
−1.5 −1.0 −0.5 0.0 0.5 1.0
X1
9
0.5
0.0
7 7
8
X2
5 5
3
−0.5
2
−1.0
1
6
−1.5
4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1
12 / 19
Hierarchical clustering
9 9
3.0
0.5
0.5
2.5
2.0
0.0
0.0
7 7
X2
1.5
8 8
X2
X2
5 5
3 3
9
−0.5
−0.5
1.0
2
3
0.5
2 2
4
−1.0
−1.0
8
1 1
0.0
6 6
7
−1.5
−1.5
4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
0.5
12 / 19
Hierarchical clustering
3.0
0.5
2.5
2.0
0.0
7 7
X2
1.5
8
X2
5 5
3
9
−0.5
1.0
2
3
0.5
2
4
−1.0
8
1
0.0
6
7
−1.5
4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
12 / 19
8 5 8 5
X
3 3
−0.5
−0.5
2 Hierarchical clustering 2
−1.0
−1.0
1 1
6 6
−1.5
4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1 X1
9 9
3.0
2.5
0.5
0.5
2.0
0.0
0.0
7 7
X2
1.5
8 8
X2
X2
5 5
9
1.0
3 3
2
−0.5
−0.5
3
0.5
4
2 2
8
0.0
−1.0
−1.0
1 1
7
6 6
−1.5
−1.5
4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
12 / 19
5 5
3
−0.5
−1.0 1
2 Hierarchical clustering
6
X1
9
3.0
2.5
0.5
2.0
0.0
7 7
X2
1.5
8
X2
5 5
9
3
1.0
2
−0.5
3
0.5
4
2
8
0.0
−1.0
7
6
−1.5
4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1
The output of the algorithm is a
dendogram.
12 / 19
5 5
3
−0.5
−1.0 1
2 Hierarchical clustering
6
X1
9
3.0
2.5
0.5
2.0
0.0
7 7
X2
1.5
8
X2
5 5
9
3
1.0
2
−0.5
3
0.5
4
2
8
0.0
−1.0
7
6
−1.5
4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
X1
We must be careful about how
we interpret the dendogram.
12 / 19
Hierarchical clustering
10
10
8
8
6
6
4
4
2
2
0
Figure: *
Figure 10.9
13 / 19
Hierarchical clustering
10
10
I Hierarchical clustering is not
8
8
always appropriate.
6
6
4
4
2
2
0
Figure: *
Figure 10.9
13 / 19
Hierarchical clustering
10
10
I Hierarchical clustering is not
8
8
always appropriate.
6
consumers of 3 different
2
2
0
nationalities.
I Natural 2 clusters: gender
I Natural 3 clusters: nationality
Figure: *
These clusterings are not nested
Figure 10.9 or hierarchical.
13 / 19
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.
14 / 19
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.
Complete linkage:
The distance between 2 clusters is the
maximum distance between any pair of
samples, one in each cluster.
14 / 19
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.
Average linkage:
The distance between 2 clusters is the
average of all pairwise distances.
14 / 19
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.
Single linkage:
The distance between 2 clusters is the
minimum distance between any pair of
samples, one in each cluster.
Suffers from the chaining phenomenon
14 / 19
Example
Average Linkage Complete Linkage Single Linkage
Figure: *
Figure 10.12
15 / 19
Clustering is riddled with questions and choices
16 / 19
Clustering is riddled with questions and choices
17 / 19
Correlation distance
Observation 1
Observation 2
Observation 3
15
10
2
5
3
1
0
5 10 15 20
Variable Index
18 / 19
Correlation distance
I Euclidean distance would cluster all customers who purchase
few things (orange and purple).
I Perhaps we want to cluster customers who purchase similar
things (orange and teal).
I Then, the correlation distance may be a more appropriate
measure of dissimilarity between samples.
20
Observation 1
Observation 2
Observation 3
15
10
2
5
3
1
0
5 10 15 20
Variable Index
19 / 19