0% found this document useful (0 votes)
17 views32 pages

Lec 2

Uploaded by

foreverycc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views32 pages

Lec 2

Uploaded by

foreverycc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Lecture 2: Classification, Clustering

STATS 202: Data mining and analysis

Rajan Patel

1 / 19
Classification problem

oo
oo o
o
o
o
Recall:
o oo oo o
o
o oo oo ooo
o o oo ooo o
oo o o o o ooo oo oo
o
oo
o o
o oo
o o
o oo o
o
ooo o o o o o
o o o
oo
o
I X = (X1 , X2 ) are inputs.
o oo o o o o o o o
o o oooo o ooo o o o o o
oo ooo
I Color Y ∈ {Yellow , Blue} is the
X2

o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o oo o
o o o oo o ooo o
oo
o
o ooooo oooo
oo o
o
output.
o o oo o o
o o oo oo o o o

(X, Y ) have a joint distribution.


o o oo
o
o oo o
o
o oo
oo o
o I
o o o

X1
I Purple line is Bayes boundary —
the best we could do if we knew
Figure: * the joint distribution of (X, Y )
Figure 2.13

2 / 19
K-nearest neighbors
To assign a color to the input ×, we look at its K = 3 nearest
neighbors. We predict the color of the majority of the neighbors.

o o
o o o o
o o o o
o o

o o

o o o o
o o
o o
o o

Figure: *

Figure 2.14

3 / 19
K-nearest neighbors also has a decision boundary
KNN: K=10

oo o
oo o
o
o
o oo oo o
o o
o oo oo o o
o o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o
o
o ooo o o o o o
o
o o o oo o o o o
o
o oo o o o o o
o o
o o oooo o ooo o o o o o
oo o o oo o o oo o
X2

o
o o o
o
o o
oo o oo o oo o
o o o o oo o
o o oo oo o
o o o ooo o
o
o ooo o
oo o ooooo o
o oo o
o oo oo o
o o oo oo o
o o ooooo o
o o oo o
o oo o
o o o

X1

Figure: *

Figure 2.15
4 / 19
The higher K, the smoother the decision boundary

KNN: K=1 KNN: K=100

o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
oo o oo o
o oo oo o oo o oo oo o oo
o o o
o o o o o
o o
oo o o o o o oo oo oo o o o o o oo oo
o o oo oo o oo o o oo oo o oo
o o o o o oo o o o o o o o oo o o
oo o o o o oo o o o o
o o o oo o o o o o o oo o o o
o oo o o o o o oo o o o o
o o oo o o o o o oo o o o
oo o o
oo ooo o oo ooo oo o o
oo ooo o oo ooo
o o o o oooo o o
o
o o o o o oooo o o
o
o
o o o o o o o o
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o oo o o o oo o o
o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o o oo o o o
o o oo o o
o o o oo o o o o o o oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o
o o

Figure: *

Figure 2.16

5 / 19
Clustering

As in classification, we assign a class to each sample in the data


matrix. However, the class is not an output variable; we only use
input variables.

Clustering is an unsupervised procedure, whose goal is to find


homogeneous subgroups among the observations.

We will discuss 2 algorithms:

I K-means clustering
I Hierarchical clustering

6 / 19
K-means clustering
I K is the number of clusters and must be fixed in advance.
I The goal of this method is to maximize the similarity of
samples within each cluster:
K
X 1 X
min W (C` ) ; W (C` ) = Distance2 (xi,: , xj,: ).
C1 ,...,CK |C` |
`=1 i,j∈C`
K=2 K=3 K=4

Figure: *

Figure 10.5
7 / 19
K-means clustering algorithm

1. Assign each sample to a cluster from 1 to K arbitrarily, e.g. at


random.

2. Iterate these two steps until the clustering is constant:

I Find the centroid of each cluster `; i.e. the average x`,: of all
the samples in the cluster:
1 X
x`,j = xi,j for j = 1, . . . , p.
|C` |
i∈C`

I Reassign each sample to the nearest centroid.

8 / 19
K-means clustering algorithm
Data Step 1 Iteration 1, Step 2a

Iteration 1, Step 2b Iteration 2, Step 2a Final Results

Figure: *

Figure 10.6
9 / 19
Properties of K-means clustering

I The algorithm always converges to a local minimum of


K
X 1 X
min W (C` ) ; W (C` ) = Distance2 (xi,: , xj,: ).
C1 ,...,CK |C` |
`=1 i,j∈C`

I Each initialization could yield a different minimum.

10 / 19
Example: K-means output with different
initializations
320.9 235.8 235.8

In practice, we start from


235.8 235.8 310.9
many random initializations
and choose the output which
minimizes the objective
function.

Figure: *

Figure 10.7 11 / 19
Hierarchical clustering

Most algorithms for hierarchical clustering are agglomerative.

9 9
0.5

0.5
0.0

0.0
7 7
8 8
X2

X2
5 5
3 3
−0.5

−0.5
2 2
−1.0

−1.0
1 1
6 6
−1.5

−1.5

4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 X1
9 9
0.5

0.5

12 / 19
Hierarchical clustering

Most algorithms for hierarchical clustering are agglomerative.

9
0.5
0.0

7 7
8
X2

5 5
3
−0.5

2
−1.0

1
6
−1.5

4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1
9
0.5

12 / 19
−0.5

−0.5
2 2
Hierarchical clustering
−1.0

−1.0
1 1
6 6
−1.5

−1.5
4 4
Most algorithms for hierarchical clustering are agglomerative.
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 X1
9 9
0.5

0.5
0.0

0.0
7 7
8 8
X2

X2
5 5
3 3
−0.5

−0.5
2 2
−1.0

−1.0
1 1
6 6
−1.5

−1.5

4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 X1

12 / 19
−0.
2

−1.0
−1.5 6
1 Hierarchical clustering
4
.0 Most algorithms for hierarchical clustering are agglomerative.
−1.5 −1.0 −0.5 0.0 0.5 1.0

X1
9
0.5
0.0

7 7
8
X2

5 5
3
−0.5

2
−1.0

1
6
−1.5

4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1

12 / 19
Hierarchical clustering

Most algorithms for hierarchical clustering are agglomerative.

9 9

3.0
0.5

0.5

2.5
2.0
0.0

0.0
7 7

X2
1.5
8 8
X2

X2
5 5
3 3

9
−0.5

−0.5

1.0

2
3
0.5
2 2

4
−1.0

−1.0

8
1 1

0.0
6 6

7
−1.5

−1.5

4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 The output ofXthe


1 algorithm is a
9 9
dendogram.
0.5

0.5

12 / 19
Hierarchical clustering

Most algorithms for hierarchical clustering are agglomerative.

3.0
0.5

2.5
2.0
0.0

7 7

X2
1.5
8
X2

5 5
3

9
−0.5

1.0

2
3
0.5
2

4
−1.0

8
1

0.0
6

7
−1.5

4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 The output of the algorithm is a


9 dendogram.
0.5

12 / 19
8 5 8 5

X
3 3

−0.5

−0.5
2 Hierarchical clustering 2
−1.0

−1.0
1 1
6 6

Most algorithms for hierarchical clustering are agglomerative.


−1.5

−1.5
4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 X1
9 9

3.0
2.5
0.5

0.5

2.0
0.0

0.0
7 7

X2
1.5
8 8
X2

X2
5 5

9
1.0
3 3

2
−0.5

−0.5

3
0.5

4
2 2

8
0.0
−1.0

−1.0
1 1

7
6 6
−1.5

−1.5

4 4
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1 The output ofXthe


1
algorithm is a
dendogram.

12 / 19
5 5
3

−0.5
−1.0 1
2 Hierarchical clustering
6

Most4algorithms for hierarchical clustering are agglomerative.


−1.5

.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1
9

3.0
2.5
0.5

2.0
0.0

7 7

X2
1.5
8
X2

5 5

9
3

1.0

2
−0.5

3
0.5

4
2

8
0.0
−1.0

7
6
−1.5

4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1
The output of the algorithm is a
dendogram.

12 / 19
5 5
3

−0.5
−1.0 1
2 Hierarchical clustering
6

Most4algorithms for hierarchical clustering are agglomerative.


−1.5

.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1
9

3.0
2.5
0.5

2.0
0.0

7 7

X2
1.5
8
X2

5 5

9
3

1.0

2
−0.5

3
0.5

4
2

8
0.0
−1.0

7
6
−1.5

4
.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

X1
We must be careful about how
we interpret the dendogram.

12 / 19
Hierarchical clustering

I The number of clusters is not


fixed.
10

10

10
8

8
6

6
4

4
2

2
0

Figure: *

Figure 10.9

13 / 19
Hierarchical clustering

I The number of clusters is not


fixed.
10

10

10
I Hierarchical clustering is not
8

8
always appropriate.
6

6
4

4
2

2
0

Figure: *

Figure 10.9

13 / 19
Hierarchical clustering

I The number of clusters is not


fixed.
10

10

10
I Hierarchical clustering is not
8

8
always appropriate.
6

e.g. Market segmentation for


4

consumers of 3 different
2

2
0

nationalities.
I Natural 2 clusters: gender
I Natural 3 clusters: nationality
Figure: *
These clusterings are not nested
Figure 10.9 or hierarchical.

13 / 19
Notion of distance between clusters

At each step, we link the 2 clusters that are “closest” to each other.

Hierarchical clustering algorithms are classified according to the


notion of distance between clusters.

14 / 19
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.

Hierarchical clustering algorithms are classified according to the


notion of distance between clusters.

Complete linkage:
The distance between 2 clusters is the
maximum distance between any pair of
samples, one in each cluster.

14 / 19
Notion of distance between clusters

At each step, we link the 2 clusters that are “closest” to each other.

Hierarchical clustering algorithms are classified according to the


notion of distance between clusters.

Average linkage:
The distance between 2 clusters is the
average of all pairwise distances.

14 / 19
Notion of distance between clusters

At each step, we link the 2 clusters that are “closest” to each other.

Hierarchical clustering algorithms are classified according to the


notion of distance between clusters.

Single linkage:
The distance between 2 clusters is the
minimum distance between any pair of
samples, one in each cluster.
Suffers from the chaining phenomenon

14 / 19
Example
Average Linkage Complete Linkage Single Linkage

Figure: *

Figure 10.12
15 / 19
Clustering is riddled with questions and choices

I Is clustering appropriate? i.e. Could a sample belong to more


than one cluster?
I Mixture models, soft clustering, topic models.
I How many clusters are appropriate?
I Choose subjectively — depends on the inference sought.
I There are formal methods based on gap statistics, mixture
models, etc.
I Are the clusters robust?
I Run the clustering on different random subsets of the data. Is
the structure preserved?
I Try different clustering algorithms. Are the conclusions
consistent?
I Most important: temper your conclusions.

16 / 19
Clustering is riddled with questions and choices

I Should we scale the variables before doing the clustering.


I Variables with larger variance have a larger effect on the
Euclidean distance between two samples.
I Does Euclidean distance capture dissimilarity between samples?

17 / 19
Correlation distance

Example: Suppose that we want to cluster customers at a store


for market segmentation.
I Samples are customers
I Each variable corresponds to a specific product and measures
the number of items bought by the customer during a year.
20

Observation 1
Observation 2
Observation 3
15
10

2
5

3
1
0

5 10 15 20

Variable Index

18 / 19
Correlation distance
I Euclidean distance would cluster all customers who purchase
few things (orange and purple).
I Perhaps we want to cluster customers who purchase similar
things (orange and teal).
I Then, the correlation distance may be a more appropriate
measure of dissimilarity between samples.
20

Observation 1
Observation 2
Observation 3
15
10

2
5

3
1
0

5 10 15 20

Variable Index
19 / 19

You might also like