QSRI Lecture4
QSRI Lecture4
clustering
Seth Flaxman1
3 July 2019
1
Based on slides from Simon Rogers & Maurizio Filippone
A problem - too many features
I Aim: To build a classifier that can diagnose leukaemia using
Gene expression data.
I Data: 27 healthy samples, 11 leukaemia samples (N = 38). Each
sample is the expression (activity) level for 3751 genes. (Also have
an independent test set)
A 3-dimensional
object
A 2-dimensional
projection
Projection
I Once we’ve chosen W we can project test data into this new
space too: Znew = Xnew W
Choosing W
I Different W will give us different projections (imagine moving
the light).
I Which should we use?
Choosing W
I Different W will give us different projections (imagine moving
the light).
I Which should we use?
I Not all will represent our data well...
zj = Xwj
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):
zj = Xwj
N N
1 X 1 X
(zjn − µj )2 , µj = zjn
N N
n=1 n=1
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):
zj = Xwj
N N
1 X 1 X
(zjn − µj )2 , µj = zjn
N N
n=1 n=1
I Once the first one has been found, w2 is found that maximises
the variance and is orthogonal to the first one etc etc.
PCA – a visualisation
1
x2
0
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
1 σ z2 = 0.39
x2
0
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
1 σ z2 = 0.39
x2
0
−1
−2
σ z2 = 1.2
−3
−3 −2 −1 0 1 2 3
x1
2
σ z2 = 1.9
1 σ z2 = 0.39
x2
0
−1
−2
σ z2 = 1.2
−3
−3 −2 −1 0 1 2 3
x1
1
x2
−1
−2
σ z2 = 1.9
−3
−3 −2 −1 0 1 2 3
x1
PCA – analytic solution
1
x2
−1
−2
σ z2 = 1.9
−3
−3 −2 −1 0 1 2 3
x1
10
0
z2
−10
−20
−30
−40 −20 0 20 40
z1
I For example:
I xn is a binary vector indicating products customer n has
bought.
I Can group customers that buy similar products.
I Can group products bought together.
Clustering
I For example:
I xn is a binary vector indicating products customer n has
bought.
I Can group customers that buy similar products.
I Can group products bought together.
I Known as Clustering
I And is an example of unsupervised learning.
Clustering
5 5
4 4
3 3
2 2
1 1
0 0
−1 −1
−2 −2
−3 −3
0 2 4 6 0 2 4 6
xn = [xn1 , xn2 ]T
I Left: data.
I Right: data after clustering (points coloured according to
cluster membership).
What we’ll cover
I K-means
I But note: there are dozens and dozens of other clustering
methods out there!
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
I Each xn is assigned to its closest cluster:
6
2
x2
−2
−4
−6
−2 0 2 4 6
x1
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
I Each xn is assigned to its closest cluster:
6
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Distance is normally Euclidean distance:
dnk = (xn − µk )T (xn − µk )
How do we find µk ?
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Solution at convergence.
When does K-means break?
1.5
x2 0.5
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x1
1.5
x2 0.5
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x1