0% found this document useful (0 votes)
6 views56 pages

QSRI Lecture4

The document discusses unsupervised learning techniques, specifically Principal Components Analysis (PCA) and K-means clustering, to address the challenge of high-dimensional data in diagnosing leukaemia using gene expression data. PCA is used to reduce the number of features by maximizing variance, while K-means clustering groups similar data points based on distance metrics. The iterative K-means algorithm assigns points to clusters and updates cluster centroids until convergence.

Uploaded by

Len McLemore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views56 pages

QSRI Lecture4

The document discusses unsupervised learning techniques, specifically Principal Components Analysis (PCA) and K-means clustering, to address the challenge of high-dimensional data in diagnosing leukaemia using gene expression data. PCA is used to reduce the number of features by maximizing variance, while K-means clustering groups similar data points based on distance metrics. The iterative K-means algorithm assigns points to clusters and updates cluster centroids until convergence.

Uploaded by

Len McLemore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Unsupervised learning: PCA and k-means

clustering

Seth Flaxman1

Imperial College London

3 July 2019

1
Based on slides from Simon Rogers & Maurizio Filippone
A problem - too many features
I Aim: To build a classifier that can diagnose leukaemia using
Gene expression data.
I Data: 27 healthy samples, 11 leukaemia samples (N = 38). Each
sample is the expression (activity) level for 3751 genes. (Also have
an independent test set)

I In general, the number of parameters will increase with the number


of features – d = 3751.
I e.g. Logistic regression – w would have length 3751!

I Fitting lots of parameters is hard


Features

I For visualisation, most examples we’ve seen have had only 2


features x = [x1 , x2 ]T .
I Now, we’ve been given lots (3751) to start with.
I We need to reduce this number.
Features

I For visualisation, most examples we’ve seen have had only 2


features x = [x1 , x2 ]T .
I Now, we’ve been given lots (3751) to start with.
I We need to reduce this number.
I 2 general schemes:
I Use a subset of the originals.
I Make new ones by combining the originals.
Making new features

I An alternative to choosing features is making new ones.


Making new features

I An alternative to choosing features is making new ones.


I Cluster:
I Cluster the features (turn our clustering problem around)
I If we use say K-means, our new features will be the K mean
vectors.
Making new features

I An alternative to choosing features is making new ones.


I Cluster:
I Cluster the features (turn our clustering problem around)
I If we use say K-means, our new features will be the K mean
vectors.
I Projection/combination
I Reduce the number of features by projecting into a lower
dimensional space.
I Do this by making new features that are combinations (linear)
of the old ones.
Projection

A 3-dimensional
object

A 2-dimensional
projection
Projection

I We can project data (d dimensions) into a lower number of


dimensions (m).
I Z = XW
I X is N × d
I W is d × m
I Z is N × m – an m-dimensional representation of our N
objects.
I W defines the projection
I Changing W is like changing where the light is coming from
for the shadow (or rotating the hand).
I (X is the hand, Z is the shadow)

I Once we’ve chosen W we can project test data into this new
space too: Znew = Xnew W
Choosing W
I Different W will give us different projections (imagine moving
the light).
I Which should we use?
Choosing W
I Different W will give us different projections (imagine moving
the light).
I Which should we use?
I Not all will represent our data well...

This doesn't look


like a hand!
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):

zj = Xwj
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):

zj = Xwj

I PCA chooses wj to maximise the variance of zj

N N
1 X 1 X
(zjn − µj )2 , µj = zjn
N N
n=1 n=1
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):

zj = Xwj

I PCA chooses wj to maximise the variance of zj

N N
1 X 1 X
(zjn − µj )2 , µj = zjn
N N
n=1 n=1

I Once the first one has been found, w2 is found that maximises
the variance and is orthogonal to the first one etc etc.
PCA – a visualisation

1
x2
0

−1

−2

−3
−3 −2 −1 0 1 2 3
x1

I Original data in 2-dimensions.


I We’d like a 1-dimensional projection.
PCA – a visualisation

1 σ z2 = 0.39

x2
0

−1

−2

−3
−3 −2 −1 0 1 2 3
x1

I Pick some arbitrary w.


I Project the data onto it.
I Compute the variance (on the line).
I The position on the line is our 1 dimensional representation.
PCA – a visualisation

1 σ z2 = 0.39

x2
0

−1

−2
σ z2 = 1.2
−3
−3 −2 −1 0 1 2 3
x1

I Pick some arbitrary w.


I Project the data onto it.
I Compute the variance (on the line).
I The position on the line is our 1 dimensional representation.
PCA – a visualisation

2
σ z2 = 1.9
1 σ z2 = 0.39

x2
0

−1

−2
σ z2 = 1.2
−3
−3 −2 −1 0 1 2 3
x1

I Pick some arbitrary w.


I Project the data onto it.
I Compute the variance (on the line).
I The position on the line is our 1 dimensional representation.
PCA – analytic solution

I Could search for w1 , . . . , wM


I But, analytic solution is available.
I w are the eigenvectors of the covariance matrix of X.
I R: prcomp or princomp
PCA – analytic solution

1
x2

−1

−2
σ z2 = 1.9
−3
−3 −2 −1 0 1 2 3
x1
PCA – analytic solution

1
x2

−1

−2
σ z2 = 1.9
−3
−3 −2 −1 0 1 2 3
x1

I What would be the second component?


PCA – leukaemia data
20

10

0
z2

−10

−20

−30
−40 −20 0 20 40
z1

First two principal components in our leukaemia data (points


labeled by class).
Summary

I Sometimes we have too much data (too many dimensions).


I Need to select features.
I Features can be dimensions that already exist.
I Or we can make new ones.
I We’ve seen one example of each.
Clustering

I What if we just have xn ?


Clustering

I For example:
I xn is a binary vector indicating products customer n has
bought.
I Can group customers that buy similar products.
I Can group products bought together.
Clustering

I For example:
I xn is a binary vector indicating products customer n has
bought.
I Can group customers that buy similar products.
I Can group products bought together.
I Known as Clustering
I And is an example of unsupervised learning.
Clustering
5 5

4 4

3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3

0 2 4 6 0 2 4 6

I In this example each object has two attributes:

xn = [xn1 , xn2 ]T
I Left: data.
I Right: data after clustering (points coloured according to
cluster membership).
What we’ll cover

I K-means
I But note: there are dozens and dozens of other clustering
methods out there!
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
I Each xn is assigned to its closest cluster:
6

2
x2

−2

−4

−6
−2 0 2 4 6
x1
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
I Each xn is assigned to its closest cluster:
6

2
x2

−2

−4

−6
−2 0 2 4 6
x1
I Distance is normally Euclidean distance:
dnk = (xn − µk )T (xn − µk )
How do we find µk ?

I No analytical solution – we can’t write down µk as a function


of X.
I Use an iterative algorithm:
How do we find µk ?

I No analytical solution – we can’t write down µk as a function


of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
How do we find µk ?

I No analytical solution – we can’t write down µk as a function


of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
How do we find µk ?

I No analytical solution – we can’t write down µk as a function


of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
How do we find µk ?

I No analytical solution – we can’t write down µk as a function


of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
4. Update µk to average of xn s assigned to µk :
PN
znk xn
µk = Pn=1
N
n=1 znk
How do we find µk ?

I No analytical solution – we can’t write down µk as a function


of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
4. Update µk to average of xn s assigned to µk :
PN
znk xn
µk = Pn=1
N
n=1 znk

5. Return to 2 until assignments do not change.


How do we find µk ?

I No analytical solution – we can’t write down µk as a function


of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
4. Update µk to average of xn s assigned to µk :
PN
znk xn
µk = Pn=1
N
n=1 znk

5. Return to 2 until assignments do not change.


I Algorithm will converge....it will reach a point where the
assignments don’t change.
K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Cluster means randomly assigned (top left).


I Points assigned to their closest mean.
K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Cluster means updated to mean of assigned points.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Points re-assigned to closest mean.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Cluster means updated to mean of assigned points.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Assign point to closest mean.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Update mean.
K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Assign point to closest mean.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Update mean.
K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Assign point to closest mean.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Update mean.
K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Assign point to closest mean.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Update mean.
K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Assign point to closest mean.


K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Update mean.
K-means – example

2
x2

−2

−4

−6
−2 0 2 4 6
x1

I Solution at convergence.
When does K-means break?

1.5

x2 0.5

−0.5

−1

−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x1

I Data has clear cluster structure.


I Outer cluster can not be represented as a single point.
When does K-means break?

1.5

x2 0.5

−0.5

−1

−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x1

I Data has clear cluster structure.


I Outer cluster can not be represented as a single point.

You might also like