0% found this document useful (0 votes)
10 views

Week 9 Notes

Uploaded by

Hind Ouazzani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Week 9 Notes

Uploaded by

Hind Ouazzani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8 Additional Topic: Kernel Methods (still ST323)

Kernel methods are a broad family of statistical techinques, which extend many
well-known methods to work with complex data. In much of this module, we have
(sometimes implicitly) exploited linear structure in data, but for some data this is
not directly possible and our favourite statistical analyses fail. For example, if data
is far from being normally distributed, then it cannot be characterised simply by a
mean and covariance matrix.
The main idea of kernel methods is to map our data to a new (usually higher
dimensional) vector space, where the data does have some linear structure. Once we
have done this, we can then use our earlier techniques more successfully. Instead of
working with the observations (xi )i directly, we work with a nonlinear transformation
of the observations, that is we work with yi , Φ(xi ), i = 1, . . . , n, where Φ : Rp → H
is called the feature map, and is typically nonlinear. Here H may be Rq with q  p,
but often it is taken to be an infinite dimensional function space.
We focus on the familiar setting where the xi ∈ Rp , but these techniques can be
used to analyse much more complicated data types, including genomic data, graphs,
text, and images. All we need is a good choice of feature map Φ.

8.1 Kernel Principal Component Analysis


Kernel PCA (KPCA) is a tool for exploratory data analysis that extends PCA. It
can be useful in cases where the data has some structure that is not linear. For
instance Figure 72 shows data coming from two groups. The two groups are not
linearly separable, even after performing a PCA, but a KPCA with a Gaussian
kernel (explained below) does separate the two groups linearly.
As in PCA, the goal of KPCA is to summarise each datapoint xi ∈ Rp into
uncorrelated scores sik , k = 1, . . . , K, where K ≤ p. In standard PCA working
with X directly, Exercise 4.8 tells us that the PC scores are given by the spectral
decomposition of X̃X̃T , where X̃ is the column-centered version of X. We know that
X̃ = HX, where H , I − 11T /n is the n × n matrix with 1 − 1/n in the diagonal,
and −1/n as off-diagonal entries. This tells us therefore that if

X̃X̃T = HXXT H = U ΛU T (8.1.1)

is the spectral decomposition of HXXT H, then the kth PC scores (sik )i are given
by the kth column of U Λ1/2 . The matrix XXT has (i, j)th entry given by xTi xj .
In KPCA we replace this inner product between xi and xj by the inner product
between Φ(xi ) and Φ(xj ) in the vector space H. Another way of thinking about this
bypasses the choice of Φ, which can be difficult. Instead, we can directly specify a
kernel function: k(·, ·) : Rp × Rp → R, which is thought of as

k(x, x0 ) , hΦ(x), Φ(x0 )iH . (8.1.2)

205
Definition 8.1.1 (Kernel function).
A function
k(·, ·) : Rp × Rp → R
is called a kernel function if for all m ≥ 1 and all x1 , . . . , xm ∈ Rp , the
m×m matrix K with (K)ij , k(xi , xj ) is symmetric positive semi-definite.

The following two results tell us that defining a kernel function is equivalent to
defining the feature map Φ.

Proposition 8.1.2 (Every feature map defines a kernel).


If k(·, ·) is defined by (8.1.2), then it is a kernel function.

Proof. Left as an exercise.

Theorem 8.1.3 (Moore–Aronszajn theorem: every kernel has an


associated feature map).
If k(·, ·) is a kernel function, then there exists a Hilbert space H with
inner-product h·, ·iH , and a mapping Φ : Rp → H such that

k(x, x0 ) = hΦ(x), Φ(x0 )iH , ∀x, x0 ∈ Rp .

Proof. The proof uses advanced analysis, and is not examinable.


We now know that we can work with a kernel instead of using the feature map.
Here are examples of well-known kernels :
linear: k(x, x0 ) , xT x0 ,
polynomial: k(x, x0 ) , (1 + xT x0 )d , for some d ∈ {1, 2, . . .},
 0 |2

Gaussian: k(x, x0 ) , exp − |x−x
2σ 2
, where σ > 0.
In particular, the linear kernel is equivalent to taking the feature map to be the
identity on Rp .
Let us now turn again to our PC scores computed from the yi s. We know
that the first step is to compute the matrix HYYT H, and then compute its spectral
decomposition. Using the kernel k(·, ·) defined by (8.1.2), we replace (YYT )ij by
hyi , yj iH = hΦ(xi ), Φ(xj )iH = k(xi , xj ).
This leads to the following algorithm for KPCA:

Algorithm 8.1.1 (Kernel PCA).


Given data x1 , . . . , xn and a kernel function k(·, ·),

1. Compute the n × n matrix K, where (K)ij = k(xi , xj ).

2. Compute the row and column centered version of K, that is K̃ ,


HKH, where H = I − 11T /n.

206
3. Compute a spectral decomposition K̃ = U ΛU T .

4. Compute U Λ1/2 , whose columns will be the PC scores.

Figure 72 (see code below) shows an example of a dataset with two groups, with
the (ordinary) PCA, and two KPCAs with polynomial kernel (d = 4), and Gaussian
kernel (σ = 1). We notice that the polynomial kernel PCA does not separate the
group linearly, but the Gaussian PCA does separate the group linearly. The choice
of the kernel depends on the applications, and it is common to try out different
kernels (and kernel parameters) to see which gives the “best” result.

x1 = matrix(rnorm(2*100), ncol=2)
theta <- runif(100, min=0, max=2*pi)
x2 <- ((3 + runif(100))*c(cos(theta), sin(theta))) %>%
matrix(ncol=2, byrow=F)
X = rbind(x1, x2)
pch <- rep(c(1, 20), each=100)
plot(X, pch=pch, xaxt='n', yaxt='n', xlab='', ylab='')
prcomp(X)$x[,1:2] %>% plot(pch=pch, xaxt='n', yaxt='n',
xlab='PC1', ylab='PC2') ## looks the same as the data

## Polynomial kernel
quadratic <- function(x,y){
(1+ sum(x*y))^4
}

## Gaussian kernel
gaussian <- function(x,y){
exp(-sum((x-y)^2)/2)
}

kernel.pca <- function(X, kernel){


K <- matrix(0, nrow=nrow(X), ncol=nrow(X))
for(i in 1:nrow(K)){
#print(sprintf("%d of %d", i, nrow(K)))
for(j in i:ncol(K)){
K[i,j] <- kernel(X[i,], X[j,])
}
}
K <- K + t(K)
diag(K) <- diag(K)/2
#m <- nrow(K); kc <- t(t(K - colSums(K)/m) - rowSums(K)/m) + sum(K)/m^2
K <- scale(K, scale=F)
K <- scale(t(K), scale=F) %>% t

K.eigen <- eigen(K, symmetric=TRUE)


npos <- sum(K.eigen$values > 0)

207
return(list(scores=K.eigen$vector[,1:npos]
%*% diag(sqrt(K.eigen$value[1:npos])),
variances=K.eigen$value[1:npos]))
}

X.kpca <- kernel.pca(X, quadratic)


X.kpca$scores[,1:2] %>% plot(pch=pch, xaxt='n', yaxt='n', xlab='KPC1', ylab='KPC2')

X.kpca <- kernel.pca(X, gaussian)


X.kpca$scores[,1:2] %>% plot(pch=pch, xaxt='n', yaxt='n', xlab='KPC1', ylab='KPC2')

8.2 Kernel Mean Embeddings


When working with normal data, the means and covariance matrices of data play
a crucial role as they completely characterise the underlying distributions. For
example, in Section 5.3.3, where we wanted to test whether two samples came from
the same distribution, we tested whether or not they had the same mean. With
more complex data types this can be insufficient, as it can happen that two different
distributions have the same mean, even after assuming that they have the same
covariance matrix.
Kernel mean embeddings work by mapping distributions to a more complex
space, where the mean characterises the distribution. Once this is done, we are free
to base our statistical analysis on the analysis of means, which is something that we
are familiar with. Recalling our feature map Φ : Rp → H, we represent a random
variable X by its kernel mean embedding

µX := E{Φ(X)},

which is an element of H. In fact, for some choices of H, we may take Φ to be


an injective mapping between distributions of X and elements of H. As usual, we
estimate this by the sample mean
n
1X
µ̂X := Φ(xi ).
n i=1

Now suppose that we have two samples x1 , . . . , xm and y1 , . . . , yn , and we would


like to know whether these two samples could have come from the same distribution.
As in Section 5.3.3, we compare µ̂X to µ̂Y by looking at the size of µ̂X − µ̂Y . Here,
this is represented by its norm in the space H:

kµ̂X − µ̂Y k2H = hµ̂X − µ̂Y , µ̂X − µ̂Y iH .

As with KPCA, we can derive a simple expression for this which only depends on
Φ through the kernel function k.

208
● ●
● ●
● ● ●

● ●
● ●
●●
● ●
● ● ● ●

●● ●

● ●
● ●
● ●

● ● ●
● ● ●


● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ●



● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
● ●



● ● ● ● ● ●
● ● ●
● ●● ●● ●
● ● ● ●
● ●●
● ● ●● ● ● ● ●
● ● ● ●
● ●


● ● ● ●

PC2

● ● ● ●
● ● ●

● ● ● ● ● ● ● ●● ●
● ● ● ●
● ●● ● ●
● ● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ●● ●● ● ● ● ●
●●

● ● ● ●●● ●
● ● ●● ●
● ●● ● ● ● ● ● ●

●●

●●● ● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ●● ● ●
● ● ●

● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ●● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●

●●
● ●

● ●

● ● ●
● ● ●● ●
●● ● ● ● ●
● ● ●●
● ●● ● ●● ●

● ●
● ● ● ●

● ●
● ● ● ●
●●
● ●●

PC1

(a) Original data (from 2 groups). (b) (ordinary) PCA of the data.

● ●
● ● ●
● ● ●
● ● ●
● ● ●


● ●
● ● ●

● ● ● ● ●
● ●

● ● ●
● ●●
● ● ● ●
● ● ●
● ●

● ●
● ● ●





● ●
● ● ●

● ● ●
● ●●
●●●
● ● ●●
● ●●●●
● ●
●●

●●


●● ● ● ●




●●

●●●●
●● ● ●
● ● ●


● ●


● ● ●
● ● ●

● ●
●●
KPC2

KPC2

● ● ● ●●
● ● ● ●
● ● ●
●● ●●● ●● ● ●
● ● ●● ●
● ●●
●●● ●

● ●

● ●● ●●
● ● ● ●
● ● ● ●
● ● ● ●●
● ●
●● ●
● ●● ● ●
● ● ● ● ●
● ● ●●● ●
● ●
● ●

●● ● ● ● ●
● ● ●● ●
● ●● ● ● ●


● ● ● ● ●

● ●● ● ●

● ●● ●




● ●●●

● ●
● ● ●●●● ● ● ● ●
●● ●
●●
● ● ●●


● ● ●●
● ●

● ● ● ●
● ●
● ●
● ● ● ● ●

● ● ● ● ● ●


● ● ● ●
● ●
● ●
● ●

● ● ●
● ● ●
● ●

● ● ●
● ●
● ●● ●

KPC1 KPC1

(c) Kernel PCA of the data using a polyno-(d) Kernel PCA of the data using a Gaussian
mial kernel. kernel.

Figure 72: Illustration of the advantage of kernel PCA for data with multiple groups
that are not linearly separable. Notice that the Gaussian kernel PCA separates
linearly the two groups.

209
Proposition 8.2.1. We have that
m m n n
1 X 2 XX 1 X
kµ̂X − µ̂Y k2H = 2 k(xi , xj ) − k(xi , yj ) + 2 k(yi , yj ).
m i,j=1 mn i=1 j=1 n i,j=1

Proof. See video.


This measure of the difference between two distributions can be used to derive
hypothesis tests, but the theory of this is beyond the scope of the course. For a
biological data example in R, see
https://bioconductor.org/packages/release/bioc/html/MMDiff2.html,
for an image analysis example in Python, see
https://docs.seldon.io/projects/alibi-detect/en/latest/examples/cd_
mmd_cifar10.html,
and for a text analysis example in Python, see
https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_
text_imdb.html.
A simple implementation in R is given by kmdd in the package kernlab.

210

You might also like