Week 9 Notes
Week 9 Notes
Kernel methods are a broad family of statistical techinques, which extend many
well-known methods to work with complex data. In much of this module, we have
(sometimes implicitly) exploited linear structure in data, but for some data this is
not directly possible and our favourite statistical analyses fail. For example, if data
is far from being normally distributed, then it cannot be characterised simply by a
mean and covariance matrix.
The main idea of kernel methods is to map our data to a new (usually higher
dimensional) vector space, where the data does have some linear structure. Once we
have done this, we can then use our earlier techniques more successfully. Instead of
working with the observations (xi )i directly, we work with a nonlinear transformation
of the observations, that is we work with yi , Φ(xi ), i = 1, . . . , n, where Φ : Rp → H
is called the feature map, and is typically nonlinear. Here H may be Rq with q p,
but often it is taken to be an infinite dimensional function space.
We focus on the familiar setting where the xi ∈ Rp , but these techniques can be
used to analyse much more complicated data types, including genomic data, graphs,
text, and images. All we need is a good choice of feature map Φ.
is the spectral decomposition of HXXT H, then the kth PC scores (sik )i are given
by the kth column of U Λ1/2 . The matrix XXT has (i, j)th entry given by xTi xj .
In KPCA we replace this inner product between xi and xj by the inner product
between Φ(xi ) and Φ(xj ) in the vector space H. Another way of thinking about this
bypasses the choice of Φ, which can be difficult. Instead, we can directly specify a
kernel function: k(·, ·) : Rp × Rp → R, which is thought of as
205
Definition 8.1.1 (Kernel function).
A function
k(·, ·) : Rp × Rp → R
is called a kernel function if for all m ≥ 1 and all x1 , . . . , xm ∈ Rp , the
m×m matrix K with (K)ij , k(xi , xj ) is symmetric positive semi-definite.
The following two results tell us that defining a kernel function is equivalent to
defining the feature map Φ.
206
3. Compute a spectral decomposition K̃ = U ΛU T .
Figure 72 (see code below) shows an example of a dataset with two groups, with
the (ordinary) PCA, and two KPCAs with polynomial kernel (d = 4), and Gaussian
kernel (σ = 1). We notice that the polynomial kernel PCA does not separate the
group linearly, but the Gaussian PCA does separate the group linearly. The choice
of the kernel depends on the applications, and it is common to try out different
kernels (and kernel parameters) to see which gives the “best” result.
x1 = matrix(rnorm(2*100), ncol=2)
theta <- runif(100, min=0, max=2*pi)
x2 <- ((3 + runif(100))*c(cos(theta), sin(theta))) %>%
matrix(ncol=2, byrow=F)
X = rbind(x1, x2)
pch <- rep(c(1, 20), each=100)
plot(X, pch=pch, xaxt='n', yaxt='n', xlab='', ylab='')
prcomp(X)$x[,1:2] %>% plot(pch=pch, xaxt='n', yaxt='n',
xlab='PC1', ylab='PC2') ## looks the same as the data
## Polynomial kernel
quadratic <- function(x,y){
(1+ sum(x*y))^4
}
## Gaussian kernel
gaussian <- function(x,y){
exp(-sum((x-y)^2)/2)
}
207
return(list(scores=K.eigen$vector[,1:npos]
%*% diag(sqrt(K.eigen$value[1:npos])),
variances=K.eigen$value[1:npos]))
}
µX := E{Φ(X)},
As with KPCA, we can derive a simple expression for this which only depends on
Φ through the kernel function k.
208
● ●
● ●
● ● ●
●
● ●
● ●
●●
● ●
● ● ● ●
●
●● ●
●
● ●
● ●
● ●
●
● ● ●
● ● ●
●
●
● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
●
●
●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
● ●
●
●
●
● ● ● ● ● ●
● ● ●
● ●● ●● ●
● ● ● ●
● ●●
● ● ●● ● ● ● ●
● ● ● ●
● ●
●
●
● ● ● ●
PC2
●
● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ●● ●
● ● ● ●
● ●● ● ●
● ● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ●● ●● ● ● ● ●
●●
●
● ● ● ●●● ●
● ● ●● ●
● ●● ● ● ● ● ● ●
●
●●
●
●●● ● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ●● ● ●
● ● ●
●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ●● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
●
●●
● ●
●
● ●
●
● ● ●
● ● ●● ●
●● ● ● ● ●
● ● ●●
● ●● ● ●● ●
●
● ●
● ● ● ●
●
● ●
● ● ● ●
●●
● ●●
PC1
(a) Original data (from 2 groups). (b) (ordinary) PCA of the data.
● ●
● ● ●
● ● ●
● ● ●
● ● ●
●
●
● ●
● ● ●
●
● ● ● ● ●
● ●
●
● ● ●
● ●●
● ● ● ●
● ● ●
● ●
●
● ●
● ● ●
●
●
●
●
●
● ●
● ● ●
●
● ● ●
● ●●
●●●
● ● ●●
● ●●●●
● ●
●●
●
●●
●
●
●● ● ● ●
●
●
●
●
●●
●
●●●●
●● ● ●
● ● ●
●
●
● ●
●
●
● ● ●
● ● ●
●
● ●
●●
KPC2
KPC2
● ● ● ●●
● ● ● ●
● ● ●
●● ●●● ●● ● ●
● ● ●● ●
● ●●
●●● ●
●
● ●
●
● ●● ●●
● ● ● ●
● ● ● ●
● ● ● ●●
● ●
●● ●
● ●● ● ●
● ● ● ● ●
● ● ●●● ●
● ●
● ●
●
●● ● ● ● ●
● ● ●● ●
● ●● ● ● ●
●
●
● ● ● ● ●
●
● ●● ● ●
●
● ●● ●
●
●
●
●
● ●●●
●
● ●
● ● ●●●● ● ● ● ●
●● ●
●●
● ● ●●
●
●
● ● ●●
● ●
●
● ● ● ●
● ●
● ●
● ● ● ● ●
●
● ● ● ● ● ●
●
●
● ● ● ●
● ●
● ●
● ●
●
● ● ●
● ● ●
● ●
●
● ● ●
● ●
● ●● ●
KPC1 KPC1
(c) Kernel PCA of the data using a polyno-(d) Kernel PCA of the data using a Gaussian
mial kernel. kernel.
Figure 72: Illustration of the advantage of kernel PCA for data with multiple groups
that are not linearly separable. Notice that the Gaussian kernel PCA separates
linearly the two groups.
209
Proposition 8.2.1. We have that
m m n n
1 X 2 XX 1 X
kµ̂X − µ̂Y k2H = 2 k(xi , xj ) − k(xi , yj ) + 2 k(yi , yj ).
m i,j=1 mn i=1 j=1 n i,j=1
210