0% found this document useful (0 votes)

13 views6 pages

Week 9 Notes

Uploaded by

Hind Ouazzani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views6 pages

Week 9 Notes

Uploaded by

Hind Ouazzani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

8 Additional Topic: Kernel Methods (still ST323)

Kernel methods are a broad family of statistical techinques, which extend many
well-known methods to work with complex data. In much of this module, we have
(sometimes implicitly) exploited linear structure in data, but for some data this is
not directly possible and our favourite statistical analyses fail. For example, if data
is far from being normally distributed, then it cannot be characterised simply by a
mean and covariance matrix.
The main idea of kernel methods is to map our data to a new (usually higher
dimensional) vector space, where the data does have some linear structure. Once we
have done this, we can then use our earlier techniques more successfully. Instead of
working with the observations (xi )i directly, we work with a nonlinear transformation
of the observations, that is we work with yi , Φ(xi ), i = 1, . . . , n, where Φ : Rp → H
is called the feature map, and is typically nonlinear. Here H may be Rq with q p,
but often it is taken to be an infinite dimensional function space.
We focus on the familiar setting where the xi ∈ Rp , but these techniques can be
used to analyse much more complicated data types, including genomic data, graphs,
text, and images. All we need is a good choice of feature map Φ.

8.1 Kernel Principal Component Analysis

Kernel PCA (KPCA) is a tool for exploratory data analysis that extends PCA. It
can be useful in cases where the data has some structure that is not linear. For
instance Figure 72 shows data coming from two groups. The two groups are not
linearly separable, even after performing a PCA, but a KPCA with a Gaussian
kernel (explained below) does separate the two groups linearly.
As in PCA, the goal of KPCA is to summarise each datapoint xi ∈ Rp into
uncorrelated scores sik , k = 1, . . . , K, where K ≤ p. In standard PCA working
with X directly, Exercise 4.8 tells us that the PC scores are given by the spectral
decomposition of X̃X̃T , where X̃ is the column-centered version of X. We know that
X̃ = HX, where H , I − 11T /n is the n × n matrix with 1 − 1/n in the diagonal,
and −1/n as off-diagonal entries. This tells us therefore that if

X̃X̃T = HXXT H = U ΛU T (8.1.1)

is the spectral decomposition of HXXT H, then the kth PC scores (sik )i are given
by the kth column of U Λ1/2 . The matrix XXT has (i, j)th entry given by xTi xj .
In KPCA we replace this inner product between xi and xj by the inner product
between Φ(xi ) and Φ(xj ) in the vector space H. Another way of thinking about this
bypasses the choice of Φ, which can be difficult. Instead, we can directly specify a
kernel function: k(·, ·) : Rp × Rp → R, which is thought of as

k(x, x0 ) , hΦ(x), Φ(x0 )iH . (8.1.2)

205
Definition 8.1.1 (Kernel function).
A function
k(·, ·) : Rp × Rp → R
is called a kernel function if for all m ≥ 1 and all x1 , . . . , xm ∈ Rp , the
m×m matrix K with (K)ij , k(xi , xj ) is symmetric positive semi-definite.

The following two results tell us that defining a kernel function is equivalent to
defining the feature map Φ.

Proposition 8.1.2 (Every feature map defines a kernel).

If k(·, ·) is defined by (8.1.2), then it is a kernel function.

Proof. Left as an exercise.

Theorem 8.1.3 (Moore–Aronszajn theorem: every kernel has an

associated feature map).
If k(·, ·) is a kernel function, then there exists a Hilbert space H with
inner-product h·, ·iH , and a mapping Φ : Rp → H such that

k(x, x0 ) = hΦ(x), Φ(x0 )iH , ∀x, x0 ∈ Rp .

Proof. The proof uses advanced analysis, and is not examinable.

We now know that we can work with a kernel instead of using the feature map.
Here are examples of well-known kernels :
linear: k(x, x0 ) , xT x0 ,
polynomial: k(x, x0 ) , (1 + xT x0 )d , for some d ∈ {1, 2, . . .},
0 |2

Gaussian: k(x, x0 ) , exp − |x−x
2σ 2
, where σ > 0.
In particular, the linear kernel is equivalent to taking the feature map to be the
identity on Rp .
Let us now turn again to our PC scores computed from the yi s. We know
that the first step is to compute the matrix HYYT H, and then compute its spectral
decomposition. Using the kernel k(·, ·) defined by (8.1.2), we replace (YYT )ij by
hyi , yj iH = hΦ(xi ), Φ(xj )iH = k(xi , xj ).
This leads to the following algorithm for KPCA:

Algorithm 8.1.1 (Kernel PCA).

Given data x1 , . . . , xn and a kernel function k(·, ·),

1. Compute the n × n matrix K, where (K)ij = k(xi , xj ).

2. Compute the row and column centered version of K, that is K̃ ,

HKH, where H = I − 11T /n.

206
3. Compute a spectral decomposition K̃ = U ΛU T .

4. Compute U Λ1/2 , whose columns will be the PC scores.

Figure 72 (see code below) shows an example of a dataset with two groups, with
the (ordinary) PCA, and two KPCAs with polynomial kernel (d = 4), and Gaussian
kernel (σ = 1). We notice that the polynomial kernel PCA does not separate the
group linearly, but the Gaussian PCA does separate the group linearly. The choice
of the kernel depends on the applications, and it is common to try out different
kernels (and kernel parameters) to see which gives the “best” result.

x1 = matrix(rnorm(2*100), ncol=2)
theta <- runif(100, min=0, max=2*pi)
x2 <- ((3 + runif(100))*c(cos(theta), sin(theta))) %>%
matrix(ncol=2, byrow=F)
X = rbind(x1, x2)
pch <- rep(c(1, 20), each=100)
plot(X, pch=pch, xaxt='n', yaxt='n', xlab='', ylab='')
prcomp(X)$x[,1:2] %>% plot(pch=pch, xaxt='n', yaxt='n',
xlab='PC1', ylab='PC2') ## looks the same as the data

## Polynomial kernel
quadratic <- function(x,y){
(1+ sum(x*y))^4
}

## Gaussian kernel
gaussian <- function(x,y){
exp(-sum((x-y)^2)/2)
}

kernel.pca <- function(X, kernel){

K <- matrix(0, nrow=nrow(X), ncol=nrow(X))
for(i in 1:nrow(K)){
#print(sprintf("%d of %d", i, nrow(K)))
for(j in i:ncol(K)){
K[i,j] <- kernel(X[i,], X[j,])
}
}
K <- K + t(K)
diag(K) <- diag(K)/2
#m <- nrow(K); kc <- t(t(K - colSums(K)/m) - rowSums(K)/m) + sum(K)/m^2
K <- scale(K, scale=F)
K <- scale(t(K), scale=F) %>% t

K.eigen <- eigen(K, symmetric=TRUE)

npos <- sum(K.eigen$values > 0)

207
return(list(scores=K.eigen$vector[,1:npos]
%*% diag(sqrt(K.eigen$value[1:npos])),
variances=K.eigen$value[1:npos]))
}

X.kpca <- kernel.pca(X, quadratic)

X.kpca$scores[,1:2] %>% plot(pch=pch, xaxt='n', yaxt='n', xlab='KPC1', ylab='KPC2')

X.kpca <- kernel.pca(X, gaussian)

X.kpca$scores[,1:2] %>% plot(pch=pch, xaxt='n', yaxt='n', xlab='KPC1', ylab='KPC2')

8.2 Kernel Mean Embeddings

When working with normal data, the means and covariance matrices of data play
a crucial role as they completely characterise the underlying distributions. For
example, in Section 5.3.3, where we wanted to test whether two samples came from
the same distribution, we tested whether or not they had the same mean. With
more complex data types this can be insufficient, as it can happen that two different
distributions have the same mean, even after assuming that they have the same
covariance matrix.
Kernel mean embeddings work by mapping distributions to a more complex
space, where the mean characterises the distribution. Once this is done, we are free
to base our statistical analysis on the analysis of means, which is something that we
are familiar with. Recalling our feature map Φ : Rp → H, we represent a random
variable X by its kernel mean embedding

µX := E{Φ(X)},

which is an element of H. In fact, for some choices of H, we may take Φ to be

an injective mapping between distributions of X and elements of H. As usual, we
estimate this by the sample mean
n
1X
µ̂X := Φ(xi ).
n i=1

Now suppose that we have two samples x1 , . . . , xm and y1 , . . . , yn , and we would

like to know whether these two samples could have come from the same distribution.
As in Section 5.3.3, we compare µ̂X to µ̂Y by looking at the size of µ̂X − µ̂Y . Here,
this is represented by its norm in the space H:

kµ̂X − µ̂Y k2H = hµ̂X − µ̂Y , µ̂X − µ̂Y iH .

As with KPCA, we can derive a simple expression for this which only depends on
Φ through the kernel function k.

208
● ●
● ●
● ● ●
●
● ●
● ●
●●
● ●
● ● ● ●
●
●● ●
●
● ●
● ●
● ●
●
● ● ●
● ● ●
●
●
● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
●
●
●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
● ●
●
●
●
● ● ● ● ● ●
● ● ●
● ●● ●● ●
● ● ● ●
● ●●
● ● ●● ● ● ● ●
● ● ● ●
● ●
●
●
● ● ● ●

PC2
●
● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ●● ●
● ● ● ●
● ●● ● ●
● ● ● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ●● ●● ● ● ● ●
●●
●
● ● ● ●●● ●
● ● ●● ●
● ●● ● ● ● ● ● ●
●
●●
●
●●● ● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ●● ● ●
● ● ●
●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ●● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
●
●●
● ●
●
● ●
●
● ● ●
● ● ●● ●
●● ● ● ● ●
● ● ●●
● ●● ● ●● ●
●
● ●
● ● ● ●
●
● ●
● ● ● ●
●●
● ●●

PC1

(a) Original data (from 2 groups). (b) (ordinary) PCA of the data.

● ●
● ● ●
● ● ●
● ● ●
● ● ●
●
●
● ●
● ● ●
●
● ● ● ● ●
● ●
●
● ● ●
● ●●
● ● ● ●
● ● ●
● ●
●
● ●
● ● ●
●
●
●
●
●
● ●
● ● ●
●
● ● ●
● ●●
●●●
● ● ●●
● ●●●●
● ●
●●
●
●●
●
●
●● ● ● ●
●
●
●
●
●●
●
●●●●
●● ● ●
● ● ●
●
●
● ●
●
●
● ● ●
● ● ●
●
● ●
●●
KPC2

KPC2

● ● ● ●●
● ● ● ●
● ● ●
●● ●●● ●● ● ●
● ● ●● ●
● ●●
●●● ●
●
● ●
●
● ●● ●●
● ● ● ●
● ● ● ●
● ● ● ●●
● ●
●● ●
● ●● ● ●
● ● ● ● ●
● ● ●●● ●
● ●
● ●
●
●● ● ● ● ●
● ● ●● ●
● ●● ● ● ●
●
●
● ● ● ● ●
●
● ●● ● ●
●
● ●● ●
●
●
●
●
● ●●●
●
● ●
● ● ●●●● ● ● ● ●
●● ●
●●
● ● ●●
●
●
● ● ●●
● ●
●
● ● ● ●
● ●
● ●
● ● ● ● ●
●
● ● ● ● ● ●
●
●
● ● ● ●
● ●
● ●
● ●
●
● ● ●
● ● ●
● ●
●
● ● ●
● ●
● ●● ●

KPC1 KPC1

Figure 72: Illustration of the advantage of kernel PCA for data with multiple groups
that are not linearly separable. Notice that the Gaussian kernel PCA separates
linearly the two groups.

209
Proposition 8.2.1. We have that
m m n n
1 X 2 XX 1 X
kµ̂X − µ̂Y k2H = 2 k(xi , xj ) − k(xi , yj ) + 2 k(yi , yj ).
m i,j=1 mn i=1 j=1 n i,j=1

Proof. See video.

This measure of the difference between two distributions can be used to derive
hypothesis tests, but the theory of this is beyond the scope of the course. For a
biological data example in R, see
https://bioconductor.org/packages/release/bioc/html/MMDiff2.html,
for an image analysis example in Python, see
https://docs.seldon.io/projects/alibi-detect/en/latest/examples/cd_
mmd_cifar10.html,
and for a text analysis example in Python, see
https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_
text_imdb.html.
A simple implementation in R is given by kmdd in the package kernlab.

210

Kernel Methods For Pattern Analysis
100% (3)
Kernel Methods For Pattern Analysis
478 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
34 pages
Wk02 machine learning
No ratings yet
Wk02 machine learning
4 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Feature Extraction Techniques
No ratings yet
Feature Extraction Techniques
32 pages
Slides Chap5 KernelMethods
No ratings yet
Slides Chap5 KernelMethods
24 pages
KPCA
No ratings yet
KPCA
26 pages
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
No ratings yet
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
40 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
QSRI-lecture4
No ratings yet
QSRI-lecture4
56 pages
Dark Background Image-Denosing Based On KPCA Method
No ratings yet
Dark Background Image-Denosing Based On KPCA Method
4 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
8 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
ML MOD-4
No ratings yet
ML MOD-4
11 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
Pca PDF
No ratings yet
Pca PDF
6 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Pre Image
No ratings yet
Pre Image
9 pages
Lecture 8_Kernels
No ratings yet
Lecture 8_Kernels
32 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
The Representation of Similarities in Linear Spaces
No ratings yet
The Representation of Similarities in Linear Spaces
17 pages
Introduction to Kernel PCA
No ratings yet
Introduction to Kernel PCA
1 page
Schlkopf 1998
No ratings yet
Schlkopf 1998
57 pages
Lecture6
No ratings yet
Lecture6
38 pages
Kernel Principal Component Analysis and Its Applications in Face Recognition and Active Shape Models
No ratings yet
Kernel Principal Component Analysis and Its Applications in Face Recognition and Active Shape Models
9 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Robust Kernel Principal Component Analys
No ratings yet
Robust Kernel Principal Component Analys
8 pages
ml mod 4
No ratings yet
ml mod 4
26 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Ai Notes V
No ratings yet
Ai Notes V
7 pages
Solution
No ratings yet
Solution
148 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Subspace Methods
100% (1)
Subspace Methods
12 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Kernel Clustering
No ratings yet
Kernel Clustering
57 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
A Primer On Kernel Methods PDF
No ratings yet
A Primer On Kernel Methods PDF
42 pages
Kernel (Statistics)
No ratings yet
Kernel (Statistics)
4 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Optimal Gaussian Kernel Parameter Selection For SVM Classifier
No ratings yet
Optimal Gaussian Kernel Parameter Selection For SVM Classifier
7 pages
Kernel_Methods_in_Machine_Learning
No ratings yet
Kernel_Methods_in_Machine_Learning
3 pages
Report Kernel Pca Method
No ratings yet
Report Kernel Pca Method
11 pages
Kde Presentation PDF
No ratings yet
Kde Presentation PDF
105 pages
Lecture4
No ratings yet
Lecture4
49 pages
LINFO2275 Questions d Examen-4
No ratings yet
LINFO2275 Questions d Examen-4
34 pages
lec16
No ratings yet
lec16
23 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
Unit5 1 1
No ratings yet
Unit5 1 1
40 pages