0% found this document useful (0 votes)
25 views

Kernel Ridge Regression

Uploaded by

matin ashrafi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Kernel Ridge Regression

Uploaded by

matin ashrafi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Kernel Ridge

Regression
Mohammad Emtiyaz Khan
EPFL
Oct 27, 2015

©Mohammad Emtiyaz Khan 2015


Motivation
The ridge solution β ∗ ∈ RD has a
counterpart α∗ ∈ RN . Using dual-
ity, we will establish a relationship
between β ∗ and α∗ which leads the
way to kernels.

Ridge regression
Throughout, we assume that there
is no intercept term β0 to make the
math easier.

The following is true for ridge reg:

β ∗ = (XT X + λID )−1XT y = XT (XXT + λIN )−1y := XT α∗,

where α∗ := (XXT + λIN )−1y.

This can be proved using the follow-


ing identity: let P be an N × M
matrix while Q be an M × N ma-
trix,

(PQ + IN )−1P = P(QP + IM )−1.

What are the computational com-


plexities for these two ways of com-
puting β ∗?

1
With this, we know that β ∗ =
XT α∗ lies in the row space of X.
Previously, we have seen that ŷ =
Xβ ∗ lies in the column space of X.
In other words,
N
X D
X
β∗ = αn∗ xn, ŷ = βd∗ x̄d
n=1 d=1
xT1
   
x11 x12 . . . x1D
 x21 x22 . . . x2D    xT2
where X =  .
 . . . .  = . 

. . . .  . 
xN 1 xN 2 . . . x N D xTN
 
= x̄1 x̄2 . . . x̄D

The representer theorem


The representer theorem generalizes
this result: for a β ∗ minimizing the
following function for any L,
N
X D
X
min L(yn, xTn β) + λβj2
β
n=1 j=1

there exists α∗ such that


β ∗ = XT α∗ .

See even more general statement


in Wikipedia, originally proved
in Scholkopf, Herbrich and Smola
(2001).
2
Kernelized ridge regression
The representer theorem allows us
to write an equivalent optimization
problem in terms of α. For exam-
ple, for ridge regression, the follow-
ing two problems are equivalent:

∗ 1 T λ T
β = arg min 2 (y − Xβ) (y − Xβ) + β β
β 2
α∗ = arg max − 21 αT (XXT + λIN )T α + αT y
α

i.e. they both return the same opti-


mal value and there is a one-to-one
mapping between α∗ and β ∗. Note
that the optimization over α is a
maximization problem.

Most importantly, the second


problem is expressed in terms of
the matrix XXT . This is our first
example of a kernel matrix.

Note: We don’t give a detailed derivation


of the second problem, but to show
the equivalence, you can show that we
obtain equal optimal values for the two
problems. We will do a derivation later
using the duality principle. You can see
a derivation here http://www.ics.uci.
edu/~welling/classnotes/papers_
class/Kernel-Ridge.pdf.
3
Advantages of kernelized ridge regression
First, it might be computationally
efficient in some cases when solving
the system of equations.

Second, by defining K = XXT , we


can work directly with K and never
have to worry about X. This is the
kernel trick.

Third, working with α is sometimes


advantageous (e.g. in SVMs many
entries of α will be zero).

Kernel functions
The linear kernel is defined below:
 T
x1 x1 xT1 x2 . . . xT1 xN

T
 xT2 x1 xT2 x2 . . . xT2 xN 
K = XX =   .. .. ... .. .

xTN x1 xTN x2 . . . xTN xN
Kernel with basis functions φ(x)
with K := ΦΦT is shown below:
T T T
 
φ(x1) φ(x1) φ(x1) φ(x2) . . . φ(x1) φ(xN )
 φ(x2)T φ(x1) φ(x2)T φ(x2) . . . φ(x2)T φ(xN ) 
.. .. .. .
...

 
φ(xN )T φ(x1) φ(xN )T φ(x2) . . . φ(xN )T φ(xN )

4
The kernel trick
A big advantage of using kernels
is that we do not need to specify
φ(x) explicitly, since we can work
directly with K.

We will use a kernel function


k(x, x0) and compute the (i, j)th en-
try of K as follows: Kij = k(xi, xj ).
For example, for linear kernel and
basis function expansion, the kernel
function is the following:

k(x, x0) := xT x0, k(x, x0) := φ(x)T φ(x0)

However, a kernel function k is


usually associated with a φ, e.g.
k(x, x0) = x2(x0)2 corresponds to
0 0
φ(x) = x2 and k(x, x ) = (x1x1 +
0 0
x2x2 + x3x3)2 corresponds to
T
 2 2 2 √ √ √ 
φ(x) = x1 x2 x3 2x1x2 2x1x3 2x2x3

The good news is that the evaluation


of a kernel is usually faster with k
than with φ.

5
Examples of kernels
The above kernel is an example of
the polynomial kernel. Another ex-
ample is the Radial Basis Function
(RBF) kernel.
0
h 0 0
i
k(x, x ) = exp − 12 (x − x )T (x − x )

See more examples in Section 14.2


of the KPM book.

A natural question is the following:


how can we ensure that there exists
a φ corresponding to a given kernel
K? The answer is: as long as a ker-
nel satisfy certain properties.

Properties of a kernel
A kernel function must be an inner-
product in some feature space. Here
are a few properties that ensure it is
the case.
1. K should be symmetric, i.e.
0 0
k(x, x ) = k(x , x).
2. For any arbitrary input set
{xn} and all N , K should be
positive semidefinite.

6
An important subclass is the
positive-definite kernel functions,
giving rise to infinite-dimensional
feature spaces.

Read about Mercer and Matern ker-


nels from Kevin Murphy’s Section
14.2. There is a small note about
Reproducing kernel Hilbert Space
in the website (written by Matthias
Seeger), please read that as well.

To do
1. Clearly understand the relationship β ∗ = XT α∗. Understand the
statement of the representer theorem.
2. Show that ridge regression and kernel ridge regression are equiv-
alent. Hint: show that the optimization problems corresponding
to β and α have the same optimal value.
3. Get familiar with various examples of kernels. See Section 6.2 of
Bishop on examples of kernel construction. Read Section 14.2 of
KPM book for examples of kernels.
4. Revise and understand the difference between positive-definite
and positive-semidefinite matrices.
5. If curious about infinite φ, see Matthias Seeger’s notes (uploaded
on the website).

You might also like