0% found this document useful (0 votes)

43 views8 pages

Kernel Ridge Regression

Uploaded by

matin ashrafi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views8 pages

Kernel Ridge Regression

Uploaded by

matin ashrafi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Kernel Ridge

Regression
Mohammad Emtiyaz Khan
EPFL
Oct 27, 2015

©Mohammad Emtiyaz Khan 2015

Motivation
The ridge solution β ∗ ∈ RD has a
counterpart α∗ ∈ RN . Using dual-
ity, we will establish a relationship
between β ∗ and α∗ which leads the
way to kernels.

Ridge regression
Throughout, we assume that there
is no intercept term β0 to make the
math easier.

The following is true for ridge reg:

β ∗ = (XT X + λID )−1XT y = XT (XXT + λIN )−1y := XT α∗,

where α∗ := (XXT + λIN )−1y.

This can be proved using the follow-

ing identity: let P be an N × M
matrix while Q be an M × N ma-
trix,

(PQ + IN )−1P = P(QP + IM )−1.

What are the computational com-

plexities for these two ways of com-
puting β ∗?

1
With this, we know that β ∗ =
XT α∗ lies in the row space of X.
Previously, we have seen that ŷ =
Xβ ∗ lies in the column space of X.
In other words,
N
X D
X
β∗ = αn∗ xn, ŷ = βd∗ x̄d
n=1 d=1
xT1
   
x11 x12 . . . x1D
 x21 x22 . . . x2D    xT2
where X =  .
 . . . .  = . 

. . . .  . 
xN 1 xN 2 . . . x N D xTN

= x̄1 x̄2 . . . x̄D

The representer theorem

The representer theorem generalizes
this result: for a β ∗ minimizing the
following function for any L,
N
X D
X
min L(yn, xTn β) + λβj2
β
n=1 j=1

there exists α∗ such that

β ∗ = XT α∗ .

See even more general statement

in Wikipedia, originally proved
in Scholkopf, Herbrich and Smola
(2001).
2
Kernelized ridge regression
The representer theorem allows us
to write an equivalent optimization
problem in terms of α. For exam-
ple, for ridge regression, the follow-
ing two problems are equivalent:

∗ 1 T λ T
β = arg min 2 (y − Xβ) (y − Xβ) + β β
β 2
α∗ = arg max − 21 αT (XXT + λIN )T α + αT y
α

i.e. they both return the same opti-

mal value and there is a one-to-one
mapping between α∗ and β ∗. Note
that the optimization over α is a
maximization problem.

Most importantly, the second

problem is expressed in terms of
the matrix XXT . This is our first
example of a kernel matrix.

Note: We don’t give a detailed derivation

of the second problem, but to show
the equivalence, you can show that we
obtain equal optimal values for the two
problems. We will do a derivation later
using the duality principle. You can see
a derivation here http://www.ics.uci.
edu/~welling/classnotes/papers_
class/Kernel-Ridge.pdf.
3
Advantages of kernelized ridge regression
First, it might be computationally
efficient in some cases when solving
the system of equations.

Second, by defining K = XXT , we

can work directly with K and never
have to worry about X. This is the
kernel trick.

Third, working with α is sometimes

advantageous (e.g. in SVMs many
entries of α will be zero).

Kernel functions
The linear kernel is defined below:
 T
x1 x1 xT1 x2 . . . xT1 xN


T
 xT2 x1 xT2 x2 . . . xT2 xN 
K = XX =   .. .. ... .. .

xTN x1 xTN x2 . . . xTN xN
Kernel with basis functions φ(x)
with K := ΦΦT is shown below:
T T T
 
φ(x1) φ(x1) φ(x1) φ(x2) . . . φ(x1) φ(xN )
 φ(x2)T φ(x1) φ(x2)T φ(x2) . . . φ(x2)T φ(xN ) 
.. .. .. .
...

 
φ(xN )T φ(x1) φ(xN )T φ(x2) . . . φ(xN )T φ(xN )

4
The kernel trick
A big advantage of using kernels
is that we do not need to specify
φ(x) explicitly, since we can work
directly with K.

We will use a kernel function

k(x, x0) and compute the (i, j)th en-
try of K as follows: Kij = k(xi, xj ).
For example, for linear kernel and
basis function expansion, the kernel
function is the following:

k(x, x0) := xT x0, k(x, x0) := φ(x)T φ(x0)

However, a kernel function k is

usually associated with a φ, e.g.
k(x, x0) = x2(x0)2 corresponds to
0 0
φ(x) = x2 and k(x, x ) = (x1x1 +
0 0
x2x2 + x3x3)2 corresponds to
T
2 2 2 √ √ √
φ(x) = x1 x2 x3 2x1x2 2x1x3 2x2x3

The good news is that the evaluation

of a kernel is usually faster with k
than with φ.

5
Examples of kernels
The above kernel is an example of
the polynomial kernel. Another ex-
ample is the Radial Basis Function
(RBF) kernel.
0
h 0 0
i
k(x, x ) = exp − 12 (x − x )T (x − x )

See more examples in Section 14.2

of the KPM book.

A natural question is the following:

how can we ensure that there exists
a φ corresponding to a given kernel
K? The answer is: as long as a ker-
nel satisfy certain properties.

Properties of a kernel
A kernel function must be an inner-
product in some feature space. Here
are a few properties that ensure it is
the case.
1. K should be symmetric, i.e.
0 0
k(x, x ) = k(x , x).
2. For any arbitrary input set
{xn} and all N , K should be
positive semidefinite.

6
An important subclass is the
positive-definite kernel functions,
giving rise to infinite-dimensional
feature spaces.

Read about Mercer and Matern ker-

nels from Kevin Murphy’s Section
14.2. There is a small note about
Reproducing kernel Hilbert Space
in the website (written by Matthias
Seeger), please read that as well.

To do
1. Clearly understand the relationship β ∗ = XT α∗. Understand the
statement of the representer theorem.
2. Show that ridge regression and kernel ridge regression are equiv-
alent. Hint: show that the optimization problems corresponding
to β and α have the same optimal value.
3. Get familiar with various examples of kernels. See Section 6.2 of
Bishop on examples of kernel construction. Read Section 14.2 of
KPM book for examples of kernels.
4. Revise and understand the difference between positive-definite
and positive-semidefinite matrices.
5. If curious about infinite φ, see Matthias Seeger’s notes (uploaded
on the website).

Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Kernel Methods: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Kernel Methods: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
29 pages
Gaussian Process
No ratings yet
Gaussian Process
172 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
7 PDF
No ratings yet
7 PDF
4 pages
Lecture03 Kernel
No ratings yet
Lecture03 Kernel
28 pages
Zhang 15 D
No ratings yet
Zhang 15 D
42 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
Lec 16
No ratings yet
Lec 16
46 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Class04 Feature+Kernels
No ratings yet
Class04 Feature+Kernels
35 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Chap6 1-KernelMethods
No ratings yet
Chap6 1-KernelMethods
36 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
DSA5102X Lecture2
No ratings yet
DSA5102X Lecture2
43 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
16 Aos1472
No ratings yet
16 Aos1472
33 pages
Lecture 8 - Kernels
No ratings yet
Lecture 8 - Kernels
32 pages
L4 Kernel Regression
No ratings yet
L4 Kernel Regression
42 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Practice Problems For ML Midterms
No ratings yet
Practice Problems For ML Midterms
5 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
Lecture 14: Kernels - Applied ML
No ratings yet
Lecture 14: Kernels - Applied ML
14 pages
#Introductory Mathematics - Applications and Methods - Gordon S. Marshall - Springer (1998) .1
100% (1)
#Introductory Mathematics - Applications and Methods - Gordon S. Marshall - Springer (1998) .1
233 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Kernel Methods
No ratings yet
Kernel Methods
6 pages
Generalized Ridge Regression Biased Estimation For
No ratings yet
Generalized Ridge Regression Biased Estimation For
23 pages
Practice Assignment 3
No ratings yet
Practice Assignment 3
2 pages
Stats 205 Hw3
No ratings yet
Stats 205 Hw3
2 pages
Support Vector Machine Explained
No ratings yet
Support Vector Machine Explained
10 pages
Kreider Introduction To Linear Analysis PDF
100% (3)
Kreider Introduction To Linear Analysis PDF
793 pages
Introduction To Kriging: To Cite This Version
No ratings yet
Introduction To Kriging: To Cite This Version
40 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Kernel Methods!: Sargur Srihari!
No ratings yet
Kernel Methods!: Sargur Srihari!
29 pages
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
No ratings yet
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
26 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Application of Derivatives: Tangent and Normal
No ratings yet
Application of Derivatives: Tangent and Normal
122 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Lect 3
No ratings yet
Lect 3
14 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
HW1 Solutions
No ratings yet
HW1 Solutions
3 pages
Risk Convergence of Centered Kernel Ridge Regression With Large Dimensional Data
No ratings yet
Risk Convergence of Centered Kernel Ridge Regression With Large Dimensional Data
13 pages
Slides Chap5 KernelMethods
No ratings yet
Slides Chap5 KernelMethods
24 pages
Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Ridge 3
No ratings yet
Ridge 3
4 pages
Multivariat Kernel Regression
No ratings yet
Multivariat Kernel Regression
3 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Reproducing Kernel Hilbert Spaces
No ratings yet
Reproducing Kernel Hilbert Spaces
4 pages
Solutions For Elementary Analysis - The Theory of Calculus' by Kenneth Ross
33% (3)
Solutions For Elementary Analysis - The Theory of Calculus' by Kenneth Ross
1 page
Trigonometric Identities DPP 06 of Lec 08 Arjuna JEE 2025 Edited
No ratings yet
Trigonometric Identities DPP 06 of Lec 08 Arjuna JEE 2025 Edited
3 pages
2.0 JEE MAIN Limits, Continuity & Derivatives Reduced Syllabus
No ratings yet
2.0 JEE MAIN Limits, Continuity & Derivatives Reduced Syllabus
216 pages
MATHS P2 MEMO GR12 JUNE 2024 - Afr+English
No ratings yet
MATHS P2 MEMO GR12 JUNE 2024 - Afr+English
19 pages
FFT
No ratings yet
FFT
10 pages
Lesson 01 - Laplace - Note
100% (1)
Lesson 01 - Laplace - Note
7 pages
Zwiebach Notes PDF
No ratings yet
Zwiebach Notes PDF
68 pages
First Term Ss2 Further Mathematics 1
50% (2)
First Term Ss2 Further Mathematics 1
3 pages
Mathematics II
No ratings yet
Mathematics II
2 pages
Arithmetic
No ratings yet
Arithmetic
17 pages
All Reviews in One File
No ratings yet
All Reviews in One File
21 pages
Hmms
No ratings yet
Hmms
59 pages
Algebraic Substitution
100% (1)
Algebraic Substitution
2 pages
Complex Analysis Qualifying
No ratings yet
Complex Analysis Qualifying
49 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
017a. Unit 2 Mock Test - Exponential and Logarithmic Functions
No ratings yet
017a. Unit 2 Mock Test - Exponential and Logarithmic Functions
4 pages
Fall 2023-MATH 1500
No ratings yet
Fall 2023-MATH 1500
10 pages
Balangay Natonal High School: Urbiztondo
No ratings yet
Balangay Natonal High School: Urbiztondo
2 pages
Wa0012 PDF
No ratings yet
Wa0012 PDF
21 pages
Mmwave SDK User Guide
No ratings yet
Mmwave SDK User Guide
64 pages
Monotonicity Basic PDF
No ratings yet
Monotonicity Basic PDF
8 pages
Mathematics Lab
No ratings yet
Mathematics Lab
6 pages
GCSE Edexcel Higher Mathematics: Topic List For Revision (NON-CALCULATOR)
No ratings yet
GCSE Edexcel Higher Mathematics: Topic List For Revision (NON-CALCULATOR)
1 page
Chapter 1 MAT455 PDF
No ratings yet
Chapter 1 MAT455 PDF
16 pages
ADC22 04 02 Viterbi
No ratings yet
ADC22 04 02 Viterbi
29 pages
Generalized Quotient Topologies
No ratings yet
Generalized Quotient Topologies
6 pages
Appendix B Algebra
No ratings yet
Appendix B Algebra
25 pages
DSP Exp 4
No ratings yet
DSP Exp 4
4 pages
Aronszajn's Theorem: X X X X X
No ratings yet
Aronszajn's Theorem: X X X X X
4 pages
AP Calculus AB Course Syllabus: Prerequisites: Each Student in AP Calculus AB Must Have 3 Years of High
No ratings yet
AP Calculus AB Course Syllabus: Prerequisites: Each Student in AP Calculus AB Must Have 3 Years of High
6 pages
Math (3rd) Dec2018 PDF
No ratings yet
Math (3rd) Dec2018 PDF
2 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Kernel Ridge Regression

Uploaded by

Kernel Ridge Regression

Uploaded by

Kernel Ridge

©Mohammad Emtiyaz Khan 2015

The following is true for ridge reg:

β ∗ = (XT X + λID )−1XT y = XT (XXT + λIN )−1y := XT α∗,

where α∗ := (XXT + λIN )−1y.

This can be proved using the follow-

(PQ + IN )−1P = P(QP + IM )−1.

What are the computational com-

The representer theorem

there exists α∗ such that

See even more general statement

i.e. they both return the same opti-

Most importantly, the second

Note: We don’t give a detailed derivation

Second, by defining K = XXT , we

Third, working with α is sometimes

We will use a kernel function

k(x, x0) := xT x0, k(x, x0) := φ(x)T φ(x0)

However, a kernel function k is

The good news is that the evaluation

See more examples in Section 14.2

A natural question is the following:

Read about Mercer and Matern ker-

You might also like