Lecture4 introToRKHS
Lecture4 introToRKHS
algorithms
Arthur Gretton
October 16, 2019
1 Outline
In this document, we give a nontechical introduction to reproducing kernel
Hilbert spaces (RKHSs), and describe some basic algorithms in RHKS.
1. What is a kernel, how do we construct it?
2. Operations on kernels that allow us to combine them
3. The reproducing kernel Hilbert space
4. Application 1: Difference in means in feature space
5. Application 2: Kernel PCA
6. Application 3: Ridge regression
2 Motivating examples
For the XOR example, we have variables in two dimensions, x ∈ R2 , arranged
in an XOR pattern. We would like to separate the red patterns from the blue,
using only a linear classifier. This is clearly not possible, in the original space.
If we map points to a higher dimensional feature space
φ(x) = x1 x2 x1 x2 ∈ R3 ,
it is possible to use a linear classifier to separate the points. See Figure 2.1.
Feature spaces can be used to compare objects which have much more com-
plex structure. An illustration is in Figure 2.2, where we have two sets of doc-
uments (the red ones on dogs, and the blue on cats) which we wish to classify.
In this case, features of the documents are chosen to be histograms over words
(there are much more sophisticated features we could use, eg string kernels [4]).
To use the terminology from the first example, these histograms represent a
mapping of the documents to feature space. Once we have histograms, we can
compare documents, classify them, cluster them, etc.
1
5
1
x2
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x1
Figure 2.1: XOR example. On the left, the points are plotted in the original
space. There is no linear classifier that can separate the red crosses from the
blue circles. Mapping the points to a higher dimensional feature space, we
obtain linearly separable classes. A possible decision boundary is shown as a
gray plane.
2
The classsification of objects via well chosen features is of course not an
unusual approach. What distinguishes kernel methods is that they can (and
often do) use infinitely many features. This can be achieved as long as our
learning algorithms are defined in terms of dot products between the features,
where these dot products can be computed in closed form. The term “kernel”
simply refers to a dot product between (possibly infinitely many) features.
Alternatively, kernel methods can be used to control smoothness of a func-
tion used in regression or classification. An example is given in Figure 2.3,
where different parameter choices determine whether the regression function
overfits, underfits, or fits optimally. The connection between feature spaces and
smoothness is not obvious, and is one of the things we’ll discuss in the course.
3
Figure 2.2: Document classification example: each document is represented as
a histogram over words.
4
0.6 0.6 0.6
0 0 0
−1 −1 −1
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
To easily prove the above, we will need to use a property of kernels introduced
later, namely positive definiteness. We provide this proof at the end of Section
3.2. A difference of kernels may not be a kernel: if k1 (x, x) − k2 (x, x) < 0, then
condition 3 of Definition 1 is violated.
Lemma 5 (Mappings between spaces). Let X and Xe be sets, and define a map
A : X → Xe. Define the kernel k on Xe. Then the kernel k(A(x), A(x0 )) is a
kernel on X .
5
Then
= k1 p> q
k1 k2
k1 q > p
= k1 trace(q > p)
= k1 trace(pq > )
>
= trace(pu v q> )
|{z}
k1
= hA, Bi ,
is a valid kernel.
To prove: expand out this expression into a sum (with non-negative scalars)
of kernels hx, x0 i raised to integer powers. These individual terms are valid
kernels by the product rule.
Can we extend this combination of sum and product rule to sums with
infinitely many terms? It turns out we can, as long as these don’t blow up.
Definition 8. The space `p of p-summable sequences is defined as all sequences
(ai )i≥1 for which
∞
api < ∞.
X
i=1
is a well-defined kernel on X .
Proof. We write the norm kak`2 associated with the inner product (3.2) as
v
u∞
uX
kak`2 := t a2i ,
i=1
6
where we write a to represent the sequence with terms ai . The Cauchy-Schwarz
inequality states
X∞
|k(x, x0 )| = φi (x)φi (x0 ) ≤ kφ(x)k`2 kφ(x0 )k`2 < ∞,
i=1
Proof. Non-negative weighted sums of kernels are kernels, and products of ker-
nels are kernels, so the following is a kernel if it converges,
∞
n
X
k(x, x0 ) = an (hx, x0 i) .
n=0
We may combine all the results above to obtain the following (the proof is
an exercise - you will need the product rule, the mapping rule, and the result
of Example 11).
Example 12 (Gaussian kernel). The Gaussian kernel on Rd is defined as
2
k(x, x0 ) := exp −γ −2 kx − x0 k .
7
3.2 Positive definiteness of an inner product in a Hilbert
space
All kernel functions are positive definite. In fact, if we have a positive definite
function, we know there exists one (or more) feature spaces for which the kernel
defines the inner product - we are not obliged to define the feature spaces explic-
itly. We begin by defining positive definiteness [1, Definition 2], [11, Definition
4.15].
Definition 13 (Positive definite functions). A symmetric function k : X ×X →
R is positive definite if ∀n ≥ 1, ∀(a1 , . . . an ) ∈ Rn , ∀(x1 , . . . , xn ) ∈ X n ,
n X
X n
ai aj k(xi , xj ) ≥ 0.
i=1 j=1
The function k(·, ·) is strictly positive definite if for mutually distinct xi , the
equality holds only when all the ai are zero.3
Every inner product is a positive definite function, and more generally:
Lemma 14. Let H be any Hilbert space (not necessarily an RKHS), X a non-
empty set and φ : X → H. Then k(x, y) := hφ(x), φ(y)iH is a positive definite
function.
Proof.
n X
X n n X
X n
ai aj k(xi , xj ) = hai φ(xi ), aj φ(xj )iH
i=1 j=1 i=1 j=1
* n n
+
X X
= ai φ(xi ), aj φ(xj ) .
i=1 j=1 H
n
2
X
ai φ(xi )
≥ 0
i=1 H
itive definite”.
8
n X
X n
ai aj [k1 (xi , xj ) + k2 (xi , xj )]
i=1 j=1
X n X n n X
X n
= ai aj k1 (xi , xj ) + ai aj k2 (xi , xj )
i=1 j=1 i=1 j=1
≥0
φ : R2 → R3
x1
x1
x= 7 → φ(x) = x2 ,
x2
x1 x2
9
The notation f (·) refers to the function itself, in the abstract (and in fact, this
function might have multiple equivalent representations). We sometimes write
f rather than f (·), when there is no ambiguity. The notation f (x) ∈ R refers
to the function evaluated at a particular point (which is just a real number).
With this notation, we can write
f (x) = f (·)> φ(x)
:= hf (·), φ(x)iH
In other words, the evaluation of f at x can be written as an inner product in
feature space (in this case, just the standard inner product in R3 ), and H is a
space of functions mapping R2 to R. This construction can still be used if there
are infinitely many features: from the Cauchy-Schwarz argument in Lemma 9,
we may write
∞
X
f (x) = f` φ` (x), (4.2)
`=1
∞
where the expression is bounded in absolute value as long as {f` }`=1 ∈ `2 (of
course, we can’t write this function explicitly, since we’d need to enumerate all
the f` ).
This line of reasoning leads us to a conclusion that might seem counterin-
tuitive at first: we’ve seen that φ(x) is a mapping from R2 to R3 , but it also
defines (the parameters of) a function mapping R2 to R. To see why this is so,
we write
y1
k(·, y) = y2 = φ(y),
y1 y2
using the same convention as in (4.1). This is certainly valid: if you give me a
y, I’ll give you a vector k(·, y) in H such that
hk(·, y), φ(x)iH = ax1 + bx2 + cx1 x2 ,
where a = y1 , b = y2 , and c = y1 y2 (i.e., for every y, we get a different vector
>
a b c ). But due to the symmetry of the arguments, we could equally
well have written
hk(·, x), φ(y)i = uy1 + vy2 + wy1 y2
= k(x, y).
In other words, we can write φ(x) = k(·, x) and φ(y) = k(·, y) without ambiguity.
This way of writing the feature mapping is called the canonical feature map
[11, Lemma 4.19].
This example illustrates the two defining features of an RKHS:
• The feature map of every point is in the feature space:
∀x ∈ X , k(·, x) ∈ H
,
10
Figure 4.1: Feature space and mapping of input points.
fˆ−` = fˆ` .
11
Top hat Fourier series coefficients
0.5
1.4
0.4
1.2
1 0.3
0.8
0.2
f (x)
fˆℓ
0.6
0.1
0.4
0.2 0
0
−0.1
−0.2
−0.2
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
Figure 4.2: “Top hat” function (red) and its approximation via a Fourier series
(blue). Only the first 21 terms are used; as more terms are added, the Fourier
representation gets closer to the desired function.
Due to the symmetry of the Fourier coefficients and the asymmetry of the sine
function, the sum can be written over positive `, and only the cosine terms
remain. See Figure 4.2.
Assume the kernel takes a single argument, which is the difference in its inputs,
k(x, y) = k(x − y),
and define the Fourier series representation of k as
∞
X
k(x) = k̂l exp (ılx) , (4.4)
l=−∞
where k̂−l = k̂l and k̂l = k̂l (a real and symmetric khas a real and symmetric
Fourier transform). For instance, when the kernel is a Jacobi Theta function ϑ
(which looks close to a Gaussian when σ 2 is sufficiently narrower than [−π, π]),
x ıσ 2 −σ 2 `2
1 1
k(x) = ϑ , , k̂` ≈ exp ,
2π 2π 2π 2π 2
12
Jacobi Theta Fourier series coefficients
0.16
0.6 0.14
0.5 0.12
0.4 0.1
k (x)
fˆℓ
0.3 0.08
0.2 0.06
0.1 0.04
0 0.02
−0.1 0
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
Figure 4.3: Jacobi Theta kernel (red) and its Fourier series representation, which
is Gaussian (blue). Again, only the first 21 terms are retained, however the
approximation is already very accurate (bearing in mind the Fourier series co-
efficients decay exponentially).
and the Fourier coefficients are Gaussian (evaluated on a discrete grid). See
Figure 4.3.
Recall the standard dot product in L2 , where we take the conjugate of the
right-hand term due to the complex valued arguments,
* ∞ ∞
+
ˆ
X X
hf, gi = L2 f` exp(ı`x), ĝm exp(ımx)
`=−∞ m=−∞ L2
∞ ∞
fˆ` ĝ ` hexp(ı`x), exp(−ımx)iL2
X X
=
`=−∞ m=−∞
∞
fˆ` ĝ ` .
X
=
`=−∞
additional technical conditions are required of the kernel for a valid RKHS to be obtained.
These conditions are given by Mercer’s theorem [11, Theorem 4.49], which when satisfied,
imply that the expansion (4.4) converges absolutely and uniformly.
13
The squared norm of a function f in H enforces smoothness:
2
∞ ˆ ˆ ∞ fˆ
X f` f` X `
kf k2H = hf, f iH = = . (4.6)
l=−∞
k̂` l=−∞
k̂`
If k̂` decays fast, then so must fˆ` if we want kf k2H < ∞. From this norm
2 of the functions in L2 ,
definition, we see that the RKHS functions are a subset
P∞
for which finiteness of the norm kf kL2 = `=−∞ fˆ` is required (this being
2
5 Exercise: what happens if we change the order, and write hf (·), k(·, x)iH ? Hint: f (x) =
f (x) since the function is real-valued.
14
Applying the dot product definition in H, we obtain
hk(·, y), k(·, z)iH = hf, giH
∞
X fˆ` ĝ `
=
`=−∞
k̂`
X∞ k̂` exp(−ı`y) k̂` exp(−ı`z)
=
`=−∞
k̂`
∞
X
= k̂` exp(ı`(z − y)) = k(z − y).
`=−∞
You might be wondering how the dot product in (4.5) relates to our original
definition of an RKHS function in (4.2): the latter equation, updated to reflect
that the features are complex-valued (and changing the sum index to run from
−∞ to ∞) is
X∞
f (x) = f` φ` (x),
`=−∞
15
to the multivariate case Rd ). Our discussion follows [2, Sections 3.1 - 3.3]. We
start by defining the eigenexpansion of k(x, x0 ) with respect to a non-negative
finite measure µ on X := R,
Z Z (
0 0 0 1 i=j
λi ei (x) = k(x, x )ei (x )dµ(x ), ei (x)ej (x)dµ(x) = (4.8)
L2 (µ) 0 i 6= j.
For the purposes of this example, we’ll use the Gaussian density µ, meaning
1
dµ(x) = √ exp −x2 dx
(4.9)
2π
We can write
∞
X
k(x, x0 ) = λ` e` (x)e` (x0 ), (4.10)
`=1
λk ∝ bk b<1
√
ek (x) ∝ exp(−(c − a)x2 )Hk (x 2c),
As with the Fourier case, we will define the dot product in H to have a roughness
penalty, yielding
∞ ˆ ∞ ˆ2
X f` ĝ` 2
X f `
hf, giH = kf kH = , (4.12)
λ` λ`
`=1 `=1
6 As with the Fourier example in Section (4.1.2), there are certain technical conditions
needed when defining an RKHS kernel, to ensure that the sum in (4.10) converges in a stronger
sense than L2 (µ). This requires a generalization of Mercer’s theorem to non-compact domains.
16
e1(x)
e (x)
2
e3(x)
Figure 4.4: First three eigenfunctions for the exponentiated quadratic kernel
with respect to a Gaussian measure µ.
where you should compare with (4.5) and (4.6). The RKHS functions are a sub-
P∞
set of the functions in L2 (µ), with norm kf k2L2 = `=1 fˆ`2 < ∞ (less restrictive
than 4.12).
Also just like the Fourier case, we can explicitly construct the feature map
that gives our original expression of the RHKS function in (4.2), namely
∞
X
f (x) = f` φ` (x).
`=1
hence
17
1
0.8
0.6
0.4
f(x)
0.2
−0.2
−0.4
−6 −4 −2 0 2 4 6 8
x
fˆ` p
f` = √ φ` (x) = λ` e` (x), (4.13)
λ`
and the reproducing property holds,7
∞
X
φ` (x)φ` (x0 ) = k(x, x0 ).
`=1
18
The eigendecomposition in (4.10) and the feature definition in (4.13) yield
∞ hp
X ihp i
k(x, x0 ) = λ` e` (x) λ` e` (x0 ) .
`=1 | {z }| {z }
φ` (x) φ` (x0 )
where
m
X p
f` = αi λ` e` (xi ).
i=1
The key point is that we need never explicitly compute the eigenfunctions e`
or the eigenexpansion (4.10) to specify functions in the RKHS: we simply write
our functions in terms of the kernels, as in (4.14).
• ∀x ∈ X , k(·, x) ∈ H,
• ∀x ∈ X , ∀f ∈ H, hf, k(·, x)iH = f (x) (the reproducing property).
8 We’ve deliberately used the same notation for the kernel as we did for positive definite
kernels earlier. We will see in the next section that we are referring in both cases to the same
object.
19
In particular, for any x, y ∈ X ,
k(x, y) = hk (·, x) , k (·, y)iH . (4.16)
Recall that a kernel is an inner product between feature maps: then φ(x) =
k(·, x) is a valid feature map (so every reproducing kernel is indeed a kernel in
the sense of Definition (3)).
The reproducing property has an interesting consequence for functions in H.
We define δx to be the operator of evaluation at x, i.e.
δx f = f (x) ∀f ∈ H, x ∈ X .
We then get the following equivalent definition for a reproducing kernel Hilbert
space.
Definition 16 (Reproducing kernel Hilbert space (second definition) ). [11,
Definition 4.18] H is an RKHS if for all x ∈ X , the evaluation operator δx is
bounded: there exists a corresponding λx ≥ 0 such that ∀f ∈ H,
|f (x)| = |δx f | ≤ λx kf kH
This definition means that when two functions are identical in the RHKS
norm, they agree at every point:
|f (x) − g(x)| = |δx (f − g)| ≤ λx kf − gkH ∀f, g ∈ H.
9
This is a particularly useful property if we’re using RKHS functions to make
predictions at a given point x, by optimizing over f ∈ H. That these definitions
are equivalent is shown in the following theorem.
Theorem 17 (Reproducing kernel equivalent to bounded δx ). [1, Theorem
1] H is a reproducing kernel Hilbert space (i.e., its evaluation operators δx are
bounded linear operators), if and only if H has a reproducing kernel.
Proof. We only prove here that if H has a reproducing kernel, then δx is a
bounded linear operator. The proof in the other direction is more complicated
[11, Theorem 4.20], and will be covered in the advanced part of the course
(briefly, it uses the Riesz representer theorem).
Given that a Hilbert space H has a reproducing kernel k with the reproducing
property hf, k(·, x)iH = f (x), then
|δx [f ]| = |f (x)|
= |hf, k(·, x)iH |
≤ kk(·, x)kH kf kH
1/2
= hk(·, x), k(·, x)iH kf kH
= k(x, x)1/2 kf kH
where the third line uses the Cauchy-Schwarz inequality. Consequently, δx :
F → R is a bounded linear operator where λx = k(x, x)1/2 .
9 This property certainly does not hold for all Hilbert spaces: for instance, it fails to hold
20
Finally, the following theorem is very fundamental [1, Theorem 3 p. 19], [11,
Theorem 4.21], and will be proved in the advanced part of the course:
Theorem 18 (Moore-Aronszajn). [1, Theorem 3] Every positive definite kernel
k is associated with a unique RKHS H.
Note that the feature map is not unique (as we saw earlier): only the kernel
is. Functions in the RKHS can be written as linear combinations of feature
maps,
m
X
f (·) := αi k(xi , ·),
i=1
as in Figure 4.5, as well as the limits of Cauchy sequences (where we can allow
m → ∞).
What might this distance be useful for? In the case φ(x) = x, we can use this
statistic to distinguish distributions with different means. If we use the feature
mapping φ(x) = [x x2 ] we can distinguish both means and variances. More
complex feature spaces permit us to distinguish increasingly complex features
of the distributions. As we’ll see in much more detail later in the course, there
are kernels that can distinguish any two distributions [3, 10].
21
Figure 6.1: PCA in R3 , for data in a two-dimensional subspace. The blue lines
represent the first two principal directions, and the grey dots represent the 2-D
plane in R3 on which the data lie (figure by Kenji Fukumizu).
n n
!!2
1X 1X
u1 = arg max u> xi − xi
kuk≤1 n n i=1
i=1
= arg max u> Cu
kuk≤1
where
n n
! n
!>
1X 1X 1X
C = xi − xi xi − xi
n i=1 n i=1 n i=1
1
= XHX > ,
n
where X = x1 . . . xn , H = I − n−1 1n×n , and 1n×n is an n × n matrix
of ones (note that H = HH, i.e. the matrix H is idempotent). We’ve looked
at the first principal component, but all of the principal components ui are
the eigenvectors of the covariance matrix C (thus, each is orthogonal to all the
previous ones). We have the eigenvalue equation
λi ui = Cui .
22
We now do this in feature space:
n
* n
+ !2
1X 1X
f1 = arg max f, φ(xi ) − φ(xi )
kf kH ≤1 n n i=1
i=1 H
= arg max var(f ).
kf kH ≤1
where k̃(xi , xj ) is the (i, j)th entry of the matrix K̃ := HKH (this is an easy
exercise!). Thus,
n n
1X X
Cf` = β`i φ̃(xi ), β`i = α`j k̃(xi , xj ).
n i=1 j=1
23
We can now project both sides of (6.2) onto each of the centred mappings
φ̃(xq ): this gives a set of equations which must all be satisifed to get an equiv-
alent eigenproblem to (6.2). This gives
D E D E n
X
φ̃(xq ), LHS = λ` φ̃(xq ), f` = λ` α`i k̃(xq , xi ) ∀q ∈ {1 . . . n}
H
i=1
n n
D E D E 1X X
φ̃(xq ), RHS = φ̃(xq ), Cf` = k̃(xq , xi ) α`j k̃(xi , xj ) ∀q ∈ {1 . . . n}
H H n i=1 j=1
or equivalently
nλ` α` = Kα
e `. (6.3)
Thus the α` are the eigenvectors of K:
e it is not necessary to ever use the feature
map φ(xi ) explicitly!
How do we ensure the eigenfunctions f have unit norm in feature space?
2
kf kH
* n n
+
X X
= αi φ̃(xi ), αi φ̃(xi )
i=1 i=1 H
n X
X n D E
= αi αi φ̃(xi ), φ̃(xj )
H
i=1 j=1
Xn X n
= αi αi k̃(xi , xj )
i=1 j=1
= α> Kα
e = nλα> α = nλkαk2 .
24
Figure 6.2: Hand-written digit denoising example (from Kenji Fukumizu’s
slides).
In many cases, it will not be possible to reduce the squared error to zero, as
there will be no single y ∗ corresponding to an exact solution. As in linear PCA,
we can use the projection onto a subspace for denoising. By doing this in feature
space, we can take into account the fact that data may not be distributed as a
simple Gaussian, but can lie in a submanifold in input space, which nonlinear
PCA can discover. See Figure 6.2.
25
7.1 A loss-based interpretation
7.1.1 Finite dimensional case
This discussion may be found in a number of sources. We draw from [9, Section
2.2]. We are given n training points in RD , which we arrange in a matrix
X = x1 . . . xn ∈ RD×n . To each of these points, there corresponds an
>
output yi , which we arrange in a column vector y := y1 . . . yn . Define
some λ > 0. Our goal is:
n
!
X
∗ > 2 2
a = arg min (yi − xi a) + λkak
a∈RD
i=1
2
= arg min
y − X > a
+ λkak2 ,
a∈RD
where the second term λkak2 is chosen to avoid problems in high dimensional
spaces (see below). Expanding out the above term, we get
y − X > a
2 + λkak2 = y > y − 2y > Xa + a> XX > a + λa> a
1/2
Define b = XX > + λI a, where the square root is well defined since the
matrix is positive definite (it may be that XX > is not invertible, for instance,
−1/2
when D > n, so adding λI ensures we can substitute a = XX > + λI b).
Then
−1/2
(∗) = y > y − 2y > X > XX > + λI b + b> b
2
2
−1/2 −1/2
= y > y +
XX > + λI Xy − b
−
y > X > XX > + λI
,
is not the approach we use, since we are later going to extend our reasoning to feature spaces:
derivatives in feature space also exist when the space is infinite dimensional, however for the
purposes of ridge regression they can be avoided). We use [5, eqs. (61) and (73)]
∂a> U a ∂v > a ∂a> v
= (U + U > )a, = = v,
∂a ∂a ∂a
Taking the derivative of the expanded expression (∗) and setting to zero,
∂
2
y − X > a
+ λkak2 = −2Xy + 2 XX > + λI a = 0,
∂a
−1
a = XX > + λI Xy.
26
7.1.2 Finite dimensional case: more informative expression
We may rewrite this expression in a way that is more informative (and more
easily kernelized). Assume without loss of generality that D > n (this will
be useful when we move to feature spaces, where D can be very large or even
infinite). We can perform an SVD on X, i.e.
X = U SV > ,
where
S̃ 0
U= u1 ... uD S= V = Ṽ 0 .
0 0
Here U is D × D and U > U = U U > = ID (the subscript denotes the size of
the unit matrix), S is D × D, where the top left diagonal S̃ has n non-zero
entries, and V is n × D, where only the first n columns are non-zero, and
Ṽ > Ṽ = Ṽ Ṽ > = In .11 Then
−1
a∗ = XX > + λID Xy
−1
>
= U S U + λID2
U SV > y
−1 >
= U S 2 + λID U U SV > y
−1
U S 2 + λID SV > y
=
−1 >
= U S S 2 + λID V y
−1 >
= U |SV{z> V} S 2 + λID V y
(a)
Step (a) is allowed since both S and V > V are non-zero in the same sized top-left
block, and V > V is just the unit matrix in that block. Step (b) occurs as follows
" −1 #
Ṽ >
2
−1 >
S̃ 2 + λIn 0
V S + λID V = Ṽ 0
0 (λID−n )
−1 0
−1
= Ṽ S̃ 2 + λIn Ṽ >
−1
= X > X + λIn .
Pn
What’s interesting about this result is that a∗ = ∗
i=1 αi xi , i.e. a is a
weighted sum of columns of X. Again, one can obtain this result straightfor-
11 Another more economical way to write the SVD would be
S̃
X=U Ṽ > ,
0
but as we’ll see, we will need the larger form.
27
wardly by applying established linear algebra results: the proof here is infor-
mative, however, since we are explicitly demonstrating the steps we take, and
hence we can be assured the same steps will still work even if D is infinite.12
13 For infinite dimensional feature spaces, the operator X still has a singular value decom-
28
Making these replacements, we get
Note that the proof becomes much easier if we begin with the knowledge
that a is a linear combination of feature space mappings of points,14
n
X
a= αi φ(xi ).
i=1
Then
n
X 2 2
(yi − ha, φ(xi )iH ) + λkak2H = ky − Kαk + λα> Kα
i=1
= y > y − 2y > Kα + α> K 2 + λK α
α∗ = (K + λIn )−1 y
as before.
where the eigenfunctions e` (x) were illustrated in Figure (4.4), and satisfy the
orthonormality condition (4.8) for the measure (4.9). The constraint kf k2H < ∞
means that the fˆ`2 must decay faster than λ` with increasing `. In other words,
basis functions e` (x) with larger ` are given less weight: these are the non-
smooth functions.
The same effect can be seen if we use the feature space in Section 4.1.2.
Recall that functions on the periodic domain [−π, π] have the representation
∞
fˆ` exp(ı`x).
X
f (x) =
`=−∞
14 This is a specific instance of the representer theorem, which we will encounter later.
29
Again, 2
ˆ
∞
X f`
kf k2H = hf, f iH = .
l=−∞
k̂`
2
This means fˆ` must decay faster than k̂` for the norm to be finite.15 This
serves to suppress the terms exp(ı`x) for large `, which are the non-smooth
terms.
8 Acknowledgements
Thanks to Gergo Bohner, Peter Dayan, Agnieszka Grabska-Barwinska, Wit-
tawat Jitkrittum, Peter Latham, Arian Maleki, Kirsty McNaughton, Sam Pat-
15 The rate of decay of k̂ will depend on the properties of the kernel. Some relevant results
l
may be found at http://en.wikipedia.org/wiki/Convergence_of_Fourier_series
30
λ=10, σ=0.6 λ=1e−07, σ=0.6 λ=0.1, σ=0.6
1 1.5 1
1
0.5 0.5
0.5
0 0
0
−0.5 −0.5
−0.5
−1 −1 −1
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
0 0 0
−1 −1 −1
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
2. Break the trainining set into m equally sized chunks, each of size nval =
ntr /m. Call these Xval,i , Yval,i for i ∈ {1, . . . , m}
3. For each λ, σ pair
31
Table 1: Fourier series relations in 1-D.
Description of rule Input space Frequency space
Shift f (x − x0 ) f˜l exp (−ıl (2π/T ) x0 )
Input real f ∗ (x) = f (x) f˜l = −f˜l∗
Input even, real ∗
f (x) = f (x), f (−x) = f (x) f˜l = −f˜∗ l
Scaling f (ax) T changes accordingly
Differentiation d
f (x) ıl (2π/T ) f˜l
R T /2 dx P∞
f (x)g ∗ (x)dx ˜ ∗
Parseval’s theorem −T /2 k=−∞ fl g̃l
terson, Dino Sejdinovic, and Yingjian Wang for providing feedback on the notes,
and correcting errors.
T T
A The Fourier series on − 2 , 2 with periodic
boundary conditions
We consider the case in which f (x) is periodic with period T , so that we need
only specify f (x) : − T2 , T2 → R. In this case, we obtain the Fourier series
expansion
T /2
1 2π 1 2π
Z
f˜l = f (x) exp −ılx dx = f˜ l , (A.1)
T −T /2 T T T
such that
∞
X
˜ 2π
f (x) = fl exp ılx . (A.2)
T
l=−∞
References
[1] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in
Probability and Statistics. Kluwer, 2004.
[2] Felipe Cucker and Steve Smale. Best choices for regularization parameters
in learning theory: On the bias–variance problem. Foundations of Compu-
tational Mathematics, 2(4):413–428, October 2002.
[3] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A
kernel method for the two-sample problem. In Advances in Neural Infor-
mation Processing Systems 15, pages 513–520, Cambridge, MA, 2007. MIT
Press.
32
[4] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins.
Text classification using string kernels. Journal of Machine Learning Re-
search, 2:419–444, February 2002.
[5] K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2008. Version
20081110.
[8] Bernhard Schölkopf and A. J. Smola. Learning with Kernels. MIT Press,
Cambridge, MA, 2002.
[9] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis.
Cambridge University Press, Cambridge, UK, 2004.
33