The Representation of Similarities in Linear Spaces
The Representation of Similarities in Linear Spaces
✕
x2 z2
✕ ✕ ✕ ✕
✕
✕ ✕
✕
✕ ✕ ✕
✕
✕
✕ ✕ ❍ ✕ ✕
❍
❍
❍
x1 ❍ ✕
❍ ❍ ✕
❍ ✕
❍
✕
❍
❍ ❍❍ ✕
✕
z1
✕ ✕ ❍ ❍
❍
✕ ✕
✕
✕
✕ ✕ ✕
✕
z3
Figure 2.1 Toy example of a binary classification problem mapped into feature space. We
assume that the true decision boundary is an ellipse in input space (left panel). The task
of the learning process is to estimate this boundary based on empirical data consisting of
training points in both classes (crosses and circles, respectively). When mapped into feature
space via the nonlinear map Φ2 (x) (z1 z2 z3 ) ([x]21 [x]22 2 [x]1 [x]2 ) (right panel), the
ellipse becomes a hyperplane (in the present simple case, it is parallel to the z3 axis, hence
all points are plotted in the (z1 z2 ) plane). This is due to the fact that ellipses can be written
as linear equations in the entries of (z1 z2 z3 ). Therefore, in feature space, the problem
reduces to that of estimating a hyperplane from the mapped data points. Note that via the
polynomial kernel (see (2.12) and (2.13)), the dot product in the three-dimensional space
can be computed without computing Φ2 . Later in the book, we shall describe algorithms
for constructing hyperplanes which are based on dot products (Chapter 7).
In what follows, we will look at things the other way round, and start with the
kernel rather than with the feature map. Given some kernel, can we construct a
feature space such that the kernel computes the dot product in that feature space;
that is, such that (2.2) holds? This question has been brought to the attention
of the machine learning community in a variety of contexts, especially during
recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has
been studied under the heading of Hilbert space representations of kernels. A good
monograph on the theory of kernels is the book of Berg, Christensen, and Ressel
[42]; indeed, a large part of the material in the present chapter is based on this
work. We do not aim to be fully rigorous; instead, we try to provide insight into
the basic ideas. As a rule, all the results that we state without proof can be found
in [42]. Other standard references include [16, 455].
There is one more aspect in which this section differs from the previous one:
the latter dealt with vectorial data, and the domain was assumed to be a subset
of N . By contrast, the results in the current section hold for data drawn from
domains which need no structure, other than their being nonempty sets. This
generalizes kernel learning algorithms to a large number of situations where a
vectorial representation is not readily available, and where one directly works
30 Kernels
We start with some basic definitions and results. As in the previous chapter, indices
i and j are understood to run over 1 m.
for all ci is called positive definite.1 Similarly, a real symmetric m m matrix K
satisfying (2.15) for all ci is called positive definite.
Note that a symmetric matrix is positive definite if and only if all its eigenvalues
are nonnegative (Problem 2.4). The left hand side of (2.15) is often referred to as
the quadratic form induced by K.
Remark 2.6 (Terminology) The term kernel stems from the first use of this type of
function in the field of integral operators as studied by Hilbert and others [243, 359, 112].
A function k which gives rise to an operator Tk via
(Tk f )(x) k(x x ) f (x ) dx
(2.16)
Simply using the term positive kernel, on the other hand, could be mistaken as referring
to a kernel whose values are positive. Finally, the term positive semidefinite kernel
becomes rather cumbersome if it is to be used throughout a book. Therefore, we follow
the convention used for instance in [42], and employ the term positive definite both for
kernels and matrices in the way introduced above. The case where the value 0 is only
attained if all coefficients are 0 will be referred to as strictly positive definite.
We shall mostly use the term kernel. Whenever we want to refer to a kernel k(x x )
which is not positive definite in the sense stated above, it will be clear from the context.
The definitions for positive definite kernels and positive definite matrices differ in
the fact that in the former case, we are free to choose the points on which the kernel
is evaluated — for every choice, the kernel induces a positive definite matrix.
Positive definiteness implies positivity on the diagonal (Problem 2.12),
k(x x) 0 for all x (2.17)
and symmetry (Problem 2.13),
k(xi x j ) k(x j xi ) (2.18)
To also cover the complex-valued case, our definition of symmetry includes com-
plex conjugation. The definition of symmetry of matrices is analogous; that is,
Ki j K ji .
For real-valued kernels it is not sufficient to stipulate that (2.15) hold for real
Real-Valued coefficients ci . To get away with real coefficients only, we must additionally require
Kernels
that the kernel be symmetric (Problem 2.14); k(x i x j ) k(x j xi ) (cf. Problem 2.13).
It can be shown that whenever k is a (complex-valued) positive definite kernel,
its real part is a (real-valued) positive definite kernel. Below, we shall largely be
dealing with real-valued kernels. Most of the results, however, also apply for
complex-valued kernels.
Kernels can be regarded as generalized dot products. Indeed, any dot product
is a kernel (Problem 2.5); however, linearity in the arguments, which is a standard
property of dot products, does not carry over to general kernels. However, another
property of dot products, the Cauchy-Schwarz inequality, does have a natural
generalization to kernels:
Proof For sake of brevity, we give a non-elementary proof using some basic facts
of linear algebra. The 2 2 Gram matrix with entries Ki j k(xi x j ) (i j 1 2 )
is positive definite. Hence both its eigenvalues are nonnegative, and so is their
product, the determinant of K. Therefore
0 K 11 K22
K 12 K21 K11 K22
K12 K12 K11 K22
K122 (2.20)
32 Kernels
We begin by constructing a dot product space containing the images of the input
patterns under Φ. To this end, we first need to define a vector space. This is done
Vector Space by taking linear combinations of the form
m
f () ∑ i k( xi ) (2.22)
i 1
Here, m , and x
i 1 xm are arbitrary. Next, we define a dot product
2.2 The Representation of Similarities in Linear Spaces 33
i j k(xi x j )
(2.24)
i 1j 1
This expression explicitly contains the expansion coefficients, which need not be
unique. To see that it is nevertheless well-defined, note that
f g
∑
m¼
j f (x j )
(2.25)
j 1
using k(x j xi )
k(xi x j ). The sum in (2.25), however, does not depend on the
f g
∑
m
i g(xi ) (2.26)
i 1
f f
∑ 0
m
i j k(xi x j ) (2.27)
i j 1
The latter implies that
is actually itself a positive definite kernel, defined
on our space of functions. To see this, note that given functions f f , and
1 n
coefficients , we have
1 n
f f ∑ f ∑ f 0
n n n
∑ i j i j i i (2.28)
j j
i j 1 i 1 j 1
Here, the left hand equality follows from the bilinearity of , and the right hand
inequality from (2.27). For the last step in proving that it qualifies as a dot product,
we will use the following interesting property of Φ, which follows directly from
the definition: for all functions (2.22), we have
k( x) k( x )
k(x x )
(2.30)
By virtue of these properties, positive definite kernels k are also called reproducing
Reproducing kernels [16, 42, 455, 578, 467, 202]. By (2.29) and Proposition 2.7, we have
Kernel
f (x) k(
2
x) f
k(x x) f f
2
(2.31)
34 Kernels
Therefore, f f
0 directly implies f 0, which is the last property that required
proof in order to establish that is a dot product (cf. Section B.2).
The case of complex-valued kernels can be dealt with using the same construc-
tion; in that case, we will end up with a complex dot product space [42].
The above reasoning has shown that any positive definite kernel can be thought
of as a dot product in another space: in view of (2.21), the reproducing kernel
property (2.30) amounts to
In view of the material in the present section, the justification for this procedure is
the following: effectively, the original algorithm can be thought of as a dot prod-
uct based algorithm operating on vectorial data Φ(x1 ) Φ(xm ). The algorithm
obtained by replacing k by k̃ then is exactly the same dot product based algorithm,
only that it operates on Φ̃(x1 ) Φ̃(xm ).
The best known application of the kernel trick is in the case where k is the dot
product in the input domain (cf. Problem 2.5). The trick is not limited to that case,
however: k and k̃ can both be nonlinear kernels. In general, care must be exercised
in determining whether the resulting algorithm will be useful: sometimes, an
algorithm will only work subject to additional conditions on the input data, e.g.,
the data set might have to lie in the positive orthant. We shall later see that certain
kernels induce feature maps which enforce such properties for the mapped data
(cf. (2.73)), and that there are algorithms which take advantage of these aspects
(e.g., in Chapter 8). In such cases, not every conceivable positive definite kernel
2.2 The Representation of Similarities in Linear Spaces 35
2. This is illustrated by the following quotation from an excellent machine learning text-
book published in the seventies (p. 174 in [152]): “The familiar functions of mathematical physics
are eigenfunctions of symmetric kernels, and their use is often suggested for the construction of po-
tential functions. However, these suggestions are more appealing for their mathematical beauty than
their practical usefulness.”
36 Kernels
RKHS In view of the properties (2.29) and (2.30), this space is usually called a reproducing
kernel Hilbert space (RKHS).
In general, an RKHS can be defined as follows.
Definition 2.9 (Reproducing Kernel Hilbert Space) Let be a nonempty set (often
called the index set) and by a Hilbert space of functions f : . Then is called
a reproducing kernel Hilbert space endowed with the dot product (and the norm
f : f f ) if there exists a function k : with the following properties.
Reproducing 1. k has the reproducing property3
Property
f k(x )
f (x) for all f ;
(2.34)
in particular,
f (x )
f k( x )
(2.36)
It follows directly from (2.35) that k(x x ) is symmetric in its arguments (see
Uniqueness of k From Problem 2.28 we know that both k and k must be symmetric. Moreover,
k(x ) k (x )
k(x x ) k (x
x) (2.37)
In the second equality we used the symmetry of the dot product. Finally, symme-
try in the arguments of k yields k(x x ) k (x x ) which proves our claim.
Section 2.2.2 has shown that any positive definite kernel can be represented as a
dot product in a linear space. This was done by explicitly constructing a (Hilbert)
space that does the job. The present section will construct another Hilbert space.
One could argue that this is superfluous, given that any two separable Hilbert
spaces are isometrically isomorphic, in other words, it is possible to define a one-
to-one linear map between the spaces which preserves the dot product. However,
the tool that we shall presently use, Mercer’s theorem, has played a crucial role
in the understanding of SVMs, and it provides valuable insight into the geometry
of feature spaces, which more than justifies its detailed discussion. In the SVM
literature, the kernel trick is usually introduced via Mercer’s theorem.
We start by stating the version of Mercer’s theorem given in [606]. We assume
( ) to be a finite measure space.4 The term almost all (cf. Appendix B) means
except for sets of measure zero. For the commonly used Lebesgue-Borel measure,
Mercer’s countable sets of individual points are examples of zero measure sets. Note that
Theorem the integral with respect to a measure is explained in Appendix B. Readers who
do not want to go into mathematical detail may simply want to think of the d(x )
Let j L2 () be the normalized orthogonal eigenfunctions of Tk associated with the
eigenvalues j 0, sorted in non-increasing order. Then
1. ( j ) j
1 ,
2. k(x x ) ∑ Nj1 j j (x) j (x ) holds for almost all (x x ). Either N , or N
;
in the latter case, the series converges absolutely and uniformly for almost all (x x ).
For the converse of Theorem 2.10, see Problem 2.23. For a data-dependent approx-
imation and its relationship to kernel PCA (Section 1.7), see Problem 2.26.
From statement 2 it follows that k(x x ) corresponds to a dot product in
2N ,
since k(x x )
Φ(x) Φ(x ) with
Φ:
2
(2.40)
x ( j j (x)) j1 N
for almost all x . Note that we use the same Φ as in (2.21) to denote the feature
4. A finite measure space is a set with a -algebra (Definition B.1) defined on it, and a
measure (Definition B.2) defined on the latter, satisfying ( ) (so that, up to a scaling
factor, is a probability measure).
38 Kernels
map, although the target spaces are different. However, this distinction is not
important for the present purposes — we are interested in the existence of some
Hilbert space in which the kernel corresponds to the dot product, and not in what
particular representation of it we are using.
In fact, it has been noted [467] that the uniform convergence of the series implies
that given any 0, there exists an n such that even if N , k can be
approximated within accuracy as a dot product in n : for almost all x x
, k(x x ) Φn (x) Φn (x ) , where Φn : x ( 1 1(x) n n (x)). The
feature space can thus always be thought of as finite-dimensional within some
accuracy . We summarize our findings in the following proposition.
k(x x )
Φ (x) Φ (x )
n
n
(2.42)
for almost all x x .
Both Mercer kernels and positive definite kernels can thus be represented as dot
products in Hilbert spaces. The following proposition, showing a case where the
two types of kernels coincide, thus comes as no surprise.
Proposition 2.12 (Mercer Kernels are Positive Definite [359, 42]) Let [a b] be
a compact interval and let k : [a b] [a b] be continuous. Then k is a positive definite
kernel if and only if
b b
a
k(x x ) f (x) f (x ) dx dx
a
0
(2.43)
Being positive definite, Mercer kernels are thus also reproducing kernels.
We next show how the reproducing kernel map is related to the Mercer kernel
map constructed from the eigenfunction decomposition [202, 467]. To this end, let
us consider a kernel which satisfies the condition of Theorem 2.10, and construct
2.2 The Representation of Similarities in Linear Spaces 39
a dot product such that k becomes a reproducing kernel for the Hilbert space
containing the functions
N
∑ ∑ i∑
f k( x )
∑
N
i ∑ j j (xi ) j n n n (x )
(2.46)
i 1
jn 1
i
for i 1 2 (2.48)
Then it will usually not be the case that Φ1 (x) Φ2 (x); due to (2.48), however,
we always have Φ1 (x) Φ1 (x ) 1
Φ2 (x) Φ2 (x ) 2 . Therefore, as long as we are
only interested in dot products, the two spaces can be considered identical.
An example of this identity is the so-called large margin regularizer that is
usually used in SVMs, as discussed in the introductory chapter (cf. also Chapters
4 and 7),
w w
m
where w ∑ i Φ(xi ) (2.49)
i 1
No matter whether Φ is the RKHS map Φ(x i ) k( xi ) (2.21) or the Mercer map
Φ(xi ) ( j j (x)) j1 N (2.40), the value of w 2 will not change.
This point is of great importance, and we hope that all readers are still with us.
40 Kernels
It is fair to say, however, that Section 2.2.5 can be skipped at first reading.
Using Mercer’s theorem, we have shown that one can think of the feature map
as a map into a high- or infinite-dimensional Hilbert space. The argument in the
remainder of the section shows that this typically entails that the mapped data
Φ() lie in some box with rapidly decaying side lengths [606]. By this we mean
that the range of the data decreases as the dimension index j increases, with a rate
that depends on the size of the eigenvalues.
Let us assume that for all j , we have supx j j (x) 2 . Define the
sequence
lj : sup
x
j j
(x) 2 (2.50)
Note that if
Ck : sup sup j (x) (2.51)
j x
exists (see Problem 2.24), then we have l j j Ck2 . However, if the j decay rapidly,
then (2.50) can be finite even if (2.51) is not.
By construction, Φ() is contained in an axis parallel parallelepiped in
2N with
side lengths 2 l j (cf. (2.40)).5
Consider an example of a common kernel, the Gaussian, and let (see The-
orem 2.10) be the Lebesgue measure. In this case, the eigenvectors are sine and
cosine functions (with supremum one), and thus the sequence of the l j coincides
with the sequence of the eigenvalues j . Generally, whenever sup x j (x) 2 is fi-
nite, the l j decay as fast as the j . We shall see in Sections 4.4, 4.5 and Chapter 12
that for many common kernels, this decay is very rapid.
It will be useful to consider operators that map Φ() into balls of some radius
R centered at the origin. The following proposition characterizes a class of such
operators, determined by the sequence (l j ) j . Recall that denotes the space of
S :
(x j ) j S(x j ) j (s j x j ) j
(2.52)
5. In fact, it is sufficient to use the essential supremum in (2.50). In that case, subsequent
statements also only hold true almost everywhere.
2.2 The Representation of Similarities in Linear Spaces 41
SΦ(x) ∑ s (x) ∑ s l R
2
j
2
j j j
2
j
2
j j (2.53)
The converse is not necessarily the case. To see this, note that if s j
lj j
2 ,
amounting to saying that
is not finite.
To see how the freedom to rescale Φ() effectively restricts the class of functions
we are using, we first note that everything in the feature space
2N is done
in terms of dot products. Therefore, we can compensate any invertible symmetric
linear transformation of the data in by the inverse transformation on the set
of admissible weight vectors in . In other words, for any invertible symmetric
operator S on , we have S 1 w SΦ(x)
w Φ(x) for all x
As we shall see below (cf. Theorem 5.5, Section 12.4, and Problem 7.5), there
exists a class of generalization error bound that depends on the radius R of the
smallest sphere containing the data. If the (li )i decay rapidly, we are not actually
“making use” of the whole sphere. In this case, we may construct a diagonal
scaling operator S which inflates the sides of the above parallelepiped as much
as possible, while ensuring that it is still contained within a sphere of the original
radius R in (Figure 2.3). By effectively reducing the size of the function class, this
will provide a way of strengthening the bounds. A similar idea, using kernel PCA
(Section 14.2) to determine empirical scaling coefficients, has been successfully
applied by [101].
We conclude this section with another useful insight that characterizes a prop-
erty of the feature map Φ. Note that most of what was said so far applies to the
case where the input domain is a general set. In this case, it is not possible to
make nontrivial statements about continuity properties of Φ. This changes if we
assume to be endowed with a notion of closeness, by turning it into a so-called
topological space. Readers not familiar with this concept will be reassured to hear
Continuity of Φ that Euclidean vector spaces are particular cases of topological spaces.
φ(x) φ(x)
R w R w
Figure 2.3 Since everything is done in terms of dot products, scaling up the data by
an operator S can be compensated by scaling the weight vectors with S 1 (cf. text). By
choosing S such that the data are still contained in a ball of the same radius R, we effectively
reduce our function class (parametrized by the weight vector), which can lead to better
generalization bounds, depending on the kernel inducing the map Φ.
The map Φ, defined in (2.21), transforms each input pattern into a function on ,
that is, into a potentially infinite-dimensional object. For any given set of points,
however, it is possible to approximate Φ by only evaluating it on these points (cf.
[232, 350, 361, 547, 474]):
matrix.6 Enforcing (2.57) on the training patterns, this yields the self-consistency
condition [478, 512]
K KMK (2.58)
6. Every dot product in Ê m can be written in this form. We do not require strict definiteness
of M, as the null space can be projected out, leading to a lower-dimensional feature space.
2.2 The Representation of Similarities in Linear Spaces 43
where K is the Gram matrix. The condition (2.58) can be satisfied for instance
by the (pseudo-)inverse M K 1 . Equivalently, we could have incorporated this
Φw
m :x K
1
(k(x1 x) k(xm x)) (2.59)
2
the i are the eigenvalues of K.7 This parallels the rescaling of the eigenfunctions
of the integral operator belonging to the kernel, given by (2.47). It turns out that
this map can equivalently be performed using kernel PCA feature extraction (see
Problem 14.8), which is why we refer to this map as the kernel PCA map.
Note that we have thus constructed a data-dependent feature map into an m-
dimensional space which satisfies Φw w
m (x) Φm (x )
k(x x ), i.e., we have found an
m-dimensional feature space associated with the given kernel. In the case where K
is invertible, Φw
m (x) computes the coordinates of Φ(x) when represented in a basis
of the m-dimensional subspace spanned by Φ(x1 ) Φ(xm ).
For data sets where the number of examples is smaller than their dimension,
it can actually be computationally attractive to carry out Φw m explicitly, rather
than using kernels in subsequent algorithms. Moreover, algorithms which are not
readily “kernelized” may benefit from explicitly carrying out the kernel PCA map.
We end this section with two notes which illustrate why the use of (2.56) need
not be restricted to the special case we just discussed.
More general kernels. When using non-symmetric kernels k in (2.56), together with
the canonical dot product, we effectively work with the positive definite matrix
K K. Note that each positive definite matrix can be written as K K. Therefore,
working with positive definite kernels leads to an equally rich set of nonlinearities
as working with an empirical kernel map using general non-symmetric kernels.
If we wanted to carry out the whitening step, we would have to use (K K) 14 (cf.
Φw
n :x K n
1
2
(k(z1 x) k(zn x)) (2.60)
where (Kn )i j : k(zi z j ). The expansion set can either be a subset of the training
set,8 or some other set of points. We will later return to the issue of how to choose
the best set (see Section 10.2 and Chapter 18). As an aside, note that in the case of
Kernel PCA (see Section 1.7 and Chapter 14 below), one does not need to worry
about the whitening step in (2.59) and (2.60): using the canonical dot product in
m (rather than ) will simply lead to diagonalizing K2 instead of K, which
yields the same eigenvectors with squared eigenvalues. This was pointed out by
[350, 361]. The study [361] reports experiments where (2.56) was employed to
speed up Kernel PCA by choosing z1 zn as a subset of x1 xm .
2.2.7 A Kernel Map Defined from Pairwise Similarities
Proof First assume that K is positive definite. In this case, it can be diagonalized
as K SDS , with an orthogonal matrix S and a diagonal matrix D with nonneg-
DS DS
where we have defined the Si as the rows of S (note that the columns of S would be
K’s eigenvectors). Therefore, K is the Gram matrix of the vectors Dii Si .9 Hence
the following map Φ, defined on x1 xm will satisfy (2.61)
Φ : xi D S ii i (2.63)
Thus far, Φ is only defined on a set of points, rather than on a vector space.
Therefore, it makes no sense to ask whether it is linear. We can, however, ask
whether it can be extended to a linear map, provided the x i are elements of a vector
space. The answer is that if the xi are linearly dependent (which is often the case),
then this will not be possible, since a linear map would then typically be over-
12
puted as Dn Un (k(z1 x) k(zn x)), where Un Dn Un is the eigenvalue decomposition of
Kn . Note that the columns of Un are the eigenvectors of Kn . We discard all columns that cor-
respond to zero eigenvalues, as well as the corresponding dimensions of Dn . To approximate
the map, we may actually discard all eigenvalues smaller than some 0.
9. In fact, every positive definite matrix is the Gram matrix of some set of vectors [46].
2.3 Examples and Properties of Kernels 45
In particular, this result implies that given data x1 xm , and a kernel k which
gives rise to a positive definite matrix K, it is always possible to construct a feature
space of dimension at most m that we are implicitly working in when using
kernels (cf. Problem 2.32 and Section 2.2.6).
If we perform an algorithm which requires k to correspond to a dot product in
some other space (as for instance the SV algorithms described in this book), it is
possible that even though k is not positive definite in general, it still gives rise to
a positive definite Gram matrix K with respect to the training data at hand. In this
case, Proposition 2.16 tells us that nothing will go wrong during training when we
work with these data. Moreover, if k leads to a matrix with some small negative
eigenvalues, we can add a small multiple of some strictly positive definite kernel
k (such as the identity k (xi x j ) Æi j ) to obtain a positive definite matrix. To see
this, suppose that min 0 is the minimal eigenvalue of k’s Gram matrix. Note that
being strictly positive definite, the Gram matrix K of k satisfies
min
« 1
« K «
min 0 (2.65)
where min denotes its minimal eigenvalue, and the first inequality follows from
k(x x )
x x
d
(2.67)
Boser, Guyon, and Vapnik [62, 223, 561] suggest the usage of Gaussian radial basis
Gaussian function kernels [26, 4],
x2
x
2
exp
k(x x )
(2.68)
2
k(x x )
tanh( x x
) (2.69)