0% found this document useful (0 votes)
85 views17 pages

The Representation of Similarities in Linear Spaces

In what follows, we will look at things the other way round, and start with the kernel rather than with the feature map. Given some kernel, can we construct a feature space such that the kernel computes the dot product in that feature space; that is, such that (2.2) holds? This question has been brought to the attention of the machine learning community in a variety of contexts, especially during recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has been studied.

Uploaded by

rotero_pujol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views17 pages

The Representation of Similarities in Linear Spaces

In what follows, we will look at things the other way round, and start with the kernel rather than with the feature map. Given some kernel, can we construct a feature space such that the kernel computes the dot product in that feature space; that is, such that (2.2) holds? This question has been brought to the attention of the machine learning community in a variety of contexts, especially during recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has been studied.

Uploaded by

rotero_pujol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

2.

2 The Representation of Similarities in Linear Spaces 29


x2 z2
✕ ✕ ✕ ✕

✕ ✕

✕ ✕ ✕


✕ ✕ ❍ ✕ ✕



x1 ❍ ✕
❍ ❍ ✕
❍ ✕



❍ ❍❍ ✕

z1
✕ ✕ ❍ ❍

✕ ✕

✕ ✕ ✕

z3

Figure 2.1 Toy example of a binary classification problem mapped into feature space. We
assume that the true decision boundary is an ellipse in input space (left panel). The task
of the learning process is to estimate this boundary based on empirical data consisting of

 

training points in both classes (crosses and circles, respectively). When mapped into feature
space via the nonlinear map Φ2 (x) (z1  z2  z3 ) ([x]21  [x]22  2 [x]1 [x]2 ) (right panel), the
ellipse becomes a hyperplane (in the present simple case, it is parallel to the z3 axis, hence
all points are plotted in the (z1  z2 ) plane). This is due to the fact that ellipses can be written
as linear equations in the entries of (z1  z2  z3 ). Therefore, in feature space, the problem
reduces to that of estimating a hyperplane from the mapped data points. Note that via the
polynomial kernel (see (2.12) and (2.13)), the dot product in the three-dimensional space
can be computed without computing Φ2 . Later in the book, we shall describe algorithms
for constructing hyperplanes which are based on dot products (Chapter 7).

2.2 The Representation of Similarities in Linear Spaces

In what follows, we will look at things the other way round, and start with the
kernel rather than with the feature map. Given some kernel, can we construct a
feature space such that the kernel computes the dot product in that feature space;
that is, such that (2.2) holds? This question has been brought to the attention
of the machine learning community in a variety of contexts, especially during
recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has
been studied under the heading of Hilbert space representations of kernels. A good
monograph on the theory of kernels is the book of Berg, Christensen, and Ressel
[42]; indeed, a large part of the material in the present chapter is based on this
work. We do not aim to be fully rigorous; instead, we try to provide insight into
the basic ideas. As a rule, all the results that we state without proof can be found
in [42]. Other standard references include [16, 455].
There is one more aspect in which this section differs from the previous one:
the latter dealt with vectorial data, and the domain  was assumed to be a subset
of  N . By contrast, the results in the current section hold for data drawn from
domains which need no structure, other than their being nonempty sets. This
generalizes kernel learning algorithms to a large number of situations where a
vectorial representation is not readily available, and where one directly works
30 Kernels

with pairwise distances or similarities between non-vectorial objects [246, 467,


154, 210, 234, 585]. This theme will recur in several places throughout the book,
for instance in Chapter 13.

2.2.1 Positive Definite Kernels

We start with some basic definitions and results. As in the previous chapter, indices
i and j are understood to run over 1     m.

Definition 2.3 (Gram Matrix) Given a function k : 2 


 (where   or   )
Gram Matrix  
and patterns x1      xm , the m m matrix K with elements
Ki j :  k(xi  x j) (2.14)
is called the Gram matrix (or kernel matrix) of k with respect to x 1      xm .

PD Matrix Definition 2.4 (Positive Definite Matrix) A complex m  m matrix K satisfying


∑ ci c̄ j Ki j  0 (2.15)
i j


for all ci  is called positive definite.1 Similarly, a real symmetric m  m matrix K

satisfying (2.15) for all ci  is called positive definite.

Note that a symmetric matrix is positive definite if and only if all its eigenvalues
are nonnegative (Problem 2.4). The left hand side of (2.15) is often referred to as
the quadratic form induced by K.

Definition 2.5 ((Positive Definite) Kernel) Let  be a nonempty set. A function k on


   which for all m   
and all x1     xm  gives rise to a positive definite Gram
PD Kernel matrix is called a positive definite (pd) kernel. Often, we shall refer to it simply as a
kernel.

Remark 2.6 (Terminology) The term kernel stems from the first use of this type of
function in the field of integral operators as studied by Hilbert and others [243, 359, 112].
A function k which gives rise to an operator Tk via

(Tk f )(x)  k(x x ) f (x ) dx

  
(2.16)

is called the kernel of Tk .


In the literature, a number of different terms are used for positive definite kernels, such
as reproducing kernel, Mercer kernel, admissible kernel, Support Vector kernel,
nonnegative definite kernel, and covariance function. One might argue that the term
positive definite kernel is slightly misleading. In matrix theory, the term definite is

sometimes reserved for the case where equality in (2.15) only occurs if c 1    cm 0. 
1. The bar in c̄ j denotes complex conjugation; for real numbers, it has no effect.
2.2 The Representation of Similarities in Linear Spaces 31

Simply using the term positive kernel, on the other hand, could be mistaken as referring
to a kernel whose values are positive. Finally, the term positive semidefinite kernel
becomes rather cumbersome if it is to be used throughout a book. Therefore, we follow
the convention used for instance in [42], and employ the term positive definite both for
kernels and matrices in the way introduced above. The case where the value 0 is only
attained if all coefficients are 0 will be referred to as strictly positive definite.
We shall mostly use the term kernel. Whenever we want to refer to a kernel k(x x ) 

which is not positive definite in the sense stated above, it will be clear from the context.

The definitions for positive definite kernels and positive definite matrices differ in
the fact that in the former case, we are free to choose the points on which the kernel
is evaluated — for every choice, the kernel induces a positive definite matrix.
Positive definiteness implies positivity on the diagonal (Problem 2.12),
k(x x)  0 for all x    (2.17)
and symmetry (Problem 2.13),
k(xi  x j )  k(x j  xi )  (2.18)
To also cover the complex-valued case, our definition of symmetry includes com-
plex conjugation. The definition of symmetry of matrices is analogous; that is,

Ki j K ji .
For real-valued kernels it is not sufficient to stipulate that (2.15) hold for real
Real-Valued coefficients ci . To get away with real coefficients only, we must additionally require
Kernels 
that the kernel be symmetric (Problem 2.14); k(x i  x j ) k(x j  xi ) (cf. Problem 2.13).
It can be shown that whenever k is a (complex-valued) positive definite kernel,
its real part is a (real-valued) positive definite kernel. Below, we shall largely be
dealing with real-valued kernels. Most of the results, however, also apply for
complex-valued kernels.
Kernels can be regarded as generalized dot products. Indeed, any dot product
is a kernel (Problem 2.5); however, linearity in the arguments, which is a standard
property of dot products, does not carry over to general kernels. However, another
property of dot products, the Cauchy-Schwarz inequality, does have a natural
generalization to kernels:

Proposition 2.7 (Cauchy-Schwarz Inequality for Kernels) If k is a positive definite


kernel, and x1  x2 , then 
k(x 1  x2 )   k(x
2
1  x1 )  k(x 2  x2 ) (2.19)

Proof For sake of brevity, we give a non-elementary proof using some basic facts

of linear algebra. The 2 2 Gram matrix with entries Ki j k(xi  x j ) (i j 1 2 ) 
is positive definite. Hence both its eigenvalues are nonnegative, and so is their
product, the determinant of K. Therefore
0 K 11 K22
K 12 K21  K11 K22
K12 K12  K11 K22
K122  (2.20)
32 Kernels

Figure 2.2 One instantiation of the fea-


ture map associated with a kernel is the
map (2.21), which represents each pattern
Φ (in the picture, x or x ) by a kernel-shaped


function sitting on the pattern. In this sense,


each pattern is represented by its similar-
ity to all other patterns. In the picture, the
. . kernel is assumed to be bell-shaped, e.g., a
x x' Φ(x) Φ(x')    
Gaussian k(x x ) exp( x x 2 (2  2 )).
 

In the text, we describe the construction of


 
a dot product   on the function space


such that k(x x ) Φ(x) Φ(x ) .

Substituting k(xi  x j ) for Ki j , we get the desired inequality.
We now show how the feature spaces in question are defined by the choice of
kernel function.

2.2.2 The Reproducing Kernel Map

Assume that k is a real-valued positive definite kernel, and  a nonempty set. We


define a map from  into the space of functions mapping  into  , denoted as
Feature Map  : 
f :  , via 
Φ:  
x  k( x)
  (2.21)
Here, Φ(x) denotes the function that assigns the value k(x  x) to x , i.e.,
 


Φ(x)() k( x) (as shown in Figure 2.2).
We have thus turned each pattern into a function on the domain . In this sense,
a pattern is now represented by its similarity to all other points in the input domain
. This seems a very rich representation; nevertheless, it will turn out that the
kernel allows the computation of the dot product in this representation. Below,
we show how to construct a feature space associated with Φ, proceeding in the
following steps:
1. Turn the image of Φ into a vector space,
2. define a dot product; that is, a strictly positive definite bilinear form, and
3. show that the dot product satisfies k(x x ) 
 Φ(x) Φ(x ) .



We begin by constructing a dot product space containing the images of the input
patterns under Φ. To this end, we first need to define a vector space. This is done
Vector Space by taking linear combinations of the form
m
f () ∑ i k( xi ) (2.22)

i 1

Here, m   ,   and x
i 1      xm   are arbitrary. Next, we define a dot product
2.2 The Representation of Similarities in Linear Spaces 33

between f and another function



g() ∑ j k( x j ) 
(2.23)

j 1

Dot Product where m 


  ,   and xj 1      xm¼
 
 , as
f g : ∑ ∑
m m ¼

 i  j k(xi  x j ) 
(2.24)
i 1j 1

This expression explicitly contains the expansion coefficients, which need not be
unique. To see that it is nevertheless well-defined, note that

f g  ∑

 j f (x j )
(2.25)
j 1

using k(x j  xi ) 
k(xi  x j ). The sum in (2.25), however, does not depend on the
 

particular expansion of f . Similarly, for g, note that

f g  ∑
m
 i g(xi ) (2.26)
i 1

The last two equations also show that  is bilinear. It is symmetric, as f  g   



g f . Moreover, it is positive definite, since positive definiteness of k implies that
for any function f , written as (2.22), we have

f f  ∑ 0
m
 i  j k(xi  x j )  (2.27)
i j 1

The latter implies that   is actually itself a positive definite kernel, defined


on our space of functions. To see this, note that given functions f f , and

1     n
coefficients  , we have
1      n
 
f f  ∑ f ∑ f 0
n   n n
∑ i  j i j i i (2.28)
j j 

i j 1   i 1 j 1

Here, the left hand equality follows from the bilinearity of   , and the right hand 

inequality from (2.27). For the last step in proving that it qualifies as a dot product,
we will use the following interesting property of Φ, which follows directly from
the definition: for all functions (2.22), we have

k(  x) f  f (x) (2.29)


— k is the representer of evaluation. In particular,

k(  x) k( x ) 
 k(x x ) 

 (2.30)
By virtue of these properties, positive definite kernels k are also called reproducing
Reproducing kernels [16, 42, 455, 578, 467, 202]. By (2.29) and Proposition 2.7, we have
Kernel
 f (x)   k(
2
 x) f   k(x x)  f f
2
   (2.31)
34 Kernels

Therefore, f  f  
0 directly implies f 0, which is the last property that required
 
proof in order to establish that  is a dot product (cf. Section B.2).
The case of complex-valued kernels can be dealt with using the same construc-
tion; in that case, we will end up with a complex dot product space [42].
The above reasoning has shown that any positive definite kernel can be thought
of as a dot product in another space: in view of (2.21), the reproducing kernel
property (2.30) amounts to

Φ(x) Φ(x )  k(x x )






 (2.32)
Therefore, the dot product space  constructed in this way is one possible instan-
tiation of the feature space associated with a kernel.
Kernels from Above, we have started with the kernel, and constructed a feature map. Let us
Feature Maps now consider the opposite direction. Whenever we have a mapping Φ from  into
a dot product space, we obtain a positive definite kernel via k(x x ) : Φ(x) Φ(x ) .

 

  
This can be seen by noting that for all ci   xi  i 1     m, we have
   2
 
∑ ci c j k(xi  x j )  ∑ ci Φ(xi ) ∑ c j Φ(x j ) 
i j i j

 i

∑ ci Φ(xi )

0  (2.33)

due to the nonnegativity of the norm.


Equivalent This has two consequences. First, it allows us to give an equivalent definition of
Definition of positive definite kernels as functions with the property that there exists a map
PD Kernels Φ into a dot product space such that (2.32) holds true. Second, it allows us to
construct kernels from feature maps. For instance, it is in this way that powerful
linear representations of 3D heads proposed in computer graphics [575, 59] give
rise to kernels. The identity (2.32) forms the basis for the kernel trick:

Remark 2.8 (“Kernel Trick”) Given an algorithm which is formulated in terms of a


positive definite kernel k, one can construct an alternative algorithm by replacing k by
Kernel Trick another positive definite kernel k̃.

In view of the material in the present section, the justification for this procedure is
the following: effectively, the original algorithm can be thought of as a dot prod-
uct based algorithm operating on vectorial data Φ(x1 )     Φ(xm ). The algorithm
obtained by replacing k by k̃ then is exactly the same dot product based algorithm,
only that it operates on Φ̃(x1 )     Φ̃(xm ).
The best known application of the kernel trick is in the case where k is the dot
product in the input domain (cf. Problem 2.5). The trick is not limited to that case,
however: k and k̃ can both be nonlinear kernels. In general, care must be exercised
in determining whether the resulting algorithm will be useful: sometimes, an
algorithm will only work subject to additional conditions on the input data, e.g.,
the data set might have to lie in the positive orthant. We shall later see that certain
kernels induce feature maps which enforce such properties for the mapped data
(cf. (2.73)), and that there are algorithms which take advantage of these aspects
(e.g., in Chapter 8). In such cases, not every conceivable positive definite kernel
2.2 The Representation of Similarities in Linear Spaces 35

will make sense.


Historical Even though the kernel trick had been used in the literature for a fair amount of
Remarks time [4, 62], it took until the mid 1990s before it was explicitly stated that any al-
gorithm that only depends on dot products, i.e., any algorithm that is rotationally
invariant, can be kernelized [479, 480]. Since then, a number of algorithms have
benefitted from the kernel trick, such as the ones described in the present book, as
well as methods for clustering in feature spaces [479, 215, 199].
Moreover, the machine learning community took time to comprehend that the
definition of kernels on general sets (rather than dot product spaces) greatly
extends the applicability of kernel methods [467], to data types such as texts and
other sequences [234, 585, 23]. Indeed, this is now recognized as a crucial feature
of kernels: they lead to an embedding of general data types in linear spaces.
Not surprisingly, the history of methods for representing kernels in linear spaces
(in other words, the mathematical counterpart of the kernel trick) dates back
significantly further than their use in machine learning. The methods appear to
have first been studied in the 1940s by Kolmogorov [304] for countable  and
Aronszajn [16] in the general case. Pioneering work on linear representations of
a related class of kernels, to be described in Section 2.4, was done by Schoenberg
[465]. Further bibliographical comments can be found in [42].
We thus see that the mathematical basis for kernel algorithms has been around
for a long time. As is often the case, however, the practical importance of mathe-
matical results was initially underestimated.2

2.2.3 Reproducing Kernel Hilbert Spaces

In the last section, we described how to define a space of functions which is a


valid realization of the feature spaces associated with a given kernel. To do this,
we had to make sure that the space is a vector space, and that it is endowed with
a dot product. Such spaces are referred to as dot product spaces (cf. Appendix B),
or equivalently as pre-Hilbert spaces. The reason for the latter is that one can turn
them into Hilbert spaces (cf. Section B.3) by a fairly simple mathematical trick. This
additional structure has some mathematical advantages. For instance, in Hilbert
spaces it is always possible to define projections. Indeed, Hilbert spaces are one of
the favorite concepts of functional analysis.
So let us again consider the pre-Hilbert space of functions (2.22), endowed with
the dot product (2.24). To turn it into a Hilbert space (over  ), one completes it in
the norm corresponding to the dot product, f : 
f  f . This is done by adding
the limit points of sequences that are convergent in that norm (see Appendix B).

2. This is illustrated by the following quotation from an excellent machine learning text-
book published in the seventies (p. 174 in [152]): “The familiar functions of mathematical physics
are eigenfunctions of symmetric kernels, and their use is often suggested for the construction of po-
tential functions. However, these suggestions are more appealing for their mathematical beauty than
their practical usefulness.”
36 Kernels

RKHS In view of the properties (2.29) and (2.30), this space is usually called a reproducing
kernel Hilbert space (RKHS).
In general, an RKHS can be defined as follows.

Definition 2.9 (Reproducing Kernel Hilbert Space) Let  be a nonempty set (often
called the index set) and by  a Hilbert space of functions f :   . Then  is called 
a reproducing kernel Hilbert space endowed with the dot product (and the norm  
  


f : f  f ) if there exists a function k :    with the following properties.
Reproducing 1. k has the reproducing property3
Property
f k(x )  f (x) for all f  ;
  (2.34)
in particular,

k(x ) k(x )  k(x x )


 

 

 (2.35)
Closed Space 2. k spans , i.e.   span k(x )x   where X denotes the completion of the set X


(cf. Appendix B).

On a more abstract level, an RKHS can be defined as a Hilbert space of functions


f on  such that all evaluation functionals (the maps f f (x ), where x ) are   

continuous. In that case, by the Riesz representation theorem (e.g., [429]), for each


x  there exists a unique function of x, called k(x x ), such that 

f (x ) 
 f  k( x ) 
 (2.36)
It follows directly from (2.35) that k(x x ) is symmetric in its arguments (see 

Problem 2.28) and satisfies the conditions for positive definiteness.


Note that the RKHS uniquely determines k. This can be shown by contradiction:
assume that there exist two kernels, say k and k , spanning the same RKHS . 

Uniqueness of k From Problem 2.28 we know that both k and k must be symmetric. Moreover, 

from (2.34) we conclude that

k(x ) k (x )   k(x x )  k (x
 
 
 
  
 x) (2.37)
In the second equality we used the symmetry of the dot product. Finally, symme-
try in the arguments of k yields k(x x ) k (x x ) which proves our claim. 
  

2.2.4 The Mercer Kernel Map

Section 2.2.2 has shown that any positive definite kernel can be represented as a
dot product in a linear space. This was done by explicitly constructing a (Hilbert)
space that does the job. The present section will construct another Hilbert space.

3. Note that this implies that each f 


is actually a single function whose values at any
x are well-defined. In contrast, L2 Hilbert spaces usually do not have this property. The
elements of these spaces are equivalence classes of functions that disagree only on sets of
measure 0; cf. footnote 15 in Section B.3.
2.2 The Representation of Similarities in Linear Spaces 37

One could argue that this is superfluous, given that any two separable Hilbert
spaces are isometrically isomorphic, in other words, it is possible to define a one-
to-one linear map between the spaces which preserves the dot product. However,
the tool that we shall presently use, Mercer’s theorem, has played a crucial role
in the understanding of SVMs, and it provides valuable insight into the geometry
of feature spaces, which more than justifies its detailed discussion. In the SVM
literature, the kernel trick is usually introduced via Mercer’s theorem.
We start by stating the version of Mercer’s theorem given in [606]. We assume
( ) to be a finite measure space.4 The term almost all (cf. Appendix B) means
except for sets of measure zero. For the commonly used Lebesgue-Borel measure,
Mercer’s countable sets of individual points are examples of zero measure sets. Note that
Theorem the integral with respect to a measure is explained in Appendix B. Readers who
do not want to go into mathematical detail may simply want to think of the d(x ) 

as a dx , and of  as a compact subset of  N . For further explanations of the terms




involved in this theorem, cf. Appendix B, especially Section B.3.

Theorem 2.10 (Mercer [359, 307]) Suppose k L (2 ) is a symmetric real-valued  

function such that the integral operator (cf. (2.16))


Tk : L2 ()

L2 () 
(Tk f )(x) : 
k(x x ) f (x ) d(x )

  
(2.38)

is positive definite; that is, for all f L2 (), we have




2
k(x x ) f (x) f (x ) d(x)d(x ) 0
  
 (2.39)


Let  j L2 () be the normalized orthogonal eigenfunctions of Tk associated with the
eigenvalues  j 0, sorted in non-increasing order. Then
1. ( j ) j 
1 ,

2. k(x x ) ∑ Nj1  j  j (x) j (x ) holds for almost all (x x ). Either N  , or N
  
  ;
in the latter case, the series converges absolutely and uniformly for almost all (x x ). 

For the converse of Theorem 2.10, see Problem 2.23. For a data-dependent approx-
imation and its relationship to kernel PCA (Section 1.7), see Problem 2.26.
From statement 2 it follows that k(x x ) corresponds to a dot product in
2N , 

since k(x x ) 
Φ(x) Φ(x ) with
 

Φ: 


2
 (2.40)
x (  j j (x)) j1 N 

for almost all x  . Note that we use the same Φ as in (2.21) to denote the feature

4. A finite measure space is a set with a  -algebra (Definition B.1) defined on it, and a
measure (Definition B.2) defined on the latter, satisfying ( )  (so that, up to a scaling  
factor,  is a probability measure).
38 Kernels

map, although the target spaces are different. However, this distinction is not
important for the present purposes — we are interested in the existence of some
Hilbert space in which the kernel corresponds to the dot product, and not in what
particular representation of it we are using.
In fact, it has been noted [467] that the uniform convergence of the series implies
that given any 0, there exists an n  such that even if N , k can be  
 
 
approximated within accuracy as a dot product in  n : for almost all x x 



, k(x x ) Φn (x) Φn (x ) , where Φn : x ( 1 1(x)     n n (x)). The
 

feature space can thus always be thought of as finite-dimensional within some
accuracy . We summarize our findings in the following proposition.

Proposition 2.11 (Mercer Kernel Map) If k is a kernel satisfying the conditions of


Mercer Feature Theorem 2.10, we can construct a mapping Φ into a space where k acts as a dot product,
Map
Φ(x) Φ(x )  k(x x )




 (2.41)
for almost all x x  . Moreover, given any 0, there exists a map Φ

into an n-
dimensional dot product space (where n   depends on ) such that
 n

k(x x )
Φ (x) Φ (x ) 

 n

n 
(2.42)
for almost all x x  . 


Both Mercer kernels and positive definite kernels can thus be represented as dot
products in Hilbert spaces. The following proposition, showing a case where the
two types of kernels coincide, thus comes as no surprise.

Proposition 2.12 (Mercer Kernels are Positive Definite [359, 42]) Let  [a b] be 
a compact interval and let k : [a b] [a b]  be continuous. Then k is a positive definite  
kernel if and only if
 b b
a
k(x x ) f (x) f (x ) dx dx
a

0  

(2.43)

for each continuous function f :  .


Note that the conditions in this proposition are actually more restrictive than
those of Theorem 2.10. Using the feature space representation (Proposition 2.11),
however, it is easy to see that Mercer kernels are also positive definite (for almost

all x x ) in the more general case of Theorem 2.10: given any c  m , we have


 2
 
∑ ci c j k(xi  x j )  ∑ ci c j
i j i j

Φ(xi ) Φ(x j )

  
∑ ci Φ(xi )
 i 
0  (2.44)

Being positive definite, Mercer kernels are thus also reproducing kernels.
We next show how the reproducing kernel map is related to the Mercer kernel
map constructed from the eigenfunction decomposition [202, 467]. To this end, let
us consider a kernel which satisfies the condition of Theorem 2.10, and construct
2.2 The Representation of Similarities in Linear Spaces 39

 
a dot product  such that k becomes a reproducing kernel for the Hilbert space
 containing the functions
N
∑ ∑ i∑
 

f (x) i k(x xi )   j j (x) j (xi ) (2.45)



i 1 
i 1 
j 1

By linearity, which holds for any dot product, we have

f k( x )  ∑  
 N
 

i ∑  j j (xi )  j  n n n (x )

(2.46)
i 1 
jn 1

Since k is a Mercer kernel, the  i (i 1     N ) can be chosen to be orthogonal 


with respect to the dot product in L2 (). Hence it is straightforward to choose   
such that
 
 j  n  Æ jn  j (2.47)
(using the Kronecker symbol Æ jn , see (B.30)), in which case (2.46) reduces to the
reproducing kernel property (2.36) (using (2.45)). For a coordinate representation
in the RKHS, see Problem 2.29.
The above connection between the Mercer kernel map and the RKHS map is
Equivalence of instructive, but we shall rarely make use of it. In fact, we will usually identify
Feature Spaces the different feature spaces. Thus, to avoid confusion in subsequent chapters, the
following comments are necessary. As described above, there are different ways
of constructing feature spaces for any given kernel. In fact, they can even differ in
terms of their dimensionality (cf. Problem 2.22). The two feature spaces that we
will mostly use in this book are the RKHS associated with k (Section 2.2.2) and
the Mercer
2 feature space. We will mostly use the same symbol  for all feature
spaces that are associated with a given kernel. This makes sense provided that
everything we do, at the end of the day, reduces to dot products. For instance, let
us assume that Φ1  Φ2 are maps into the feature spaces 1  2 respectively, both
associated with the kernel k; in other words,
k(x x ) 
 Φi (x) Φi (x )  


i
 for i 1 2   (2.48)
Then it will usually not be the case that Φ1 (x) Φ2 (x); due to (2.48), however, 
we always have Φ1 (x) Φ1 (x ) 1 
Φ2 (x) Φ2 (x ) 2 . Therefore, as long as we are
 

only interested in dot products, the two spaces can be considered identical.
An example of this identity is the so-called large margin regularizer that is
usually used in SVMs, as discussed in the introductory chapter (cf. also Chapters
4 and 7),

w w
m
  where w ∑ i Φ(xi ) (2.49)
i 1
No matter whether Φ is the RKHS map Φ(x i ) k( xi ) (2.21) or the Mercer map 


Φ(xi ) (  j  j (x)) j1 N (2.40), the value of w 2 will not change. 
This point is of great importance, and we hope that all readers are still with us.
40 Kernels

It is fair to say, however, that Section 2.2.5 can be skipped at first reading.

2.2.5 The Shape of the Mapped Data in Feature Space

Using Mercer’s theorem, we have shown that one can think of the feature map
as a map into a high- or infinite-dimensional Hilbert space. The argument in the
remainder of the section shows that this typically entails that the mapped data
Φ() lie in some box with rapidly decaying side lengths [606]. By this we mean
that the range of the data decreases as the dimension index j increases, with a rate
that depends on the size of the eigenvalues.

Let us assume that for all j  , we have supx   j  j (x) 2 . Define the 
  
sequence
lj :  sup
x 

j j 
(x) 2  (2.50)

Note that if
Ck :  sup sup  j (x)  (2.51)
j x 

exists (see Problem 2.24), then we have l j  j Ck2 . However, if the  j decay rapidly,
then (2.50) can be finite even if (2.51) is not.
By construction, Φ() is contained in an axis parallel parallelepiped in
2N with

side lengths 2 l j (cf. (2.40)).5
Consider an example of a common kernel, the Gaussian, and let  (see The-
orem 2.10) be the Lebesgue measure. In this case, the eigenvectors are sine and
cosine functions (with supremum one), and thus the sequence of the l j coincides
with the sequence of the eigenvalues  j . Generally, whenever sup x   j (x) 2 is fi- 
 
nite, the l j decay as fast as the  j . We shall see in Sections 4.4, 4.5 and Chapter 12
that for many common kernels, this decay is very rapid.
It will be useful to consider operators that map Φ() into balls of some radius
R centered at the origin. The following proposition characterizes a class of such
operators, determined by the sequence (l j ) j  . Recall that   denotes the space of


all real sequences.

Proposition 2.13 (Mapping Φ() into


2 ) Let S be the diagonal map

S :   
(x j ) j  S(x j ) j  (s j x j ) j 
(2.52)

where (s j ) j    . If s l  


j j j
2 , then S maps Φ() into a ball centered at the origin
whose radius is R  
 sj

l j j .

5. In fact, it is sufficient to use the essential supremum in (2.50). In that case, subsequent
statements also only hold true almost everywhere.
2.2 The Representation of Similarities in Linear Spaces 41

Proof Suppose s j . Using the Mercer map (2.40), we have


  
lj j

2

SΦ(x)  ∑ s  (x)  ∑ s l  R
2

j
2
j j j
2

j
2
j j (2.53)

for any x  . Hence SΦ()  .


2

The converse is not necessarily the case. To see this, note that if s j
  
lj j

2 ,
amounting to saying that

∑ s2j sup  j  j (x)2 (2.54)


j x
is not finite, then there need not always exist an x   such that SΦ(x) 
 

s j  j  j (x) j
2 , i.e., that

∑ s2j  j  j (x)2 (2.55)


j

is not finite.
To see how the freedom to rescale Φ() effectively restricts the class of functions
we are using, we first note that everything in the feature space 
2N is done 
in terms of dot products. Therefore, we can compensate any invertible symmetric
linear transformation of the data in  by the inverse transformation on the set
of admissible weight vectors in . In other words, for any invertible symmetric

operator S on , we have S 1 w SΦ(x)


w Φ(x) for all x 


As we shall see below (cf. Theorem 5.5, Section 12.4, and Problem 7.5), there
exists a class of generalization error bound that depends on the radius R of the
smallest sphere containing the data. If the (li )i decay rapidly, we are not actually
“making use” of the whole sphere. In this case, we may construct a diagonal
scaling operator S which inflates the sides of the above parallelepiped as much
as possible, while ensuring that it is still contained within a sphere of the original
radius R in  (Figure 2.3). By effectively reducing the size of the function class, this
will provide a way of strengthening the bounds. A similar idea, using kernel PCA
(Section 14.2) to determine empirical scaling coefficients, has been successfully
applied by [101].
We conclude this section with another useful insight that characterizes a prop-
erty of the feature map Φ. Note that most of what was said so far applies to the
case where the input domain  is a general set. In this case, it is not possible to
make nontrivial statements about continuity properties of Φ. This changes if we
assume  to be endowed with a notion of closeness, by turning it into a so-called
topological space. Readers not familiar with this concept will be reassured to hear
Continuity of Φ that Euclidean vector spaces are particular cases of topological spaces.

Proposition 2.14 (Continuity of the Feature Map [402]) If  is a topological space



and k is a continuous positive definite kernel on  , then there exists a Hilbert space

 and a continuous map Φ :   such that for all x x , we have k(x x ) 
 

Φ(x) Φ(x ) . 

42 Kernels

φ(x) φ(x)

R w R w

Feature Space Weight vector Feature Space Weight vector

Figure 2.3 Since everything is done in terms of dot products, scaling up the data by
an operator S can be compensated by scaling the weight vectors with S 1 (cf. text). By 

choosing S such that the data are still contained in a ball of the same radius R, we effectively
reduce our function class (parametrized by the weight vector), which can lead to better
generalization bounds, depending on the kernel inducing the map Φ.

2.2.6 The Empirical Kernel Map

The map Φ, defined in (2.21), transforms each input pattern into a function on ,
that is, into a potentially infinite-dimensional object. For any given set of points,
however, it is possible to approximate Φ by only evaluating it on these points (cf.
[232, 350, 361, 547, 474]):

Definition 2.15 (Empirical Kernel Map) For a given set z1      zn   , n   , we


Empirical Kernel call
Map
Φn :  N  n
 k( x)
where x   z1  zn   (k(z1  x)     k(zn  x))

(2.56)

the empirical kernel map w.r.t.  z 1      zn .


As an example, consider first the case where k is a positive definite kernel, and
z 
1      zn
x1      xm ; we thus evaluate k( x) on the training patterns. If we
carry out a linear algorithm in feature space, then everything will take place in
the linear span of the mapped training patterns. Therefore, we can represent the
k( x) of (2.21) as Φm (x) without losing information. The dot product to use in that
representation, however, is not simply the canonical dot product in  m , since the
Φ(xi ) will usually not form an orthonormal system. To turn Φm into a feature map
associated with k, we need to endow  m with a dot product  m such that  
k(x x )
 Φm (x) Φm (x ) m


 (2.57)
To this end, we use the ansatz  m     
 M , with M being a positive definite

matrix.6 Enforcing (2.57) on the training patterns, this yields the self-consistency
condition [478, 512]
K  KMK  (2.58)

6. Every dot product in Ê m can be written in this form. We do not require strict definiteness
of M, as the null space can be projected out, leading to a lower-dimensional feature space.
2.2 The Representation of Similarities in Linear Spaces 43

where K is the Gram matrix. The condition (2.58) can be satisfied for instance

by the (pseudo-)inverse M K 1 . Equivalently, we could have incorporated this


rescaling operation, which corresponds to a Kernel PCA “whitening” ([478, 547,


Kernel PCA Map 474], cf. Section 11.4), directly into the map, by whitening (2.56) to get

Φw
m :x  K 
1
(k(x1  x)     k(xm  x)) (2.59)

2

This simply amounts to dividing the eigenvector basis vectors of K by i , where

the i are the eigenvalues of K.7 This parallels the rescaling of the eigenfunctions
of the integral operator belonging to the kernel, given by (2.47). It turns out that
this map can equivalently be performed using kernel PCA feature extraction (see
Problem 14.8), which is why we refer to this map as the kernel PCA map.
Note that we have thus constructed a data-dependent feature map into an m-
dimensional space which satisfies Φw w
m (x) Φm (x ) 
k(x x ), i.e., we have found an
 

m-dimensional feature space associated with the given kernel. In the case where K
is invertible, Φw
m (x) computes the coordinates of Φ(x) when represented in a basis
of the m-dimensional subspace spanned by Φ(x1 )     Φ(xm ).
For data sets where the number of examples is smaller than their dimension,
it can actually be computationally attractive to carry out Φw m explicitly, rather
than using kernels in subsequent algorithms. Moreover, algorithms which are not
readily “kernelized” may benefit from explicitly carrying out the kernel PCA map.
We end this section with two notes which illustrate why the use of (2.56) need
not be restricted to the special case we just discussed.
More general kernels. When using non-symmetric kernels k in (2.56), together with
the canonical dot product, we effectively work with the positive definite matrix
K K. Note that each positive definite matrix can be written as K K. Therefore,
 

working with positive definite kernels leads to an equally rich set of nonlinearities
as working with an empirical kernel map using general non-symmetric kernels.
If we wanted to carry out the whitening step, we would have to use (K K) 14 (cf. 

footnote 7 concerning potential singularities).


Different evaluation sets. Things can be sped up by using expansion sets of the

form z1      zn , mapping into an n-dimensional space, with n m, as done in
[100, 228]. In that case, one modifies (2.59) to

Φw
n :x  K n

1
2
(k(z1  x)     k(zn  x)) (2.60)

where (Kn )i j : k(zi  z j ). The expansion set can either be a subset of the training
set,8 or some other set of points. We will later return to the issue of how to choose

7. It is understood that if K is singular, we use the pseudo-inverse of K12 in which case we


get an even lower dimensional subspace.
8. In [228] it is recommended that the size n of the expansion set is chosen large enough to
ensure that the smallest eigenvalue of Kn is larger than some predetermined   0. Alter-
natively, one can start off with a larger set, and use kernel PCA to select the most important
components for the map, see Problem 14.8. In the kernel PCA case, the map (2.60) is com-
44 Kernels

the best set (see Section 10.2 and Chapter 18). As an aside, note that in the case of
Kernel PCA (see Section 1.7 and Chapter 14 below), one does not need to worry
about the whitening step in (2.59) and (2.60): using the canonical dot product in
 
 m (rather than  ) will simply lead to diagonalizing K2 instead of K, which
yields the same eigenvectors with squared eigenvalues. This was pointed out by
[350, 361]. The study [361] reports experiments where (2.56) was employed to

speed up Kernel PCA by choosing z1      zn as a subset of x1      xm . 
2.2.7 A Kernel Map Defined from Pairwise Similarities

In practice, we are given a finite amount of data x 1      xm . The following simple


observation shows that even if we do not want to (or are unable to) analyze a given
kernel k analytically, we can still compute a map Φ such that k corresponds to a
dot product in the linear span of the Φ(xi ):

Proposition 2.16 (Data-Dependent Kernel Map [467]) Suppose the data x1      xm


and the kernel k are such that the kernel Gram matrix Ki j k(xi  x j ) is positive definite. 
Then it is possible to construct a map Φ into an m-dimensional feature space  such that
k(xi  x j )  Φ(xi ) Φ(x j )
  (2.61)
Conversely, given
 
an arbitrary map Φ into some feature space , the matrix K i j 
Φ(xi ) Φ(x j ) is positive definite.

Proof First assume that K is positive definite. In this case, it can be diagonalized

as K SDS , with an orthogonal matrix S and a diagonal matrix D with nonneg-


DS DS

ative entries. Then



  Si

k(xi  x j )  (SDS 
)i j  DS j  i j  (2.62)

where we have defined the Si as the rows of S (note that the columns of S would be
K’s eigenvectors). Therefore, K is the Gram matrix of the vectors Dii Si .9 Hence
 
the following map Φ, defined on x1      xm will satisfy (2.61)
Φ : xi  D  S ii i (2.63)
Thus far, Φ is only defined on a set of points, rather than on a vector space.
Therefore, it makes no sense to ask whether it is linear. We can, however, ask
whether it can be extended to a linear map, provided the x i are elements of a vector
space. The answer is that if the xi are linearly dependent (which is often the case),
then this will not be possible, since a linear map would then typically be over-

12
puted as Dn Un (k(z1  x)     k(zn  x)), where Un Dn Un is the eigenvalue decomposition of
 

Kn . Note that the columns of Un are the eigenvectors of Kn . We discard all columns that cor-
respond to zero eigenvalues, as well as the corresponding dimensions of Dn . To approximate
the map, we may actually discard all eigenvalues smaller than some   0.
9. In fact, every positive definite matrix is the Gram matrix of some set of vectors [46].
2.3 Examples and Properties of Kernels 45

determined by the m conditions (2.63).


For the converse, assume an arbitrary «  m
, and compute
   2
m 
0
m m m
∑ i  j Ki j  ∑ i Φ(xi ) ∑  j Φ(x j )  
i1

 ∑ i Φ(xi )

 (2.64)
i j 1  i 1  j 1 

In particular, this result implies that given data x1      xm , and a kernel k which
gives rise to a positive definite matrix K, it is always possible to construct a feature
space  of dimension at most m that we are implicitly working in when using
kernels (cf. Problem 2.32 and Section 2.2.6).
If we perform an algorithm which requires k to correspond to a dot product in
some other space (as for instance the SV algorithms described in this book), it is
possible that even though k is not positive definite in general, it still gives rise to
a positive definite Gram matrix K with respect to the training data at hand. In this
case, Proposition 2.16 tells us that nothing will go wrong during training when we
work with these data. Moreover, if k leads to a matrix with some small negative
eigenvalues, we can add a small multiple of some strictly positive definite kernel
k (such as the identity k (xi  x j ) Æi j ) to obtain a positive definite matrix. To see
 

this, suppose that min 0 is the minimal eigenvalue of k’s Gram matrix. Note that
being strictly positive definite, the Gram matrix K of k satisfies  

min
 « 1
« K « 

 
min 0 (2.65)

where min denotes its minimal eigenvalue, and the first inequality follows from


Rayleigh’s principle (B.57). Therefore, provided that  min min 0, we have  



« (K    K )«
 « K«  « K «  «
  
 2

min  
min

0 (2.66)
for all «  m
, rendering (K   K ) positive definite.


2.3 Examples and Properties of Kernels

For the following examples, let us assume that    N . Besides homogeneous


Polynomial polynomial kernels (cf. Proposition 2.1),

k(x x ) 
 x  x 
d
 (2.67)
Boser, Guyon, and Vapnik [62, 223, 561] suggest the usage of Gaussian radial basis
Gaussian function kernels [26, 4],


x2
x 
2

 exp


k(x x ) 
 (2.68)
2

Sigmoid where  0, and sigmoid kernels,

k(x x ) 
 tanh(  x x 


  ) (2.69)

You might also like