0% found this document useful (0 votes)

85 views17 pages

The Representation of Similarities in Linear Spaces

Uploaded by

rotero_pujol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views17 pages

The Representation of Similarities in Linear Spaces

Uploaded by

rotero_pujol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

2.

2 The Representation of Similarities in Linear Spaces 29

✕
x2 z2
✕ ✕ ✕ ✕
✕
✕ ✕
✕
✕ ✕ ✕
✕
✕
✕ ✕ ❍ ✕ ✕
❍
❍
❍
x1 ❍ ✕
❍ ❍ ✕
❍ ✕
❍
✕
❍
❍ ❍❍ ✕
✕
z1
✕ ✕ ❍ ❍
❍
✕ ✕
✕
✕

✕ ✕ ✕
✕
z3

Figure 2.1 Toy example of a binary classification problem mapped into feature space. We
assume that the true decision boundary is an ellipse in input space (left panel). The task
of the learning process is to estimate this boundary based on empirical data consisting of

training points in both classes (crosses and circles, respectively). When mapped into feature
space via the nonlinear map Φ2 (x) (z1 z2 z3 ) ([x]21 [x]22 2 [x]1 [x]2 ) (right panel), the
ellipse becomes a hyperplane (in the present simple case, it is parallel to the z3 axis, hence
all points are plotted in the (z1 z2 ) plane). This is due to the fact that ellipses can be written
as linear equations in the entries of (z1 z2 z3 ). Therefore, in feature space, the problem
reduces to that of estimating a hyperplane from the mapped data points. Note that via the
polynomial kernel (see (2.12) and (2.13)), the dot product in the three-dimensional space
can be computed without computing Φ2 . Later in the book, we shall describe algorithms
for constructing hyperplanes which are based on dot products (Chapter 7).

2.2 The Representation of Similarities in Linear Spaces

In what follows, we will look at things the other way round, and start with the
kernel rather than with the feature map. Given some kernel, can we construct a
feature space such that the kernel computes the dot product in that feature space;
that is, such that (2.2) holds? This question has been brought to the attention
of the machine learning community in a variety of contexts, especially during
recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has
been studied under the heading of Hilbert space representations of kernels. A good
monograph on the theory of kernels is the book of Berg, Christensen, and Ressel
[42]; indeed, a large part of the material in the present chapter is based on this
work. We do not aim to be fully rigorous; instead, we try to provide insight into
the basic ideas. As a rule, all the results that we state without proof can be found
in [42]. Other standard references include [16, 455].
There is one more aspect in which this section differs from the previous one:
the latter dealt with vectorial data, and the domain was assumed to be a subset
of N . By contrast, the results in the current section hold for data drawn from
domains which need no structure, other than their being nonempty sets. This
generalizes kernel learning algorithms to a large number of situations where a
vectorial representation is not readily available, and where one directly works
30 Kernels

with pairwise distances or similarities between non-vectorial objects [246, 467,

154, 210, 234, 585]. This theme will recur in several places throughout the book,
for instance in Chapter 13.

2.2.1 Positive Definite Kernels

We start with some basic definitions and results. As in the previous chapter, indices
i and j are understood to run over 1 m.

Definition 2.3 (Gram Matrix) Given a function k : 2

(where or )
Gram Matrix
and patterns x1 xm , the m m matrix K with elements
Ki j : k(xi x j) (2.14)
is called the Gram matrix (or kernel matrix) of k with respect to x 1 xm .

PD Matrix Definition 2.4 (Positive Definite Matrix) A complex m m matrix K satisfying

∑ ci c̄ j Ki j 0 (2.15)
i j

for all ci is called positive definite.1 Similarly, a real symmetric m m matrix K

satisfying (2.15) for all ci is called positive definite.

Note that a symmetric matrix is positive definite if and only if all its eigenvalues
are nonnegative (Problem 2.4). The left hand side of (2.15) is often referred to as
the quadratic form induced by K.

Definition 2.5 ((Positive Definite) Kernel) Let be a nonempty set. A function k on

which for all m
and all x1 xm gives rise to a positive definite Gram
PD Kernel matrix is called a positive definite (pd) kernel. Often, we shall refer to it simply as a
kernel.

Remark 2.6 (Terminology) The term kernel stems from the first use of this type of
function in the field of integral operators as studied by Hilbert and others [243, 359, 112].
A function k which gives rise to an operator Tk via

(Tk f )(x) k(x x ) f (x ) dx

(2.16)

is called the kernel of Tk .

In the literature, a number of different terms are used for positive definite kernels, such
as reproducing kernel, Mercer kernel, admissible kernel, Support Vector kernel,
nonnegative definite kernel, and covariance function. One might argue that the term
positive definite kernel is slightly misleading. In matrix theory, the term definite is

sometimes reserved for the case where equality in (2.15) only occurs if c 1 cm 0.
1. The bar in c̄ j denotes complex conjugation; for real numbers, it has no effect.
2.2 The Representation of Similarities in Linear Spaces 31

Simply using the term positive kernel, on the other hand, could be mistaken as referring
to a kernel whose values are positive. Finally, the term positive semidefinite kernel
becomes rather cumbersome if it is to be used throughout a book. Therefore, we follow
the convention used for instance in [42], and employ the term positive definite both for
kernels and matrices in the way introduced above. The case where the value 0 is only
attained if all coefficients are 0 will be referred to as strictly positive definite.
We shall mostly use the term kernel. Whenever we want to refer to a kernel k(x x )

which is not positive definite in the sense stated above, it will be clear from the context.

The definitions for positive definite kernels and positive definite matrices differ in
the fact that in the former case, we are free to choose the points on which the kernel
is evaluated — for every choice, the kernel induces a positive definite matrix.
Positive definiteness implies positivity on the diagonal (Problem 2.12),
k(x x) 0 for all x (2.17)
and symmetry (Problem 2.13),
k(xi x j ) k(x j xi ) (2.18)
To also cover the complex-valued case, our definition of symmetry includes com-
plex conjugation. The definition of symmetry of matrices is analogous; that is,

Ki j K ji .
For real-valued kernels it is not sufficient to stipulate that (2.15) hold for real
Real-Valued coefficients ci . To get away with real coefficients only, we must additionally require
Kernels
that the kernel be symmetric (Problem 2.14); k(x i x j ) k(x j xi ) (cf. Problem 2.13).
It can be shown that whenever k is a (complex-valued) positive definite kernel,
its real part is a (real-valued) positive definite kernel. Below, we shall largely be
dealing with real-valued kernels. Most of the results, however, also apply for
complex-valued kernels.
Kernels can be regarded as generalized dot products. Indeed, any dot product
is a kernel (Problem 2.5); however, linearity in the arguments, which is a standard
property of dot products, does not carry over to general kernels. However, another
property of dot products, the Cauchy-Schwarz inequality, does have a natural
generalization to kernels:

Proposition 2.7 (Cauchy-Schwarz Inequality for Kernels) If k is a positive definite

kernel, and x1 x2 , then
k(x 1 x2 ) k(x
2
1 x1 ) k(x 2 x2 ) (2.19)

Proof For sake of brevity, we give a non-elementary proof using some basic facts

of linear algebra. The 2 2 Gram matrix with entries Ki j k(xi x j ) (i j 1 2 )
is positive definite. Hence both its eigenvalues are nonnegative, and so is their
product, the determinant of K. Therefore
0 K 11 K22
K 12 K21 K11 K22
K12 K12 K11 K22
K122 (2.20)
32 Kernels

Figure 2.2 One instantiation of the fea-

ture map associated with a kernel is the
map (2.21), which represents each pattern
Φ (in the picture, x or x ) by a kernel-shaped

function sitting on the pattern. In this sense,

each pattern is represented by its similar-
ity to all other patterns. In the picture, the
. . kernel is assumed to be bell-shaped, e.g., a
x x' Φ(x) Φ(x')
Gaussian k(x x ) exp( x x 2 (2 2 )).

In the text, we describe the construction of

a dot product on the function space

such that k(x x ) Φ(x) Φ(x ) .

Substituting k(xi x j ) for Ki j , we get the desired inequality.
We now show how the feature spaces in question are defined by the choice of
kernel function.

2.2.2 The Reproducing Kernel Map

Assume that k is a real-valued positive definite kernel, and a nonempty set. We

define a map from into the space of functions mapping into , denoted as
Feature Map :
f : , via
Φ:
x k( x)
(2.21)
Here, Φ(x) denotes the function that assigns the value k(x x) to x , i.e.,

Φ(x)() k( x) (as shown in Figure 2.2).
We have thus turned each pattern into a function on the domain . In this sense,
a pattern is now represented by its similarity to all other points in the input domain
. This seems a very rich representation; nevertheless, it will turn out that the
kernel allows the computation of the dot product in this representation. Below,
we show how to construct a feature space associated with Φ, proceeding in the
following steps:
1. Turn the image of Φ into a vector space,
2. define a dot product; that is, a strictly positive definite bilinear form, and
3. show that the dot product satisfies k(x x )
Φ(x) Φ(x ) .

We begin by constructing a dot product space containing the images of the input
patterns under Φ. To this end, we first need to define a vector space. This is done
Vector Space by taking linear combinations of the form
m
f () ∑ i k( xi ) (2.22)

i 1

Here, m , and x
i 1 xm are arbitrary. Next, we define a dot product
2.2 The Representation of Similarities in Linear Spaces 33

between f and another function

m¼
g() ∑ j k( x j )
(2.23)

j 1

Dot Product where m

, and xj 1 xm¼

, as
f g : ∑ ∑
m m ¼

i j k(xi x j )
(2.24)
i 1j 1

This expression explicitly contains the expansion coefficients, which need not be
unique. To see that it is nevertheless well-defined, note that

f g ∑
m¼
j f (x j )
(2.25)
j 1

using k(x j xi )
k(xi x j ). The sum in (2.25), however, does not depend on the

particular expansion of f . Similarly, for g, note that

f g ∑
m
i g(xi ) (2.26)
i 1

The last two equations also show that is bilinear. It is symmetric, as f g

g f . Moreover, it is positive definite, since positive definiteness of k implies that
for any function f , written as (2.22), we have

f f ∑ 0
m
i j k(xi x j ) (2.27)
i j 1

The latter implies that is actually itself a positive definite kernel, defined

on our space of functions. To see this, note that given functions f f , and

1 n
coefficients , we have
1 n

f f ∑ f ∑ f 0
n n n
∑ i j i j i i (2.28)
j j

i j 1 i 1 j 1

Here, the left hand equality follows from the bilinearity of , and the right hand

inequality from (2.27). For the last step in proving that it qualifies as a dot product,
we will use the following interesting property of Φ, which follows directly from
the definition: for all functions (2.22), we have

k( x) f f (x) (2.29)

— k is the representer of evaluation. In particular,

k( x) k( x )
k(x x )

(2.30)
By virtue of these properties, positive definite kernels k are also called reproducing
Reproducing kernels [16, 42, 455, 578, 467, 202]. By (2.29) and Proposition 2.7, we have
Kernel
f (x) k(
2
x) f k(x x) f f
2
(2.31)
34 Kernels

Therefore, f f
0 directly implies f 0, which is the last property that required

proof in order to establish that is a dot product (cf. Section B.2).
The case of complex-valued kernels can be dealt with using the same construc-
tion; in that case, we will end up with a complex dot product space [42].
The above reasoning has shown that any positive definite kernel can be thought
of as a dot product in another space: in view of (2.21), the reproducing kernel
property (2.30) amounts to

Φ(x) Φ(x ) k(x x )

(2.32)
Therefore, the dot product space constructed in this way is one possible instan-
tiation of the feature space associated with a kernel.
Kernels from Above, we have started with the kernel, and constructed a feature map. Let us
Feature Maps now consider the opposite direction. Whenever we have a mapping Φ from into
a dot product space, we obtain a positive definite kernel via k(x x ) : Φ(x) Φ(x ) .

This can be seen by noting that for all ci xi i 1 m, we have
2

∑ ci c j k(xi x j ) ∑ ci Φ(xi ) ∑ c j Φ(x j )
i j i j

i

∑ ci Φ(xi )

0 (2.33)

due to the nonnegativity of the norm.

Equivalent This has two consequences. First, it allows us to give an equivalent definition of
Definition of positive definite kernels as functions with the property that there exists a map
PD Kernels Φ into a dot product space such that (2.32) holds true. Second, it allows us to
construct kernels from feature maps. For instance, it is in this way that powerful
linear representations of 3D heads proposed in computer graphics [575, 59] give
rise to kernels. The identity (2.32) forms the basis for the kernel trick:

Remark 2.8 (“Kernel Trick”) Given an algorithm which is formulated in terms of a

positive definite kernel k, one can construct an alternative algorithm by replacing k by
Kernel Trick another positive definite kernel k̃.

In view of the material in the present section, the justification for this procedure is
the following: effectively, the original algorithm can be thought of as a dot prod-
uct based algorithm operating on vectorial data Φ(x1 ) Φ(xm ). The algorithm
obtained by replacing k by k̃ then is exactly the same dot product based algorithm,
only that it operates on Φ̃(x1 ) Φ̃(xm ).
The best known application of the kernel trick is in the case where k is the dot
product in the input domain (cf. Problem 2.5). The trick is not limited to that case,
however: k and k̃ can both be nonlinear kernels. In general, care must be exercised
in determining whether the resulting algorithm will be useful: sometimes, an
algorithm will only work subject to additional conditions on the input data, e.g.,
the data set might have to lie in the positive orthant. We shall later see that certain
kernels induce feature maps which enforce such properties for the mapped data
(cf. (2.73)), and that there are algorithms which take advantage of these aspects
(e.g., in Chapter 8). In such cases, not every conceivable positive definite kernel
2.2 The Representation of Similarities in Linear Spaces 35

will make sense.

Historical Even though the kernel trick had been used in the literature for a fair amount of
Remarks time [4, 62], it took until the mid 1990s before it was explicitly stated that any al-
gorithm that only depends on dot products, i.e., any algorithm that is rotationally
invariant, can be kernelized [479, 480]. Since then, a number of algorithms have
benefitted from the kernel trick, such as the ones described in the present book, as
well as methods for clustering in feature spaces [479, 215, 199].
Moreover, the machine learning community took time to comprehend that the
definition of kernels on general sets (rather than dot product spaces) greatly
extends the applicability of kernel methods [467], to data types such as texts and
other sequences [234, 585, 23]. Indeed, this is now recognized as a crucial feature
of kernels: they lead to an embedding of general data types in linear spaces.
Not surprisingly, the history of methods for representing kernels in linear spaces
(in other words, the mathematical counterpart of the kernel trick) dates back
significantly further than their use in machine learning. The methods appear to
have first been studied in the 1940s by Kolmogorov [304] for countable and
Aronszajn [16] in the general case. Pioneering work on linear representations of
a related class of kernels, to be described in Section 2.4, was done by Schoenberg
[465]. Further bibliographical comments can be found in [42].
We thus see that the mathematical basis for kernel algorithms has been around
for a long time. As is often the case, however, the practical importance of mathe-
matical results was initially underestimated.2

2.2.3 Reproducing Kernel Hilbert Spaces

In the last section, we described how to define a space of functions which is a

valid realization of the feature spaces associated with a given kernel. To do this,
we had to make sure that the space is a vector space, and that it is endowed with
a dot product. Such spaces are referred to as dot product spaces (cf. Appendix B),
or equivalently as pre-Hilbert spaces. The reason for the latter is that one can turn
them into Hilbert spaces (cf. Section B.3) by a fairly simple mathematical trick. This
additional structure has some mathematical advantages. For instance, in Hilbert
spaces it is always possible to define projections. Indeed, Hilbert spaces are one of
the favorite concepts of functional analysis.
So let us again consider the pre-Hilbert space of functions (2.22), endowed with
the dot product (2.24). To turn it into a Hilbert space (over ), one completes it in
the norm corresponding to the dot product, f :
f f . This is done by adding
the limit points of sequences that are convergent in that norm (see Appendix B).

2. This is illustrated by the following quotation from an excellent machine learning text-
book published in the seventies (p. 174 in [152]): “The familiar functions of mathematical physics
are eigenfunctions of symmetric kernels, and their use is often suggested for the construction of po-
tential functions. However, these suggestions are more appealing for their mathematical beauty than
their practical usefulness.”
36 Kernels

RKHS In view of the properties (2.29) and (2.30), this space is usually called a reproducing
kernel Hilbert space (RKHS).
In general, an RKHS can be defined as follows.

Definition 2.9 (Reproducing Kernel Hilbert Space) Let be a nonempty set (often
called the index set) and by a Hilbert space of functions f : . Then is called
a reproducing kernel Hilbert space endowed with the dot product (and the norm

f : f f ) if there exists a function k : with the following properties.
Reproducing 1. k has the reproducing property3
Property
f k(x ) f (x) for all f ;
(2.34)
in particular,

k(x ) k(x ) k(x x )

(2.35)
Closed Space 2. k spans , i.e. span k(x )x where X denotes the completion of the set X

(cf. Appendix B).

On a more abstract level, an RKHS can be defined as a Hilbert space of functions

f on such that all evaluation functionals (the maps f f (x ), where x ) are

continuous. In that case, by the Riesz representation theorem (e.g., [429]), for each

x there exists a unique function of x, called k(x x ), such that

f (x )
f k( x )
(2.36)
It follows directly from (2.35) that k(x x ) is symmetric in its arguments (see

Problem 2.28) and satisfies the conditions for positive definiteness.

Note that the RKHS uniquely determines k. This can be shown by contradiction:
assume that there exist two kernels, say k and k , spanning the same RKHS .

Uniqueness of k From Problem 2.28 we know that both k and k must be symmetric. Moreover,

from (2.34) we conclude that

k(x ) k (x ) k(x x ) k (x

x) (2.37)
In the second equality we used the symmetry of the dot product. Finally, symme-
try in the arguments of k yields k(x x ) k (x x ) which proves our claim.

2.2.4 The Mercer Kernel Map

Section 2.2.2 has shown that any positive definite kernel can be represented as a
dot product in a linear space. This was done by explicitly constructing a (Hilbert)
space that does the job. The present section will construct another Hilbert space.

3. Note that this implies that each f

is actually a single function whose values at any
x are well-defined. In contrast, L2 Hilbert spaces usually do not have this property. The
elements of these spaces are equivalence classes of functions that disagree only on sets of
measure 0; cf. footnote 15 in Section B.3.
2.2 The Representation of Similarities in Linear Spaces 37

One could argue that this is superfluous, given that any two separable Hilbert
spaces are isometrically isomorphic, in other words, it is possible to define a one-
to-one linear map between the spaces which preserves the dot product. However,
the tool that we shall presently use, Mercer’s theorem, has played a crucial role
in the understanding of SVMs, and it provides valuable insight into the geometry
of feature spaces, which more than justifies its detailed discussion. In the SVM
literature, the kernel trick is usually introduced via Mercer’s theorem.
We start by stating the version of Mercer’s theorem given in [606]. We assume
( ) to be a finite measure space.4 The term almost all (cf. Appendix B) means
except for sets of measure zero. For the commonly used Lebesgue-Borel measure,
Mercer’s countable sets of individual points are examples of zero measure sets. Note that
Theorem the integral with respect to a measure is explained in Appendix B. Readers who
do not want to go into mathematical detail may simply want to think of the d(x )

as a dx , and of as a compact subset of N . For further explanations of the terms

involved in this theorem, cf. Appendix B, especially Section B.3.

Theorem 2.10 (Mercer [359, 307]) Suppose k L (2 ) is a symmetric real-valued

function such that the integral operator (cf. (2.16))

Tk : L2 ()

L2 ()
(Tk f )(x) :
k(x x ) f (x ) d(x )

(2.38)

is positive definite; that is, for all f L2 (), we have

2
k(x x ) f (x) f (x ) d(x)d(x ) 0

(2.39)

Let j L2 () be the normalized orthogonal eigenfunctions of Tk associated with the
eigenvalues j 0, sorted in non-increasing order. Then
1. ( j ) j
1 ,

2. k(x x ) ∑ Nj1 j j (x) j (x ) holds for almost all (x x ). Either N , or N

;
in the latter case, the series converges absolutely and uniformly for almost all (x x ).

For the converse of Theorem 2.10, see Problem 2.23. For a data-dependent approx-
imation and its relationship to kernel PCA (Section 1.7), see Problem 2.26.
From statement 2 it follows that k(x x ) corresponds to a dot product in
2N ,

since k(x x )
Φ(x) Φ(x ) with

Φ:

2
(2.40)
x ( j j (x)) j1 N

for almost all x . Note that we use the same Φ as in (2.21) to denote the feature

4. A finite measure space is a set with a -algebra (Definition B.1) defined on it, and a
measure (Definition B.2) defined on the latter, satisfying ( ) (so that, up to a scaling
factor, is a probability measure).
38 Kernels

map, although the target spaces are different. However, this distinction is not
important for the present purposes — we are interested in the existence of some
Hilbert space in which the kernel corresponds to the dot product, and not in what
particular representation of it we are using.
In fact, it has been noted [467] that the uniform convergence of the series implies
that given any 0, there exists an n such that even if N , k can be

approximated within accuracy as a dot product in n : for almost all x x

, k(x x ) Φn (x) Φn (x ) , where Φn : x ( 1 1(x) n n (x)). The

feature space can thus always be thought of as finite-dimensional within some
accuracy . We summarize our findings in the following proposition.

Proposition 2.11 (Mercer Kernel Map) If k is a kernel satisfying the conditions of

Mercer Feature Theorem 2.10, we can construct a mapping Φ into a space where k acts as a dot product,
Map
Φ(x) Φ(x ) k(x x )

(2.41)
for almost all x x . Moreover, given any 0, there exists a map Φ

into an n-
dimensional dot product space (where n depends on ) such that
n

k(x x )
Φ (x) Φ (x )

n

n
(2.42)
for almost all x x .

Both Mercer kernels and positive definite kernels can thus be represented as dot
products in Hilbert spaces. The following proposition, showing a case where the
two types of kernels coincide, thus comes as no surprise.

Proposition 2.12 (Mercer Kernels are Positive Definite [359, 42]) Let [a b] be
a compact interval and let k : [a b] [a b] be continuous. Then k is a positive definite
kernel if and only if
b b
a
k(x x ) f (x) f (x ) dx dx
a

0

(2.43)

for each continuous function f : .

Note that the conditions in this proposition are actually more restrictive than
those of Theorem 2.10. Using the feature space representation (Proposition 2.11),
however, it is easy to see that Mercer kernels are also positive definite (for almost

all x x ) in the more general case of Theorem 2.10: given any c m , we have

2

∑ ci c j k(xi x j ) ∑ ci c j
i j i j

Φ(xi ) Φ(x j )

∑ ci Φ(xi )
i
0 (2.44)

Being positive definite, Mercer kernels are thus also reproducing kernels.
We next show how the reproducing kernel map is related to the Mercer kernel
map constructed from the eigenfunction decomposition [202, 467]. To this end, let
us consider a kernel which satisfies the condition of Theorem 2.10, and construct
2.2 The Representation of Similarities in Linear Spaces 39

a dot product such that k becomes a reproducing kernel for the Hilbert space
containing the functions
N
∑ ∑ i∑

f (x) i k(x xi ) j j (x) j (xi ) (2.45)

i 1
i 1
j 1

By linearity, which holds for any dot product, we have

f k( x ) ∑
N

i ∑ j j (xi ) j n n n (x )

(2.46)
i 1
jn 1

Since k is a Mercer kernel, the i (i 1 N ) can be chosen to be orthogonal

with respect to the dot product in L2 (). Hence it is straightforward to choose
such that

j n Æ jn j (2.47)
(using the Kronecker symbol Æ jn , see (B.30)), in which case (2.46) reduces to the
reproducing kernel property (2.36) (using (2.45)). For a coordinate representation
in the RKHS, see Problem 2.29.
The above connection between the Mercer kernel map and the RKHS map is
Equivalence of instructive, but we shall rarely make use of it. In fact, we will usually identify
Feature Spaces the different feature spaces. Thus, to avoid confusion in subsequent chapters, the
following comments are necessary. As described above, there are different ways
of constructing feature spaces for any given kernel. In fact, they can even differ in
terms of their dimensionality (cf. Problem 2.22). The two feature spaces that we
will mostly use in this book are the RKHS associated with k (Section 2.2.2) and
the Mercer
2 feature space. We will mostly use the same symbol for all feature
spaces that are associated with a given kernel. This makes sense provided that
everything we do, at the end of the day, reduces to dot products. For instance, let
us assume that Φ1 Φ2 are maps into the feature spaces 1 2 respectively, both
associated with the kernel k; in other words,
k(x x )
Φi (x) Φi (x )

i
for i 1 2 (2.48)
Then it will usually not be the case that Φ1 (x) Φ2 (x); due to (2.48), however,
we always have Φ1 (x) Φ1 (x ) 1
Φ2 (x) Φ2 (x ) 2 . Therefore, as long as we are

only interested in dot products, the two spaces can be considered identical.
An example of this identity is the so-called large margin regularizer that is
usually used in SVMs, as discussed in the introductory chapter (cf. also Chapters
4 and 7),

w w
m
where w ∑ i Φ(xi ) (2.49)
i 1
No matter whether Φ is the RKHS map Φ(x i ) k( xi ) (2.21) or the Mercer map

Φ(xi ) ( j j (x)) j1 N (2.40), the value of w 2 will not change.
This point is of great importance, and we hope that all readers are still with us.
40 Kernels

It is fair to say, however, that Section 2.2.5 can be skipped at first reading.

2.2.5 The Shape of the Mapped Data in Feature Space

Using Mercer’s theorem, we have shown that one can think of the feature map
as a map into a high- or infinite-dimensional Hilbert space. The argument in the
remainder of the section shows that this typically entails that the mapped data
Φ() lie in some box with rapidly decaying side lengths [606]. By this we mean
that the range of the data decreases as the dimension index j increases, with a rate
that depends on the size of the eigenvalues.

Let us assume that for all j , we have supx j j (x) 2 . Define the

sequence
lj : sup
x

j j
(x) 2 (2.50)

Note that if
Ck : sup sup j (x) (2.51)
j x

exists (see Problem 2.24), then we have l j j Ck2 . However, if the j decay rapidly,
then (2.50) can be finite even if (2.51) is not.
By construction, Φ() is contained in an axis parallel parallelepiped in
2N with

side lengths 2 l j (cf. (2.40)).5
Consider an example of a common kernel, the Gaussian, and let (see The-
orem 2.10) be the Lebesgue measure. In this case, the eigenvectors are sine and
cosine functions (with supremum one), and thus the sequence of the l j coincides
with the sequence of the eigenvalues j . Generally, whenever sup x j (x) 2 is fi-

nite, the l j decay as fast as the j . We shall see in Sections 4.4, 4.5 and Chapter 12
that for many common kernels, this decay is very rapid.
It will be useful to consider operators that map Φ() into balls of some radius
R centered at the origin. The following proposition characterizes a class of such
operators, determined by the sequence (l j ) j . Recall that denotes the space of

all real sequences.

Proposition 2.13 (Mapping Φ() into

2 ) Let S be the diagonal map

S :
(x j ) j S(x j ) j (s j x j ) j
(2.52)

where (s j ) j . If s l

j j j
2 , then S maps Φ() into a ball centered at the origin
whose radius is R
sj

l j j .

5. In fact, it is sufficient to use the essential supremum in (2.50). In that case, subsequent
statements also only hold true almost everywhere.
2.2 The Representation of Similarities in Linear Spaces 41

Proof Suppose s j . Using the Mercer map (2.40), we have

lj j

2

SΦ(x) ∑ s (x) ∑ s l R
2

j
2
j j j
2

j
2
j j (2.53)

for any x . Hence SΦ() .

The converse is not necessarily the case. To see this, note that if s j

lj j

2 ,
amounting to saying that

∑ s2j sup j j (x)2 (2.54)

j x
is not finite, then there need not always exist an x such that SΦ(x)

s j j j (x) j
2 , i.e., that

∑ s2j j j (x)2 (2.55)

is not finite.
To see how the freedom to rescale Φ() effectively restricts the class of functions
we are using, we first note that everything in the feature space
2N is done
in terms of dot products. Therefore, we can compensate any invertible symmetric
linear transformation of the data in by the inverse transformation on the set
of admissible weight vectors in . In other words, for any invertible symmetric

operator S on , we have S 1 w SΦ(x)

w Φ(x) for all x

As we shall see below (cf. Theorem 5.5, Section 12.4, and Problem 7.5), there
exists a class of generalization error bound that depends on the radius R of the
smallest sphere containing the data. If the (li )i decay rapidly, we are not actually
“making use” of the whole sphere. In this case, we may construct a diagonal
scaling operator S which inflates the sides of the above parallelepiped as much
as possible, while ensuring that it is still contained within a sphere of the original
radius R in (Figure 2.3). By effectively reducing the size of the function class, this
will provide a way of strengthening the bounds. A similar idea, using kernel PCA
(Section 14.2) to determine empirical scaling coefficients, has been successfully
applied by [101].
We conclude this section with another useful insight that characterizes a prop-
erty of the feature map Φ. Note that most of what was said so far applies to the
case where the input domain is a general set. In this case, it is not possible to
make nontrivial statements about continuity properties of Φ. This changes if we
assume to be endowed with a notion of closeness, by turning it into a so-called
topological space. Readers not familiar with this concept will be reassured to hear
Continuity of Φ that Euclidean vector spaces are particular cases of topological spaces.

Proposition 2.14 (Continuity of the Feature Map [402]) If is a topological space

and k is a continuous positive definite kernel on , then there exists a Hilbert space

and a continuous map Φ : such that for all x x , we have k(x x )

Φ(x) Φ(x ) .

42 Kernels

φ(x) φ(x)

R w R w

Feature Space Weight vector Feature Space Weight vector

Figure 2.3 Since everything is done in terms of dot products, scaling up the data by
an operator S can be compensated by scaling the weight vectors with S 1 (cf. text). By

choosing S such that the data are still contained in a ball of the same radius R, we effectively
reduce our function class (parametrized by the weight vector), which can lead to better
generalization bounds, depending on the kernel inducing the map Φ.

2.2.6 The Empirical Kernel Map

The map Φ, defined in (2.21), transforms each input pattern into a function on ,
that is, into a potentially infinite-dimensional object. For any given set of points,
however, it is possible to approximate Φ by only evaluating it on these points (cf.
[232, 350, 361, 547, 474]):

Definition 2.15 (Empirical Kernel Map) For a given set z1 zn , n , we

Empirical Kernel call
Map
Φn : N n
k( x)
where x z1 zn (k(z1 x) k(zn x))

(2.56)

the empirical kernel map w.r.t. z 1 zn .

As an example, consider first the case where k is a positive definite kernel, and
z
1 zn
x1 xm ; we thus evaluate k( x) on the training patterns. If we
carry out a linear algorithm in feature space, then everything will take place in
the linear span of the mapped training patterns. Therefore, we can represent the
k( x) of (2.21) as Φm (x) without losing information. The dot product to use in that
representation, however, is not simply the canonical dot product in m , since the
Φ(xi ) will usually not form an orthonormal system. To turn Φm into a feature map
associated with k, we need to endow m with a dot product m such that
k(x x )
Φm (x) Φm (x ) m

(2.57)
To this end, we use the ansatz m
M , with M being a positive definite

matrix.6 Enforcing (2.57) on the training patterns, this yields the self-consistency
condition [478, 512]
K KMK (2.58)

6. Every dot product in Ê m can be written in this form. We do not require strict definiteness
of M, as the null space can be projected out, leading to a lower-dimensional feature space.
2.2 The Representation of Similarities in Linear Spaces 43

where K is the Gram matrix. The condition (2.58) can be satisfied for instance

by the (pseudo-)inverse M K 1 . Equivalently, we could have incorporated this

rescaling operation, which corresponds to a Kernel PCA “whitening” ([478, 547,

Kernel PCA Map 474], cf. Section 11.4), directly into the map, by whitening (2.56) to get

Φw
m :x K
1
(k(x1 x) k(xm x)) (2.59)

2

This simply amounts to dividing the eigenvector basis vectors of K by i , where

the i are the eigenvalues of K.7 This parallels the rescaling of the eigenfunctions
of the integral operator belonging to the kernel, given by (2.47). It turns out that
this map can equivalently be performed using kernel PCA feature extraction (see
Problem 14.8), which is why we refer to this map as the kernel PCA map.
Note that we have thus constructed a data-dependent feature map into an m-
dimensional space which satisfies Φw w
m (x) Φm (x )
k(x x ), i.e., we have found an

m-dimensional feature space associated with the given kernel. In the case where K
is invertible, Φw
m (x) computes the coordinates of Φ(x) when represented in a basis
of the m-dimensional subspace spanned by Φ(x1 ) Φ(xm ).
For data sets where the number of examples is smaller than their dimension,
it can actually be computationally attractive to carry out Φw m explicitly, rather
than using kernels in subsequent algorithms. Moreover, algorithms which are not
readily “kernelized” may benefit from explicitly carrying out the kernel PCA map.
We end this section with two notes which illustrate why the use of (2.56) need
not be restricted to the special case we just discussed.
More general kernels. When using non-symmetric kernels k in (2.56), together with
the canonical dot product, we effectively work with the positive definite matrix
K K. Note that each positive definite matrix can be written as K K. Therefore,

working with positive definite kernels leads to an equally rich set of nonlinearities
as working with an empirical kernel map using general non-symmetric kernels.
If we wanted to carry out the whitening step, we would have to use (K K) 14 (cf.

footnote 7 concerning potential singularities).

Different evaluation sets. Things can be sped up by using expansion sets of the

form z1 zn , mapping into an n-dimensional space, with n m, as done in
[100, 228]. In that case, one modifies (2.59) to

Φw
n :x K n

1
2
(k(z1 x) k(zn x)) (2.60)

where (Kn )i j : k(zi z j ). The expansion set can either be a subset of the training
set,8 or some other set of points. We will later return to the issue of how to choose

7. It is understood that if K is singular, we use the pseudo-inverse of K12 in which case we

get an even lower dimensional subspace.
8. In [228] it is recommended that the size n of the expansion set is chosen large enough to
ensure that the smallest eigenvalue of Kn is larger than some predetermined 0. Alter-
natively, one can start off with a larger set, and use kernel PCA to select the most important
components for the map, see Problem 14.8. In the kernel PCA case, the map (2.60) is com-
44 Kernels

the best set (see Section 10.2 and Chapter 18). As an aside, note that in the case of
Kernel PCA (see Section 1.7 and Chapter 14 below), one does not need to worry
about the whitening step in (2.59) and (2.60): using the canonical dot product in

m (rather than ) will simply lead to diagonalizing K2 instead of K, which
yields the same eigenvectors with squared eigenvalues. This was pointed out by
[350, 361]. The study [361] reports experiments where (2.56) was employed to

speed up Kernel PCA by choosing z1 zn as a subset of x1 xm .
2.2.7 A Kernel Map Defined from Pairwise Similarities

In practice, we are given a finite amount of data x 1 xm . The following simple

observation shows that even if we do not want to (or are unable to) analyze a given
kernel k analytically, we can still compute a map Φ such that k corresponds to a
dot product in the linear span of the Φ(xi ):

Proposition 2.16 (Data-Dependent Kernel Map [467]) Suppose the data x1 xm

and the kernel k are such that the kernel Gram matrix Ki j k(xi x j ) is positive definite.
Then it is possible to construct a map Φ into an m-dimensional feature space such that
k(xi x j ) Φ(xi ) Φ(x j )
(2.61)
Conversely, given

an arbitrary map Φ into some feature space , the matrix K i j
Φ(xi ) Φ(x j ) is positive definite.

Proof First assume that K is positive definite. In this case, it can be diagonalized

as K SDS , with an orthogonal matrix S and a diagonal matrix D with nonneg-

DS DS

ative entries. Then

Si

k(xi x j ) (SDS
)i j DS j i j (2.62)

where we have defined the Si as the rows of S (note that the columns of S would be
K’s eigenvectors). Therefore, K is the Gram matrix of the vectors Dii Si .9 Hence

the following map Φ, defined on x1 xm will satisfy (2.61)
Φ : xi D S ii i (2.63)
Thus far, Φ is only defined on a set of points, rather than on a vector space.
Therefore, it makes no sense to ask whether it is linear. We can, however, ask
whether it can be extended to a linear map, provided the x i are elements of a vector
space. The answer is that if the xi are linearly dependent (which is often the case),
then this will not be possible, since a linear map would then typically be over-

12
puted as Dn Un (k(z1 x) k(zn x)), where Un Dn Un is the eigenvalue decomposition of

Kn . Note that the columns of Un are the eigenvectors of Kn . We discard all columns that cor-
respond to zero eigenvalues, as well as the corresponding dimensions of Dn . To approximate
the map, we may actually discard all eigenvalues smaller than some 0.
9. In fact, every positive definite matrix is the Gram matrix of some set of vectors [46].
2.3 Examples and Properties of Kernels 45

determined by the m conditions (2.63).

For the converse, assume an arbitrary « m
, and compute
2
m
0
m m m
∑ i j Ki j ∑ i Φ(xi ) ∑ j Φ(x j )
i1

∑ i Φ(xi )

(2.64)
i j 1 i 1 j 1

In particular, this result implies that given data x1 xm , and a kernel k which
gives rise to a positive definite matrix K, it is always possible to construct a feature
space of dimension at most m that we are implicitly working in when using
kernels (cf. Problem 2.32 and Section 2.2.6).
If we perform an algorithm which requires k to correspond to a dot product in
some other space (as for instance the SV algorithms described in this book), it is
possible that even though k is not positive definite in general, it still gives rise to
a positive definite Gram matrix K with respect to the training data at hand. In this
case, Proposition 2.16 tells us that nothing will go wrong during training when we
work with these data. Moreover, if k leads to a matrix with some small negative
eigenvalues, we can add a small multiple of some strictly positive definite kernel
k (such as the identity k (xi x j ) Æi j ) to obtain a positive definite matrix. To see

this, suppose that min 0 is the minimal eigenvalue of k’s Gram matrix. Note that
being strictly positive definite, the Gram matrix K of k satisfies

min
« 1
« K «

min 0 (2.65)

where min denotes its minimal eigenvalue, and the first inequality follows from

Rayleigh’s principle (B.57). Therefore, provided that min min 0, we have

« (K K )«
« K« « K « «

2

min
min

0 (2.66)
for all « m
, rendering (K K ) positive definite.

2.3 Examples and Properties of Kernels

For the following examples, let us assume that N . Besides homogeneous

Polynomial polynomial kernels (cf. Proposition 2.1),

k(x x )
x x
d
(2.67)
Boser, Guyon, and Vapnik [62, 223, 561] suggest the usage of Gaussian radial basis
Gaussian function kernels [26, 4],

x2
x
2

exp

k(x x )
(2.68)
2

Sigmoid where 0, and sigmoid kernels,

k(x x )
tanh( x x

) (2.69)

Transparent Concrete
67% (3)
Transparent Concrete
26 pages
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
No ratings yet
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
15 pages
Lecture 36
No ratings yet
Lecture 36
133 pages
Python in One Shot
No ratings yet
Python in One Shot
10 pages
Method of Testing To Determine Flow Resistance of HVAC Ducts and Fittings
100% (1)
Method of Testing To Determine Flow Resistance of HVAC Ducts and Fittings
6 pages
Arthur Gretton - Slides4A
No ratings yet
Arthur Gretton - Slides4A
121 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
Banach Spaces and Hilbert Spaces in Machine Learning Theory
No ratings yet
Banach Spaces and Hilbert Spaces in Machine Learning Theory
33 pages
APR2025 Refresher PSAD 4 Without Answers
No ratings yet
APR2025 Refresher PSAD 4 Without Answers
4 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
10 Understanding Kernels
No ratings yet
10 Understanding Kernels
41 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
1 s2.0 S0024379503005536 Main
No ratings yet
1 s2.0 S0024379503005536 Main
26 pages
Lecture 8 - Kernels
No ratings yet
Lecture 8 - Kernels
32 pages
Linear Algebra Note by Sophie Morel
No ratings yet
Linear Algebra Note by Sophie Morel
114 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
ECU Part Number Vehicle Application List
100% (3)
ECU Part Number Vehicle Application List
2 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
Lec 16
No ratings yet
Lec 16
23 pages
BH35 2
100% (1)
BH35 2
4 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
1 s2.0 S0024379512002674 Main
No ratings yet
1 s2.0 S0024379512002674 Main
9 pages
Lec5 SVM Kernel SoftMargin
No ratings yet
Lec5 SVM Kernel SoftMargin
44 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
Practice Problems For ML Midterms
No ratings yet
Practice Problems For ML Midterms
5 pages
Kernel Method Homework
No ratings yet
Kernel Method Homework
5 pages
Research 6
No ratings yet
Research 6
16 pages
hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
Scribe
No ratings yet
Scribe
4 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Ds 11
No ratings yet
Ds 11
21 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
A511
No ratings yet
A511
8 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
ASME Code Case 2600 PDF
No ratings yet
ASME Code Case 2600 PDF
2 pages
On Commuting Matrices and Exponentials: Abstract
No ratings yet
On Commuting Matrices and Exponentials: Abstract
12 pages
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
No ratings yet
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
23 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
ICT1511-19-S1 - Assisgnment 2 Questions
No ratings yet
ICT1511-19-S1 - Assisgnment 2 Questions
19 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Week 9 Notes
No ratings yet
Week 9 Notes
6 pages
Combining Entropy Measures For Anomaly Detection
No ratings yet
Combining Entropy Measures For Anomaly Detection
14 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
Images Kernels and Subspaces
No ratings yet
Images Kernels and Subspaces
8 pages
ML Kernel Methods
No ratings yet
ML Kernel Methods
51 pages
Modular Spaces Revised
No ratings yet
Modular Spaces Revised
13 pages
A Primer On Kernel Methods PDF
No ratings yet
A Primer On Kernel Methods PDF
42 pages
Apuntes de Sociología Jurídica
No ratings yet
Apuntes de Sociología Jurídica
34 pages
Slides Chap5 KernelMethods
No ratings yet
Slides Chap5 KernelMethods
24 pages
Bhel Report
No ratings yet
Bhel Report
17 pages
Benedetto, Biglieri, Castellani, Digital Transmission Theory, Prentice-Hall, 1987 Appendix
No ratings yet
Benedetto, Biglieri, Castellani, Digital Transmission Theory, Prentice-Hall, 1987 Appendix
56 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
The Dawn of Science Glimpses From History For The Curious Mind Dropbox Download
100% (18)
The Dawn of Science Glimpses From History For The Curious Mind Dropbox Download
17 pages
Tesis Sobre Integridad Estructural - Phased Array
No ratings yet
Tesis Sobre Integridad Estructural - Phased Array
8 pages
Measuring Soil Water Content With Ground Penetrating Radar - A Review - IMPORTANTE PDF
No ratings yet
Measuring Soil Water Content With Ground Penetrating Radar - A Review - IMPORTANTE PDF
16 pages
7 PDF
No ratings yet
7 PDF
4 pages
Math 122 Definitions
No ratings yet
Math 122 Definitions
13 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
Visual Monitoring of Civil Infrastructure Systems Via Camera Equipped Unmanned Aerial Vehicles Uavs A Review of Related Works
No ratings yet
Visual Monitoring of Civil Infrastructure Systems Via Camera Equipped Unmanned Aerial Vehicles Uavs A Review of Related Works
8 pages
Kernel Discriminant Analysis For Positive Definite and Indefinite Kernels
No ratings yet
Kernel Discriminant Analysis For Positive Definite and Indefinite Kernels
15 pages
IBM Progess LED Codes
No ratings yet
IBM Progess LED Codes
126 pages
Complex Eigenvalues Best++++
No ratings yet
Complex Eigenvalues Best++++
16 pages
C 20 CE 3 4 Sem-Min
No ratings yet
C 20 CE 3 4 Sem-Min
92 pages
A Gentle Introduction To The Kernel Distance: 1 Definitions
No ratings yet
A Gentle Introduction To The Kernel Distance: 1 Definitions
9 pages
Optimizacion de EMATs para Un Maximo Rango Dinamico
No ratings yet
Optimizacion de EMATs para Un Maximo Rango Dinamico
5 pages
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
No ratings yet
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
4 pages
Beacon Solidworks Product Matrix
No ratings yet
Beacon Solidworks Product Matrix
6 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Statistical Theories of Discrimination in Labor Market
No ratings yet
Statistical Theories of Discrimination in Labor Market
14 pages
DAA Quiz Answers
No ratings yet
DAA Quiz Answers
5 pages
Artificial Neural Networks: An Overview: August 2023
No ratings yet
Artificial Neural Networks: An Overview: August 2023
11 pages
Reproducing Kernel Hilbert Spaces
No ratings yet
Reproducing Kernel Hilbert Spaces
4 pages
ReadMe Win
No ratings yet
ReadMe Win
4 pages
May 12 L3 Proposed Exam Paper UBGLFP-20-3
No ratings yet
May 12 L3 Proposed Exam Paper UBGLFP-20-3
9 pages
Stainless Steel
No ratings yet
Stainless Steel
4 pages
Condensed Matter Theory I - WS14/15
No ratings yet
Condensed Matter Theory I - WS14/15
13 pages
Emats Transducers - Ondas de Rayleigh
No ratings yet
Emats Transducers - Ondas de Rayleigh
6 pages
Output) ) : 1.8 Empirical Results and Implementations 21
No ratings yet
Output) ) : 1.8 Empirical Results and Implementations 21
2 pages
Classical Flutter 2DOF Lec13
No ratings yet
Classical Flutter 2DOF Lec13
11 pages
2022 Summer Question Paper (Msbte Study Resources)
No ratings yet
2022 Summer Question Paper (Msbte Study Resources)
3 pages
SPAD
No ratings yet
SPAD
34 pages
2005 - A Robotic Mechanism For Grasping Sacks
No ratings yet
2005 - A Robotic Mechanism For Grasping Sacks
11 pages
Conformable Array For Mapping Corrosion Profiles
No ratings yet
Conformable Array For Mapping Corrosion Profiles
38 pages
(FEM) Modelación de Elementos Finitos para Transductores Ultrasónicos
No ratings yet
(FEM) Modelación de Elementos Finitos para Transductores Ultrasónicos
24 pages
Unit 5 10 PDF
No ratings yet
Unit 5 10 PDF
4 pages
50 Años de Piezoelectricidad
No ratings yet
50 Años de Piezoelectricidad
6 pages
DM74123 Dual Retriggerable One-Shot With Clear and Complementary Outputs
No ratings yet
DM74123 Dual Retriggerable One-Shot With Clear and Complementary Outputs
5 pages
Non-Destructive Evaluation of Moisture Content in Wood Using Ground-Penetrating Radar
No ratings yet
Non-Destructive Evaluation of Moisture Content in Wood Using Ground-Penetrating Radar
7 pages
Fagor 8055tc
No ratings yet
Fagor 8055tc
50 pages
Assignment Technique: Working Rules and Guidelines
No ratings yet
Assignment Technique: Working Rules and Guidelines
15 pages
HMC6981LS6: Gaas Phemt Mmic 2 Watt Power Amplifier, 15 - 20 GHZ
No ratings yet
HMC6981LS6: Gaas Phemt Mmic 2 Watt Power Amplifier, 15 - 20 GHZ
10 pages
Photosynthesis Quiz
No ratings yet
Photosynthesis Quiz
3 pages
Kernels Spaces
No ratings yet
Kernels Spaces
1 page
Worksheet 03
No ratings yet
Worksheet 03
2 pages
The Detection of Flaws in Austenitic Welds Using Decomposition - Image Processing
No ratings yet
The Detection of Flaws in Austenitic Welds Using Decomposition - Image Processing
18 pages

The Representation of Similarities in Linear Spaces

Uploaded by

The Representation of Similarities in Linear Spaces

Uploaded by

2.

2 The Representation of Similarities in Linear Spaces 29

2.2 The Representation of Similarities in Linear Spaces

with pairwise distances or similarities between non-vectorial objects [246, 467,

2.2.1 Positive Definite Kernels

Definition 2.3 (Gram Matrix) Given a function k : 2

PD Matrix Definition 2.4 (Positive Definite Matrix) A complex m m matrix K satisfying

Definition 2.5 ((Positive Definite) Kernel) Let be a nonempty set. A function k on

is called the kernel of Tk .

Proposition 2.7 (Cauchy-Schwarz Inequality for Kernels) If k is a positive definite

Figure 2.2 One instantiation of the fea-

function sitting on the pattern. In this sense,

In the text, we describe the construction of

2.2.2 The Reproducing Kernel Map

Assume that k is a real-valued positive definite kernel, and a nonempty set. We

between f and another function

Dot Product where m

particular expansion of f . Similarly, for g, note that

The last two equations also show that is bilinear. It is symmetric, as f g  

k( x) f f (x) (2.29)

Φ(x) Φ(x ) k(x x )

due to the nonnegativity of the norm.

Remark 2.8 (“Kernel Trick”) Given an algorithm which is formulated in terms of a

will make sense.

2.2.3 Reproducing Kernel Hilbert Spaces

In the last section, we described how to define a space of functions which is a

k(x ) k(x ) k(x x )

(cf. Appendix B).

On a more abstract level, an RKHS can be defined as a Hilbert space of functions

Problem 2.28) and satisfies the conditions for positive definiteness.

from (2.34) we conclude that

2.2.4 The Mercer Kernel Map

3. Note that this implies that each f 

as a dx , and of as a compact subset of N . For further explanations of the terms

involved in this theorem, cf. Appendix B, especially Section B.3.

Theorem 2.10 (Mercer [359, 307]) Suppose k L (2 ) is a symmetric real-valued

function such that the integral operator (cf. (2.16))

is positive definite; that is, for all f L2 (), we have

Proposition 2.11 (Mercer Kernel Map) If k is a kernel satisfying the conditions of

for each continuous function f : .

f (x) i k(x xi )  j j (x) j (xi ) (2.45)

By linearity, which holds for any dot product, we have

Since k is a Mercer kernel, the  i (i 1 N ) can be chosen to be orthogonal

2.2.5 The Shape of the Mapped Data in Feature Space

all real sequences.

Proposition 2.13 (Mapping Φ() into

where (s j ) j  . If s l  

Proof Suppose s j . Using the Mercer map (2.40), we have

for any x . Hence SΦ()  .

∑ s2j sup  j  j (x)2 (2.54)

∑ s2j  j  j (x)2 (2.55)

Proposition 2.14 (Continuity of the Feature Map [402]) If is a topological space

Feature Space Weight vector Feature Space Weight vector

2.2.6 The Empirical Kernel Map

Definition 2.15 (Empirical Kernel Map) For a given set z1 zn   , n , we

the empirical kernel map w.r.t.  z 1 zn .

rescaling operation, which corresponds to a Kernel PCA “whitening” ([478, 547,

This simply amounts to dividing the eigenvector basis vectors of K by i , where

footnote 7 concerning potential singularities).

7. It is understood that if K is singular, we use the pseudo-inverse of K12 in which case we

In practice, we are given a finite amount of data x 1 xm . The following simple

Proposition 2.16 (Data-Dependent Kernel Map [467]) Suppose the data x1 xm

ative entries. Then

determined by the m conditions (2.63).

Rayleigh’s principle (B.57). Therefore, provided that  min min 0, we have

2.3 Examples and Properties of Kernels

For the following examples, let us assume that  N . Besides homogeneous

Sigmoid where  0, and sigmoid kernels,

You might also like

The last two equations also show that is bilinear. It is symmetric, as f g

k(x ) k(x ) k(x x )

3. Note that this implies that each f

f (x) i k(x xi ) j j (x) j (xi ) (2.45)

Since k is a Mercer kernel, the i (i 1 N ) can be chosen to be orthogonal

where (s j ) j . If s l

for any x . Hence SΦ() .

∑ s2j sup j j (x)2 (2.54)

∑ s2j j j (x)2 (2.55)

Definition 2.15 (Empirical Kernel Map) For a given set z1 zn , n , we

the empirical kernel map w.r.t. z 1 zn .

This simply amounts to dividing the eigenvector basis vectors of K by i , where

Rayleigh’s principle (B.57). Therefore, provided that min min 0, we have

For the following examples, let us assume that N . Besides homogeneous

Sigmoid where 0, and sigmoid kernels,