0% found this document useful (0 votes)
26 views

Some Methods of Constructing Kernel

This document discusses methods for constructing kernels in statistical learning. It provides an overview of kernel methods in machine learning, including support vector machines and kernel principal component analysis. It then describes various techniques for building kernels, such as using superpositions of kernels with other functions like sigmoidal functions. It also proposes constructing new kernels by mapping data to a unit sphere.

Uploaded by

João Vieira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Some Methods of Constructing Kernel

This document discusses methods for constructing kernels in statistical learning. It provides an overview of kernel methods in machine learning, including support vector machines and kernel principal component analysis. It then describes various techniques for building kernels, such as using superpositions of kernels with other functions like sigmoidal functions. It also proposes constructing new kernels by mapping data to a unit sphere.

Uploaded by

João Vieira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Discussiones Mathematicae

Probability and Statistics 30 (2010 ) 179–201

SOME METHODS OF CONSTRUCTING KERNELS


IN STATISTICAL LEARNING
Tomasz Górecki
Faculty of Mathematics and Computer Science
Adam Mickiewicz University
Umultowska 87, 61–614 Poznań, Poland
e-mail: [email protected]

and
Maciej Luczak
Department of Civil and Environmental Engineering
Koszalin University of Technology
Śniadeckich 2, 75–453 Koszalin, Poland
e-mail: [email protected]

Abstract
This paper is a collection of numerous methods and results concern-
ing a design of kernel functions. It gives a short overview of methods
of building kernels in metric spaces, especially Rn and S n . However
we also present a new theory. Introducing kernels was motivated by
searching for non-linear patterns by using linear functions in a feature
space created using a non-linear feature map.
Keywords: positive definite kernel, dot product kernel, statistical
kernel, SVM, kPCA.
2010 Mathematics Subject Classification: 62H30, 68T10.

1. Introduction

The mid-1990’s yielded major advance in machine learning: the support


vector machine (SVM). The fundamental idea beyond this method is
180 T. Górecki and M. Luczak

especially far-reaching. SVM utilizes the positive definite kernels. What


does it mean? The most of machine learning methods is very well developed
in the linear case. However in practice we have real data and we need
non-linear methods to detect the kind of dependencies that enable us to
predict successfully the properties of interest. The kernel corresponds to
a dot product in a feature space (usually high-dimensional, even infinite-
dimensional for Gaussian kernel). In this space, our methods are linear, but
as long as we can formulate everything in terms of kernel evaluations, we
will never have to compute explicitly in the high-dimensional feature space.
If we can show that a linear algorithm is depended on the data matrix X
only by

K ≡ XX T ,

then it can be easily ”kernelized”. In general, this procedure is known as the


kernel trick. The kernel trick transforms any algorithm that solely depends
on the dot product. Wherever a dot product is used, it is replaced with the
kernel function. Thus, a linear algorithm can easily be transformed into a
non-linear algorithm. The algorithms capable of operating with kernels are
(apart from SVM) Fisher’s linear discriminant analysis (LDA), principal
components analysis (PCA), canonical correlation analysis (CCA), ridge
regression, spectral clustering, and many others. A full review of kernel
methods was presented in (Hoffman, Schölkopf, Smola, 2008).
In this article we focus one’s attention on constructing kernels. The
modularity is important advantage of kernel methods. To solve a different
problem, we should use a different kernel function. Hence, it is essential to
have as many as possible kernel functions, because we never know which
kernel will be the best (to be effective in practice, obviously we should use
the correct kernel function and with right parameters (Zu, 2008)).
In our paper first we present the main ideas beyond kernels methods in
machine learning (Section 2). We review especially SVM and kernel PCA
as the members of big family of ”kernelized” methods. Then we describe
basic kernels and main methods of constructing kernels (Section 3). In
Section 4 we show how kernels are constructed using superposition of ker-
nels with other functions, namely functions with ”good” Taylor or Legendre
polynomials series expansion. We pay attention particularly to sigmoidal
kernels (Corollary 2, Example 4–7). In the same section we propose su-
perposition of kernels with maps from Rn to unit sphere (Corollary 3–6).
Section 5 concerns with the special case of the method proposed in Section 4.
Some methods of constructing kernels in ... 181

We construct the inverse of the stereographic projection. This enables us


to construct new kernels as a superposition of this projection and kernels
on sphere. In Section 6 we generalize results from Section 4 to the case of
multivariable functions.

2. Kernels in machine learning

Suppose we are given empirical data (x1 , y1 ), . . . , (xn , yn ) ∈ X × {±1}. Here,


the domain X is some nonempty set from which the patterns xi are taken; the
yi are called class labels. In order to study the problem of learning, we need
additional structure. In learning, we want to generalize unseen data points.
This means in the case of pattern recognition, that given some new pattern
x ∈ X, we want to predict the corresponding y ∈ {±1}. Although the most
of kernels methods can manage with multi-class classification problems as
well we are limited mostly to the two-class classification problem. We choose
y such as (x, y) is in some way similar to the training examples. So we need
similarity measures in X. We require a similarity measure
k :X×X→R
i.e., given two examples x1 and x2 , a function returns a real number char-
acterizing their similarity. The function k is called a kernel.
A type of similarity measure being of particular mathematical appeal
are dot products. In order to use a dot product as a similarity measure, we
first need to embed them into some dot product space F , which may not be
identical to Rn . To this end, we use a map
Φ : X → F.
The space F is called a feature space. Embedding the data into F has some
benefits:
• It lets us define a similarity measure from the dot product in F :
k(x1 , x2 ) = hΦ(x1 ), Φ(x2 )i.

• It allows us to deal with the patterns geometrically, and thus lets us


study learning algorithm using linear algebra and analytic geometry.
The geometrical interpretation of dot product means computing the
cosine of the angle between the vectors, provided they are normalized
to length 1. Moreover, it allows computation of the length of a vector,
and the distance between two vectors as the length of their difference.
182 T. Górecki and M. Luczak

• Ability to choose the mapping Φ will enable us to design a large variety


of learning algorithms.

Now we give two basic ”kernelization” examples of well known linear method.
2.1. SVM
In the case of support vector machines, a data point is viewed as a p-
dimensional vector. We want to know whether we can separate such points
with a p − 1-dimensional hyperplane. This is called a linear classifier. There
are many hyperplanes that might classify the data. However, we are ad-
ditionally interested in finding out if we can achieve maximum separation
(margin) between the two classes. By this we mean that we choose the
hyperplane such as the distance from the hyperplane to the nearest data
point is maximized. In other words the nearest distance between a point in
one separated hyperplane and a point in the other separated hyperplane is
maximized. Now, if such hyperplane exists, it is clearly of interest and is
known as the maximum-margin hyperplane. Furthermore such linear classi-
fier is known as a maximum margin classifier. The samples on the margin
are called the support vectors (maximum margin hyperplane and hence the
classification task is only a function of the support vectors). To calculate
the margin, two parallel hyperplanes are constructed, one on each side of
the separating hyperplane, which are ”pushed up against” the two data sets.

Figure 1. (a) H1 does not separate the 2 classes, H2 does, with a small margin
and H3 with the maximum margin. b) Correct map (kernel) changes
non-linear classifier into linear in higher dimensional space.
Some methods of constructing kernels in ... 183

If two classes are perfectly separable, then there exist an infinite number of
separating hyperplanes. SVM method is based on the fact that the perfect
hyperplane for separating two classes is the one that is the farthest away
from the training points.
Intuitively, a good separation is achieved by the hyperplane of the largest
distance to the neighboring datapoints of both classes. In general the larger
the margin is, the better the generalization error of the classifier (Figure
1a).
Cortes and Vapnik (1995) suggested a modified maximum margin idea
(soft margin) providing mislabeled examples (generally we can not assume
that the two classes are always perfectly separable). If there exists no hyper-
plane splitting examples, the method will choose a hyperplane that splits the
examples as cleanly as possible, still maximizing the distance to the nearest
cleanly split examples. The method introduces slack variables, ξ i ≥ 0, which
measure the degree of misclassification of the datum xi . Then the objective
function is increased by a function which penalizes non-zero ξ i . Additionally
the optimization becomes a trade off between a large margin, and a small
error penalty.
However, one can not possibly expect a linear classifier to succeed in gen-
eral situations, no matter how optimal the hyperplane is. Boser, Guyon and
Vapnik (1992) suggested a way to create non-linear classifiers by applying
the kernel trick to maximum-margin hyperplanes. The resulting algorithm
is formally similar, except for every dot product replacing by a non-linear
kernel function. This allows the algorithm to fit the maximum-margin hy-
perplane in the transformed feature space. The transformation may be
non-linear and the transformed space might be high dimensional. Although
the classifier is a hyperplane in the high-dimensional feature space, it may
be non-linear in the original input space (Figure 1b).
SVMs belong to a family of generalized linear classifiers. They can also
be considered as a special case of Tikhonov regularization (most commonly
used method of regularization of ill-posed problems, in statistics the method
is also known as ridge regression – Tarantola, 2004).

2.2. Kernel PCA


Principal Component Analysis (PCA) is a vector space transformation often
used to reduce multidimensional data sets to lower dimensions for analysis.
184 T. Górecki and M. Luczak

PCA is an orthogonal linear transformation of the coordinate system in


which we describe our data such, that the greatest variance by any projection
of the data comes to lie on the first coordinate (called the first principal
component), the second greatest variance on the second coordinate, and so
on. Unfortunately we have to assume that observed data set to be linear
combinations of certain basis. Kernel principal component analysis (kPCA)
is an extension of principal component analysis (PCA) using techniques of
kernel methods (without assuming linearity). Using a kernel, the originally
linear operations of PCA are done in a reproducing kernel Hilbert space with
a non-linear mapping (Fugure 2). It is a successful example of ”kernelizing
a well-known linear algorithm. Schölkopf et al. (1998) showed that finding
and projecting onto principal components depend on just the inner-product
and kernel trick could be use. There are several important points to note
about the behavior of the kPCA components, which should be contrasted
with the behavior of classic PCA:

• The maximum number of components is determined not by the dimen-


sionality of the input data, but by the number of input data points.

• Not all sets of values of the components correspond to an actual point


in input space.

Figure 2. a) Input points before kernel PCA. b) Output after kernel PCA. The
three groups are distinguishable using the first component only.
Some methods of constructing kernels in ... 185

3. Basic kernels

According to (Schölkopf and Smola, 2002) we introduce basic definitions.

Definition 1. Given a function k : X → R and x1 , . . . , xn ∈ X, the n × n


matrix K with elements

Kij = k(xi , xj )

is called the Gram matrix (or kernel matrix ) of k with respect to x1 , . . . , xn


∈ X.

Definition 2. A real symmetric n × n matrix K satisfying


X
ci cj Kij ≥ 0
i,j

for all ci , cj ∈ R is called positive definite ∗ .

Definition 3. Let X be a nonempty set. A function k on X × X which for


all n ∈ N and all x1 , . . . , xn ∈ X gives rise to a positive definite Gram matrix
is called a positive definite kernel or in short form kernel.

The key idea of the kernel technique is to invert the chain of arguments.
i.e., choose a kernel k rather than a mapping before applying a learning
algorithm. Not every symmetric function can be a kernel. The necessary
and sufficient condition for this are given by Mercer’s theorem.

Theorem 1 (Mercer’s theorem). Suppose k ∈ L∞ (X × X) is a symmetric


function, such that the integral operator Tk : L2 (X) → L2 (X) given by
Z
Tk f (·) = k(·, x)f (x)dx
X

is positive definite, that is,


Z Z
k(x1 , x2 )f (x1 )f (x2 )dx1 dx2 ≥ 0,
X X


some authors call this nonnegative definite
186 T. Górecki and M. Luczak

for all f ∈ L2 (X). Such kernel we call Mercer kernel.


Let ψ i ∈ L2 (X) be the eigenfunction ofRTk associated with the eigenvalue
λi ≥ 0 and normalized such that ||ψ i ||2 = X ψ 2i (x)dx = 1, i.e,
Z
∀x ∈ X : k(x1 , x2 )ψ i (x2 )dx2 = λi ψ i (x1 ).
X

Then

1. (λi )i∈N ∈ l1 ,

2. ψ i ∈ L∞ (X),

3. k can be expanded in a uniformly convergent series, i.e.,


X
k(x1 , x2 ) = λi ψ i (x1 )ψ i (x2 )
i=1

holds for all x1 , x2 ∈ X.

Proposition 1 (Herbrich, 2002). The function k : X × X → R is a Mercer


kernel if and only if is a kernel in sense of Definition 3 (for almost all x).

If we have a positive definite kernel such as in Definition 3, which, however,


is not in L∞ (X), to use Mercer’s theorem we can take any compact subset
of X containing all observations.

Example 1. A few simple examples of functions that are kernels or not.

2
• Kernels: hx1 , x2 i, e−kx1 −x2 k , ehx1 ,x2 i ;

2
• Not kernels: kx1 − x2 k2 , −kx1 − x2 k2 , −hx1 , x2 i, ekx1 −x2 k , e−hx1 ,x2 i .

Example 2. Lets show that k(x1 , x2 ) = hx1 , x2 i, x1 , x2 ∈ X ⊂ Rn is a


kernel in sense of Definition 3 and Theorem 1.
Some methods of constructing kernels in ... 187

For xi = (x1i , . . . , xni ) ∈ X, ci ∈ R, i = 1, . . . , m we have

X X X
ci cj hxi , xj i = ci cj xki xkj
i,j i,j k

XX
= ci xki cj xkj
k i,j

X  X  X 
= ci xki cj xkj
k i j

XX 2
= ci xki ≥ 0.
k i

For f ∈ L∞ (X) we have

Z Z Z Z X 
hx1 , x2 if (x1 )f (x2 ) dx1 dx2 = xi1 xi2 f (x1 )f (x2 )dx1 dx2
X X X X i

XZ Z
= xi1 f (x1 ) xi2 f (x2 ) dx1 dx2
i X X

Z Z !
X
= xi1 f (x1 ) dx1 xi2 f (x2 ) dx2
i X X

Z !2
X
i
= x f (x) dx ≥ 0.
i X

Example 3. We can show that function k(x1 , x2 ) = kx1 − x2 k2 is not a


kernel.
Let m = 2, x1 , x2 ∈ X, c1 , c2 ∈ R. Then
X
ci cj kxi − xj k2 = 2c1 c2 kx1 − x2 k2 < 0
i,j∈{1,2}

for x1 6= x2 , c1 c2 < 0.
188 T. Górecki and M. Luczak

4. Functions of kernels

Following facts show that we can create new kernels from existing kernels
using a number of simple operations. In this way we can create complex
kernels by basic operations that combine simpler kernels.
Theorem 2 (Herbrich, 2002). Let ki : X × X → R be any kernels. Then,
the functions k : X × X → R given by

1. k(x1 , x2 ) = k1 (x1 , x2 ) + k2 (x1 , x2 ),

2. k(x1 , x2 ) = ck1 (x1 , x2 ) for all c ∈ R+ ,

3. k(x1 , x2 ) = k1 (x1 , x2 )k2 (x1 , x2 ),

4. k(x1 , x2 ) = f (x1 )f (x2 ) for any function f : X → R,

5. k(x1 , x2 ) = x′1 Bx2 for any symmetric positive definite B matrix,

6. k(x1 , x2 ) = lim ki (x1 , x2 ), if the limit exists


i→∞

are also kernels.

The combination of kernels given in part (3) is often refereed to as the


Schur product. We can easily decompose any kernel into the Schur product
of its normalisation and the 1-dimensional kernel of part (4) with f (x) =
p
k(x, x).

Corollary 1 (Herbrich, 2002). Let k1 : X × X → R be a kernel. Then, the


functions k : X × X → R given by

1. k(x1 , x2 ) = (k1 (x1 , x2 ) + θ1 )θ2 for all θ1 ∈ R+ and θ2 ∈ N,


 
2. k(x1 , x2 ) = exp k1 (xσ12,x2 ) for all σ ∈ R+ ,
 
3. k(x1 , x2 ) = exp − k1 (x1 ,x1 )−2k1 (x

1 ,x2 )+k1 (x2 ,x2 )
2 for all σ ∈ R+ ,

k1 (x1 ,x2 )
4. k(x1 , x2 ) = √ ,
k1 (x1 ,x1 )k1 (x2 ,x2 )

are also kernels.


Some methods of constructing kernels in ... 189

From Theorem 2 we see that the following theorems are true.

Theorem 3. Let k : X × X → R be a kernel. Let P be a polynomial of


degree n with nonnegative coefficients:
n
X
P (t) = ai ti (ai ≥ 0).
i=0

Then the function k̃ : X × X → R defined by

k̃(x1 , x2 ) := P (k(x1 , x2 ))

is a kernel.

Theorem 4. Let k : X × X → R be a kernel. Let f : k(X, X) → R be a


function which Taylor expansion has only nonnegative coefficients:

X
f (t) = ai ti (ai ≥ 0).
i=0

Then the function k̃ : X × X → R defined by

k̃(x1 , x2 ) := f (k(x1 , x2 ))

is a kernel.

For example, functions with ”good” (nonnegative coefficients) Taylor ex-


pansion: ex , arcsin x, sinh x, cosh x, tan x, arctanh x. Functions with ”bad”
Taylor expansion: sin x, cos x, arccos x, arctan x, arcsinh x, tanh x.

Corollary 2. Let k : X × X → R be a function and let f : k(X, X) → R be a


function for which there is an inverse f −1 with nonnegative Taylor expansion
coefficients. Then k is a kernel if the superposition f ◦ k is a kernel.

Example 4. If k is not a kernel, then functions

k1 (x1 , x2 ) = tanh(k(x1 , x2 )),

k2 (x1 , x2 ) = arctan(k(x1 , x2 ))

are not kernels.


190 T. Górecki and M. Luczak

Indeed, functions arctanh and tan have ”good” Taylor expansions.

Next theorems concern dot product kernels.

Definition 4 (Abramowitz and Stegun, 1972). The solutions of Legendre’s


differential equation
 
d 2 d
(1 − x ) P (x) + n(n + 1)P (x) = 0
dx dx

are called Legendre functions. When n is an integer, the solution Pn (x)


is a polynomial. These solutions for n = 0, 1, . . . (with the normalization
Pn (1) = 1) form a orthogonal polynomials called the Legendre polynomials.
Each Legendre polynomial Pn (x) is an nth-degree polynomial. It may be
expressed using Rodrigues formula:

1 dn  2 n

Pn (x) = (x − 1) .
2n n! dxn
Associated Legendre functions are the canonical solutions of the general
Legendre equation

m2
 
2 ′′ ′
(1 − x ) y − 2xy + n[n + 1] − y = 0.
1 − x2

This equation has solutions that are nonsingular on [−1, 1] only if n and m
are integers with 0 ≤ m ≤ n. When in addition m is even, the function is a
polynomial. When m is zero and n integer, these functions are identical to
the Legendre polynomials. These functions are denoted Pnm (x). Their most
straightforward definition is in terms of derivatives of ordinary Legendre
polynomials (m ≥ 0):

dm
Pnm (x) = (−1)m (1 − x2 )m/2 (Pn (x)) .
dxm

Since, by Rodrigues formula one obtains

(−1)m 2 m/2 d
n+m
Pnm (x) = (1 − x ) (x2 − 1)n .
2n n! dxn+m
Some methods of constructing kernels in ... 191

Coefficients cn and cm
n of expansion function f (x) to series of Legendre and
associated Legendre polynomials we calculate:

(2n + 1) 1
Z
cn = f (x)Pn (x)dx,
2 −1

1
(2n + 1)(n − m)!
Z
cm
n = f (x)Pnm (x)dx.
2(n + m)! −1

Theorem 5 (Shoenberg, 1942). Let k(x1 , x2 ) = f (hx1 , x2 i) be a func-


tion defined on S × S ⊂ Rm+3 × Rm+3 , where S is the unit sphere, and
f : [−1, 1] → R is a function with expansion into Legendre polynomials Pnm

X
f (t) = an Pnm (t).
n=0

Then k is a kernel if and only if all an ≥ 0.

Theorem 6 (Shoenberg 1942). Let k(x1 , x2 ) = f (hx1 , x2 i) be a function


defined on S × S ⊂ H × H, where S is the unit sphere in an infinite dimen-
sional Hilbert space H, and f : [−1, 1] → R is a function with a power series
expansion

X
f (t) = an tn .
n=0

Then k is a kernel if and only if all an ≥ 0.

Remark 1. In order to prove positive definiteness for arbitrary dimensions


it suffices to show that the Taylor expansion contains only positive coeffi-
cients. On the other hand, in order to prove that a candidate for a kernel
function will never be positive definite, it is sufficient to show this for the
Legendre Polynomials Pn .

Example 5. The function

k(x1 , x2 ) = tanh(ahx1 , x2 i + b)
192 T. Górecki and M. Luczak

is not a kernel for any a, b ∈ R, a 6= 0. We have to show that the kernel does
not satisfy the conditions of Theorem 5. Since this is very technical we refer
the reader to work of Ovari (2000) for details, and explain how the method
works in the simpler case of Theorem 6. The Taylor series of tanh(at + b) is
equal

tanh b + (1 − tanh2 b)at + (tanh3 b − tanh b)a2 t2 + . . .

Since the coefficients have to be nonnegative we have tanh b ≥ 0, tanh3


b − tanh b ≥ 0. Hence b ≥ 0 and if b 6= 0 then tanh2 b ≥ 1 — contradiction.
a3 t3
If b = 0 the expansion is equal at − + . . . , and then a = 0 —
3
contradiction.

Example 6. By computer computations we obtain that for parameters


a, b ∈ {−3, −2, . . . , 2, 3}, a 6= 0 any function

1
f (x) :=
1 + exp(ax + b)

has a negative coefficient in its expansion into Legendre polynomial series,


therefore k(x1 , x2 ) = f (hx1 , x2 i) is not a kernel.

Example 7. Consider a function

1
f (x) := .
1 − exp(ax − b)

The function f is well definite for 0 < a < b, x ∈ [−1, 1] and its Taylor series
is equal
 2 
eb eb a x eb + eb a2 x2
+ 2 + 3 2 + ...
eb − 1 (eb ) − 2 eb + 1 2 (eb ) − 6 (eb ) + 6 eb − 2


X an xn
= cn , cn > 0.
(eb − 1)n+1
n=0
Some methods of constructing kernels in ... 193

Thus all coefficients of the series are nonnegative and k(x1 , x2 ) = f (hx1 , x2 i)
is a kernel on the product of the unit spheres.

The next corollaries are simple consequence of the definition of kernels and
Remark 1.
Corollary 3. Let k : Y × Y → R be a map and let T : X → Y be a map such
that Y = T (X). Then the map

k̃(x1 , x2 ) := k T (x1 ), T (x2 )
is a kernel on X × X if and only if k is a kernel.
Corollary 4. Let T : X → H be a map such that T (X) ⊂ S, where S is
the unit sphere in Hilbert space H (finite (Rn ) or infinite-dimensional). Let
f : [−1, 1] → R be a function with Taylor expansion

X
f (t) = ai ti .
i=0

If all ai ≥ 0 then the map



k(x1 , x2 ) := f hT (x1 ), T (x2 )i
is a kernel on X × X.
Corollary 5. Let T : X → H be a map such that S ⊂ T (X), where S is
the unit sphere in Hilbert space H (finite (Rn ) or infinite-dimensional). Let
f : [−1, 1] → R be a function with expansion into Legendre polynomials

X
f (t) = ai Pi (t).
i=0

If some ai < 0 then the map



k(x1 , x2 ) := f hT (x1 ), T (x2 )i
is not a kernel on X × X.
Example 8. For any transformation T from X onto the unit sphere S the
function

k(x1 , x2 ) := tanh ahT (x1 ), T (x2 )i + b
in not a kernel for any parameters a, b.
194 T. Górecki and M. Luczak

Corollary 6. Let f : D ⊂ R → R be a function which can be written as


f (t) = g(at + b), where g — some function, and Legendre polynomial ex-
pansion of f has some negative coefficient for any a, b. Then for any kernel
k0 (x1 , x2 ) = hΦ(x1 ), Φ(x2 )i such that S(0, r) ⊂ Φ(X) (S(0, r) — a sphere
with a radius of r > 0 and the center in 0), the function
 
k(x1 , x2 ) := f k0 (x1 , x2 ) = g ak0 (x1 , x2 ) + b

in not a kernel for any parameters a, b.

P roof. Since S(0, r) ⊂ Φ(X), S(0, 1) ⊂ 1r Φ(X). Then, by Corollary 5,

g r 2 ah 1r Φ(x1 ), 1r Φ(x2 )i + b = g ahΦ(x1 ), Φ(x2 )i + b = k(x1 , x2 )


 

is not a kernel.

Example 9. For any kernel k0 (x1 , x2 ) = hΦ(x1 ), Φ(x2 )i such that S(0, r) ⊂
Φ(X), the function

k(x1 , x2 ) := tanh ak0 (x1 , x2 ) + b

in not a kernel for any parameters a, b.

Theorem 7 (Burges, 1999). Let k(x1 , x2 ) = f (hx1 , x2 i) be a dot product


kernel, where f : R → R is a differentiable function. Then

f (t) ≥ 0, f ′ (t) ≥ 0, f ′ (t) + tf ′′ (t) ≥ 0

for any t ≥ 0.

Example 10. Let k(x1 , x2 ) = exp − hx1 , x2 i . Then f (t) = e−t , f ′ (t) =


−e−t < 0, thus k is not a kernel.

5. Inverse of the stereographic projection

We see that we have many methods of checking of kernel on the sphere. So if


we have observations in Rn we can use inverse of the stereographic projection
T into S ⊂ Rn+1 (see below) and then the superposition of T and any kernel
k on sphere will be a kernel. Similarly, we can use this technique if we have
a kernel on sphere which is not a kernel on a whole space.
Some methods of constructing kernels in ... 195

Example 11. Let take the second Legendre polynomial P2 (t) = − 12 + 32 t2 .


The function

k(x1 , x2 ) := P2 (hx1 , x2 i) = − 12 + 23 hx1 , x2 i2

is, by Theorem 5, a kernel on the unit spheres S ⊂ R3 . But k is not a kernel


on any subset of R2 including zero. Indeed, for c 6= 0, we have c2 k(0, 0) < 0
so, by Definition 3, k is not a kernel. Even if we exclude zero from the
subset, k is not a kernel because lim k(x, x) = − 12 .
x→0

We construct inverse of the stereographic projection and introduce a new


metric on Rn induced from Euclidean metric on the unit sphere in Rn+1 .
This metric could be used not only to constructing kernels but also directly
in, for example, classification methods.
We define a map h : Rn ∪ {∞} → Rn+1 ,

Rn ∋ x = (x1 , . . . , xn ) 7→ y = (y 1 , . . . , y n , y n+1 ),

∞ 7→ (0, . . . , 0, 1).

To find y = h(x) for x ∈ Rn we take an n-dimensional sphere S( 1 , 1 )


2 2
in Rn+1 with center in the point (0, . . . , 0, 21 ) and a radius of 12 . We draw a
line through the points (x1 , . . . , xn , 0) and (0, . . . , 0, 1). The intersection
of the line and the sphere (another than (0, . . . , 0, 1)) is the result point
y = h(x).
The equation of the sphere is (y 1 )2 + · · · + (y n )2 + (y n+1 − 12 )2 = 14 and
the parametrical equations of the line are

y 1 = x1 t, ..., y n = xn t, y n+1 = 1 − t, t ∈ R.

Then the intersection point y = (y 1 , . . . , y n , y n+1 ) has the coordinates

x1 xn kxk2
y1 = , ..., yn = , y n+1 = .
kxk2 + 1 kxk2 + 1 kxk2 + 1

Now, we define a metric d on Rn by


196 T. Górecki and M. Luczak

d(x1 , x2 ) := d(n+1) h(x1 ), h(x2 ) ,




where d(n+1) is the usual Euclidean metric on Rn+1 . We have

kx1 − x2 k
d(x1 , x2 ) = p p .
kx1 + 1 kx2 k2 + 1
k2

The metric d has the following properties

1
d(x1 , x2 ) < 1, d(x, ∞) = p , d(0, ∞) = 1
kxk2 + 1

for x1 , x2 ∈ Rn .
If we need a map onto the unit sphere S = S(0,1) , we can define

h̃(x) := 2 h(x) − (0, . . . , 0, 12 ) .


 

Then we have d(x˜ 1 , x2 ) = 2d(x1 , x2 ).


If we need to transform the closed ball B̄ ⊂ Rn (with the center in x0
and a radius of ]r > 0) onto a sphere S ∈ Rn+1 we can take a transformation
T : B̄ ⊂ Rn → Rn ∪ {∞} defined
  
g kx − x0 k (x − x ) for x ∈ B,
0
T (x) := r
∞ for x ∈ B̄ \ B,

where g is some function which maps [0, 1] onto [0, ∞), for example:
arctanh(t) or tan( π2 t). Then we define ĥ : B̄ ⊂ Rn → S ⊂ Rn+1 as ĥ = h ◦ T
or ĥ = h̃ ◦ T . Note that all points from boundary of B̄ are mapped on the
same point in Rn+1 .

Example 12. If we have all data in the unit ball B ⊂ Rn , we can transform
the ball into the unit sphere S ⊂ Rn+1 . We have

T (x) = g(kxk) x,
Some methods of constructing kernels in ... 197

where g is a function mentioned above. Then

4g̃(x1 , x2 ) + (g̃(x1 ) − 1)(g̃(x2 ) − 1)


hy1 , y2 i = ,
(g̃(x1 ) + 1)(g̃(x2 ) + 1)

where

g̃(x1 , x2 ) = g(kx1 k)g(kx2 k) hx1 , x2 i

g̃(x) = g̃(x, x).

Thus we have a kernel on B × B



k(x1 , x2 ) := f hy1 , y2 i

for any function f which satisfies conditions of Corollary 4.

6. Multivariable functions of kernels

We can generalize the method of superposition functions of one variable


with kernels (Section 4) to the case of multivariable functions. If we have a
function of n variables with ”good” Taylor expansion then its superposition
with n kernels is a kernel. This enables us to construct new kernels and
simplifies checking that a function is a kernel.
Next theorems concern multivariable functions and directly follow
Theorem 2.

Theorem 8. Let ki : X × X → R (i = 1, . . . , n) be kernels. Let P : Rn → R


be a several variable polynomial with nonnegative coefficients:
m
ai ti11 . . . tinn ,
X
P (t1 , . . . , tn ) = ai ≥ 0.
i=1

Then the function k : X × X → R defined by



k(x1 , x2 ) := P k1 (x1 , x2 ), . . . , kn (x1 , x2 )

is a kernel.
198 T. Górecki and M. Luczak

Theorem 9. Let ki : X × X → R (i = 1, . . . , n) be kernels. Let f : k1 (X, X) ×


· · · × kn (X, X) → R be a several variable function which Taylor expansion
has only nonnegative coefficients:


ai ti11 . . . tinn ,
X
f (t1 , . . . , tn ) = ai ≥ 0.
i=1

Then the function k : X × X → R defined by


k(x1 , x2 ) := f k1 (x1 , x2 ), . . . , kn (x1 , x2 )

is a kernel.

Example 13. Let n ∈ N. If ki : X × X → (−1, 1), i = 1, . . . , n are kernels


then the following functions are kernels:

n
Y 1
K1 (x1 , x2 ) := ,
1 − ki (x1 , x2 )
i=1

n
Y 1 + ki (x1 , x2 )
K2 (x1 , x2 ) := .
1 − ki (x1 , x2 )
i=1

For ti ∈ (−1, 1), i = 1, . . . , n we have

n
1
tα1 1 . . . tαnn =
X Y
,
1 − ti
α1 ,...,αn ∈N∪{0} i=1

n
X |α1 |
Y 1 + ti
t1 . . . t|α
n
n|
= .
1 − ti
α1 ,...,αn ∈Z i=1

Therefore, by Theorem 9, K1 , K2 are kernels.


Some methods of constructing kernels in ... 199

Example 14. If k1 , k2 are kernels and a, b, d ≥ 0, c > 0, m, n ∈ N then


m
a + bk1 (x1 , x2 )
k(x1 , x2 ) := n
c − dk2 (x1 , x2 )

is a kernel.

Indeed, we have k(x1 , x2 ) = f k1 (x1 , x2 ), k2 (x1 , x2 ) , where

(a + bt1 )m
f (t1 , t2 ) = .
(c − dt2 )n

The partial derivatives are



m!
∂k bk (a + bt1 )m−k for k ≤ m


m
(a + bt1 ) = (m − k)!
∂tk1 0 for k > m

∂l (n − 1 + l)! l
l
(c − dt2 )−n = d (c − dt2 )−(n+l) for l ∈ N
∂t2 (n − 1)!

and
m−k bk dl

∂ k+l f  m! (n − 1 + l)! a

for k ≤ m
(0, 0) = (m − k)! (n − 1)! cn+l
∂tk1 ∂tl2 
0 for k > m.

Hence all coefficients in Taylor expansion of function f are nonnegative and,


by Theorem 9, k is a kernel.

Example 15. Let k1i , k2i be kernels and ai , bi , di ≥ 0, ci > 0, mi , ni ∈ N,


i = 1, . . . , n. Then

n m
Y ai + bi k1i (x1 , x2 ) i
k(x1 , x2 ) := n
i=1
ci − di k2i (x1 , x2 ) i

is a kernel and a generalization of kernels from Examples 13–14.


200 T. Górecki and M. Luczak

In all above examples we have to note that the Taylor expansions are con-
vergent to appropriate functions only if the kernels are well definite. In this
ci
example it has to hold |k2i (x1 , x2 )| < (di 6= 0) for i = 1, . . . , n.
di

7. Conclusion

We showed a few method of constructing kernels on metric spaces. We hope


that this could be useful for researchers using kernel methods. The choice
of proper kernel is very difficult and corresponds to:
• choosing a similarity measure for the data,

• choosing a linear representation of the data,

• choosing a regularization functional,

• choosing a covariance function for correlated observations.

Therefore, this choice should reflect prior knowledge about the problem at
hand.

References

[1] M. Abramowitz and I.A. Stegun, Chs. Legendre functions and orthogonal poly-
nomials in Handbook of mathematical functions, Dover Publications, New
York 1972.
[2] B.E. Boser, I.M. Guyon and V.N. Guyon, A training algorithm for optimal
margin classifiers, in D. Haussler, eds. 5th Annual ACM Workshop on COLT.
ACM Press, Pittsburgh (1992), 144–152.
[3] C.J.C. Burges, Geometry and invariance in kernel based methods in: Schölkopf,
B. Burges, C.J.C. Smola, A.J. eds. Advances in kernel methods — support
vector learning. MIT Press, Cambridge (1999), 89–116.
[4] C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning 20
(1995), 273–297.
[5] R. Herbrich, Learning Kernel Classifiers, MIT Press, London 2002.
[6] T. Hofmann, B. Schölkopf and A.J. Smola, Kernels methods in machine learn-
ing, Annals of Statistics 36 (2008), 1171–1220.
[7] Z. Ovari, Kernels, eigenvalues and support vector machines, Honours thesis,
Australian National University, Canberra 2000.
Some methods of constructing kernels in ... 201

[8] B. Schölkopf and A.J. Smola, Learning with Kernels, MIT Press, London 2002.
[9] B. Schölkopf, A.J. Smola and K.R. Müller, Nonlinear component analysis as
a kernel eigenvalue problem, Neural Computation 10 (1998), 1299–1319.
[10] I.J. Schoenberg, Positive definite functions on spheres, Duke Mathematical
Journal 9 (1942), 96–108.
[11] A. Tarantola, Inverse problem theory and methods for model paramenter
estimation, SIAM, Philadelphia 2005.
[12] M. Zu, Kernels and ensembles: perspective on statistical learning, American
Statistician 62 (2008), 97–109.

Received 8 March 2010

You might also like