0% found this document useful (0 votes)

32 views23 pages

Some Methods of Constructing Kernel

This document discusses methods for constructing kernels in statistical learning. It provides an overview of kernel methods in machine learning, including support vector machines and kernel principal component analysis. It then describes various techniques for building kernels, such as using superpositions of kernels with other functions like sigmoidal functions. It also proposes constructing new kernels by mapping data to a unit sphere.

Uploaded by

João Vieira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views23 pages

Some Methods of Constructing Kernel

Uploaded by

João Vieira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Discussiones Mathematicae

Probability and Statistics 30 (2010 ) 179–201

SOME METHODS OF CONSTRUCTING KERNELS

IN STATISTICAL LEARNING
Tomasz Górecki
Faculty of Mathematics and Computer Science
Adam Mickiewicz University
Umultowska 87, 61–614 Poznań, Poland
e-mail: [email protected]

and
Maciej Luczak
Department of Civil and Environmental Engineering
Koszalin University of Technology
Śniadeckich 2, 75–453 Koszalin, Poland
e-mail: [email protected]

Abstract
This paper is a collection of numerous methods and results concern-
ing a design of kernel functions. It gives a short overview of methods
of building kernels in metric spaces, especially Rn and S n . However
we also present a new theory. Introducing kernels was motivated by
searching for non-linear patterns by using linear functions in a feature
space created using a non-linear feature map.
Keywords: positive definite kernel, dot product kernel, statistical
kernel, SVM, kPCA.
2010 Mathematics Subject Classification: 62H30, 68T10.

1. Introduction

The mid-1990’s yielded major advance in machine learning: the support

vector machine (SVM). The fundamental idea beyond this method is
180 T. Górecki and M. Luczak

especially far-reaching. SVM utilizes the positive definite kernels. What

does it mean? The most of machine learning methods is very well developed
in the linear case. However in practice we have real data and we need
non-linear methods to detect the kind of dependencies that enable us to
predict successfully the properties of interest. The kernel corresponds to
a dot product in a feature space (usually high-dimensional, even infinite-
dimensional for Gaussian kernel). In this space, our methods are linear, but
as long as we can formulate everything in terms of kernel evaluations, we
will never have to compute explicitly in the high-dimensional feature space.
If we can show that a linear algorithm is depended on the data matrix X
only by

K ≡ XX T ,

then it can be easily ”kernelized”. In general, this procedure is known as the

kernel trick. The kernel trick transforms any algorithm that solely depends
on the dot product. Wherever a dot product is used, it is replaced with the
kernel function. Thus, a linear algorithm can easily be transformed into a
non-linear algorithm. The algorithms capable of operating with kernels are
(apart from SVM) Fisher’s linear discriminant analysis (LDA), principal
components analysis (PCA), canonical correlation analysis (CCA), ridge
regression, spectral clustering, and many others. A full review of kernel
methods was presented in (Hoffman, Schölkopf, Smola, 2008).
In this article we focus one’s attention on constructing kernels. The
modularity is important advantage of kernel methods. To solve a different
problem, we should use a different kernel function. Hence, it is essential to
have as many as possible kernel functions, because we never know which
kernel will be the best (to be effective in practice, obviously we should use
the correct kernel function and with right parameters (Zu, 2008)).
In our paper first we present the main ideas beyond kernels methods in
machine learning (Section 2). We review especially SVM and kernel PCA
as the members of big family of ”kernelized” methods. Then we describe
basic kernels and main methods of constructing kernels (Section 3). In
Section 4 we show how kernels are constructed using superposition of ker-
nels with other functions, namely functions with ”good” Taylor or Legendre
polynomials series expansion. We pay attention particularly to sigmoidal
kernels (Corollary 2, Example 4–7). In the same section we propose su-
perposition of kernels with maps from Rn to unit sphere (Corollary 3–6).
Section 5 concerns with the special case of the method proposed in Section 4.
Some methods of constructing kernels in ... 181

We construct the inverse of the stereographic projection. This enables us

to construct new kernels as a superposition of this projection and kernels
on sphere. In Section 6 we generalize results from Section 4 to the case of
multivariable functions.

2. Kernels in machine learning

Suppose we are given empirical data (x1 , y1 ), . . . , (xn , yn ) ∈ X × {±1}. Here,

the domain X is some nonempty set from which the patterns xi are taken; the
yi are called class labels. In order to study the problem of learning, we need
additional structure. In learning, we want to generalize unseen data points.
This means in the case of pattern recognition, that given some new pattern
x ∈ X, we want to predict the corresponding y ∈ {±1}. Although the most
of kernels methods can manage with multi-class classification problems as
well we are limited mostly to the two-class classification problem. We choose
y such as (x, y) is in some way similar to the training examples. So we need
similarity measures in X. We require a similarity measure
k :X×X→R
i.e., given two examples x1 and x2 , a function returns a real number char-
acterizing their similarity. The function k is called a kernel.
A type of similarity measure being of particular mathematical appeal
are dot products. In order to use a dot product as a similarity measure, we
first need to embed them into some dot product space F , which may not be
identical to Rn . To this end, we use a map
Φ : X → F.
The space F is called a feature space. Embedding the data into F has some
benefits:
• It lets us define a similarity measure from the dot product in F :
k(x1 , x2 ) = hΦ(x1 ), Φ(x2 )i.

• It allows us to deal with the patterns geometrically, and thus lets us

study learning algorithm using linear algebra and analytic geometry.
The geometrical interpretation of dot product means computing the
cosine of the angle between the vectors, provided they are normalized
to length 1. Moreover, it allows computation of the length of a vector,
and the distance between two vectors as the length of their difference.
182 T. Górecki and M. Luczak

• Ability to choose the mapping Φ will enable us to design a large variety

of learning algorithms.

Now we give two basic ”kernelization” examples of well known linear method.
2.1. SVM
In the case of support vector machines, a data point is viewed as a p-
dimensional vector. We want to know whether we can separate such points
with a p − 1-dimensional hyperplane. This is called a linear classifier. There
are many hyperplanes that might classify the data. However, we are ad-
ditionally interested in finding out if we can achieve maximum separation
(margin) between the two classes. By this we mean that we choose the
hyperplane such as the distance from the hyperplane to the nearest data
point is maximized. In other words the nearest distance between a point in
one separated hyperplane and a point in the other separated hyperplane is
maximized. Now, if such hyperplane exists, it is clearly of interest and is
known as the maximum-margin hyperplane. Furthermore such linear classi-
fier is known as a maximum margin classifier. The samples on the margin
are called the support vectors (maximum margin hyperplane and hence the
classification task is only a function of the support vectors). To calculate
the margin, two parallel hyperplanes are constructed, one on each side of
the separating hyperplane, which are ”pushed up against” the two data sets.

Figure 1. (a) H1 does not separate the 2 classes, H2 does, with a small margin
and H3 with the maximum margin. b) Correct map (kernel) changes
non-linear classifier into linear in higher dimensional space.
Some methods of constructing kernels in ... 183

If two classes are perfectly separable, then there exist an infinite number of
separating hyperplanes. SVM method is based on the fact that the perfect
hyperplane for separating two classes is the one that is the farthest away
from the training points.
Intuitively, a good separation is achieved by the hyperplane of the largest
distance to the neighboring datapoints of both classes. In general the larger
the margin is, the better the generalization error of the classifier (Figure
1a).
Cortes and Vapnik (1995) suggested a modified maximum margin idea
(soft margin) providing mislabeled examples (generally we can not assume
that the two classes are always perfectly separable). If there exists no hyper-
plane splitting examples, the method will choose a hyperplane that splits the
examples as cleanly as possible, still maximizing the distance to the nearest
cleanly split examples. The method introduces slack variables, ξ i ≥ 0, which
measure the degree of misclassification of the datum xi . Then the objective
function is increased by a function which penalizes non-zero ξ i . Additionally
the optimization becomes a trade off between a large margin, and a small
error penalty.
However, one can not possibly expect a linear classifier to succeed in gen-
eral situations, no matter how optimal the hyperplane is. Boser, Guyon and
Vapnik (1992) suggested a way to create non-linear classifiers by applying
the kernel trick to maximum-margin hyperplanes. The resulting algorithm
is formally similar, except for every dot product replacing by a non-linear
kernel function. This allows the algorithm to fit the maximum-margin hy-
perplane in the transformed feature space. The transformation may be
non-linear and the transformed space might be high dimensional. Although
the classifier is a hyperplane in the high-dimensional feature space, it may
be non-linear in the original input space (Figure 1b).
SVMs belong to a family of generalized linear classifiers. They can also
be considered as a special case of Tikhonov regularization (most commonly
used method of regularization of ill-posed problems, in statistics the method
is also known as ridge regression – Tarantola, 2004).

2.2. Kernel PCA

Principal Component Analysis (PCA) is a vector space transformation often
used to reduce multidimensional data sets to lower dimensions for analysis.
184 T. Górecki and M. Luczak

PCA is an orthogonal linear transformation of the coordinate system in

which we describe our data such, that the greatest variance by any projection
of the data comes to lie on the first coordinate (called the first principal
component), the second greatest variance on the second coordinate, and so
on. Unfortunately we have to assume that observed data set to be linear
combinations of certain basis. Kernel principal component analysis (kPCA)
is an extension of principal component analysis (PCA) using techniques of
kernel methods (without assuming linearity). Using a kernel, the originally
linear operations of PCA are done in a reproducing kernel Hilbert space with
a non-linear mapping (Fugure 2). It is a successful example of ”kernelizing
a well-known linear algorithm. Schölkopf et al. (1998) showed that finding
and projecting onto principal components depend on just the inner-product
and kernel trick could be use. There are several important points to note
about the behavior of the kPCA components, which should be contrasted
with the behavior of classic PCA:

• The maximum number of components is determined not by the dimen-

sionality of the input data, but by the number of input data points.

• Not all sets of values of the components correspond to an actual point

in input space.

Figure 2. a) Input points before kernel PCA. b) Output after kernel PCA. The
three groups are distinguishable using the first component only.
Some methods of constructing kernels in ... 185

3. Basic kernels

According to (Schölkopf and Smola, 2002) we introduce basic definitions.

Definition 1. Given a function k : X → R and x1 , . . . , xn ∈ X, the n × n

matrix K with elements

Kij = k(xi , xj )

is called the Gram matrix (or kernel matrix ) of k with respect to x1 , . . . , xn

∈ X.

Definition 2. A real symmetric n × n matrix K satisfying

X
ci cj Kij ≥ 0
i,j

for all ci , cj ∈ R is called positive definite ∗ .

Definition 3. Let X be a nonempty set. A function k on X × X which for

all n ∈ N and all x1 , . . . , xn ∈ X gives rise to a positive definite Gram matrix
is called a positive definite kernel or in short form kernel.

The key idea of the kernel technique is to invert the chain of arguments.
i.e., choose a kernel k rather than a mapping before applying a learning
algorithm. Not every symmetric function can be a kernel. The necessary
and sufficient condition for this are given by Mercer’s theorem.

Theorem 1 (Mercer’s theorem). Suppose k ∈ L∞ (X × X) is a symmetric

function, such that the integral operator Tk : L2 (X) → L2 (X) given by
Z
Tk f (·) = k(·, x)f (x)dx
X

is positive definite, that is,

Z Z
k(x1 , x2 )f (x1 )f (x2 )dx1 dx2 ≥ 0,
X X

∗
some authors call this nonnegative definite
186 T. Górecki and M. Luczak

for all f ∈ L2 (X). Such kernel we call Mercer kernel.

Let ψ i ∈ L2 (X) be the eigenfunction ofRTk associated with the eigenvalue
λi ≥ 0 and normalized such that ||ψ i ||2 = X ψ 2i (x)dx = 1, i.e,
Z
∀x ∈ X : k(x1 , x2 )ψ i (x2 )dx2 = λi ψ i (x1 ).
X

Then

1. (λi )i∈N ∈ l1 ,

2. ψ i ∈ L∞ (X),

3. k can be expanded in a uniformly convergent series, i.e.,

∞
X
k(x1 , x2 ) = λi ψ i (x1 )ψ i (x2 )
i=1

holds for all x1 , x2 ∈ X.

Proposition 1 (Herbrich, 2002). The function k : X × X → R is a Mercer

kernel if and only if is a kernel in sense of Definition 3 (for almost all x).

If we have a positive definite kernel such as in Definition 3, which, however,

is not in L∞ (X), to use Mercer’s theorem we can take any compact subset
of X containing all observations.

Example 1. A few simple examples of functions that are kernels or not.

2
• Kernels: hx1 , x2 i, e−kx1 −x2 k , ehx1 ,x2 i ;

2
• Not kernels: kx1 − x2 k2 , −kx1 − x2 k2 , −hx1 , x2 i, ekx1 −x2 k , e−hx1 ,x2 i .

Example 2. Lets show that k(x1 , x2 ) = hx1 , x2 i, x1 , x2 ∈ X ⊂ Rn is a

kernel in sense of Definition 3 and Theorem 1.
Some methods of constructing kernels in ... 187

For xi = (x1i , . . . , xni ) ∈ X, ci ∈ R, i = 1, . . . , m we have

X X X
ci cj hxi , xj i = ci cj xki xkj
i,j i,j k

XX
= ci xki cj xkj
k i,j

X X X
= ci xki cj xkj
k i j

XX 2
= ci xki ≥ 0.
k i

For f ∈ L∞ (X) we have

Z Z Z Z X
hx1 , x2 if (x1 )f (x2 ) dx1 dx2 = xi1 xi2 f (x1 )f (x2 )dx1 dx2
X X X X i

XZ Z
= xi1 f (x1 ) xi2 f (x2 ) dx1 dx2
i X X

Z Z !
X
= xi1 f (x1 ) dx1 xi2 f (x2 ) dx2
i X X

Z !2
X
i
= x f (x) dx ≥ 0.
i X

Example 3. We can show that function k(x1 , x2 ) = kx1 − x2 k2 is not a

kernel.
Let m = 2, x1 , x2 ∈ X, c1 , c2 ∈ R. Then
X
ci cj kxi − xj k2 = 2c1 c2 kx1 − x2 k2 < 0
i,j∈{1,2}

for x1 6= x2 , c1 c2 < 0.
188 T. Górecki and M. Luczak

4. Functions of kernels

Following facts show that we can create new kernels from existing kernels
using a number of simple operations. In this way we can create complex
kernels by basic operations that combine simpler kernels.
Theorem 2 (Herbrich, 2002). Let ki : X × X → R be any kernels. Then,
the functions k : X × X → R given by

1. k(x1 , x2 ) = k1 (x1 , x2 ) + k2 (x1 , x2 ),

2. k(x1 , x2 ) = ck1 (x1 , x2 ) for all c ∈ R+ ,

3. k(x1 , x2 ) = k1 (x1 , x2 )k2 (x1 , x2 ),

4. k(x1 , x2 ) = f (x1 )f (x2 ) for any function f : X → R,

5. k(x1 , x2 ) = x′1 Bx2 for any symmetric positive definite B matrix,

6. k(x1 , x2 ) = lim ki (x1 , x2 ), if the limit exists

i→∞

are also kernels.

The combination of kernels given in part (3) is often refereed to as the

Schur product. We can easily decompose any kernel into the Schur product
of its normalisation and the 1-dimensional kernel of part (4) with f (x) =
p
k(x, x).

Corollary 1 (Herbrich, 2002). Let k1 : X × X → R be a kernel. Then, the

functions k : X × X → R given by

1. k(x1 , x2 ) = (k1 (x1 , x2 ) + θ1 )θ2 for all θ1 ∈ R+ and θ2 ∈ N,

2. k(x1 , x2 ) = exp k1 (xσ12,x2 ) for all σ ∈ R+ ,

3. k(x1 , x2 ) = exp − k1 (x1 ,x1 )−2k1 (x
2σ
1 ,x2 )+k1 (x2 ,x2 )
2 for all σ ∈ R+ ,

k1 (x1 ,x2 )
4. k(x1 , x2 ) = √ ,
k1 (x1 ,x1 )k1 (x2 ,x2 )

are also kernels.

Some methods of constructing kernels in ... 189

From Theorem 2 we see that the following theorems are true.

Theorem 3. Let k : X × X → R be a kernel. Let P be a polynomial of

degree n with nonnegative coefficients:
n
X
P (t) = ai ti (ai ≥ 0).
i=0

Then the function k̃ : X × X → R defined by

k̃(x1 , x2 ) := P (k(x1 , x2 ))

is a kernel.

Theorem 4. Let k : X × X → R be a kernel. Let f : k(X, X) → R be a

function which Taylor expansion has only nonnegative coefficients:
∞
X
f (t) = ai ti (ai ≥ 0).
i=0

Then the function k̃ : X × X → R defined by

k̃(x1 , x2 ) := f (k(x1 , x2 ))

is a kernel.

For example, functions with ”good” (nonnegative coefficients) Taylor ex-

pansion: ex , arcsin x, sinh x, cosh x, tan x, arctanh x. Functions with ”bad”
Taylor expansion: sin x, cos x, arccos x, arctan x, arcsinh x, tanh x.

Corollary 2. Let k : X × X → R be a function and let f : k(X, X) → R be a

function for which there is an inverse f −1 with nonnegative Taylor expansion
coefficients. Then k is a kernel if the superposition f ◦ k is a kernel.

Example 4. If k is not a kernel, then functions

k1 (x1 , x2 ) = tanh(k(x1 , x2 )),

k2 (x1 , x2 ) = arctan(k(x1 , x2 ))

are not kernels.

190 T. Górecki and M. Luczak

Indeed, functions arctanh and tan have ”good” Taylor expansions.

Next theorems concern dot product kernels.

Definition 4 (Abramowitz and Stegun, 1972). The solutions of Legendre’s

differential equation

d 2 d
(1 − x ) P (x) + n(n + 1)P (x) = 0
dx dx

are called Legendre functions. When n is an integer, the solution Pn (x)

is a polynomial. These solutions for n = 0, 1, . . . (with the normalization
Pn (1) = 1) form a orthogonal polynomials called the Legendre polynomials.
Each Legendre polynomial Pn (x) is an nth-degree polynomial. It may be
expressed using Rodrigues formula:

1 dn 2 n

Pn (x) = (x − 1) .
2n n! dxn
Associated Legendre functions are the canonical solutions of the general
Legendre equation

m2

2 ′′ ′
(1 − x ) y − 2xy + n[n + 1] − y = 0.
1 − x2

This equation has solutions that are nonsingular on [−1, 1] only if n and m
are integers with 0 ≤ m ≤ n. When in addition m is even, the function is a
polynomial. When m is zero and n integer, these functions are identical to
the Legendre polynomials. These functions are denoted Pnm (x). Their most
straightforward definition is in terms of derivatives of ordinary Legendre
polynomials (m ≥ 0):

dm
Pnm (x) = (−1)m (1 − x2 )m/2 (Pn (x)) .
dxm

Since, by Rodrigues formula one obtains

(−1)m 2 m/2 d
n+m
Pnm (x) = (1 − x ) (x2 − 1)n .
2n n! dxn+m
Some methods of constructing kernels in ... 191

Coefficients cn and cm
n of expansion function f (x) to series of Legendre and
associated Legendre polynomials we calculate:

(2n + 1) 1
Z
cn = f (x)Pn (x)dx,
2 −1

1
(2n + 1)(n − m)!
Z
cm
n = f (x)Pnm (x)dx.
2(n + m)! −1

Theorem 5 (Shoenberg, 1942). Let k(x1 , x2 ) = f (hx1 , x2 i) be a func-

tion defined on S × S ⊂ Rm+3 × Rm+3 , where S is the unit sphere, and
f : [−1, 1] → R is a function with expansion into Legendre polynomials Pnm
∞
X
f (t) = an Pnm (t).
n=0

Then k is a kernel if and only if all an ≥ 0.

Theorem 6 (Shoenberg 1942). Let k(x1 , x2 ) = f (hx1 , x2 i) be a function

defined on S × S ⊂ H × H, where S is the unit sphere in an infinite dimen-
sional Hilbert space H, and f : [−1, 1] → R is a function with a power series
expansion
∞
X
f (t) = an tn .
n=0

Then k is a kernel if and only if all an ≥ 0.

Remark 1. In order to prove positive definiteness for arbitrary dimensions

it suffices to show that the Taylor expansion contains only positive coeffi-
cients. On the other hand, in order to prove that a candidate for a kernel
function will never be positive definite, it is sufficient to show this for the
Legendre Polynomials Pn .

Example 5. The function

k(x1 , x2 ) = tanh(ahx1 , x2 i + b)
192 T. Górecki and M. Luczak

is not a kernel for any a, b ∈ R, a 6= 0. We have to show that the kernel does
not satisfy the conditions of Theorem 5. Since this is very technical we refer
the reader to work of Ovari (2000) for details, and explain how the method
works in the simpler case of Theorem 6. The Taylor series of tanh(at + b) is
equal

tanh b + (1 − tanh2 b)at + (tanh3 b − tanh b)a2 t2 + . . .

Since the coefficients have to be nonnegative we have tanh b ≥ 0, tanh3

b − tanh b ≥ 0. Hence b ≥ 0 and if b 6= 0 then tanh2 b ≥ 1 — contradiction.
a3 t3
If b = 0 the expansion is equal at − + . . . , and then a = 0 —
3
contradiction.

Example 6. By computer computations we obtain that for parameters

a, b ∈ {−3, −2, . . . , 2, 3}, a 6= 0 any function

1
f (x) :=
1 + exp(ax + b)

has a negative coefficient in its expansion into Legendre polynomial series,

therefore k(x1 , x2 ) = f (hx1 , x2 i) is not a kernel.

Example 7. Consider a function

1
f (x) := .
1 − exp(ax − b)

The function f is well definite for 0 < a < b, x ∈ [−1, 1] and its Taylor series
is equal
2
eb eb a x eb + eb a2 x2
+ 2 + 3 2 + ...
eb − 1 (eb ) − 2 eb + 1 2 (eb ) − 6 (eb ) + 6 eb − 2

∞
X an xn
= cn , cn > 0.
(eb − 1)n+1
n=0
Some methods of constructing kernels in ... 193

Thus all coefficients of the series are nonnegative and k(x1 , x2 ) = f (hx1 , x2 i)
is a kernel on the product of the unit spheres.

The next corollaries are simple consequence of the definition of kernels and
Remark 1.
Corollary 3. Let k : Y × Y → R be a map and let T : X → Y be a map such
that Y = T (X). Then the map

k̃(x1 , x2 ) := k T (x1 ), T (x2 )
is a kernel on X × X if and only if k is a kernel.
Corollary 4. Let T : X → H be a map such that T (X) ⊂ S, where S is
the unit sphere in Hilbert space H (finite (Rn ) or infinite-dimensional). Let
f : [−1, 1] → R be a function with Taylor expansion
∞
X
f (t) = ai ti .
i=0

If all ai ≥ 0 then the map

k(x1 , x2 ) := f hT (x1 ), T (x2 )i
is a kernel on X × X.
Corollary 5. Let T : X → H be a map such that S ⊂ T (X), where S is
the unit sphere in Hilbert space H (finite (Rn ) or infinite-dimensional). Let
f : [−1, 1] → R be a function with expansion into Legendre polynomials
∞
X
f (t) = ai Pi (t).
i=0

If some ai < 0 then the map

k(x1 , x2 ) := f hT (x1 ), T (x2 )i
is not a kernel on X × X.
Example 8. For any transformation T from X onto the unit sphere S the
function

k(x1 , x2 ) := tanh ahT (x1 ), T (x2 )i + b
in not a kernel for any parameters a, b.
194 T. Górecki and M. Luczak

Corollary 6. Let f : D ⊂ R → R be a function which can be written as

f (t) = g(at + b), where g — some function, and Legendre polynomial ex-
pansion of f has some negative coefficient for any a, b. Then for any kernel
k0 (x1 , x2 ) = hΦ(x1 ), Φ(x2 )i such that S(0, r) ⊂ Φ(X) (S(0, r) — a sphere
with a radius of r > 0 and the center in 0), the function

k(x1 , x2 ) := f k0 (x1 , x2 ) = g ak0 (x1 , x2 ) + b

in not a kernel for any parameters a, b.

P roof. Since S(0, r) ⊂ Φ(X), S(0, 1) ⊂ 1r Φ(X). Then, by Corollary 5,

g r 2 ah 1r Φ(x1 ), 1r Φ(x2 )i + b = g ahΦ(x1 ), Φ(x2 )i + b = k(x1 , x2 )

is not a kernel.

Example 9. For any kernel k0 (x1 , x2 ) = hΦ(x1 ), Φ(x2 )i such that S(0, r) ⊂
Φ(X), the function

k(x1 , x2 ) := tanh ak0 (x1 , x2 ) + b

in not a kernel for any parameters a, b.

Theorem 7 (Burges, 1999). Let k(x1 , x2 ) = f (hx1 , x2 i) be a dot product

kernel, where f : R → R is a differentiable function. Then

f (t) ≥ 0, f ′ (t) ≥ 0, f ′ (t) + tf ′′ (t) ≥ 0

for any t ≥ 0.

Example 10. Let k(x1 , x2 ) = exp − hx1 , x2 i . Then f (t) = e−t , f ′ (t) =

−e−t < 0, thus k is not a kernel.

5. Inverse of the stereographic projection

We see that we have many methods of checking of kernel on the sphere. So if

we have observations in Rn we can use inverse of the stereographic projection
T into S ⊂ Rn+1 (see below) and then the superposition of T and any kernel
k on sphere will be a kernel. Similarly, we can use this technique if we have
a kernel on sphere which is not a kernel on a whole space.
Some methods of constructing kernels in ... 195

Example 11. Let take the second Legendre polynomial P2 (t) = − 12 + 32 t2 .

The function

k(x1 , x2 ) := P2 (hx1 , x2 i) = − 12 + 23 hx1 , x2 i2

is, by Theorem 5, a kernel on the unit spheres S ⊂ R3 . But k is not a kernel

on any subset of R2 including zero. Indeed, for c 6= 0, we have c2 k(0, 0) < 0
so, by Definition 3, k is not a kernel. Even if we exclude zero from the
subset, k is not a kernel because lim k(x, x) = − 12 .
x→0

We construct inverse of the stereographic projection and introduce a new

metric on Rn induced from Euclidean metric on the unit sphere in Rn+1 .
This metric could be used not only to constructing kernels but also directly
in, for example, classification methods.
We define a map h : Rn ∪ {∞} → Rn+1 ,

Rn ∋ x = (x1 , . . . , xn ) 7→ y = (y 1 , . . . , y n , y n+1 ),

∞ 7→ (0, . . . , 0, 1).

To find y = h(x) for x ∈ Rn we take an n-dimensional sphere S( 1 , 1 )

2 2
in Rn+1 with center in the point (0, . . . , 0, 21 ) and a radius of 12 . We draw a
line through the points (x1 , . . . , xn , 0) and (0, . . . , 0, 1). The intersection
of the line and the sphere (another than (0, . . . , 0, 1)) is the result point
y = h(x).
The equation of the sphere is (y 1 )2 + · · · + (y n )2 + (y n+1 − 12 )2 = 14 and
the parametrical equations of the line are

y 1 = x1 t, ..., y n = xn t, y n+1 = 1 − t, t ∈ R.

Then the intersection point y = (y 1 , . . . , y n , y n+1 ) has the coordinates

x1 xn kxk2
y1 = , ..., yn = , y n+1 = .
kxk2 + 1 kxk2 + 1 kxk2 + 1

Now, we define a metric d on Rn by

196 T. Górecki and M. Luczak

d(x1 , x2 ) := d(n+1) h(x1 ), h(x2 ) ,

where d(n+1) is the usual Euclidean metric on Rn+1 . We have

kx1 − x2 k
d(x1 , x2 ) = p p .
kx1 + 1 kx2 k2 + 1
k2

The metric d has the following properties

1
d(x1 , x2 ) < 1, d(x, ∞) = p , d(0, ∞) = 1
kxk2 + 1

for x1 , x2 ∈ Rn .
If we need a map onto the unit sphere S = S(0,1) , we can define

h̃(x) := 2 h(x) − (0, . . . , 0, 12 ) .

Then we have d(x˜ 1 , x2 ) = 2d(x1 , x2 ).

If we need to transform the closed ball B̄ ⊂ Rn (with the center in x0
and a radius of ]r > 0) onto a sphere S ∈ Rn+1 we can take a transformation
T : B̄ ⊂ Rn → Rn ∪ {∞} defined

g kx − x0 k (x − x ) for x ∈ B,
0
T (x) := r
∞ for x ∈ B̄ \ B,


where g is some function which maps [0, 1] onto [0, ∞), for example:
arctanh(t) or tan( π2 t). Then we define ĥ : B̄ ⊂ Rn → S ⊂ Rn+1 as ĥ = h ◦ T
or ĥ = h̃ ◦ T . Note that all points from boundary of B̄ are mapped on the
same point in Rn+1 .

Example 12. If we have all data in the unit ball B ⊂ Rn , we can transform
the ball into the unit sphere S ⊂ Rn+1 . We have

T (x) = g(kxk) x,
Some methods of constructing kernels in ... 197

where g is a function mentioned above. Then

4g̃(x1 , x2 ) + (g̃(x1 ) − 1)(g̃(x2 ) − 1)

hy1 , y2 i = ,
(g̃(x1 ) + 1)(g̃(x2 ) + 1)

where

g̃(x1 , x2 ) = g(kx1 k)g(kx2 k) hx1 , x2 i

g̃(x) = g̃(x, x).

Thus we have a kernel on B × B

k(x1 , x2 ) := f hy1 , y2 i

for any function f which satisfies conditions of Corollary 4.

6. Multivariable functions of kernels

We can generalize the method of superposition functions of one variable

with kernels (Section 4) to the case of multivariable functions. If we have a
function of n variables with ”good” Taylor expansion then its superposition
with n kernels is a kernel. This enables us to construct new kernels and
simplifies checking that a function is a kernel.
Next theorems concern multivariable functions and directly follow
Theorem 2.

Theorem 8. Let ki : X × X → R (i = 1, . . . , n) be kernels. Let P : Rn → R

be a several variable polynomial with nonnegative coefficients:
m
ai ti11 . . . tinn ,
X
P (t1 , . . . , tn ) = ai ≥ 0.
i=1

Then the function k : X × X → R defined by

k(x1 , x2 ) := P k1 (x1 , x2 ), . . . , kn (x1 , x2 )

is a kernel.
198 T. Górecki and M. Luczak

Theorem 9. Let ki : X × X → R (i = 1, . . . , n) be kernels. Let f : k1 (X, X) ×

· · · × kn (X, X) → R be a several variable function which Taylor expansion
has only nonnegative coefficients:

∞
ai ti11 . . . tinn ,
X
f (t1 , . . . , tn ) = ai ≥ 0.
i=1

Then the function k : X × X → R defined by

k(x1 , x2 ) := f k1 (x1 , x2 ), . . . , kn (x1 , x2 )

is a kernel.

Example 13. Let n ∈ N. If ki : X × X → (−1, 1), i = 1, . . . , n are kernels

then the following functions are kernels:

n
Y 1
K1 (x1 , x2 ) := ,
1 − ki (x1 , x2 )
i=1

n
Y 1 + ki (x1 , x2 )
K2 (x1 , x2 ) := .
1 − ki (x1 , x2 )
i=1

For ti ∈ (−1, 1), i = 1, . . . , n we have

n
1
tα1 1 . . . tαnn =
X Y
,
1 − ti
α1 ,...,αn ∈N∪{0} i=1

n
X |α1 |
Y 1 + ti
t1 . . . t|α
n
n|
= .
1 − ti
α1 ,...,αn ∈Z i=1

Therefore, by Theorem 9, K1 , K2 are kernels.

Some methods of constructing kernels in ... 199

Example 14. If k1 , k2 are kernels and a, b, d ≥ 0, c > 0, m, n ∈ N then

m
a + bk1 (x1 , x2 )
k(x1 , x2 ) := n
c − dk2 (x1 , x2 )

is a kernel.

Indeed, we have k(x1 , x2 ) = f k1 (x1 , x2 ), k2 (x1 , x2 ) , where

(a + bt1 )m
f (t1 , t2 ) = .
(c − dt2 )n

The partial derivatives are


m!
∂k bk (a + bt1 )m−k for k ≤ m


m
(a + bt1 ) = (m − k)!
∂tk1 0 for k > m


∂l (n − 1 + l)! l
l
(c − dt2 )−n = d (c − dt2 )−(n+l) for l ∈ N
∂t2 (n − 1)!

and
m−k bk dl

∂ k+l f  m! (n − 1 + l)! a

for k ≤ m
(0, 0) = (m − k)! (n − 1)! cn+l
∂tk1 ∂tl2 
0 for k > m.

Hence all coefficients in Taylor expansion of function f are nonnegative and,

by Theorem 9, k is a kernel.

Example 15. Let k1i , k2i be kernels and ai , bi , di ≥ 0, ci > 0, mi , ni ∈ N,

i = 1, . . . , n. Then

n m
Y ai + bi k1i (x1 , x2 ) i
k(x1 , x2 ) := n
i=1
ci − di k2i (x1 , x2 ) i

is a kernel and a generalization of kernels from Examples 13–14.

200 T. Górecki and M. Luczak

In all above examples we have to note that the Taylor expansions are con-
vergent to appropriate functions only if the kernels are well definite. In this
ci
example it has to hold |k2i (x1 , x2 )| < (di 6= 0) for i = 1, . . . , n.
di

7. Conclusion

We showed a few method of constructing kernels on metric spaces. We hope

that this could be useful for researchers using kernel methods. The choice
of proper kernel is very difficult and corresponds to:
• choosing a similarity measure for the data,

• choosing a linear representation of the data,

• choosing a regularization functional,

• choosing a covariance function for correlated observations.

Therefore, this choice should reflect prior knowledge about the problem at
hand.

References

[1] M. Abramowitz and I.A. Stegun, Chs. Legendre functions and orthogonal poly-
nomials in Handbook of mathematical functions, Dover Publications, New
York 1972.
[2] B.E. Boser, I.M. Guyon and V.N. Guyon, A training algorithm for optimal
margin classifiers, in D. Haussler, eds. 5th Annual ACM Workshop on COLT.
ACM Press, Pittsburgh (1992), 144–152.
[3] C.J.C. Burges, Geometry and invariance in kernel based methods in: Schölkopf,
B. Burges, C.J.C. Smola, A.J. eds. Advances in kernel methods — support
vector learning. MIT Press, Cambridge (1999), 89–116.
[4] C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning 20
(1995), 273–297.
[5] R. Herbrich, Learning Kernel Classifiers, MIT Press, London 2002.
[6] T. Hofmann, B. Schölkopf and A.J. Smola, Kernels methods in machine learn-
ing, Annals of Statistics 36 (2008), 1171–1220.
[7] Z. Ovari, Kernels, eigenvalues and support vector machines, Honours thesis,
Australian National University, Canberra 2000.
Some methods of constructing kernels in ... 201

[8] B. Schölkopf and A.J. Smola, Learning with Kernels, MIT Press, London 2002.
[9] B. Schölkopf, A.J. Smola and K.R. Müller, Nonlinear component analysis as
a kernel eigenvalue problem, Neural Computation 10 (1998), 1299–1319.
[10] I.J. Schoenberg, Positive definite functions on spheres, Duke Mathematical
Journal 9 (1942), 96–108.
[11] A. Tarantola, Inverse problem theory and methods for model paramenter
estimation, SIAM, Philadelphia 2005.
[12] M. Zu, Kernels and ensembles: perspective on statistical learning, American
Statistician 62 (2008), 97–109.

Received 8 March 2010

03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
0701907v3
No ratings yet
0701907v3
53 pages
Machine Unit4
No ratings yet
Machine Unit4
55 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
No ratings yet
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
40 pages
SVM Extra Kernels
No ratings yet
SVM Extra Kernels
29 pages
Fast Kernel Classifiers
No ratings yet
Fast Kernel Classifiers
41 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
Lecture4
No ratings yet
Lecture4
49 pages
SchSmo03c
No ratings yet
SchSmo03c
24 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
The Passage Below Is Accompanied by A Set of Six Questions. Choose The Best Answer To Each Question
100% (1)
The Passage Below Is Accompanied by A Set of Six Questions. Choose The Best Answer To Each Question
123 pages
4.4-InstanceBasedLearning Part 2
No ratings yet
4.4-InstanceBasedLearning Part 2
16 pages
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
No ratings yet
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
23 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
Svm
No ratings yet
Svm
40 pages
SVM Tutorial
100% (1)
SVM Tutorial
34 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
lec16
No ratings yet
lec16
23 pages
Lecture09 SVM Intro, Kernel Trick (Updated)
No ratings yet
Lecture09 SVM Intro, Kernel Trick (Updated)
36 pages
24. Support Vector Machine (1)
No ratings yet
24. Support Vector Machine (1)
34 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Poly Kernel
No ratings yet
Poly Kernel
6 pages
Support Vector Machine
0% (1)
Support Vector Machine
7 pages
SVM-Worked Out Example
No ratings yet
SVM-Worked Out Example
4 pages
SVM
No ratings yet
SVM
43 pages
svm
No ratings yet
svm
8 pages
Ds 11
No ratings yet
Ds 11
21 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
SVM Class
No ratings yet
SVM Class
33 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Blog 03 - Psionics_pdf
No ratings yet
Blog 03 - Psionics_pdf
94 pages
B43 Exp3 ML
No ratings yet
B43 Exp3 ML
5 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
1501589527da-mod14-Q1-e-text
No ratings yet
1501589527da-mod14-Q1-e-text
12 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Support Vector Machine
No ratings yet
Support Vector Machine
8 pages
Ebooks File Superconductivity: From Materials Science To Practical Applications Paolo Mele All Chapters
100% (3)
Ebooks File Superconductivity: From Materials Science To Practical Applications Paolo Mele All Chapters
62 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
Experiment On Basic Concepts: Experiment 4.1 Measurement of Viscosity by Redwood Viscometer
100% (1)
Experiment On Basic Concepts: Experiment 4.1 Measurement of Viscosity by Redwood Viscometer
5 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Where Are You Mama - The Role of Motherhood in The Victorian Nove
No ratings yet
Where Are You Mama - The Role of Motherhood in The Victorian Nove
78 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Vahid
No ratings yet
Vahid
18 pages
Manuscript-Template-Jurnal-Kejuruteraan-2025-anonymous-double blind manuscript
No ratings yet
Manuscript-Template-Jurnal-Kejuruteraan-2025-anonymous-double blind manuscript
7 pages
Contents
No ratings yet
Contents
95 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
PDF An Introduction to the Regularity Theory for Elliptic Systems Harmonic Maps and Minimal Graphs Publications of the Scuola Normale Superiore Mariano Giaquinta download
100% (3)
PDF An Introduction to the Regularity Theory for Elliptic Systems Harmonic Maps and Minimal Graphs Publications of the Scuola Normale Superiore Mariano Giaquinta download
65 pages
This Is
No ratings yet
This Is
7 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Ecology Webquest 10
No ratings yet
Ecology Webquest 10
5 pages
Kelompok 2
No ratings yet
Kelompok 2
11 pages
3 Concept of Vulnerability and Elements Exposed To Hazards
No ratings yet
3 Concept of Vulnerability and Elements Exposed To Hazards
7 pages
Aspects That Influence Reading Development
No ratings yet
Aspects That Influence Reading Development
24 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
Lettei: Homceopathic Physician
No ratings yet
Lettei: Homceopathic Physician
6 pages
Lesson Plan - Ethnic Groups
No ratings yet
Lesson Plan - Ethnic Groups
2 pages
QUIZ in Reflection and Mirrors
100% (1)
QUIZ in Reflection and Mirrors
18 pages
The University of Manila: Modules For Rizal Life Works and Writings (RLWW)
No ratings yet
The University of Manila: Modules For Rizal Life Works and Writings (RLWW)
8 pages
School Physics Experiments. 2022-2023
No ratings yet
School Physics Experiments. 2022-2023
7 pages
Analysis of Brain Waves According To Their Frequency
No ratings yet
Analysis of Brain Waves According To Their Frequency
7 pages
Annexure V - SOP -Temp & RH Mapping (1)
No ratings yet
Annexure V - SOP -Temp & RH Mapping (1)
2 pages
JIT Template
No ratings yet
JIT Template
2 pages
Herbert A Simon
No ratings yet
Herbert A Simon
30 pages
STD 10 - ENG - WS-9 ADJECTIVES (Demonstrative, Possessive)
No ratings yet
STD 10 - ENG - WS-9 ADJECTIVES (Demonstrative, Possessive)
2 pages
The Graduate School of Korean Studies, The Academy of Korean Studies
No ratings yet
The Graduate School of Korean Studies, The Academy of Korean Studies
9 pages
Part 2.2 Continuation Pestel and BCG Matrix
No ratings yet
Part 2.2 Continuation Pestel and BCG Matrix
18 pages
TLE 6 ICT ENTREP Q1 Week 1
No ratings yet
TLE 6 ICT ENTREP Q1 Week 1
10 pages
Resume - Noraziah Binti Mohd Darus
No ratings yet
Resume - Noraziah Binti Mohd Darus
1 page
Nutrition For The Sprinter
No ratings yet
Nutrition For The Sprinter
12 pages
Draft Resolution No. 161-2020 - Adopting Local Climate Change Action Plan 2021
No ratings yet
Draft Resolution No. 161-2020 - Adopting Local Climate Change Action Plan 2021
2 pages
IVC1 2AD Analog Input Module User Manual PDF
No ratings yet
IVC1 2AD Analog Input Module User Manual PDF
4 pages
Lombok 3 Sales Handout
No ratings yet
Lombok 3 Sales Handout
1 page
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
From Everand
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet