Some Methods of Constructing Kernel
Some Methods of Constructing Kernel
and
Maciej Luczak
Department of Civil and Environmental Engineering
Koszalin University of Technology
Śniadeckich 2, 75–453 Koszalin, Poland
e-mail: [email protected]
Abstract
This paper is a collection of numerous methods and results concern-
ing a design of kernel functions. It gives a short overview of methods
of building kernels in metric spaces, especially Rn and S n . However
we also present a new theory. Introducing kernels was motivated by
searching for non-linear patterns by using linear functions in a feature
space created using a non-linear feature map.
Keywords: positive definite kernel, dot product kernel, statistical
kernel, SVM, kPCA.
2010 Mathematics Subject Classification: 62H30, 68T10.
1. Introduction
K ≡ XX T ,
Now we give two basic ”kernelization” examples of well known linear method.
2.1. SVM
In the case of support vector machines, a data point is viewed as a p-
dimensional vector. We want to know whether we can separate such points
with a p − 1-dimensional hyperplane. This is called a linear classifier. There
are many hyperplanes that might classify the data. However, we are ad-
ditionally interested in finding out if we can achieve maximum separation
(margin) between the two classes. By this we mean that we choose the
hyperplane such as the distance from the hyperplane to the nearest data
point is maximized. In other words the nearest distance between a point in
one separated hyperplane and a point in the other separated hyperplane is
maximized. Now, if such hyperplane exists, it is clearly of interest and is
known as the maximum-margin hyperplane. Furthermore such linear classi-
fier is known as a maximum margin classifier. The samples on the margin
are called the support vectors (maximum margin hyperplane and hence the
classification task is only a function of the support vectors). To calculate
the margin, two parallel hyperplanes are constructed, one on each side of
the separating hyperplane, which are ”pushed up against” the two data sets.
Figure 1. (a) H1 does not separate the 2 classes, H2 does, with a small margin
and H3 with the maximum margin. b) Correct map (kernel) changes
non-linear classifier into linear in higher dimensional space.
Some methods of constructing kernels in ... 183
If two classes are perfectly separable, then there exist an infinite number of
separating hyperplanes. SVM method is based on the fact that the perfect
hyperplane for separating two classes is the one that is the farthest away
from the training points.
Intuitively, a good separation is achieved by the hyperplane of the largest
distance to the neighboring datapoints of both classes. In general the larger
the margin is, the better the generalization error of the classifier (Figure
1a).
Cortes and Vapnik (1995) suggested a modified maximum margin idea
(soft margin) providing mislabeled examples (generally we can not assume
that the two classes are always perfectly separable). If there exists no hyper-
plane splitting examples, the method will choose a hyperplane that splits the
examples as cleanly as possible, still maximizing the distance to the nearest
cleanly split examples. The method introduces slack variables, ξ i ≥ 0, which
measure the degree of misclassification of the datum xi . Then the objective
function is increased by a function which penalizes non-zero ξ i . Additionally
the optimization becomes a trade off between a large margin, and a small
error penalty.
However, one can not possibly expect a linear classifier to succeed in gen-
eral situations, no matter how optimal the hyperplane is. Boser, Guyon and
Vapnik (1992) suggested a way to create non-linear classifiers by applying
the kernel trick to maximum-margin hyperplanes. The resulting algorithm
is formally similar, except for every dot product replacing by a non-linear
kernel function. This allows the algorithm to fit the maximum-margin hy-
perplane in the transformed feature space. The transformation may be
non-linear and the transformed space might be high dimensional. Although
the classifier is a hyperplane in the high-dimensional feature space, it may
be non-linear in the original input space (Figure 1b).
SVMs belong to a family of generalized linear classifiers. They can also
be considered as a special case of Tikhonov regularization (most commonly
used method of regularization of ill-posed problems, in statistics the method
is also known as ridge regression – Tarantola, 2004).
Figure 2. a) Input points before kernel PCA. b) Output after kernel PCA. The
three groups are distinguishable using the first component only.
Some methods of constructing kernels in ... 185
3. Basic kernels
Kij = k(xi , xj )
The key idea of the kernel technique is to invert the chain of arguments.
i.e., choose a kernel k rather than a mapping before applying a learning
algorithm. Not every symmetric function can be a kernel. The necessary
and sufficient condition for this are given by Mercer’s theorem.
∗
some authors call this nonnegative definite
186 T. Górecki and M. Luczak
Then
1. (λi )i∈N ∈ l1 ,
2. ψ i ∈ L∞ (X),
∞
X
k(x1 , x2 ) = λi ψ i (x1 )ψ i (x2 )
i=1
2
• Kernels: hx1 , x2 i, e−kx1 −x2 k , ehx1 ,x2 i ;
2
• Not kernels: kx1 − x2 k2 , −kx1 − x2 k2 , −hx1 , x2 i, ekx1 −x2 k , e−hx1 ,x2 i .
X X X
ci cj hxi , xj i = ci cj xki xkj
i,j i,j k
XX
= ci xki cj xkj
k i,j
X X X
= ci xki cj xkj
k i j
XX 2
= ci xki ≥ 0.
k i
Z Z Z Z X
hx1 , x2 if (x1 )f (x2 ) dx1 dx2 = xi1 xi2 f (x1 )f (x2 )dx1 dx2
X X X X i
XZ Z
= xi1 f (x1 ) xi2 f (x2 ) dx1 dx2
i X X
Z Z !
X
= xi1 f (x1 ) dx1 xi2 f (x2 ) dx2
i X X
Z !2
X
i
= x f (x) dx ≥ 0.
i X
for x1 6= x2 , c1 c2 < 0.
188 T. Górecki and M. Luczak
4. Functions of kernels
Following facts show that we can create new kernels from existing kernels
using a number of simple operations. In this way we can create complex
kernels by basic operations that combine simpler kernels.
Theorem 2 (Herbrich, 2002). Let ki : X × X → R be any kernels. Then,
the functions k : X × X → R given by
k1 (x1 ,x2 )
4. k(x1 , x2 ) = √ ,
k1 (x1 ,x1 )k1 (x2 ,x2 )
k̃(x1 , x2 ) := P (k(x1 , x2 ))
is a kernel.
k̃(x1 , x2 ) := f (k(x1 , x2 ))
is a kernel.
k2 (x1 , x2 ) = arctan(k(x1 , x2 ))
1 dn 2 n
Pn (x) = (x − 1) .
2n n! dxn
Associated Legendre functions are the canonical solutions of the general
Legendre equation
m2
2 ′′ ′
(1 − x ) y − 2xy + n[n + 1] − y = 0.
1 − x2
This equation has solutions that are nonsingular on [−1, 1] only if n and m
are integers with 0 ≤ m ≤ n. When in addition m is even, the function is a
polynomial. When m is zero and n integer, these functions are identical to
the Legendre polynomials. These functions are denoted Pnm (x). Their most
straightforward definition is in terms of derivatives of ordinary Legendre
polynomials (m ≥ 0):
dm
Pnm (x) = (−1)m (1 − x2 )m/2 (Pn (x)) .
dxm
(−1)m 2 m/2 d
n+m
Pnm (x) = (1 − x ) (x2 − 1)n .
2n n! dxn+m
Some methods of constructing kernels in ... 191
Coefficients cn and cm
n of expansion function f (x) to series of Legendre and
associated Legendre polynomials we calculate:
(2n + 1) 1
Z
cn = f (x)Pn (x)dx,
2 −1
1
(2n + 1)(n − m)!
Z
cm
n = f (x)Pnm (x)dx.
2(n + m)! −1
k(x1 , x2 ) = tanh(ahx1 , x2 i + b)
192 T. Górecki and M. Luczak
is not a kernel for any a, b ∈ R, a 6= 0. We have to show that the kernel does
not satisfy the conditions of Theorem 5. Since this is very technical we refer
the reader to work of Ovari (2000) for details, and explain how the method
works in the simpler case of Theorem 6. The Taylor series of tanh(at + b) is
equal
1
f (x) :=
1 + exp(ax + b)
1
f (x) := .
1 − exp(ax − b)
The function f is well definite for 0 < a < b, x ∈ [−1, 1] and its Taylor series
is equal
2
eb eb a x eb + eb a2 x2
+ 2 + 3 2 + ...
eb − 1 (eb ) − 2 eb + 1 2 (eb ) − 6 (eb ) + 6 eb − 2
∞
X an xn
= cn , cn > 0.
(eb − 1)n+1
n=0
Some methods of constructing kernels in ... 193
Thus all coefficients of the series are nonnegative and k(x1 , x2 ) = f (hx1 , x2 i)
is a kernel on the product of the unit spheres.
The next corollaries are simple consequence of the definition of kernels and
Remark 1.
Corollary 3. Let k : Y × Y → R be a map and let T : X → Y be a map such
that Y = T (X). Then the map
k̃(x1 , x2 ) := k T (x1 ), T (x2 )
is a kernel on X × X if and only if k is a kernel.
Corollary 4. Let T : X → H be a map such that T (X) ⊂ S, where S is
the unit sphere in Hilbert space H (finite (Rn ) or infinite-dimensional). Let
f : [−1, 1] → R be a function with Taylor expansion
∞
X
f (t) = ai ti .
i=0
is not a kernel.
Example 9. For any kernel k0 (x1 , x2 ) = hΦ(x1 ), Φ(x2 )i such that S(0, r) ⊂
Φ(X), the function
k(x1 , x2 ) := tanh ak0 (x1 , x2 ) + b
for any t ≥ 0.
Example 10. Let k(x1 , x2 ) = exp − hx1 , x2 i . Then f (t) = e−t , f ′ (t) =
Rn ∋ x = (x1 , . . . , xn ) 7→ y = (y 1 , . . . , y n , y n+1 ),
∞ 7→ (0, . . . , 0, 1).
y 1 = x1 t, ..., y n = xn t, y n+1 = 1 − t, t ∈ R.
x1 xn kxk2
y1 = , ..., yn = , y n+1 = .
kxk2 + 1 kxk2 + 1 kxk2 + 1
kx1 − x2 k
d(x1 , x2 ) = p p .
kx1 + 1 kx2 k2 + 1
k2
1
d(x1 , x2 ) < 1, d(x, ∞) = p , d(0, ∞) = 1
kxk2 + 1
for x1 , x2 ∈ Rn .
If we need a map onto the unit sphere S = S(0,1) , we can define
where g is some function which maps [0, 1] onto [0, ∞), for example:
arctanh(t) or tan( π2 t). Then we define ĥ : B̄ ⊂ Rn → S ⊂ Rn+1 as ĥ = h ◦ T
or ĥ = h̃ ◦ T . Note that all points from boundary of B̄ are mapped on the
same point in Rn+1 .
Example 12. If we have all data in the unit ball B ⊂ Rn , we can transform
the ball into the unit sphere S ⊂ Rn+1 . We have
T (x) = g(kxk) x,
Some methods of constructing kernels in ... 197
where
is a kernel.
198 T. Górecki and M. Luczak
∞
ai ti11 . . . tinn ,
X
f (t1 , . . . , tn ) = ai ≥ 0.
i=1
k(x1 , x2 ) := f k1 (x1 , x2 ), . . . , kn (x1 , x2 )
is a kernel.
n
Y 1
K1 (x1 , x2 ) := ,
1 − ki (x1 , x2 )
i=1
n
Y 1 + ki (x1 , x2 )
K2 (x1 , x2 ) := .
1 − ki (x1 , x2 )
i=1
n
1
tα1 1 . . . tαnn =
X Y
,
1 − ti
α1 ,...,αn ∈N∪{0} i=1
n
X |α1 |
Y 1 + ti
t1 . . . t|α
n
n|
= .
1 − ti
α1 ,...,αn ∈Z i=1
is a kernel.
Indeed, we have k(x1 , x2 ) = f k1 (x1 , x2 ), k2 (x1 , x2 ) , where
(a + bt1 )m
f (t1 , t2 ) = .
(c − dt2 )n
∂l (n − 1 + l)! l
l
(c − dt2 )−n = d (c − dt2 )−(n+l) for l ∈ N
∂t2 (n − 1)!
and
m−k bk dl
∂ k+l f m! (n − 1 + l)! a
for k ≤ m
(0, 0) = (m − k)! (n − 1)! cn+l
∂tk1 ∂tl2
0 for k > m.
n m
Y ai + bi k1i (x1 , x2 ) i
k(x1 , x2 ) := n
i=1
ci − di k2i (x1 , x2 ) i
In all above examples we have to note that the Taylor expansions are con-
vergent to appropriate functions only if the kernels are well definite. In this
ci
example it has to hold |k2i (x1 , x2 )| < (di 6= 0) for i = 1, . . . , n.
di
7. Conclusion
Therefore, this choice should reflect prior knowledge about the problem at
hand.
References
[1] M. Abramowitz and I.A. Stegun, Chs. Legendre functions and orthogonal poly-
nomials in Handbook of mathematical functions, Dover Publications, New
York 1972.
[2] B.E. Boser, I.M. Guyon and V.N. Guyon, A training algorithm for optimal
margin classifiers, in D. Haussler, eds. 5th Annual ACM Workshop on COLT.
ACM Press, Pittsburgh (1992), 144–152.
[3] C.J.C. Burges, Geometry and invariance in kernel based methods in: Schölkopf,
B. Burges, C.J.C. Smola, A.J. eds. Advances in kernel methods — support
vector learning. MIT Press, Cambridge (1999), 89–116.
[4] C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning 20
(1995), 273–297.
[5] R. Herbrich, Learning Kernel Classifiers, MIT Press, London 2002.
[6] T. Hofmann, B. Schölkopf and A.J. Smola, Kernels methods in machine learn-
ing, Annals of Statistics 36 (2008), 1171–1220.
[7] Z. Ovari, Kernels, eigenvalues and support vector machines, Honours thesis,
Australian National University, Canberra 2000.
Some methods of constructing kernels in ... 201
[8] B. Schölkopf and A.J. Smola, Learning with Kernels, MIT Press, London 2002.
[9] B. Schölkopf, A.J. Smola and K.R. Müller, Nonlinear component analysis as
a kernel eigenvalue problem, Neural Computation 10 (1998), 1299–1319.
[10] I.J. Schoenberg, Positive definite functions on spheres, Duke Mathematical
Journal 9 (1942), 96–108.
[11] A. Tarantola, Inverse problem theory and methods for model paramenter
estimation, SIAM, Philadelphia 2005.
[12] M. Zu, Kernels and ensembles: perspective on statistical learning, American
Statistician 62 (2008), 97–109.