0% found this document useful (0 votes)
13 views32 pages

Lecture 8 - Kernels

The document discusses the kernel trick in support vector machines (SVMs), explaining how kernel functions allow for efficient computation of dot products in high-dimensional feature spaces without explicitly mapping the data. It also covers the properties of valid kernel functions, including symmetry and positive semidefiniteness, as well as Mercer’s theorem, which provides a method to verify if a function is a valid kernel. Additionally, it outlines how to construct new kernels from existing ones and briefly touches on kernelized linear regression.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views32 pages

Lecture 8 - Kernels

The document discusses the kernel trick in support vector machines (SVMs), explaining how kernel functions allow for efficient computation of dot products in high-dimensional feature spaces without explicitly mapping the data. It also covers the properties of valid kernel functions, including symmetry and positive semidefiniteness, as well as Mercer’s theorem, which provides a method to verify if a function is a valid kernel. Additionally, it outlines how to construct new kernels from existing ones and briefly touches on kernelized linear regression.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Artificial Intelligence II (CS4442 & CS9542)

More on Kernels

Boyu Wang
Department of Computer Science
University of Western Ontario
Kernel Trick

1
Kernel trick

I Recall: in SVMs, for a feature mapping function:


0
φ : Rn → Rn : x → φ(x), we define the kernel function as

k (x, z) = φ(x)> φ(z)

I In other words, kernel functions are ways of expressing dot-products in


some feature space.
I If we work with a dual formulation of the learning algorithm, we do not
actually have to deal feature mapping φ. We just have to compute the
kernel function k(x, z).

2
Kernel trick
0
x ∈ Rn → φ(x) ∈ Rn , k (x, z) = φ(x)> φ(z)

Training
Pm 1
Pm
maxα i=1 αi − 2 i,j=1 αi αj yi yj k (xi , xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P 
m
f (x) = sgn i=1 αi yi k (xi , x) + b

3
Kernel trick
0
x ∈ Rn → φ(x) ∈ Rn , k (x, z) = φ(x)> φ(z)

Training
Pm 1
Pm
maxα i=1 αi − 2 i,j=1 αi αj yi yj k (xi , xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n

Prediction
P 
m
f (x) = sgn i=1 αi yi k (xi , x) + b

I The computation does not depend on n or n0 , but the number of training


instances m.
0
I φ can map x ∈ Rn to φ(x) ∈ Rn , where n0 can be much larger than n
(even infinity).

3
Nonlinear mapping and kernel trick

https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f

4
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).

5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):

φ(x) = x ⇒ k (x, z) = x > z

5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):

φ(x) = x ⇒ k (x, z) = x > z

I In practice, we define directly a kernel function k :

k (x, z) = (x > z)2


Is this a kernel?

5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):

φ(x) = x ⇒ k (x, z) = x > z

I In practice, we define directly a kernel function k :

k (x, z) = (x > z)2


Is this a kernel?
I Let x = [x1 , . . . , xn ]> , and z = [z1 , . . . , zn ]> (notation is overloaded), we
have
n
!2 n n
X X X
k (x, z) = xi zi = xi zi xj zj = (xi xj )(zi zj )
i=1 i,j=1 i,j=1

5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):

φ(x) = x ⇒ k (x, z) = x > z

I In practice, we define directly a kernel function k :

k (x, z) = (x > z)2


Is this a kernel?
I Let x = [x1 , . . . , xn ]> , and z = [z1 , . . . , zn ]> (notation is overloaded), we
have
n
!2 n n
X X X
k (x, z) = xi zi = xi zi xj zj = (xi xj )(zi zj )
i=1 i,j=1 i,j=1

I Hence, it is a valid kernel, with feature mapping:


2
φ(x) = [x12 , x1 x2 , . . . , x1 xn , x2 x1 , x22 , . . . , xn2 ] ∈ Rn
Feature vector includes all squares of elements and all cross terms.
5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).

I How about a Gaussian kernel?


||x−z||22

k (x, z) = e 2σ 2

6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).

I How about a Gaussian kernel?


||x−z||22

k (x, z) = e 2σ 2

I It is also a valid kernel, with an infinite-dimensional feature mapping

6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).

I How about a Gaussian kernel?


||x−z||22

k (x, z) = e 2σ 2

I It is also a valid kernel, with an infinite-dimensional feature mapping

I For one-dimensional input x ∈ R, the mapping function is


" r r r #>
−x 2 /2σ 2 1 1 2 1 3
φ(x) = e 1, x, x , x ,...
1!σ 2 2!σ 4 3!σ 6

6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).

I How about a Gaussian kernel?


||x−z||22

k (x, z) = e 2σ 2

I It is also a valid kernel, with an infinite-dimensional feature mapping

I For one-dimensional input x ∈ R, the mapping function is


" r r r #>
−x 2 /2σ 2 1 1 2 1 3
φ(x) = e 1, x, x , x ,...
1!σ 2 2!σ 4 3!σ 6

I In general, given a kernel function k : Rn × Rn → R, under what


conditions k (x, z) can be written as a dot product φ(x)> φ(z) for some
feature mapping φ?

6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).

I How about a Gaussian kernel?


||x−z||22

k (x, z) = e 2σ 2

I It is also a valid kernel, with an infinite-dimensional feature mapping

I For one-dimensional input x ∈ R, the mapping function is


" r r r #>
−x 2 /2σ 2 1 1 2 1 3
φ(x) = e 1, x, x , x ,...
1!σ 2 2!σ 4 3!σ 6

I In general, given a kernel function k : Rn × Rn → R, under what


conditions k (x, z) can be written as a dot product φ(x)> φ(z) for some
feature mapping φ?
I We want a general recipe, which does not require explicitly defining φ
every time.

6
Kernel matrix

I Suppose we have an arbitrary set of input vectors: {xi }m


i=1

I The kernel matrix (or Gram matrix) K ∈ Rm×m corresponding to kernel


function k is an m × m matrix such that Kij = k(xi , xj )

7
Kernel matrix

I Suppose we have an arbitrary set of input vectors: {xi }m


i=1

I The kernel matrix (or Gram matrix) K ∈ Rm×m corresponding to kernel


function k is an m × m matrix such that Kij = k(xi , xj )

I What the properties does the kernel matrix K have if k is a valid kernel
function?

1. K is a symmetric matrix (i.e., Kij = Kji )


2. K is positive semidefinite (i.e., α> K α ≥ 0, ∀α ∈ Rm )

7
Proofs

1. Kij = φ(xi )> φ(xj ) = φ(xj )> φ(xi ) = Kji

8
Proofs

1. Kij = φ(xi )> φ(xj ) = φ(xj )> φ(xi ) = Kji


0
2. For any vector α = [α1 , . . . αm ]> ∈ Rm and φ(x) = [φ1 (x), . . . , φn0 (x)]> ∈ Rn
we have
m X
X m
α> K α = αi Kij αj (definition of matrix-vector product)
i=1 j=1
m X
X m  
= αi φ(xi )> φ(xj ) αj (definition of kernel matrix)
i=1 j=1
 0

m X
X m n
X
= αi  φk (xi ) · φk (xj ) αj (definition of inner product)
i=1 j=1 k =1
0
n X
X m X
m
= αi φk (xi ) · φk (xj )αj (change the order of summation)
k=1 i=1 j=1
0 !2
n
X m
X m
X m X
X m
= αi φk (xi ) ≥0 (( xi )2 = xi xj )
k=1 i=1 i=1 i=1 j=1

8
Mercer’s theorem

I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.

9
Mercer’s theorem

I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.
I Mercer’s theorem states that the reverse is also true:
Given a function k : Rn × Rn → R, k is a valid kernel function if and only
if, for any data set, the corresponding kernel matrix K is symmetric and
positive semidefinite

9
Mercer’s theorem

I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.
I Mercer’s theorem states that the reverse is also true:
Given a function k : Rn × Rn → R, k is a valid kernel function if and only
if, for any data set, the corresponding kernel matrix K is symmetric and
positive semidefinite
I The reverse direction of the proof is much harder.

9
Mercer’s theorem

I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.
I Mercer’s theorem states that the reverse is also true:
Given a function k : Rn × Rn → R, k is a valid kernel function if and only
if, for any data set, the corresponding kernel matrix K is symmetric and
positive semidefinite
I The reverse direction of the proof is much harder.

I It gives us a way to check if a given function is a kernel, by checking


these two properties of its kernel matrix.

9
Construct a kernel with kernels

Let k1 and k2 be valid kernels over Rn × Rn , a ∈ R+ be a positive number,


0
φ(x) : Rn → Rn be a mapping function, with a kernel k3 defined over
0 0
Rn × Rn , and A ∈ Rn×n is a symmetric positive semidefinite matrix. Then,
the following functions are kernels:

1. k(x, z) = k1 (x, z) + k2 (x, z)

2. k(x, z) = ak1 (x, z)

3. k(x, z) = k1 (x, z)k2 (x, z)

4. k(x, z) = φ(x)φ(z)

5. k(x, z) = k3 (φ(x), φ(z))

6. k(x, z) = x > Az

10
Kernelized Linear Regression

11
Linear regression revisited

I (Regularized) linear regression aims to minimize the loss function (we


omit the bias term b for simplicity):
1 λ
L(w) = ||Xw − y||22 + ||w||22
2 2
0
I If we use a mapping function φ : Rn → Rn : x → φ(x) for data
pre-processing, then we have
1 λ
L(u) = ||Φu − y||22 + ||u||22
2 2
0
where Φ = [φ(x1 ), . . . , φ(xm )]> ∈ Rm×n is the matrix of data points in
0
the new feature space, and u ∈ Rn is the corresponding weight vector

12
Kernelized linear regression

1 λ
L(u) = ||Φu − y ||22 + ||u||22
2 2
I Assume that u can be represented as a linear combination of φ(xi ):

u = Φ> α
where α = [α1 , . . . , αm ]> ∈ Rm (analogous to the αi ’s in SVM)
I Then, the objective function becomes:
1 λ
L(α) = ||ΦΦ> α − y ||22 + ||Φ> α||22
2 2
1 > 1 λ
= α ΦΦ ΦΦ α − α> ΦΦ> y + y > y + α> ΦΦ> α
> >
2 2 2
1 > > 1 > λ >
= α KK α − α Ky + y y + α K α
2 2 2
where K , ΦΦ> ∈ Rm×m , with Kij = φ(xi )> φ(xj ) = k (xi , xj ).
I In other words, K is a kernel matrix!

13
Kernelized linear regression

1 > 1 λ
L(α) = α KK α − α> Ky + y > y + α> K α
2 2 2
I This is a quadratic function with respect to α, and we can find the
solution by setting the gradient of Jα to zero and solve for α:
α = (K + λIm )−1 y
I Once we obtain α, we can predict the value of x by using

f (x) = φ(x)> u
= φ(x)> Φ> α
m
X
= αi k (xi , x)
i=1

Again, the feature mapping function φ is not needed!


P 
m
I Recall: SVM prediction: f (x) = sgn i=1 αi yi k (xi , x) + b – similar
form for prediction!

I Demo!
14
Kernelized Logistic Regression

15
Kernelized logistic regression

I Recall: in linear logistic regression, we have

1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
I Similarly, we use a mapping function φ : x → φ(x):

1
σ(hu (x)) =
1 + e−u> φ(x)
and assume that u = Φ> α
I Then, we have

1
σ(hα (x)) = −
Pm
αi k(x,xi )
1+e i=1

16
Kernelized logistic regression

I Recall: given a training set, the objective function (cross-entropy loss)


function of linear logistic regression is
m
X
J(w) = − (yi log ti + (1 − yi ) log(1 − ti )) , (1)
i=1

1
where ti = σ(hw (xi )) = >x
1+e−w i

I For kernelized logistic regression, the objective function is still the same
as (1), and the only difference is ti :
1
ti = σ(hα (xi )) = −
Pm
αj k (xi ,xj )
1+e j=1

I The training procedure is the same as for linear logistic regression (e.g.,
gradient descent or Newton’s method)

17

You might also like