Lecture 8 - Kernels
Lecture 8 - Kernels
More on Kernels
Boyu Wang
Department of Computer Science
University of Western Ontario
Kernel Trick
1
Kernel trick
2
Kernel trick
0
x ∈ Rn → φ(x) ∈ Rn , k (x, z) = φ(x)> φ(z)
Training
Pm 1
Pm
maxα i=1 αi − 2 i,j=1 αi αj yi yj k (xi , xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n
Prediction
P
m
f (x) = sgn i=1 αi yi k (xi , x) + b
3
Kernel trick
0
x ∈ Rn → φ(x) ∈ Rn , k (x, z) = φ(x)> φ(z)
Training
Pm 1
Pm
maxα i=1 αi − 2 i,j=1 αi αj yi yj k (xi , xj )
Pm
s.t. i=1 αi yi = 0 and αi ≥ 0, i = 1, . . . , n
Prediction
P
m
f (x) = sgn i=1 αi yi k (xi , x) + b
3
Nonlinear mapping and kernel trick
https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
4
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):
5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):
5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):
5
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
I First define φ(x), then k (x, z) = φ(x)> φ(z):
6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
6
How to find a valid kernel
By definition, a kernel function k must be the inner product in the feature
space defined by φ: k(x, z) = φ(x)> φ(z).
6
Kernel matrix
7
Kernel matrix
I What the properties does the kernel matrix K have if k is a valid kernel
function?
7
Proofs
8
Proofs
8
Mercer’s theorem
I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.
9
Mercer’s theorem
I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.
I Mercer’s theorem states that the reverse is also true:
Given a function k : Rn × Rn → R, k is a valid kernel function if and only
if, for any data set, the corresponding kernel matrix K is symmetric and
positive semidefinite
9
Mercer’s theorem
I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.
I Mercer’s theorem states that the reverse is also true:
Given a function k : Rn × Rn → R, k is a valid kernel function if and only
if, for any data set, the corresponding kernel matrix K is symmetric and
positive semidefinite
I The reverse direction of the proof is much harder.
9
Mercer’s theorem
I We have shown that if k is a valid kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = k(xi , xj ) is
symmetric and positive semidefinite.
I Mercer’s theorem states that the reverse is also true:
Given a function k : Rn × Rn → R, k is a valid kernel function if and only
if, for any data set, the corresponding kernel matrix K is symmetric and
positive semidefinite
I The reverse direction of the proof is much harder.
9
Construct a kernel with kernels
4. k(x, z) = φ(x)φ(z)
6. k(x, z) = x > Az
10
Kernelized Linear Regression
11
Linear regression revisited
12
Kernelized linear regression
1 λ
L(u) = ||Φu − y ||22 + ||u||22
2 2
I Assume that u can be represented as a linear combination of φ(xi ):
u = Φ> α
where α = [α1 , . . . , αm ]> ∈ Rm (analogous to the αi ’s in SVM)
I Then, the objective function becomes:
1 λ
L(α) = ||ΦΦ> α − y ||22 + ||Φ> α||22
2 2
1 > 1 λ
= α ΦΦ ΦΦ α − α> ΦΦ> y + y > y + α> ΦΦ> α
> >
2 2 2
1 > > 1 > λ >
= α KK α − α Ky + y y + α K α
2 2 2
where K , ΦΦ> ∈ Rm×m , with Kij = φ(xi )> φ(xj ) = k (xi , xj ).
I In other words, K is a kernel matrix!
13
Kernelized linear regression
1 > 1 λ
L(α) = α KK α − α> Ky + y > y + α> K α
2 2 2
I This is a quadratic function with respect to α, and we can find the
solution by setting the gradient of Jα to zero and solve for α:
α = (K + λIm )−1 y
I Once we obtain α, we can predict the value of x by using
f (x) = φ(x)> u
= φ(x)> Φ> α
m
X
= αi k (xi , x)
i=1
I Demo!
14
Kernelized Logistic Regression
15
Kernelized logistic regression
1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
I Similarly, we use a mapping function φ : x → φ(x):
1
σ(hu (x)) =
1 + e−u> φ(x)
and assume that u = Φ> α
I Then, we have
1
σ(hα (x)) = −
Pm
αi k(x,xi )
1+e i=1
16
Kernelized logistic regression
1
where ti = σ(hw (xi )) = >x
1+e−w i
I For kernelized logistic regression, the objective function is still the same
as (1), and the only difference is ti :
1
ti = σ(hα (xi )) = −
Pm
αj k (xi ,xj )
1+e j=1
I The training procedure is the same as for linear logistic regression (e.g.,
gradient descent or Newton’s method)
17