Lecture17 Kernels
Lecture17 Kernels
George Lan
𝑦 = 𝜃! + 𝜃" 𝑥 + 𝜃# 𝑥 # + ⋯ + 𝜃$ 𝑥 $ + 𝜖
y = 𝜃 % 𝑥(
2
Problem of explicitly constructing features
Explicitly construct feature 𝜙 𝑥 : 𝑅$ ↦
𝐹, feature space can grow really large
and really quickly
3
Can we avoid expanding the features?
Rather than computing the features explicitly, and then compute
inner product.
4
Typical kernels for vector data
Polynomial of degree d
𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 "
Polynomial of degree up to d
𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 + 𝑐 "
%
𝑥!# 𝑦!#
𝜙 𝑥 %
𝜙 𝑦 = 2𝑥! 𝑥# 2𝑦! 𝑦# = 𝑥!# 𝑦!# + 2𝑥! 𝑥# 𝑦! 𝑦# + 𝑥## 𝑦##
𝑥## 𝑦##
= 𝑥" 𝑦" + 𝑥# 𝑦# # = 𝑥 ! 𝑦 #
6
What 𝑘(𝑥, 𝑦) can be called a kernel function?
𝑘(𝑥, 𝑦) equivalent to first compute feature 𝜙(𝑥), and then
perform inner product 𝑘 𝑥, 𝑦 = 𝜙 𝑥 % 𝜙 𝑦
A dataset 𝐷 = 𝑥" , 𝑥# , 𝑥5 … 𝑥$
𝑠. 𝑡. 𝑤 % 𝑥 . + 𝑏 𝑦 . ≥ 1 − 𝜉 . , 𝜉 . ≥ 0, ∀𝑗
Can be high order
polynomial features
Lagrangian 𝜙 𝑥$
!
𝐿 𝑤, 𝜉, 𝛼, 𝛽 = 𝑤 % 𝑤 + 𝐶 ∑. 𝜉 . + ∑. 𝛼. (1 − 𝜉 . − P𝑤 % 𝑥 . +
#
𝑏 Q𝑦 . ) − β . 𝜉 .
𝛼/ ≥ 0, 𝛽/ ≥ 0
9
Illustration of kernel SVM
Kernel SVM
implicitly map data to a new nonlinear feature space
find linear decision boundary in the new space
10
Ridge regression and matrix inversion lemma
Matrix inversion lemma (𝐵 ∈ 𝑅$×H ):
𝑥 % 𝜃 J = 𝑥 % 𝑋𝑋 % + 𝜆𝐼 I" 𝑋𝑦
= 𝑥 % 𝑋 𝑋 % 𝑋 + 𝜆𝐼 I" 𝑦
11
Kernel ridge regression
𝑓 𝑥 = 𝜃 % x = 𝑦 % 𝑋 % 𝑋 + 𝜆𝐼+ $!
𝑋 % 𝑥 only depends on inner
products!
𝑥"! 𝑥
𝑋!𝑥 = ⋮
𝑥%! 𝑥
13
Principal component analysis
Given a set of 𝑚 centered
observations 𝑥 & ∈ 𝑅( , PCA finds
the direction that maximizes the
variance
𝑋 = 𝑥!, 𝑥 #, … , 𝑥 2
∗ ! % / #
𝑤 = 𝑎𝑟𝑔𝑚𝑎𝑥 , 4! ∑/ 𝑤 𝑥
2
!
= 𝑎𝑟𝑔𝑚𝑎𝑥 , 4! 𝑤 % 𝑋𝑋 % 𝑤
2
" %, 𝑤 ∗
𝐶= H
𝑋𝑋 can be found by
solving the following eigen-value
problem
𝐶𝑤 = 𝜆 𝑤
14
Alternative expression for PCA
The principal component lies in the span of the data
𝑤 = I 𝛼& 𝑥 & = 𝑋𝛼
&
! ! !
𝑤 = 𝐶𝑤 = 𝑋𝑋 % w= 𝑋( 𝑋 % w)= 𝑋𝛼 for any 𝜆 > 0.
5 52 52
Plug this in we have
1
𝐶𝑤 = 𝑋𝑋 % 𝑋𝛼 = 𝜆𝑤 = 𝜆 𝑋𝛼
𝑚
Kernel PCA:
𝑥. ↦ 𝜙 𝑥. , X ↦ Φ = 𝜙 𝑥" , … , 𝜙 𝑥. , 𝐾 = Φ ! Φ
Nonlinear principal component 𝑤 = Φ𝛼
" "
𝐾𝐾𝛼 = 𝜆𝐾𝛼, equivalent to 𝐾𝛼 = 𝜆 𝛼
- -
The solutions of the above two linear systems differ only for eigenvectors
of 𝐾 with zero eigenvalue.
! !
𝐾(" 𝐾𝛼 − 𝜆𝛼) = 0, " 𝐾𝛼 − 𝜆𝛼 can not belong to the null space of 𝐾 since
neither 𝐾𝛼 nor 𝛼 does (under the assumption that 𝐾𝛼 is nonzero.
17
Canonical correlation analysis
18
Canonical correlation analysis
Given 𝐷 = 𝑥" , 𝑦" , … , 𝑥 H , 𝑦 H ~𝑃(𝑥, 𝑦)
𝑋 = (𝑥 ! , 𝑥 # , … , 𝑥 2 )
𝑌 = (𝑦 ! , 𝑦 # , … , 𝑦 2 )
19
Matrix form of CCA
Define the covariance matrix of 𝑥, 𝑦
𝑥 𝑥 % 𝐶TT 𝐶TV
𝐶 = 𝔼(T,V) 𝑦 𝑦 =
𝐶VT 𝐶VV
𝑤T% 𝐶TV 𝑤V
𝜌 = max
W#,W$
𝑤T% 𝐶TT 𝑤T 𝑤V% 𝐶VV 𝑤V
20
CCA as generalized eigenvalue problem
The optimality conditions say
C xy wy = lC xx wx
C yx wx = lC yy wy
,%& >%' ,'
𝜆= (set the gradient equal to zero).
,%& >%% ,%
21
CCA in inner product format
Similar to PCA, the directions of projection lie in the span of
the data 𝑋 = 𝑥" , 𝑥 # , … , 𝑥 H , 𝑌 = (𝑦" , 𝑦 # , … , 𝑦 H )
𝑤& = 𝑋𝛼, 𝑤' = 𝑌𝛽
! !
𝐶&' = 𝑋𝑌 % , 𝐶&& = 𝑋𝑋 % , 𝐶'' = 𝑌𝑌 %
2 2
W#%X#$W$
Earlier we have 𝜌 = max
W#,W$ W %X W W %X W
# ## # $ $$ $
Plug in 𝑤T = 𝑋𝛼, 𝑤V = 𝑌𝛽
Data only appear in
inner products
a T X T XY T Yb
r = max
a ,b
a T X T XX T Xa b T Y T YY T Yb
22
Kernel CCA
Replace inner product matrix by kernel matrix
a T KxK yb
r = max
a ,b
a K x K xa b K y K y b
T T
æ 0 K x K y öæ a ö æ KxKx 0 öæ a ö
ç ÷çç ÷÷ = l ç ÷çç ÷÷
çK K 0 ÷øè b ø ç 0 K y K y ÷øè b ø
è y x è
23