0% found this document useful (0 votes)
17 views23 pages

Lecture17 Kernels

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

Lecture17 Kernels

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Kernel Methods

George Lan

A. Russell Chandler III Chair Professor


H. Milton Stewart School of Industrial & Systems
Engineering
Nonlinear regression

Want to fit a polynomial regression model

𝑦 = 𝜃! + 𝜃" 𝑥 + 𝜃# 𝑥 # + ⋯ + 𝜃$ 𝑥 $ + 𝜖

Let 𝑥( = 1, 𝑥, 𝑥 # , … , 𝑥 $ % and 𝜃 = 𝜃! , 𝜃" , 𝜃# , … , 𝜃$ %

y = 𝜃 % 𝑥(
2
Problem of explicitly constructing features
Explicitly construct feature 𝜙 𝑥 : 𝑅$ ↦
𝐹, feature space can grow really large
and really quickly

Eg. Polynomial feature of degree 𝑑


𝑥!" , 𝑥! 𝑥# … 𝑥" , 𝑥!# 𝑥# … 𝑥"$!
Total number of such feature is
𝑑+𝑛−1 𝑑+𝑛−1 !
=
𝑑 𝑑! 𝑛 − 1 !
𝑑 = 6, 𝑛 = 100, there are 1.6 billion
terms

3
Can we avoid expanding the features?
Rather than computing the features explicitly, and then compute
inner product.

Can we merge two steps using a clever function 𝑘(𝑥& , 𝑥' )


Eg. Polynomial 𝑑 = 2
!
𝑥"# 𝑦"#
𝑥" 𝑥# 𝑦" 𝑦# # # # #
𝜙 𝑥 !𝜙 𝑦 = # #
= 𝑥 " 𝑦" + 2𝑥" 𝑥# 𝑦" 𝑦# + 𝑥# 𝑦#
𝑥# 𝑦#
𝑥# 𝑥" 𝑦# 𝑦"
= 𝑥" 𝑦" + 𝑥# 𝑦# # = 𝑥 ! 𝑦 #
𝑂 𝑛 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛!

Polynomial kernel degreee 𝑑, 𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 ( = 𝜙 𝑥 %𝜙 𝑦

4
Typical kernels for vector data
Polynomial of degree d
𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 "

Polynomial of degree up to d
𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 + 𝑐 "

Exponential kernel (infinite degree polynomials)


𝑘 𝑥, 𝑦 = exp(𝑠 ⋅ 𝑥 % 𝑦)
Gaussian radial basis function (RBF) kernel
&$' !
𝑘 𝑥, 𝑦 = exp −
#( !
Laplace Kernel
&$'
𝑘 𝑥, 𝑦 = exp −
#( !
Exponentiated distance
" &,' !
𝑘 𝑥, 𝑦 = exp −
*! 5
Feature space not unique!
Eg. Polynomial 𝑑 = 2
%
𝑥!# 𝑦!#
𝑥! 𝑥# 𝑦! 𝑦#
𝜙 𝑥 %𝜙 𝑦 = = 𝑥!# 𝑦!# + 2𝑥! 𝑥# 𝑦! 𝑦# + 𝑥## 𝑦##
𝑥## 𝑦##
𝑥# 𝑥! 𝑦# 𝑦!
= 𝑥" 𝑦" + 𝑥# 𝑦# # = 𝑥 ! 𝑦 #

%
𝑥!# 𝑦!#

𝜙 𝑥 %
𝜙 𝑦 = 2𝑥! 𝑥# 2𝑦! 𝑦# = 𝑥!# 𝑦!# + 2𝑥! 𝑥# 𝑦! 𝑦# + 𝑥## 𝑦##
𝑥## 𝑦##

= 𝑥" 𝑦" + 𝑥# 𝑦# # = 𝑥 ! 𝑦 #

6
What 𝑘(𝑥, 𝑦) can be called a kernel function?
𝑘(𝑥, 𝑦) equivalent to first compute feature 𝜙(𝑥), and then
perform inner product 𝑘 𝑥, 𝑦 = 𝜙 𝑥 % 𝜙 𝑦

A dataset 𝐷 = 𝑥" , 𝑥# , 𝑥5 … 𝑥$

Compute pairwise kernel function 𝑘 𝑥& , 𝑥' and form a 𝑛×𝑛


kernel matrix (Gram matrix)
𝑘(𝑥! , 𝑥# ) … 𝑘(𝑥! , 𝑥+ )
𝐾= ⋮ ⋱ ⋮
𝑘(𝑥+ , 𝑥! ) … 𝑘(𝑥+ , 𝑥+ )

𝑘(𝑥, 𝑦) is a kernel function, iff matrix 𝐾 is positive semidefinite


∀𝑣 ∈ 𝑅 + , 𝑣 % 𝐾𝑣 ≥ 0
7
Support Vector Machines
Primal problem:
!
min 𝑤 % 𝑤 + 𝐶 ∑. 𝜉 .
,,- #

𝑠. 𝑡. 𝑤 % 𝑥 . + 𝑏 𝑦 . ≥ 1 − 𝜉 . , 𝜉 . ≥ 0, ∀𝑗
Can be high order
polynomial features
Lagrangian 𝜙 𝑥$
!
𝐿 𝑤, 𝜉, 𝛼, 𝛽 = 𝑤 % 𝑤 + 𝐶 ∑. 𝜉 . + ∑. 𝛼. (1 − 𝜉 . − P𝑤 % 𝑥 . +
#
𝑏 Q𝑦 . ) − β . 𝜉 .
𝛼/ ≥ 0, 𝛽/ ≥ 0

Take derivative of 𝐿 𝑤, 𝜉, 𝛼, 𝛽 with respect to 𝑤 and 𝜉 we


have
𝑤 = ∑. 𝛼. 𝑦 . 𝑥 .
𝑏 = 𝑦0 − 𝑤 % 𝑥0 for any 𝑘 such that 0 < 𝛼0 < 𝐶 8
SVM dual problem and kernelize
Plug in 𝑤 and 𝑏 into the Lagrangian, and the dual problem
! / . /% .
𝑀𝑎𝑥1 ∑/ 𝛼/ − ∑/,. 𝛼/ 𝛼. 𝑦 𝑦 𝑥 𝑥
#
𝑠. 𝑡. ∑/ 𝛼/ 𝑦 / = 0 Can be replaced by
#
0 ≤ 𝛼/ ≤ 𝐶 𝜙 𝑥 " 𝜙(𝑥 $ )
and 𝑘(𝑥 " , 𝑥 $ )

Other steps can also depend only on inner product


% . .%
𝑤 𝑥 = ∑. 𝛼. 𝑦 𝑥 𝑥
𝑏 = 𝑦0 − 𝑤 % 𝑥0 for any 𝑘 such that 0 < 𝛼0 < 𝐶

9
Illustration of kernel SVM
Kernel SVM
implicitly map data to a new nonlinear feature space
find linear decision boundary in the new space

10
Ridge regression and matrix inversion lemma
Matrix inversion lemma (𝐵 ∈ 𝑅$×H ):

𝐵𝐵% + 𝜆𝐼 I" 𝐵 = 𝐵 𝐵% 𝐵 + 𝜆𝐼 I"

Note that 𝑋 = (𝑥 " , 𝑥 (#) , … 𝑥 (H) )

Evaluate ridge regression solution: 𝜃 J = 𝑋𝑋 % + 𝜆𝐼 I" 𝑋𝑦 on a


new test point 𝑥

𝑥 % 𝜃 J = 𝑥 % 𝑋𝑋 % + 𝜆𝐼 I" 𝑋𝑦
= 𝑥 % 𝑋 𝑋 % 𝑋 + 𝜆𝐼 I" 𝑦

Only dependent on inner product between data points

11
Kernel ridge regression
𝑓 𝑥 = 𝜃 % x = 𝑦 % 𝑋 % 𝑋 + 𝜆𝐼+ $!
𝑋 % 𝑥 only depends on inner
products!

𝑥"! 𝑥" … 𝑥"! 𝑥%


𝑋!𝑋 = ⋮ ⋱ ⋮
𝑥%! 𝑥" … 𝑥%! 𝑥%

𝑥"! 𝑥
𝑋!𝑥 = ⋮
𝑥%! 𝑥

Kernel ridge regression: replace inner product by a kernel function


𝑋 ! 𝑋 → 𝐾 = 𝑘 𝑥& , 𝑥$
%×%
𝑋 ! 𝑥 → k ( = 𝑘 𝑥& , 𝑥 %×"
𝑓 𝑥 = 𝑦 ! 𝐾 + 𝜆𝐼% )"
𝑘*
12
Kernel ridge regression
Use Gaussian rbf kernel
*)+ !
𝑘 𝑥, 𝑦 = exp −
#, !

𝑙𝑎𝑟𝑔𝑒 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑠𝑚𝑎𝑙𝑙 𝜆 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆

Use cross-validation to choose parameters

13
Principal component analysis
Given a set of 𝑚 centered
observations 𝑥 & ∈ 𝑅( , PCA finds
the direction that maximizes the
variance
𝑋 = 𝑥!, 𝑥 #, … , 𝑥 2
∗ ! % / #
𝑤 = 𝑎𝑟𝑔𝑚𝑎𝑥 , 4! ∑/ 𝑤 𝑥
2
!
= 𝑎𝑟𝑔𝑚𝑎𝑥 , 4! 𝑤 % 𝑋𝑋 % 𝑤
2

" %, 𝑤 ∗
𝐶= H
𝑋𝑋 can be found by
solving the following eigen-value
problem
𝐶𝑤 = 𝜆 𝑤
14
Alternative expression for PCA
The principal component lies in the span of the data
𝑤 = I 𝛼& 𝑥 & = 𝑋𝛼
&
! ! !
𝑤 = 𝐶𝑤 = 𝑋𝑋 % w= 𝑋( 𝑋 % w)= 𝑋𝛼 for any 𝜆 > 0.
5 52 52
Plug this in we have
1
𝐶𝑤 = 𝑋𝑋 % 𝑋𝛼 = 𝜆𝑤 = 𝜆 𝑋𝛼
𝑚

Furthermore, for each data point 𝑥 & , the following relation


holds
/% ! /% % /%
Only depends on
𝑥 𝐶𝑤 = 𝑥 𝑋𝑋 𝑋𝛼 = 𝜆𝑥 𝑋𝛼, ∀𝑖 inner product matrix
2
! %
In matrix form, 𝑋 𝑋𝑋 % 𝑋𝛼 = 𝜆𝑋 % 𝑋𝛼
2
15
Kernel PCA
Key Idea: Replace inner product matrix by kernel matrix
"
PCA: 𝑋 ! 𝑋𝑋 ! 𝑋𝛼 = 𝜆𝑋 ! 𝑋𝛼
-

Kernel PCA:
𝑥. ↦ 𝜙 𝑥. , X ↦ Φ = 𝜙 𝑥" , … , 𝜙 𝑥. , 𝐾 = Φ ! Φ
Nonlinear principal component 𝑤 = Φ𝛼
" "
𝐾𝐾𝛼 = 𝜆𝐾𝛼, equivalent to 𝐾𝛼 = 𝜆 𝛼
- -
The solutions of the above two linear systems differ only for eigenvectors
of 𝐾 with zero eigenvalue.
! !
𝐾(" 𝐾𝛼 − 𝜆𝛼) = 0, " 𝐾𝛼 − 𝜆𝛼 can not belong to the null space of 𝐾 since
neither 𝐾𝛼 nor 𝛼 does (under the assumption that 𝐾𝛼 is nonzero.

Key computation: form an 𝑚 by 𝑚 kernel matrix 𝐾, and then


perform eigen-decomposition on 𝐾
16
Kernel PCA example
Gaussian radial basis function (RBF) kernel
"
TIT !
exp − #U"
over 2 dimensional space
Eigen-vector evaluated at a test point 𝑥 is a function
𝑤 % 𝜙 𝑥 = ∑& 𝛼& < 𝜙 𝑥 & , 𝜙 𝑥 > = ∑& 𝛼& 𝑘(𝑥 & , 𝑥)

17
Canonical correlation analysis

18
Canonical correlation analysis
Given 𝐷 = 𝑥" , 𝑦" , … , 𝑥 H , 𝑦 H ~𝑃(𝑥, 𝑦)
𝑋 = (𝑥 ! , 𝑥 # , … , 𝑥 2 )
𝑌 = (𝑦 ! , 𝑦 # , … , 𝑦 2 )

Find two vectors 𝑤T and 𝑤V , and project the data respectively


𝑋𝑤& and 𝑌𝑤'

Such that the correlations of the projected data are maximized


𝜌 = max 𝑐𝑜𝑟𝑟 𝑋𝑤T , 𝑌𝑤V
W#,W$
< 𝑋𝑤T , 𝑌𝑤V >
= max
W#,W$ 𝑋𝑤T 𝑌𝑤V

19
Matrix form of CCA
Define the covariance matrix of 𝑥, 𝑦

𝑥 𝑥 % 𝐶TT 𝐶TV
𝐶 = 𝔼(T,V) 𝑦 𝑦 =
𝐶VT 𝐶VV

The optimization problem is equal to

𝑤T% 𝐶TV 𝑤V
𝜌 = max
W#,W$
𝑤T% 𝐶TT 𝑤T 𝑤V% 𝐶VV 𝑤V

20
CCA as generalized eigenvalue problem
The optimality conditions say

C xy wy = lC xx wx
C yx wx = lC yy wy
,%& >%' ,'
𝜆= (set the gradient equal to zero).
,%& >%% ,%

Put these conditions into matrix format


æ 0 C xy öæ wx ö æ C xx 0 öæ wx ö
ç ÷ç ÷ = l ç ÷ç ÷
çC 0 ÷øçè wy ÷ø ç 0 C yy ÷øçè wy ÷ø
è yx è

Generalized eigenvalue problem 𝐴𝑤 = 𝜆𝐵𝑤

21
CCA in inner product format
Similar to PCA, the directions of projection lie in the span of
the data 𝑋 = 𝑥" , 𝑥 # , … , 𝑥 H , 𝑌 = (𝑦" , 𝑦 # , … , 𝑦 H )
𝑤& = 𝑋𝛼, 𝑤' = 𝑌𝛽
! !
𝐶&' = 𝑋𝑌 % , 𝐶&& = 𝑋𝑋 % , 𝐶'' = 𝑌𝑌 %
2 2

W#%X#$W$
Earlier we have 𝜌 = max
W#,W$ W %X W W %X W
# ## # $ $$ $

Plug in 𝑤T = 𝑋𝛼, 𝑤V = 𝑌𝛽
Data only appear in
inner products
a T X T XY T Yb
r = max
a ,b
a T X T XX T Xa b T Y T YY T Yb
22
Kernel CCA
Replace inner product matrix by kernel matrix
a T KxK yb
r = max
a ,b
a K x K xa b K y K y b
T T

Where 𝐾T is kernel matrix for data 𝑋, with entries 𝐾T 𝑖, 𝑗 =


𝑘 𝑥& , 𝑥'

Solve generalized eigenvalue problem

æ 0 K x K y öæ a ö æ KxKx 0 öæ a ö
ç ÷çç ÷÷ = l ç ÷çç ÷÷
çK K 0 ÷øè b ø ç 0 K y K y ÷øè b ø
è y x è
23

You might also like