0% found this document useful (0 votes)
59 views13 pages

Kernels and Kernelized Perceptron: Instructor: Alan Ritter

The kernelized perceptron algorithm allows for non-linear classification by mapping data points to a higher dimensional feature space without explicitly computing the mapping. It does so using a kernel function K(u,v) that computes the dot product between two points u and v in the feature space, avoiding the need to explicitly compute the feature vectors φ(u) and φ(v). The kernelized perceptron performs the same update rule as the standard perceptron, but replaces the dot product between weights and features with the kernel function K, allowing it to learn non-linear decision boundaries while working entirely in terms of the kernel function.

Uploaded by

ThePrius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views13 pages

Kernels and Kernelized Perceptron: Instructor: Alan Ritter

The kernelized perceptron algorithm allows for non-linear classification by mapping data points to a higher dimensional feature space without explicitly computing the mapping. It does so using a kernel function K(u,v) that computes the dot product between two points u and v in the feature space, avoiding the need to explicitly compute the feature vectors φ(u) and φ(v). The kernelized perceptron performs the same update rule as the standard perceptron, but replaces the dot product between weights and features with the kernel function K, allowing it to learn non-linear decision boundaries while working entirely in terms of the kernel function.

Uploaded by

ThePrius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Kernels and Kernelized

Perceptron

Instructor: Alan Ritter

Many Slides from Carlos Guestrin and Luke Zettlemoyer


What'if'the'data'is'not'linearly'separable?'
Use features of features
of features of features….
0 1
x1
B ... C
B C
B xn C
B C
B x1 x2 C
(x) = B
B
C
C
B x1 x3 C
B ... C
B C
@ ex1 A
...

Feature space can get really large really quickly!


NonIlinear'features:'1D'input'
•  Datasets'that'are'linearly'separable'with'some'noise'work'
out'great:'
' x
0

•  But'what'are'we'going'to'do'if'the'dataset'is'just'too'hard?''
0 x

•  How'about…'mapping'data'to'a'higherIdimensional'space:'
x2

x
Feature'spaces'
•  General'idea:'''map'to'higher'dimensional'space'
–  if'x'is'in'Rn,'then'φ(x)'is'in'Rm'for'm>n'
–  Can'now'learn'feature'weights'w#in'Rm'and'predict:''
y = sign(w · (x))
–  Linear'funcXon'in'the'higher'dimensional'space'will'be'nonIlinear'in'
the'original'space'

x → φ(x)
Higher'order'polynomials'
number of monomial terms

d=4

m – input features
d – degree of polynomial

d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
⌥ x e⇥. . x. =w ⇥ j yj xj
⌥u1 .⇤w .. v1
⌥ . . .
Efficient'dotIproduct'of'polynomials'
⇥(u).⇥(v) = . =j u v + u v = u.v
⇤L 1 1 2 2
⇤L = w⌥ 2 x(1)⇥y x2
u v ⇥
⇤w =⇧ w eu1exactlyj j j⌃
jjyj x v1
Polynomials
⇤ ⇤w
⇥(u).⇥(v) of degree
⌅ j⇤
= . ⌅ d = u1 v1 + u2 v2 =
2 ju 2 v
u1 . 2 v1
. . 2
d=1 ⇥⇥⌥ ⇥
⇥ ⌥⌅ v⇤
u
⌥u1u
1 1⇤u 2 v2 1
1 ⌥ 1 v2 2 ⌅ 2 2
⇥(u).⇥(v)
⇥(u).⇥(v) =⇧
⇥(u).⇥(v)== . u1 . ⇧ ==uu1 v1v1⌃
. ⌃ ++
1
1 =u2uu
v221vv=
21 + u.v
u.v
= 2u1 v1 u2 v2
⇤Luu22u2⌥u1 u1 uvv222 v⌥ 2 v1
v1xv2
⇥(u).⇥(v)
⇤ ⌅=
= u
⇤⌥

w
2 ⌅ .
⌃ ⇧2

vj2y j j ⌃ = u 2 2
1 v1 + 2u1 v1 u2
d=2 u⇤w
1
2 2
uv21u1
2 v2 v1
⌥ u1 u2 ⌥ v1uv22 j v 2
2 v )2
⇥(u).⇥(v) = ⌥ . ⌥ = (u= v
u 2
+
v 2
u
⇧ u2 u1 ⌃⇥⇧ v2 v1 ⌃⇥1 11 1 2 21 1 2 2
+ 2u v u v + u 2 2
2 v2

u
u2 1 v 2v1 = (u v + u v ) 2
⇥(u).⇥(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
⇥(u).⇥(v) = (u.v)d
For any d (we will skip proof): ⇥(u).⇥(v) = (u.v)d
d
K(u, v) = ⇥(u).⇥(v) = (u.v)
⇥(u).⇥(v) = (u.v)d
•  Cool! Taking a dot product and an exponential gives same
results as mapping into high dimensional space and then taking
dot product
The' Kernel'Trick '
•  A'kernel&func*on'defines'a'dot'product'in'some'feature'space.'
& & &K(u,v)='φ(u)!#φ(v)'
•  Example:''
'2Idimensional'vectors'u=[u1&&&u2]'and'v=[v1&&&v2];''let'K(u,v)=(1'+'u!v)2,'
'Need'to'show'that'K(xi,xj)='φ(xi)#!φ(xj):'
''K(u,v)=(1'+'u!v)2,='1+'u12v12&+&2'u1v1&u2v2+&u22v22&+'2u1v1&+&2u2v2=&
&&&&&&&=&[1,'u12,&&√2'u1u2,&&&u22,&&√2u1,&&√2u2]#!#[1,''v12,&&√2v1v2,&&v22,&&√2v1,&&√2v2]'='
'''''''='φ(u)#!φ(v),''''where'φ(x)'='#[1,''x12,&&√2'x1x2,&&&x22,&&&√2x1,&&√2x2]'
•  Thus,'a'kernel'funcXon&implicitly&maps'data'to'a'highIdimensional'space'
(without'the'need'to'compute'each'φ(x)'explicitly).'
•  But,'it'isn’t'obvious'yet'how'we'will'incorporate'it'into'actual'learning'
algorithms…'
“Kernel'trick”'for'The'Perceptron!'
•  Never'compute'features'explicitly!!!'
–  Compute'dot'products'in'closed'form'K(u,v)'='Φ(u)'!'Φ(v)''
•  Standard'Perceptron:' •  Kernelized'Perceptron:'
•  set'a i=0'for'each'example'i'
•  set'wi=0'for'each'feature'i'
•  For't=1..T,'i=1..n:'
•  set'ai=0'for'each'example'i' X
–  y'' = sign(( ak (xk )) · (xi ))
•  For't=1..T,'i=1..n:' k
i X
'' = sign(w · (x ))
–  y ' = sign( ak K(xk , xi ))
–  if'y'≠'yi' –  if'y'≠'yi'
= w + y i (xi )
k
•  w
''

•  'ai'+='yi' •  ai'+='yi'
'
•  At'all'Xmes'during'learning:'
X ' Exactly the same
w= ak (xk ) computations, but can use
k K(u,v) to avoid enumerating
the features!!!
•  set'ai=0'for'each'example'i' IniXal:'
•  For't=1..T,'i=1..n:' •  a'='[a1,'a2,'a3,'a4]'='[0,0,0,0]'
X t=1,i=1'
–  y'' = sign( ak K(xk , xi )) •  ΣkakK(xk,x1)'='0x4+0x0+0x4+0x0'='0,'sign(0)=I1'
–  if'y'≠'yi' k •  a1'+='y1 "'a1+=1,'new'a='[1,0,0,0]'
•  ai'+='yi' t=1,i=2'
' •  ΣkakK(xk,x2)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
t=1,i=3'
x1#' x2# y# •  ΣkakK(xk,x3)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
t=1,i=4'
1' 1' 1' •  ΣkakK(xk,x4)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
I1' 1' I1' x1 t=2,i=1'
•  ΣkakK(xk,x1)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
I1' I1' 1' …'
'
1' I1' I1' x2
'
'
K(u,v)'='(u!v)2' K# x1# x2# x3# x4# Converged!!!'
e.g.,'' x1' 4' 0' 4' 0' •  y=Σk'ak'K(xk,x)'
K(x1,x2)'' '''''''='1×K(x1,x)+0×K(x2,x)+0×K(x3,x)+0×K(x4,x)'
''''='K([1,1],[I1,1])' x2' 0' 4' 0' 4'
'''''''='K(x1,x)'
''''='(1xI1+1x1)2' x3' 4' 0' 4' 0' '''''''='K([1,1],x)'''(because'x1=[1,1])'
''''''='0'
x4' 0' 4' 0' 4' '''''''='(x1+x2)2'''''''''(because''K(u,v)'='(u!v)2)'
'' '
Common'kernels'
•  Polynomials'of'degree'exactly'd&

•  Polynomials'of'degree'up'to'd&

•  Gaussian'kernels'

•  Sigmoid'
'
'
•  And'many'others:'very'acXve'area'of'research!'
Overfipng?'
•  Huge'feature'space'with'kernels,'what'about'
overfipng???'
–  Oqen'robust'to'overfipng,'e.g.'if'you'don’t'make'
too'many'Perceptron'updates'
–  SVMs'(which'we'will'see'next)'will'have'a'clearer'
story'for'avoiding'overfipng'
–  But'everything'overfits'someXmes!!!'
•  Can'control'by:'
–  Choosing'a'be@er'Kernel'
–  Varying'parameters'of'the'Kernel'(width'of'Gaussian,'etc.)'
Kernels'in'logisXc'regression'
1
P (Y = 0|X = x, w, w0 ) =
1 + exp(w0 + w · x)
•  Define'weights'in'terms'of'data'points:'
X
w= ↵j (xj )
j
1
P (Y = 0|X = x, w, w0 ) = P
1 + exp(w0 + j ↵j (xj ) · (x))
1
= P
' 1 + exp(w0 + j ↵j K(xj , x))
•  Derive'gradient'descent'rule'on'αj,w0'
•  Similar'tricks'for'all'linear'models:'SVMs,'etc'
What'you'need'to'know'
•  The'kernel'trick'
•  Derive'polynomial'kernel'
•  Common'kernels'
•  Kernelized'perceptron'

You might also like