03 - Kernelization
03 - Kernelization
i=1
wi xi + w0 = w0 + w1 x1 +. . . +wp xp
When we cannot fit the data well, we can add non-linear transformations of the features
Feature map (or basis expansion ) : ϕ X → R
d
T T
y = w x → y = w ϕ(x)
E.g. Polynomial feature map: all polynomials up to degree and all products
d
ϕ
2 2 d
[1, x1 , . . . , xp ] → [1, x1 , . . . , xp , x , . . . , xp , . . . , xp , x1 x2 , . . . , xp−1 xp ]
1
Example with p = 1, d = 3 :
ϕ
2 3
y = w 0 + w 1 x1 → y = w 0 + w 1 x1 + w 2 x + w3 x
1 1
Ridge regression example
Weights: [0.418]
Add all polynomials up to degree 10 and fit again:
x
d
∗ T −1 T
w = (Φ(X) Φ(X) + αI ) Φ(X) Y
2 2
Φ(x) = (x1 , . . . , xp , x , . . . , xp , √2x1 x2 , . . . , √2xp−1 xp )
1
A kernel function exists for this feature map to compute dot products
2
kquad (xi , xj ) = Φ(xi ) ⋅ Φ(xj ) = xi ⋅ xj + (xi ⋅ xj )
l l
1
LDual (ai ) = ∑ ai − ∑ ai aj yi yj (xi . xj )
2
i=1 i,j=1
l l
1
LDual (ai , k) = ∑ ai − ∑ ai aj yi yj k(xi , xj )
2
i=1 i,j=1
Which kernels exist?
Intuitively,
k(x1 , x2 ) ≥ 0
k(x1 , x1 ) … k(x1 , xn )
⎡ ⎤
T ⎢ ⎥
K = XX =
⎢ ⋮ ⋱ ⋮ ⎥
⎣ ⎦
k(xn , x1 ) … k(xn , xn )
In practice, you can either supply a kernel function or precompute the kernel matrix
Linear kernel
Input space is same as output space: X = H = R
d
Kernel:klinear (xi , xj ) = xi ⋅ xj
c0
d
kpoly (x1 , x2 ) = (γ(x1 ⋅ x2 ) + c0 )
RBF (Gaussian) kernel
The Radial Basis Function (RBF) feature map builds the Taylor series expansion of e
x
T
2 2 1 1 1
−x /2γ 2 3
Φ(x) = e [1, √ x, √ x ,√ x ,…]
2 4 6
1!γ 2!γ 3!γ
It's a local kernel : every data point only influences data points nearby
linear and polynomial kernels are global : every point affects the whole space
Similarity depends on closeness of points and kernel width
value goes up for closer points and wider kernels (larger overlap)
Kernelized SVMs in practice
You can use SVMs with any kernel to learn non-linear decision boundaries
SVM with RBF kernel
Every support vector locally influences predictions, according to kernel width ( )
γ
The prediction for test point : sum of the remaining influence of each support vector
u
l
f (x) = ∑ ai yi k(xi , u)
i=1
Tuning RBF SVMs
gamma (kernel width)
high values cause narrow Gaussians, more support vectors, overfitting
low values cause wide Gaussians, underfitting
C (cost of margin violations)
high values punish margin violations, cause narrow margins, overfitting
low values cause wider margins, more support vectors, underfitting
Kernel overview
SVMs in practice
C and gamma always need to be tuned
Interacting regularizers. Find a good C, then finetune gamma
SVMs expect all features to be approximately on the same scale
Data needs to be scaled beforehand
Allow to learn complex decision boundaries, even with few features
Work well on both low- and high dimensional data
Especially good at small, high-dimensional data
Hard to inspect, although support vectors can be inspected
In sklearn, you can use SVC for classification with a range of kernels
SVR for regression
Other kernels
There are many more possible kernels
If no kernel function exists, we can still precompute the kernel matrix
All you need is some similarity measure, and you can use SVMs
Text kernels:
Word kernels: build a bag-of-words representation of the text (e.g. TFIDF)
Kernel is the inner product between these vectors
Subsequence kernels: sequences are similar if they share many sub-sequences
Build a kernel matrix based on pairwise similarities
Graph kernels: Same idea (e.g. find common subgraphs to measure similarity)
These days, deep learning embeddings are more frequently used
The Representer Theorem
We can kernelize many other loss functions as well
The Representer Theorem states that if we have a loss function with L
′
′
L (w) = L(y, f (x)) + λR(||w||)
Then the weights can be described as a linear combination of the training samples:
w
w = ∑ ai yi f (xi )
i=1
Hence, we can also kernelize Ridge regression, Logistic regression, Perceptrons, Support Vector
i=1
Regression, ...
Kernelized Ridge regression
The linear Ridge regression loss (with x0 = 1 ):
n
2 2
LRidge (w) = ∑ (yi − wxi ) + λ∥w∥
i=0
Filling in w = ∑
n
i=1
α i y i xi yields the dual formulation:
n n n n
2
LRidge (w) = ∑ (yi − ∑ αj yj xi xj ) + λ ∑ ∑ α i α j y i y j xi xj
Generalize xi ⋅ xj to k(xi , xj )
n n n n
2
LKernelRidge (α, k) = ∑ (yi − ∑ αj yj k(xi , xn )) + λ ∑ ∑ αi αj yi yj k(xi , xj )
j=1
A kernel matrix can be precomputed using any similarity measure (e.g. for text, graphs,...)
Any loss function where inputs appear only as dot products can be kernelized
E.g. Linear SVMs: simply replace the dot product with a kernel of choice
The Representer theorem states which other loss functions can also be kernelized and how
Ridge regression, Logistic regression, Perceptrons,...