0% found this document useful (0 votes)
14 views

03 - Kernelization

Uploaded by

Rehan Mahmood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

03 - Kernelization

Uploaded by

Rehan Mahmood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Lecture 3: Kernelization

Making linear models non-linear


Joaquin Vanschoren
Feature Maps
Linear models: y
^ = wx + w0 = ∑
p

i=1
wi xi + w0 = w0 + w1 x1 +. . . +wp xp

When we cannot fit the data well, we can add non-linear transformations of the features
Feature map (or basis expansion ) : ϕ X → R
d

T T
y = w x → y = w ϕ(x)

E.g. Polynomial feature map: all polynomials up to degree and all products
d

ϕ
2 2 d
[1, x1 , . . . , xp ] → [1, x1 , . . . , xp , x , . . . , xp , . . . , xp , x1 x2 , . . . , xp−1 xp ]
1

Example with p = 1, d = 3 :
ϕ
2 3
y = w 0 + w 1 x1 → y = w 0 + w 1 x1 + w 2 x + w3 x
1 1
Ridge regression example
Weights: [0.418]
Add all polynomials up to degree 10 and fit again:
x
d

e.g. use sklearn PolynomialFeatures


x0 x0^2 x0^3 x0^4 x0^5 x0^6 x0^7 x0^8
0 -0.752759 0.566647 -0.426548 0.321088 -0.241702 0.181944 -0.136960 0.103098
1 2.704286 7.313162 19.776880 53.482337 144.631526 391.124988 1057.713767 2860.360362 77
2 1.391964 1.937563 2.697017 3.754150 5.225640 7.273901 10.125005 14.093639
3 0.591951 0.350406 0.207423 0.122784 0.072682 0.043024 0.025468 0.015076
4 -2.063888 4.259634 -8.791409 18.144485 -37.448187 77.288869 -159.515582 329.222321 -67
Weights: [ 0.643 0.297 -0.69 -0.264 0.41 0.096 -0.076 -0.014 0.004 0.001]
How expensive is this?

You may need MANY dimensions to fit the data


Memory and computational cost
More weights to learn, more likely overfitting
Ridge has a closed-form solution which we can compute with linear algebra:
∗ T −1 T
w = (X X + αI ) X Y

Since X has rows (examples), and columns (features),


n d X
T
X has dimensionality dxd

Hence Ridge is quadratic in the number of features, O(d n)


2

After the feature map , we get


Φ

∗ T −1 T
w = (Φ(X) Φ(X) + αI ) Φ(X) Y

Since increases a lot,


Φ d Φ(X)
T
Φ(X) becomes huge
Linear SVM example (classification)
We can add a new feature by taking the squares of feature1 values
Now we can fit a linear model
As a function of the original features, the decision boundary is now a polynomial as well
2
y = w 0 + w 1 x1 + w 2 x2 + w 3 x > 0
2
The kernel trick
Computations in explicit, high-dimensional feature maps are expensive
For some feature maps, we can, however, compute distances between points cheaply
Without explicitly constructing the high-dimensional space at all
Example: quadratic feature map for :
x = (x1 , . . . , xp )

2 2
Φ(x) = (x1 , . . . , xp , x , . . . , xp , √2x1 x2 , . . . , √2xp−1 xp )
1

A kernel function exists for this feature map to compute dot products
2
kquad (xi , xj ) = Φ(xi ) ⋅ Φ(xj ) = xi ⋅ xj + (xi ⋅ xj )

Skip computation of Φ(xi ) and Φ(xj ) and compute k(xi , xj ) directly


Kernelization
Kernel corresponding to a feature map :
k Φ k(xi , xj ) = Φ(xi ) ⋅ Φ(xj )

Computes dot product between xi , xj in a high-dimensional spaceH

Kernels are sometimes called generalized dot products


H is called the reproducing kernel Hilbert space (RKHS)
The dot product is a measure of the similarity between xi , xj

Hence, a kernel can be seen as a similarity measure for high-dimensional spaces


If we have a loss function based on dot productsxi ⋅ xj it can be kernelized
Simply replace the dot products with k(xi , xj )
Example: SVMs
Linear SVMs (dual form, for support vectors with dual coefficients and classes ):
l ai yi

l l
1
LDual (ai ) = ∑ ai − ∑ ai aj yi yj (xi . xj )
2
i=1 i,j=1

Kernelized SVM, using any existing kernel we want:


k

l l
1
LDual (ai , k) = ∑ ai − ∑ ai aj yi yj k(xi , xj )
2
i=1 i,j=1
Which kernels exist?

A (Mercer) kernel is any functionk : X × X → R with these properties:


Symmetry: k(x1 , x2 ) = k(x2 , x1 ) ∀x1 , x2 ∈ X

Positive definite: the kernel matrix is positive semi-definite


K

Intuitively,
k(x1 , x2 ) ≥ 0

The kernel matrix (or Gram matrix) for points of


n is defined as:
x1 , . . . , xn ∈ X

k(x1 , x1 ) … k(x1 , xn )
⎡ ⎤

T ⎢ ⎥
K = XX =
⎢ ⋮ ⋱ ⋮ ⎥

⎣ ⎦
k(xn , x1 ) … k(xn , xn )

Once computed ( O(n )


2
), simply lookup for any two points
k(x1 , x2 )

In practice, you can either supply a kernel function or precompute the kernel matrix
Linear kernel
Input space is same as output space: X = H = R
d

Feature map Φ(x) = x

Kernel:klinear (xi , xj ) = xi ⋅ xj

Geometrically, the dot product is the projection of on hyperplane defined by


xj xi

Becomes larger if and are in the same 'direction'


xi xj
Linear kernel between point (0,1) and another unit vector an angle (in radians)
a

Points with similar angles are deemed similar


Polynomial kernel
If , are kernels, then ( ),
k1 k2 λ. k1 λ ≥ 0 k1 + k2 , andk1 . k2 are also kernels
The polynomial kernel (for degree d ∈ N ) reproduces the polynomial feature map
is a scaling hyperparameter (default )
γ
1

is a hyperparameter (default 1) to trade off influence of higher-order terms


p

c0

d
kpoly (x1 , x2 ) = (γ(x1 ⋅ x2 ) + c0 )
RBF (Gaussian) kernel

The Radial Basis Function (RBF) feature map builds the Taylor series expansion of e
x

T
2 2 1 1 1
−x /2γ 2 3
Φ(x) = e [1, √ x, √ x ,√ x ,…]
2 4 6
1!γ 2!γ 3!γ

RBF (or Gaussian ) kernel with kernel width γ ≥ 0 :


2
kRBF (x1 , x2 ) = exp(−γ||x1 − x2 || )
The RBF kernel kRBF (x1 , x2 ) = exp(−γ||x1 − x2 || )
2
does not use a dot product
It only considers the distance between and
x1 x2

It's a local kernel : every data point only influences data points nearby
linear and polynomial kernels are global : every point affects the whole space
Similarity depends on closeness of points and kernel width
value goes up for closer points and wider kernels (larger overlap)
Kernelized SVMs in practice
You can use SVMs with any kernel to learn non-linear decision boundaries
SVM with RBF kernel
Every support vector locally influences predictions, according to kernel width ( )
γ

The prediction for test point : sum of the remaining influence of each support vector
u
l
f (x) = ∑ ai yi k(xi , u)
i=1
Tuning RBF SVMs
gamma (kernel width)
high values cause narrow Gaussians, more support vectors, overfitting
low values cause wide Gaussians, underfitting
C (cost of margin violations)
high values punish margin violations, cause narrow margins, overfitting
low values cause wider margins, more support vectors, underfitting
Kernel overview
SVMs in practice
C and gamma always need to be tuned
Interacting regularizers. Find a good C, then finetune gamma
SVMs expect all features to be approximately on the same scale
Data needs to be scaled beforehand
Allow to learn complex decision boundaries, even with few features
Work well on both low- and high dimensional data
Especially good at small, high-dimensional data
Hard to inspect, although support vectors can be inspected
In sklearn, you can use SVC for classification with a range of kernels
SVR for regression
Other kernels
There are many more possible kernels
If no kernel function exists, we can still precompute the kernel matrix
All you need is some similarity measure, and you can use SVMs
Text kernels:
Word kernels: build a bag-of-words representation of the text (e.g. TFIDF)
Kernel is the inner product between these vectors
Subsequence kernels: sequences are similar if they share many sub-sequences
Build a kernel matrix based on pairwise similarities
Graph kernels: Same idea (e.g. find common subgraphs to measure similarity)
These days, deep learning embeddings are more frequently used
The Representer Theorem
We can kernelize many other loss functions as well
The Representer Theorem states that if we have a loss function with L

Lan arbitrary loss function using some function of the inputs


f x

R a (non-decreasing) regularization score (e.g. L1 or L2) and constant λ


L (w) = L(y, f (x)) + λR(||w||)

Then the weights can be described as a linear combination of the training samples:
w

w = ∑ ai yi f (xi )

i=1

Note that this is exactly what we found for SVMs: w = ∑


l
a i y i xi

Hence, we can also kernelize Ridge regression, Logistic regression, Perceptrons, Support Vector
i=1

Regression, ...
Kernelized Ridge regression
The linear Ridge regression loss (with x0 = 1 ):
n

2 2
LRidge (w) = ∑ (yi − wxi ) + λ∥w∥

i=0

Filling in w = ∑
n

i=1
α i y i xi yields the dual formulation:
n n n n

2
LRidge (w) = ∑ (yi − ∑ αj yj xi xj ) + λ ∑ ∑ α i α j y i y j xi xj

i=1 j=1 i=1 j=1

Generalize xi ⋅ xj to k(xi , xj )

n n n n

2
LKernelRidge (α, k) = ∑ (yi − ∑ αj yj k(xi , xn )) + λ ∑ ∑ αi αj yi yj k(xi , xj )

i=1 j=1 i=1 j=1


Example of kernelized Ridge
Prediction (red) is now a linear combination of kernels (blue): y = ∑
n
αj yj k(x, xj )

We learn a dual coefficient for each point


j=1
Fitting our regression data with KernelRidge
Other kernelized methods
Same procedure can be done for logistic regression
For perceptrons,α → α + 1 after every misclassification
n

LDualP erceptron (xi , k) = max(0, yi ∑ αj yj k(xj , xi ))

j=1

Support Vector Regression behaves similarly to Kernel Ridge


Summary
Feature maps transform features to create a higher-dimensional space
Φ(x)

Allows learning non-linear functions or boundaries, but very expensive/slow


For some , we can compute dot products without constructing this space
Φ(x)

Kernel trick:k(xi , xj ) = Φ(xi ) ⋅ Φ(xj )

Kernel (generalized dot product) is a measure of similarity between and


k xi xj

There are many such kernels


Polynomial kernel: kpoly (x1 , x2 ) = (γ(x1 ⋅ x2 ) + c0 )
d

RBF (Gaussian) kernel: kRBF (x1 , x2 ) = exp(−γ||x1 − x2 || )


2

A kernel matrix can be precomputed using any similarity measure (e.g. for text, graphs,...)
Any loss function where inputs appear only as dot products can be kernelized
E.g. Linear SVMs: simply replace the dot product with a kernel of choice
The Representer theorem states which other loss functions can also be kernelized and how
Ridge regression, Logistic regression, Perceptrons,...

You might also like