0% found this document useful (0 votes)

14 views

03 - Kernelization

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

03 - Kernelization

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Lecture 3: Kernelization

Making linear models non-linear

Joaquin Vanschoren
Feature Maps
Linear models: y
^ = wx + w0 = ∑
p

i=1
wi xi + w0 = w0 + w1 x1 +. . . +wp xp

When we cannot fit the data well, we can add non-linear transformations of the features
Feature map (or basis expansion ) : ϕ X → R
d

T T
y = w x → y = w ϕ(x)

E.g. Polynomial feature map: all polynomials up to degree and all products
d

ϕ
2 2 d
[1, x1 , . . . , xp ] → [1, x1 , . . . , xp , x , . . . , xp , . . . , xp , x1 x2 , . . . , xp−1 xp ]
1

Example with p = 1, d = 3 :
ϕ
2 3
y = w 0 + w 1 x1 → y = w 0 + w 1 x1 + w 2 x + w3 x
1 1
Ridge regression example
Weights: [0.418]
Add all polynomials up to degree 10 and fit again:
x
d

e.g. use sklearn PolynomialFeatures

x0 x0^2 x0^3 x0^4 x0^5 x0^6 x0^7 x0^8
0 -0.752759 0.566647 -0.426548 0.321088 -0.241702 0.181944 -0.136960 0.103098
1 2.704286 7.313162 19.776880 53.482337 144.631526 391.124988 1057.713767 2860.360362 77
2 1.391964 1.937563 2.697017 3.754150 5.225640 7.273901 10.125005 14.093639
3 0.591951 0.350406 0.207423 0.122784 0.072682 0.043024 0.025468 0.015076
4 -2.063888 4.259634 -8.791409 18.144485 -37.448187 77.288869 -159.515582 329.222321 -67
Weights: [ 0.643 0.297 -0.69 -0.264 0.41 0.096 -0.076 -0.014 0.004 0.001]
How expensive is this?

You may need MANY dimensions to fit the data

Memory and computational cost
More weights to learn, more likely overfitting
Ridge has a closed-form solution which we can compute with linear algebra:
∗ T −1 T
w = (X X + αI ) X Y

Since X has rows (examples), and columns (features),

n d X
T
X has dimensionality dxd

Hence Ridge is quadratic in the number of features, O(d n)

After the feature map , we get

∗ T −1 T
w = (Φ(X) Φ(X) + αI ) Φ(X) Y

Since increases a lot,

Φ d Φ(X)
T
Φ(X) becomes huge
Linear SVM example (classification)
We can add a new feature by taking the squares of feature1 values
Now we can fit a linear model
As a function of the original features, the decision boundary is now a polynomial as well
2
y = w 0 + w 1 x1 + w 2 x2 + w 3 x > 0
2
The kernel trick
Computations in explicit, high-dimensional feature maps are expensive
For some feature maps, we can, however, compute distances between points cheaply
Without explicitly constructing the high-dimensional space at all
Example: quadratic feature map for :
x = (x1 , . . . , xp )

2 2
Φ(x) = (x1 , . . . , xp , x , . . . , xp , √2x1 x2 , . . . , √2xp−1 xp )
1

A kernel function exists for this feature map to compute dot products
2
kquad (xi , xj ) = Φ(xi ) ⋅ Φ(xj ) = xi ⋅ xj + (xi ⋅ xj )

Skip computation of Φ(xi ) and Φ(xj ) and compute k(xi , xj ) directly

Kernelization
Kernel corresponding to a feature map :
k Φ k(xi , xj ) = Φ(xi ) ⋅ Φ(xj )

Computes dot product between xi , xj in a high-dimensional spaceH

Kernels are sometimes called generalized dot products

H is called the reproducing kernel Hilbert space (RKHS)
The dot product is a measure of the similarity between xi , xj

Hence, a kernel can be seen as a similarity measure for high-dimensional spaces

If we have a loss function based on dot productsxi ⋅ xj it can be kernelized
Simply replace the dot products with k(xi , xj )
Example: SVMs
Linear SVMs (dual form, for support vectors with dual coefficients and classes ):
l ai yi

l l
1
LDual (ai ) = ∑ ai − ∑ ai aj yi yj (xi . xj )
2
i=1 i,j=1

Kernelized SVM, using any existing kernel we want:

l l
1
LDual (ai , k) = ∑ ai − ∑ ai aj yi yj k(xi , xj )
2
i=1 i,j=1
Which kernels exist?

A (Mercer) kernel is any functionk : X × X → R with these properties:

Symmetry: k(x1 , x2 ) = k(x2 , x1 ) ∀x1 , x2 ∈ X

Positive definite: the kernel matrix is positive semi-definite

Intuitively,
k(x1 , x2 ) ≥ 0

The kernel matrix (or Gram matrix) for points of

n is defined as:
x1 , . . . , xn ∈ X

k(x1 , x1 ) … k(x1 , xn )
⎡ ⎤

T ⎢ ⎥
K = XX =
⎢ ⋮ ⋱ ⋮ ⎥

⎣ ⎦
k(xn , x1 ) … k(xn , xn )

Once computed ( O(n )

2
), simply lookup for any two points
k(x1 , x2 )

In practice, you can either supply a kernel function or precompute the kernel matrix
Linear kernel
Input space is same as output space: X = H = R
d

Feature map Φ(x) = x

Kernel:klinear (xi , xj ) = xi ⋅ xj

Geometrically, the dot product is the projection of on hyperplane defined by

xj xi

Becomes larger if and are in the same 'direction'

xi xj
Linear kernel between point (0,1) and another unit vector an angle (in radians)
a

Points with similar angles are deemed similar

Polynomial kernel
If , are kernels, then ( ),
k1 k2 λ. k1 λ ≥ 0 k1 + k2 , andk1 . k2 are also kernels
The polynomial kernel (for degree d ∈ N ) reproduces the polynomial feature map
is a scaling hyperparameter (default )
γ
1

is a hyperparameter (default 1) to trade off influence of higher-order terms

d
kpoly (x1 , x2 ) = (γ(x1 ⋅ x2 ) + c0 )
RBF (Gaussian) kernel

The Radial Basis Function (RBF) feature map builds the Taylor series expansion of e
x

T
2 2 1 1 1
−x /2γ 2 3
Φ(x) = e [1, √ x, √ x ,√ x ,…]
2 4 6
1!γ 2!γ 3!γ

RBF (or Gaussian ) kernel with kernel width γ ≥ 0 :

2
kRBF (x1 , x2 ) = exp(−γ||x1 − x2 || )
The RBF kernel kRBF (x1 , x2 ) = exp(−γ||x1 − x2 || )
2
does not use a dot product
It only considers the distance between and
x1 x2

It's a local kernel : every data point only influences data points nearby
linear and polynomial kernels are global : every point affects the whole space
Similarity depends on closeness of points and kernel width
value goes up for closer points and wider kernels (larger overlap)
Kernelized SVMs in practice
You can use SVMs with any kernel to learn non-linear decision boundaries
SVM with RBF kernel
Every support vector locally influences predictions, according to kernel width ( )
γ

The prediction for test point : sum of the remaining influence of each support vector
u
l
f (x) = ∑ ai yi k(xi , u)
i=1
Tuning RBF SVMs
gamma (kernel width)
high values cause narrow Gaussians, more support vectors, overfitting
low values cause wide Gaussians, underfitting
C (cost of margin violations)
high values punish margin violations, cause narrow margins, overfitting
low values cause wider margins, more support vectors, underfitting
Kernel overview
SVMs in practice
C and gamma always need to be tuned
Interacting regularizers. Find a good C, then finetune gamma
SVMs expect all features to be approximately on the same scale
Data needs to be scaled beforehand
Allow to learn complex decision boundaries, even with few features
Work well on both low- and high dimensional data
Especially good at small, high-dimensional data
Hard to inspect, although support vectors can be inspected
In sklearn, you can use SVC for classification with a range of kernels
SVR for regression
Other kernels
There are many more possible kernels
If no kernel function exists, we can still precompute the kernel matrix
All you need is some similarity measure, and you can use SVMs
Text kernels:
Word kernels: build a bag-of-words representation of the text (e.g. TFIDF)
Kernel is the inner product between these vectors
Subsequence kernels: sequences are similar if they share many sub-sequences
Build a kernel matrix based on pairwise similarities
Graph kernels: Same idea (e.g. find common subgraphs to measure similarity)
These days, deep learning embeddings are more frequently used
The Representer Theorem
We can kernelize many other loss functions as well
The Representer Theorem states that if we have a loss function with L
′

Lan arbitrary loss function using some function of the inputs

f x

R a (non-decreasing) regularization score (e.g. L1 or L2) and constant λ

′
L (w) = L(y, f (x)) + λR(||w||)

Then the weights can be described as a linear combination of the training samples:
w

w = ∑ ai yi f (xi )

i=1

Note that this is exactly what we found for SVMs: w = ∑

l
a i y i xi

Hence, we can also kernelize Ridge regression, Logistic regression, Perceptrons, Support Vector
i=1

Regression, ...
Kernelized Ridge regression
The linear Ridge regression loss (with x0 = 1 ):
n

2 2
LRidge (w) = ∑ (yi − wxi ) + λ∥w∥

i=0

Filling in w = ∑
n

i=1
α i y i xi yields the dual formulation:
n n n n

2
LRidge (w) = ∑ (yi − ∑ αj yj xi xj ) + λ ∑ ∑ α i α j y i y j xi xj

i=1 j=1 i=1 j=1

Generalize xi ⋅ xj to k(xi , xj )

n n n n

2
LKernelRidge (α, k) = ∑ (yi − ∑ αj yj k(xi , xn )) + λ ∑ ∑ αi αj yi yj k(xi , xj )

i=1 j=1 i=1 j=1

Example of kernelized Ridge
Prediction (red) is now a linear combination of kernels (blue): y = ∑
n
αj yj k(x, xj )

We learn a dual coefficient for each point

j=1
Fitting our regression data with KernelRidge
Other kernelized methods
Same procedure can be done for logistic regression
For perceptrons,α → α + 1 after every misclassification
n

LDualP erceptron (xi , k) = max(0, yi ∑ αj yj k(xj , xi ))

j=1

Support Vector Regression behaves similarly to Kernel Ridge

Summary
Feature maps transform features to create a higher-dimensional space
Φ(x)

Allows learning non-linear functions or boundaries, but very expensive/slow

For some , we can compute dot products without constructing this space
Φ(x)

Kernel trick:k(xi , xj ) = Φ(xi ) ⋅ Φ(xj )

Kernel (generalized dot product) is a measure of similarity between and

k xi xj

There are many such kernels

Polynomial kernel: kpoly (x1 , x2 ) = (γ(x1 ⋅ x2 ) + c0 )
d

RBF (Gaussian) kernel: kRBF (x1 , x2 ) = exp(−γ||x1 − x2 || )

A kernel matrix can be precomputed using any similarity measure (e.g. for text, graphs,...)
Any loss function where inputs appear only as dot products can be kernelized
E.g. Linear SVMs: simply replace the dot product with a kernel of choice
The Representer theorem states which other loss functions can also be kernelized and how
Ridge regression, Logistic regression, Perceptrons,...

Toolsie 2
50% (2)
Toolsie 2
925 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Ds 11
No ratings yet
Ds 11
21 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Chapter 4- Kernel Theory
No ratings yet
Chapter 4- Kernel Theory
9 pages
slidesgo - 20241215145629eDkm
No ratings yet
slidesgo - 20241215145629eDkm
8 pages
Lect 3
No ratings yet
Lect 3
14 pages
Lecture 14: Kernels — Applied ML
No ratings yet
Lecture 14: Kernels — Applied ML
14 pages
ML Assignment 2 PDF
No ratings yet
ML Assignment 2 PDF
5 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
SVM Overview
No ratings yet
SVM Overview
4 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
Support Vector Machine Explained
No ratings yet
Support Vector Machine Explained
10 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
Support Vector Machines and Kernels
No ratings yet
Support Vector Machines and Kernels
23 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Support Vector Machine
0% (1)
Support Vector Machine
7 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
KernelMethods
No ratings yet
KernelMethods
19 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
Kernels in Support Vector Machine Part B
No ratings yet
Kernels in Support Vector Machine Part B
5 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
This Is
No ratings yet
This Is
7 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Lecture09 SVM Intro, Kernel Trick (Updated)
No ratings yet
Lecture09 SVM Intro, Kernel Trick (Updated)
36 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
No ratings yet
Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
6 pages
Lecture 5
No ratings yet
Lecture 5
19 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
Lecture 04
No ratings yet
Lecture 04
19 pages
Svm
No ratings yet
Svm
40 pages
SVM
No ratings yet
SVM
43 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
SVM
No ratings yet
SVM
12 pages
SVM Class
No ratings yet
SVM Class
33 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Hands-On Machine Learning: Chapter 5: Support Vector Machines
No ratings yet
Hands-On Machine Learning: Chapter 5: Support Vector Machines
32 pages
SVM
No ratings yet
SVM
40 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Lecture 05
No ratings yet
Lecture 05
10 pages
MergedPDF Iml
No ratings yet
MergedPDF Iml
114 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Support Vector Machines
No ratings yet
Support Vector Machines
43 pages
Support Vector Machine: Classification, Regression and Outliers Detection
No ratings yet
Support Vector Machine: Classification, Regression and Outliers Detection
26 pages
data an-6
No ratings yet
data an-6
36 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
DSA5102X_lecture2
No ratings yet
DSA5102X_lecture2
43 pages
Polynomial Kernel
No ratings yet
Polynomial Kernel
5 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Kernel Method
No ratings yet
Kernel Method
5 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
06 - Data Preprocessing
No ratings yet
06 - Data Preprocessing
68 pages
Ex Ml-Basics
No ratings yet
Ex Ml-Basics
1 page
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
10 - Neural Networks For Text
No ratings yet
10 - Neural Networks For Text
40 pages
This Study Resource Was
No ratings yet
This Study Resource Was
3 pages
Department of Computing: CS-220: Database Systems Class: BSCS-4C
100% (1)
Department of Computing: CS-220: Database Systems Class: BSCS-4C
14 pages
This Study Resource Was
No ratings yet
This Study Resource Was
1 page
Deques: ECE 250 Algorithms and Data Structures
No ratings yet
Deques: ECE 250 Algorithms and Data Structures
34 pages
Some Random Reviews: Which Happens by Default?
No ratings yet
Some Random Reviews: Which Happens by Default?
9 pages
Instant ebooks textbook Mathematics for Physical Chemistry 4ed. Edition Mortimer R.G. download all chapters
100% (12)
Instant ebooks textbook Mathematics for Physical Chemistry 4ed. Edition Mortimer R.G. download all chapters
72 pages
JIC Course Catalog 19062014
No ratings yet
JIC Course Catalog 19062014
42 pages
Fa mth405 1
0% (1)
Fa mth405 1
31 pages
Ethiopian Grade 11 New Curriculum Mathematics Textbook
No ratings yet
Ethiopian Grade 11 New Curriculum Mathematics Textbook
12 pages
Mathematics HB
No ratings yet
Mathematics HB
17 pages
Section 3: Cross Product of Vectors: Vector Algebra
No ratings yet
Section 3: Cross Product of Vectors: Vector Algebra
6 pages
RT Hcs
No ratings yet
RT Hcs
26 pages
A New Simple Model For Land Mobile Satellite Channels: First-And Second-Order Statistics
No ratings yet
A New Simple Model For Land Mobile Satellite Channels: First-And Second-Order Statistics
10 pages
Handout FEniCS
No ratings yet
Handout FEniCS
16 pages
Download Complete Basic Real Analysis 2nd Edition Anthony W. Knapp PDF for All Chapters
100% (3)
Download Complete Basic Real Analysis 2nd Edition Anthony W. Knapp PDF for All Chapters
55 pages
Class: 12 Subject: Mathematics: Almost 30% Reduced Syllabus
No ratings yet
Class: 12 Subject: Mathematics: Almost 30% Reduced Syllabus
36 pages
Mathematics 2 (2022)
No ratings yet
Mathematics 2 (2022)
3 pages
Harrison Kinsley, Daniel Kukieła - Neural Networks from Scratch in Python (2020)-31-61
No ratings yet
Harrison Kinsley, Daniel Kukieła - Neural Networks from Scratch in Python (2020)-31-61
31 pages
New Syllabus MSC Mathematics CSJM University2
No ratings yet
New Syllabus MSC Mathematics CSJM University2
39 pages
Mathematical Concepts of CATIA V5
100% (1)
Mathematical Concepts of CATIA V5
18 pages
Questions CoprihensiveViva
No ratings yet
Questions CoprihensiveViva
3 pages
Advanced Mathematical Physics-WithoutTensors
No ratings yet
Advanced Mathematical Physics-WithoutTensors
2 pages
Bsc-Civil Engineering
No ratings yet
Bsc-Civil Engineering
159 pages
Modern Control Engineering 5th Edition
No ratings yet
Modern Control Engineering 5th Edition
17 pages
Linear Algebra
No ratings yet
Linear Algebra
91 pages
Theano Tutorial
No ratings yet
Theano Tutorial
29 pages
HP48 Frequently Asked Questions List (FAQ) Appendix B GX Specific Information
No ratings yet
HP48 Frequently Asked Questions List (FAQ) Appendix B GX Specific Information
12 pages
TLS Tutorial
No ratings yet
TLS Tutorial
6 pages
41 Years Chapterwise Topicwise Solved Papers (2019-1979) IIT -- Amit M_ Agarwal -- PT, 2019 -- Arihant Books
No ratings yet
41 Years Chapterwise Topicwise Solved Papers (2019-1979) IIT -- Amit M_ Agarwal -- PT, 2019 -- Arihant Books
625 pages
Chapter 2 Normed Spaces
No ratings yet
Chapter 2 Normed Spaces
60 pages
Full Download Linear Algebra and Partial Differential Equations 1st Edition - Ebook PDF
100% (5)
Full Download Linear Algebra and Partial Differential Equations 1st Edition - Ebook PDF
41 pages
Functional Analysis Lecture Notes
No ratings yet
Functional Analysis Lecture Notes
52 pages
Vector Analysis
100% (1)
Vector Analysis
77 pages
Rayleigh Ritz Method
No ratings yet
Rayleigh Ritz Method
7 pages

03 - Kernelization

Uploaded by

03 - Kernelization

Uploaded by

Lecture 3: Kernelization

Making linear models non-linear

e.g. use sklearn PolynomialFeatures

You may need MANY dimensions to fit the data

Since X has rows (examples), and columns (features),

Hence Ridge is quadratic in the number of features, O(d n)

After the feature map , we get

Since increases a lot,

Skip computation of Φ(xi ) and Φ(xj ) and compute k(xi , xj ) directly

Computes dot product between xi , xj in a high-dimensional spaceH

Kernels are sometimes called generalized dot products

Hence, a kernel can be seen as a similarity measure for high-dimensional spaces

Kernelized SVM, using any existing kernel we want:

A (Mercer) kernel is any functionk : X × X → R with these properties:

Positive definite: the kernel matrix is positive semi-definite

The kernel matrix (or Gram matrix) for points of

Once computed ( O(n )

Feature map Φ(x) = x

Geometrically, the dot product is the projection of on hyperplane defined by

Becomes larger if and are in the same 'direction'

Points with similar angles are deemed similar

is a hyperparameter (default 1) to trade off influence of higher-order terms

RBF (or Gaussian ) kernel with kernel width γ ≥ 0 :

Lan arbitrary loss function using some function of the inputs

R a (non-decreasing) regularization score (e.g. L1 or L2) and constant λ

Note that this is exactly what we found for SVMs: w = ∑

i=1 j=1 i=1 j=1

i=1 j=1 i=1 j=1

We learn a dual coefficient for each point

LDualP erceptron (xi , k) = max(0, yi ∑ αj yj k(xj , xi ))

Support Vector Regression behaves similarly to Kernel Ridge

Allows learning non-linear functions or boundaries, but very expensive/slow

Kernel trick:k(xi , xj ) = Φ(xi ) ⋅ Φ(xj )

Kernel (generalized dot product) is a measure of similarity between and

There are many such kernels

RBF (Gaussian) kernel: kRBF (x1 , x2 ) = exp(−γ||x1 − x2 || )

You might also like