Ds 11

Kernel methods allow algorithms that traditionally operate on vectors to instead operate on arbitrary similarity measures between pairs of data points. This is done through kernels, which compute similarities without needing to explicitly map data into a feature space. Three key points: 1. Mercer kernels define valid similarity measures and correspond to an inner product in some potentially infinite-dimensional feature space. 2. Common kernels include linear, polynomial, Gaussian, and domain-specific kernels for text, graphs, etc. 3. The "kernel trick" allows kernelizing algorithms like SVM, ridge regression, k-means by replacing inner products with kernels, avoiding explicit feature mappings.

Uploaded by

Dilip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views21 pages

Ds 11

Uploaded by

Dilip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Kernel Methods

Mercer Kernels
In general, has to be a Hilbert space which, technicalities aside, is very
much like a real vector space such as but is possibly infinite dimensional.
It is always possible to define inner/dot products on Hilbert spaces
2
Suppose are any two unit vectors
The dot product is a natural notion of similarity between these vectors
It is highest when the vectors are the same i.e. when we have
It is lowest when the vectors are diametrically opposite i.e. and
Mercer kernels are notions of similarity that extend such nice behaviour
Given a set of objects (images, video, strings, genome sequences), a
similarity function is called a Mercer kernel if there exists a map s.t.
for all ,
often called feature map or feature embedding, can be for some
large/moderate . can even be infinite dimensional
Thus, when asked to give similarity between two objects, all that a Mercer
kernel does is first map those objects to two (high-dim) vectors and return the
dot/inner product between those two vectors
Examples of Kernels
Poly kernels called homogeneous if .
makes the kernel non-Mercer
3
When are vectors
Linear kernel
Quadratic kernel
Polynomial kernel ,
Gaussian kernel
Laplacian kernel
All of the above are Mercer kernels
There indeed exist feature maps for each of them (proving so a bit tedious)
need to be tuned. Large , can cause overfitting
Notice all are the above are indeed notions of similarity
Take two unit vectors (unit for sake of normalization). Easy to verify that is
largest when and smallest when
Mercer Kernel Feature Maps
Homogenous poly kernels () only use features of the form where and . In
4
Linearcontrast,
kernelif we have then the kernels use all features of the form where ,
where“linear”
Use . Called . Non-homfor
poly kernels use
a reason: anymore expressive
linear feature
function overmaps
is just a linear
function over the original features
Quadratic kernel
We have , when where
Similar constructions (more tedious to write) for polynomial kernel
Called “quadratic” for a reason: any linear function over is a quadratic
function over the original features . Verify for the simple case yourself
If we use a linear ML algo over , can learn any quadratic function over
original features . Polynomial kernel of degree similarly allows learning of
degree polynomial functions over original data
Mercer Kernel
Learning a linear Feature
function over Maps
the features amounts to learning an infinite-
degree polynomial over . Gauss/Lap are very powerful kernels, often called
5
Warning: mayByexist
universal kernels. usingmore than one
these kernels, map for
theoretically the same
speaking, kernel
one can learn
Example: anywe
function
used over data (details However
for quadratic. beyond scope of CS771)
with gives same
Gaussian/Laplacian Kernels correspond to infinite dimensional maps

Gaussian kernel is an infinite linear combination of poly kernels of all orders

Let be a map for the poly kernel . Then a map for is
Some Domain Specific Kernels 6
Over the years people have designed innovative and powerful Mercer
kernels specifically for NLP, vision and other domains
When are bag of words features for strings/documents
Let dictionary have words in it
Let be the count of word in string
Intersection kernel (Mercer kernel)
Simply represent a set using an indicator vector with if else . In
Normalize intersection kernel (define this)case,
More generally, when are sets
Intersection kernel
Norm. Int. kernel (notice that )
The above are just the linear kernel in disguise and hence clearly Mercer
Some Domain Specific Kernels 7
N-gram, substring, Fisher kernels: other kernels between two strings
Random walk kernels between two graphs
Subtree, convolutional kernels between two trees
Pyramid kernel used in vision … combination of intersection kernels
In practice, we often use a linear method first e.g. SVM/ridge regression. If that
gives unsatisfactory performance, often we jump directly to Gaussian kernel 
although we should not neglect polynomial/other domain specific kernels.
There exist “kernel learning” methods that can learn the most appropriate
kernel for us or else tune the kernel parameters e.g. for us automatically
Creating New
The normalized Kernels
kernel actually normalizes the feature map as well.
Verify that if is a map for then a map for is where
8
Method 1: combine old kernels. If are existing Mercer kernels
is also a Mercer kernel if
If gives very large values (in magnitude), some algorithms may suffer.
is also a nice kernel
The normalized version will always give values between

gives a normalized kernel

Method 2: find a new feature rep. for data and use

Method 3: mix and match. Take new data rep. and an old kernel
9
Kernelized Algorithms
 Several algorithms we studied till now, work with kernels too!
 Supervised: kNN, LWP, SVM, ridge regression
 Unsupervised: k-means, PCA
 Others like LASSO are harder to get working with kernels
 Probabilistic/Bayesian algorithms also possible with kernels
Special Case of Naïve Bayes Learning
Use a standard Gaussian to model points of class
MLE estimate of : mean of points of class – a “prototype” of class 
Can have multiple prototypes too – multiple clusters for class
If assume then to classify a test point , simply find

Called the “Learning with prototype” (LwP) model – linear decision

boundary
Suppose we let every train point be its own cluster  kNN
algorithm!
Let training set be where . Given test point , find

Give as the output – may consult more than one neighbor too!
The Kernel Trick Revisited 11
An algorithmically This
effective
peculiarway of using
property is oftenlinear models on non-lin
called kernelizability. An maps
Every kernel is associated
ML algo iswith
saidatomap such that if we can show that it
be kernelizable
The map is usuallyworks
(very) non-linear
identically and (very)
if, instead highvectors,
of feature dimensional i.e. good
we supply
candidate for our overall
pairwisegoal of using
train-train andlinear models
test-train over non-linear
dot products of feature maps
vectors
Peculiar property of several ML algos
So far we have seen ML algos work with feature vectors of train/test points
However, many of them work even if feature vectors are not provided directly
but instead pairwise dot/inner products b/w feature vectors is provided!
For training, pairwise dot products between train points needed
For testing, dot products between the test point and all train points needed
Thus, we can say we want to work with high-dim feature vectors and when the
ML algo asks us for dot products, give it 
Would get same result as working directly with but without having to compute
This is a recurring theme in kernel learning. Never ever compute .
kNN with Kernels
Instead, express all operations in the ML algo in terms of inner product
computations which are then expressible as kernel computations
12
All that is needed to execute kNN is compute Euclidean distances
If working with kernelKERNEL
Indeed,
with map KNN
computing
, need (K takes
usually = 1)time if but
computing may take much longer e.g. for
1. Choose a kernel with map
Gaussian kernel it will take time
2. Training: receive and store points
Thus, distances in can be computed without computing first 
3. Prediction:
1NN:1.Given training
Receive a testpoints
point and a test point
Find
2. closest neighborneighbour
Find nearest in i.e. which is the same as and predict as the label
Note: if then this finds most “similar” point i.e.
3. Predict
Similarly we can execute kNN for as well
LwP with Kernels
Observe that in LwP with kernels, we now have to store entire training
data whereas earlier we just had to store two prototypes. This is common
in kernel learning – larger model sizes and longer prediction times
13
Given train data , we earlier found prototypes
and and used them to predict KERNEL There are LWP
on test point as kernel methods
ways in which
1. IfChoose a kernel
using a kernel withwith
mapmap
, we can be sped
should nowupcompute
and modelnew
sizesprototypes
reduced. as and
and predict using Will see those techniques later
2. Training: receive and store points
3. Prediction:
Need to be careful now – cannot compute these new prototypes explicitly
1. Receive
Instead, a testwepoint
as before, reduce the above to kernel computations instead
2. Find distance to positive prototype where using the shortcut
3.
The first term is simply , second term is , third term is which can be pre-
4. Predictat train time
calculated
Kernel SVM 14
PRIMAL FORMULATION DUAL FORMULATION

s.t. and s.t.

Lets see what happens if we

execute the SVM after
applying a (nonlinear) feature
map
Kernel SVM 15
PRIMAL FORMULATION DUAL FORMULATION
⟨ 𝜙 ( 𝐱𝑖 ) , 𝜙 ( 𝐱 𝑗 ) ⟩
s.t. and s.t.

Note that if then the model

itself is
Kernel SVM dual problem perfectly since and
Finding/storing the model explicitly is not feasible even if we solve the
16
PRIMAL FORMULATION DUAL FORMULATION
KERNEL SVM
So instead we can store all the values (only of them). At test time,
1. Choosegiven a kernel with
a test point , wemap
can predict using
2.
s.t. Training:
and receive train points s.t.
1. Solve
Solving dual isproblem
the primal infeasible Computing usually even if e.g.
a2.single
IfNote Implicitly
that SGD
if the store
teststep
data pointby
would storing
is take
very similarfor
to allofsupport
Gaussian
one thekernel vectors
training i.e.iswhere
points i.e.
3.large,
Prediction:
infinitelythenlong  i.e. influences the prediction
that label Can still
muchsolve
more.this
If weproblem
think thisusing SDCA
way, kernel SVM almost looks like a “soft” form of kNN. If there are support
1. Receive a test point Each step of SDCA still takes time apart
vectors, then prediction requires kernel computations i.e. roughly time since
2. Predict each kernel computation takes from time to
roughly time
compute
If time taken to compute added then each
SDCA
Training is more expensive, model size is larger, takes time
prediction about timefor
is more
kernel SVM than was for linear SVM – very typical of non-linear models
Kernel Ridge Regression 17
Given data ,equality
To handle RR solution is simply
constraints
Method 1: convert to a pair of inequality constraints
Method 2: use a Lagrangian variable that has no constraints 
Is ridge-regression kernelizable? Does not seem so at first
In fact, it is – by using the dual problem of ridge regression
Deriving the dual for RR: RR solves
Dual requires constraints – none here so lets deliberately introduce some!
New (but equivalent) formulation: s.t.
Lagrangian becomes
Applying first order optimality gives us and
Dual becomes
Kernel Ridge Regression
Note however, that we can use this dual trick to solve RR even in the
linear case when . Solving linear RR in primal requires time (to invert a
matrix) whereas solving linear RR in dual requires time (to invert an
18
Thus, RR does have a dual problem (that makes
matrix). Thus, if , dual solution is cheaper
it kernelizable too)
Solve
Model is cannot be stored explicitly
Given a test point , predict as
Some simplifications
Let denote the “Gram matrix” of the training points
Dual of kernel RR can be rewritten as
Solution available in closed form
Requires inverting matrix (linear RR required inverting matrix)
As before, kernel RR requires more train time, test time and larger model size
Kernel Clustering 19
Should be relatively simple given our experience with kernel LwP,
kNN
K-MEANS/LLOYD’S ALGORITHM
1. Initialize centroids
2. For , do cluster assignment, update using
1. Let
3. Update
4. Repeat until convergence

All we need to do is kernelize the distance computations and keep

Kernel K-means 20
Note that cluster centers in k-means are always the average of data
points that were assigned to that cluster - maintains this info
KERNEL
Need to maintain this information K-MEANS
a bit differently for easy processing
1. and
Let if Initialize
if i.e. randomly
This 2.
letsFor , do cluster assignment
us write
1. Let
Let denote the Gram matrix of training points
Using this, we can rewrite distance computations as
2. Let
3. For update
where ( is j-th column of ) and which is nothing but
1. For , set and for all
4. Repeat until convergence
Parametric vs Non-parametric ML models
ML models for which model size independent of number of training
points are called parametric
Linear SVM, LwP, Logistic regression, Ridge regression
ML models for which model size dependent on number of training
points are called non-parametric (name is a bit non-intuitive)
Kernel SVM with a non-linear kernel – need to store all non-zero and
corresponding and possible that all train points become support vectors
Kernel algorithms in general are non-parametric
kNN – need to store all training points

Lec 16
No ratings yet
Lec 16
23 pages
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
No ratings yet
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
15 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Icml Tutorial
No ratings yet
Icml Tutorial
85 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
SP14 CS188 Lecture 23 - Kernels and Clustering - Print
No ratings yet
SP14 CS188 Lecture 23 - Kernels and Clustering - Print
39 pages
Lecture 13 - Kernels
No ratings yet
Lecture 13 - Kernels
5 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
Vahid
No ratings yet
Vahid
18 pages
Kernel Machines
No ratings yet
Kernel Machines
33 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
DSA5102X Lecture2
No ratings yet
DSA5102X Lecture2
43 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Kernel Method
No ratings yet
Kernel Method
5 pages
This Is
No ratings yet
This Is
7 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
Lecture 8 - Kernels
No ratings yet
Lecture 8 - Kernels
32 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Poly Kernel
No ratings yet
Poly Kernel
6 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
Lecture 5
No ratings yet
Lecture 5
19 pages
SVM 4
No ratings yet
SVM 4
8 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Be Central
No ratings yet
Be Central
98 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Lect 3
No ratings yet
Lect 3
14 pages
Kernel Methods: Feature Mapping at No Cost
No ratings yet
Kernel Methods: Feature Mapping at No Cost
25 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
Lec5 SVM Kernel SoftMargin
No ratings yet
Lec5 SVM Kernel SoftMargin
44 pages
Kernels and Kernelized Perceptron: Instructor: Alan Ritter
No ratings yet
Kernels and Kernelized Perceptron: Instructor: Alan Ritter
13 pages
AML-V New
No ratings yet
AML-V New
165 pages
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
No ratings yet
Discriminative and Generative Methods For Bags of Features: Zebra Non-Zebra
40 pages
28.7 - Polynomial Kernel - mp4
No ratings yet
28.7 - Polynomial Kernel - mp4
3 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
CS 771A: Intro To Machine Learning, IIT Kanpur Name Roll No Dept
No ratings yet
CS 771A: Intro To Machine Learning, IIT Kanpur Name Roll No Dept
2 pages
CS 771A: Introduction To Machine Learning Name Roll No Dept
No ratings yet
CS 771A: Introduction To Machine Learning Name Roll No Dept
2 pages
CS 771A: Intro To Machine Learning, IIT Kanpur Name Roll No Dept
No ratings yet
CS 771A: Intro To Machine Learning, IIT Kanpur Name Roll No Dept
2 pages
Ds 1
No ratings yet
Ds 1
23 pages
Linear Partial Differential Equations of Order Two With Variable Coefficients
100% (2)
Linear Partial Differential Equations of Order Two With Variable Coefficients
30 pages
K Means
No ratings yet
K Means
10 pages
Ijtech Template
No ratings yet
Ijtech Template
10 pages
Unit 3
No ratings yet
Unit 3
30 pages
Cluster PDF
No ratings yet
Cluster PDF
8 pages
BRM Multivariate Notes
No ratings yet
BRM Multivariate Notes
22 pages
Shakiba Rahimiaghdam - 61130 - Assignsubmission - File - DatasetAnalysis - MINERS
No ratings yet
Shakiba Rahimiaghdam - 61130 - Assignsubmission - File - DatasetAnalysis - MINERS
56 pages
What Is Application Function Library (AFL) ?
No ratings yet
What Is Application Function Library (AFL) ?
12 pages
Course - Machine Learning A-Z - AI, Python & R + ChatGPT Prize (2025) - Udemy Business
No ratings yet
Course - Machine Learning A-Z - AI, Python & R + ChatGPT Prize (2025) - Udemy Business
18 pages
Chapter 2 - Texture Analysis
No ratings yet
Chapter 2 - Texture Analysis
18 pages
Clustering Monograph DSBA
No ratings yet
Clustering Monograph DSBA
36 pages
K-Means Clustering From Scratch
No ratings yet
K-Means Clustering From Scratch
3 pages
Week 8
No ratings yet
Week 8
24 pages
Intro To Data Minning
No ratings yet
Intro To Data Minning
24 pages
Fuzzy C-Means Clustering: Mahdi Amiri
100% (1)
Fuzzy C-Means Clustering: Mahdi Amiri
33 pages
Unit 9 - Classification & Clustering
No ratings yet
Unit 9 - Classification & Clustering
34 pages
مايننغ اسئلة Mcq
No ratings yet
مايننغ اسئلة Mcq
70 pages
Community Detection
No ratings yet
Community Detection
72 pages
Predictive Analytics and Data Mining: Segmentation Using Clustering
No ratings yet
Predictive Analytics and Data Mining: Segmentation Using Clustering
25 pages
ML Interview Questions
No ratings yet
ML Interview Questions
146 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
Text Mining Applications and Theory
100% (1)
Text Mining Applications and Theory
5 pages
Student Cluster Analysis Based On Moodle Data and Academic Performance Indicators
No ratings yet
Student Cluster Analysis Based On Moodle Data and Academic Performance Indicators
4 pages
Aiml - 06 - 28
No ratings yet
Aiml - 06 - 28
4 pages
BahadirAkinAkgul Sub156
No ratings yet
BahadirAkinAkgul Sub156
10 pages
Quiz 4 5 6
No ratings yet
Quiz 4 5 6
11 pages
Kmeans - Ipynb - Colab
No ratings yet
Kmeans - Ipynb - Colab
2 pages
Application of Artificial Intelligence in Petroleum Engineering
No ratings yet
Application of Artificial Intelligence in Petroleum Engineering
104 pages
Agricultural Crop Recommendations Based On Productivity and Season
No ratings yet
Agricultural Crop Recommendations Based On Productivity and Season
4 pages
K Means
No ratings yet
K Means
18 pages
Clustering by Fast Search and Find of Density Peaks
No ratings yet
Clustering by Fast Search and Find of Density Peaks
6 pages

Ds 11

Uploaded by

Ds 11

Uploaded by

Kernel Methods

Gaussian kernel is an infinite linear combination of poly kernels of all orders

gives a normalized kernel

Called the “Learning with prototype” (LwP) model – linear decision

s.t. and s.t.

Lets see what happens if we

Note that if then the model

All we need to do is kernelize the distance computations and keep

You might also like