0% found this document useful (0 votes)
44 views21 pages

Ds 11

Kernel methods allow algorithms that traditionally operate on vectors to instead operate on arbitrary similarity measures between pairs of data points. This is done through kernels, which compute similarities without needing to explicitly map data into a feature space. Three key points: 1. Mercer kernels define valid similarity measures and correspond to an inner product in some potentially infinite-dimensional feature space. 2. Common kernels include linear, polynomial, Gaussian, and domain-specific kernels for text, graphs, etc. 3. The "kernel trick" allows kernelizing algorithms like SVM, ridge regression, k-means by replacing inner products with kernels, avoiding explicit feature mappings.

Uploaded by

Dilip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views21 pages

Ds 11

Kernel methods allow algorithms that traditionally operate on vectors to instead operate on arbitrary similarity measures between pairs of data points. This is done through kernels, which compute similarities without needing to explicitly map data into a feature space. Three key points: 1. Mercer kernels define valid similarity measures and correspond to an inner product in some potentially infinite-dimensional feature space. 2. Common kernels include linear, polynomial, Gaussian, and domain-specific kernels for text, graphs, etc. 3. The "kernel trick" allows kernelizing algorithms like SVM, ridge regression, k-means by replacing inner products with kernels, avoiding explicit feature mappings.

Uploaded by

Dilip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Kernel Methods

Mercer Kernels
In general, has to be a Hilbert space which, technicalities aside, is very
much like a real vector space such as but is possibly infinite dimensional.
It is always possible to define inner/dot products on Hilbert spaces
2
Suppose are any two unit vectors
The dot product is a natural notion of similarity between these vectors
It is highest when the vectors are the same i.e. when we have
It is lowest when the vectors are diametrically opposite i.e. and
Mercer kernels are notions of similarity that extend such nice behaviour
Given a set of objects (images, video, strings, genome sequences), a
similarity function is called a Mercer kernel if there exists a map s.t.
for all ,
often called feature map or feature embedding, can be for some
large/moderate . can even be infinite dimensional
Thus, when asked to give similarity between two objects, all that a Mercer
kernel does is first map those objects to two (high-dim) vectors and return the
dot/inner product between those two vectors
Examples of Kernels
Poly kernels called homogeneous if .
makes the kernel non-Mercer
3
When are vectors
Linear kernel
Quadratic kernel
Polynomial kernel ,
Gaussian kernel
Laplacian kernel
All of the above are Mercer kernels
There indeed exist feature maps for each of them (proving so a bit tedious)
need to be tuned. Large , can cause overfitting
Notice all are the above are indeed notions of similarity
Take two unit vectors (unit for sake of normalization). Easy to verify that is
largest when and smallest when
Mercer Kernel Feature Maps
Homogenous poly kernels () only use features of the form where and . In
4
Linearcontrast,
kernelif we have then the kernels use all features of the form where ,
where“linear”
Use . Called . Non-homfor
poly kernels use
a reason: anymore expressive
linear feature
function overmaps
is just a linear
function over the original features
Quadratic kernel
We have , when where
Similar constructions (more tedious to write) for polynomial kernel
Called “quadratic” for a reason: any linear function over is a quadratic
function over the original features . Verify for the simple case yourself
If we use a linear ML algo over , can learn any quadratic function over
original features . Polynomial kernel of degree similarly allows learning of
degree polynomial functions over original data
Mercer Kernel
Learning a linear Feature
function over Maps
the features amounts to learning an infinite-
degree polynomial over . Gauss/Lap are very powerful kernels, often called
5
Warning: mayByexist
universal kernels. usingmore than one
these kernels, map for
theoretically the same
speaking, kernel
one can learn
Example: anywe
function
used over data (details However
for quadratic. beyond scope of CS771)
with gives same
Gaussian/Laplacian Kernels correspond to infinite dimensional maps

Gaussian kernel is an infinite linear combination of poly kernels of all orders


Let be a map for the poly kernel . Then a map for is
Some Domain Specific Kernels 6
Over the years people have designed innovative and powerful Mercer
kernels specifically for NLP, vision and other domains
When are bag of words features for strings/documents
Let dictionary have words in it
Let be the count of word in string
Intersection kernel (Mercer kernel)
Simply represent a set using an indicator vector with if else . In
Normalize intersection kernel (define this)case,
More generally, when are sets
Intersection kernel
Norm. Int. kernel (notice that )
The above are just the linear kernel in disguise and hence clearly Mercer
Some Domain Specific Kernels 7
N-gram, substring, Fisher kernels: other kernels between two strings
Random walk kernels between two graphs
Subtree, convolutional kernels between two trees
Pyramid kernel used in vision … combination of intersection kernels
In practice, we often use a linear method first e.g. SVM/ridge regression. If that
gives unsatisfactory performance, often we jump directly to Gaussian kernel 
although we should not neglect polynomial/other domain specific kernels.
There exist “kernel learning” methods that can learn the most appropriate
kernel for us or else tune the kernel parameters e.g. for us automatically
Creating New
The normalized Kernels
kernel actually normalizes the feature map as well.
Verify that if is a map for then a map for is where
8
Method 1: combine old kernels. If are existing Mercer kernels
is also a Mercer kernel if
If gives very large values (in magnitude), some algorithms may suffer.
is also a nice kernel
The normalized version will always give values between

gives a normalized kernel


Method 2: find a new feature rep. for data and use

Method 3: mix and match. Take new data rep. and an old kernel
9
Kernelized Algorithms
 Several algorithms we studied till now, work with kernels too!
 Supervised: kNN, LWP, SVM, ridge regression
 Unsupervised: k-means, PCA
 Others like LASSO are harder to get working with kernels
 Probabilistic/Bayesian algorithms also possible with kernels
Special Case of Naïve Bayes Learning
Use a standard Gaussian to model points of class
MLE estimate of : mean of points of class – a “prototype” of class 
Can have multiple prototypes too – multiple clusters for class
If assume then to classify a test point , simply find

Called the “Learning with prototype” (LwP) model – linear decision


boundary
Suppose we let every train point be its own cluster  kNN
algorithm!
Let training set be where . Given test point , find

Give as the output – may consult more than one neighbor too!
The Kernel Trick Revisited 11
An algorithmically This
effective
peculiarway of using
property is oftenlinear models on non-lin
called kernelizability. An maps
Every kernel is associated
ML algo iswith
saidatomap such that if we can show that it
be kernelizable
The map is usuallyworks
(very) non-linear
identically and (very)
if, instead highvectors,
of feature dimensional i.e. good
we supply
candidate for our overall
pairwisegoal of using
train-train andlinear models
test-train over non-linear
dot products of feature maps
vectors
Peculiar property of several ML algos
So far we have seen ML algos work with feature vectors of train/test points
However, many of them work even if feature vectors are not provided directly
but instead pairwise dot/inner products b/w feature vectors is provided!
For training, pairwise dot products between train points needed
For testing, dot products between the test point and all train points needed
Thus, we can say we want to work with high-dim feature vectors and when the
ML algo asks us for dot products, give it 
Would get same result as working directly with but without having to compute
This is a recurring theme in kernel learning. Never ever compute .
kNN with Kernels
Instead, express all operations in the ML algo in terms of inner product
computations which are then expressible as kernel computations
12
All that is needed to execute kNN is compute Euclidean distances
If working with kernelKERNEL
Indeed,
with map KNN
computing
, need (K takes
usually = 1)time if but
computing may take much longer e.g. for
1. Choose a kernel with map
Gaussian kernel it will take time
2. Training: receive and store points
Thus, distances in can be computed without computing first 
3. Prediction:
1NN:1.Given training
Receive a testpoints
point and a test point
Find
2. closest neighborneighbour
Find nearest in i.e. which is the same as and predict as the label
Note: if then this finds most “similar” point i.e.
3. Predict
Similarly we can execute kNN for as well
LwP with Kernels
Observe that in LwP with kernels, we now have to store entire training
data whereas earlier we just had to store two prototypes. This is common
in kernel learning – larger model sizes and longer prediction times
13
Given train data , we earlier found prototypes
and and used them to predict KERNEL There are LWP
on test point as kernel methods
ways in which
1. IfChoose a kernel
using a kernel withwith
mapmap
, we can be sped
should nowupcompute
and modelnew
sizesprototypes
reduced. as and
and predict using Will see those techniques later
2. Training: receive and store points
3. Prediction:
Need to be careful now – cannot compute these new prototypes explicitly
1. Receive
Instead, a testwepoint
as before, reduce the above to kernel computations instead
2. Find distance to positive prototype where using the shortcut
3.
The first term is simply , second term is , third term is which can be pre-
4. Predictat train time
calculated
Kernel SVM 14
PRIMAL FORMULATION DUAL FORMULATION

s.t. and s.t.

Lets see what happens if we


execute the SVM after
applying a (nonlinear) feature
map
Kernel SVM 15
PRIMAL FORMULATION DUAL FORMULATION
⟨ 𝜙 ( 𝐱𝑖 ) , 𝜙 ( 𝐱 𝑗 ) ⟩
s.t. and s.t.

Note that if then the model


itself is
Kernel SVM dual problem perfectly since and
Finding/storing the model explicitly is not feasible even if we solve the
16
PRIMAL FORMULATION DUAL FORMULATION
KERNEL SVM
So instead we can store all the values (only of them). At test time,
1. Choosegiven a kernel with
a test point , wemap
can predict using
2.
s.t. Training:
and receive train points s.t.
1. Solve
Solving dual isproblem
the primal infeasible Computing usually even if e.g.
a2.single
IfNote Implicitly
that SGD
if the store
teststep
data pointby
would storing
is take
very similarfor
to allofsupport
Gaussian
one thekernel vectors
training i.e.iswhere
points i.e.
3.large,
Prediction:
infinitelythenlong  i.e. influences the prediction
that label Can still
muchsolve
more.this
If weproblem
think thisusing SDCA
way, kernel SVM almost looks like a “soft” form of kNN. If there are support
1. Receive a test point Each step of SDCA still takes time apart
vectors, then prediction requires kernel computations i.e. roughly time since
2. Predict each kernel computation takes from time to
roughly time
compute
If time taken to compute added then each
SDCA
Training is more expensive, model size is larger, takes time
prediction about timefor
is more
kernel SVM than was for linear SVM – very typical of non-linear models
Kernel Ridge Regression 17
Given data ,equality
To handle RR solution is simply
constraints
Method 1: convert to a pair of inequality constraints
Method 2: use a Lagrangian variable that has no constraints 
Is ridge-regression kernelizable? Does not seem so at first
In fact, it is – by using the dual problem of ridge regression
Deriving the dual for RR: RR solves
Dual requires constraints – none here so lets deliberately introduce some!
New (but equivalent) formulation: s.t.
Lagrangian becomes
Applying first order optimality gives us and
Dual becomes
Kernel Ridge Regression
Note however, that we can use this dual trick to solve RR even in the
linear case when . Solving linear RR in primal requires time (to invert a
matrix) whereas solving linear RR in dual requires time (to invert an
18
Thus, RR does have a dual problem (that makes
matrix). Thus, if , dual solution is cheaper
it kernelizable too)
Solve
Model is cannot be stored explicitly
Given a test point , predict as
Some simplifications
Let denote the “Gram matrix” of the training points
Dual of kernel RR can be rewritten as
Solution available in closed form
Requires inverting matrix (linear RR required inverting matrix)
As before, kernel RR requires more train time, test time and larger model size
Kernel Clustering 19
Should be relatively simple given our experience with kernel LwP,
kNN
K-MEANS/LLOYD’S ALGORITHM
1. Initialize centroids
2. For , do cluster assignment, update using
1. Let
3. Update
4. Repeat until convergence

All we need to do is kernelize the distance computations and keep


Kernel K-means 20
Note that cluster centers in k-means are always the average of data
points that were assigned to that cluster - maintains this info
KERNEL
Need to maintain this information K-MEANS
a bit differently for easy processing
1. and
Let if Initialize
if i.e. randomly
This 2.
letsFor , do cluster assignment
us write
1. Let
Let denote the Gram matrix of training points
Using this, we can rewrite distance computations as
2. Let
3. For update
where ( is j-th column of ) and which is nothing but
1. For , set and for all
4. Repeat until convergence
Parametric vs Non-parametric ML models
ML models for which model size independent of number of training
points are called parametric
Linear SVM, LwP, Logistic regression, Ridge regression
ML models for which model size dependent on number of training
points are called non-parametric (name is a bit non-intuitive)
Kernel SVM with a non-linear kernel – need to store all non-zero and
corresponding and possible that all train points become support vectors
Kernel algorithms in general are non-parametric
kNN – need to store all training points

You might also like