Ds 11
Ds 11
Mercer Kernels
In general, has to be a Hilbert space which, technicalities aside, is very
much like a real vector space such as but is possibly infinite dimensional.
It is always possible to define inner/dot products on Hilbert spaces
2
Suppose are any two unit vectors
The dot product is a natural notion of similarity between these vectors
It is highest when the vectors are the same i.e. when we have
It is lowest when the vectors are diametrically opposite i.e. and
Mercer kernels are notions of similarity that extend such nice behaviour
Given a set of objects (images, video, strings, genome sequences), a
similarity function is called a Mercer kernel if there exists a map s.t.
for all ,
often called feature map or feature embedding, can be for some
large/moderate . can even be infinite dimensional
Thus, when asked to give similarity between two objects, all that a Mercer
kernel does is first map those objects to two (high-dim) vectors and return the
dot/inner product between those two vectors
Examples of Kernels
Poly kernels called homogeneous if .
makes the kernel non-Mercer
3
When are vectors
Linear kernel
Quadratic kernel
Polynomial kernel ,
Gaussian kernel
Laplacian kernel
All of the above are Mercer kernels
There indeed exist feature maps for each of them (proving so a bit tedious)
need to be tuned. Large , can cause overfitting
Notice all are the above are indeed notions of similarity
Take two unit vectors (unit for sake of normalization). Easy to verify that is
largest when and smallest when
Mercer Kernel Feature Maps
Homogenous poly kernels () only use features of the form where and . In
4
Linearcontrast,
kernelif we have then the kernels use all features of the form where ,
where“linear”
Use . Called . Non-homfor
poly kernels use
a reason: anymore expressive
linear feature
function overmaps
is just a linear
function over the original features
Quadratic kernel
We have , when where
Similar constructions (more tedious to write) for polynomial kernel
Called “quadratic” for a reason: any linear function over is a quadratic
function over the original features . Verify for the simple case yourself
If we use a linear ML algo over , can learn any quadratic function over
original features . Polynomial kernel of degree similarly allows learning of
degree polynomial functions over original data
Mercer Kernel
Learning a linear Feature
function over Maps
the features amounts to learning an infinite-
degree polynomial over . Gauss/Lap are very powerful kernels, often called
5
Warning: mayByexist
universal kernels. usingmore than one
these kernels, map for
theoretically the same
speaking, kernel
one can learn
Example: anywe
function
used over data (details However
for quadratic. beyond scope of CS771)
with gives same
Gaussian/Laplacian Kernels correspond to infinite dimensional maps
Method 3: mix and match. Take new data rep. and an old kernel
9
Kernelized Algorithms
Several algorithms we studied till now, work with kernels too!
Supervised: kNN, LWP, SVM, ridge regression
Unsupervised: k-means, PCA
Others like LASSO are harder to get working with kernels
Probabilistic/Bayesian algorithms also possible with kernels
Special Case of Naïve Bayes Learning
Use a standard Gaussian to model points of class
MLE estimate of : mean of points of class – a “prototype” of class
Can have multiple prototypes too – multiple clusters for class
If assume then to classify a test point , simply find
Give as the output – may consult more than one neighbor too!
The Kernel Trick Revisited 11
An algorithmically This
effective
peculiarway of using
property is oftenlinear models on non-lin
called kernelizability. An maps
Every kernel is associated
ML algo iswith
saidatomap such that if we can show that it
be kernelizable
The map is usuallyworks
(very) non-linear
identically and (very)
if, instead highvectors,
of feature dimensional i.e. good
we supply
candidate for our overall
pairwisegoal of using
train-train andlinear models
test-train over non-linear
dot products of feature maps
vectors
Peculiar property of several ML algos
So far we have seen ML algos work with feature vectors of train/test points
However, many of them work even if feature vectors are not provided directly
but instead pairwise dot/inner products b/w feature vectors is provided!
For training, pairwise dot products between train points needed
For testing, dot products between the test point and all train points needed
Thus, we can say we want to work with high-dim feature vectors and when the
ML algo asks us for dot products, give it
Would get same result as working directly with but without having to compute
This is a recurring theme in kernel learning. Never ever compute .
kNN with Kernels
Instead, express all operations in the ML algo in terms of inner product
computations which are then expressible as kernel computations
12
All that is needed to execute kNN is compute Euclidean distances
If working with kernelKERNEL
Indeed,
with map KNN
computing
, need (K takes
usually = 1)time if but
computing may take much longer e.g. for
1. Choose a kernel with map
Gaussian kernel it will take time
2. Training: receive and store points
Thus, distances in can be computed without computing first
3. Prediction:
1NN:1.Given training
Receive a testpoints
point and a test point
Find
2. closest neighborneighbour
Find nearest in i.e. which is the same as and predict as the label
Note: if then this finds most “similar” point i.e.
3. Predict
Similarly we can execute kNN for as well
LwP with Kernels
Observe that in LwP with kernels, we now have to store entire training
data whereas earlier we just had to store two prototypes. This is common
in kernel learning – larger model sizes and longer prediction times
13
Given train data , we earlier found prototypes
and and used them to predict KERNEL There are LWP
on test point as kernel methods
ways in which
1. IfChoose a kernel
using a kernel withwith
mapmap
, we can be sped
should nowupcompute
and modelnew
sizesprototypes
reduced. as and
and predict using Will see those techniques later
2. Training: receive and store points
3. Prediction:
Need to be careful now – cannot compute these new prototypes explicitly
1. Receive
Instead, a testwepoint
as before, reduce the above to kernel computations instead
2. Find distance to positive prototype where using the shortcut
3.
The first term is simply , second term is , third term is which can be pre-
4. Predictat train time
calculated
Kernel SVM 14
PRIMAL FORMULATION DUAL FORMULATION