Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
CS771: Intro to ML
3
Linear Models for Nonlinear Problems
Consider the following one-dimensional inputs from two classes
𝑥
Can’t separate using a linear hyperplane
CS771: Intro to ML
4
Linear Models for Nonlinear Problems
Consider mapping each to two-dimensions as
Linear
hyperplane
𝑥 → 𝒛 =[ 𝑧 1 , 𝑧 2 ] =[ 𝑥 , cos ( 𝑥)]
Not a linear
relationship between
inputs and outputs
CS771: Intro to ML
6
Linear Models for Nonlinear Problems
Can assume a feature mapping that maps/transforms the inputs to a “nice” space
CS771: Intro to ML
8
How to get these “good” (nonlinear) mappings?
Can try to learn the mapping from the data itself (e.g., using deep learning - later)
Can use pre-defined “good” mappings (e.g., defined by kernel functions - today’s
topic) Thankfully, using kernels,
you don’t need to compute
these mappings explicitly
Even if I knew a good
mapping, it seems I need to The kernel will define an
apply it for every input. “implicit” feature mapping
Won’t this be Important: The idea can be applied to any ML
computationally expensive? algo in which training and test stage only
Also, the number of features require computing pairwise similarities b/w
will increase? Will it not inputs
slow down the learning In a high-dim space implicitly defined by
algorithm? an underlying mapping associated this
this kernel function (.,.)
Kernel: A function (.,.) that gives dot product similarity b/w two inputs, say and
Important: As we will see, computing (.,.)
does not require computing the mapping ,) CS771: Intro to ML
9
Kernels as (Implicit) Feature Maps
Consider two inputs (in the same two-dim feature space):
Suppose we have a function which takes two inputs and and computes
Called the
This is not a dot/inner product
“kernel Can think of this as a notion
function” 𝑘 ( 𝒙 , 𝒛 )=( 𝒙 ⊤ 𝒛 )2 of similarity b/w and
similarity but similarity using a
more general function of and
Didn’t need to compute (square of dot product)
explicitly. Just using the
definition of the kernel Remember that a kernel
implicitly gave us this mapping does two things: Maps
for each input the data implicitly into a
new feature space
Thus kernel function (feature transformation)
implicitly defined a feature and computes pairwise
mapping such that for , Dot product similarity in similarity between any
the new feature space two inputs under the new
defined by the mapping feature representation
needs to be a vector space with a dot product defined on it (a.k.a. a Hilbert space)
Is any function for some a kernel function?
No. The function must satisfy Mercer’s Condition
CS771: Intro to ML
11
Kernel Functions
For (.,.) to be a kernel function
must define a dot product for some Hilbert Space
Above is true if is symmetric and positive semi-definite (p.s.d.) function (though there
are Loosely speaking a PSD function here
exceptions; there are also “indefinite” kernels) means that if we evaluation this function
for inputs (pairs) then the matrix will be
𝑘 ( 𝒙 , 𝒛 )=𝑘(𝒛 , 𝒙 ) PSD (also called a kernel matrix)
∬ 𝑓 ( 𝒙) 𝑘( 𝒙,𝒛 ) 𝑓 ( 𝒛 ) 𝑑𝒙𝑑𝒛≥0
For all “square integrable” functions
(such functions satisfy
= + : simple sum
= : scalar product CS771: Intro to ML
12
Some Pre-defined Kernel Functions Remember that kernels
Several other kernels proposed are a notion of
Linear kernel: for non-vector data, such as
trees, strings, etc
similarity between pairs
of inputs
Kernels can have a pre-defined
Quadratic Kernel: or form or can be learned from data (a
bit advanced for this course)
Polynomial Kernel (of degree ): or
Radial Basis Function (RBF) or “Gaussian” Kernel:
Controls how the distance
Gaussian kernel gives a similarity score between 0 and 1 between two inputs should
is a hyperparameter (called the kernel bandwidth parameter) be converted into a
similarity
The RBF kernel corresponds to an infinite dim. feature space (i.e., you can’t actually
write down or store the map explicitly – but we don’t need to do that anyway )
Also called “stationary kernel”: only depends on the distance between and (translating
both by the same amount won’t change the value of )
Kernel hyperparameters (e.g.,) can be set via cross-validation
CS771: Intro to ML
13
RBF Kernel = Infinite Dimensional Mapping
We saw that the RBF/Gaussian kernel is defined as
Using this kernel corresponds to mapping data to infinite dimensional space
2
𝑘 ( 𝑥 , 𝑧 )=exp [ − ( 𝑥 − 𝑧 ) ] ( assuming 𝛾 =1∧𝑥∧𝑧 ¿ be scalars ¿
⊤
¿ 𝜙 (𝑥 ) 𝜙 (𝑧) Thus an infinite-dim vector (ignoring the
constants coming from the and terms
Here ,
But again, note that we never need to compute to compute
is easily computable from its definition itself ( in this case)
CS771: Intro to ML
14
Kernel Matrix
Kernel based ML algos work with kernel matrices rather than feature vectors
Given inputs, the kernel function can be used to construct a Kernel Matrix
The kernel matrix is of size with each entry defined as
Note again that we don’t
need to compute and this
dot product explicitly
: Similarity between the and inputs in the kernel induced feature space
is a symmetric and positive
semi-definite matrix
Inputs
CS771: Intro to ML