0% found this document useful (0 votes)
61 views15 pages

Lecture 19 - Nonlinear Learning With Kernels (1) - Plain

Linear models are interpretable but cannot learn complex nonlinear patterns. Kernel methods address this by mapping inputs to a higher dimensional feature space where nonlinear relationships appear linear. This is done implicitly through kernel functions, which compute the similarity between any two inputs in the feature space without explicitly computing the mapping. A kernel function defines a valid feature space if it is symmetric and positive semi-definite, satisfying Mercer's condition for defining an inner product. This allows linear models to be applied to the feature space to solve nonlinear problems in the original input space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views15 pages

Lecture 19 - Nonlinear Learning With Kernels (1) - Plain

Linear models are interpretable but cannot learn complex nonlinear patterns. Kernel methods address this by mapping inputs to a higher dimensional feature space where nonlinear relationships appear linear. This is done implicitly through kernel functions, which compute the similarity between any two inputs in the feature space without explicitly computing the mapping. A kernel function defines a valid feature space if it is symmetric and positive semi-definite, satisfying Mercer's condition for defining an inner product. This allows linear models to be applied to the feature space to solve nonlinear problems in the original input space.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Turning Linear Models into Nonlinear Models

using Kernel Methods

CS771: Introduction to Machine Learning


Piyush Rai
2
Linear Models
 Nice and interpretable but can’t learn “difficult” nonlinear patterns

 So, are linear models useless for such problems?

CS771: Intro to ML
3
Linear Models for Nonlinear Problems
 Consider the following one-dimensional inputs from two classes

𝑥
 Can’t separate using a linear hyperplane
CS771: Intro to ML
4
Linear Models for Nonlinear Problems
 Consider mapping each to two-dimensions as

Linear
hyperplane

 Classes are now linearly separable in the two-dimensional space


CS771: Intro to ML
5
Linear Models for Nonlinear Problems
 The same idea can be applied for nonlinear regression as well

𝑥 → 𝒛 =[ 𝑧 1 , 𝑧 2 ] =[ 𝑥 , cos ( 𝑥)]
Not a linear
relationship between
inputs and outputs

A linear regression A linear regression model


model will not work will work well with this
well new two-dim
representation of the
original one-dim inputs

CS771: Intro to ML
6
Linear Models for Nonlinear Problems
 Can assume a feature mapping that maps/transforms the inputs to a “nice” space

The linear model in the


new feature space
corresponds to a nonlinear
model in the original
feature space

 .. and then happily apply a linear model in the new space!


CS771: Intro to ML
7
Not Every Mapping is Helpful
 Not every higher-dim mapping helps in learning nonlinear patterns
 Must be a nonlinear mapping
 For the nonlin classfn problem we saw earlier, consider some possible mappings

CS771: Intro to ML
8
How to get these “good” (nonlinear) mappings?
 Can try to learn the mapping from the data itself (e.g., using deep learning - later)
 Can use pre-defined “good” mappings (e.g., defined by kernel functions - today’s
topic) Thankfully, using kernels,
you don’t need to compute
these mappings explicitly
Even if I knew a good
mapping, it seems I need to The kernel will define an
apply it for every input. “implicit” feature mapping
Won’t this be Important: The idea can be applied to any ML
computationally expensive? algo in which training and test stage only
Also, the number of features require computing pairwise similarities b/w
will increase? Will it not inputs
slow down the learning In a high-dim space implicitly defined by
algorithm? an underlying mapping associated this
this kernel function (.,.)

 Kernel: A function (.,.) that gives dot product similarity b/w two inputs, say and
Important: As we will see, computing (.,.)
does not require computing the mapping ,) CS771: Intro to ML
9
Kernels as (Implicit) Feature Maps
 Consider two inputs (in the same two-dim feature space):
 Suppose we have a function which takes two inputs and and computes
Called the
This is not a dot/inner product
“kernel Can think of this as a notion
function” 𝑘 ( 𝒙 , 𝒛 )=( 𝒙 ⊤ 𝒛 )2 of similarity b/w and
similarity but similarity using a
more general function of and
Didn’t need to compute (square of dot product)
explicitly. Just using the
definition of the kernel Remember that a kernel
implicitly gave us this mapping does two things: Maps
for each input the data implicitly into a
new feature space
Thus kernel function (feature transformation)
implicitly defined a feature and computes pairwise
mapping such that for , Dot product similarity in similarity between any
the new feature space two inputs under the new
defined by the mapping feature representation

 Also didn’t have to compute . Defn gives that


CS771: Intro to ML
10
Kernel Functions
As we saw, kernel function
implicitly defines a feature mapping such that for a
two-dim ,

 Every kernel function implicitly defines a feature mapping


 takes input (e.g., ) and maps it to a new “feature space”
 The kernel function can be seen as taking two points as inputs and computing their
inner-product based similarity in the space For some kernels, as we will see shortly, (and thus the
new feature space ) can be very high-dimensional or
even be infinite dimensional (but we don’t need to
compute it anyway, so it is not an issue)

 needs to be a vector space with a dot product defined on it (a.k.a. a Hilbert space)
 Is any function for some a kernel function?
 No. The function must satisfy Mercer’s Condition
CS771: Intro to ML
11
Kernel Functions
 For (.,.) to be a kernel function
 must define a dot product for some Hilbert Space
 Above is true if is symmetric and positive semi-definite (p.s.d.) function (though there
are Loosely speaking a PSD function here
exceptions; there are also “indefinite” kernels) means that if we evaluation this function
for inputs (pairs) then the matrix will be
𝑘 ( 𝒙 , 𝒛 )=𝑘(𝒛 , 𝒙 ) PSD (also called a kernel matrix)

∬ 𝑓 ( 𝒙) 𝑘( 𝒙,𝒛 ) 𝑓 ( 𝒛 ) 𝑑𝒙𝑑𝒛≥0
For all “square integrable” functions
(such functions satisfy

Can easily verify that the


Mercer’s Condition holds

 The above condition is essentially known as Mercer’s Condition


Can also combine these rules and the resulting
 Let be two kernel functions then the following are as well function will also be a kernel function

 = + : simple sum
 = : scalar product CS771: Intro to ML
12
Some Pre-defined Kernel Functions Remember that kernels
Several other kernels proposed are a notion of
 Linear kernel: for non-vector data, such as
trees, strings, etc
similarity between pairs
of inputs
Kernels can have a pre-defined
 Quadratic Kernel: or form or can be learned from data (a
bit advanced for this course)
 Polynomial Kernel (of degree ): or
 Radial Basis Function (RBF) or “Gaussian” Kernel:
Controls how the distance
 Gaussian kernel gives a similarity score between 0 and 1 between two inputs should
 is a hyperparameter (called the kernel bandwidth parameter) be converted into a
similarity
 The RBF kernel corresponds to an infinite dim. feature space (i.e., you can’t actually
write down or store the map explicitly – but we don’t need to do that anyway )
 Also called “stationary kernel”: only depends on the distance between and (translating
both by the same amount won’t change the value of )
 Kernel hyperparameters (e.g.,) can be set via cross-validation
CS771: Intro to ML
13
RBF Kernel = Infinite Dimensional Mapping
 We saw that the RBF/Gaussian kernel is defined as
 Using this kernel corresponds to mapping data to infinite dimensional space
2
𝑘 ( 𝑥 , 𝑧 )=exp [ − ( 𝑥 − 𝑧 ) ] ( assuming 𝛾 =1∧𝑥∧𝑧 ¿ be scalars ¿


¿ 𝜙 (𝑥 ) 𝜙 (𝑧) Thus an infinite-dim vector (ignoring the
constants coming from the and terms

 Here ,
 But again, note that we never need to compute to compute
 is easily computable from its definition itself ( in this case)
CS771: Intro to ML
14
Kernel Matrix
 Kernel based ML algos work with kernel matrices rather than feature vectors
 Given inputs, the kernel function can be used to construct a Kernel Matrix
 The kernel matrix is of size with each entry defined as
Note again that we don’t
need to compute and this
dot product explicitly

 : Similarity between the and inputs in the kernel induced feature space
is a symmetric and positive
semi-definite matrix
Inputs

Also, all eigenvalues of are non-negative


Feature Matrix Kernel Matrix
CS771: Intro to ML
15
Coming up next..
 Applying kernel methods for SVM and ridge regression

CS771: Intro to ML

You might also like