0% found this document useful (0 votes)
7 views69 pages

Support Vector Machines

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views69 pages

Support Vector Machines

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

10 601 Introduction to Machine Learning

Machine Learning Department


School of Computer Science
Carnegie Mellon University

Support Vector Machines


+
Kernels
Matt Gormley
Lecture 27
Apr. 22, 2020

1
Reminders
Homework 8: Reinforcement Learning
Out: Fri, Apr 10
Due: Wed, Apr 22 at 11:59pm
Homework 9: Learning Paradigms
Out: Wed, Apr. 22
Due: Wed, Apr. 29 at 11:59pm
Can only be submitted up to 3 days late,
so we can return grades before final exam

Today’s In Class Poll


http://poll.mlcourse.org
2
CONSTRAINED OPTIMIZATION

7
Constrained Optimization

8
Quadratic Program

9
SVM: Optimization Background
Whiteboard
Constrained Optimization
Linear programming
Quadratic programming
Example: 2D quadratic function with linear
constraints

10
Quadratic Program

11
Quadratic Program

12
Quadratic Program

13
Quadratic Program

14
Quadratic Program

15
SUPPORT VECTOR MACHINE
(SVM)

16
Example: Building Walls

18
https://www.facebook.com/Mondobloxx/
SVM
Whiteboard
SVM Primal (Linearly Separable Case)

This section borrows ideas from Nina Balcan’s SVM lectures at CMU and Patrick Winston’s 19
“widest street” SVM lecture at MIT (https://www.youtube.com/watch?v=_PwhiWxHK8o).
SVM QP

20
SVM QP

21
SVM QP

22
SVM QP

23
SVM QP

24
SVM QP

25
Support Vector Machines (SVMs)
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)

Instead of minimizing the primal, we can maximize the


dual problem
For the SVM, these two problems give the same
answer (i.e. the minimum of one is the maximum of the
other)
Definition: support vectors are those points x(i) for
which (i) 0
26
METHOD OF LAGRANGE
MULTIPLIERS

27
Method of Lagrange Multipliers

28
Method of Lagrange Multipliers

29
Method of Lagrange Multipliers

30
Method of Lagrange Multipliers

31
Method of Lagrange Multipliers

32
Method of Lagrange Multipliers

33
Method of Lagrange Multipliers

34
Figure from http://tutorial.math.lamar.edu/Classes/CalcIII/LagrangeMultipliers.aspx
Method of Lagrange Multipliers

35
Figure from http://tutorial.math.lamar.edu/Classes/CalcIII/LagrangeMultipliers.aspx
SVM DUAL

36
Method of Lagrange Multipliers
Whiteboard
Lagrangian Duality
Example: SVM Dual

This section borrows ideas from Nina Balcan’s SVM lectures at CMU and Patrick Winston’s 37
“widest street” SVM lecture at MIT (https://www.youtube.com/watch?v=_PwhiWxHK8o).
Support Vector Machines (SVMs)
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)

Instead of minimizing the primal, we can maximize the


dual problem
For the SVM, these two problems give the same
answer (i.e. the minimum of one is the maximum of the
other)
Definition: support vectors are those points x(i) for
which (i) 0
39
SVM EXTENSIONS

42
Soft Margin SVM
Hard margin SVM (Primal)
Question: If the dataset is
not linearly separable, can
we still use an SVM?
Answer: Not the hard
margin version. It will never
find a feasible solution.
In the soft margin version,
Soft margin SVM (Primal) we add “slack variables”
that allow some points to
violate the large margin
constraints.
The constant C dictates
how large we should allow
the slack variables to be
43
Soft Margin SVM
Hard margin SVM (Primal)

Soft margin SVM (Primal)

44
Soft Margin SVM
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)

Soft margin SVM (Primal) Soft margin SVM (Lagrangian Dual)

We can also work with the dual of the soft margin SVM
45
Multiclass SVMs
The SVM is inherently a binary classification method,
but can be extended to handle K class classification in
many ways.
1. one vs rest:
build K binary classifiers
train the kth classifier to predict whether an instance
has label k or something else
predict the class with largest score
2. one vs one:
build (K choose 2) binary classifiers
train one classifier for distinguishing between each pair
of labels
predict the class with the most “votes” from any given
classifier
46
Learning Objectives
Support Vector Machines
You should be able to…
1. Motivate the learning of a decision boundary with large margin
2. Compare the decision boundary learned by SVM with that of
Perceptron
3. Distinguish unconstrained and constrained optimization
4. Compare linear and quadratic mathematical programs
5. Derive the hard margin SVM primal formulation
6. Derive the Lagrangian dual for a hard margin SVM
7. Describe the mathematical properties of support vectors and provide
an intuitive explanation of their role
8. Draw a picture of the weight vector, bias, decision boundary, training
examples, support vectors, and margin of an SVM
9. Employ slack variables to obtain the soft margin SVM
10. Implement an SVM learner using a black box quadratic programming
(QP) solver

52
KERNELS

53
Kernels: Motivation
Most real world problems exhibit data that is
not linearly separable.
Example: pixel representation for Facial Recognition:

Q: When your data is not linearly separable,


how can you still use a linear classifier?
A: Preprocess the data to produce nonlinear
features
54
Kernels: Motivation
Motivation #1: Inefficient Features
Non linearly separable data requires high
dimensional representation
Might be prohibitively expensive to compute or
store
Motivation #2: Memory based Methods
k Nearest Neighbors (KNN) for facial recognition
allows a distance metric between images no
need to worry about linearity restriction at all

55
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)

The kernel k(x,z) can be any legal definition of a dot product:


k(x, z) = (x) T (z) for any function : RD
So we only compute the dot product implicitly

This “kernel trick” can be applied to many algorithms:


classification: perceptron, SVM, …
regression: ridge regression, …
clustering: k means, …

57
SVM: Kernel Trick
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)

Suppose we do some
feature engineering
Our feature function is
We apply to each input
vector x

58
SVM: Kernel Trick
Hard margin SVM (Lagrangian Dual)

We could replace the dot product of the two feature vectors


in the transformed space with a function k(x,z)

59
SVM: Kernel Trick
Hard margin SVM (Lagrangian Dual)

We could replace the dot product of the two feature vectors


in the transformed space with a function k(x,z)

60
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)

The kernel k(x,z) can be any legal definition of a dot product:


k(x, z) = (x) T (z) for any function : RD
So we only compute the dot product implicitly

This “kernel trick” can be applied to many algorithms:


classification: perceptron, SVM, …
regression: ridge regression, …
clustering: k means, …

61
Kernel Methods
Q: These are just non linear features, right?
A: Yes, but…

Q: Can’t we just compute the feature


transformation explicitly?
A: That depends...

Q: So, why all the hype about the kernel trick?


A: Because the explicit features might either
be prohibitively expensive to compute or
infinite length vectors 62
Example: Polynomial Kernel

63
Slide from Nina Balcan
Kernel Examples
Name Kernel Function Feature Space
(implicit dot product) (explicit dot product)

Linear Same as original input


space

Polynomial (v1) All polynomials of degree


d

Polynomial (v2) All polynomials up to


degree d

Gaussian Infinite dimensional space

Hyperbolic (With SVM, this is


Tangent equivalent to a 2 layer
(Sigmoid) neural network)
Kernel
66
RBF Kernel Example

RBF Kernel:
67
RBF Kernel Example

RBF Kernel:
68
RBF Kernel Example

RBF Kernel:
69
RBF Kernel Example

RBF Kernel:
70
RBF Kernel Example

RBF Kernel:
71
RBF Kernel Example

RBF Kernel:
72
RBF Kernel Example

RBF Kernel:
73
RBF Kernel Example

RBF Kernel:
74
RBF Kernel Example

RBF Kernel:
75
RBF Kernel Example

RBF Kernel:
76
RBF Kernel Example

RBF Kernel:
77
RBF Kernel Example

RBF Kernel:
78
RBF Kernel Example
KNN vs. SVM

RBF Kernel:
79
RBF Kernel Example
KNN vs. SVM

RBF Kernel:
80
RBF Kernel Example
KNN vs. SVM

RBF Kernel:
81
RBF Kernel Example
KNN vs. SVM

RBF Kernel:
82
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)

The kernel k(x,z) can be any legal definition of a dot product:


k(x, z) = (x) T (z) for any function : RD
So we only compute the dot product implicitly

This “kernel trick” can be applied to many algorithms:


classification: perceptron, SVM, …
regression: ridge regression, …
clustering: k means, …

83
SVM + Kernels: Takeaways
Maximizing the margin of a linear separator is a good
training criteria
Support Vector Machines (SVMs) learn a max margin
linear classifier
The SVM optimization problem can be solved with
black box Quadratic Programming (QP) solvers
Learned decision boundary is defined by its support
vectors
Kernel methods allow us to work in a transformed
feature space without explicitly representing that
space
The kernel trick can be applied to SVMs, as well as
many other algorithms

86
Learning Objectives
Kernels
You should be able to…
1. Employ the kernel trick in common learning
algorithms
2. Explain why the use of a kernel produces only
an implicit representation of the transformed
feature space
3. Use the "kernel trick" to obtain a
computational complexity advantage over
explicit feature transformation
4. Sketch the decision boundaries of a linear
classifier with an RBF kernel
87

You might also like