Support Vector Machines
Support Vector Machines
1
Reminders
Homework 8: Reinforcement Learning
Out: Fri, Apr 10
Due: Wed, Apr 22 at 11:59pm
Homework 9: Learning Paradigms
Out: Wed, Apr. 22
Due: Wed, Apr. 29 at 11:59pm
Can only be submitted up to 3 days late,
so we can return grades before final exam
7
Constrained Optimization
8
Quadratic Program
9
SVM: Optimization Background
Whiteboard
Constrained Optimization
Linear programming
Quadratic programming
Example: 2D quadratic function with linear
constraints
10
Quadratic Program
11
Quadratic Program
12
Quadratic Program
13
Quadratic Program
14
Quadratic Program
15
SUPPORT VECTOR MACHINE
(SVM)
16
Example: Building Walls
18
https://www.facebook.com/Mondobloxx/
SVM
Whiteboard
SVM Primal (Linearly Separable Case)
This section borrows ideas from Nina Balcan’s SVM lectures at CMU and Patrick Winston’s 19
“widest street” SVM lecture at MIT (https://www.youtube.com/watch?v=_PwhiWxHK8o).
SVM QP
20
SVM QP
21
SVM QP
22
SVM QP
23
SVM QP
24
SVM QP
25
Support Vector Machines (SVMs)
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
27
Method of Lagrange Multipliers
28
Method of Lagrange Multipliers
29
Method of Lagrange Multipliers
30
Method of Lagrange Multipliers
31
Method of Lagrange Multipliers
32
Method of Lagrange Multipliers
33
Method of Lagrange Multipliers
34
Figure from http://tutorial.math.lamar.edu/Classes/CalcIII/LagrangeMultipliers.aspx
Method of Lagrange Multipliers
35
Figure from http://tutorial.math.lamar.edu/Classes/CalcIII/LagrangeMultipliers.aspx
SVM DUAL
36
Method of Lagrange Multipliers
Whiteboard
Lagrangian Duality
Example: SVM Dual
This section borrows ideas from Nina Balcan’s SVM lectures at CMU and Patrick Winston’s 37
“widest street” SVM lecture at MIT (https://www.youtube.com/watch?v=_PwhiWxHK8o).
Support Vector Machines (SVMs)
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
42
Soft Margin SVM
Hard margin SVM (Primal)
Question: If the dataset is
not linearly separable, can
we still use an SVM?
Answer: Not the hard
margin version. It will never
find a feasible solution.
In the soft margin version,
Soft margin SVM (Primal) we add “slack variables”
that allow some points to
violate the large margin
constraints.
The constant C dictates
how large we should allow
the slack variables to be
43
Soft Margin SVM
Hard margin SVM (Primal)
44
Soft Margin SVM
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
We can also work with the dual of the soft margin SVM
45
Multiclass SVMs
The SVM is inherently a binary classification method,
but can be extended to handle K class classification in
many ways.
1. one vs rest:
build K binary classifiers
train the kth classifier to predict whether an instance
has label k or something else
predict the class with largest score
2. one vs one:
build (K choose 2) binary classifiers
train one classifier for distinguishing between each pair
of labels
predict the class with the most “votes” from any given
classifier
46
Learning Objectives
Support Vector Machines
You should be able to…
1. Motivate the learning of a decision boundary with large margin
2. Compare the decision boundary learned by SVM with that of
Perceptron
3. Distinguish unconstrained and constrained optimization
4. Compare linear and quadratic mathematical programs
5. Derive the hard margin SVM primal formulation
6. Derive the Lagrangian dual for a hard margin SVM
7. Describe the mathematical properties of support vectors and provide
an intuitive explanation of their role
8. Draw a picture of the weight vector, bias, decision boundary, training
examples, support vectors, and margin of an SVM
9. Employ slack variables to obtain the soft margin SVM
10. Implement an SVM learner using a black box quadratic programming
(QP) solver
52
KERNELS
53
Kernels: Motivation
Most real world problems exhibit data that is
not linearly separable.
Example: pixel representation for Facial Recognition:
55
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)
57
SVM: Kernel Trick
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
Suppose we do some
feature engineering
Our feature function is
We apply to each input
vector x
58
SVM: Kernel Trick
Hard margin SVM (Lagrangian Dual)
59
SVM: Kernel Trick
Hard margin SVM (Lagrangian Dual)
60
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)
61
Kernel Methods
Q: These are just non linear features, right?
A: Yes, but…
63
Slide from Nina Balcan
Kernel Examples
Name Kernel Function Feature Space
(implicit dot product) (explicit dot product)
RBF Kernel:
67
RBF Kernel Example
RBF Kernel:
68
RBF Kernel Example
RBF Kernel:
69
RBF Kernel Example
RBF Kernel:
70
RBF Kernel Example
RBF Kernel:
71
RBF Kernel Example
RBF Kernel:
72
RBF Kernel Example
RBF Kernel:
73
RBF Kernel Example
RBF Kernel:
74
RBF Kernel Example
RBF Kernel:
75
RBF Kernel Example
RBF Kernel:
76
RBF Kernel Example
RBF Kernel:
77
RBF Kernel Example
RBF Kernel:
78
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
79
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
80
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
81
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
82
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)
83
SVM + Kernels: Takeaways
Maximizing the margin of a linear separator is a good
training criteria
Support Vector Machines (SVMs) learn a max margin
linear classifier
The SVM optimization problem can be solved with
black box Quadratic Programming (QP) solvers
Learned decision boundary is defined by its support
vectors
Kernel methods allow us to work in a transformed
feature space without explicitly representing that
space
The kernel trick can be applied to SVMs, as well as
many other algorithms
86
Learning Objectives
Kernels
You should be able to…
1. Employ the kernel trick in common learning
algorithms
2. Explain why the use of a kernel produces only
an implicit representation of the transformed
feature space
3. Use the "kernel trick" to obtain a
computational complexity advantage over
explicit feature transformation
4. Sketch the decision boundaries of a linear
classifier with an RBF kernel
87