Lecture Slides-Week12
Lecture Slides-Week12
(CS4613)
Bayesian Inference. Week 6,7 Fundamentals of Machine Learning for Predictive Data Analytics, Ch6
Naïve Bayes
PCA Week 8 Hands-on machine learning, Ch 8
Linear Regression Week 9, 10 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
SVM Week 11, 12 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
ANN Week 13, 14 Neural Networks, A systematic Introduction, 1, 2, 3, 4, 7 (Selected Topics)
Hands-on machine learning, Ch 10
K-Nearest Neighbor Week 15 Fundamentals of Machine Learning for Predictive Data Analytics, Ch5
Master Machine Learning Algorithms Ch 22, 23
K-Means Clustering Week 16 Data Mining_ The Textbook. Ch 6
2
Outline Week 12
• SVM Problem Formulation
• Dual Problem Specification
• Intuition behind SVM
• Example
• Kernel Trick
• Code Example
3
w0 is b in this figure 4
SVM Training
• We'd like to maximize the margin, subject to the constraint that all points
from one class must be on one side of the hyperplane and all points from
the other class must be on the other side.
• The constraint is
• The optimization criterion is defined in terms of the perpendicular
distance from any instance to the decision boundary and is given by (see
next slide)
6
Optimization Problem
OR OR
If we represent the output as y, input as x and w0 as b then we can rewrite the constraint as
Note that actually these are n conditions (constraints), one for each training instance.
It essentially represents that all the training instances are correctly classified.
This is a constrained optimization problem and can be solved by Lagrange multiplier method.
7
Lagrange formulation of SVM
• Minimize
9
The Dual Formulation
• Maximize
• Subject to
10
Why Dual Formation?
• Because this will let us solve the problem by
computing just the dot products of xi, xj which will
be very important later on when we want to solve
non-linearly separable classification problems using
the kernel trick.
11
SVM Training
• In the training phase, we find a weight (α) for each
training point.
• Those points whose weight becomes zero are not
support vectors. There will be only a few non-zero α
at the end of the training.
• Once the support vectors have been identified, the
bias term (w0 or b) can be computed using these
support vectors because for the support vectors
ti x (w0 + w.d) = 1
12
SVM Prediction Model
• Once we have all the α and w0, we can make a prediction for a
new query instance using
15
Example
• Suppose we are given the following positively
labeled data points
16
17
Example Contd..
• There are three support vectors
18
Example Contd..
• For a support vector on the positive side margin we have w.d +
w0 = +1
• d is the support vector
• For a support vector on the negative side margin we have w.d +
w0 = -1
• We know (from slide 9)
Example Contd..
• We can write 3 equations for the 3 support vectors.
α1.y1.(s1.s1) + α2.y2.(s2.s1)+ α3.y3.(s3.s1) + w0 = -1
α1.y1.(s1.s2) + α2.y2.(s2.s2)+ α3.y3.(s3.s2) + w0 = +1 s1.s1 = 1
α1.y1.(s1.s3) + α2.y2.(s2.s3)+ α3.y3.(s3.s3) + w0 = +1 s2.s1 = 3
s3.s1 = 3
- α1 + α2 + α3 =0
• α1 = 0.5
• α2 = 0.25
• α3 = 0.25
• w = -2
22
Finding the separating hyperplane
• We know that
23
24
25
Non-Linear SVM and
the Kernel Trick
26
Data that is not Linearly
Separable
• So far we have focused only on the data that is linearly
separable.
• We have learnt how to find a hyperplane that maximizes the
margin to the class boundary.
• However not all data are linearly separable. For example see
the data below.
27
Mapping the Data to a higher
Dimension
• For the example data given on the previous slide, if
we find a way to map the data from 2-D to 3-D, we
will be able to find a decision surface that clearly
divides between different classes.
28
Example: Mapping to 3-D
29
Example: From 3-D to 9-D
• Original Data (3-D)
30
Problem
• Remember SVM’s Dual Problem expression?
32
Example: Transformation using
polynomial kernel of degree 2
• For this example we are using a polynomial kernel of degree 2
which looks like (1+ x.y)2
• Assume we have two vectors (data instances),
•
• We want to transform x and y to a higher dimension.
• To be able to use the kernel trick, we transform x and y into the
higher dimensions in a particular way.
• Then their dot product in the higher dimension will be the same as
computing (x.y + 1)2 where x.y is the dot product in the original 2-D.
• Hence we will not have to first transform the vectors and then
compute the dot product, we would simply compute their dot
product in the original dimensions, add 1 and then square the
answer.
33
Dot product of x and y in 2-D
(original dimensions)
𝑥.𝑦=𝑥1 𝑦1+𝑥2 𝑦2
34
Transforming x and y to 6
dimensions (in a certain way)
Here I am writing the vector in a row and using
Transpose symbol.
Also I am representing the transformed vectors using
capital symbols
[]T is transformed to
X = [1 ]T
[]T is transformed to
Y =[1 ]T
35
Dot product of transformed vectors (element-
wise multiplication and then summation)
X.Y= (1 + + 2)
This must be the same as (1+ x.y)2 . Lets verify.
(1+ x.y)2
)2 (since )
36
How it helped?
• Instead of computing the dot product of x and y in 6-
D, (for which we first will have to transform them to
6-D and then compute the dot product), we
computed their dot product in the original 2-D, added
1 and then squared the answer. This saved us a lot of
computation.
• This is the kernel trick.
• The important point here is that using particular
kernels gives us particular transformations.
• In case of polynomial kernel, we can modify the
following two parameters.
37
Numerical Example
• x= [5 2]T y= [3 4]T
• x.y= 5x3 + 2x4 = 23
• (1+ x.y)2 = (1 + 23)2 = 242= 576
• X = [1 25 4 5 .2 .10 ]
• Y = [1 9 16 3 .4 .12 ]
39
Python Example
https://drive.google.com/file/d/1Rj3Pvj9o2jI5FYvuat0
4JfXGOz7AFVly/view?usp=sharing
40
That is all for Week 12
41