100% found this document useful (1 vote)
50 views41 pages

Lecture Slides-Week12

This document provides an outline for a machine learning course covering various machine learning algorithms over 16 weeks. The algorithms covered include decision trees, Bayesian inference, PCA, linear regression, SVM, ANN, and K-means clustering. It also provides details on SVM including the optimization problem formulation, dual problem specification, intuition behind SVM, and an example of SVM training and prediction. The document discusses how the kernel trick can be used to make SVM work for non-linearly separable data by implicitly mapping the data into a higher dimensional space.

Uploaded by

moazzam kiani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
50 views41 pages

Lecture Slides-Week12

This document provides an outline for a machine learning course covering various machine learning algorithms over 16 weeks. The algorithms covered include decision trees, Bayesian inference, PCA, linear regression, SVM, ANN, and K-means clustering. It also provides details on SVM including the optimization problem formulation, dual problem specification, intuition behind SVM, and an example of SVM training and prediction. The document discusses how the kernel trick can be used to make SVM work for non-linearly separable data by implicitly mapping the data into a higher dimensional space.

Uploaded by

moazzam kiani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning

(CS4613)

Department of Computer Science


Capital University of Science and Technology (CUST)
Course Outline
Topic Weeks Reference
Introduction Week 1 Hands-on machine learning, Ch 1
Hypothesis Learning Week 2 Tom Mitchel, Ch2
Model Evaluation. Week 3 Fundamentals of Machine Learning for Predictive Data Analytics, Ch8
Classification
Decision Trees Week 4, 5 Fundamentals of Machine Learning for Predictive Data Analytics, Ch4

Bayesian Inference. Week 6,7 Fundamentals of Machine Learning for Predictive Data Analytics, Ch6
Naïve Bayes
PCA Week 8 Hands-on machine learning, Ch 8
Linear Regression Week 9, 10 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
SVM Week 11, 12 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
ANN Week 13, 14 Neural Networks, A systematic Introduction, 1, 2, 3, 4, 7 (Selected Topics)
Hands-on machine learning, Ch 10
K-Nearest Neighbor Week 15 Fundamentals of Machine Learning for Predictive Data Analytics, Ch5
Master Machine Learning Algorithms Ch 22, 23
K-Means Clustering Week 16 Data Mining_ The Textbook. Ch 6

2
Outline Week 12
• SVM Problem Formulation
• Dual Problem Specification
• Intuition behind SVM
• Example
• Kernel Trick
• Code Example

3
w0 is b in this figure 4
SVM Training
• We'd like to maximize the margin, subject to the constraint that all points
from one class must be on one side of the hyperplane and all points from
the other class must be on the other side.
• The constraint is
• The optimization criterion is defined in terms of the perpendicular
distance from any instance to the decision boundary and is given by (see
next slide)

• The numerator will be 1 for instances along the margin extents.


• Hence the distance of such an instance (a support vector) and the plane
will be 1 / ||w||.
• Because the margin is symmetrical to either side of the decision
boundary, the size of the margin is 2 / ||w||. So we want to maximize it.
5
Distance between a point and a
line

6
Optimization Problem
OR OR

Done for mathematical convenience

Subject to the following constraint

If we represent the output as y, input as x and w0 as b then we can rewrite the constraint as

Note that actually these are n conditions (constraints), one for each training instance.
It essentially represents that all the training instances are correctly classified.

This is a constrained optimization problem and can be solved by Lagrange multiplier method.
7
Lagrange formulation of SVM
• Minimize

• This one equation essentially combines the


optimization function with the constraint.
• αare called Langrage Multipliers. There are n number
of these, one for each training instance.
• We will not discuss the Mathematics behind
Lagrange Formulation as it is beyond the scope of
this course.
https://towardsdatascience.com/support-vector-machines-dual-formulation-quadratic-programming-sequential-
8
minimal-optimization-57f4387ce4dd
Lagrange formulation Contd..
• Taking the derivative of the Lagrange function w.r.t. w and b
and putting these equal to 0 will give us the following two
equations.

• Putting the values from the above two equations in the


Lagrange function given on the previous slide will help us
get rid of w and b from that equation and after
simplification will give us the following dual formulation of
SVM.

9
The Dual Formulation
• Maximize

• Subject to

10
Why Dual Formation?
• Because this will let us solve the problem by
computing just the dot products of xi, xj which will
be very important later on when we want to solve
non-linearly separable classification problems using
the kernel trick.

11
SVM Training
• In the training phase, we find a weight (α) for each
training point.
• Those points whose weight becomes zero are not
support vectors. There will be only a few non-zero α
at the end of the training.
• Once the support vectors have been identified, the
bias term (w0 or b) can be computed using these
support vectors because for the support vectors
ti x (w0 + w.d) = 1

12
SVM Prediction Model
• Once we have all the α and w0, we can make a prediction for a
new query instance using

q is the set of descriptive features for a query instance


(d1, t1), …, (ds, ts) are s support vectors (instances composed of descriptive
features and a target feature)
w0 is the bias
α is the set of parameters determined during the training process
• When the output of the above equation is greater than 1, we
predict the positive target level for the query, and when the
output is less than −1, we predict the negative target level.
13
Intuition Behind SVM Predictions
• We actually compute a weighted sum of similarities from a
test point (Query) to each support vector and compute the
class based on this weighted sum.
ti is +1 or -1

Weight of the support vector

Similarity between the test point and a support vector


14
Linear SVM Example

15
Example
• Suppose we are given the following positively
labeled data points

• and the following negatively labeled data points

16
17
Example Contd..
• There are three support vectors

• So we need to find three α and one bias (4


parameters). That means we need 4 equations.

18
Example Contd..
• For a support vector on the positive side margin we have w.d +
w0 = +1
• d is the support vector
• For a support vector on the negative side margin we have w.d +
w0 = -1
• We know (from slide 9)

• Hence .si + w0 = +1 (for +ve SV)

• And .si + w0 = -1 (for -ve SV)


19
-ve +ve +ve

Example Contd..
• We can write 3 equations for the 3 support vectors.
α1.y1.(s1.s1) + α2.y2.(s2.s1)+ α3.y3.(s3.s1) + w0 = -1
α1.y1.(s1.s2) + α2.y2.(s2.s2)+ α3.y3.(s3.s2) + w0 = +1 s1.s1 = 1
α1.y1.(s1.s3) + α2.y2.(s2.s3)+ α3.y3.(s3.s3) + w0 = +1 s2.s1 = 3
s3.s1 = 3

α1.-1.1 + α2.+1.3+ α3.+1. 3+ w0 = -1 s2.s2 = 10


α1.-1.3+ α2.+1.10+ α3.+1.8 + w0 = +1 s2.s3 = 8
α1.-1.3 + α2.+1.8+ α3.+1.10 + w0 = +1 s3.s3 = 10

-α1 + 3α2 + 3α3 + w0 = -1


-3α1 + 10α2 + 8α3 + w0 = +1
-3α1 + 8α2 + 10α3 + w0 = +1
20
Example Contd.
And from

- α1 + α2 + α3 =0

• Hence the 4 equations are


- α1 + α2 + α3 = 0
-α1 + 3α2 + 3α3 + w0 = -1
-3α1 + 10α2 + 8α3 + w0 = +1
-3α1 + 8α2 + 10α3 + w0 = +1
21
Example Contd..
• You can solve these equations however you like.
• I used the following online tool (using x, y, z and w)
• https://www.wolframalpha.com/calculators/system-equation-calculator

• α1 = 0.5
• α2 = 0.25
• α3 = 0.25
• w = -2

22
Finding the separating hyperplane
• We know that

w = 0.5 x -1 x s1 + 0.25 x 1 x s2 + 0.25 x 1 x s3


= [1 0]T

That gives us A=1 and B=0. We already know C=-2.

23
24
25
Non-Linear SVM and
the Kernel Trick

26
Data that is not Linearly
Separable
• So far we have focused only on the data that is linearly
separable.
• We have learnt how to find a hyperplane that maximizes the
margin to the class boundary.
• However not all data are linearly separable. For example see
the data below.

27
Mapping the Data to a higher
Dimension
• For the example data given on the previous slide, if
we find a way to map the data from 2-D to 3-D, we
will be able to find a decision surface that clearly
divides between different classes.

28
Example: Mapping to 3-D

29
Example: From 3-D to 9-D
• Original Data (3-D)

• Transformed (by taking all possible combinations)

30
Problem
• Remember SVM’s Dual Problem expression?

• If we have transformed our data into a higher


dimension, then it means we need to compute the
dot products of the transformed vectors.
• That can be computationally very expensive. First we
will have to transform the data into the higher
dimension and then compute the dot product.
31
Kernel Trick
• The kernel trick allows us to get the same answer as
the dot product of the transformed vectors
without actually performing the transformation.
• The idea is to use specially designed functions
(called kernel functions).
• One such function is polynomial kernel which looks
like the following. Here x and y are two vectors and
d is the degree of the polynomial.

32
Example: Transformation using
polynomial kernel of degree 2
• For this example we are using a polynomial kernel of degree 2
which looks like (1+ x.y)2
• Assume we have two vectors (data instances),

• We want to transform x and y to a higher dimension.
• To be able to use the kernel trick, we transform x and y into the
higher dimensions in a particular way.
• Then their dot product in the higher dimension will be the same as
computing (x.y + 1)2 where x.y is the dot product in the original 2-D.
• Hence we will not have to first transform the vectors and then
compute the dot product, we would simply compute their dot
product in the original dimensions, add 1 and then square the
answer.
33
Dot product of x and y in 2-D
(original dimensions)

𝑥.𝑦=𝑥1 𝑦1+𝑥2 𝑦2
34
Transforming x and y to 6
dimensions (in a certain way)
Here I am writing the vector in a row and using
Transpose symbol.
Also I am representing the transformed vectors using
capital symbols

[]T is transformed to
X = [1 ]T

[]T is transformed to
Y =[1 ]T
35
Dot product of transformed vectors (element-
wise multiplication and then summation)

X.Y= (1 + + 2)
This must be the same as (1+ x.y)2 . Lets verify.
(1+ x.y)2
)2 (since )

36
How it helped?
• Instead of computing the dot product of x and y in 6-
D, (for which we first will have to transform them to
6-D and then compute the dot product), we
computed their dot product in the original 2-D, added
1 and then squared the answer. This saved us a lot of
computation.
• This is the kernel trick.
• The important point here is that using particular
kernels gives us particular transformations.
• In case of polynomial kernel, we can modify the
following two parameters.
37
Numerical Example
• x= [5 2]T y= [3 4]T
• x.y= 5x3 + 2x4 = 23
• (1+ x.y)2 = (1 + 23)2 = 242= 576

• Using the transformations

• X = [1 25 4 5 .2 .10 ]
• Y = [1 9 16 3 .4 .12 ]

• X.Y = 1 + 225 + 64 + 30 + 16+ 240 = 576


38
Soft Margin SVM
• The description of the support vector machine approach given
till now assumes that it is possible to separate the instances with
the two different target feature levels with a linear hyperplane.
• This is called Hard Margin SVM.
• Sometimes this is not possible, even after using a kernel function
to move the data to a higher dimensional feature space.
• In these instances, a margin cannot be defined as we have done
in this example.
• An extension of the standard support vector machine approach
that allows a soft margin, however, caters for this and allows
overlap between instances with target features of the two
different levels.

39
Python Example
https://drive.google.com/file/d/1Rj3Pvj9o2jI5FYvuat0
4JfXGOz7AFVly/view?usp=sharing

40
That is all for Week 12

41

You might also like