ML Lectures - 20 22

TE (CS)
Spring Semester 2024
Machine Learning (CS-324)
Lecture #20-22
Support Vector Machines
Dr Syed Zaffar Qasim

Assistant Professor (CIS)
Numerical Domains and Support Vector Machines
Which Linear Classifier is better?

▪ Fig 1 shows three linear classifiers, each perfectly separating
the positive examples from the negative.
▪ Knowing that good behavior on the training set does not
guarantee high performance in the future, we then ask:
o which of three is likely to score best on future examples?
Fig 1: Linearly separable classes can be separated in infinitely many different ways.
2
CS-324: Machine Learning 1

▪ The Support Vector Machine (SVM) Mathematicians who
studied this problem found an answer.
▪ In Fig 1, the dotted-line classifier all but touches the nearest
examples on either side; its margin is small.
▪ Conversely, the margin is greater in case of solid-line classifier.

▪ Margin: the perpendicular distance from the decision
boundary to the closest points on either side.
▪ Hence, the greater the margin, the higher the chances that the
classifier will do well on future data.
3

▪ The SVM technique is illustrated in Fig. 2.
Fig 2
▪ The solid line is the best classifier.

▪ The graph shows also two thinner lines,
o parallel to the classifier, each at the same distance
o they pass through the examples called support vectors.
▪ Support vectors The set of points closest to the maximum
margin decision boundary.
o They define or support the decision boundary.
4

▪ The task for machine learning is to identify the support vectors
that maximize the margin.
▪ The simplest technique tries all possible n-tuples of examples,

and measure the margin implied by each such choice.
▪ This, however, is unrealistic in domains with many examples.
▪ Most of the time, therefore, engineers rely on the software
packages available for free on the internet.
The Case When the Data Are Linearly Separable
▪ To explain the SVMs, let’s first look at the simplest case – a

two-class problem where the classes are linearly separable.
▪ Let the data set D be given as (X1, y1), (X2, y2), … , (Xn, yn),
where Xi is the set of training tuples with associated class
labels, yi.
▪ Each yi can take one of two values,
o either +1 or -1 (i.e., yi  [+1, -1]),
o corresponding to the classes buys_computer = yes and
buys_computer = `no, respectively.

▪ Consider an example based on two attributes, A1 and A2.
Fig 3
▪ Infinite number of separating lines could be drawn.

▪ We want to find the “best” one with the minimum
classification error on previously unseen tuples.
▪ Generalizing to n dimensions, we want to find the best
hyperplane.
▪ An SVM approaches this problem by searching for the
Maximum Marginal Hyperplane (MMH).
7

▪ Consider Fig 4, which shows two possible separating
hyperplanes and their associated margins.
Fig 4
▪ We expect the hyperplane with the larger margin to be

more accurate at classifying future data tuples than the
hyperplane with the smaller margin.
▪ This is why (during the learning phase) the SVM
searches for the hyperplane with the largest margin,
that is, the maximum marginal hyperplane (MMH).
8

Decision Function
▪ A separating hyperplane can be written as

wT x + b = 0 ----- (1)
o w is a weight vector, namely, wT = [w1, w2, …, wn]; n is the
number of attributes;
𝑥1
𝐱= ⋮
𝑥𝑛
o and b is a scalar, often referred to as a bias.
▪ The learning task involves choosing the values of w and b
o based on the training data
o by maximizing the quantity called margin.
Decision Function
▪ If we think of b as an additional weight, w0, we can rewrite
Eq. (1) as
w0 + w1x1 + w2x2 = 0
Fig 5
▪ Thus, any point that lies above the separating hyperplane

satisfies
w0 + w1x1 + w2x2 > 0
▪ Similarly, any point that lies below the separating
hyperplane satisfies
w0 + w1x1 + w2x2 < 0 10

Decision Function
▪ Decision function, fnew = sign(WT X + b)

▪ fnew is invariant to scaling its argument by a positive
constant.
▪ This means that we can multiply (WT X + b) by a
positive constant  and the output of the sign
function will be unchanged.
▪ Therefore we can decide to fix the scaling of W and b such
that WT X + b = ±1 for the closest points on either side.
11
Decision Function
▪ Hence the hyperplanes defining the “sides” of the margin
can be written as
o H1 : w0 + w1x1 + w2x2 ≥ 1 for yi = +1
o H2 : w0 + w1x1 + w2x2 ≤ -1 for yi = -1
▪ Any tuple that falls on or above H1 belongs to class C1, and

any tuple that falls on or below H2 belongs to class -1.
▪ Combining the two inequalities, we get
yi . (w0 + w1x1 + w2x2) ≥ 1 i
12

Maximizing the margin
▪ It is easiest to compute the margin using one point from
each class.
▪ X1 and X2 are the closest points from the two classes.
Fig 6
▪ 2 (i.e. double the boundary) is equal to the component

of the vector joining X1 and X2 in the direction
perpendicular to the boundary.
▪ The vector joining X1 and X2 is given by X1 – X2
▪ The direction perpendicular to the decision boundary
is given by 𝑾 / 𝑊 . 13

▪ The inner product these two quantities
1
2𝛾 = 𝑊 𝑇 (𝑋1 − 𝑋2 )
𝑊
1
• To maximize the margin, we must maximize .
𝑤
• There are however some constraints as follows:-
yn (W . X + b) ≥ 1
1 2
• It will actually be easier to minimize 𝑤 .
2
• Formally our optimization problem has become
1 2
argmin 𝑤
𝑤 2
Subject to yn (WT . X + b) ≥ 1, for all n
14

Method of Lagrange multipliers
▪ Consider the problem of finding the minimum or maximum of
the function f(x) subject to the restriction that x must satisfy all
the equations
g1(x) = b1
g2(x) = b2
⋮
gm(x) = bm
▪ A classical method of dealing with this problem is the method of
Lagrange multipliers. The procedure begins by formulating the
Lagrangian function
𝑚
ℎ 𝑥, 𝜆 = 𝑓 𝑥 − ෍ 𝜆𝑖 [𝑔𝑖 𝑥 − 𝑏𝑖 ]
𝑖=1
▪ Where the new variables  = (1, 2, …, m) are called Lagrange
multipliers. Notice the key fact that for the feasible values of x,
gi(x) – bi = 0 for all i
So h(x, ) = f(x) . 15

▪ Our new objective function is
𝑵
𝟏
𝐚𝐫𝐠𝐦𝐢𝐧 𝒘𝑻 𝒘 − ෍ 𝜶𝒊 (𝒚𝒊 (𝒘𝑻 𝒙𝒊 + 𝒃) − 𝟏)
𝒘,𝜶 𝟐
𝒊=𝟏
Subject to αi ≥ 0 for all n
▪ To solve it, we need to incorporate constraints into the
objective function through a set of Lagrangian multipliers.
▪ Lagrangian multipliers add a new term in objective function
for each constraint such that
o the optimum of the new objective function corresponds to
o the optimum of the original constrained problem.
▪ In our case, we need N lagrangian terms.
▪ Each has an associated Lagrange multiplier which itself is
constrained to be positive.
16

▪ Where we have used the fact that 𝑤 2 = 𝑤 𝑇 𝑤.
▪ At an optimum of this new objective function, the partial
derivatives of the objective function with respect to w and b
must be zero.
• a new objective function which must be maximized with
respect to the αi rather than w:
𝑵 𝑵
𝟏
෍ 𝜶𝒊 − ෍ 𝜶𝒊 𝜶𝒋 𝒚𝒊 𝒚𝒋 𝒙𝑻𝒊 𝒙𝒋
𝟐
𝒊=𝟏 𝒊,𝒋=𝟏
▪ Notice that w doesn’t feature at all in this optimization

problem.
▪ This optimization problem is a constrained quadratic
programming task, quadratic because of the αiαj term.
▪ There is no analytical solution but it is reasonably
straightforward to solve numerically. 17
Making predictions
o Given a set of optimal αn, how do we go about

making predictions?
o Our decision function, ynew = sign(wT xnew + b), is
based on w and b, not αn.
o To convert it into a function of αi, we substitute the
expression for w given in equation 2, resulting in
𝑦𝑛𝑒𝑤 = 𝑠𝑖𝑔𝑛(σ𝑁 𝑇
𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑛𝑒𝑤 + 𝑏) --- (1)
18

Making predictions
o To find b, we will use the fact that for the closest

points, yn(wTxn + b) = 1.
o Substituting eq 1 into this expression and re-
arranging allows us to calculate b (note that 1/yn = yn):
𝑏 = 𝑦𝑛 − σ𝑁 𝑇
𝑗=1 𝛼𝑗 𝑦𝑗 𝑥𝑗 𝑥𝑛 -------- (2)
o Where xn is any one of the closest points.
o This gives us everything we need to be able to
classify any xnew.
19
Hard vs Soft Margins

▪ As the decision boundary found by maximizing the margin
o the margin only depends on the closest points,
o we could discard all of other data and
o end up with just the same decision boundary.
▪ This is reflected by the fact that, at the optimum, all of
αi, that do not represent support vectors will be zero.
▪ For large problems, this can be a very useful feature.
▪ For an SVM trained on the same data, the decision
function might just involve a small subset of the
training data.
20

Hard margins
▪ Fig 7 shows a binary dataset and the resulting decision
boundary (wTx + b = 0, where 𝒘 = σ𝑵 𝒊=𝟏 𝜶𝒊 𝒚𝒊 𝒙𝒊 )
alongwith the three support vectors (large grey circles).
Fig 7
▪ These are the only points, for which αi > 0, that need to
be used when classifying new data.
21

Hard margins
▪ Although it could be efficient to make decision on only three of
the training points, it will not always be a good thing.
▪ To illustrate why, consider fig 8.
Fig 8
▪ There is one difference from fig 7 – the support vector from class
denoted by grey squares has been moved closer to the other class.
▪ Moving this single data point has had a large effect on the
position of the decision boundary. 22

Hard margins
▪ This is another example of overfitting – we are allowing the data
to have too much influence.
• To see why this happens, look at our original constraints:

yn(wTxn + b) ≥ 1 -------- (3)
▪ This means that all training points have to sit on the correct
site of the decision boundary.
▪ This type of SVM is known as a hard margin SVM. 23

▪ It will sometimes be sensible to relax this constraint
using a soft margin.
Soft margins
• To allow points to potentially lie on the wrong side of
the boundary, we need to slacken the constraints in
our original formulation.
• In particular, we need to adapt eqn 3 so that it admits
the possibility of some points lying closer to (or on the
wrong side of) the decision boundary.
• To achieve this the constraint becomes:
𝒚𝒊 𝒘𝑻 𝒙𝒊 + 𝒃 ≥ 𝟏 − 𝝃𝒊
Where 𝝃 ≥ 𝟎
24

Soft margins
▪ If 𝟎 ≤ 𝝃𝒊 ≤ 𝟏, the point lies on the correct side of the
boundary but within the boundary of the margin.
▪ If 𝝃 ≥ 𝟏 , the point lies on the wrong side of the
boundary.
▪ Our optimization task becomes
𝟏 𝑻
Minimize 𝒘 𝒘 + 𝑪 σ𝑵𝒊=𝟏 𝝃𝒊
𝟐
▪ Subject to 𝝃𝒊 ≥ 𝟎 and 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖
▪ The new parameter C controls to what extent we are
willing to allow points to sit within the margin band or
on the wrong side of the decision boundary.
25

Soft margins
▪ If we follow the same steps that we took for the hard
margin class, we find that this change in the model has
only a very small effect on the maximization problem.
▪ Omitting the details, we now need to find the
maximum of the following quadratic programming
problem:
𝟏
Maximize σ𝑵
𝒊=𝟏 𝜶𝒊 − σ𝑵 𝑻
𝒊,𝒋=𝟏 𝜶𝒊 𝜶𝒋 𝒚𝒊 𝒚𝒋 𝒙𝒊 𝒙𝒋
𝟐
▪ Subject to
σ𝑵
𝒊=𝟏 𝜶𝒊 𝒚𝒊 = 𝟎 and 𝟎 ≤ 𝜶𝒊 ≤ 𝑪, 𝒇𝒐𝒓 𝒂𝒍𝒍 𝒊
▪ The only difference is an upper bound (C) on αn.
26

Soft margins
• The influence of each training point in our decision
function is proportional to αn.
• We are therefore imposing an upper bound on the
influence that any one training point can have.
• For the example in fig 8, the support vector from the
grey class had αi = 5.45.
• Setting C to 1 would result in a change in the decision
boundary (some other αn will have to become non-
zero from the grey square class), moving it back
towards the other objects in the grey square class.
27

Soft margins
▪ This is exactly what happens, as we can see from fig 9,
where we plot the decision boundary and support vectors
for C = 1 and C = 0.01.
Fig 9
▪ As C decreases, the maximum potential influence of each

training point is eroded and so more and more of them
become active in the decision function.
▪ Using a soft margin gives us a free parameter (C) that needs
to be fixed.
28

ML Lectures - 20 22

Uploaded by

Copyright:

Available Formats

ML Lectures - 20 22

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Lectures - 20 22

Uploaded by

Copyright:

Available Formats

TE (CS)

Spring Semester 2024

Machine Learning (CS-324)

Dr Syed Zaffar Qasim

Numerical Domains and Support Vector Machines

Which Linear Classifier is better?

CS-324: Machine Learning 1

▪ Conversely, the margin is greater in case of solid-line classifier.

Numerical Domains and Support Vector Machines

▪ The solid line is the best classifier.

CS-324: Machine Learning 2

▪ The simplest technique tries all possible n-tuples of examples,

The Case When the Data Are Linearly Separable

▪ To explain the SVMs, let’s first look at the simplest case – a

CS-324: Machine Learning 3

▪ Infinite number of separating lines could be drawn.

The Case When the Data Are Linearly Separable

▪ We expect the hyperplane with the larger margin to be

CS-324: Machine Learning 4

▪ A separating hyperplane can be written as

▪ Thus, any point that lies above the separating hyperplane

CS-324: Machine Learning 5

▪ Decision function, fnew = sign(WT X + b)

▪ Any tuple that falls on or above H1 belongs to class C1, and

CS-324: Machine Learning 6

▪ 2 (i.e. double the boundary) is equal to the component

Maximizing the margin

CS-324: Machine Learning 7

Maximizing the margin

CS-324: Machine Learning 8

▪ Notice that w doesn’t feature at all in this optimization

o Given a set of optimal αn, how do we go about

CS-324: Machine Learning 9

o To find b, we will use the fact that for the closest

Hard vs Soft Margins

CS-324: Machine Learning 10

Hard vs Soft Margins

CS-324: Machine Learning 11

• To see why this happens, look at our original constraints:

Hard vs Soft Margins

CS-324: Machine Learning 12

Hard vs Soft Margins

CS-324: Machine Learning 13

Hard vs Soft Margins

▪ As C decreases, the maximum potential influence of each

CS-324: Machine Learning 14

You might also like