ML Lectures - 20 22

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

TE (CS)

Spring Semester 2024

Machine Learning (CS-324)

Lecture #20-22
Support Vector Machines

Dr Syed Zaffar Qasim


Assistant Professor (CIS)

Numerical Domains and Support Vector Machines

Which Linear Classifier is better?


▪ Fig 1 shows three linear classifiers, each perfectly separating
the positive examples from the negative.
▪ Knowing that good behavior on the training set does not
guarantee high performance in the future, we then ask:
o which of three is likely to score best on future examples?

Fig 1: Linearly separable classes can be separated in infinitely many different ways.
2

CS-324: Machine Learning 1


Numerical Domains and Support Vector Machines
▪ The Support Vector Machine (SVM) Mathematicians who
studied this problem found an answer.
▪ In Fig 1, the dotted-line classifier all but touches the nearest
examples on either side; its margin is small.

▪ Conversely, the margin is greater in case of solid-line classifier.


▪ Margin: the perpendicular distance from the decision
boundary to the closest points on either side.
▪ Hence, the greater the margin, the higher the chances that the
classifier will do well on future data.
3

Numerical Domains and Support Vector Machines


▪ The SVM technique is illustrated in Fig. 2.

Fig 2

▪ The solid line is the best classifier.


▪ The graph shows also two thinner lines,
o parallel to the classifier, each at the same distance
o they pass through the examples called support vectors.
▪ Support vectors The set of points closest to the maximum
margin decision boundary.
o They define or support the decision boundary.
4

CS-324: Machine Learning 2


Numerical Domains and Support Vector Machines
▪ The task for machine learning is to identify the support vectors
that maximize the margin.

▪ The simplest technique tries all possible n-tuples of examples,


and measure the margin implied by each such choice.
▪ This, however, is unrealistic in domains with many examples.
▪ Most of the time, therefore, engineers rely on the software
packages available for free on the internet.

The Case When the Data Are Linearly Separable

▪ To explain the SVMs, let’s first look at the simplest case – a


two-class problem where the classes are linearly separable.
▪ Let the data set D be given as (X1, y1), (X2, y2), … , (Xn, yn),
where Xi is the set of training tuples with associated class
labels, yi.
▪ Each yi can take one of two values,
o either +1 or -1 (i.e., yi  [+1, -1]),
o corresponding to the classes buys_computer = yes and
buys_computer = `no, respectively.

CS-324: Machine Learning 3


The Case When the Data Are Linearly Separable
▪ Consider an example based on two attributes, A1 and A2.

Fig 3

▪ Infinite number of separating lines could be drawn.


▪ We want to find the “best” one with the minimum
classification error on previously unseen tuples.
▪ Generalizing to n dimensions, we want to find the best
hyperplane.
▪ An SVM approaches this problem by searching for the
Maximum Marginal Hyperplane (MMH).
7

The Case When the Data Are Linearly Separable


▪ Consider Fig 4, which shows two possible separating
hyperplanes and their associated margins.

Fig 4

▪ We expect the hyperplane with the larger margin to be


more accurate at classifying future data tuples than the
hyperplane with the smaller margin.
▪ This is why (during the learning phase) the SVM
searches for the hyperplane with the largest margin,
that is, the maximum marginal hyperplane (MMH).
8

CS-324: Machine Learning 4


Decision Function

▪ A separating hyperplane can be written as


wT x + b = 0 ----- (1)
o w is a weight vector, namely, wT = [w1, w2, …, wn]; n is the
number of attributes;
𝑥1
𝐱= ⋮
𝑥𝑛
o and b is a scalar, often referred to as a bias.
▪ The learning task involves choosing the values of w and b
o based on the training data
o by maximizing the quantity called margin.

Decision Function
▪ If we think of b as an additional weight, w0, we can rewrite
Eq. (1) as
w0 + w1x1 + w2x2 = 0

Fig 5

▪ Thus, any point that lies above the separating hyperplane


satisfies
w0 + w1x1 + w2x2 > 0
▪ Similarly, any point that lies below the separating
hyperplane satisfies
w0 + w1x1 + w2x2 < 0 10

CS-324: Machine Learning 5


Decision Function

▪ Decision function, fnew = sign(WT X + b)


▪ fnew is invariant to scaling its argument by a positive
constant.
▪ This means that we can multiply (WT X + b) by a
positive constant  and the output of the sign
function will be unchanged.
▪ Therefore we can decide to fix the scaling of W and b such
that WT X + b = ±1 for the closest points on either side.

11

Decision Function
▪ Hence the hyperplanes defining the “sides” of the margin
can be written as
o H1 : w0 + w1x1 + w2x2 ≥ 1 for yi = +1
o H2 : w0 + w1x1 + w2x2 ≤ -1 for yi = -1

▪ Any tuple that falls on or above H1 belongs to class C1, and


any tuple that falls on or below H2 belongs to class -1.
▪ Combining the two inequalities, we get
yi . (w0 + w1x1 + w2x2) ≥ 1 i
12

CS-324: Machine Learning 6


Maximizing the margin
▪ It is easiest to compute the margin using one point from
each class.
▪ X1 and X2 are the closest points from the two classes.

Fig 6

▪ 2 (i.e. double the boundary) is equal to the component


of the vector joining X1 and X2 in the direction
perpendicular to the boundary.
▪ The vector joining X1 and X2 is given by X1 – X2
▪ The direction perpendicular to the decision boundary
is given by 𝑾 / 𝑊 . 13

Maximizing the margin


▪ The inner product these two quantities
1
2𝛾 = 𝑊 𝑇 (𝑋1 − 𝑋2 )
𝑊
1
• To maximize the margin, we must maximize .
𝑤
• There are however some constraints as follows:-
yn (W . X + b) ≥ 1
1 2
• It will actually be easier to minimize 𝑤 .
2
• Formally our optimization problem has become
1 2
argmin 𝑤
𝑤 2
Subject to yn (WT . X + b) ≥ 1, for all n
14

CS-324: Machine Learning 7


Method of Lagrange multipliers
▪ Consider the problem of finding the minimum or maximum of
the function f(x) subject to the restriction that x must satisfy all
the equations
g1(x) = b1
g2(x) = b2

gm(x) = bm
▪ A classical method of dealing with this problem is the method of
Lagrange multipliers. The procedure begins by formulating the
Lagrangian function
𝑚

ℎ 𝑥, 𝜆 = 𝑓 𝑥 − ෍ 𝜆𝑖 [𝑔𝑖 𝑥 − 𝑏𝑖 ]
𝑖=1
▪ Where the new variables  = (1, 2, …, m) are called Lagrange
multipliers. Notice the key fact that for the feasible values of x,
gi(x) – bi = 0 for all i
So h(x, ) = f(x) . 15

Maximizing the margin


▪ Our new objective function is
𝑵
𝟏
𝐚𝐫𝐠𝐦𝐢𝐧 𝒘𝑻 𝒘 − ෍ 𝜶𝒊 (𝒚𝒊 (𝒘𝑻 𝒙𝒊 + 𝒃) − 𝟏)
𝒘,𝜶 𝟐
𝒊=𝟏
Subject to αi ≥ 0 for all n
▪ To solve it, we need to incorporate constraints into the
objective function through a set of Lagrangian multipliers.
▪ Lagrangian multipliers add a new term in objective function
for each constraint such that
o the optimum of the new objective function corresponds to
o the optimum of the original constrained problem.
▪ In our case, we need N lagrangian terms.
▪ Each has an associated Lagrange multiplier which itself is
constrained to be positive.
16

CS-324: Machine Learning 8


Maximizing the margin
▪ Where we have used the fact that 𝑤 2 = 𝑤 𝑇 𝑤.
▪ At an optimum of this new objective function, the partial
derivatives of the objective function with respect to w and b
must be zero.
• a new objective function which must be maximized with
respect to the αi rather than w:
𝑵 𝑵
𝟏
෍ 𝜶𝒊 − ෍ 𝜶𝒊 𝜶𝒋 𝒚𝒊 𝒚𝒋 𝒙𝑻𝒊 𝒙𝒋
𝟐
𝒊=𝟏 𝒊,𝒋=𝟏

▪ Notice that w doesn’t feature at all in this optimization


problem.
▪ This optimization problem is a constrained quadratic
programming task, quadratic because of the αiαj term.
▪ There is no analytical solution but it is reasonably
straightforward to solve numerically. 17

Making predictions

o Given a set of optimal αn, how do we go about


making predictions?
o Our decision function, ynew = sign(wT xnew + b), is
based on w and b, not αn.
o To convert it into a function of αi, we substitute the
expression for w given in equation 2, resulting in
𝑦𝑛𝑒𝑤 = 𝑠𝑖𝑔𝑛(σ𝑁 𝑇
𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑛𝑒𝑤 + 𝑏) --- (1)

18

CS-324: Machine Learning 9


Making predictions

o To find b, we will use the fact that for the closest


points, yn(wTxn + b) = 1.
o Substituting eq 1 into this expression and re-
arranging allows us to calculate b (note that 1/yn = yn):
𝑏 = 𝑦𝑛 − σ𝑁 𝑇
𝑗=1 𝛼𝑗 𝑦𝑗 𝑥𝑗 𝑥𝑛 -------- (2)
o Where xn is any one of the closest points.
o This gives us everything we need to be able to
classify any xnew.

19

Hard vs Soft Margins


▪ As the decision boundary found by maximizing the margin
o the margin only depends on the closest points,
o we could discard all of other data and
o end up with just the same decision boundary.
▪ This is reflected by the fact that, at the optimum, all of
αi, that do not represent support vectors will be zero.
▪ For large problems, this can be a very useful feature.
▪ For an SVM trained on the same data, the decision
function might just involve a small subset of the
training data.

20

CS-324: Machine Learning 10


Hard vs Soft Margins
Hard margins
▪ Fig 7 shows a binary dataset and the resulting decision
boundary (wTx + b = 0, where 𝒘 = σ𝑵 𝒊=𝟏 𝜶𝒊 𝒚𝒊 𝒙𝒊 )
alongwith the three support vectors (large grey circles).

Fig 7

▪ These are the only points, for which αi > 0, that need to
be used when classifying new data.
21

Hard vs Soft Margins


Hard margins
▪ Although it could be efficient to make decision on only three of
the training points, it will not always be a good thing.
▪ To illustrate why, consider fig 8.

Fig 8

▪ There is one difference from fig 7 – the support vector from class
denoted by grey squares has been moved closer to the other class.
▪ Moving this single data point has had a large effect on the
position of the decision boundary. 22

CS-324: Machine Learning 11


Hard vs Soft Margins
Hard margins
▪ This is another example of overfitting – we are allowing the data
to have too much influence.

• To see why this happens, look at our original constraints:


yn(wTxn + b) ≥ 1 -------- (3)
▪ This means that all training points have to sit on the correct
site of the decision boundary.
▪ This type of SVM is known as a hard margin SVM. 23

Hard vs Soft Margins


▪ It will sometimes be sensible to relax this constraint
using a soft margin.
Soft margins
• To allow points to potentially lie on the wrong side of
the boundary, we need to slacken the constraints in
our original formulation.
• In particular, we need to adapt eqn 3 so that it admits
the possibility of some points lying closer to (or on the
wrong side of) the decision boundary.
• To achieve this the constraint becomes:
𝒚𝒊 𝒘𝑻 𝒙𝒊 + 𝒃 ≥ 𝟏 − 𝝃𝒊
Where 𝝃 ≥ 𝟎

24

CS-324: Machine Learning 12


Hard vs Soft Margins
Soft margins
▪ If 𝟎 ≤ 𝝃𝒊 ≤ 𝟏, the point lies on the correct side of the
boundary but within the boundary of the margin.
▪ If 𝝃 ≥ 𝟏 , the point lies on the wrong side of the
boundary.
▪ Our optimization task becomes
𝟏 𝑻
Minimize 𝒘 𝒘 + 𝑪 σ𝑵𝒊=𝟏 𝝃𝒊
𝟐
▪ Subject to 𝝃𝒊 ≥ 𝟎 and 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖
▪ The new parameter C controls to what extent we are
willing to allow points to sit within the margin band or
on the wrong side of the decision boundary.

25

Hard vs Soft Margins


Soft margins
▪ If we follow the same steps that we took for the hard
margin class, we find that this change in the model has
only a very small effect on the maximization problem.
▪ Omitting the details, we now need to find the
maximum of the following quadratic programming
problem:
𝟏
Maximize σ𝑵
𝒊=𝟏 𝜶𝒊 − σ𝑵 𝑻
𝒊,𝒋=𝟏 𝜶𝒊 𝜶𝒋 𝒚𝒊 𝒚𝒋 𝒙𝒊 𝒙𝒋
𝟐
▪ Subject to
σ𝑵
𝒊=𝟏 𝜶𝒊 𝒚𝒊 = 𝟎 and 𝟎 ≤ 𝜶𝒊 ≤ 𝑪, 𝒇𝒐𝒓 𝒂𝒍𝒍 𝒊
▪ The only difference is an upper bound (C) on αn.

26

CS-324: Machine Learning 13


Hard vs Soft Margins
Soft margins
• The influence of each training point in our decision
function is proportional to αn.
• We are therefore imposing an upper bound on the
influence that any one training point can have.
• For the example in fig 8, the support vector from the
grey class had αi = 5.45.
• Setting C to 1 would result in a change in the decision
boundary (some other αn will have to become non-
zero from the grey square class), moving it back
towards the other objects in the grey square class.

27

Hard vs Soft Margins


Soft margins
▪ This is exactly what happens, as we can see from fig 9,
where we plot the decision boundary and support vectors
for C = 1 and C = 0.01.

Fig 9

▪ As C decreases, the maximum potential influence of each


training point is eroded and so more and more of them
become active in the decision function.
▪ Using a soft margin gives us a free parameter (C) that needs
to be fixed.
28

CS-324: Machine Learning 14

You might also like