ML Module 4
ML Module 4
j/_..ftttti t
Learning with Classification
Syllabus
'· ..
At the end of this unit )'OU should be able to unde,stand and co~pt • ,h end the following syllabu, '°Pk, ,
Support Vector Machine · . · . . . .
o Constrained Optimization
o Optimal decision boundary
o Margins and support vectors
• SVM is a supervised learning method. You plot each datapoint as a point in a n-dimensional space (where n is the
number of datapoint attributes you have). Then, you perform classification by finding the hyperplane that
adequately differentiates the two classes. The higher the gap between the datapoints (highest
datapoints of one class and lowest boundary datapoints of another class), the better. boundary
Support vectors
0
0 0 ·'
DD
D
(f~ossible hyperplan~~J
(a)
(b)
Fig. 4.1.1
r
IL~
1J ~9
t rl'1a
,,.;,g[r,llfJ
cl11-~e ~e
fies are
de''•sion boundaries
be a
.
. . •·
~yf'e 1af'le ,ari If the number of input data attributes is 2, then the h er p epe~ds upon the number
, ~)'f'erl' attribLlteS• is 3, then the hyperplane becomes a two-dimen . I plane ,s Just a lme. If the number of
ostll ttribLlteS 2 • s1ona p ane, and likewise.
ol. ta a A 11ypel'PIane__in A Is a lme
..,.. - - - -c. - - -
3
08 A hyperplane In R Is a line
l~r __ ..,. - - -
I I I. •I I 7 --,--r-,--r-,--r . . --i-.
,,.. I I I __ . . . .- - - -
,___ ... ---,- L , _._-. 6 --:--r-,--r-,--r 11,., .. r-,
L I f I I I ',
---:---T
1
5 .,..,,• I
1 1
I +: 1
I 3 --•--L--1--1--,-
.· .
--,.. ...... , Il',, ,1'I
---+I ---- ---II ,, ,1., ..: '-'.:. r• -,- ~r -.,.., ... -.
--+---,- -+-----+
__ _..--. I .. I I I
2. ,•· I • I II I 10 -. "_-._
'-. -~:::,.,:--.-:_--.::::-,,
,t. ._ ', ', , I ._ ')I
••
Fig.4.1.3
l~ ~------------~=-----------------~._.i'j'~T~ecl~latlll~~N~l~i
t No L y ••• " , • , 1 •••
. ·98904/2021)
¥ Machine Learning (MU) 4 -3 Learning with. C:
• . boun ded by these two hyperplanes is called the "margin", and the maximum-marg·1n hyp 1i!ssiQ~
The region
h 1
e that lies halfway between them. With a normalised or standardised dataset, these h erplan .
yperp anby the following equations.
described . YPerp1 aries eCa1111.•,
Hyperplane 1 is described as wx x - b = l which classifies the data into the first class.
Hyperplane 2 is described as wx x - b = -1 which classifies the data into the second class.
variables in ,,,, PTesenee otcoiistra;nts 9,;_,Juf,,e ~ kii!,if ;;:;;;; ,; ;,:Ir;:.:;,;,L;,,c ;;,,,;:::,; •./: ;,: . ;,
•
The objective function is either a cost function or energy function, which is to be minimised, or a reward function
or utility function, which is to be maximised. Constraints can be either hard constraints, which set conditions for the
variables that are required to be satisfied, or soft constraints, which have some variable values that are penalised in
the objective function if, and based on the extent that, the conditions on the variables are not satisfied .
• Typically, constrained optimisation helps you ir: identifying feasible solutions out of a very large and complexsetof
possibilities. Let's understand this by taking a very simplistic example.
• Have you ever gone for trekking? Have you ever enjoyed watching or playing one of the games in Takeshi's castle?
I am sure you would have gone for trekking in one of the terrains or other or any other form of climbing up and
then coming down.
(b)
Fig. 4.2.1
(Copyright
- - - - -No.
- -L-98904 2021
- - - /- -- , - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ~...
~• Tad~
-,, , 11 1 11,a
. Learning (MU)
JI r,taclllne 4-4 Learning with Classification
"- , w te rne one thing. Assume that you are on top of a rocky h.ll I (or on one of the stones in a d) d .
11
, 1'10nt' to 9 down (or cross the series of stones to reach the land from t he pond) How do you de ·d ponth an you
wa 0
down thehill or cross the pond.
w~h the help of stones? 00 you . t st .· . er e e way to go
JUS art walking as if you are on a plane flat road
or as ·t you are on a well-laid out footpath?
• • • Would you be confident th at Just
. walking
. randomly would not be
1
'-•I and would not cause you any mJunes due to fall?
narrt11u
. . · Ok. you do not just walk, as you• do in normal day t 0 day circumstances,
5111111n9 · on a rocky hill or a stony pond. At
1
' tep you carefully evaluate two things:
every s '
Which direction to go and
1.
HoW far to go in that direction
2.
, Both, the direction as well as the distance to cover in that direction, are equally important, isn't it? A few extra steps
in the wrong direction could lead to injury, lost energy, or backtracking. When you need to choose between a set
of paths to go down, you usually pick-up the most promising path that is likely to get you down more quickly than
other potential paths and provide maximum descent towards the downhill.
, That is the general concept behind constrained optimisation. You use derivative intonnation to detennine search
direction and then descent in that direction, for a calculated distance, in the hope of finding an optimal value.
f(x).
• Now, consider another function such as the following.
f(x + e) = f(i<) + e f(x)
. . small steps with opposite sign of derivative.
• You can reduce f(x) by moving % in
£~
Publlcatia n,
98904
(Copyright NO• L·
, • Lagrange multiplier is introduced for each constraint.
4-_5_ _ _ _ _ _ _ _ _ _;;;Le;;a;_::rnin~g:;w~ith~
,,..!•~M~a~ch:!!in:e.!:Le:a~m~in:!!g~(!:M.!:;UJ!,...._ _ _""."':'_~-:":"":=::
_..,.. Classin ~ijilti
N
L(w, b, a) = 2 + L an {Yn(W * xn - b)}
n=l
N
L(w, b, a) =
lliYle + L an YnW Xn - an Ynb
2
n=l
• To minimise, take partial derivatives of L with respect to wand b and set them to o.
ol
aw = 0 which gives
N
,.
w = I an Yn Xn
n=l
aL
ab = 0 which gives
N
I an Yn = 0
n=l
• As you notice, you have eliminated the dependency on w and b and the minimisation problem now only depends
on training examples (Xn Yn) and values of an. In most cases, an would be Ounless a training instance Xn satisfies th
- e
equation Yn(W * xi - b) = 1. Thus, the training examples with an > 0 lie on the hyperplane margins and hence are
support vectors. •
Practice Questions ?
f - - '\\
( )
Ex. 4.3.1 : Given the following data, calculate hyperplane. Also, classifyy(0.6, 0.9) based on the calculated hyperplane.
I::::: 1
0.38 o.4zf1
,,,..,
+,-1
I
( 65.52 -
' ;
0.49 0.61 _, I 65.52 ''-/ .
0.92 · 0.41 0
0.74 0.89 - ) 0
0.18 0.58 + 0
0.41 0.35 + J 0
0.93 0.81 - ) 0
0.21 0.1 + ) 0
Soln.:
As you see, the value of a is non-zero for only first two training examples. Hence, the first two training examples
are support vectors.
w = I a(id><n
n=l
a,.
= ------;----
Y, A2x, + ., • Y2. A2x,, --,,."'
6S.s2' l' "'-c
047 + 6S.S2• - -72.
6 • ::_1•0,49,
b
para m W2 can be calculated for eact, suppOft \<ector as follow• + 5,Sl -l•0.61, -9.l
eter S,
0.5
0.4
o.3L~--t--,-
0
0.1 0.2 0.3 . 0.4
Fig. P. 4.3.l(a)
• Ttdllallll..li
VPublltatla n t
~
l1g t No.h ----
L-98904/2021)
'f/jT Machine Leaming (MU) 4-7 \ ~
~
. - - - - " " " " " " " ' - - - - - - - -- - -- ................,._::Le~arn~ing,\V1t:1i
. ct
0.8
0.7
0 .6
0.5
0.4
0.3
-veSamples
0.2
0 .1
0
0 0 .1 0 .2 0.3 0.4 0.5 0.6 0.7 0 .8 0.9
Fig. P.4.3.l(b)
• So far, you learnt about hyperplanes that could linearly separate the data into two classes using a line very clearly.
But what if your dataset looks like the following examples?
fl
• • •
!al
••
•• •••
Iii
•
ID
fjJ
B
lil1
Ill
ll1
Iii •
Fig. 4.3.1
. I function to
• As you see, you cannot really draw a line separating out the classes. In such scenarios, you use a kerne
map the datapoints such that they can be "uplifted" to higher dimensions and then could possibly be classified.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - : - - ..~
,
(Copyright No. L-98904/2021) , ,,,,,
. 1,earning (MU)
c9
0
'\
'
I
'I
.
0
0
''
0
0
''
I
''
0
0
'
,
,'
I
• 0
El •
I
0
0
I
,
I
,''
'' • •
0 0 /
'' I
I
Kerne1 •
0
0
0
:
:
I
I
I
'
I function
<I>
0 0
'' •
o ,I 0
0 '
\ ~- --------------------,,,, 0 0 oo \
0 ', '
•
0
'\\.. 0 0 \, \
0 ', __ - -----o \ ''
_ ...... ,-- 0 ',,
'\• 0
0 0
o\ ' \
0 0 Oo \
0
0
\ '' ''
.
\ \
0 \
\
\ 0\
' ''
0 0 I \
0 0 OQ,
''
0 \
''
I
\ 0 0 \
o•' I 0 ''
0 ''
x·2
_ Kemel Functio~ ••
1
••
X'1
Fig. 4.3.3
, The kernel function is often denoted by cj){x). You can choose the type of kernel function as appropriate for
problem in hand. Some of the commonly used kernel functions are polynomial function, sigmoid function, and
radial basis function. This technique of applying a mapping kernel function to adjust the data is also called as
kernel trick.
I I
K(x, y) = (1 + xT y)
Linear i
)
J,,
K(x, y) = (1 + xT y)S
Polynomial
K(x, y) = tanh(locl Y- o)
Sigmoid
. . TacU11111leqi
Pu:bllcatlans
~-------
&ht No. L-98904/2021)
¥ Machine Leaming (MU) 4-9
Learningw;
4.3.3 Comparison between Logistic Regression and SVM
As you learned, logistic regression and SVM are both used for data classification H
. . . . • owever, SVM .
better than logrstrc regression as SVM can classify non-linear data as well. The Table is Cons·
4 •3 .2 sun,n, . 1der
differences between logistic regression and SVM. arises th
e~
Table 4.3.2
• ,. 'Jil,
·Comparison Attribute Logistic R'9ressi
'. '•'
4.4 SVM for Non-linear Classification using Radial Basis Functions (RBF)
• As you briefly learnt in the "Kernels for Learning Non-Linear Functions (Kernel Trick)" section, Radial Basis Fun .
(RBF) is one of the popular kernel tricks used to classify non-linear data using SVM. Mathematically it is de ction
noted as
following.
1
• Often times cr2 is represented as y. So, RBF could be re-written as
2
K (x, y) = e-r<x - Yl 2
• The way it works is that it finds the influence of the datapoint to be classified with the datapoints that are already
classified and assigns the higher-dimensional relationship based on the influence. Higher the influence of the
already classified datapoint on the datapoint to be classified, higher the chances of the new datapoint to be
classified as the one that influenced the most.
• So, assume that the point x is known, and the pointy is to be classified. Then, RBF calculates the distance between
those two points in the infinite dimension as (x - y) 2 y is the scaling factor that you may choose as required to
make the distance numbers more significant.
Ex. 4.4.1 : Calculate the radial distance of point A having value of 6 with respect to point B and C having value of 8 and 12
respectively.
Soln.:
Assumey = 1
Distance of point A with respect to B = e-Y<x-y) 2 = e-1(6-8) 2 = e-4 = 0.018
Distance of point A with respect to C = e-r (x - y/ = e-1(6 -12) 2 = e-36 = o
So, point B has higher influence over point A than point C. ___...-;:;;;
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; ; ' 9 ' f 6 ~ •T O ~
(Copyright No. L-98904/2021) V · ••" ' '
r
ciJifle 1,e
iJJ11I 11
- c;u~J;;:::~;:::~~4-:1:0;::::~:--------!:!!!!:!.:~~~:!::.
g 18 hOW you could use RBF over XOR fu Leaming with Oassification
Ill
0 0
0 A 1
Fig. P. 4.4.2
tt r how hard you try, you would not be able to separate out these points as in the case of a linear SVM .
hoW RBF helps.
i.et'ssee
Note that there are two classes here Y = {0, 1} and four training samples, two for each class.
(O, 11 and {1, 0} produce 1 (+ ve class) whereas {0, 0}_ and {1, 1} produce 0 (- ve class). Assume "f = 1.
Let's choose elements {0, 1} and {l, 0} as the basis for RBF.
= =e-1
\,
..,,,(x-v)2 _ -l({l,1}-{1,0))2 e-1(1-11-0)2
e, , - e
2
(l, 1) e""'((x-y) = e-1({1, 1}- {O, 1})2 = e-1(1- o 1-1>2 = e-1 :: 0.37
= 0.37
So, RBF do . . t
es the following transformations to the respective pom 5·
l
I
1
~
~ --
t-i-g
---
~ - - - - - - - - - - - - - - - - - - - - - - - " ' . , ~ : f t r , 1 i T K l ~ b l a ~ ly ~Publltatlun,
,.1.-,a,04,2021)
1q~i
l(tT Machine Learning (MU) 4-11
Learn·
Old Point ,
.. New
: .,
Point ,.;,
Ing With
Class1ij
~~Clii
(0, 0} {0.37, 0.37}
{1, 0} {0.018, 1}
If you plot these points, you see that they are now clearly separatable as in the case of a r·
rnear SVM.
0.2
Hence, as you see, RBF transformed the XOR points to make them linearly separable.
• As you understand, the core concept behind RBF is that inputs that are close together should generate the same
output, whereas inputs that are far apart should not.
• Points that are close together exercise more influence over each other than the points that are far apart.
As you previously learnt about neural networks, for any input that you present to a set of the neurons, some of
• th
them will fire strongly, some weakly, and some will not fire at all, depending upon the distance between e
weights and the particular input in weight space.
You can treat these nodes as a hidden layer, just as you did for the Multilayer Perceptrons, and connect up some
•
output nodes in a second layer.
This simply requires adding weights from each hidden (RBF) neuron to a set of output nodes. This is known as an
•
RBF network.
-----------------------------------------;-.3•,,;-riiy,ct~
Copyright No. L-98904/2021) ..,. ,..,, ..
d
r
r
. cl'iJ1e
·11g~t,1~U~)---------..!4-;!1!,2---------==~:!,!!!;~:!!!!:!.
eiirll 1
Learning with Cassiflcation
l ~9
, The RBF network would work in such a way that if the torch was at 2 o'clock or so, then you would do some of the
o'clock action and a bit more of the 3 o'clock action, but none of the 6 o'clock or 9 o'clock actions. So, you
12
would move forwards and to the right.
, As you understand, Support Vector Machines (SVM) are popularly and widely used for classification problems in
machine learning. Support Vector Regression (SVR) uses the same principle as SVM, but for regression problems.
The goal of a regression problem is to find a function that approximates mapping from an input domain to real
numbers on the basis of a training sample.
Decision
Boundary
Decision
Boundary
Fig. 4.5.1
~h
g t No L
----------------------~•~lKl~b.im~Hltti"-
"" P • b 11 " t 1 • n ,
. •98904/2021)
'fflT Machine Learni ng (MU) 4 - 13 Le
. . . . arning With I
• of SVM, consider the two dec1s1on boundanes and a hyperplane. Your ob· ct· .
As in case CJilssi
. . . • • ~e IVe IS to C . ¾..
that are w1thm the dec1s1on boundary lme. The best fit line (or regression fin ) . 0 ns,der
. . .. e 1s the hype the
maximum number of points. Assume that the dec1s1on boundaries are at any dista rplane t• ~ "'
. nee, say 'a' fro ";)t ·q
So, these are the Imes that you draw at distance '+a' and '-a' from the hyperpla 8 ' rn the h
ne. ased on SV•. :.-Pl! i
the hyperplane is as following. •v1, the e rp~11.
91J;)t• ...
IQ~
Y = wx-b, llf
X
Xx X
X
0
00
0
Multi-class classification is a task with more than two classes. For example, classifying a set of images of fruits
which may be oranges, apples, or pears. Multi-class classification makes the.assumption that each sample is assigned to
one and only one class (or label). For example, a fruit can either be an apple or a pear, but not both at the same time.
Another example could be recognising alphabets in an optical character recognition type of problems where a given
alphabet could be one of the 26 alphabets. Let's learn about multi-class classification techniques.
For example, consider a multi-class classification problem with four classes - 'red', 'blue', 'green', and 'yellow'- This
could then be divided into six binary classification datasets as following.
le given earlier, you have 4 classes. Hence, the number of binary datasets that would be created can
the exarnP
In •
lated as following.
tiecalcU 4 x (4-1)
No. of binary datasets for 4 classes = 2 =6
. ach binary classification model may predict one class label and the model with the most predictions or votes is
~~cted. Similarly, if the binary classification models predict a numerical class membership, such as a probability, then
of the sum of the scores is predicted as the class label.
meargm ax t
f,
, , one vs Rest (OvR) (Ove vs All) ,
462
V
one vs Rest (OvR in short, also referred to as One-vs-All or OvA) is a heuristic method for using binary
classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary
crassification problems. Each binary classifier is then trained on each binary classification problem and predictions are
made using the model that is the most confident. Basically, for each split dataset, you take one class as positive and all
otlier classes as negative.
for example, consider a multi-class classification problem with four classes - 'red', 'blue', 'green', and 'yellow'. This
could be divided into four binary classification datasets as following.
1
Binary Classification Problem 1: red vs [blue, green, yellow]
' Binary Cl assi'fiIcatIon
. Problem 2: blue vs (red, green, yellow]
8'
~-
I
.inary Classification
· Problem 3: green vs [red, blue, yellow]
1
8inary Class'f ·
1IcatIon Problem 4: yellow vs [red, blue, green]
'~uir . approach
ine learning model is created for each class. For example, four classes require four models. This
idass ind . c model predicts a class membership probability or aprobability-like score. The argmax ofh t ese scores
~tally exw1th . th e largest score) is then used to predict a class. This approach ·1s commonly used for algont· hms that
K Predict · d
"''antage of th· numerical class membership probability or score, such as Logistic Regression an Perceptron. ne
o
~. IS I I 'fi 't.
ss1bIe to . approach is its interpretability. Since each class is represented by one and on y one cass1 ier, I is
!!rategy antn knowledge about the class by inspecting its corresponding classifier. This is the most commonly used
is afair d f
~ e ault choice.
1
~tig~t
No L --------------------;~i',;-it•Klllb•i.itiiiii°•tti
9
. 8904/202 I I ., ' .. "" "' "'
1l!1r' Machine Learning (MU) 4-15
----
2 C
Interpretability
Low
High
Used
Less commonly
More Commonly
Here are a few review questions to help you gauge your understanding of this chapter. Try to attem
questions and ensure that you can recall the points mentioned in the chapter. pt th ese
" .-I!~ ·· :,,_,"'-:-.:'
[AJ;,'!%.~PPPct':!3L~M~~~$lf~M
Q. 1 Write a short note on SVM.
(4 Marks)
Q.2 With a diagram, explain maximum margin concept behind SVM.
(6 Marks)
Q.3 Write a short note on constrained optimisation.
(4 Marks)
Q.4 Given the following data, calculate hyperplane. Also, classify (0.6, 0.9) based on the calculated hyperplane.
(8 Marks)
0.92 0.41 0
0.74 0.89 • 0
0. 18 0.58 + 0
0.41 0.35 + 0
0.93 0.81 0
, 0.21 0.1 + 0
. 10 emonstrate how you could use RBF over XOR funct·ion to draw a linear
. SVM. (8 Marka)
D
0 e,cplain Radial Basis Function Network.
. 11 (8 Marki)
0
O, 12 wrtte a short note on Support Vector Regression (SVR) · (4 Marka)
,..;:;:~,ltliti
~!!'~ -~ · !'.ti!ssiftta!io~ t~~bni~
9!~ ~~•:rr!!"~;,,•:;n,•••:~.., :,Ytr.v:_~~
J~~p,,~~.,,......,.,,_ r:r~q - . ~
"' ~~w<k'+>~~~.J,J.,.;,,
~Is:.< 1
.--
, • t
4
Q. write a short note on multi-class classification techniques. ( Marks)
13
Q. explain One vs One (OvO) multi-class classification technique. (4 Marks)
14 (4 Marks)
Explain One vs Rest (OvR) multi-class classification technique.
Q. 15 (4 Marks) I
explain One vs All (OVA) multi-class classification technique.
Q. 16
Q.1 7
compare One vs One (OvO) and One vs Ali (OvA) multi-class classification techniques.
(4 Marks)
(4 Marks) I
Q.
18
compare one vs One (OvO) and One vs Rest (OvR) multi-class classification techniques.
I