0% found this document useful (0 votes)
42 views16 pages

ML Module 4

Uploaded by

Junaid Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views16 pages

ML Module 4

Uploaded by

Junaid Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

j/_..ftttti t
Learning with Classification

Syllabus

'· ..
At the end of this unit )'OU should be able to unde,stand and co~pt • ,h end the following syllabu, '°Pk, ,
Support Vector Machine · . · . . . .
o Constrained Optimization
o Optimal decision boundary
o Margins and support vectors

o SVM as constrained optimizat!on probltmi


o Quadratic Programming

o SVM for linear and nonlinear classification


o Basics of Kernel trick .
o Support Vector Regression
• Multiclass Classification

4.1 Support Vector Machines (SVM)

• SVM is a supervised learning method. You plot each datapoint as a point in a n-dimensional space (where n is the
number of datapoint attributes you have). Then, you perform classification by finding the hyperplane that
adequately differentiates the two classes. The higher the gap between the datapoints (highest
datapoints of one class and lowest boundary datapoints of another class), the better. boundary

Support vectors
0
0 0 ·'

DD
D
(f~ossible hyperplan~~J
(a)
(b)
Fig. 4.1.1
r
IL~

1J ~9
t rl'1a
,,.;,g[r,llfJ
cl11-~e ~e
fies are
de''•sion boundaries
be a
.
. . •·

ttr·,buted to different classes. Also the d"im


'
2

that help m classifying the d at apomts


. . f ·
. Datapoints fair
ens1on o the hyper I d
·
mg on either side of the
Loamlngwi.. aa.,Ukaoon
LI

~yf'e 1af'le ,ari If the number of input data attributes is 2, then the h er p epe~ds upon the number
, ~)'f'erl' attribLlteS• is 3, then the hyperplane becomes a two-dimen . I plane ,s Just a lme. If the number of
ostll ttribLlteS 2 • s1ona p ane, and likewise.
ol. ta a A 11ypel'PIane__in A Is a lme
..,.. - - - -c. - - -
3
08 A hyperplane In R Is a line
l~r __ ..,. - - -
I I I. •I I 7 --,--r-,--r-,--r . . --i-.
,,.. I I I __ . . . .- - - -
,___ ... ---,- L , _._-. 6 --:--r-,--r-,--r 11,., .. r-,
L I f I I I ',

1 - ' ,..- --, I • •-


.!.--+---__a-+:--- I
5 - --r-,--r-,...-r •i'', l',, 1
1 I I I I I - I _.. t ',

6 ' I _,!.---+- - '


I __,I___
• -...- -. : •
______ .,: 4 - 7__ T
, - , ·--l---{i$4,.,,1
'k ',. . I I

---:---T
1
5 .,..,,• I
1 1
I +: 1
I 3 --•--L--1--1--,-
.· .
--,.. ...... , Il',, ,1'I
---+I ---- ---II ,, ,1., ..: '-'.:. r• -,- ~r -.,.., ... -.

--l.--1.\t-.~ ' ',


4 - ' ' 2
t,--!--4t 1 I I -, .,7 . 1 ·- I , · I , :_,
3 •.:
-- I
_.___ _,.---.---7 --.---~
I I I
1
O
-,--r- I I - -- -
-----1._- -l. ',I ",
l',, )
t

--+---,- -+-----+
__ _..--. I .. I I I
2. ,•· I • I II I 10 -. "_-._
'-. -~:::,.,:--.-:_--.::::-,,
,t. ._ ', ', , I ._ ')I

1 ,- ~- I I I I 5 ' ' ', ', ', ' 67


' I
I 3 4 5 6 7 012345
0 (a) (b)
Fig.4.1.2
ors are datapoints that are closer to the hyperplane and influence the position and orientation of the
portvect . . . . .
, SUP Using these support vectors, you max1m1se the margin of the classifier. Deleting the support vectors
h erplane.the position of the .hyperplane. These are the points that help in building the SVM model.
YP
will change ·
•• ximum Margin Linear Separators (Optimal Decision Boundary)
,.~1 IYl8
nderstand, the larger the margin between the separating hyperplane the better. Typically, for linearly
As you U
' separable hyperplanes (which is a line), the decision boundary is given as w x x - b = O.
, So, any positive point above the decision boundary classifies the input as one class and any negative point below
the decision boundary classifies the input as another class. You can select two parallel hyperplanes that separate
the two classes of data, so that the distance between them is as large as possibie.
X2

••
Fig.4.1.3
l~ ~------------~=-----------------~._.i'j'~T~ecl~latlll~~N~l~i
t No L y ••• " , • , 1 •••

. ·98904/2021)
¥ Machine Learning (MU) 4 -3 Learning with. C:

• . boun ded by these two hyperplanes is called the "margin", and the maximum-marg·1n hyp 1i!ssiQ~
The region
h 1
e that lies halfway between them. With a normalised or standardised dataset, these h erplan .
yperp anby the following equations.
described . YPerp1 aries eCa1111.•,

Hyperplane 1 is described as wx x - b = l which classifies the data into the first class.

Hyperplane 2 is described as wx x - b = -1 which classifies the data into the second class.

• Geometrically, the distance between these two hyperplanes(margin) is given as m - -L S


distance between the planes you want to minimise lfwlf. - llwu· o, to lllai( . ·~
• IF111se I!.
Note here that w and x in the above equations are vectors. Since, SVM is a supervised learning rneth
input sample features such as x1, x2, x3, ..., Xn w denotes the set of weights wi for each feature b . Od, x den0t.
can be added to the hype,plane to shfft (adjust) . It towacds a partkula, class as requfred. Y;. isis th,
the b· . "'I ••a,
classification based on the SVM hyperplanes. So, the datapoints must lie on the correct side of th resuitallt
classifying them correctly. nfar
e triargi
So,
If w x x - b 1, then Yi = 1 (Positive classification)
It w x x - b :s; - 1, then Yi = - 1 (Negative classification)
So, you want to minimise llwll (the absolute value of w without sign) subject to y; (w x - b) > 1 to, r = 1, •. n
4.2 Constrained Optimisation
2

• . . . · · : ·. :- ,. "/': :·:.tHfh;-(\" ,:1,r::rh:·n::±,,>'.i•:


Defin,tton, Constrauu!d optinusa#nn is.tfi,e~•~ of""'f:'!';ut/f ?~ o/u<f'#_"!'-,;i.flf.. ~:<, r,f:Wf''\·+•;,<-.;..,,._i.:
2: "·",r,Yt
0
1
"'f~k i'-esP'r' t,i;,,,,;,
•.

variables in ,,,, PTesenee otcoiistra;nts 9,;_,Juf,,e ~ kii!,if ;;:;;;; ,; ;,:Ir;:.:;,;,L;,,c ;;,,,;:::,; •./: ;,: . ;,

The objective function is either a cost function or energy function, which is to be minimised, or a reward function
or utility function, which is to be maximised. Constraints can be either hard constraints, which set conditions for the
variables that are required to be satisfied, or soft constraints, which have some variable values that are penalised in
the objective function if, and based on the extent that, the conditions on the variables are not satisfied .

• Typically, constrained optimisation helps you ir: identifying feasible solutions out of a very large and complexsetof
possibilities. Let's understand this by taking a very simplistic example.

• Have you ever gone for trekking? Have you ever enjoyed watching or playing one of the games in Takeshi's castle?
I am sure you would have gone for trekking in one of the terrains or other or any other form of climbing up and
then coming down.

(b)
Fig. 4.2.1
(Copyright
- - - - -No.
- -L-98904 2021
- - - /- -- , - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ~...
~• Tad~
-,, , 11 1 11,a
. Learning (MU)
JI r,taclllne 4-4 Learning with Classification
"- , w te rne one thing. Assume that you are on top of a rocky h.ll I (or on one of the stones in a d) d .
11
, 1'10nt' to 9 down (or cross the series of stones to reach the land from t he pond) How do you de ·d ponth an you
wa 0
down thehill or cross the pond.
w~h the help of stones? 00 you . t st .· . er e e way to go
JUS art walking as if you are on a plane flat road
or as ·t you are on a well-laid out footpath?
• • • Would you be confident th at Just
. walking
. randomly would not be
1
'-•I and would not cause you any mJunes due to fall?
narrt11u
. . · Ok. you do not just walk, as you• do in normal day t 0 day circumstances,
5111111n9 · on a rocky hill or a stony pond. At
1
' tep you carefully evaluate two things:
every s '
Which direction to go and
1.
HoW far to go in that direction
2.
, Both, the direction as well as the distance to cover in that direction, are equally important, isn't it? A few extra steps
in the wrong direction could lead to injury, lost energy, or backtracking. When you need to choose between a set
of paths to go down, you usually pick-up the most promising path that is likely to get you down more quickly than
other potential paths and provide maximum descent towards the downhill.
, That is the general concept behind constrained optimisation. You use derivative intonnation to detennine search
direction and then descent in that direction, for a calculated distance, in the hope of finding an optimal value.

, Generally (mathematically) speaking, consider a simple function


y = f(x)

, Derivative of the function is denoted as


dx
. f(x) or as dy
, Derivative r (x) gives the slope of f(x) at point It specffies how a small change in input • affects the change in Yor

f(x).
• Now, consider another function such as the following.
f(x + e) = f(i<) + e f(x)
. . small steps with opposite sign of derivative.
• You can reduce f(x) by moving % in

4.3 SVM as Constrained Optimisation Problem


nt to minimise Uwll (the absolute value of w without sign) subject to
So, you learnt ea .ier that for SVM, youkwalosely this is a constrained optimisation problem. You are require · d to
rl
. 1 2
Yi(w * Xi - b) 1 for 1 = , ... · n If you Ioo c ,
w * x· _ b) > 1 for i = 1, 2 ... n. In the next section, you would learn how to go
. . . IIwII given
mm1m1se . the constraint. that
. Yi(
. '
roblem - SVM through quadratic programming.
in
. rained opt1m1sat1on p
about addressing conSt • t f' d'ng1 Maximum Margin Separators
mming Solution o in
4·3·1 Quadratic Progra . to minimise Uwll (the absolute value of w without
. sign)
• subject to
our goal is
• As you understand, y
i - 1 2 ...n
Yi(w * Xi - b) 2: 1 f 0 r - ' 11w112 This is a convex obJect1ve
• •mising Uwll2 or .11!!11-_ . . function
. which. .1s a quadratic
.
. ivalent to m1n1 2
• Minimising nwll IS eqU . 'th n linear inequality constraints.
. So, you need to so Ive a constraine
. d an d a convex
program (quadrati ·c equation) w1 . .
f d the maximum margin. To solve this quadratic program (or convex opt1m1sat1on
. . .
. blem to in .
optimisation pro ange's multiplier method 1s commonly used.
problems .in generall- Lagr ~-----------------------:V::c-:T:-m-;-;IHlll::---:-HP-:-;--.
----- /2021) •

£~
Publlcatia n,

98904
(Copyright NO• L·
, • Lagrange multiplier is introduced for each constraint.
4-_5_ _ _ _ _ _ _ _ _ _;;;Le;;a;_::rnin~g:;w~ith~
,,..!•~M~a~ch:!!in:e.!:Le:a~m~in:!!g~(!:M.!:;UJ!,...._ _ _""."':'_~-:":"":=::
_..,.. Classin ~ijilti

N
L(w, b, a) = 2 + L an {Yn(W * xn - b)}
n=l
N
L(w, b, a) =
lliYle + L an YnW Xn - an Ynb
2
n=l
• To minimise, take partial derivatives of L with respect to wand b and set them to o.

ol
aw = 0 which gives

N
,.
w = I an Yn Xn
n=l
aL
ab = 0 which gives

N
I an Yn = 0
n=l
• As you notice, you have eliminated the dependency on w and b and the minimisation problem now only depends
on training examples (Xn Yn) and values of an. In most cases, an would be Ounless a training instance Xn satisfies th
- e
equation Yn(W * xi - b) = 1. Thus, the training examples with an > 0 lie on the hyperplane margins and hence are
support vectors. •
Practice Questions ?
f - - '\\
( )
Ex. 4.3.1 : Given the following data, calculate hyperplane. Also, classifyy(0.6, 0.9) based on the calculated hyperplane.

I::::: 1
0.38 o.4zf1
,,,..,
+,-1
I
( 65.52 -
' ;
0.49 0.61 _, I 65.52 ''-/ .

0.92 · 0.41 0
0.74 0.89 - ) 0
0.18 0.58 + 0
0.41 0.35 + J 0
0.93 0.81 - ) 0
0.21 0.1 + ) 0
Soln.:

As you see, the value of a is non-zero for only first two training examples. Hence, the first two training examples
are support vectors.

(Copyright No. L-98904/2021} I Tedi~ ---


VP11Dllcarl1fis
'a
Learn ,ng(MU) ¼

~•""' he value of w and b asrequ;,'<I to


I. ~e/5 cal'ulate
Let wI = (WL w,) ca1CUlat, th, SVM hYoe.,i.....
N Learning With Class1ncatton

w = I a(id><n
n=l

w, = a, • Y1 • Alx, + , 6S,s2 • l • 038

a,.
= ------;----
Y, A2x, + ., • Y2. A2x,, --,,."'
6S.s2' l' "'-c
047 + 6S.S2• - -72.
6 • ::_1•0,49,
b
para m W2 can be calculated for eact, suppOft \<ector as follow• + 5,Sl -l•0.61, -9.l
eter S,

,fl!' b1 = 1-w • x, = 1- (- 7.2) • O,J8-t-g_2) • 0,47, 8,06


b2 = 1 - w • X,= 1- (- 7,2) • 0,47 - (- 9.2) • 0,61 • 10
ging b1, b2 you get b = 9.03v

Avera the hyperplane line is defined as - 7.2 Xi - 9.2 X2 + 9.03 ::: O


1
Hence, ---r--r--,--r-,~,\: --1-,--,--7

0.5

0.4

o.3L~--t--,-

0
0.1 0.2 0.3 . 0.4

Fig. P. 4.3.l(a)

Now, suppose that there isa new data point thatyou


. (0.6,, 0+.9)9.03 want
youget - 7to
·2*classify. __ 3 57.
0.6 - 9.2 * 0.9 + 9.03 - .
· _ 7.2 Xt - 9.2 2
. the values in the equation
Putting

Hence, this is classified as negativ


. e sample.

• Ttdllallll..li
VPublltatla n t
~
l1g t No.h ----
L-98904/2021)
'f/jT Machine Leaming (MU) 4-7 \ ~
~
. - - - - " " " " " " " ' - - - - - - - -- - -- ................,._::Le~arn~ing,\V1t:1i
. ct

r --, --, - -,-- ,.~~~~1-,--r----r--- New point to


be classified
<Issi~
c.~~

0.8

0.7

0 .6

0.5

0.4

0.3
-veSamples
0.2

0 .1

0
0 0 .1 0 .2 0.3 0.4 0.5 0.6 0.7 0 .8 0.9

Fig. P.4.3.l(b)

4.3.2 Kernels for Learning Non-Linear Functions (Kernel Trick)

• So far, you learnt about hyperplanes that could linearly separate the data into two classes using a line very clearly.
But what if your dataset looks like the following examples?

fl
• • •
!al
••
•• •••
Iii

ID
fjJ
B
lil1
Ill
ll1
Iii •
Fig. 4.3.1
. I function to
• As you see, you cannot really draw a line separating out the classes. In such scenarios, you use a kerne
map the datapoints such that they can be "uplifted" to higher dimensions and then could possibly be classified.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - : - - ..~
,
(Copyright No. L-98904/2021) , ,,,,,
. 1,earning (MU)

c9
0
'\
'
I
'I
.
0
0
''
0
0
''
I

''
0
0
'
,
,'
I
• 0

El •
I

0
0
I
,
I

,''
'' • •
0 0 /
'' I
I
Kerne1 •
0
0
0
:
:
I
I
I
'
I function
<I>
0 0
'' •
o ,I 0
0 '
\ ~- --------------------,,,, 0 0 oo \
0 ', '

0
'\\.. 0 0 \, \
0 ', __ - -----o \ ''
_ ...... ,-- 0 ',,
'\• 0
0 0
o\ ' \
0 0 Oo \
0
0
\ '' ''

.
\ \

0 \
\
\ 0\
' ''
0 0 I \
0 0 OQ,
''
0 \
''
I
\ 0 0 \
o•' I 0 ''
0 ''

Fig. 4.3.2 ---------:_~~\i\J 0 ''

x·2

_ Kemel Functio~ ••
1

••
X'1
Fig. 4.3.3

, The kernel function is often denoted by cj){x). You can choose the type of kernel function as appropriate for
problem in hand. Some of the commonly used kernel functions are polynomial function, sigmoid function, and
radial basis function. This technique of applying a mapping kernel function to adjust the data is also called as
kernel trick.

' The Table 4.3.1 summarises various commonly used kernels.


Table4.3.1

I I
K(x, y) = (1 + xT y)
Linear i
)
J,,
K(x, y) = (1 + xT y)S
Polynomial

K(x, y) = tanh(locl Y- o)
Sigmoid

Radial Basis Function


K(x, y) = exp - ( u.:1 2cr2 ) I\

. . TacU11111leqi
Pu:bllcatlans

~-------
&ht No. L-98904/2021)
¥ Machine Leaming (MU) 4-9
Learningw;
4.3.3 Comparison between Logistic Regression and SVM
As you learned, logistic regression and SVM are both used for data classification H
. . . . • owever, SVM .
better than logrstrc regression as SVM can classify non-linear data as well. The Table is Cons·
4 •3 .2 sun,n, . 1der
differences between logistic regression and SVM. arises th
e~
Table 4.3.2
• ,. 'Jil,
·Comparison Attribute Logistic R'9ressi
'. '•'

Good for Linear classification o mear an non- mear classification


Decision boundary Multiple One (best one)
Approach Statistical Geometrical
Errors Comparatively higher Comparatively Lower

4.4 SVM for Non-linear Classification using Radial Basis Functions (RBF)

• As you briefly learnt in the "Kernels for Learning Non-Linear Functions (Kernel Trick)" section, Radial Basis Fun .
(RBF) is one of the popular kernel tricks used to classify non-linear data using SVM. Mathematically it is de ction
noted as
following.

1
• Often times cr2 is represented as y. So, RBF could be re-written as
2

K (x, y) = e-r<x - Yl 2

• The way it works is that it finds the influence of the datapoint to be classified with the datapoints that are already
classified and assigns the higher-dimensional relationship based on the influence. Higher the influence of the
already classified datapoint on the datapoint to be classified, higher the chances of the new datapoint to be
classified as the one that influenced the most.
• So, assume that the point x is known, and the pointy is to be classified. Then, RBF calculates the distance between
those two points in the infinite dimension as (x - y) 2 y is the scaling factor that you may choose as required to
make the distance numbers more significant.

Let's see some examples.


Practice Questions

Ex. 4.4.1 : Calculate the radial distance of point A having value of 6 with respect to point B and C having value of 8 and 12
respectively.
Soln.:
Assumey = 1
Distance of point A with respect to B = e-Y<x-y) 2 = e-1(6-8) 2 = e-4 = 0.018
Distance of point A with respect to C = e-r (x - y/ = e-1(6 -12) 2 = e-36 = o
So, point B has higher influence over point A than point C. ___...-;:;;;
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; ; ' 9 ' f 6 ~ •T O ~
(Copyright No. L-98904/2021) V · ••" ' '
r

ciJifle 1,e
iJJ11I 11
- c;u~J;;:::~;:::~~4-:1:0;::::~:--------!:!!!!:!.:~~~:!::.
g 18 hOW you could use RBF over XOR fu Leaming with Oassification

l -~ ~- roonstra nctlon to draw a linear SVM


.
1,AJ • )(OR function is as following
tJ-· • ble for .
tll ta
•.J~•. i, •f'!.' ,-
-
-
~- f~e tf'LI '. A'~.
l •·
·" V
-
0 0 0
0 l l
l 0 l
l l 0
-

oints, you get the layout as shown in Fig. P.


io t these P 44
...
2
lfY,oLl P
1 1 0

Ill

0 0

0 A 1

Fig. P. 4.4.2

tt r how hard you try, you would not be able to separate out these points as in the case of a linear SVM .
hoW RBF helps.
i.et'ssee
Note that there are two classes here Y = {0, 1} and four training samples, two for each class.

(O, 11 and {1, 0} produce 1 (+ ve class) whereas {0, 0}_ and {1, 1} produce 0 (- ve class). Assume "f = 1.
Let's choose elements {0, 1} and {l, 0} as the basis for RBF.

e-"((x-y)2 =e-1({0,0)-{1,0J>2 =e-1(0-10-0)2 =e-1


(O, O) e-'((x-y)2 = e-1{{0, O}-{O, 1})2 = e-1(0-0 0-1)2 = e-1 = 0_37
=0.37
e""Y!x-y)2 =e-1({0,11-11.01>2 =e-110-11-0)2 = e-4 \
(0, 1) e-'f!x-y)2 = e-1({0, 1}- {O, 1})2 = e-1(0- o i-1>2 = eo = 1
I I
= 0.018
I \
2 - 0 -1 )
2
(1,0) e""'{(x-y) =e-l({l,0}-{0,1})2=e-1(1-00-1)2 =e-4=0.018 e""Y(x-y) 2 =e-1({1' 0)-{1' 0))2 =e-1(1-10-0) -e -
\

= =e-1
\,
..,,,(x-v)2 _ -l({l,1}-{1,0))2 e-1(1-11-0)2
e, , - e
2
(l, 1) e""'((x-y) = e-1({1, 1}- {O, 1})2 = e-1(1- o 1-1>2 = e-1 :: 0.37
= 0.37

So, RBF do . . t
es the following transformations to the respective pom 5·

l
I
1
~
~ --
t-i-g
---
~ - - - - - - - - - - - - - - - - - - - - - - - " ' . , ~ : f t r , 1 i T K l ~ b l a ~ ly ~Publltatlun,
,.1.-,a,04,2021)
1q~i
l(tT Machine Learning (MU) 4-11
Learn·
Old Point ,
.. New
: .,
Point ,.;,
Ing With
Class1ij
~~Clii
(0, 0} {0.37, 0.37}

(0, 1} {1, 0.018}

{1, 0} {0.018, 1}

{1, l} {0.37, 0.37}

If you plot these points, you see that they are now clearly separatable as in the case of a r·
rnear SVM.

0.2

0.2 0.4 0.6 0.8 " 1 1.2


A'
Fig. P. 4.4.2(a)

Points {l, 0.018} and {0.018, 1} give output y = 1.

Points {0.37, 0.37) and (0.37, 0.37) give output y = 0.

Hence, as you see, RBF transformed the XOR points to make them linearly separable.

4.4.1 The Radial Basis Function {RBF) Network

• As you understand, the core concept behind RBF is that inputs that are close together should generate the same
output, whereas inputs that are far apart should not.

• Points that are close together exercise more influence over each other than the points that are far apart.

As you previously learnt about neural networks, for any input that you present to a set of the neurons, some of
• th
them will fire strongly, some weakly, and some will not fire at all, depending upon the distance between e
weights and the particular input in weight space.
You can treat these nodes as a hidden layer, just as you did for the Multilayer Perceptrons, and connect up some

output nodes in a second layer.
This simply requires adding weights from each hidden (RBF) neuron to a set of output nodes. This is known as an

RBF network.
-----------------------------------------;-.3•,,;-riiy,ct~
Copyright No. L-98904/2021) ..,. ,..,, ..

d
r

r
. cl'iJ1e
·11g~t,1~U~)---------..!4-;!1!,2---------==~:!,!!!;~:!!!!:!.
eiirll 1
Learning with Cassiflcation
l ~9

Input Layer RBF Layer


Output Layer
Fig.4.4.3

. Basis function network consists of input nodes connected b .


, Radial . . Yweights to a set of RBF neurons which fire
, 'file rtiona lly to the distance between the mput and the neuron in we1g . ht space. '
propO . f ons of these nodes are used as ·inputs to the second layer wh' h . .
ihe act1va 1 . • IC consists of hnear nodes. RBF networks
, more than one layer of non-linear neurons.
never have . . . ~, :,
In an RB F network, .mput nodes will activate according
. to how close they are to th e •mput , and t he combmat,on
. . of
, ct'vations will enable the network to decide how to respond.
these a I
For example, suppose that somebody is signalling directions to you using a torch. If the torch is high (in u o'clock 1\
' pasition) you go forwards, low (in 6 o'clock position) you go backwards, and left and right (in 9 o'clock and 3
o'clock, respectively) mean that you turn.

, The RBF network would work in such a way that if the torch was at 2 o'clock or so, then you would do some of the
o'clock action and a bit more of the 3 o'clock action, but none of the 6 o'clock or 9 o'clock actions. So, you
12
would move forwards and to the right.

4.5 Support Vector Regression {SVR)

, As you understand, Support Vector Machines (SVM) are popularly and widely used for classification problems in
machine learning. Support Vector Regression (SVR) uses the same principle as SVM, but for regression problems.
The goal of a regression problem is to find a function that approximates mapping from an input domain to real
numbers on the basis of a training sample.
Decision
Boundary

Decision
Boundary

Fig. 4.5.1

~h
g t No L
----------------------~•~lKl~b.im~Hltti"-
"" P • b 11 " t 1 • n ,
. •98904/2021)
'fflT Machine Learni ng (MU) 4 - 13 Le
. . . . arning With I
• of SVM, consider the two dec1s1on boundanes and a hyperplane. Your ob· ct· .
As in case CJilssi
. . . • • ~e IVe IS to C . ¾..
that are w1thm the dec1s1on boundary lme. The best fit line (or regression fin ) . 0 ns,der
. . .. e 1s the hype the
maximum number of points. Assume that the dec1s1on boundaries are at any dista rplane t• ~ "'
. nee, say 'a' fro ";)t ·q
So, these are the Imes that you draw at distance '+a' and '-a' from the hyperpla 8 ' rn the h
ne. ased on SV•. :.-Pl! i
the hyperplane is as following. •v1, the e rp~11.
91J;)t• ...
IQ~
Y = wx-b, llf

• The equations of decision boundaries are as following.


wx-b = a
wx-b = -a
• Thus, any hyperplane that satisfies SVR should satisfy - a s wx - b s a
• Your goal is to decide a decision boundary at 'a' distance from the original hyperplane such that
closest to the hyperplane or the support vectors are within that boundary line. data Poi111s

4.6 Multi-class Classification Techniques

X
Xx X
X
0
00
0

Binary classification Multi-class classification


Fig. 4.6.1

Multi-class classification is a task with more than two classes. For example, classifying a set of images of fruits
which may be oranges, apples, or pears. Multi-class classification makes the.assumption that each sample is assigned to
one and only one class (or label). For example, a fruit can either be an apple or a pear, but not both at the same time.
Another example could be recognising alphabets in an optical character recognition type of problems where a given
alphabet could be one of the 26 alphabets. Let's learn about multi-class classification techniques.

4.6.l One vs One (OvO)


One vs One (OvO in short) is a heuristic method for using binary classification algorithms for multi-class
th
classification. One vs One technique splits a multi-class classification dataset into binary classification problems. In is
approach, the entire dataset is split into one dataset for each class versus every other class.

For example, consider a multi-class classification problem with four classes - 'red', 'blue', 'green', and 'yellow'- This
could then be divided into six binary classification datasets as following.

• Binary Classification Problem 1 : red vs blue


- ---------------------------~ ~
99"' , 11•11( -'
(Copyright No. L-98904/2021)
4-14
arJling (tA~U~)--.,,,-~=:::=--.;.;,,;..._______ ~1.e;,:arn:i~ng~wi'~th!Cl~a:;ss:ifi~ca:!:ti:.on
~;~e i,e . problem 2 : red vs green
JJ ~gc
t . r'/ c1ass
I·fication
. problem 3 : red vs yellow
B,na sificat1on
, . r'/ cias . problem 4 : blue vs green
B,na ification
' . ar'/cIass . problem 5: blue vs yellow
&1n ·ficat1on
, ·nar'/
cIasSI . problem 6 : green vs yeIIow
B' sificat1on
, cIas
Binal)' . . significantly more datasets.
, usee, this is .
AS Yo 1culate the number of binary datasets(or models) that would be created for One vs One approach as
can ca
you
~ win9·
10
dI Cx (C - 1)
No, of binary datasets (or mo es) = 2 where Cis the number of classes

le given earlier, you have 4 classes. Hence, the number of binary datasets that would be created can
the exarnP
In •
lated as following.
tiecalcU 4 x (4-1)
No. of binary datasets for 4 classes = 2 =6

. ach binary classification model may predict one class label and the model with the most predictions or votes is
~~cted. Similarly, if the binary classification models predict a numerical class membership, such as a probability, then
of the sum of the scores is predicted as the class label.
meargm ax t

f,
, , one vs Rest (OvR) (Ove vs All) ,
462
V
one vs Rest (OvR in short, also referred to as One-vs-All or OvA) is a heuristic method for using binary
classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary
crassification problems. Each binary classifier is then trained on each binary classification problem and predictions are
made using the model that is the most confident. Basically, for each split dataset, you take one class as positive and all
otlier classes as negative.
for example, consider a multi-class classification problem with four classes - 'red', 'blue', 'green', and 'yellow'. This
could be divided into four binary classification datasets as following.
1
Binary Classification Problem 1: red vs [blue, green, yellow]
' Binary Cl assi'fiIcatIon
. Problem 2: blue vs (red, green, yellow]
8'

~-
I
.inary Classification
· Problem 3: green vs [red, blue, yellow]
1
8inary Class'f ·
1IcatIon Problem 4: yellow vs [red, blue, green]

'~uir . approach
ine learning model is created for each class. For example, four classes require four models. This
idass ind . c model predicts a class membership probability or aprobability-like score. The argmax ofh t ese scores
~tally exw1th . th e largest score) is then used to predict a class. This approach ·1s commonly used for algont· hms that
K Predict · d
"''antage of th· numerical class membership probability or score, such as Logistic Regression an Perceptron. ne
o
~. IS I I 'fi 't.
ss1bIe to . approach is its interpretability. Since each class is represented by one and on y one cass1 ier, I is
!!rategy antn knowledge about the class by inspecting its corresponding classifier. This is the most commonly used
is afair d f
~ e ault choice.
1
~tig~t
No L --------------------;~i',;-it•Klllb•i.itiiiii°•tti
9
. 8904/202 I I ., ' .. "" "' "'
1l!1r' Machine Learning (MU) 4-15

4.6.3 Comparison between OvO and OvR Learning WitJi C 1


1ass
1
~(',i~
The Table 4.6.1 provides a quick comparison between OvO and OvR multi-class class"f . tJii
, ,cation tech .
Table 4.6.1 n,qlJes.
..
l ,
•. ,,

· Compariso., Attribute .. • ' ' ••.r 1:, •,I'•: ' i,,!,':•1•·•• .. ,


·•.- ' ;-
,;/.. <>ne Ont((~) : :/• ' :\o~e-; i-R
" ·:<~it) .
I
Speed . '. '-·~
Slower than OvR ----.:..:..._
Faster than Ovo
Computation Complexity High
Low
Suitable for
Algorithms that don't scale ---------
----
Algorithms t h ~
scale
No. of binary datasets or models for C classes Cx (C-1)

----
2 C
Interpretability
Low
High
Used
Less commonly
More Commonly

Here are a few review questions to help you gauge your understanding of this chapter. Try to attem
questions and ensure that you can recall the points mentioned in the chapter. pt th ese
" .-I!~ ·· :,,_,"'-:-.:'
[AJ;,'!%.~PPPct':!3L~M~~~$lf~M
Q. 1 Write a short note on SVM.
(4 Marks)
Q.2 With a diagram, explain maximum margin concept behind SVM.
(6 Marks)
Q.3 Write a short note on constrained optimisation.
(4 Marks)
Q.4 Given the following data, calculate hyperplane. Also, classify (0.6, 0.9) based on the calculated hyperplane.
(8 Marks)

0.38 0.47 + 65.52

0.49 0.61 65.52

0.92 0.41 0

0.74 0.89 • 0

0. 18 0.58 + 0

0.41 0.35 + 0

0.93 0.81 0

, 0.21 0.1 + 0

Q.5 With a diagram, explain Kernel Trick. (S Marks)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - ~ . · I l l ' :-_,,,,..:1 ,,ca


(Copyright No. L-98904/2021) 11
J)P
. e t,earnlng (MU) 4
, r,1 c!11Jl -16 Learning with Classification
8
( 1-10W Kernels could be used for classifying non• 11near functions?
(8
O· 6 cort1Pare 1ogistic regression and SVM.
(4 Marka)
0· 1 wrtte 8 short note on Radial Basis Function. 4
o,s9 aiculate the radial distance of point A having vaueo
I f 6withrespectto
. I tB ( M11ka)
0, respectively.
C po n and C having value of aand 12
(4 Marka)

. 10 emonstrate how you could use RBF over XOR funct·ion to draw a linear
. SVM. (8 Marka)
D
0 e,cplain Radial Basis Function Network.
. 11 (8 Marki)
0
O, 12 wrtte a short note on Support Vector Regression (SVR) · (4 Marka)

,..;:;:~,ltliti
~!!'~ -~ · !'.ti!ssiftta!io~ t~~bni~
9!~ ~~•:rr!!"~;,,•:;n,•••:~.., :,Ytr.v:_~~
J~~p,,~~.,,......,.,,_ r:r~q - . ~
"' ~~w<k'+>~~~.J,J.,.;,,
~Is:.< 1
.--
, • t
4
Q. write a short note on multi-class classification techniques. ( Marks)
13
Q. explain One vs One (OvO) multi-class classification technique. (4 Marks)
14 (4 Marks)
Explain One vs Rest (OvR) multi-class classification technique.
Q. 15 (4 Marks) I
explain One vs All (OVA) multi-class classification technique.
Q. 16

Q.1 7
compare One vs One (OvO) and One vs Ali (OvA) multi-class classification techniques.
(4 Marks)

(4 Marks) I
Q.
18
compare one vs One (OvO) and One vs Rest (OvR) multi-class classification techniques.
I

You might also like