0% found this document useful (0 votes)
11 views28 pages

Lec5 Support Vector Machine

Uploaded by

alaaabdo347890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views28 pages

Lec5 Support Vector Machine

Uploaded by

alaaabdo347890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning Algorithms

Support Vector Machine – SVM


overview
• SVM for linearly separable binary set
•Main Goal to design a hyper plane that classify all training vectors into
two classes
•The best model that leaves the maximum margin from both classes
•the two classes labels +1 (positive examples and -1 (negative examples)

X2

X1
overview
This is a constrained optimization problem
Split the data

•I margin in the best


possible way

•• • 0
hyperplane
This hyperplane
best splits the data
because it is as far
0 as possible from
0, 0 these support

support vectors
,/
•• 0 0
Q vectors
which is another
way of saying we
0
0
• maximized the
margin
Intuition behind SVM
i
Points (instances) are like vectors p = (xl,x2 xn)
'
SVM finds the closest two points from the two classes (see
figure), that support (define) the best separating line|plane Itffortwrtai

• Then SVM draws a line connecting them (the orange line


in the igure) *
After that, SVM decides that the best separating line is the
line that bisects, and is perpendicular to, the connecting WDp*
WOT
line
Margin in terms of W

n*
•( x2 — \\ ) = width
I
it • II

it' - x : + b =1
-
H’ Xj+i = -1
IV Xj+6 — IV —
-XT-& = 1 (-1 >
-
W X; W X : ~ ^
w 2
(x2 - X, )
HI >1
X2
Support Vector Machine
linearly separable data
Margin =
w ||
.
••

.
••
4

.1f ••

/
-1 - ••
V'

Support Vector , *;

.•
*
^ •*
/
® Support Vector
*
§

wrx + 6 = 1 *’
w >•

ViT\+ 6 = 0

\y' x + b = -l *
Svm as a minimization problem
• Maximizing is the same as minimizing 1^ 1 / 2
2 / 1tv I

• Hence SVM becomes a minimization problem:


Quadratic . l
problem
11mi - M'
Linear
sJ . •Vf- 1 M' • xtF
'
+ b*> 1. VAI
7 . constrain

• We are now optimizing a quadratic function subject to


linear constraints
• Quadratic optimization problems are a standard, well-
known class of mathematical optimization problems, and
many algorithms exist for solving them
1
mm
2
w 2
S.t . Vi ( xi - w -I- b ) 1 O Vi
In order to cater for the constraints in this minimization , we need to allocate
them Lagrange multipliers c* , where 0 V* :
1
Lr
2
| w || '
l ct - w -+- b ) 1 Vi ]
1
||w ||- y '
a t i [ t/ j ( x , - w + 6) 1]
^
2
2 =1
L
1
2^
7 l|w ||' y ] w C
*2
2 =1 2 =1
We wish to find the w and b which minimizes, and the α which maximizes LP(whilst
keeping αi
≥ 0 ∀i). We can do this by differentiatingL LP with respect to w and b and setting
9L p
the derivatives to zero:
9w
0 w
z=l
E
L
p
06
0 «iz/i =o
1=1
A Geometrical Interpretation

Class 2

10=0
8=0.6

7=0
2=0
5=0

1=0.8
4=0
6=1.4
9=0
3=0
Class 1
Example
• Here we select 3 Support Vectors to start with.
• They are S1; S2 and S3.

2
*2 A
Si
5I
©
1 (• 52 (- )
0
1 2 3 5 6
*
1
-1
-2
>
®S2
S3
s3 ()
U
Example
• Here we will use vectors augmented with a 1 as a bias input,
and for clarity we will differentiate these with an over-tilde.
That is:
2
(5) si
* = I
I
2

* = U) $2 1
1
4

* 0 = $3 0
1 w
Example
a) 5] . 5i + a2S2' S\ + a3S3 . S -i = 1 (- ve class')

5]. S2 4 — 1 (_ — ve class')
d| " “
^ ^
^3 * 3* 2

a , 5i . 53 + azS2 - S3 + a3S3. S3 = + 1 ( + ue class )


• Let's substitute the values for 5 S2 and S3 in the above
equations. 2 ^ 2 4
Si 1 S2 1 S3 0
1 1 1
2 2 2 2 4 2
«1 1 1 . 1 + «3 0 . 1 1
1 1 1 1 1 1
2 2 2 2 4 2
«i 1 . 1 . 1 1
1 1 1 1 1 1
2 4 2 4 4 4
1 ft . 0 + a3 0 . 0 +1
1 1 1 1 1 1
-- —-
1

- © (?) (© (© © )
ai

©? *
«»( ) “ (-;)
• (
; ) * (?) (
;•)
»
• = -1

• After simplification we get:

Get ! +
- - 4 az +
- - 9 cr 3 1
4«! -I- 6 cr 2 4- 9 cr 3 = 1

9«! + 9 az + 17cr 3
• Simplifying the above 3 simultaneous equations we
get: a1=a2 = - 3.25 and a3 = 3.5.
« 1 = «2 - 3.25 and a 3 = 3.5 * - ( )
2
52 = -
1
1
• The hyper plane that discriminates the
positive class from the negative class is give
S3 =
Q
by:
w
z
a i Si

• Substituting the values we get:


l

2 2 4
iv 1 + a2 1 + «3 0
1 1 1
2 2 4 1
w .
(-3 . 2 5) 1 + (-3 . 2 5). .
1 + (3 . 5 ) 0 0
1 1 1 3
2 2 4 1
vv = (- 3.25 ). 1 + (- 3.25 ). 1 + (3.5 ). 0 = 0
1 1 1 3

• Our vectors are augmented with a bias.


• Hence we can equate the entry in vv as the
hyper plane with an offset b .
• Therefore the separating hyper plane equation
y wx + b with vv and offset b 3.
Support Vector Machines

• y wx + b with w

-1
-2
Kernel trick
SVM Algorithm

1- Define an optimal hyperplane: maximize margin


2- Extend the above definition for non-linearly separable
problems: have a penalty term for misclassifications
3- Map data to high dimensional space where it is easier to
classify with linear decision surfaces: reformulate problem so
that data is mapped implicitly to this space
Suppose we're in 1-dimension

Not a big surprise

I O 0 0 o o o

x=0
Positive " plane^ Negative " plane"
Harder 1-dimensional dataset

That's wiped the


smirk off SVM's
face.
What can be
done about
this?
I 0 0 0 i +

x=0
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let's permit them
here too

Z /t = (• )
x=0
** s
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let s permit them
here too
/
x^O
Non-linear SVMs: Feature spaces

• General idea: the original feature space can


always be mapped to some higher-
dimensional feature space where the
training set is separable:

Φ: x → φ(x) />
*V
\
N \
/ \ \
\
/ \
N
/ \ • <\ N
S
l
\ / s S
s
\ /
\
\ \ /
/
\
N
\

s
\
\
s
\

27
</>

Input Space Feature Space


svm for nonlinear reparability
• The simplest way to separate two groups of data is with a straight
line, flat plane an N- dimensional hyperplane
• However, there are situations where a nonlinear region can separate
the groups more efficiently
• SVM handles this by using a kernel function (nonlinear) to map the
data into a different space where a hyperplane (linear) cannot be
used to do the separation
• It means a non-linear function is learned by a linear learning
machine in a high-dimensional feature space while the capacity of
the system is controlled by a parameter that does not depend on the
dimensionality of the space
• This is called kernel trick which means the kernel function transform
the data into a higher dimensional feature space to make it possible
to perform the linear separation
Kernels
• Why use kernels?
• Make non-separable problem separable.
• Map data into better representational space
• Common kernels
• Linear
• Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
• Radial basis function (infinite dimensional
space)

• Haven’t been very useful in text


classification 30
Thanks

You might also like