0% found this document useful (0 votes)
24 views31 pages

Linear Classifier: Linear Discriminant Function: Compiled by Lakshmi Manasa, CED16I033

The document discusses linear discriminant functions for classification. It explains that a linear discriminant function can be formulated to minimize a criterion function using training samples without assuming a probability distribution. The function separates linearly separable classes using a weight vector and bias term estimated from the samples.

Uploaded by

Ankur Saroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views31 pages

Linear Classifier: Linear Discriminant Function: Compiled by Lakshmi Manasa, CED16I033

The document discusses linear discriminant functions for classification. It explains that a linear discriminant function can be formulated to minimize a criterion function using training samples without assuming a probability distribution. The function separates linearly separable classes using a weight vector and bias term estimated from the samples.

Uploaded by

Ankur Saroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Linear Classifier: Linear Discriminant Function

Compiled by Lakshmi Manasa, CED16I033


Guided by
Dr Umarani Jayaraman

Department of Computer Science and Engineering


Indian Institute of Information Technology Design and Manufacturing
Kancheepuram

April 18, 2022

1 / 31
Discriminant Function

We know the proper forms for the discriminant functions and use the
samples to estimate the values of parameters of the discriminant
function
Although it estimates the parameters of the discriminant function, it
is said to be non-parametric form as it does require the knowledge
about the probability distributions.
Linear Discriminant function will be formulated as a problem of
minimizing a criterion function.
Criterion function: the obvious criterion function for classification
purpose is the sample risk or training error.

2 / 31
Discriminant function

Training error: The average loss incurred in classifying the set of


training samples.
No probability form is assumed: If the parametric form of the
class-density function is not known; then we have to design the
decision boundary using samples which are available with us.
Here, we don’t assume any parametric form of any probability
distribution function.
But, what we know is that, the classes are linearly separable

3 / 31
Linear Discriminant Function

Non parametric form


Supervised Learning
Classes are linearly separable
Classes : ω1 and ω2
Using this information, as the classes are linearly separable, we can
formulate the linear equation as g (x) = W t X + w0 X -
d-dimensional vector W - d-dimensional weight vector W t X - Inner
product of two vectors w0 - bias/threshold weight

4 / 31
Decision criteria

g (x) > 0; xω1


g (x) < 0; xω2
g (x) = 0; then x on the decision boundary
Now let us analyze the significance of each attribute in the equation,
g (x) = W t X + w0
Nature of weight vector W
What does g(x) represents?

5 / 31
1. Nature of weight vector w

g (X 1 ) = g (X 2 )
W tX 1 + w0 = W tX 2 + w0
W t (X 1 − X 2 ) = 0
We know that, A.B = |A|.|B|cosΘ;
If A.B = 0, then A is perpendicular to B
Likewise, W t (X 1 − X 2 ) is the inner product of weight vector W with
(X1 − X2 ).
As it is zero, it indicates that vector ‘W ’ is orthogonal to any vector
lying on decision surface.
In d-dimensional space, this surface is called as Hyper plane ’H’. 6 / 31
2. What does g(x) represents?

Draw a perpendicular line from a point x to the Hyper plane ‘H’


which is Xp
W
Let the distance of X and Xp is ‘r’ Then, X = X p + r . ||W ||
7 / 31
2. What does g(x) represents?

As seen earlier, W is orthogonal to the hyper plane ‘H’.


So, the direction of ‘W ’ is same direction of from X p to X .
Hence, Both X p to X and ‘W ’ is orthogonal to hyper plane ‘H’
8 / 31
2. What does g(x) represents?

Pd
W w
= qPi=1 i
||W || d 2
i=1 (w i )

w
X = X P + r . ||W ||

g (X ) = W t X + w 0
w
g (X ) = W t [X p + r . ||W || ] + w 0
t
g (X ) = W t X p + w 0 + r . W||W.W
||

The point X p that lies on the decision surface so W t X p + w 0 is zero.


t
g (X ) = 0 + r . W||W.W
||
2
g (X ) = 0 + r . ||W ||
||W ||
g (X ) = r .||W ||
9 / 31
2. Why g(x) is algebraic measure?

If ax + by + c = 0 is the equation of the straight line and (x1 , y1 ) is a


point, then distance of (x1 , y1 ) to the line is nothing but
ax√1 +by 1 +c
d= a2 +b 2
In, 2-dimension.
g (x)
r= ||w ||2
In, d-dimension.

10 / 31
3. Distance of origin from the hyperplane H

w0
Distance of origin from the huperplane H is ||W || ; w 0 is
Bias/Threshold.
If w 0 is +ve, then origin lies on the +ve side of the hyper plane ‘H’.
If w 0 is -ve, then origin lies on the -ve side of the hyper plane ‘H’.
If w 0 is zero, then the hyper plane passes through origin. And also,

11 / 31
3. Distance of origin from the hyperplane H

12 / 31
3. Distance of origin from the hyperplane H

13 / 31
3. Distance of origin from the hyperplane H

14 / 31
3. Distance of origin from the hyperplane H

If w 0 is zero, discriminant function g (x) takes the particular form


g (x) = W T X ; in this case we don’t have any bias because w 0 = 0
g (x) = W T X is said to be in Homogeneous form
In mathematics, It is convenient, If we represent the equation in
Homogeneous form.
So, in order to design a linear classifier we should estimate two
parameters such as weight vector W and bias w0 .
Since it is supervised learning W and w 0 are supposed to be
estimated based on the samples that are available.

15 / 31
Design of weight vector W

Assumption: Two classes and linearly separable case


We have two classes and it is linearly separable
We should have the discriminant function which separates these two
classes
It is of the form g (X ) = W t X + w 0
This expression is not in homogeneous form.
Hence, converting this homogeneous form makes the analysis easier.

16 / 31
Converting to Homogeneous form

g (X ) = W t X + w 0
g (X ) ≈ at y
 
x1
x 2 
 
  x 3 
g (X ) ≈ w 1 w 2 ... ... w d w0  
 .. 
 
x d 
1
Pd
g (X ) ≈ i=1 w i x i + w0
g (X ) ≈ W tX + w0

17 / 31
Decision rule in Homogeneous form

The decision rule remains the same, for at y


If at y > 0 then decide y ω1
If at y < 0 then decide y ω2
If at y = 0 then no decision can be taken.

18 / 31
How to design weight vector ’W ’ and w 0 using the
samples?

We have n- no of samples (or) training samples y1 , y2 , ..., yn


         
x11 x21 .. .. xn1
 x12   x22  .. ..  xn2 
         
 ..   ..  .. ..  .. 
y1 = 
  y2 =   y3 =   y 4 =   yn =  
 .. 
  ..  .. ..  .. 
       
x 1 d  x2d  .. .. xnd 
1 1 .. .. 1
These are the samples which are useful to train the classifier.
Some of the samples are labeled as ω1 and some are labeled as ω2 .
Let’s consider the i th sample as yi .

19 / 31
Two Criterion Decision rule in Homogeneous form

The decision rule remains the same, for at yi


If at yi > 0 then decide yi ω1
If at yi < 0 then decide yi ω2
If at yi = 0 then no decision can be taken.

20 / 31
Two Criterion Decision rule in Homogeneous form

Given a weight vector ’a’; If we take all the samples which are labelled
as ω1
If for each of the samples, at yi > 0; then that weight vector ‘a’ is
correctly classifying all the samples which are labelled as ω1
If we also find, for the same weight vector ’a’ all the samples
belonging to class ω2 ;
If at yi < 0; then the weight vector ’a’ is also classified correctly for all
samples belongs to class ω2
That particular weight vector ’a’ is the correct weight vector, because
it is correctly classified all the samples labelled as ω1 , also it is
correctly classified all the samples labelled as ω2 .

21 / 31
Single Criterion

Instead of two conditions at y i > 0 and at y i < 0, Can’t we have a


single criterion to classify correctly.
at y i > 0 true, irrespective of class label.
We can say that, y i is correctly classified if at y i > 0. Otherwise, yi is
mis-classified.
Otherwise, include < 0 and = 0.

22 / 31
Single Criterion: How can we do that?

Samples belonging to class ω1 , we can take them as it is.


For the samples belonging to class ω2 , we augment them by
appending 1 and then take negative of it.
Take all the samples which are labelled as ω2 and then negate it.
Instead of considering yi , consider −yi
If we take negative, this at y i which is supposed to be < 0, now it will
be > 0.
So, we get single (uniform) decision criterion which is at y i > 0 for
both the classes.

23 / 31
Single Criterion: How can we do that?

If at y i > 0, all samples are correctly classified, irrespective of class


labels.
Now, for this what will be the weight vector ’a’ ?
We take some criteria Function, J(a).
J(a) has to be minimized, if ’a’ is a solution (correct weight) vector.
J(a) will be minimum, If it classifies all the training samples correctly,
for the weight vector ’a’ which is obtained.
For minimization of J(a), we can make use of Gradient Descent
Procedure.

24 / 31
Gradient Descent Procedure

25 / 31
Gradient Descent Procedure

Initialize with weight vector a(k) with some random values and try to
minimize the training error for every iteration.
At the k th iteration, we know the values of a(k).
We should update the weight vector for a(k+1).
a(k + 1) = a(k) − η(k) 5 J(a(k))
This is called Gradient Descent Procedure or Steepest Descent
Procedure.

26 / 31
Algorithm: Gradient Descent

Initialize a, threshold θ, η(.), k ← 0


do k ← k + 1
a ← a − η(k) 5 J(a)
until η(k) 5 J(a) < θ
return a
27 / 31
Perceptron Criterion Function

Our aim will be to find out weight vector ’a’ which will classify all the
training samples correctly.
So, we can try to design a criterion function which will make use of
samples which are not correctly classified.
If the samples are not correctly classifies by a(k), then update weight
vector ’a’ in a(k+1)
So, accordingly we can define the criterion function
Criterion function can be defined as,
J p (a) = ∀y misclassified (−at y ), here p refers to perceptron criterion.
P

28 / 31
Perceptron Criterion Function

t y ),
P
J p (a) = ∀y misclassified (−a
Here, (−at y ) is positive.
As a result, The criterion for J p (a), never have a negative value.
It can always have a positive value.
The minimum value can be 0.
So, we have a global minimum for J p (a) and this can be find by
Gradient Decent Procedure.

29 / 31
Perceptron Criterion Function

According to Gradient Descent Procedure, take gradient of J p (a)


w.r.t weight vector a.
J p (a) = ∀y misclassified (−at y ),
P
P
5. J p (a) = ∀y misclassified (−y )
The update rule is,
a(0) ⇒ Initial weight vector; arbitrary .
P
a(k + 1) = a(k) + η(k) ∀y misclassified (y )
This is the algorithm to design weight vector ’a’ if the samples are
linearly separable.

30 / 31
THANK YOU

31 / 31

You might also like