0% found this document useful (0 votes)
9 views38 pages

Logistic Regression Training DR Anil

Uploaded by

kaushalmeena3003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views38 pages

Logistic Regression Training DR Anil

Uploaded by

kaushalmeena3003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

LOGISTIC

REGRESSION
Classifier & Training

Prof. Anil Singh Parihar


Notations

■ Small letters in bold are used to represent a vector


for example x is a vector.
■ Capital letters in bold are used to represent a matrix
for example X is matrix
Equation of in n-Dimensional Space
■ Equation of plane:
w T x  w0  0
■ Minimum distance of plane from any point x
(shown in red) outside the plane is: (considering
distance as positive only):
d min (x)  w T x  w0 w
■ If we consider sign of distance as well, then
d min (x)  w T x  w0 w
■ Distance from origin:

d 0  w0 w
Equation of in n-Dimensional Space
■ Let us consider a function
g (x)  w T x  w0
■ Equation of the plane
g ( x)  0
■ Minimum distance of a point from the plane is:
d min (x)  g (x) w
■ Remember vector w decides the orientation of
the plane w0 is proportional to distance of the
plane from origin.
■ If g ( x)  0 then point x is above the plane.
■ If g ( x)  0 then point x is on the plane.
■ If g ( x)  0 then point x is below the plane.
Discriminant Function
■ In a classification problem plane g ( x)  0
can be considered as discriminant function. x2
g ( x)  0

g ( x)  0
■ Consider two-class classification problem with two
dimensional feature space i.e. x  x1 , x2 
T
g (x)
d
w

■ Remember minimum distance of a point from the


plane is d min  g (x) w g ( x)  0
.
x1
■ Thus, for given a plane (i.e. fixed w ), distance of
a point is proportional to g (x)
Discriminant Function
■ Therefore for sample point close to plane
(decision boundary) g (x) will be x2
g ( x)  0
smaller and for point far away from plane,
g (x) will have large value. (Ignore sign of
the distance). g ( x)  0

g (x)
d
w

■ Note for points above (one side of) the


plane g (x)  0 g ( x)  0
and for points below (other side of )the
x1
plane g (x)  0
Classification
■ Notice that we can assign the class of distant
point i.e. point having high(absolute) value of x2
g ( x)  0
g (x) with more confidence as compared to
closer points i.e. points having low value of g (x)
. g ( x)  0

g (x)
d
w

■ For point very close to plan, the class


assignment may change even with small g ( x)  0
variation in orientation (or location) of the
plane i.e. slight variation in vector w (or w0 ). x1
Classification
■ i.e., there in uncertainty in class prediction of
point close to plane (decision boundary) and x2
g ( x)  0
g (x)

g ( x)  0
■ This uncertainty to directly proportional to
(absolute) value of g (x) . d
g (x)
w

g ( x)  0
■ The value of g (x) vary between   to   .
However, for classification exact values are not x1
required.
Classification
■ For the purpose of the classification,
it is sufficient find some measure of distance x2
g ( x)  0
from the hyper plane g (x)

g ( x)  0
■ However exact measure of distance in not
required. d
g (x)
w x (B )
x ( A)
■ If two points at significantly large distance g ( x)  0
from the plane,
we can safely assign their classes. x1
Even through, one point may be closer to
plane as compared to other.
Classification
■ For example point x ( A) and x ( B ) shown in Fig. both
are at sufficient distance from the plane and x2
g ( x)  0
can be classified with almost no uncertainty.

g ( x)  0
■ Even though, the distance of x ( A) from the
plane is less than distance of x ( B ) if we calculate g (x)
d
exact distances. w x (B )
x ( A)

g ( x)  0
■ One way to have better measure of distances
from hyper plane and sample point is the
x1
mapping of g (x) from (   ,   ) to [0, 1].
Classification
■ The uncertainty in deciding the classes can be
better modelled, x2
g ( x)  0
if we restrict the distance measure in the
range [0, 1].
g ( x)  0

■ Indeed, it can be considered as the probability d


g (x)

of the belonging to a class. w x (B )


x ( A)

g ( x)  0
■ For example, let us consider two-class
classification (Class C1 and Class C2). x1
Classification
■ If a point x is above the plane (say Class C1)
then g(x) > 0 and x2
g ( x)  0
the exact value of g(x) may be any value in
the range (0,  ).
g ( x)  0

■ We can map the values of g(x) in the range [0, d


g (x)

1], using some function say h(x) = h(g(x) ). w x (B )


x ( A)

g ( x)  0
■ One such function is logistic function or
sigmoid function x1
Logistic Regression
■ The logistic function or sigmoid function is given as:

1 1
h( g (x))   g (x)
;  h ( x ) 
1 e 1 e  ( w T x  w0 )

■ Note that h(x) represents the h(x)


probability that a point x belongs to
class C1.
■ Thus, if h(x) > 0.5 then x belongs to
class C1. g ( x)  0 g ( x)  0
■ and, if h(x) < 0.5 then x belongs to
class C2.
■ Let us consider a class level y such
y = 1, denotes Class C1, and g (x)  w T x  w0
y = 0, denotes Class C2
Alternate representations of w and x
■ Revisit the equation of plane (hyper-plane) n- dimensional
wTx + w0 = 0 where w = [w1 , w2,…,wn ]T and x = [x1 , x2,…,xn ]T
i.e. w1 x1 + w2 x2 + … + wn xn + w0 = 0
w0 + w1 x1 + w2 x2 + … + wn xn = 0
w0 x0 + w1 x1 + w2 x2 + … + wn xn = 0, where x0 = 1

■ Thus, wT x = 0, represents n+1 dimensional hyper plane passing through its origin
where w = [w0 , w1,…,wn ]T and x = [x0 , x1,…,xn ]T
There are augmented weight and augmented feature vectors
Logistic Regression…
■ Let us consider a class level y such y = 1, denotes Class C1,
and y = 0, denotes Class C2

■ Thus, the hypothesis h(x) is the probability of output variable y = 1 for input x,
moreover, this probability depends on hyper-plane g(x) = 0.

h(x)
■ The location and orientation
of hyper plane is controlled
by parameter vector
w = [w0 , w1,…,wn ]T

g (x)  w T x  w0
Logistic Regression…
■ That is for given samples (fixed), the probability (point x of belonging to a class) is
govern by parameter vector w.

■ Thus, the hypothesis h(x) is better denoted by hw(x).


h(x)
■ Clearly, it can be observed that
the probability of y = 1 depends
on input x and parameter w.
i.e. hw(x) = p(y = 1| x, w )

■ Since, we are discussing binary


classification, thus
p(y = 0| x, w )= 1 - p(y = 1| x, w ) g (x)  w T x  w0
Learning of Parameter w
■ In classification problems (supervised learning),
a levelled dataset (with class levels) is available and
we need to design a classifier for new samples
(testing set without class levels) with help of the given levelled dataset.

■ Designing a logistic regression classifier to indeed is a problem of finding suitable


parameters w (including w0)

such that hype-plane divides (separates) classes without any misclassification.

■ Note that it will work only for data with linearly separable classes.
Learning of Parameter w…
■ Process of learning parameters from data:
– Initially, we will start with a random plane (we may use heuristic to find initial
plane) i.e. with some random values of w.

– There would be some misclassifications in general (if you are not extremely
lucky to have no misclassification).

– We will try to update parameters w (i.e. change the orientation and location of
hyper-plane)
– such that classification errors (misclassifications) are reduced. [such
mechanism is required]

– We update parameters until error reduced to zero, the parameters w*


corresponding to zero classification error represents final hyper-plane.
Learning of Parameter w…
■ Let us consider that we have m samples points (training set) with class levels:
Training set: { (x(1), y(1)), (x(2), y(2)), …, (x(m), y(m)) }

■ Here x(i) i = 1, 2,…, m are n-dimensional vectors, and


y(i) are corresponding outputs (class levels)
i.e. for two class classification y  {0, 1}.

■ We can write these vectors as augmented vectors: x = [x0 , x1,…,xn ]T where x0 = 1


and w = [w0 , w1,…,wn ]T
Learning of Parameter w…
■ All vectors (augmented) of training set can be put together in a matrix
 | | | 
X  x (1) x ( 2 ) ... x ( m ) 
 
 | | | 
■ and corresponding class levels (outputs) can be put in a row vector.


y  y (1) , y ( 2 ) ,... , y ( m ) 
1
■ Thus, hypothesis is given as: h( x) 
wT x
1 e

■ Our aim is to find parameter w = [w0 , w1,…,wn ]T from given training data.
Learning of Parameter w
■ As discussed earlier, we start with random (or initial parameters based on some
heuristics) parameters and

■ Use some mechanism to iteratively update parameters (weights) to reduce the


difference (error)

between the predicted output hw(x(i) ) and actual output (given class) y(i)

■ Can we do it (finding suitable parameters) by minimizing cost function or loss


function as we did in case of linear regression? Lets see.
Learning of Parameter w
■ In case of linear regression, we have cost function based on mean squared error i.e.

1 m 1

J (w )   hw (x (i ) )  y (i )
m i 1 2

2

■ Above cost function is error associate with whole training set, indeed its average
error of all m samples.

■ For simplicity and to avoid confusion let us call the error for one sample as loss
function:

1

L (x , y )  hw (x )  y
(i ) (i )
2
(i ) (i ) 2
,  1 m
 J ( w )   L ( x (i ) , y (i ) )
m i 1
Learning of Parameter w
■ In case of linear regression hw(x) = wTx ,
the cost function J(w) is quadratic function, which is convex in nature.

■ Thus, gradient descent algorithm will converse to global minima.

■ However, in case of logistic regression the loss function ( cost function) involves
sigmoid function.
 wT x
hw (x)  1 1  e

hence L (x (i )
,y (i ) 1
 (i )
)  hw (x )  y
2

(i ) 2 is not quadratic (not convex).
Learning of Parameter w
■ In case of linear regression hw(x) = wTx ,
the cost function J(w) is quadratic function, which is convex in nature.

■ Thus, gradient descent algorithm will converse to global minima.

■ However, in case of logistic regression the loss function ( cost function) involves
sigmoid function
 wT x
hw (x)  1 1  e

hence L ( x (i ) , y (i ) ) 
1
2
 
hw (x (i ) )  y (i )
2
is not quadratic.
Learning of Parameter w
■ Thus, mean squared error based cost function:

1 m 1

J (w )   hw (x (i ) )  y (i )
m i 1 2
 is non-convex in nature.
2

■ There gradient descent algorithm does not guaranteed to converse in global minima

■ Thus, we require an other cost function (rather than mean squared error), which is
convex in nature.
Cost function for logistic regression
■ Remember that the hypothesis hw(x) is
probability of the output taking value 1 (i.e. y = 1) for an input x,
(Let y = 1 represents class C1 and y = 0 represents Class C2.)

Thus, hw(x) = p ( y = 1|x, w)

■ It is the posterior probability of class C1 (y = 1),

■ Since we are discussing two class classification,


the posterior probability of class C2 (y = 0) is given as:
p ( y = 0|x, w) = 1- p ( y = 1|x, w)
= 1-hw(x)
Cost function for logistic regression
■ We can combine both outputs (class) in single expression i.e. p ( y |x, w) as:

p ( y | x, w )  hw (x)  y 1  hw (x) 1 y

■ It can be easily verified as:

y = 1, p ( y  1 | x, w )  hw (x) 
. 1  hw (x) 0  hw (x)

y = 0, p ( y  0 | x, w )  hw (x) 0 .1  hw (x)   1  hw (x)


Cost function for logistic regression

■ Considering class level as random variable,


p ( y |x, w) represents probability distribution of random variable .

■ Actually, this is a particular case of binomial distribution called Bernoulli distribution.

■ Assume the data points in the data are drawn independently from this distribution.

■ The parameters of the distributions can be estimated using maximum likelihood


estimation.
– Note that here the parameters are w.
Cost function for logistic regression

■ The likelihood of the observing the training set is given as:

 ) 1  hw (x )
m y (i )  (i )
 (i ) (i ) 1 y
hw ( x
i 1

■ It is more convenient to minimize the negative logarithm of likelihood.

■ Thus. the cost function (to be minimize) for logistic regression is given as:

    
m
1
J (w )   y (i ) log hw (x (i ) )  (1  y (i ) ) log 1  hw (x (i ) )
m i 1
Cost function for logistic regression

■ Let us verify, if this new cost function in convex in nature?

■ For simplicity, let us consider single training example,

then cost function i.e. loss function as:

L (hw (x), y )   y loghw (x)   (1  y ) log1  hw (x) 

  loghw (x) , if y  1
L (hw (x), y )  
log1  hw (x) , if y  0
Cost function for logistic regression
■ Remember, we are discussing two class classification

i.e. two actual classes C1 ( y = 1) and class C2 ( y = 0), and

■ the predicted value hw(x) represents:


probability of belonging to class C1 of a training sample x.

■ The loss L (hw (x), y ) represents error (cost) in prediction about a training sample x.
Cost function for logistic regression
■ Let us analyse case wise (for training samples of
both class)

Case-I: y = 1 (actual class given);


predicted value = hw(x) and L (hw (x), y )
the associated loss is: L (hw (x), y )   loghw (x)  ttttxuhjttttttttttttttttttttt5

■ The curve shows that if predicted value is near to


1 i.e. hw(x)  1, the loss is near zero (very low).
0 hw(x) 1

■ This the case of proper classification i.e. the


objects of class 1 are predicted to be class 1
object.
Cost function for logistic regression
Case-I: y = 1 (actual class given);

■ Now, if predicted value is near to 0


i.e. if hw(x)  0 and
L (hw (x), y )
the actual class is class 1 (y = 1 ),
ttttxuhjttttttttttttttttttttt5

■ The curve shows that there is very high loss.

0 hw(x) 1
■ This the case of misclassification i.e. the objects
of class 1 are predicted to be class 0 object.
Cost function for logistic regression
Case-II: y = 0 (actual class given);
predicted value hw(x), and
the associated loss is:
L (hw (x), y )  log1  hw (x) 

■ The curve shows that if predicted value is near L (hw (x), y )


to 1 i.e. hw(x)  1,
– then loss is very high.

0 hw(x) 1
■ This the case of misclassification i.e. the objects
of class 0 are predicted to be class 1 object.
Cost function for logistic regression
Case-II: y = 0 (actual class given);

■ Now if predicted value is near to 1 i.e. hw(x)  0


and the actual class is class 0 (y = 0 ),

■ The curve shows that there is very low loss ( 0).


L (hw (x), y )
This is the case of proper classification.

0 hw(x) 1
■ In case of multiple instances (samples), the loss will be low for proper classification and high
for misclassification.
■ Thus, the nature of cost function is convex in nature and gradient descent algorithm will
converge in global minima (since there is only one).
Gradient Descent Algorithm
■ The gradient descent algorithm is based on the simple notion that

if we want go down hill, we should move in the opposite direction (downwards)


of maximum change in height.

■ The direction and amount of maximum change in height is given by Gradient.

■ In the given problem cost function is analogous to height.

So, we use gradient of cost function.

■ It should be noted that gradient descent algorithm may stuck in local minima.
Gradient Descent Algorithm
■ For simplicity, lets consider weight vector is consist of only one weight w (one
dimensional or scalar).
Bias component w0 is also zero.

■ Gradient descent algorithm:

weight updation is given as:


d
w : w  J ( w)
dw
Gradient Descent Algorithm
■ The figure shows plot of cost function with
one dimensional parameter w.

■ We can start with any random value of w.

■ Then we apply gradient descent algorithm


to find the value optimal parameter w.

■ The parameter w corresponds to minimum


value of the cost

d
weight updation is given as: w : w  J ( w)
dw

You might also like