Logistic Regression Training DR Anil
Logistic Regression Training DR Anil
REGRESSION
Classifier & Training
d 0 w0 w
Equation of in n-Dimensional Space
■ Let us consider a function
g (x) w T x w0
■ Equation of the plane
g ( x) 0
■ Minimum distance of a point from the plane is:
d min (x) g (x) w
■ Remember vector w decides the orientation of
the plane w0 is proportional to distance of the
plane from origin.
■ If g ( x) 0 then point x is above the plane.
■ If g ( x) 0 then point x is on the plane.
■ If g ( x) 0 then point x is below the plane.
Discriminant Function
■ In a classification problem plane g ( x) 0
can be considered as discriminant function. x2
g ( x) 0
g ( x) 0
■ Consider two-class classification problem with two
dimensional feature space i.e. x x1 , x2
T
g (x)
d
w
g (x)
d
w
g (x)
d
w
g ( x) 0
■ This uncertainty to directly proportional to
(absolute) value of g (x) . d
g (x)
w
g ( x) 0
■ The value of g (x) vary between to .
However, for classification exact values are not x1
required.
Classification
■ For the purpose of the classification,
it is sufficient find some measure of distance x2
g ( x) 0
from the hyper plane g (x)
g ( x) 0
■ However exact measure of distance in not
required. d
g (x)
w x (B )
x ( A)
■ If two points at significantly large distance g ( x) 0
from the plane,
we can safely assign their classes. x1
Even through, one point may be closer to
plane as compared to other.
Classification
■ For example point x ( A) and x ( B ) shown in Fig. both
are at sufficient distance from the plane and x2
g ( x) 0
can be classified with almost no uncertainty.
g ( x) 0
■ Even though, the distance of x ( A) from the
plane is less than distance of x ( B ) if we calculate g (x)
d
exact distances. w x (B )
x ( A)
g ( x) 0
■ One way to have better measure of distances
from hyper plane and sample point is the
x1
mapping of g (x) from ( , ) to [0, 1].
Classification
■ The uncertainty in deciding the classes can be
better modelled, x2
g ( x) 0
if we restrict the distance measure in the
range [0, 1].
g ( x) 0
g ( x) 0
■ For example, let us consider two-class
classification (Class C1 and Class C2). x1
Classification
■ If a point x is above the plane (say Class C1)
then g(x) > 0 and x2
g ( x) 0
the exact value of g(x) may be any value in
the range (0, ).
g ( x) 0
g ( x) 0
■ One such function is logistic function or
sigmoid function x1
Logistic Regression
■ The logistic function or sigmoid function is given as:
1 1
h( g (x)) g (x)
; h ( x )
1 e 1 e ( w T x w0 )
■ Thus, wT x = 0, represents n+1 dimensional hyper plane passing through its origin
where w = [w0 , w1,…,wn ]T and x = [x0 , x1,…,xn ]T
There are augmented weight and augmented feature vectors
Logistic Regression…
■ Let us consider a class level y such y = 1, denotes Class C1,
and y = 0, denotes Class C2
■ Thus, the hypothesis h(x) is the probability of output variable y = 1 for input x,
moreover, this probability depends on hyper-plane g(x) = 0.
h(x)
■ The location and orientation
of hyper plane is controlled
by parameter vector
w = [w0 , w1,…,wn ]T
g (x) w T x w0
Logistic Regression…
■ That is for given samples (fixed), the probability (point x of belonging to a class) is
govern by parameter vector w.
■ Note that it will work only for data with linearly separable classes.
Learning of Parameter w…
■ Process of learning parameters from data:
– Initially, we will start with a random plane (we may use heuristic to find initial
plane) i.e. with some random values of w.
– There would be some misclassifications in general (if you are not extremely
lucky to have no misclassification).
– We will try to update parameters w (i.e. change the orientation and location of
hyper-plane)
– such that classification errors (misclassifications) are reduced. [such
mechanism is required]
y y (1) , y ( 2 ) ,... , y ( m )
1
■ Thus, hypothesis is given as: h( x)
wT x
1 e
■ Our aim is to find parameter w = [w0 , w1,…,wn ]T from given training data.
Learning of Parameter w
■ As discussed earlier, we start with random (or initial parameters based on some
heuristics) parameters and
between the predicted output hw(x(i) ) and actual output (given class) y(i)
1 m 1
J (w ) hw (x (i ) ) y (i )
m i 1 2
2
■ Above cost function is error associate with whole training set, indeed its average
error of all m samples.
■ For simplicity and to avoid confusion let us call the error for one sample as loss
function:
1
L (x , y ) hw (x ) y
(i ) (i )
2
(i ) (i ) 2
, 1 m
J ( w ) L ( x (i ) , y (i ) )
m i 1
Learning of Parameter w
■ In case of linear regression hw(x) = wTx ,
the cost function J(w) is quadratic function, which is convex in nature.
■ However, in case of logistic regression the loss function ( cost function) involves
sigmoid function.
wT x
hw (x) 1 1 e
hence L (x (i )
,y (i ) 1
(i )
) hw (x ) y
2
(i ) 2 is not quadratic (not convex).
Learning of Parameter w
■ In case of linear regression hw(x) = wTx ,
the cost function J(w) is quadratic function, which is convex in nature.
■ However, in case of logistic regression the loss function ( cost function) involves
sigmoid function
wT x
hw (x) 1 1 e
hence L ( x (i ) , y (i ) )
1
2
hw (x (i ) ) y (i )
2
is not quadratic.
Learning of Parameter w
■ Thus, mean squared error based cost function:
1 m 1
J (w ) hw (x (i ) ) y (i )
m i 1 2
is non-convex in nature.
2
■ There gradient descent algorithm does not guaranteed to converse in global minima
■ Thus, we require an other cost function (rather than mean squared error), which is
convex in nature.
Cost function for logistic regression
■ Remember that the hypothesis hw(x) is
probability of the output taking value 1 (i.e. y = 1) for an input x,
(Let y = 1 represents class C1 and y = 0 represents Class C2.)
y = 1, p ( y 1 | x, w ) hw (x)
. 1 hw (x) 0 hw (x)
■ Assume the data points in the data are drawn independently from this distribution.
) 1 hw (x )
m y (i ) (i )
(i ) (i ) 1 y
hw ( x
i 1
■ Thus. the cost function (to be minimize) for logistic regression is given as:
m
1
J (w ) y (i ) log hw (x (i ) ) (1 y (i ) ) log 1 hw (x (i ) )
m i 1
Cost function for logistic regression
loghw (x) , if y 1
L (hw (x), y )
log1 hw (x) , if y 0
Cost function for logistic regression
■ Remember, we are discussing two class classification
■ The loss L (hw (x), y ) represents error (cost) in prediction about a training sample x.
Cost function for logistic regression
■ Let us analyse case wise (for training samples of
both class)
0 hw(x) 1
■ This the case of misclassification i.e. the objects
of class 1 are predicted to be class 0 object.
Cost function for logistic regression
Case-II: y = 0 (actual class given);
predicted value hw(x), and
the associated loss is:
L (hw (x), y ) log1 hw (x)
0 hw(x) 1
■ This the case of misclassification i.e. the objects
of class 0 are predicted to be class 1 object.
Cost function for logistic regression
Case-II: y = 0 (actual class given);
0 hw(x) 1
■ In case of multiple instances (samples), the loss will be low for proper classification and high
for misclassification.
■ Thus, the nature of cost function is convex in nature and gradient descent algorithm will
converge in global minima (since there is only one).
Gradient Descent Algorithm
■ The gradient descent algorithm is based on the simple notion that
■ It should be noted that gradient descent algorithm may stuck in local minima.
Gradient Descent Algorithm
■ For simplicity, lets consider weight vector is consist of only one weight w (one
dimensional or scalar).
Bias component w0 is also zero.
d
weight updation is given as: w : w J ( w)
dw