0% found this document useful (0 votes)

9 views38 pages

Logistic Regression Training DR Anil

Uploaded by

kaushalmeena3003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views38 pages

Logistic Regression Training DR Anil

Uploaded by

kaushalmeena3003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

LOGISTIC

REGRESSION
Classifier & Training

Prof. Anil Singh Parihar

Notations

■ Small letters in bold are used to represent a vector

for example x is a vector.
■ Capital letters in bold are used to represent a matrix
for example X is matrix
Equation of in n-Dimensional Space
■ Equation of plane:
w T x  w0  0
■ Minimum distance of plane from any point x
(shown in red) outside the plane is: (considering
distance as positive only):
d min (x)  w T x  w0 w
■ If we consider sign of distance as well, then
d min (x)  w T x  w0 w
■ Distance from origin:

d 0  w0 w
Equation of in n-Dimensional Space
■ Let us consider a function
g (x)  w T x  w0
■ Equation of the plane
g ( x)  0
■ Minimum distance of a point from the plane is:
d min (x)  g (x) w
■ Remember vector w decides the orientation of
the plane w0 is proportional to distance of the
plane from origin.
■ If g ( x)  0 then point x is above the plane.
■ If g ( x)  0 then point x is on the plane.
■ If g ( x)  0 then point x is below the plane.
Discriminant Function
■ In a classification problem plane g ( x)  0
can be considered as discriminant function. x2
g ( x)  0

g ( x)  0
■ Consider two-class classification problem with two
dimensional feature space i.e. x  x1 , x2 
T
g (x)
d
w

■ Remember minimum distance of a point from the

plane is d min  g (x) w g ( x)  0
.
x1
■ Thus, for given a plane (i.e. fixed w ), distance of
a point is proportional to g (x)
Discriminant Function
■ Therefore for sample point close to plane
(decision boundary) g (x) will be x2
g ( x)  0
smaller and for point far away from plane,
g (x) will have large value. (Ignore sign of
the distance). g ( x)  0

g (x)
d
w

■ Note for points above (one side of) the

plane g (x)  0 g ( x)  0
and for points below (other side of )the
x1
plane g (x)  0
Classification
■ Notice that we can assign the class of distant
point i.e. point having high(absolute) value of x2
g ( x)  0
g (x) with more confidence as compared to
closer points i.e. points having low value of g (x)
. g ( x)  0

g (x)
d
w

■ For point very close to plan, the class

assignment may change even with small g ( x)  0
variation in orientation (or location) of the
plane i.e. slight variation in vector w (or w0 ). x1
Classification
■ i.e., there in uncertainty in class prediction of
point close to plane (decision boundary) and x2
g ( x)  0
g (x)

g ( x)  0
■ This uncertainty to directly proportional to
(absolute) value of g (x) . d
g (x)
w

g ( x)  0
■ The value of g (x) vary between   to   .
However, for classification exact values are not x1
required.
Classification
■ For the purpose of the classification,
it is sufficient find some measure of distance x2
g ( x)  0
from the hyper plane g (x)

g ( x)  0
■ However exact measure of distance in not
required. d
g (x)
w x (B )
x ( A)
■ If two points at significantly large distance g ( x)  0
from the plane,
we can safely assign their classes. x1
Even through, one point may be closer to
plane as compared to other.
Classification
■ For example point x ( A) and x ( B ) shown in Fig. both
are at sufficient distance from the plane and x2
g ( x)  0
can be classified with almost no uncertainty.

g ( x)  0
■ Even though, the distance of x ( A) from the
plane is less than distance of x ( B ) if we calculate g (x)
d
exact distances. w x (B )
x ( A)

g ( x)  0
■ One way to have better measure of distances
from hyper plane and sample point is the
x1
mapping of g (x) from (   ,   ) to [0, 1].
Classification
■ The uncertainty in deciding the classes can be
better modelled, x2
g ( x)  0
if we restrict the distance measure in the
range [0, 1].
g ( x)  0

■ Indeed, it can be considered as the probability d

g (x)

of the belonging to a class. w x (B )

x ( A)

g ( x)  0
■ For example, let us consider two-class
classification (Class C1 and Class C2). x1
Classification
■ If a point x is above the plane (say Class C1)
then g(x) > 0 and x2
g ( x)  0
the exact value of g(x) may be any value in
the range (0,  ).
g ( x)  0

■ We can map the values of g(x) in the range [0, d

g (x)

1], using some function say h(x) = h(g(x) ). w x (B )

x ( A)

g ( x)  0
■ One such function is logistic function or
sigmoid function x1
Logistic Regression
■ The logistic function or sigmoid function is given as:

1 1
h( g (x))   g (x)
;  h ( x ) 
1 e 1 e  ( w T x  w0 )

■ Note that h(x) represents the h(x)

probability that a point x belongs to
class C1.
■ Thus, if h(x) > 0.5 then x belongs to
class C1. g ( x)  0 g ( x)  0
■ and, if h(x) < 0.5 then x belongs to
class C2.
■ Let us consider a class level y such
y = 1, denotes Class C1, and g (x)  w T x  w0
y = 0, denotes Class C2
Alternate representations of w and x
■ Revisit the equation of plane (hyper-plane) n- dimensional
wTx + w0 = 0 where w = [w1 , w2,…,wn ]T and x = [x1 , x2,…,xn ]T
i.e. w1 x1 + w2 x2 + … + wn xn + w0 = 0
w0 + w1 x1 + w2 x2 + … + wn xn = 0
w0 x0 + w1 x1 + w2 x2 + … + wn xn = 0, where x0 = 1

■ Thus, wT x = 0, represents n+1 dimensional hyper plane passing through its origin
where w = [w0 , w1,…,wn ]T and x = [x0 , x1,…,xn ]T
There are augmented weight and augmented feature vectors
Logistic Regression…
■ Let us consider a class level y such y = 1, denotes Class C1,
and y = 0, denotes Class C2

■ Thus, the hypothesis h(x) is the probability of output variable y = 1 for input x,
moreover, this probability depends on hyper-plane g(x) = 0.

h(x)
■ The location and orientation
of hyper plane is controlled
by parameter vector
w = [w0 , w1,…,wn ]T

g (x)  w T x  w0
Logistic Regression…
■ That is for given samples (fixed), the probability (point x of belonging to a class) is
govern by parameter vector w.

■ Thus, the hypothesis h(x) is better denoted by hw(x).

h(x)
■ Clearly, it can be observed that
the probability of y = 1 depends
on input x and parameter w.
i.e. hw(x) = p(y = 1| x, w )

■ Since, we are discussing binary

classification, thus
p(y = 0| x, w )= 1 - p(y = 1| x, w ) g (x)  w T x  w0
Learning of Parameter w
■ In classification problems (supervised learning),
a levelled dataset (with class levels) is available and
we need to design a classifier for new samples
(testing set without class levels) with help of the given levelled dataset.

■ Designing a logistic regression classifier to indeed is a problem of finding suitable

parameters w (including w0)

such that hype-plane divides (separates) classes without any misclassification.

■ Note that it will work only for data with linearly separable classes.
Learning of Parameter w…
■ Process of learning parameters from data:
– Initially, we will start with a random plane (we may use heuristic to find initial
plane) i.e. with some random values of w.

– There would be some misclassifications in general (if you are not extremely
lucky to have no misclassification).

– We will try to update parameters w (i.e. change the orientation and location of
hyper-plane)
– such that classification errors (misclassifications) are reduced. [such
mechanism is required]

– We update parameters until error reduced to zero, the parameters w*

corresponding to zero classification error represents final hyper-plane.
Learning of Parameter w…
■ Let us consider that we have m samples points (training set) with class levels:
Training set: { (x(1), y(1)), (x(2), y(2)), …, (x(m), y(m)) }

■ Here x(i) i = 1, 2,…, m are n-dimensional vectors, and

y(i) are corresponding outputs (class levels)
i.e. for two class classification y  {0, 1}.

■ We can write these vectors as augmented vectors: x = [x0 , x1,…,xn ]T where x0 = 1

and w = [w0 , w1,…,wn ]T
Learning of Parameter w…
■ All vectors (augmented) of training set can be put together in a matrix
 | | | 
X  x (1) x ( 2 ) ... x ( m ) 
 
 | | | 
■ and corresponding class levels (outputs) can be put in a row vector.


y  y (1) , y ( 2 ) ,... , y ( m ) 
1
■ Thus, hypothesis is given as: h( x) 
wT x
1 e

■ Our aim is to find parameter w = [w0 , w1,…,wn ]T from given training data.
Learning of Parameter w
■ As discussed earlier, we start with random (or initial parameters based on some
heuristics) parameters and

■ Use some mechanism to iteratively update parameters (weights) to reduce the

difference (error)

between the predicted output hw(x(i) ) and actual output (given class) y(i)

■ Can we do it (finding suitable parameters) by minimizing cost function or loss

function as we did in case of linear regression? Lets see.
Learning of Parameter w
■ In case of linear regression, we have cost function based on mean squared error i.e.

1 m 1

J (w )   hw (x (i ) )  y (i )
m i 1 2

2

■ Above cost function is error associate with whole training set, indeed its average
error of all m samples.

■ For simplicity and to avoid confusion let us call the error for one sample as loss
function:

1

L (x , y )  hw (x )  y
(i ) (i )
2
(i ) (i ) 2
,  1 m
 J ( w )   L ( x (i ) , y (i ) )
m i 1
Learning of Parameter w
■ In case of linear regression hw(x) = wTx ,
the cost function J(w) is quadratic function, which is convex in nature.

■ Thus, gradient descent algorithm will converse to global minima.

■ However, in case of logistic regression the loss function ( cost function) involves
sigmoid function.
 wT x
hw (x)  1 1  e

hence L (x (i )
,y (i ) 1
 (i )
)  hw (x )  y
2

(i ) 2 is not quadratic (not convex).
Learning of Parameter w
■ In case of linear regression hw(x) = wTx ,
the cost function J(w) is quadratic function, which is convex in nature.

■ Thus, gradient descent algorithm will converse to global minima.

■ However, in case of logistic regression the loss function ( cost function) involves
sigmoid function
 wT x
hw (x)  1 1  e

hence L ( x (i ) , y (i ) ) 
1
2
 
hw (x (i ) )  y (i )
2
is not quadratic.
Learning of Parameter w
■ Thus, mean squared error based cost function:

1 m 1

J (w )   hw (x (i ) )  y (i )
m i 1 2
 is non-convex in nature.
2

■ There gradient descent algorithm does not guaranteed to converse in global minima

■ Thus, we require an other cost function (rather than mean squared error), which is
convex in nature.
Cost function for logistic regression
■ Remember that the hypothesis hw(x) is
probability of the output taking value 1 (i.e. y = 1) for an input x,
(Let y = 1 represents class C1 and y = 0 represents Class C2.)

Thus, hw(x) = p ( y = 1|x, w)

■ It is the posterior probability of class C1 (y = 1),

■ Since we are discussing two class classification,

the posterior probability of class C2 (y = 0) is given as:
p ( y = 0|x, w) = 1- p ( y = 1|x, w)
= 1-hw(x)
Cost function for logistic regression
■ We can combine both outputs (class) in single expression i.e. p ( y |x, w) as:

p ( y | x, w )  hw (x)  y 1  hw (x) 1 y

■ It can be easily verified as:

y = 1, p ( y  1 | x, w )  hw (x) 
. 1  hw (x) 0  hw (x)

y = 0, p ( y  0 | x, w )  hw (x) 0 .1  hw (x)   1  hw (x)

Cost function for logistic regression

■ Considering class level as random variable,

p ( y |x, w) represents probability distribution of random variable .

■ Actually, this is a particular case of binomial distribution called Bernoulli distribution.

■ Assume the data points in the data are drawn independently from this distribution.

■ The parameters of the distributions can be estimated using maximum likelihood

estimation.
– Note that here the parameters are w.
Cost function for logistic regression

■ The likelihood of the observing the training set is given as:

 ) 1  hw (x )
m y (i )  (i )
 (i ) (i ) 1 y
hw ( x
i 1

■ It is more convenient to minimize the negative logarithm of likelihood.

■ Thus. the cost function (to be minimize) for logistic regression is given as:

    
m
1
J (w )   y (i ) log hw (x (i ) )  (1  y (i ) ) log 1  hw (x (i ) )
m i 1
Cost function for logistic regression

■ Let us verify, if this new cost function in convex in nature?

■ For simplicity, let us consider single training example,

then cost function i.e. loss function as:

L (hw (x), y )   y loghw (x)   (1  y ) log1  hw (x) 

  loghw (x) , if y  1
L (hw (x), y )  
log1  hw (x) , if y  0
Cost function for logistic regression
■ Remember, we are discussing two class classification

i.e. two actual classes C1 ( y = 1) and class C2 ( y = 0), and

■ the predicted value hw(x) represents:

probability of belonging to class C1 of a training sample x.

■ The loss L (hw (x), y ) represents error (cost) in prediction about a training sample x.
Cost function for logistic regression
■ Let us analyse case wise (for training samples of
both class)

Case-I: y = 1 (actual class given);

predicted value = hw(x) and L (hw (x), y )
the associated loss is: L (hw (x), y )   loghw (x)  ttttxuhjttttttttttttttttttttt5

■ The curve shows that if predicted value is near to

1 i.e. hw(x)  1, the loss is near zero (very low).
0 hw(x) 1

■ This the case of proper classification i.e. the

objects of class 1 are predicted to be class 1
object.
Cost function for logistic regression
Case-I: y = 1 (actual class given);

■ Now, if predicted value is near to 0

i.e. if hw(x)  0 and
L (hw (x), y )
the actual class is class 1 (y = 1 ),
ttttxuhjttttttttttttttttttttt5

■ The curve shows that there is very high loss.

0 hw(x) 1
■ This the case of misclassification i.e. the objects
of class 1 are predicted to be class 0 object.
Cost function for logistic regression
Case-II: y = 0 (actual class given);
predicted value hw(x), and
the associated loss is:
L (hw (x), y )  log1  hw (x) 

■ The curve shows that if predicted value is near L (hw (x), y )

to 1 i.e. hw(x)  1,
– then loss is very high.

0 hw(x) 1
■ This the case of misclassification i.e. the objects
of class 0 are predicted to be class 1 object.
Cost function for logistic regression
Case-II: y = 0 (actual class given);

■ Now if predicted value is near to 1 i.e. hw(x)  0

and the actual class is class 0 (y = 0 ),

■ The curve shows that there is very low loss ( 0).

L (hw (x), y )
This is the case of proper classification.

0 hw(x) 1
■ In case of multiple instances (samples), the loss will be low for proper classification and high
for misclassification.
■ Thus, the nature of cost function is convex in nature and gradient descent algorithm will
converge in global minima (since there is only one).
Gradient Descent Algorithm
■ The gradient descent algorithm is based on the simple notion that

if we want go down hill, we should move in the opposite direction (downwards)

of maximum change in height.

■ The direction and amount of maximum change in height is given by Gradient.

■ In the given problem cost function is analogous to height.

So, we use gradient of cost function.

■ It should be noted that gradient descent algorithm may stuck in local minima.
Gradient Descent Algorithm
■ For simplicity, lets consider weight vector is consist of only one weight w (one
dimensional or scalar).
Bias component w0 is also zero.

■ Gradient descent algorithm:

weight updation is given as:

d
w : w  J ( w)
dw
Gradient Descent Algorithm
■ The figure shows plot of cost function with
one dimensional parameter w.

■ We can start with any random value of w.

■ Then we apply gradient descent algorithm

to find the value optimal parameter w.

■ The parameter w corresponds to minimum

value of the cost

d
weight updation is given as: w : w  J ( w)
dw

Classification
100% (2)
Classification
105 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
ML - Logistic Regression&KNN
No ratings yet
ML - Logistic Regression&KNN
48 pages
Chapter02 Introduction To DeepLearning
No ratings yet
Chapter02 Introduction To DeepLearning
84 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
79 pages
Pattern Recognition and Deep Learning Linear Models For Classification
No ratings yet
Pattern Recognition and Deep Learning Linear Models For Classification
59 pages
Dsa Record Dsa Elab Answers
No ratings yet
Dsa Record Dsa Elab Answers
126 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
2.0. Mathematical Language and Symbols Including Sets and Functions
No ratings yet
2.0. Mathematical Language and Symbols Including Sets and Functions
69 pages
LR, Decision Tree
No ratings yet
LR, Decision Tree
48 pages
Classification
No ratings yet
Classification
74 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Classification
No ratings yet
Classification
31 pages
Week 4 v1.1 (Hidden) - Supervised Learning (Classification)
No ratings yet
Week 4 v1.1 (Hidden) - Supervised Learning (Classification)
43 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
06 Logistic Regression
No ratings yet
06 Logistic Regression
55 pages
CB Chapterwise Index
No ratings yet
CB Chapterwise Index
4 pages
Algorithms Notes
No ratings yet
Algorithms Notes
66 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Slide 2
No ratings yet
Slide 2
30 pages
ML-chap10 2024 110300
No ratings yet
ML-chap10 2024 110300
29 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
GWG PDFX4 Workflow EN
No ratings yet
GWG PDFX4 Workflow EN
36 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Linear Classifier: Linear Discriminant Function: Compiled by Lakshmi Manasa, CED16I033
No ratings yet
Linear Classifier: Linear Discriminant Function: Compiled by Lakshmi Manasa, CED16I033
31 pages
Lec 20
No ratings yet
Lec 20
16 pages
Lecture Notes Chapt13
No ratings yet
Lecture Notes Chapt13
15 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
4 Linkers and Connectors
No ratings yet
4 Linkers and Connectors
44 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Unit 3-Discriminative Models
No ratings yet
Unit 3-Discriminative Models
29 pages
Lec 42
No ratings yet
Lec 42
12 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Hastrup, Kirsten - Getting It Right
No ratings yet
Hastrup, Kirsten - Getting It Right
19 pages
Classification-Introduction, Logistic Regression
No ratings yet
Classification-Introduction, Logistic Regression
26 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Presentation 1
No ratings yet
Presentation 1
9 pages
Lec 43
No ratings yet
Lec 43
9 pages
Main
No ratings yet
Main
5 pages
Grade 10 Math
No ratings yet
Grade 10 Math
142 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
41 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Formative Assessment-1
No ratings yet
Formative Assessment-1
15 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
Machine Learning Algorithms: Amit Kumar Singh.b
No ratings yet
Machine Learning Algorithms: Amit Kumar Singh.b
14 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
12 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Kognitivnolingv Aspekty
No ratings yet
Kognitivnolingv Aspekty
2,153 pages
Grammar in Use 2: - Noun + Preposition - Adj + Preposition
No ratings yet
Grammar in Use 2: - Noun + Preposition - Adj + Preposition
16 pages
Partition Literature - Basti
No ratings yet
Partition Literature - Basti
8 pages
Guide Manu 2 PDF
No ratings yet
Guide Manu 2 PDF
80 pages
Latex Template For Scientific Style Book
No ratings yet
Latex Template For Scientific Style Book
31 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Table of Specification
No ratings yet
Table of Specification
37 pages
阅读分享
No ratings yet
阅读分享
5 pages
Ethical World Conception
No ratings yet
Ethical World Conception
24 pages
French Level 1 - Student Workbook PDF
75% (4)
French Level 1 - Student Workbook PDF
100 pages
Design and Configure Azure Front Door
No ratings yet
Design and Configure Azure Front Door
11 pages
Concept of Philosophy and Science
No ratings yet
Concept of Philosophy and Science
5 pages
Lab 2 - Introduction To Programming BBA
No ratings yet
Lab 2 - Introduction To Programming BBA
2 pages
Loyola Application
No ratings yet
Loyola Application
3 pages
IT3401 - WE Lesson Plan
No ratings yet
IT3401 - WE Lesson Plan
6 pages
RRB ALP Previous Year Papers PDF - 2424
No ratings yet
RRB ALP Previous Year Papers PDF - 2424
70 pages
Individual Workplan
No ratings yet
Individual Workplan
1 page
Greece and The Greeks in Ottoman History and Turkish Historiography
No ratings yet
Greece and The Greeks in Ottoman History and Turkish Historiography
15 pages
Half Yearly Syllabus-2023-2024 Class-6
No ratings yet
Half Yearly Syllabus-2023-2024 Class-6
4 pages
Orang Kuala
No ratings yet
Orang Kuala
1 page
CSSE 3113: Software Engineering: Assignment 2
No ratings yet
CSSE 3113: Software Engineering: Assignment 2
3 pages
C A R R o T: Listen To Your Teacher. Number The Pictures
No ratings yet
C A R R o T: Listen To Your Teacher. Number The Pictures
2 pages

Logistic Regression Training DR Anil

Uploaded by

Logistic Regression Training DR Anil

Uploaded by

LOGISTIC

Prof. Anil Singh Parihar

■ Small letters in bold are used to represent a vector

■ Remember minimum distance of a point from the

■ Note for points above (one side of) the

■ For point very close to plan, the class

■ Indeed, it can be considered as the probability d

of the belonging to a class. w x (B )

■ We can map the values of g(x) in the range [0, d

1], using some function say h(x) = h(g(x) ). w x (B )

■ Note that h(x) represents the h(x)

■ Thus, the hypothesis h(x) is better denoted by hw(x).

■ Since, we are discussing binary

■ Designing a logistic regression classifier to indeed is a problem of finding suitable

such that hype-plane divides (separates) classes without any misclassification.

– We update parameters until error reduced to zero, the parameters w*

■ Here x(i) i = 1, 2,…, m are n-dimensional vectors, and

■ We can write these vectors as augmented vectors: x = [x0 , x1,…,xn ]T where x0 = 1

■ Use some mechanism to iteratively update parameters (weights) to reduce the

■ Can we do it (finding suitable parameters) by minimizing cost function or loss

■ Thus, gradient descent algorithm will converse to global minima.

■ Thus, gradient descent algorithm will converse to global minima.

Thus, hw(x) = p ( y = 1|x, w)

■ It is the posterior probability of class C1 (y = 1),

■ Since we are discussing two class classification,

p ( y | x, w )  hw (x)  y 1  hw (x) 1 y

■ It can be easily verified as:

y = 0, p ( y  0 | x, w )  hw (x) 0 .1  hw (x)   1  hw (x)

■ Considering class level as random variable,

■ Actually, this is a particular case of binomial distribution called Bernoulli distribution.

■ The parameters of the distributions can be estimated using maximum likelihood

■ The likelihood of the observing the training set is given as:

■ It is more convenient to minimize the negative logarithm of likelihood.

■ Let us verify, if this new cost function in convex in nature?

■ For simplicity, let us consider single training example,

then cost function i.e. loss function as:

L (hw (x), y )   y loghw (x)   (1  y ) log1  hw (x) 

i.e. two actual classes C1 ( y = 1) and class C2 ( y = 0), and

■ the predicted value hw(x) represents:

Case-I: y = 1 (actual class given);

■ The curve shows that if predicted value is near to

■ This the case of proper classification i.e. the

■ Now, if predicted value is near to 0

■ The curve shows that there is very high loss.

■ The curve shows that if predicted value is near L (hw (x), y )

■ Now if predicted value is near to 1 i.e. hw(x)  0

■ The curve shows that there is very low loss ( 0).

if we want go down hill, we should move in the opposite direction (downwards)

■ The direction and amount of maximum change in height is given by Gradient.

■ In the given problem cost function is analogous to height.

So, we use gradient of cost function.

■ Gradient descent algorithm:

weight updation is given as:

■ We can start with any random value of w.

■ Then we apply gradient descent algorithm

■ The parameter w corresponds to minimum

You might also like