0% found this document useful (0 votes)
11 views16 pages

Lec 20

The lecture focuses on logistic regression, a method for binary classification that transforms probabilities using the logit function. It explains how to model the log odds of class membership, derive the decision boundary, and utilize maximum likelihood estimation for parameter fitting. The discussion also touches on extending logistic regression to multiple classes and the iterative methods used for optimization, such as Newton-Raphson and gradient descent.

Uploaded by

srkkps6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Lec 20

The lecture focuses on logistic regression, a method for binary classification that transforms probabilities using the logit function. It explains how to model the log odds of class membership, derive the decision boundary, and utilize maximum likelihood estimation for parameter fitting. The discussion also touches on extending logistic regression to multiple classes and the iterative methods used for optimization, such as Newton-Raphson and gradient descent.

Uploaded by

srkkps6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

NPTEL

NPTEL ONLINE CERTIFICTION COURSE

Introduction to Machine Learning

Lecture 20

Prof. Balaraman Ravindran


Computer Science and Engineering
Indian Institute of Technology Madras

Logistic Regression

So let us go back to whatever has been bothering all of us. So what we are essentially choosing
when we did regression here was we are going to make sure that the output is either 1 or 0. And
then we are trying to regress on that. So what do you want our function to do? Basically you
want your f(x) to give you P(K|X) but then trying to do that is a little harder so what we are
going to do is you are going to look at some kind of a transformation of the probability.

(Refer Slide Time: 00:49)


And we are going to try and fit that. Let me look at the logit transformation. This is essentially
p ( x)
log .
1  p ( x)

(Refer Slide Time: 01:29)

To make my life easier for the next few minutes I am going to assume we are dealing with binary
classification. So the class label is either 0 or 1 and p(x) is essentially probability that the output
is 1 given the input is X. So this makes my life a little easier when I write the next part.
(Refer Slide Time: 02:12)

So given that p ( x)  Pr(G  1| X  x) , what is 1 - p(X)? It is probability 0. We are talking about


binary classes so this is sometimes called the probability of success divided by the probability of
failure or “odds”. So this is sometimes called the log odds function or the logit function. So this
is essentially the transformation that we want to look at so what I am going to do is I am going to
try fit a linear model to the log odds. So what does p(x) in this case right.
(Refer Slide Time: 03:37)

So what is this function going to look like? That is a sigmoid right so essentially we are saying
that my p(x) is going to given by probability that x is 1 will be given by a sigmoid.
(Refer to slide time 4.35)

(Refer Slide Time: 04:54)


That is the term in the power except that here it is a minus before that. So what do we have here?
so what we do if p(x) is greater than 0.5?

(Refer to slide time 7.46)

We will output this one and if p(x) is less than 0.5 will output it as 0. So is it okay because even
though I am doing linear regression and linear regression is unbounded I am going to plug it into
this expression and therefore this will make sure that my probability is between 0 & 1. What is
that point this 0.5 depends on what I had put for my β0. So what about the classifier that I am
learning here? What is the separating surface or the decision boundary between class one and
class two? Think you have p(x) = 0.5 but what does it mean? Look at the expression that we have
here? So in p(x) equal to 0.5 this be 1 we have log of that will be 0.
So essentially the thing is β0 + βx = 0 and that is the straight line I mean assuming X is uni-
dimensional that is a straight line right. So even though I did something complex, I used an
exponential to define my probability, the decision surface turns out to be still hyper plane. So
plug in p(x) equal to 0.5 here and I'm going to get 0 on the left hand side. I am essentially solving
β0 + βx = 0. That is just a straight line assuming that it is a hyper plane. Maybe I should do the
whole class in one dimension it makes it easier for people to visualize things. So one thing I
should point out is that logistic regression looks simple right but it yields a very powerful
classifier it works very well in practice.

And it is used for not just for building classification surfaces but it is also used a lot in what
people sometimes call “sensitivity analysis”. So they look at how each factor contributes to the
output and so how each how much is each factor important in predicting the class label so for
doing that they do logistic regression and then they look at the β vector and figure out how much
each variable is going to be contributing to the output. So people use that a lot I mean of course
you can use anything that we have seen for doing this kind of sensitivity analysis I am just telling
you what people use in practice okay.
So logistic regression is something that is used vary widely in practice both by machine learning
folks and by statisticians in fact when I work with few doctors right it was almost impossible to
get them to accept anything else other than logistic regression as a valid classifier because they
were so sold on logistic regression and with good reason because it does work very well in very
well in practice.

So that is for two classes, so what do you do for multiple classes? So multiple classes I'm
essentially going to take recourse to this form. I am going to say keep the probability that the
output is class 1 given X is given by an expression like this the probability that the output is class
2 given the input is X is given by another expression like this where which will have a different
set of β0 and βs right.

Likewise for every “class I” am going to say is given by a different set of β0 and β. So do we
have to do that for all the k classes? I have to do it only for k-1 classes because the kth class
probability will be automatically determined. I will have to have k-1 sets of β. If I have k classes
I have to figure out how to estimate those.

(Refer Slide Time: 10:52)


(Refer to slide time 11.46)

So we are going to write like this for that K minus 1 classes by convention either the first class or
the last class whichever arbitrary numbering you choose either the first class or the last class the
coefficient is set to zero setting the coefficient to zero will essentially give you the answer that
you want. So now you agree with me setting it to zero is fine so how do we estimate the
parameters for logistic regression? So it’s a little tricky since we are anyway trying to model
directly the probabilities. What we are going to try and do is maximize the likelihood okay of the
data so far we have always looked at some kind of error function and we have been trying to
optimize the error function within those linear regression we looked at squared error and then we
did the optimization and soon so forth but here we are going to look at a slightly different
criterion we are going to optimize the likelihood of the data.
(Refer Slide Time: 13:14)
So just to keep it together I am going to do this today but I have a whole session planned on
maximum likelihood and other ways of estimating parameters so when we come to that I will do
maximum likelihood in more detail in a generic form. So right now I will just look at logistic
regression and maximum likelihood. So what is likelihood? Suppose I have some training data D
the training data has been given to me the probability of D given parameters θ is known as the
likelihood of θ .

So D is fixed. Think about it. I am given a training data D, D is fixed what is it that I am actually
looking to find? It is θ. So this I will write as likelihood of θ. So we are always used to thinking
of something that comes after the slash as the conditioning variable and the one that comes
before the slash is the actual argument in this case it turns out that theta is the argument and the
probability of D given θ is the likelihood of θ. D is fixed I am really trying to find what θ is
correct. So finding the likelihood of theta so the scoring function should be on θ and I am usually
interested in the log of θ .
(Refer Slide Time: 15:05)
Because it allows me to simplify a lot of the distributions that I will be considering and we will
denote this by “l” mostly. So what is the likelihood so in our case? θ is our βs so my input my D
is going to consist of {(x1 , g1), … (xn , gn)}. It is going to consist of pairs of data points like this.
So X is the input G is the output we are talking about classification so G is the output X is the
input right so what is the likelihood? So I wanted to stay in the two class domain for y so G is
either G belongs to 0 or 1 so 0 means is class 0, 1 means is class 1 okay.
(Refer Slide Time: 16:41)
This is a funky expression that is written and we will come back to this. We will see this again so
this is the probability of one pair “xg” occurring? So what is this is the probability that the X has
a label 1? This is the probability that X as the labels 0 and what is this? This is the actual label
of X. If the actual label of X is 1 then the term log(p(xi)) will appear in the equation if the actual
level of X is 0 then the term log(1-p(xi)) will appear in the equation this will become 1 right so if
the actual level of X is 1 what should be the probability.
Probability that X equal to 1 right that is what I should be looking at so that is what this is the
actual label is 0 then I should be looking at the probability that X is 0 that is what this term is
right so you can see that this gives me the probability of one “XG” pair. I do this for all of them
assuming that they are all sampled independently right because I am assuming they are
independent I can take the product. So now we know why we love logarithms right. (Refer Slide
Time: 18:26)

So that is the expression and is simple enough so now comes the interesting part we want to do
what we want to maximize likelihood. So we need to take the derivative of this and equate it to
zero. It's fine right because log is a monotone transformation we can take the derivative of the
log equate it to zero and then solve for  . Unfortunately life is not so simple. Let us try and do
the simplification which I am multiplying this out and gathering the terms. Multiplying this out I
gather the terms here and we know what that is what is that yet we already know that so that we
can insert that and simplify that there and what about this guy one minus this right I can again
write it in a simpler form write log of one minus z will give me okay.
(Refer Slide Time: 21:33)
So now I can take a derivative of that with respect to β and equate it to zero what do I get.
(Refer Slide Time: 22:11)

So take the derivative of this term right you are going to end up with minus P. So this will go
down and I will get eβ0 + β x as the numerator. I will get minus P times X since I am doing it with
respect to a specific βj. I will get Xij. Does it make sense? So this first term I will get minus P
times Xij and this one if I take the derivative of that I will get gi times Xij okay. So that
essentially what I have done here looks like a nice and easy expression to solve but unfortunately
it is not so because this will it is an exponential function here so it is not really easy to solve this
you have to look at some other iterative method for solving this and the most popular method
used is Newton-Raphson. I am not going to go into the depths of Newton-Raphson. People are
encouraged to look it up.
(Refer Slide Time: 24:11)

But the basic idea is that so people were more comfortable looking at it this way so take the old
estimate of your values or your take your old solution and look at the first derivative of the
function that you are maximizing. l’ divided by l’’ so you adjust this by that so that is essentially
the basic idea behind Newton-Raphson I am just defining some terms here so X is going to be
my (n x p + 1) matrix as usual and my P is going to be a vector where each entry is going to be
the probability of xi. So what will be the dimensionality of P? It will be n. So it is a n vector that
tells me what is the probability of each xi being one. So that is that is P and W is going to be a
diagonal matrix where each diagonal entry is P into 1 minus P right for that particular at a point
xi so this makes it convenient to rewrite things and I am going to assume that g is the vector of
outputs like zeros and ones depending on what class it is.
(Refer to slide time 27.13)
(Refer Slide Time: 24:46)

So I can write my ∂l/ ∂β is, it is XT(g – p). In terms of the matrices it is P and the vector Y is the
vectors of zeros and ones corresponding to the class label and X is my input. So I am just
basically written this in vector notation so you already found the derivative I have just rewritten
it in vector notation that it make sense okay, so what about the second derivative so I am so I am
not going to work it out. But it is X-T WX okay and so W is essentially the diagonal matrix with
this entries okay so that is my second order I made my second derivative so what do I get now
putting this together.
(Refer Slide Time: 28:40)

The beginning look something like regression here alright you are getting your (X-T X)-1 X-T and
all that so we just have to do a little bit more work little bit of algebra to make it look more like
regression so that’s what we will do now. I just substituted the derivatives here, nothing fancy.
So you want to solve this problem when this becomes 0 so you can see the β in here right no yes
the β is in the P right so the right so I have erased the p off now but so β is in their P is the P is
β0+Bx β0β
e /1+e is here I really want to solve for this right I want to find the 0 of this function
right.
But it is not easy to do because of the fact that we have it exponential in there right so what we
have to do is look at some kind of iterative method for solving this problem and so what the way
we do this iterative approaches you start off with a guess called βold. And then you do some
computation you get a new guess call βnew. So one very popular way of doing this kind of
iterative thing is to do “gradient following”. Have you have you looked at that? I mean you must
have might have come across that this side so suppose I have a function like this, I am here this
is my current solution right this is I will call this Xold okay, so now I will compute the gradient
here right and I will move in the opposite direction of the gradient to find the minimum right so
instead of going all the way I can go a small step that gives me the Xnew right.
(Refer to slide time 31.22)

Normally what you will do is you will find the gradient try to equate it to 0 and get it but it can
do this in iterative fashion also right you can take small steps in the direction of the gradient so
likewise what we are going to do is we will start off with βold which is some guess for this okay
in fact β of all 0 actually works fine okay can start off by saying all my β at 0 okay and then try
to find a β new. So what I will essentially be doing is I will find so β0 will put me somewhere
here on the L function right I will find out what is the first order and the second order derivatives
at this point with respect to β and then use that for changing my β values right so people agree
with me so this is the XTW X-1 . (Refer Slide Time: 32:08)

If you take the product here this will be XT W X so that is just the identity right and I take the
product here I will get XT W -1 back this W-1 and will get cancelled out right so I have just done
some algebra to get it this way think about it what is Xβold, so what is original I mean since it is
like linear regression right it is like the original response I will get if βold is my variables and I
am actually prime making a linear prediction based on X right. So the Xβold is this and I am
essentially adjusting it you think this quantity so this is the prediction I make with my old
parameters this is some kind of adjustment I am making to the prediction so this is called the
“adjusted response” and this turns out to be the solution of something known as weighted linear
regression it is in weighted linear regression essentially what you do.

So in linear regression that is what you are trying to minimize that is a square error right so linear
regression this is what you are trying to minimize weighted linear regression you essentially have
a waiting term in your error function right since I am just saying I am going to minimize the
squared error for every term in the squared error I am going to assign a different weight right so
for some data points I want to be more aggressive in minimizing the for some data point I want
to be less aggressive.

So data points in which I have to be more aggressive I will have a higher weight for data points
for which I want to be less aggressive I will have a lower weight so that will allow me to trade-
off the importance of data points this is idea behind weighted linear regression right so this is
essentially weighted linear regression so minimize be the minimizer of this right it is this okay
you can do the usual now you can take the derivative set it to zero and solve it this is easy
enough to solve this is actual linear regression right.

So the minimize is (XTWX)-1 XT W into Z right so essentially what we are saying is your β the β
new is essentially solving a weighted linear regression or weighted least squares problem okay
are solving a weighted least squares problem with this adjusted response so this is called iterative
reweighted least squares this Is a separate algorithm called iterative reweighted least squares for
solving logistic regression.
But all it does this essentially does Newton-Raphson essentially is doing Newton-Raphson but
the way iterative rebated least squares is described to you is okay start off with a guess for β
okay form the adjusted response okay as soon as I guess as soon as I have a value for β, I can
find out what my P is so G is given to me already in the data and my W can be constructed once
I know P. I make a guess for β, I construct my P. I construct my W.

Form this adjusted response solve this weighted least squares problem get a new β keep
repeating this until my predictions are accurate enough okay so that is basically this is it is the
most popular way of solving logistic regression but there are many other ways people have come
up with more efficient ways of solving logistic regression actually and but if you pickup any
popular package like R or something so IRLS is the base logistic regression solver that would be
implemented okay. So this just to give you a flavor of how hard it can be to optimize things
sometimes.

IIT Madras Production

Funded by
Department of Higher Education
Ministry of Human Resource Development
Government of India

www.nptel.ac.in
Copyrights Reserved

You might also like