Lec 42
Lec 42
Lecture - 42
Logistic regression
Just to recap the things that we have seen before, we have talked
about binary classification problem before. Just to make sure that we
recall some of the things that we have talked about before. We said
classification is the task of identifying, what category a new data point,
or an observation belongs to. There could be many categories to which
the data could belong, but when the number of categories is 2, it is
what we call as the binary classification problem. We can also think of
binary classification problems as simple yes or no problems where, you
either say something belongs to particular category, or no it does not
belong to that category.
So, you notice that these are quantitative numbers while these are
qualitative features. Now, you could convert this all into quantitative
features by coding yes as 1 and no as 0. So, then those also become
number. This is very crude way of doing this, there might be much
better ways of coding qualitative features into quantitative features and
so on. You also have to remember that there are some data analytics
approach, that can directly handle these qualitative features without a
need to convert them into numbers and so on.
So, you should keep in mind that that is also possible. Now that we
have these feature, we go back to our pictorial understanding of these
things. Just for the sake of illustration, let us take this example, where
we have this 2 dimensional data. So, here we would say X is x 1 x2; two
variables let us say x1 is here x2 is here. So, it is organized into data like
this. Now let us assume that all the circular data belong to one category
and all the starred data, belong to another category. Notice that circled
data would have certain x1 and certain x2 and, similarly starred data
would have certain x1 and certain x2. So, in other words all values of x1
and x2 such that the data is here, belongs to one class and, such that the
data is here belongs to another class.
Now, if we were able to come up with hyper plane such as the one
that is shown here, we learn from our linear algebra module that to one
side of this hyper plane is half space, this side is a half space and,
depending on the way the normal is defined, you would have positive
value and a negative value to each side of the hyper plane. So, this is
something that we have dealt with in detail, in one of the linear
algebraic classes.
So, I could have a data point, which is true value here; however,
because of noise it could slip to the other side and so on. So, as I come
closer and closer to this then you know the probability, or the
confidence with which I can say it belongs to particular class, can
intuitively come down. So, simply saying yes this data point and, this
data point belongs to class 0 is 1 answer, but that is pretty crude. So,
the question that this logistic regression answers is can we do
something better using probabilities. So, I would like to say that the
probability that this belongs to class 1 is much higher than this,
because it is far away from the decision boundary. So, how do we do
this, is the question that logistics regression addresses.
So, one could say yes this belongs to class, a better nuanced answer
could be that yes it belongs to class, but with a certain probability as
the probability is higher, then you feel more con dent about assigning
that class to the data. On the other hand if you model through
probabilities, we do not want to lose binary answer like yes or no also.
So, if I have probabilities for something I can easily convert them to
yes or no answers through some thresholding, which we will see in the
logistics regression methodology when we describe that. So, while we
do not lose the ability to categorically say, if a data belongs to a
particular class or not by modelling this probability. On the other hand,
we get a bene t of getting a nuanced answer, instead of just saying yes
or no.
So, you could think of something slightly different and, then say
look instead of saying p of x is this let me say log (p (x ))= β 0 + β1 x. In
this case you will notice that it is bounded only on one side. In other
words, if I write log (p (x ))= β0 + β1 x, I will ensure that p of x never
becomes negative; however, on the positive side p of x can go to ∞.
That again is a problem because we need to bound p of x between 0
and 1. So, this is an important thing to remember. So, it only bounds
this on one side.
(Refer Slide Time: 11:16)
So, is there something more sophisticated we can do? The next idea
is to write p of X as what is called as sigmoidal function. The
sigmoidal function has relevance in many areas. So, this is the function
that is used in neutral networks and other very interesting applications.
So, the sigmoid has an interesting form which is here. So, now, let
us look at this form right here. I want you to notice two things number
one is still we are trying to stick this hyperplane equation into the
probability expression because, that is the decision surface. Remember
intuitively somehow we are trying to convert that hyper plane into
probability interpretation. So, that is the reason why we are still
sticking to this β0 + β1 x. Now let us look at this equation and then see
what happens.
Now, what has happened is instead of that, this number is put back
into this expression and depending on what value you get you get a
probabilistic interpretation. That is the beauty of this idea here. You
can rearrange this in this form and then say log (p (X))/ (1 – p( X)) = β 0
+ β1 x. The reason why I show you this form is because the left hand
side could be interpreted as log of odds ratio, which is an idea that is
used in several places. So, that is the connection here.
So, we still have to figure out what are the values for this. Once we
have a value for this, any time I get a new point I simply put it into the
p of x equation that we saw in the last slide and then get a probability.
So, this still needs to be identified and obviously, if we are looking at a
classification problem where I have this on this side and stars on this
side, I want to identify these β0 , β11 and β12 in such a way this
classification problem is solved. So, I need to have some objective for
identifying these values. Remember in the optimization lectures I told
you that that all machine learning techniques can be interpreted in
some senses an optimization problem.
So, here again we come back to the same thing and then we say
look we want to identify this hyper plane, but I need to have some
objective function that I can use to identify these values. So, these β 0 ,
β11 and β12 will become the decision variables but I still need an
objective function. And as we discussed before when we were talking
about the optimization techniques, the objective function has to reffect
what we want to do with this problem. So, here is an objective function
looks little complicated, but I will explain this as we go along. So, I
said in the optimization lectures we could look at maximizing or
minimizing. In this case, what we going to say is I want to find value
for β0 , β11 and β12 such that this objective is maximized.
So, take a minute to look at this objective and then see why
someone might want to do something like this. So, when I look at this
objective function, let us say I again draw this and then let us say I
have these points on one side and the other points on the other side. So,
let us call this class 0 and let us call this class 1. So, what I would like
to do is I want to convert this decision function into probabilities. So,
the way I am going to think about this is, when I am on this line I
should have the probability being = 0.5, which basically says that if I
am on the line I cannot make a choice between class 1 and class 2.
So, in other words we can paraphrase this and then say for any data
point on this side belonging to class 0, we want to minimize p of x
when x is substituted into that probability function and, for any point
on this side when we substitute these data points into the probability
function, we want to maximize the probability. So, if you look at this
here what they say is if this data point belongs to class 0 then y i is 0.
So, whenever a data point belongs to class 0 anything to the power 0 is
1 so, this will vanish. So, in the product there will be functions of this
form, which will be 1 - p of xi and because yi is 0 this will become 1.
So, this will become something to the power 0 1. So, this term will
vanish and the only thing that will remain is 1 - p of x i. So, if we try to
maximize 1 - p of x i, then that is equivalent to minimizing p of xi. So,
for all the points that belong to class 0 we are minimizing p of xi.
Now, let us look at the other case of a data point belonging to class
1, in which case yi is 1 so, 1 - 1 0. So, this term will be something to
the power 0 which will become 1. So, it cannot drop out. So, the only
thing that will remain is p of xi now yi is 1. So, power 1 will be just left
with p of xi. And since this data belongs to class 1, I want this
probability to be very large. So, when I maximize this it will be large
number.
So, you have to think carefully about this equation. There are many
things going on here, number 1 that this is a multiplication of the
probabilities for each of the data point. So, this includes data points
from class 0 and class 1. The other thing that you should remember is
let us say I have a product of several numbers, if I am guaranteed that
every number is positive right, then the product will be maximized
when each of these individual numbers are maximized. So, that is the
principle that is also operating here, that is why we do this product of
all the probabilities.
However if a data point belongs to class 1, I want probability to be
high. So, the individual term is just written as p of xi. So, this is high
for class 1. When a data point belongs to class 0, I still want this
number to be high, that means, this number will be small. So, it
automatically takes care of this as far as class 0 and class 1 are
concerned. So, while this looks little complicated, this is written in this
way because it is easier to write this as one expression.
Now let us take a simple example to see how this will look. Let us
say I have class 0, I have 2 data points X1 and X2 and class 1, I have 2
data points X3 and X4. So, this objective function when it is written out
would look something like this. So, when we take let us say these
points belonging to class 0 then I said the only thing that will be
remaining is here. So, this will be 1 - p of X 1 for the second data point
it will be 1 - p of X 2, then for the data third data point it will be p of X 3
and for the fourth data point it will be p of X4.
So, this would be the expression from here. So, now when we maximize this,
then since p of X s are bounded between 0 and 1, this is a positive
number, this is a positive number, positive number positive number
and, if the product has to be maximized, then each number has to be
individually maximized. That means, this has to be maximized. So, it
will go closer and closer to 1 the closer to 1 it is better. So, you notice
that X4 would be optimized to belong in class 1 Similarly X3 would be
optimized to belong in class 1 and when you come to these two
numbers, you would see that this would be a large number if p of X1 is
a small number. So, p of X1 basically means that X1 is optimized to be
in class 0. And similarly X2 is optimized to be in class 0. So, this is an
important idea that we have to understand in terms of how this
objective function is generated.
Now, one simple trick you can do is take that objective function and
take a log of that and then maximize it. So, if I am maximizing a
positive number X, then that it is equivalent to maximising log of X
also. So, whenever this is maximized that will also be maximized, the
reason why you do this it makes the product into a sum makes it looks
simple. So, remember from our optimization lectures, we said we got a
maximise this objective. So, we always write this objective in terms of
decision variables and the decision variables in this case are β naught β
1 1 and β 1 2 so, we described before. So, what happens is each of
these probability expressions, if you recall from your previous slides,
will have these 3 variables and x i are the points that are already given.
So, you simply substitute them into this expression.
So, I will have something like β0 + β11 x1 + β12 x2 and so on + β1n xn,
this will be an n + 1 variable problem, there are n + 1 decision
variables, these n + 1 decision variables will be identified through this
optimization solution. And for any new data point, once we put that
data point into the p of x function that sigmoidal function that we have
described, then you get the probability that it belongs to class 0 or class
1.
So, this is the basic idea of logistic regression. In the next lecture, I
will take very simple example with several data points to show you
how this works in practice and I will also introduce notion of
regularization, which would help in avoiding over fitting when we do
logistic regression. I will explain what over fitting means in the next
lecture also, with that you will have theoretical under-standing of how
logistics regression works and in a subsequent lecture doctor Hemanth
Kumar would illustrate, how to use this technique in R on a case study
problem.
So, that will give you the practical experience of how to use
logistics regression and how to make sense out of the results that you
get from using logistics regression on an example problem.