What Is Logistic Regression
What Is Logistic Regression
relationship between one dependent binary variable and one or more nominal,
I found this definition on google and now we’ll try to understand it.
means a variable that has only 2 outputs, for example, A person will survive this
accident or not, The student will pass this exam or not. The outcome can either
predict whether a customer will churn or not, whether a patient has a disease or
more possible outcomes, such as the type of product a customer will buy, the
rating a customer will give a product, or the political party a person will vote for.
order, such as the level of customer satisfaction, the severity of a disease, or the
stage of cancer.
If you have this doubt, then you’re in the right place, my friend. After reading the
definition of logistic regression we now know that it is only used when our
continuous.
The second problem is that if we add an outlier in our dataset, the best fit line in
the distance between the predicted value and actual value, the line will be like
this:
Image Source:towardsdatascience.com
Here the threshold value is 0.5, which means if the value of h(x) is greater than
0.5 then we predict malignant tumor (1) and if it is less than 0.5 then we predict
benign tumor (0). Everything seems okay here but now let’s change it a bit, we
add some outliers in our dataset, now this best fit line will shift to that point.
Image Source:towardsdatascience.com
Do you see any problem here? The blue line represents the old threshold and the
yellow line represents the new threshold which is maybe 0.2 here. To keep our
predictions right we had to lower our threshold value. Hence we can say that
linear regression is prone to outliers. Now here if h(x) is greater than 0.2 then
Another problem with linear regression is that the predicted values may be out of
range. We know that probability can be between 0 and 1, but if we use linear
straight best fit line in linear regression to an S-curve using the sigmoid function,
which will always give values between 0 and 1. How does this work and what’s
If you want to know the difference between logistic regression and linear
1. Prepare the data: The data should be in a format where each row represents a
single observation and each column represents a different variable. The target
variable (the variable you want to predict) should be binary (yes/no, true/false,
0/1).
2. Train the model: We teach the model by showing it the training data. This
involves finding the values of the model parameters that minimize the error in the
training data.
3. Evaluate the model: The model is evaluated on the held-out test data to assess
4. Use the model to make predictions: After the model has been trained and
You must be wondering how logistic regression squeezes the output of linear
Well, there’s a little bit of math included behind this and it is pretty interesting
trust me.
How similar it is too linear regression? If you haven’t read my article on Linear
We all know the equation of the best fit line in linear regression is:
Let’s say instead of y we are taking probabilities (P). But there is an issue here,
the value of (P) will exceed 1 or go below 0 and we know that range of
Do you think we are done here? No, we are not. We know that odds can always
be positive which means the range will always be (0,+∞ ). Odds are nothing but
the ratio of the probability of success and probability of failure. Now the question
comes out of so many other options to transform this why did we only
take ‘odds’? Because odds are probably the easiest way to do this, that’s it.
The problem here is that the range is restricted and we don’t want a restricted
range we are actually decreasing the number of data points and of course, if we
decrease our data points, our correlation will decrease. It is difficult to model a
variable that has a restricted range. To control this we take the log of
If you understood what I did here then you have done 80% of the maths. Now we
just want a function of P because we want to predict probability right? not log of
odds. To do so we will multiply by exponent on both sides and then solve for P.
Now we have our logistic function, also called a sigmoid function. The graph of a
special “S” shaped curve to predict probabilities. It ensures that the predicted
seem complex, the relationship between our inputs (like age, height, etc.) and
the outcome (like yes/no) is pretty simple to understand. It’s like drawing a
Coefficients: These are just numbers that tell us how much each input affects
the outcome in the logistic regression model. For example, if age is a predictor,
the coefficient tells us how much the outcome changes for every one year
increase in age.
Best Guess: We figure out the best coefficients for the logistic regression model
by looking at the data we have and tweaking them until our predictions match the
observations are independent, meaning one doesn’t affect the other. We also
assume that there’s not too much overlap between our predictors (like age and
height), and the relationship between our predictors and the outcome is kind of
regression gives us probabilities, like saying there’s a 70% chance it’s a “yes” in
the logistic regression model. We can then decide on a cutoff point to make our
final decision.
make sure our predictions are good, like accuracy, precision, recall, and a curve
called the ROC curve. These help us see how well our logistic regression model
In linear regression, we use the Mean squared error which was the difference
between y_predicted and y_actual and this is derived from the maximum
likelihood estimator. The graph of the cost function in linear regression is like
this:
above MSE equation then it will give a non-convex graph with many local minima
as shown
Image Source:towardsdatascience.com
The problem here is that this cost function will give results with local minima,
which is a big problem because then we’ll miss out on our global minima and our
In order to solve this problem, we derive a different cost function for logistic
regression called log loss which is also derived from the maximum likelihood
estimation method.
In the next section, we’ll talk a little bit about the maximum likelihood estimator
and what it is used for. We’ll also try to see the math behind this log loss
function.
What is the use of Maximum Likelihood Estimator?
values that maximize the likelihood function. This function represents the joint
learning, this process aims to discover parameter values such that, when
plugged into the model for P(x), it produces a value close to one for individuals
with a malignant tumor and close to zero for those with a benign tumor.
Let’s start by defining our likelihood function. We now know that the labels are
binary which means they can be either yes/no or pass/fail etc. We can also say
we have two outcomes success and failure. This means we can interpret each
Random Experiment
A random experiment whose outcomes are of two types, success S and failure F,
this experiment a random variable X is defined such that it takes value 1 when S
We need a value for theta which will maximize this likelihood function. To make
our calculations easier we multiply the log on both sides. The function we get is
also called the log-likelihood function or sum of the log conditional probability
gradient descent, rather than maximize an objective function via gradient ascent.
If we maximize this above function then we’ll have to deal with gradient ascent to
avoid this we take negative of this log so that we use gradient descent. We’ll talk
more about gradient descent in a later section and then you’ll have more clarity.
Also, remember,
max[log(x)] = min[-log(x)]
The negative of this function is our cost function and what do we want with our
minimize a cost function for optimization problems; therefore, we can invert the
class.
If we combine both the graphs, we will get a convex graph with only 1 local
vanish. Now if the predicted probability is close to 1 then our loss will be less and
The black line represents 0 class (y=0), the left term will vanish in our cost
function and if the predicted probability is close to 0 then our loss function will be
less but if our probability approaches 1 then our loss function reaches infinity.
This cost function is also called log loss. It also ensures that as the probability of
minimized. Lower the value of this cost function higher will be the accuracy.
In this section, we will try to understand how we can utilize Gradient Descent to
Gradient descent changes the value of our weights in such a way that it always
converges to minimum point or we can also say that, it aims at finding the
optimal weights which minimize the loss function of our model. It is an iterative
method that finds the minimum of a function by figuring out the slope at a random
The intuition is that if you are hiking in a canyon and trying to descend most
quickly down to the river at the bottom, you might look around yourself 360
degrees, find the direction where the ground is sloping the steepest, and walk
At first gradient descent takes a random value of our parameters from our
function. Now we need an algorithm that will tell us whether at the next iteration
we should move left or right to reach the minimum point. The gradient descent
algorithm finds the slope of the loss function at that particular point and then in
the next iteration, it moves in the opposite direction to reach the minima. Since
we have a convex graph now we don’t need to worry about local minima. A
iteration while moving towards the minimum point. Usually, a lower value
of “alpha” is preferred, because if the learning rate is a big number then we may
miss the minimum point and keep on oscillating in the convex curve
Image
Source : https://stackoverflow.com/
Now the question is what is this derivative of cost function? How do we do this?
Don’t worry, In the next section we’ll see how we can derive this cost function
Before we derive our cost function we’ll first find a derivative for our sigmoid
Step-1: Use chain rule and break the partial derivative of log-likelihood.
Step-2: Find derivative of log-likelihood w.r.t p
Now since we have our derivative of the cost function, we can write our gradient
If the slope is negative (downward slope) then our gradient descent will add
some value to our new value of the parameter directing it towards the minimum
point of the convex curve. Whereas if the slope is positive (upward slope) the
gradient descent will minus some value to direct it towards the minimum point.