Logistics Regression Notes
Logistics Regression Notes
Logistic Regression
Submitted by:
Name: Shruti Gupta
Class: BBA (FIA) 2B
Roll No: 18373
Introduction
When the dependent variable is categorical, then the relationship between the dependent
variable and the independent variables can be represented by using a logistic regression
model. Using the logistic regression model, the value of the dependent variable can be
predicted from the values of the independent variables.
Categorical Data:
Categorical data is a type of statistical data which consists of categorical variables or
grouped data which can be converted into categorical form. Categorical data is divided into
groups according to the variables present in the data.
Examples of Categorical data can be:
The differences in the characteristics of Linear Regression and Logistic Regression are
illustrated below:
Due to the above factors, Linear Regression becomes unsuitable in case the
Dependent variable is categorical.
The model can be made suitable if the two conditions are satisfied:
• The function must always be positive
• The function must be less than 1
Sigmoid Function:
The sigmoid function, which is also called the logistic function gives an ‘S’ shaped curve that
can take any real-valued number and limit it into a value between 0 and 1. If the curve goes
to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y
predicted will become 0.
If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or
Yes, and if it is less than 0.5, we can classify it like 0 or No.
To make the value of the RHS positive and less than 1, we apply the sigmoid function to the
linear regression model.
We take the exponent of the RHS to bring its value between 0 and 1
After this transformation, the value of the dependent variable is limited between 0 and 1.
To overcome the residual issue, we identify a threshold probability value. If the probability is
more than the threshold value, it is predicted that the event is certain to happen and if the
probability value is less than the threshold value, it is predicted that the event is not certain
to happen.
This is the Logistic Regression Function which overcomes the limitations of the linear model.
Logistic Regression:
The logistic regression predicts the dependent variable using the independent variables.
p = e^( β0 + β1Xi)*(1-p)
When we take natural log on both sides, then the equation becomes
Ln(p/(1-p)) = β0 + β1Xi
Here p/1-p represents the log of odds i.e. the ratio of the probability of the event happening
with the probability of event not happening.
Though, the independent variable is not linearly related with the dependent variable, it is
linearly related with the log of odds, making this is linear function.
Logistic Coefficients: Log of odds is the ratio of the probability of the event happening with
the probability of event not happening.
The slope coefficient is interpreted as the rate of change in the "log odds" as X changes. The
coefficient is used to determine whether a change in a predictor variable makes the event
more likely or less likely. A positive coefficient makes the event more likely and negative
coefficient makes the event less likely. An estimated coefficient near 0 implies that the
effect of the predictor is small.
Log of Odds: If the β1 value is 1.6, it means that 1 unit change in X1 while others
independent variables are at same level, produces 1.6 unit change in log of the odd. If we
take exponential for log odd, we will get odd value.
Research Problem:
The research problem taken to analyse the application of Logistic regression is to predict
“whether the birthweight of an infant would be low (< 2500 g) or not”. This would be
affected by a number of factors such as the age of the mother, race of the mother, weight of
the mother, whether she smoked during pregnancy or not etc.
This is a problem of Logistic Regression since the dependent variable i.e. low birthweight is
binary in nature. It can take only two values – whether the birthweight or the infant would
be low or not.
Data: For the analysis, a sample data of 189 mothers has been taken
Dependent Variable – Low Birthweight which is labelled as ‘low’
Independent Variables –
“age” – indicates the age of the mother at the time of pregnancy
“smoke” – indicates whether the mother was a smoker or a non – smoker during pregnancy
Binary Predictor:
Under this, the independent (explanatory variable) is also a binary variable which can take
only 2 values. To analyse this, “smoke” which is a binary independent variable has been
taken. The Logistic Regression model aims to estimate whether smoking by a pregnant
woman causes low birthweight in infants or not.
Low Birthweight = β0 + β1*(Smoke) + µi
Logistic Coefficients Method:
Since the coefficient of the predictor variable is positive, it indicates that low birthweight is
likely to occur in infants as the incidences of smoking increases in the mothers. The value of
the coefficient is 0.7040, which implies that there is 70.4% change that smoking will cause
low birthweight in infants.
Odds Ratio:
From the results, we can observe that the odds ratio of the independent variable is
2.0219. This indicates that the odds of low birthweight in mothers who smoke is
almost twice the odds for a non – smoking mother. The p – value is less than 5%,
which implies that the results are statistically significant.
Continuous Predictors:
Under this, the independent (explanatory variable) is a continuous variable which can take
any number of values. To analyse this, “age” which is a continuous independent variable has
been taken. The Logistic Regression model aims to estimate whether age of the pregnant
woman causes low birthweight in infants or not.
Low Birthweight = β0 + β1*(Age) + µi
From the results, we can observe that the odds ratio of the independent variable is
0.9501 . The odds ratio of 0.95 indicates that each year increase in age is associated
with a 5% increase in the odds of low birthweight in infants.
This is the margin graph and margin plot of the predictions. On the vertical axis, there is
probability of low birthweight in infants and on the horizontal axis is the age of the mother.
The graphs indicate the probability of low birthweight at various levels of the age of mother.
Along with the predicted values, the graphs depict the 95% confidence interval of low
birthweight for the ages ranging between 25 – 45.