0% found this document useful (0 votes)
97 views12 pages

Logistics Regression Notes

This document discusses logistic regression and its advantages over linear regression when the dependent variable is categorical. It begins by defining categorical data and the two types: binary and polychotomous. It then explains binary logistic regression and its assumptions. Key differences between linear and logistic regression are outlined, specifically that logistic regression does not assume a linear relationship between variables and is suited for classification problems rather than regression. The limitations of linear regression for categorical dependent variables are described. Finally, the document introduces the sigmoid function used in logistic regression to limit the range of probabilities to 0 to 1.

Uploaded by

shruti gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views12 pages

Logistics Regression Notes

This document discusses logistic regression and its advantages over linear regression when the dependent variable is categorical. It begins by defining categorical data and the two types: binary and polychotomous. It then explains binary logistic regression and its assumptions. Key differences between linear and logistic regression are outlined, specifically that logistic regression does not assume a linear relationship between variables and is suited for classification problems rather than regression. The limitations of linear regression for categorical dependent variables are described. Finally, the document introduces the sigmoid function used in logistic regression to limit the range of probabilities to 0 to 1.

Uploaded by

shruti gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Financial Econometrics Assignment 3

Logistic Regression
Submitted by:
Name: Shruti Gupta
Class: BBA (FIA) 2B
Roll No: 18373

Introduction
When the dependent variable is categorical, then the relationship between the dependent
variable and the independent variables can be represented by using a logistic regression
model. Using the logistic regression model, the value of the dependent variable can be
predicted from the values of the independent variables.
Categorical Data:
Categorical data is a type of statistical data which consists of categorical variables or
grouped data which can be converted into categorical form. Categorical data is divided into
groups according to the variables present in the data.
Examples of Categorical data can be:

• Gender, such as Male or Female


• Income levels groups such as low-income group, middle income group or high-
income group
• Blood type of a person: A, B, AB or O.
• Race of a person
• Educational level of a person etc.
Categorical variables can be of two types:
1. Binary Variable: A categorical variable which can take on exactly two values is
termed a binary variable or a dichotomous variable. Binary variable is a random
variable which is binary in nature. Examples of Binary variables can be – True or
False, male or female, yes or no etc.
2. Polychotomous Variable: A Categorical variable which can take more than two
possible values are called polychotomous variables. For example; blood type of a
person, income levels etc.

Binary Logistic Regression:


Binary Logistic Regression is a statistical method for the prediction of the binary classes. This
technique is used in the cases when the Dependent variable is binary in nature, that is it can
take only two variables such as a yes or a no. It calculates the probability of occurrence of an
event.
Some examples of Logistic regression problems can be:
• To buy or sell a stock?
• Will the car break down or not?
• Should the bank give loan to the person or not?

Assumptions of Binary Linear Regression:


The Logistic Regression Model does not undertake some of the assumptions taken under
the Classical Linear Regression Model which are:
1. Under this model, a linear relationship is not required between the dependent and
independent variables. The relationship between the variables can be non – linear as
the method involves a log transformation of the variable.
2. The error or residual terms are not required to be normally distributed.
3. Logistic regression does not require variances to be homoscedastic for each level of
the independent variables i.e. homoscedasticity is not required in the data.
4. The independent variable can be ordinal or nominal.
The Logistic Regression Model takes the following assumptions:
1. The Binary Logistic Regression Model requires the dependent variable to be binary in
nature i.e. taking only 2 values.
2. Under this model, the observations should be independent i.e. the data should not
come from repeated measurements or matched data.
3. There should not be multicollinearity in the data. This means that the independent
variables should be independent and there should not be any correlation between
the independent variables.
4. The independent variables are linearly related to the log odds. The independent
variables need not be linearly related to the dependent variable, but under the
logistic regression model, there is a linear relationship between the independent
variables and the log odds.
5. The model should be fitted correctly. Over fitting nor under fitting of the model
should not occur. All the meaningful variables should be included in the model.
6. Logistic regression requires the sample sizes to be large because maximum likelihood
estimates are less powerful than ordinary least squares

Logistic Regression Vs Linear Regression


There are significant differences between Logistic Regression and Linear Regression. The
Linear Regression models are used to estimate and solve the Regression Problems whereas
the Logistic Regression models are mainly used to estimate and solve the Classification
Problems. However, Logistic Regression Models can also be used for solving Regression
Problems.

The differences in the characteristics of Linear Regression and Logistic Regression are
illustrated below:

S. No. Linear Regression Logistic Regression


1. Linear regression is used to Logistic Regression is used to
predict the continuous predict the categorical dependent
dependent variable through a variable through a given set of
given set of independent independent variables.
variables.
2. Linear Regression is used to Logistic regression is used to
solve Regression problem. solve Classification problems.
4. Under linear regression, we find Under Logistic Regression, we
the best fit line in order to find the S-curve in order to
predict the output. classify the samples.
3. Under Linear regression, we Under logistic Regression, we
predict the value of continuous predict the values of categorical
variables. variables.
5. Least square estimation method Maximum likelihood estimation
is used for estimation of method is used for estimation of
accuracy. accuracy.
6. In Linear regression, the In Logistic regression, it is not
relationship between required to have the linear
dependent variable and relationship between the
independent variable must be dependent and independent
linear. variable.
Limitations of Linear Regression in the case of Categorical Variables:
A linear regression model is not suitable in case the dependent variable is categorical
(binary) in nature. When the dependent variable is binary in nature, it can assume only two
values i.e. 0 or 1 just like a dummy variable.
Under Linear Regression, the regression equation of the model is
Yi = β0 + β1Xi + µi
Under the linear regression model, the dependent and independent variables can take any
number of values which causes certain problems while analysing the categorical data. These
are stated as follows:
1. In a binary classification problem, we tend to estimate the probability of an outcome
occurring. Probability is ranged between 0 and 1, where the probability of something
certain to happen is 1, and 0 is something unlikely to happen. But in linear
regression, we are predicting an
absolute number, which can range
outside 0 and 1. Since, the linear
regression model produces a
straight-line curve, some values
may be either less than 0 or more
1. But the probability value cannot
be greater than 1 or less than 0.
Though we can limit any value
greater than 1 to be 1, and value
lower than 0 to be 0, but in this
case the analysis would not
generate the most accurate results
as compared to Logistic Regression
model.
2. Since binary classification problems can only have one of two possible values(0 or 1),
the residuals or the error terms will not be normally distributed about the regression
line. But, under the Classical Linear Regression model, we assume the error terms to
be normally distributed.

Due to the above factors, Linear Regression becomes unsuitable in case the
Dependent variable is categorical.

The model can be made suitable if the two conditions are satisfied:
• The function must always be positive
• The function must be less than 1
Sigmoid Function:

The sigmoid function, which is also called the logistic function gives an ‘S’ shaped curve that
can take any real-valued number and limit it into a value between 0 and 1. If the curve goes
to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y
predicted will become 0.

If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or
Yes, and if it is less than 0.5, we can classify it like 0 or No.

The linear regression model is


Yi (or p) = β0 + β1Xi

To make the value of the RHS positive and less than 1, we apply the sigmoid function to the
linear regression model.

We take the exponent of the RHS to bring its value between 0 and 1

After this transformation, the value of the dependent variable is limited between 0 and 1.

To overcome the residual issue, we identify a threshold probability value. If the probability is
more than the threshold value, it is predicted that the event is certain to happen and if the
probability value is less than the threshold value, it is predicted that the event is not certain
to happen.
This is the Logistic Regression Function which overcomes the limitations of the linear model.

Logistic Regression:

The logistic regression predicts the dependent variable using the independent variables.

The equation of the logistic regression model is

This can be written as

p = e^( β0 + β1Xi)/ 1 + e^( β0 + β1Xi)

p (1 + e^( β0 + β1Xi)) = e^( β0 + β1Xi)

p + p(e^( β0 + β1Xi) = e^( β0 + β1Xi)

p = e^( β0 + β1Xi)*(1-p)

p/(1-p) = e^( β0 + β1Xi)

When we take natural log on both sides, then the equation becomes

Ln(p/(1-p)) = β0 + β1Xi

This is another form of the logistic regression equation.

Here p/1-p represents the log of odds i.e. the ratio of the probability of the event happening
with the probability of event not happening.
Though, the independent variable is not linearly related with the dependent variable, it is
linearly related with the log of odds, making this is linear function.

Interpretation of the Coefficients:

Logistic Coefficients: Log of odds is the ratio of the probability of the event happening with
the probability of event not happening.

The slope coefficient is interpreted as the rate of change in the "log odds" as X changes. The
coefficient is used to determine whether a change in a predictor variable makes the event
more likely or less likely. A positive coefficient makes the event more likely and negative
coefficient makes the event less likely. An estimated coefficient near 0 implies that the
effect of the predictor is small.

Log of Odds: If the β1 value is 1.6, it means that 1 unit change in X1 while others
independent variables are at same level, produces 1.6 unit change in log of the odd. If we
take exponential for log odd, we will get odd value.

Research Problem:
The research problem taken to analyse the application of Logistic regression is to predict
“whether the birthweight of an infant would be low (< 2500 g) or not”. This would be
affected by a number of factors such as the age of the mother, race of the mother, weight of
the mother, whether she smoked during pregnancy or not etc.
This is a problem of Logistic Regression since the dependent variable i.e. low birthweight is
binary in nature. It can take only two values – whether the birthweight or the infant would
be low or not.
Data: For the analysis, a sample data of 189 mothers has been taken
Dependent Variable – Low Birthweight which is labelled as ‘low’
Independent Variables –
“age” – indicates the age of the mother at the time of pregnancy
“smoke” – indicates whether the mother was a smoker or a non – smoker during pregnancy

Binary Predictor:
Under this, the independent (explanatory variable) is also a binary variable which can take
only 2 values. To analyse this, “smoke” which is a binary independent variable has been
taken. The Logistic Regression model aims to estimate whether smoking by a pregnant
woman causes low birthweight in infants or not.
Low Birthweight = β0 + β1*(Smoke) + µi
Logistic Coefficients Method:

Since the coefficient of the predictor variable is positive, it indicates that low birthweight is
likely to occur in infants as the incidences of smoking increases in the mothers. The value of
the coefficient is 0.7040, which implies that there is 70.4% change that smoking will cause
low birthweight in infants.

Odds Ratio:
From the results, we can observe that the odds ratio of the independent variable is
2.0219. This indicates that the odds of low birthweight in mothers who smoke is
almost twice the odds for a non – smoking mother. The p – value is less than 5%,
which implies that the results are statistically significant.
Continuous Predictors:
Under this, the independent (explanatory variable) is a continuous variable which can take
any number of values. To analyse this, “age” which is a continuous independent variable has
been taken. The Logistic Regression model aims to estimate whether age of the pregnant
woman causes low birthweight in infants or not.
Low Birthweight = β0 + β1*(Age) + µi

Logistic Coefficients Method: The coefficient of a continuous predictor is the estimated


change in the natural log of the odds for the reference event for each unit increase in the
predictor.
Since the coefficient of the predictor variable is negative, it indicates that low birthweight is
less likely to occur in infants as the age of the mothers increases. The value of the coefficient
is -0.0511, which implies that there is 5.11% chance that increase in age will cause low
birthweight in infants.
Odds Ratio:

From the results, we can observe that the odds ratio of the independent variable is
0.9501 . The odds ratio of 0.95 indicates that each year increase in age is associated
with a 5% increase in the odds of low birthweight in infants.
This is the margin graph and margin plot of the predictions. On the vertical axis, there is
probability of low birthweight in infants and on the horizontal axis is the age of the mother.
The graphs indicate the probability of low birthweight at various levels of the age of mother.
Along with the predicted values, the graphs depict the 95% confidence interval of low
birthweight for the ages ranging between 25 – 45.

Practical Application of Logistic Regression:


The objective of Logistic Regression is to develop a mathematical equation that gives us a
score in the range of 0 to 1. This score provides the probability of the variable taking the
value 1.
Logistic Regression can be used in a variety of spheres to estimate the real – life problems
1. Spam Detection: Spam detection is a binary classification problem where we need to
classify whether or an email is a spam or not. If the email is spam, we label it 1; if it is
not spam, we label it 0. In order to apply Logistic Regression to the spam detection
problem, the following features of the email are extracted:
• Sender of the email
• Number of typos in the email
• Occurrence of words/phrases like “offer”, “prize”, “free gift”, etc.
The resulting feature vector is then used to train a Logistic classifier which emits a
score in the range 0 to 1. If the score is more than the threshold value, say 0.5, we
label the email as spam. Otherwise, we don’t label it as spam.
2. Fraud Detection: The Credit Card Fraud Detection problem is another application of
Logistic Regression and is of significant importance to the banking industry because
banks each year spend hundreds of millions of dollars due to fraud. When a credit
card transaction happens, the bank makes a note of several factors. For instance, the
date of the transaction, amount, place, type of purchase, etc. Based on these factors,
they develop a Logistic Regression model of whether or not the transaction is a
fraud.
We may label doing fraud as 1 and not doing fraud as zero. Through the logistic
regression estimates, we may determine whether the person is more likely to
commit a fraud or not according to the probability value.

3. Tumour Prediction: A Logistic Regression model may be used to identify whether a


tumour is malignant or if it is benign. Several medical imaging techniques are used to
extract various features of tumours. For instance, the size of the tumour, the
affected body area, etc. These features are then fed to a Logistic Regression classifier
to identify if the tumour is malignant or if it is benign.

4. Subscription Prediction: Logistic regression can be used to predict whether a person


will subscribe to an OTT platform like Netflix, Hotstar or not. We can label
subscribing to the platform as 1 and not subscribing to the platform as 0. The
decision to subscribe may be dependent on factors such as price of the subscription,
types of shows available or the rating of the platform etc. Through this method, we
can predict the probability of a person subscribing to the OTT Platform.

You might also like