0% found this document useful (0 votes)
6 views10 pages

Module 6 - Logistic Regression Analysis

This document is a module on Logistic Regression Analysis from the Department of Statistics at Visayas State University. It covers the learning outcomes, applications, assumptions, and methods for conducting logistic regression analysis using SPSS, along with examples and interpretations of results. The module emphasizes the importance of understanding when to use logistic regression and the necessary data assumptions for accurate analysis.

Uploaded by

jedricderama7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Module 6 - Logistic Regression Analysis

This document is a module on Logistic Regression Analysis from the Department of Statistics at Visayas State University. It covers the learning outcomes, applications, assumptions, and methods for conducting logistic regression analysis using SPSS, along with examples and interpretations of results. The module emphasizes the importance of understanding when to use logistic regression and the necessary data assumptions for accurate analysis.

Uploaded by

jedricderama7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DEPARTMENT OF STATISTICS

Visayas State University


Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

MODULE 6
Logistic Regression Analysis

Learning Outcomes
After completion of this module, participants are expected to:
1. Determine when logistic regression should be used
2. Identify the data assumptions associated with the use of logistic regression
3. Conduct logistic regression analysis using SPSS
4. Interpret the results of the logistic regression

What is logistic regression?


Logistic regression is a form of regression that allows the prediction of a discrete
variable (Y) by a mix of continuous and discrete predictors (Xi). When the dependent variable
(Y) is discrete (nominal or ordinal), linear regression model is no longer appropriate and may
lead to nonsensible estimates and predictions. If the dependent variable is binary or
dichotomous such as pass or fail, we call the model as the binary logistic regression model.

Some Applications of Logistic Regression


1. Medical researchers want to know how exercise and weight impact the probability of
having a heart attack. To understand the relationship between the predictor variables
and the probability of having a heart attack, researchers can perform logistic
regression.
The response variable in the model will be heart attack and it has two potential
outcomes:
 A heart attack occurs.
 A heart attack does not occur.
The results of the model will tell researchers exactly how changes in exercise and
weight affect the probability that a given individual has a heart attack. The researchers
can also use the fitted logistic regression model to predict the probability that a given
individual has a heart attacked, based on their weight and their time spent exercising.

2. Researchers want to know how GPA, ACT score, and number of AP classes taken impact
the probability of getting accepted into a particular university. To understand the
relationship between the predictor variables and the probability of getting accepted,
researchers can perform logistic regression.
The response variable in the model will be “acceptance” and it has two potential
outcomes:
 A student gets accepted.
 A student does not get accepted.
The results of the model will tell researchers exactly how changes in GPA, ACT score,
and number of AP classes taken affect the probability that a given individual gets
accepted into the university. The researchers can also use the fitted logistic regression
model to predict the probability that a given individual gets accepted, based on their
GPA, ACT score, and number of AP classes taken.

NE Milla Page 1 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

3. A business wants to know whether word count and country of origin impact the
probability that an email is spam. To understand the relationship between these two
predictor variables and the probability of an email being spam, researchers can perform
logistic regression.
The response variable in the model will be “spam” and it has two potential outcomes:
 The email is spam.
 The email is not spam.
The results of the model will tell the business exactly how changes in word count and
country of origin affect the probability of a given email being spam. The business can
also use the fitted logistic regression model to predict the probability that a given email
is spam, based on its word count and country of origin.

4. A credit card company wants to know whether transaction amount and credit score
impact the probability of a given transaction being fraudulent. To understand the
relationship between these two predictor variables and the probability of a transaction
being fraudulent, the company can perform logistic regression.
The response variable in the model will be “fraudulent” and it has two potential
outcomes:
 The transaction is fraudulent.
 The transaction is not fraudulent.
The results of the model will tell the company exactly how changes in transaction
amount and credit score affect the probability of a given transaction being fraudulent.
The company can also use the fitted logistic regression model to predict the probability
that a given transaction is fraudulent, based on the transaction amount and the credit
score of the individual who made the transaction.

Probability, Odds, and Odds Ratio


Below is a cross tabulation of students’ sex and whether they use drugs or not.
Drug users Non-users TOTAL
Female 85 106 222
Male 120 102 191
TOTAL 205 208 413

Here the DV is drug use and IV is sex. The probability of a student using drugs is
205
 0.496 . For the table below, the research question is whether there is a gender difference
413
in using drugs or whether the probability of drug use is the same for males and females.

Recall that odds is defined as the probability of belonging to one group or event
occurring divided by the probability of not belonging to that group or event not occurring.
120
Hence, the odds of a male using drug is  1.18 which means that a male is 1.18 times as
102

NE Milla Page 2 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

85
likely to use drug as not to use. On the other hand, the odds of a female using drug is  0.80
106
which means that females is 0.80 times as likely to use drugs as not to use.

Odds ratio (OR), as the name implies, is the ratio of two odds. That is,
odds for target group
OR  .
odds for reference group

An odds ratio > 1 indicates that the likelihood of an event occurring is more likely for
the response category than the referent category of an independent variable. While, an odds
ratio < 1 indicates that the likelihood of an event occurring is less likely for the response
category than the referent category of an independent variable.

For example, if we use male as target group (1) and the females as reference group (0), then
1.18
OR   1.48 .
0.80
This means that males are 1.48 time more likely to use drugs than females.

Logistic Regression Model


Let Y be an event. We assign Y a value of 1 if the event occurs and 0, otherwise. Let the
probability of that the event occurs to be equal to  , that is, P Y  1    . The odds are
defined as the probability that the event will occur divided by the probability that the event will
not occur. That is,
P Y  1  
Odds = 
P Y  0  1  
If we take the natural logarithm of the odds, we call the resulting quantity as the log odds
or logit. We use the logit as the independent variable in logistic regression analysis. Thus, the
logistic regression model is given by
  
ln    0  1 X1  2 X2    p X p   . (1)
 1  
Using algebra, we can compute the probability that the event will occur as
0  1 X1  2 X2    p X p 
e
 X  0  1 X1  2 X2    p X p 
.
1e

Just like in multiple linear regression the beta coefficients in (1) are unknown
quantities and have to be estimated from data. In logistic regression analysis these
coefficients are estimated using Method of Maximum Likelihood.

Assumptions of the Logistic Regression Model


Logistic regression does not make many of the key assumptions of linear regression.
First, logistic regression does not require a linear relationship between the dependent and
independent variables. Second, the error terms (residuals) do not need to be normally

NE Milla Page 3 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

distributed. Third, homoscedasticity is not required. Finally, the dependent variable in logistic
regression is not measured on an interval or ratio scale.
However, some other assumptions still apply. First, binary logistic regression requires
the dependent variable to be binary and ordinal logistic regression requires the dependent
variable to be ordinal. Second, logistic regression requires the observations to be independent
of each other. In other words, the observations should not come from repeated
measurements or matched data. Third, logistic regression requires there to be little or no
multicollinearity among the independent variables. This means that the independent variables
should not be too highly correlated with each other. Fourth, logistic regression assumes
linearity of independent variables and log odds. although this analysis does not require the
dependent and independent variables to be related linearly, it requires that the independent
variables are linearly related to the log odds.
Finally, logistic regression typically requires a large sample size. A general guideline is
that you need at minimum of 10 cases with the least frequent outcome for each independent
variable in your model. For example, if you have 5 independent variables and the expected
probability of your least frequent outcome is .10, then you would need a minimum sample size
of 500 (10*5 / .10).

The data
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores),
GPA (grade point average) and prestige of the undergraduate institution, effect admission into
graduate school. This dataset graduateadmission.sav has a binary response (outcome,
dependent) variable called admit (1=admit, 0=do not admit). There are three predictor
variables: gre, gpa and rank. The variables gre and gpa are continuous. The variable rank takes
on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those
with a rank of 4 have the lowest.

Binary Logistic Regression Analysis using IBM SPSS Statistics


1. Open the graduateadmission.sav file.
2. Click the Analyze menu, point to Regression, and click Binary Logistic.
3. In the Logistic Regression dialog box, select the variable admit and move it to the
Dependent box.
4. Select the variables gre, gpa, and rank and move them to the Covariates box.
5. Click the Categorical tab.
6. In the Logistic Regression: Define Categorical Variables dialog box, select the variable
rank and move it to the Categorical Covariates box (see Figure 1). In this dialog box, the
default type of contrast is Indicator or dummy variable and the reference category is
set to Last. Later you can make appropriate changes in these settings.

NE Milla Page 4 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

Figure 1- Logistic Regression: Define Categorical Variables Dialog Box

7. Click the Continue button.


8. Click the Save button.
9. In the Logistic Regression: Save dialog box, put check marks on the following boxes:
Probabilities and Membership under Predicted Values, Standardized under Residuals,
Cook’s, Leverage values, and DfBeta(s) under Influence (see Figure 2).

Figure 2- Logistic Regression: Save Dialog Box

10. Click the Continue button.


11. Click the Options tab.
12. In the Logistic Regression: Options dialog box, put check marks in the following boxes:
Classification plots, Hosmer-Lemeshow goodness-of-fit, Casewise listing of residuals, and
CI for exp(B) (see Figure 3).
13. Click the Continue button.
14. Click the OK button.

NE Milla Page 5 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

Figure 2- Logistic Regression: Options Dialog Box

OUTPUT AND INTERPRETATIONS

Displays how the categorical


variables were coded,
especially those that are
included as covariates. In the
case of the rank variable, 3
indicator variables were
created and rank=4 is treated
as reference category (default).

Constant-only-model (null
model): Block 0 (beginning
block/step 0) means only
constant is in the model and
our predictors are not in the
equation yet.

NE Milla Page 6 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

Model Fit Statistics

Interpretation:
This tests the null hypothesis that all predictors in the model have no effect on
the dependent variable. The Chi-square test value if 41.459 with p-value of 0.000 which
is less than a 0.05 significance indicating that we reject the null hypothesis. We conclude
that the predictors together can explain variation in the dependent variable. This is
equivalent to the F test for linear regression analysis.

Interpretation:
This table shows the -2log likelihood statistic of 458.517. This is the test statistic
used to compare the full model (model with 3 predictors) with null model (model with
constant only). The p-value associated with this statistic is 0.000 which is highly
significant indicating that the model with 3 predictors is significantly better than the
model without a predictor. The -2log likelihood statistic is also referred to as the Deviance
statistic.
Other values reflected in the table are the Cox & Snell R square and Nagelkerke
R Square values of 0.098 and 0.138, respectively. These are what we call as pseudo-R2
statistics. The higher their values the better!
Note that logistic regression does not have an equivalent to the R-squared that
is found in OLS regression; however, many people have tried to come up with
one. There are a wide variety of pseudo-R-square statistics (these are only two of
them). Because this statistic does not mean what R-squared means in OLS regression
(the proportion of variance explained by the predictors), we suggest interpreting these
statistics with great caution.

NE Milla Page 7 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

Interpretation:
The Hosmer-Lemeshow test is used to test the hypothesis that model
adequately describes the data. This is done by comparing the predicted probabilities
obtained using the model and the observed probabilities. The Hosmer-Lemeshow
statistic indicates a poor fit if the significance value is less than 0.05. Here, the Hosmer-
Lemeshow test statistic is 11.085 with p-value of 0.195 which is greater than the 0.05
significance level. Thus, the model adequately fits the data.

False positive

False negative

Interpretation:
The classification table (also called confusion matrix) gives us the percentage
of cases correctly classified by the model. The overall percentage correct classification
254  30
is 71% (  0.71 ). Notice that in the null model the overall percent correct
400
classification is 68.3% which means that adding the three predictors in the model
improves the correct classification rate by 2.7% only (a bit small!).
The percentages 23.6% and 93.0% are called sensitivity and specificity of
predictions, respectively.

NE Milla Page 8 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

Interpretation:
In the table labeled Variables in the Equation we see the coefficients, their
standard errors, the Wald test statistic with associated degrees of freedom and p-values,
and the exponentiated coefficient (also known as an odds ratio). Both gre and gpa are
statistically significant. The overall (i.e., multiple degree of freedom) test for rank is given
first, followed by the terms for rank=1, rank=2, and rank=3. The overall effect of rank is
statistically significant, as are the terms for rank=1 and rank=2.
The logistic regression coefficients give the change in the log odds of the
outcome for a one-unit increase in the predictor variable.
 For every one unit change in gre, the log odds of admission (versus non-admission)
increases by 0.002, assuming all else are equal.
 For a one-unit increase in gpa, the odds of being admitted to graduate school 2.235
time higher than non-admission, assuming all else are equal.
 The indicator variables for rank have a slightly different interpretation. For
example, having attended an undergraduate institution with rank of 1, versus an
institution with a rank of 4, increases the log odds of admission by 1.551. In
addition, the log odds of admission of a graduate from a rank 2 institution is 0.876
higher than a graduate from a rank 4 institution, assuming all else are equal.

Alternatively, it is easier to understand the effect of each predictor if we use


the odds ratio (Exp(B)).
 For every one unit change in gre, the odds of admission is increased by a factor of
1.002, assuming all else are equal.
 For a one-unit increase in gpa, the odds of admission is increased by a factor of
2.235, assuming all else are equal.
 The odds of admission of a graduate from rank 1 institution is 4.718 times greater
than the odds of admission of a graduate from rank 4 institution. Also, the odds of
admission of graduates from rank 2 institutions is 2.401 times greater than the odds
of admission of graduates from rank 4 institutions, assuming all else are equal.

NE Milla Page 9 of 10
DEPARTMENT OF STATISTICS
Visayas State University
Visca, Baybay City, Leyte PHILIPPINES
Phone/Fax: +63 53 _____ ________
Email: [email protected]
Website: www.vsu.edu.ph

NOTE:
1. Residual analysis must ALWAYS be done prior to interpreting the coefficients in the
final model.
2. There is a need to review these cases and examine how come they are outliers.

NE Milla Page 10 of 10

You might also like