0% found this document useful (0 votes)
3 views25 pages

Module 6A

The document discusses the Linear Probability Model (LPM) for binary dependent variables, explaining how to estimate probabilities using linear regression when the outcome is binary (0 or 1). It provides examples of applying LPM to real data, including factors affecting labor force participation and mortgage application denials, while also addressing the limitations of LPM, such as heteroskedasticity and omitted variable bias. The document emphasizes that the coefficients in LPM represent changes in the probability of success as explanatory variables change.

Uploaded by

chingchangwafang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views25 pages

Module 6A

The document discusses the Linear Probability Model (LPM) for binary dependent variables, explaining how to estimate probabilities using linear regression when the outcome is binary (0 or 1). It provides examples of applying LPM to real data, including factors affecting labor force participation and mortgage application denials, while also addressing the limitations of LPM, such as heteroskedasticity and omitted variable bias. The document emphasizes that the coefficients in LPM represent changes in the probability of success as explanatory variables change.

Uploaded by

chingchangwafang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

BSAD 6318/ ECON 5339

MYASAR
SP 2025

Module 6A: Linear Probability Model

BSAD 6318-ECON 5339, SP 2025


1
Binary Dependent Variables

A dependent variable we use can be binary


It takes on only two values: 0 and 1
Whether a person used public or private transportation
to work
Whether a firm exported during the year or not
Whether another firm took over a firm during a given
year
In each case, the dependent variable is coded as a binary
variable
For instance, the dependent variable y =1 if the firm
exported during a given year and y=0 otherwise

BSAD 6318-ECON 5339, SP 2025 2


Binary Dependent Variables
Suppose we are interested in estimating the following model
when y is a binary variable.
" = !! + !" #" + ### + ! ! #! + $ $"%
Since the dependent variable (y) takes on only two values (0
or 1), betaj cannot be interpreted as before (can not be
interpreted as the change in y given a one-unit increase in xj,
cp)
n y either changes from 0 to 1 or from 1 to 0 or does not change
If we assume the zero conditional mean assumption
(E(u\x1,….xk)=0, then we have
" # # $ $% = !! + !"$" + &&& + !! $! #'%

Where x is shorthand for all of the explanatory variables 3


Binary Dependent Variables
When the dependent variable y is binary (y =1 or 0),
P(y=1\x) = E(y/x), meaning that the probability of success (the
probability that y = 1) is the same as the expected value of y
Thus, we can write our model as follows:
"# # = "$ $% = !! + !" $" + &&& + !! $! ' #(%
which indicates that the probability of success (p(x) =
P(y=1\x)) is a linear function of x
This is an example of a binary response model.

When the dependent variable is binary, the multiple regression


model is called the Linear Probability Model (LPM) because the
response probability (P(y=1/x)) is linear in the parameters
betaj

BSAD 6318-ECON 5339, SP 2025 4


Binary Dependent Variables
Using the zero conditional mean assumption, one can show
that
" # # $ $% = !! + !" $" + &&& + ! ! $!
The probability that Yi = 1 is P = Pr (Yi = 1)
and the probability that Yi = 0 is 1 - P = Pr (Yi = 0)
So, since Yi only takes on the values 0 and 1, it has a Bernoulli
distribution- a binomial distribution with n=1. Thus,
E(Yi) = 1 × Pi + 0 × (1 - Pi) = Pi
V(Yi)=sqrt(Pi(1-Pi))
Probability models aim to examine the determinants of Pi, the
probability that Yi=1 rather than Yi=0.
The predicted probability Pi_hat can be determined as
Yi_hat for a given value of xi.
BSAD 6318-ECON 5339, SP 2025 5
Binary Dependent Variables: Interpretation
In the LPM, Betaj measures the change in the
probability of success when xj changes, ceteris
paribus
"# ! $! = "# %$ = ! " "%!"
If you rewrite the estimated equation as
# = !# + !# $ + $$$ + !# $
! ! " !" " !"

y_hat is the predicted probability of success


Beta0 is the predicted probability of success when xj is
zero
Bataj measures the predicted change in probability of
success when x1 increases by one unit
BSAD 6318-ECON 5339, SP 2025 6
Example: Binary Dependent Variables

Let’s use the data from Mroz (1987) and estimate a linear
probability model, where 428 out of 753 women in the
sample report being in the labor force at some point during
1975
Let inlf =1 if the woman reports working for a wage and zero
otherwise
Assume that the labor force participation depends on the
following:
other sources of income (nwifeinc, in $1000)
years of education (educ)
experience (exper)
Age
number of children less than six years old (kidslt6)
number of kids between 6 and 18 years of age (kidsge6)
use mroz, replace
reg inlf nwifeinc educ exper expersq age kidslt6 kidsge6, r
BSAD 6318-ECON 5339, SP 2025 7
Source SS df MS Number of obs = 753
F( 7, 745) = 38.22
Model 48.8080578 7 6.97257969 Prob > F = 0.0000
Residual 135.919698 745 .182442547 R-squared = 0.2642
Adj R-squared = 0.2573
Total 184.727756 752 .245648611 Root MSE = .42713

inlf Coef. Std. Err. t P>|t| [95% Conf. Interval]

nwifeinc -.0034052 .0014485 -2.35 0.019 -.0062488 -.0005616


educ .0379953 .007376 5.15 0.000 .023515 .0524756
exper .0394924 .0056727 6.96 0.000 .0283561 .0506287
expersq -.0005963 .0001848 -3.23 0.001 -.0009591 -.0002335
age -.0160908 .0024847 -6.48 0.000 -.0209686 -.011213
kidslt6 -.2618105 .0335058 -7.81 0.000 -.3275875 -.1960335
kidsge6 .0130122 .013196 0.99 0.324 -.0128935 .0389179
_cons .5855192 .154178 3.80 0.000 .2828442 .8881943

The estimated slope coefficient (betaj) is the impact of a unit change in that
explanatory variable (xj) on the probability that Y=1
The coefficient on educ indicates that an extra year of education increases
the probability of labor force participation by 0.038 or by 3.8 percentage
points, ceteris paribus

BSAD 6318-ECON 5339, SP 2025 8


Example 2: Binary Dependent Variables

BSAD 6318-ECON 5339, SP 2025 9


Example 2: Binary Dependent Variables

Use hmda_sw2.dta", replace


Let's first create a binary variable
den=1 if loan application is denied, and den = 0 if loan is originated or
application is approved but not accepted by the applicant.
See hmda.doc file for the variable definitions
gen den=0
replace den=1 if s7==3
(S13) Applicant race
1 –3 – Black; 5 – White;
gen race=0
replace race=1 if s13==3

*s46 is the debt-to-income ratio (the banks’ calculation of total


*obligations/income)

gen PI=s46/100
BSAD 6318-ECON 5339, SP 2025 10
Example 3: Binary Dependent Variables
regress den PI, r
. regress den PI, r

Linear regression Number of obs = 2,380


F(1, 2378) = 37.56
Prob > F = 0.0000
R-squared = 0.0397
Root MSE = .31828

Robust
den Coefficient std. err. t P>|t| [95% conf. interval]

PI .6035349 .0984826 6.13 0.000 .4104144 .7966555


_cons -.0799096 .0319666 -2.50 0.012 -.1425949 -.0172243

Note that the estimated coefficient on the PI ratio is positive (0.604) and
significant at a .01 significance level. Thus, those with higher payments as a
fraction of income are more likely to have their application denied
For example, if the PI ratio is .10, then the probability of den increases by
.604*.10*100, by almost 6 percentage points.
11
BSAD 6318-ECON 5339, SP 2025
Example 3: Binary Dependent Variables
Now, let's compute the predicted den probabilities as a
function of the PI ratio
If, for instance, the PI ratio is .30, the predicted value from
the predicted equation is
-0.08+0.604*.30 = 0.101
An applicant whose projected debt payments are 30% of
his/her income has a probability of 0.101 that his/her
application will be denied.

BSAD 6318-ECON 5339, SP 2025 12


Example 3: Binary Dependent Variables
Now, let's examine the effect of race on the probability of denial, holding the
PI constant
regress den PI race, vce (r)
. regress den PI race, vce(r)

Linear regression Number of obs = 2,380


F(2, 2377) = 49.39
Prob > F = 0.0000
R-squared = 0.0760
Root MSE = .31228

Robust
den Coefficient std. err. t P>|t| [95% conf. interval]

PI .5591946 .0886663 6.31 0.000 .3853233 .7330658


race .1774282 .0249463 7.11 0.000 .1285096 .2263469
_cons -.0905136 .0285996 -3.16 0.002 -.1465963 -.0344309

The coefficient on race is 0.177, which indicates that a black applicant has a 17.7
percentage points higher probability of having a mortgage application denied
than the control group, holding PI constant.
But keep in mind that we do not control for many variables. Thus, this difference
may change as we add more explanatory variables. This is just a simple example.

BSAD 6318-ECON 5339, SP 2025


13
The rest of the material is optional. We will discuss it in the second course next
semester

BSAD 6318-ECON 5339, SP 2025 14


Marginal Effects
Now let's look at the marginal effects for the LPM model
mfx
. mfx

Marginal effects after regress


y = Fitted values (predict)
= .1197479

variable dy/dx Std. err. z P>|z| [ 95% C.I. ] X

PI .5591946 .08867 6.31 0.000 .385412 .732977 .330814


race* .1774282 .02495 7.11 0.000 .128534 .226322 .142437

(*) dy/dx is for discrete change of dummy variable from 0 to 1

The marginal effects are the same as the slope coefficients


Why? Because the relationships are linear in LPM regression and do not
vary with the values of the other explanatory variables
We will learn below that this is not the case with logit and probit models

BSAD 6318-ECON 5339, SP 2025 15


Instead of mfx, you may use margins, dydx(*) to obtain the marginal effects. Margins
command is faster.
. regress den PI i.race, vce(r)

Linear regression Number of obs = 2,380


F(2, 2377) = 49.39
Prob > F = 0.0000
R-squared = 0.0760
Root MSE = .31228

Robust
den Coefficient std. err. t P>|t| [95% conf. interval]

PI .5591946 .0886663 6.31 0.000 .3853233 .7330658


1.race .1774282 .0249463 7.11 0.000 .1285096 .2263469
_cons -.0905136 .0285996 -3.16 0.002 -.1465963 -.0344309

. margins, dydx(*)

Average marginal effects Number of obs = 2,380


Model VCE: Robust

Expression: Linear prediction, predict()


dy/dx wrt: PI 1.race

Delta-method
dy/dx std. err. t P>|t| [95% conf. interval]

PI .5591946 .0886663 6.31 0.000 .3853233 .7330658


1.race .1774282 .0249463 7.11 0.000 .1285096 .2263469
BSAD 6318-ECON 5339, SP 2025 16
Note: dy/dx for factor levels is the discrete change from the base level.
Limitations of LPM
Based on these results, can we conclude that mortgage decisions
have a racial bias?
No
Many other factors can affect this decision, which are omitted
from the above models.
If these other factors are correlated with the independent
variables, then their omission from the model will cause an
omitted variable bias.
We will include some variables in the following models.

The Linear Probability Model also has the following limitations.

BSAD 6318-ECON 5339, SP 2025 17


Limitations of LPM: Heteroskedasticity

Problem 1 with LPM: Heteroskedasticity

let's get the residuals and graph them


regress den PI, r
predict resid, resid
graph7 resid PI, ylab xlab yline(0)

BSAD 6318-ECON 5339, SP 2025 18


Limitations of LPM: Heteroskedasticity

01
!"#$%&'(#

/01

/I
+ I - .
)*

This graph illustrates that for a given value of x (PI), there are two possible
values of the residual, indicating that the variance of the error term in the LPM
is heteroskedastic BSAD 6318-ECON 5339, SP 2025 19
Limitations of LPM: Heteroskedasticity
The OLS estimators are unbiased if x variables are uncorrelated with the
explanatory variables
However, the errors are heteroskedastic
From Var(y/x)=p(x)[1-p(x)]
Thus, Var(u) can take on different values for different observations
There will be a heteroskedasticity in the LPM, except in the case where the
probability does not depend on any of the independent variables
The dependent variable takes on only 0 or 1 for given values of the
independent variables. Thus, the error term (u) will also take on only these two
values
When den (yi) =1, ui = 1 – b0-b1*PIi – b2*blacki ---For y to be equal to 1
When den (yi) =0, ui = 0 – b0-b1*PIi – b2*blacki ---For y to be equal to 0
Thus, the distribution of u has only two specific values
Since u (specific two values) change with the explanatory variables, the error
term cannot be assumed to be homoskedastic

BSAD 6318-ECON 5339, SP 2025 20


Limitations of LPM: Heteroskedasticity
This will not cause a bias in the OLS estimates of the betaj (iff x
variables are all exogenous), but we know that homoskedasticity is
crucial for test statistics.
The standard errors must be corrected for the heteroskedasticity
We should use robust standard errors to calculate the test statistics
Note also that since the distribution of u has only two values,
normality does not hold, which can be an issue for the standard
errors and test statistics

BSAD 6318-ECON 5339, SP 2025 21


Linear versus Non-Linear

Logit and Probit models are nonlinear and provide


predicted probabilities between 0 and 1

gen prob=-0.091+.559*PI+0.177*race
tab prob

BSAD 6318-ECON 5339, SP 2025 22


Limitations of LPM: outside of the 0 and 1 range
Problem 2 with LPM: The range of the predicted probabilities can lie
outside of the 0 and 1 range
Now let's compute the predicted value of den
predict den_hat
now let's graph them
graph7 den_hat PI, ylab xlab yline (0,1)
As illustrated on the next slide, the graph shows that the probability
is between -.0799096 and 1.730695. It should be between 0 and 1

As we will explain later, LPM can be a good alternative when there


are fewer 1s in the dependent variable.

BSAD 6318-ECON 5339, SP 2025 23


Limitations of LPM: outside of the 0 and 1 range
0

/2P
!"##$%&'(F*$+

2P

. / 0 1
I-

BSAD 6318-ECON 5339, SP 2025 24


Sources

Wooldridge (2009)
Stock and Watson (2005)

BSAD 6318-ECON 5339, SP 2025 25

You might also like