Limited Dependent Variables
Limited Dependent Variables
Updated 2024-03-25
Check-in
We've spent a lot of time thinking about research designs - what we need to control for,
finding natural experiments, etc.
We've spent a little less time on what kinds of statistical methods there are other than
OLS!
This leaves out some important stuff!
(Statisticians, as opposed to econometricians, might say we've barely done any statistics
at all!)
2 / 34
OLS and the Dependent Variable
A typical OLS equation looks like:
Y = β0 + β1 X + ε
The normal distribution is continuous and smooth and has infinite range
And the linear form stretches off to infinity in either direction as X gets small or big
Both of these imply that the dependent variable, Y , is continuous and can take any
value (why is that?)!
If that's not true, then our model will be misspecified in some way
3 / 34
Non-Continuous Dependent Variables
When might dependent variables not be continuous and have infinite range?
4 / 34
Binary Dependent Variables
In many cases, such as variables that must be round numbers, or can't be negative,
even though there are ways of properly handling these issues, people will usually
ignore the problem and just use OLS, as long as the data is continuous-ish (i.e. doesn't
have a LOT of observations right at 0 next to the impossible negative values, or has lots
of different values so the round number smooth out)
However, the problems of using OLS are a bit worse for binary data, and so they're the
most common case in which we do something special to account for it
Binary dependent variables are also really common! We're often interested in whether a
certain outcome happened or didn't (if we want to know if a drug was effective, we are
likely asking if you are cured or not!)
So, how can we deal with having a binary dependent variable, and why do they give OLS
such problems?
5 / 34
The Linear Probability Model
First off, let's ignore the completely unexplained warnings I've just given you and do it
with OLS anyway, and see what happens
Running OLS with a binary dependent variable is called the "linear probability model"
or LPM
D = β0 + β1 X + ε
6 / 34
The Linear Probability Model
In terms of how we do it, the interpretation is the exact same as regular OLS, so you can
bring in all your intuition
The only difference is that our interpretation of the dependent variable is now in
probability terms
If β^1 = .03, that means that a one-unit increase in X is associated with a three
percentage point increase in the probability that D = 1
(percentage points! Not percentage - an increase from .1 to .13, say, not .1 to .103)
7 / 34
The Linear Probability Model
So what's the problem?
Terrible predictions
Incorrect slopes that don't acknowledge the boundaries of the data
8 / 34
Terrible Predictions
OLS fits a straight line. So if you increase or decrease X enough, eventually you'll
predict that the probability of D = 1 is bigger than 1, or lower than 0. Impossible!
We can address part of this by just not trying to predict outside the range of the data,
but if X has a lot of variation in it, we might get those impossible predictions even for
values in our data. And what do we do with that?
(Also, because errors tend to be small for certain ranges of X and large for others, we
have to use heteroskedasticity-robust standard errors)
9 / 34
Terrible Predictins
10 / 34
Incorrect Slopes
Also, OLS requires that the slopes be constant
(Not necessarily if you use a polynomial or logarithm, but the following critique still
applies)
This is not what we want for binary data!
As the prediction gets really close to 0 or 1, the slope should flatten out to nothing
If we predict there's a .50 chance of D = 1 , a one-unit increase in X with β^1 = .03
11 / 34
Incorrect Slopes
We can see how much the OLS slopes are overstating changes in D as X changes near
the edges by comparing an OLS fit to just regular ol' local means, with no shape
imposed at all
We're not forcing the red line to flatten out - it's doing that naturally as the mean can't
possibly go any lower! OLS barrels on through though
12 / 34
Linear Probability Model
So what can we make of the LPM?
13 / 34
Generalized Linear Models
So LPM has problems. What can we do instead?
Let's introduce the concept of the Generalized Linear Model
Y = β0 + β1 X + ε
14 / 34
Generalized Linear Models
E(D|X) = F (β0 + β1 X)
We can call the β0 + β1 X part, which is the same as in OLS, the index function. It's a
linear function of our variable X (plus whatever other controls we have in there), same
as before
But to get our prediction of what Y will be conditional on what X is ( D|X ), we do
one additional step of running it through a function F () first. We call this function a
link function since it links the index function to the outcome
If F (z) = z, then we're basically back to OLS
But if F () is nonlinear, then we can account for all sorts of nonlinear dependent
variables!
So in other words, our prediction of D is still based on the linear index, but we run it
through some nonlinear function first to get our nonlinear output!
15 / 34
Generalized Linear Models
We can also think of this in terms of the latent variable interpretation
∗
D = β0 + β1 X
Where D∗ is an unseen "latent" variable that can take any value, just like a regular OLS
dependent variable (and roughly the same in concept as our index function)
16 / 34
Probit and Logit
Let's go back to our index-and-function interpretation. What function should we use?
(many many different options depending on your dependent variable - poisson for
count data, log link for nonnegative skewed values, multinomial logit for categorical
data...)
For binary dependent variables the two most common link functions are the probit and
logistic links. We often call a regression with a logistic link a "logit regression"
P robit(index) = Φ(index)
where Φ() is the standard normal cumulative distribution function (i.e. the probability that
a random standard normal value is less than or equal to index )
index
e
Logistic(index) =
index
1 + e
For most purposes it doesn't matter whether you use probit or logit, but logit is getting
much more popular recently (due to its common use in data science - it's computationally
easier) so we'll focus on that, and just know that pretty much all of this is the same with
probit
17 / 34
Logit
Notice that we can't possibly predict a value outside of 0 or 1, no matter how wild X
and our index get
As index goes to −∞,
0
Logistic(index) → = 0
1 + 0
18 / 34
Logit
Also notice that, like the local means did, its slope flattens out near the edges
19 / 34
Probit and Logit in R
We can do probit in logit in R fairly easily
Instead of using feols we use feglm ("generalized linear model") (also available is just
base-R glm() )
And we must specify which kind of link function we have ( family = 'binomial' for
binary data)
And the actual link function ( link = 'logit' )
Note: One nice thing about feglm is that it lets you do fixed effects right. Don't do your
own fixed effects by de-meaning or adding a bunch of binary controls. It doesn't work
well.
20 / 34
Probit and Logit in R
From this we get... uh... hmm, what does this mean? Why are logit and probit so different if it
doesn't matter which you use?
21 / 34
Probit and Logit
The interpretation of the probit and logit coefficients is that that's the effect of a one-
unit change in X on the index, not on D directly
And since the scale of the index depends on the link function, So the interpretation
depends on the link function
From the coefficients themselves we can get direction (positive/negative) and
significance, but not really scale
Which isn't too intuitive. Generally, when trying to interpret probit or logit coefficients,
we instead transform them into statements about the effect of X on the probability
that D = 1 itself, similar to OLS
We'll get to how we do that in a moment!
22 / 34
Concept Checks
Why can't we just use OLS when the dependent variable is binary, even if we are only
interested in the slope?
What features would a link function need to have to model binary data?
Why does the slope on X need to depend on the value of X for this to work?
23 / 34
Interpreting Probit and Logit
We are often interested in getting a result in the form "the effect of a one-unit increase
in X on the probability that D = 1 is..."
But we can't get this with our logit coefficients as-is
So we will generally calculate marginal effects
The marginal effect is what we get if we, well... check what the logit/probit model
predicts happens to the average of D if X increases by 1
24 / 34
Types of Marginal Effects
This is complicated somewhat by the fact that there is no one marginal effect
The effect of X, as we've seen, varies depending on how far left or right we are on the
graph
25 / 34
Types of Marginal Effects
And this isn't so much based on the value of X as it is based on the value of the index
Meaning that the effect of X on P (D = 1) depends on every variable in the regression
E(D|X, Z) = Logistic(β0 + β1 X + β2 Z)
If we estimate β^0 ^
= −2, β
1
^
= 2, β
2
= 1 , then the marginal effect of X for someone
with X = 3, Z = 2 is .005 , but the marginal effect of X for someone with
X = 1, Z = .5 is .470
26 / 34
Types of Marginal Effects
There are four common ways people present marginal effects:
Present the whole distribution! - Calculate each individual observation's marginal effect
The Marginal Effect of a Representative (MER): Pick a particular set of right-hand-side
variables you're particularly interested in for some reason and calculate the marginal
effect for them
The Average Marginal Effect (AME) - Calculate each individual observation's marginal
effect, then take the mean
The Marginal Effect at the Mean (MEM) - Calculate the average of each variable, then get
the marginal effect for some hypothetical observation with all those mean values
MEM is easier to calculate (and often easier to interpret), but the AME is generally
considered more appropriate - it takes into account how the variables correlate with each
other, and doesn't produce a marginal effect for some average person with 2.3 kids who
doesn't exist
27 / 34
Marginal Effects in R
There are a few standard ways in R to estimate marginal effects, none are perfect.
We can calculate the individual marginal effects using slopes() and the MER and AME easily
using avg_slopes in the marginaleffects package.
library(marginaleffects)
data(gss_cat, package = 'forcats')
# Eliminate unused level
gss_cat <- gss_cat %>% mutate(race = factor(race)) %>% na.omit()
marriedlogit <- feglm(I(marital == 'Married') ~ age*tvhours + race, data = gss_cat)
# AME below
avg_slopes(marriedlogit)
##
## Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 %
## age dY/dX 0.00283 0.000272 10.385 <0.001 81.5 0.00229
## race Black - Other -0.18775 0.019475 -9.640 <0.001 70.6 -0.22592
## race White - Other 0.00673 0.016299 0.413 0.68 0.6 -0.02522
## tvhours dY/dX -0.01952 0.001831 -10.656 <0.001 85.7 -0.02311
## 97.5 %
## 0.00336
## -0.14957
28 / 34
## 0.03867
Marginal Effects in R
You can get MER or MEMs using the datagrid() function. datagrid() by itself will give you
the MEMs. Or datagrid(varaible=value, grid.type = 'counterfactual') will set just the
values you want while retaining the original data.
##
## Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 %
## age dY/dX 0.00325 0.000298 10.894 <0.001 89.4 0.00266
## race Black - Other -0.18775 0.019475 -9.640 <0.001 70.6 -0.22592
## race White - Other 0.00673 0.016299 0.413 0.68 0.6 -0.02522
## tvhours dY/dX -0.02071 0.001834 -11.291 <0.001 95.8 -0.02430
## 97.5 %
## 0.00383
## -0.14957
## 0.03867
## -0.01711
##
## Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.lo
## Type: response 29 / 34
Marginal Effects in R
Upsides of this approach:
Downsides:
Doesn't work with etable() . You'll need another regression-table function like
export_summs() in jtools
Because of the last point in "upsides", doesn't allow you to evaluate interaction effects
(although this is also a plus because you probably don't actually want to do that - we'll
get to this in the assigned paper for next week)
Package author willing to change syntax in future
30 / 34
Concept Checks
Why does each individual have their own marginal effect?
What's one reason we might not want to calculate a MEM?
What should we keep in mind when interpreting an AME that we calculate? Is this the
effect of a one-unit change in X on P (D = 1)?
31 / 34
Hypothesis Testing (briefly!)
How can we calculate hypothesis tests for our logit and probit models?
For single coefficients, we can just use the standard t-statistics that are reported in the
regression output
For multiple coefficients, we need to compare the full model to a restricted model!
We can just use wald() as normal
32 / 34
Watch Out!
Before we go, some things to watch out for when using probit or logit:
Doing fixed effects with probit or logit is a lot trickier. Neither de-meaning or adding
dummies work. You gotta use specialized functions (like feglm() in fixest), and even
then interpretations get trickier, and you're more likely to have the data fail you and get
weird results
Both logit and probit are estimated using maximum likelihood, which doesn't perform
as well as OLS in small samples. So LPM might be worth it if your data set is tiny (say,
below 500)
Interaction terms in probit and logit models are much more tricky to interpret, and the
marginal effects for them should be looked at with suspicion - you instead will want to
work with predict() ed values and see how the interaction plays out there. See Ai and
Norton (2003).
33 / 34
Let's go!
Do the Swirl
Do the homeworks
Check out the assigned paper
34 / 34