Assignment9Sol - Copy
Assignment9Sol - Copy
Sahar Parsa
Fall 2024
The nineth assignment is due on Friday, November 22rd, 2024. It covers the material related to logit and
probit model as well as panel data methods. For the Data questions, report the output of your analysis in a
“report style” pleasing to read and add the codes you used to generate your results.
Question 1
Four hundred driver’s license applicants were randomly selected and asked whether they passed their driving
test (P assi = 1) or failed their test (P assi = 0); data were also collected on their gender (M alei = 1 if
male and = 0 if female) and their years of driving experience (Experiencei , in years). The following table
summarizes several estimated models.
Probit Logit Linear Probability
Experience 0.031 0.040 0.006
(0.009) (0.016) (0.002)
Constant 0.712 1.059 0.774
(0.126) (0.221) (0.034)
a. Using the results in column (1), does the probability of passing the test depend on Experience?
Assumed that Matthew has 10 years of driving experience, what is the probability that he will pass the
test? Christopher is a new driver (zero years of experience). What is the probability that he will pass
the test? The sample included values of Exper between 0 and 40 years, and only four people in the
sample had more than 30 years of driving experience. Jed is 95 years old and has been driving since he
was 15. What is the model’s prediction for the probability that Jed will pass the test? Do you think
that this prediction is reliable? Why or why not?
Solution :
At 5% significant level, the probability of passing the test depends on Experience. In the Probit model, the
value and standard errors of experience coefficients are 0.031 and 0.009. The corresponding t-value1 could be
calculated using the below formula:
1
Similarly, Christopher’s probability of passing is Φ(0.712) = 0.76177.
For Jed, the predicted probability is Φ(0.712 + 0.031 ∗ 80) = Φ(3.192) = 0.99930. Note that one could worry
that because our sample has only individuals between 0 to 40 years of experience, the model might not apply
for the Jed who has 80 years of driving experience. In particular, we could worry that the Probability of
passing the test conditional on experience decreases after a certain number of years of experience as our
individuals would become very old. This could be picked up by a z-index that is quadratic in experience. But
again, we might not not the right sample to pick up the effect for the population at large and our model
might only apply for the population specific to our study. More generally, there are no reasons to believe that
a model can’t be used to extrapolate out of sample. But one has to understand the context better.
b. Answer (a) using the results in column (2). Sketch the predicted probabilities from the probit and logit
in columns (1) and (2) for values of Experience between 0 and 60. Are the probit and logit models
similar?
Solution :
At 5% significant level, the probability of passing the test depends on Experience. In the Logit model, the
value and standard errors of experience coefficients are 0.040 and 0.016. The corresponding t-value2 could be
calculated using the below formula:
The t-value2 is greater than 1.96 needed at the 5% significant level. Therefore, we could reject the null
hypothesis that the probability of passing the test not depend on Experience.
For Logit model:
P r(P assi = 1) = (1 + exp(−(1.059 + 0.040 ∗ Experience)))−1
Mathew has 10 years of driving experience, the probability of passing is (1 + exp(−(1.059 + 0.040 ∗ 10)))−1 =
0.81138.
Similarly, Christopher’s probability of passing is (1 + exp(−1.059))−1 = 0.74250.
For Jed, the predicted probability is (1 + exp(−(1.059 + 0.040 ∗ 80)))−1 = 0.98606. The same comment about
the large number of years of experience for Jed applies here as well.
2
Logit vs. Probit
1.00
0.95
0.90
Model
Logit
y
Probit
0.85
0.80
0.75
0 20 40 60
Experience
The Probit model always predicts a higher probability than the Logit model given our sample and the
experience range we are considering, but the prediction is highly similar. Those two models have a similar
shape as well, and both show a diminishing experience effect on the probability of passing. Another familiar
plot would have been plotting against the z-index for both the probit and logit.
c. Answer (a) using the results in column (3). Sketch the predicted probabilities from the probit and
linear probability in columns (1) and (3) as a function of Experience for values of Experience between 0
and 60. Do you think that the linear probability is appropriate here? Why or why not? Solution :
At 5% significant level, the probability of passing the test depends on Experience. In the Linear Probability
model, the value and standard errors of experience coefficients are 0.006 and 0.002. The corresponding
t-value3 could be calculated using the below formula:
t − stat3 = 0.006/0.002 = 3
The t-value3 is greater than 1.96 needed at the 5% significant level. Therefore, we reject the null hypothesis
at the 5% that the probability of passing the test not depend on Experience.
For Linear Probability model:
Mathew has 10 years of driving experience, the probability of passing is 0.774 + 0.006 ∗ 10 = 0.834.
Similarly, Christopher’s probability of passing is 0.774.
For Jed, the predicted probability is 0.774 + 0.006 ∗ 80 = 1.254
In this case, the linear probability is inappropriate to predict probabilities as the predicted value would be
larger than 1. We know that probabilities are bounded between 0 and 1. Assumed there is a person who has
3
50 years of experience, his/her probability of passing is 1.07(0.774 + 0.006 ∗ 50). The linear probability plot
also shows that the predicted probability would exceed 1 after some experience level.
Probit vs. Linear
1.15
1.05
Model
Linear Prob
y
0.95
Probit
0.85
0 20 40 60
Experience
Question 2
Suppose that, for one semester, you can collect the following data on a random sample of college juniors
and seniors for each class taken: a standardized final exam score, percentage of lectures attended, a dummy
variable indicating whether the class is within the student’s major, cumulative grade point average prior to
the start of the semester, and SAT score.
a. Is this dataset a cluster data? Why would you classify this data set as a cluster sample? Roughly, how
many observations would you expect for the typical student?
Solution :
This dataset is a cluster data because the scores within a class are likely to be correlated. In this question,
the final exam scores for each class are a cluster, as different professors likely have different grading criteria
and students might help each other studying the material.
A typical undergraduate student takes 4 courses in one semester on average. I would expect for 4 observations
for one student.
b. Write a model that explains final exam score on the percentage of lectures attended and the other
characteristics. Use s to subscript student and c to subscript class. Which variables do not change
within a student?
Solution :
4
where,
F inalExamScoresc = Final exam score for each student in the given class
Attendsc = percentage of lectures attended for each student in the given class
M ajorsc = a dummy variable whether this class is in the student’s major
GP As = cumulative grade point average prior to the start of the semester for each student
SAT scores = SAT score for each student
Among these variables, GP As and SAT scores do not change within a student as they are predetermined
before this semester start.
c. If you pool all of the data and use OLS, what are you assuming about the unobserved student
characteristics that affect performance and attendance rate? What roles do SAT score and prior GPA
play in this regard?
Solution :
If using an OlS estimator on the pooled data, to get unbiased estimators we need the unobserved student
characteristics to be uncorrelated with Attendsc . However, we might worry that Ability might be correlated
with Attendance and be an important omitted variable in this case. SAT scores and GPA scores are fixed
students characteristics within our setting and will help alleviate the omitted variable problem. But not
completely deal with it as other variables might matter as well. It is unlikely that SAT scores and GPA
scores adequately capture a student ability.
d. If you think SAT score and prior GPA do not adequately capture student ability, how would you
estimate the effect of attendance on final exam performance?
Solution :
We would use the fixed effect model to estimate the true effect of attendance on final exam scores. If GP As
and SAT scores are unable to capture student ability, then As prone to correlated with Attendsc . As a
result, the pooled OLS estimators are biased and inconsistent. Instead, we should use the fixed effect model.
In the lecture, there are three methods to estimate the fixed effect model. Here we are going to use the
entity-demeaned OLS.
C C C C
1 X 1 X 1 X 1 X
F inalExamScoresc = βs + β1 Attendsc + β2 M ajorsc + εsc
C c=1 C c=1 C c=1 C c=1
where the C is the number of classes a typical student taken, which can be rewritten as:
¯
F inalExamScore ¯ ¯
s = βs + β1 Attends + β2 M ajor s + ε̄s
Subtracting the fixed effect regression with the demeaned regression, we can get the deviation from the entity
averages and eliminate the students’ fixed effects.
5
¯
F inalExamScoresc − F inalExamScore ¯ ¯
s = β1 [Attendsc − Attends ] + β2 [M ajorsc − M ajor s ] + εsc − ε̄s
˜
F inalExamScore ˜ ˜
sc = β1 Attendsc + β2 M ajorsc + ε̃sc
˜
3. The final step is to estimate beta1 by regressing F inalExamScore ˜ ˜
sc on Attendsc and M ajorsc using
OLS.
Question 3
From Stock and Watson Chapter 11: Consider a model for new capital investment in a particular industry
(say, manufacturing), where the cross section observations are at the county level and there are T years of
data for each county:
6
it might be too demanding. Alternatively, we could estimate the model on the entity demeaned variables.
Then we could run the OLS on T − 1 without constant model. This will give us the unbiased estimators as
long as the effects are fixed effects and there is no other omitted variables changing within states and time
affecting capital investment.