GMU Econ535-Applied Econometrics Problem Set3 (PS3) Solutions Spring 2024
GMU Econ535-Applied Econometrics Problem Set3 (PS3) Solutions Spring 2024
7pm, March 25
Part 1: Building a Small Data Set and Revisiting Omitted Variable Bias
Start by identifying two variables where you would hypothesis that one is
causally related to the other: one outcome variables, Y , and one independent
variable X 1.
Then identify a second independent variable X 2 . X 2 should also be such that
your hypothesis is that it is related to the outcome variable Y .
Once you have put together your data set, start by getting to know your data set by
reporting summary statistics for your (at least three) variables: min, max, mean,
standard deviation, etc.
Thereafter, run three regressions (using robust standard errors): one where you
regress Y on X 1 , one where you regress Y on X 2 , and one where you regress Y on
both X 1 and X 2 . For all regressions, use robust standard errors. Report the results
either using the Stata output, or in a table similar to what you would see in an
academic paper. Interpret all coefficients and discuss their statistical significance.
Observe how the coefficient on X 1 and X 2 changes (or not) between the binary and
the multiple regressions, and discuss the change in terms of the existence (or not)
of omitted variable bias.
In this assignment, you will explore the determinants of income using data from
the National Longitudinal Survey of Youth. Download the file PS3_Data.dta, which
contains a sample of about 1000 respondents. This data set began with roughly
12,000 American teenagers about 20 years ago and has been following them since.
Respondents are classified as white, black or Hispanic.
-------------+---------------------------------------------------------
-------------+---------------------------------------------------------
46.2% of the sample is female. The average age is 20.26 and the minimum,
maximum and average incomes in the sample are $258.96, $2618.18 and $887.74
respectively. 8.5% of the sample is black, and 5.7% of the sample is Hispanic.
gen minority=black==1|hispanic==1
. sum minority
-------------+---------------------------------------------------------
R-squared = 0.1329
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
The coefficients on male and yrs_educ are significant at the 1% level (at least). The
coefficient on minority is not statistically significant.
R-squared = 0.1397
Root MSE = 319.89
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
------------------------------------------------------------------------------
Now the coefficients on male, married, and years of education are significant at the
1% level (at least). The coefficient on “minority” is not significant.
The return to a year of schooling for men and women would be 46.33. (We cannot
know whether the return to a year of schooling would be different for men and
women. To know this, we have to create a new variable (gen male_educ = male *
years_educ) and add it to the list of explanatory variables, but you are not asked to
do that here.)
4. (3 points) Now generate a new variable that is equal to the interaction of the
dummies for “male” and “married” (call it male_married). Run the same regression
as in 3, but include your new variable among the independent variables. Interpret
the coefficient on married. Interpret the coefficient on male. Interpret the
coefficient on the interaction of male and married. Which of these three
coefficients are statistically significant, and at what level?
gen male_married=male*married
R-squared = 0.1459
| Robust
-------------+----------------------------------------------------------------
The coefficient on married decreases from about 65.3 to 2.3, and is not statistically
significant anymore. This coefficient (2.3) can be interpreted as the predicted
difference in income between the married women and single women in our sample.
The coefficient on male is 205.2 and can be interpreted as the predicted difference
in income between single men and single women. It is statistically significant at
the 1% level (at least). The coefficient on male_married is 123, and is statistically
significant at the 1% level. This coefficient (123) can be interpreted as telling us
that the predicted change in income going from single to married depends on
whether you’re male or female.
5. (2 points) A policy maker comes across your results and notes that married
young people earn more than unmarried young people. She therefore suggests that
it would be good policy to promote marriage among young members of the labor
market. Is this a correct conclusion to draw? Why or why not?
This is probably not a causal relationship. It could, for example, be the case that
young people wait to get married until they have enough income to support a
partner or a family. If this is the case, than the causality would run in the opposite
direction – i.e. higher income would cause marriage and not the other way around.
There are several examples like this that would make for good answers. In general,
because this is simply cross-sectional data, we can’t prove causality, so it would be
irresponsible to encourage policies that don’t have a demonstrated causal effect on
our outcome of interest (here, income).
R-squared = 0.0213
------------------------------------------------------------------------------
| Robust
-------------+----------------------------------------------------------------
The estimated “return” to a year of schooling for whites would be 31.04, while the
“return” for minorities would be 29.07 = (31.04 – 1.97). There is no evidence based
on this regression, however, to reject the null that the coefficient on min_educ is
different from zero (p=0.879). Thus, we don’t have any evidence to suggest that
there is a different education premium for whites vs. minorities in this sample.
1. (3 points) Continue using the same data. Make a regression table like those
usually present in academic papers. The dependent variable is always ln(income)
(this requires you to generate a new variable) and all of these regressions should
use robust standard errors. The independent variables of interest are male, black
and years of education; add these sequentially to each of the 3 models (i.e. your
first model should just be a regression of ln(income) on male, the second should
include male and black as the independent variables and the third should include
all 3 independent variables. Standard errors should be placed below coefficients in
parentheses, along with stars for statistical significance (* p<.1 ** p<.05 ***
p<.01). The R2 and sample size for each regression should be reported at the
bottom of each column.
gen lincome=ln(income)
. eststo clear
.
. eststo: reg lincome male, robust
------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2169101 .0209135 10.37 0.000 .1758704 .2579498
_cons | 6.608175 .0137173 481.74 0.000 6.581257 6.635093
------------------------------------------------------------------------------
(est1 stored)
.
. eststo: reg lincome male black, robust
------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2176794 .0209102 10.41 0.000 .1766461 .2587128
black | -.0799085 .0362088 -2.21 0.028 -.1509632 -.0088538
_cons | 6.614588 .0139254 475.00 0.000 6.587262 6.641915
------------------------------------------------------------------------------
(est2 stored)
------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2428386 .0210778 11.52 0.000 .2014764 .2842009
black | -.0677965 .0348131 -1.95 0.052 -.1361124 .0005195
yrs_educ | .0492704 .0071684 6.87 0.000 .0352034 .0633373
_cons | 6.010936 .0889598 67.57 0.000 5.836365 6.185507
------------------------------------------------------------------------------
(est3 stored)
------------------------------------------------------------
(1) (2) (3)
lincome lincome lincome
------------------------------------------------------------
male 0.217*** 0.218*** 0.243***
(0.0209) (0.0209) (0.0211)
3. (3 points) Is the return to education different for blacks and non-blacks? If so, by
how much? Explain in detail how you would answer this question, then answer it
by running a fourth regression.
We want to assess the relationship of one independent variable (education) with
the outcome variable (lnincome) depending on the level of another independent
variable (black). This is an ideal case to use interactions.
We first make an interaction variable that is equal to black*yrs_educ. If the
coefficient on this variable is statistically different from zero, then we could argue
that the returns to education are different for blacks and non-blacks.
gen black_ed = black*yrs_educ
------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2430082 .0210995 11.52 0.000 .2016033 .2844132
black | -.3642078 .2752017 -1.32 0.186 -.9042534 .1758378
yrs_educ | .0475133 .0074795 6.35 0.000 .0328357 .0621909
black_ed | .0252565 .0232488 1.09 0.278 -.0203661 .0708791
_cons | 6.031893 .092684 65.08 0.000 5.850014 6.213773
Since the coefficient on black_ed is not statistically significant, we fail to reject the
implicit null that this coefficient is equal to zero. Thus, we cannot say with any
certainty that the returns to education for blacks and non-blacks are different.
1. (1 point) Continue using the same data. Regress income on education and
education2 (this requires you to generate a new variable). Report your results.
gen ed2 = yrs_ed^2
------------------------------------------------------------------------------
| Robust
income | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
yrs_educ | 43.85142 45.62291 0.96 0.337 -45.67708 133.3799
ed2 | -.5104324 2.016529 -0.25 0.800 -4.467584 3.446719
_cons | 437.5441 258.0703 1.70 0.090 -68.88224 943.9704
------------------------------------------------------------------------------
2. (1 point) Can you interpret the coefficient on yrs_educ? Why or why not? If you
can interpret it, do so.
You cannot directly interpret the magnitude of the coefficient on yrs_educ since we
can’t change yrs_educ and hold ed2 constant. In order to see the relationship
between educ and income, we must look at a particular level of income. We could
create a table using sample education values, the regression constant and
coefficients on yrs_educ and ed2 to calculate sample incomes and changes in
income associated with an additional year of education. We can then calculate the
changes in income at different levels of education.
3. (1 point) What's the predicted difference in income between people with 11 and
12 years of educ?
4. (1 point) What's the predicted difference in income between people with 15 and
16 years of educ?
5. (1 point) Are the answers to 3 and 4 the same? Why or why not? Discuss.
They are different because the change in income associated with a change in
education now depends on the level of education. Since the sign on the education
is positive and on the square term is negative, we know that income is increasing
in education at a decreasing rate. Therefore, in this sample, going from 15 to 16
years of education is associated with a smaller increase in monthly income
($28.04) than going from 11 to 12 years of education ($32.12).
( 1) yrs_educ = 0
( 2) ed2 = 0
F( 2, 992) = 12.95
Prob > F = 0.0000
No, this does not necessarily mean that education is not a significant predictor of
income. It is possible that since yrs_ed and ed2 are correlated, they do not appear
to be individually significant, but they are jointly significant. To determine whether
education is a significant predictor of income, we must test the hypothesis that
both education related coefficients, yrs_educ and ed2, are jointly zero taken
together. When we run an F-test for this purpose, we find that education is, in fact,
a statistically significant predictor of income (p=0.000). Hence, we reject our
implicit null hypothesis. (Note that the test above is already reported in the
regression table from 1, so you don’t have to run the test separately, you can also
just refer to that Stata output).
Part 5: Before answering the five questions below, please read the following
(slightly edited) excerpt from a science blog about the concept p-hacking:
“Most scientists are careful and scrupulous in how they collect data and carry out
statistical tests. However, there are ways in which statistical techniques can be
misused and abused to show effects which are not really there. To avoid reporting
spurious results as fact and giving air to bad science, we must be able to recognize
when such methods may be in use. This piece introduces one such technique
known as ‘p-hacking’. It is one of the most common ways in which data analysis is
misused to generate statistically significant results where none exists, and is one
which we should remain vigilant against.
To take a toy example, suppose you wanted to establish a link between chocolate
and baldness. You could then get a group of 10,000 men (a pretty big sample size
by all accounts) to report on their consumption of M&Ms, Twix and Mars Bars over
a period of time. In addition, you record the rate of going bald in the group over
time. Once you have your chocolate and baldness data, you run tests on everything
you can think of. Do men who eat only M&Ms go bald younger? Do young men who
eat both Mars and M&Ms but not Twix go bald on top more often than the front?
Do older, unmarried men who don’t exercise and eat non chocolate bars have a
lower incidence of baldness? Run enough of these tests and you are eventually
bound to get a result that is ‘statistically significant’.