0% found this document useful (0 votes)
19 views15 pages

GMU Econ535-Applied Econometrics Problem Set3 (PS3) Solutions Spring 2024

The document outlines solutions for Problem Set 3 in an Applied Econometrics course, focusing on building a small dataset, exploring income patterns, and analyzing regression results. It includes tasks such as generating summary statistics, running regressions with robust standard errors, and interpreting coefficients related to income determinants. The document emphasizes the importance of understanding omitted variable bias and the limitations of drawing causal conclusions from cross-sectional data.

Uploaded by

AmRonPaulian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

GMU Econ535-Applied Econometrics Problem Set3 (PS3) Solutions Spring 2024

The document outlines solutions for Problem Set 3 in an Applied Econometrics course, focusing on building a small dataset, exploring income patterns, and analyzing regression results. It includes tasks such as generating summary statistics, running regressions with robust standard errors, and interpreting coefficients related to income determinants. The document emphasizes the importance of understanding omitted variable bias and the limitations of drawing causal conclusions from cross-sectional data.

Uploaded by

AmRonPaulian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Econ 535, Applied Econometrics Problem set 3 - SOLUTIONS Due

7pm, March 25

Part 1: Building a Small Data Set and Revisiting Omitted Variable Bias

There are multiple sources of data available online (e.g.


https://ourworldindata.org/, https://www.worldvaluessurvey.org/wvs.jsp,
https://data.worldbank.org/, etc). In this part of the problem set you will find data
yourself on one of these, or some other, site. The total number of observations can
vary greatly – but should be no less than N=50 (but it can be much higher,
depending on your choice of data). You will build a small data set that should
include at least three variables, selected as follows:

 Start by identifying two variables where you would hypothesis that one is
causally related to the other: one outcome variables, Y , and one independent
variable X 1.
 Then identify a second independent variable X 2 . X 2 should also be such that
your hypothesis is that it is related to the outcome variable Y .

Once you have put together your data set, start by getting to know your data set by
reporting summary statistics for your (at least three) variables: min, max, mean,
standard deviation, etc.

Thereafter, run three regressions (using robust standard errors): one where you
regress Y on X 1 , one where you regress Y on X 2 , and one where you regress Y on
both X 1 and X 2 . For all regressions, use robust standard errors. Report the results
either using the Stata output, or in a table similar to what you would see in an
academic paper. Interpret all coefficients and discuss their statistical significance.
Observe how the coefficient on X 1 and X 2 changes (or not) between the binary and
the multiple regressions, and discuss the change in terms of the existence (or not)
of omitted variable bias.

(5 points) Answers will vary depending on the data set chosen.

Part 2: Explaining Income Patterns

In this assignment, you will explore the determinants of income using data from
the National Longitudinal Survey of Youth. Download the file PS3_Data.dta, which
contains a sample of about 1000 respondents. This data set began with roughly
12,000 American teenagers about 20 years ago and has been following them since.
Respondents are classified as white, black or Hispanic.

1. First, get to know your data set.

a) (1 point) First show summary statistics by using the command “sum”


(summarize). What fraction of the sample is female? What is the average
age? What are the minimum, maximum and average monthly incomes in the
sample? What fraction is black and what fraction is Hispanic?
sum

Variable | Obs Mean Std. dev. Min Max

-------------+---------------------------------------------------------

age | 995 20.2603 1.576226 16 23

black | 995 .0854271 .2796568 0 1

hispanic | 995 .0572864 .232506 0 1

income | 995 887.7364 344.2031 258.9612 2618.175

single | 995 .7366834 .4406542 0 1

-------------+---------------------------------------------------------

married | 995 .2633166 .4406542 0 1

yrs_educ | 995 11.95678 1.50557 4 17

urban | 995 .801005 .399445 0 1

male | 995 .5366834 .4989033 0 1

46.2% of the sample is female. The average age is 20.26 and the minimum,
maximum and average incomes in the sample are $258.96, $2618.18 and $887.74
respectively. 8.5% of the sample is black, and 5.7% of the sample is Hispanic.

b) (1 point) Now generate a new variable called “minority” that is equal to 1 if


a person is black or Hispanic, and 0 otherwise. What fraction of the sample
is made up of whites?

gen minority=black==1|hispanic==1

. sum minority

Variable | Obs Mean Std. dev. Min Max

-------------+---------------------------------------------------------

minority | 995 .1427136 .3499564 0 1

85.7% of the sample is made up of whites.


2. (2 points) Regress income on the variables male, minority and years of
education. For this and all subsequent regressions, use robust standard errors.
Report your results. Which coefficients are statistically significant?
reg income male minority yrs_educ, robust

Linear regression Number of obs = 995

F(3, 991) = 50.50

Prob > F = 0.0000

R-squared = 0.1329

Root MSE = 321

------------------------------------------------------------------------------

| Robust

income | Coefficient std. err. t P>|t| [95% conf. interval]

-------------+----------------------------------------------------------------

male | 233.9662 20.32987 11.51 0.000 194.0717 273.8608

minority | -33.09229 29.04318 -1.14 0.255 -90.08548 23.9009

yrs_educ | 44.10299 6.514053 6.77 0.000 31.32007 56.88591

_cons | 239.5634 79.93118 3.00 0.003 82.70959 396.4172

The coefficients on male and yrs_educ are significant at the 1% level (at least). The
coefficient on minority is not statistically significant.

3. (2 points) Now regress income on the variables male, minority, years of


education and married. Which coefficients are significant and at what level? Also,
what is the estimated change in income associated with an additional year of
education for males, and for females?

reg income male minority yrs_educ married, robust

Linear regression Number of obs = 995

F(4, 990) = 39.10

Prob > F = 0.0000

R-squared = 0.1397
Root MSE = 319.89

------------------------------------------------------------------------------

| Robust

income | Coefficient std. err. t P>|t| [95% conf. interval]

-------------+----------------------------------------------------------------

male | 237.9132 20.4348 11.64 0.000 197.8127 278.0137

minority | -24.59293 29.00056 -0.85 0.397 -81.50256 32.3167

yrs_educ | 46.32538 6.546817 7.08 0.000 33.47815 59.17261

married | 65.27789 23.16271 2.82 0.005 19.82424 110.7315

_cons | 192.4708 81.68977 2.36 0.019 32.16577 352.7758

------------------------------------------------------------------------------

Now the coefficients on male, married, and years of education are significant at the
1% level (at least). The coefficient on “minority” is not significant.
The return to a year of schooling for men and women would be 46.33. (We cannot
know whether the return to a year of schooling would be different for men and
women. To know this, we have to create a new variable (gen male_educ = male *
years_educ) and add it to the list of explanatory variables, but you are not asked to
do that here.)
4. (3 points) Now generate a new variable that is equal to the interaction of the
dummies for “male” and “married” (call it male_married). Run the same regression
as in 3, but include your new variable among the independent variables. Interpret
the coefficient on married. Interpret the coefficient on male. Interpret the
coefficient on the interaction of male and married. Which of these three
coefficients are statistically significant, and at what level?
gen male_married=male*married

. reg income male minority yrs_educ married male_married, robust

Linear regression Number of obs = 995

F(5, 989) = 32.03

Prob > F = 0.0000

R-squared = 0.1459

Root MSE = 318.9


------------------------------------------------------------------------------

| Robust

income | Coefficient std. err. t P>|t| [95% conf. interval]

-------------+----------------------------------------------------------------

male | 205.1602 23.37881 8.78 0.000 159.2824 251.0379

minority | -26.09316 28.87601 -0.90 0.366 -82.75845 30.57213

yrs_educ | 46.11461 6.578909 7.01 0.000 33.20438 59.02483

married | 2.281599 22.58367 0.10 0.920 -42.03582 46.59902

male_married | 122.9917 44.75587 2.75 0.006 35.16428 210.819

_cons | 213.3017 81.60683 2.61 0.009 53.15929 373.4441

The coefficient on married decreases from about 65.3 to 2.3, and is not statistically
significant anymore. This coefficient (2.3) can be interpreted as the predicted
difference in income between the married women and single women in our sample.
The coefficient on male is 205.2 and can be interpreted as the predicted difference
in income between single men and single women. It is statistically significant at
the 1% level (at least). The coefficient on male_married is 123, and is statistically
significant at the 1% level. This coefficient (123) can be interpreted as telling us
that the predicted change in income going from single to married depends on
whether you’re male or female.

5. (2 points) A policy maker comes across your results and notes that married
young people earn more than unmarried young people. She therefore suggests that
it would be good policy to promote marriage among young members of the labor
market. Is this a correct conclusion to draw? Why or why not?

This is probably not a causal relationship. It could, for example, be the case that
young people wait to get married until they have enough income to support a
partner or a family. If this is the case, than the causality would run in the opposite
direction – i.e. higher income would cause marriage and not the other way around.
There are several examples like this that would make for good answers. In general,
because this is simply cross-sectional data, we can’t prove causality, so it would be
irresponsible to encourage policies that don’t have a demonstrated causal effect on
our outcome of interest (here, income).

6. (2 points) Generate a variable called min_educ, the interaction between minority


and educ. Regress income on minority, educ and min_educ. What is the estimated
return to a year of schooling for whites? For minorities? Is there evidence to
suggest that this return is different for whites than for minorities in this sample?
gen min_educ= minority*yrs_educ

. reg income minority yrs_educ min_educ, robust

Linear regression Number of obs = 995

F(3, 991) = 10.31

Prob > F = 0.0000

R-squared = 0.0213

Root MSE = 341.03

------------------------------------------------------------------------------

| Robust

income | Coefficient std. err. t P>|t| [95% conf. interval]

-------------+----------------------------------------------------------------

minority | -18.62319 147.1104 -0.13 0.899 -307.3069 270.0606

yrs_educ | 31.04021 7.919749 3.92 0.000 15.49881 46.58161

min_educ | -1.972122 12.9715 -0.15 0.879 -27.42689 23.48264

_cons | 522.4759 94.9886 5.50 0.000 336.074 708.8778

The estimated “return” to a year of schooling for whites would be 31.04, while the
“return” for minorities would be 29.07 = (31.04 – 1.97). There is no evidence based
on this regression, however, to reject the null that the coefficient on min_educ is
different from zero (p=0.879). Thus, we don’t have any evidence to suggest that
there is a different education premium for whites vs. minorities in this sample.

Part 3: Explore the determinants of ln(income).

1. (3 points) Continue using the same data. Make a regression table like those
usually present in academic papers. The dependent variable is always ln(income)
(this requires you to generate a new variable) and all of these regressions should
use robust standard errors. The independent variables of interest are male, black
and years of education; add these sequentially to each of the 3 models (i.e. your
first model should just be a regression of ln(income) on male, the second should
include male and black as the independent variables and the third should include
all 3 independent variables. Standard errors should be placed below coefficients in
parentheses, along with stars for statistical significance (* p<.1 ** p<.05 ***
p<.01). The R2 and sample size for each regression should be reported at the
bottom of each column.
gen lincome=ln(income)

. eststo clear

.
. eststo: reg lincome male, robust

Linear regression Number of obs = 995


F(1, 993) = 107.57
Prob > F = 0.0000
R-squared = 0.0950
Root MSE = .33408

------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2169101 .0209135 10.37 0.000 .1758704 .2579498
_cons | 6.608175 .0137173 481.74 0.000 6.581257 6.635093
------------------------------------------------------------------------------
(est1 stored)

.
. eststo: reg lincome male black, robust

Linear regression Number of obs = 995


F(2, 992) = 54.99
Prob > F = 0.0000
R-squared = 0.0991
Root MSE = .3335

------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2176794 .0209102 10.41 0.000 .1766461 .2587128
black | -.0799085 .0362088 -2.21 0.028 -.1509632 -.0088538
_cons | 6.614588 .0139254 475.00 0.000 6.587262 6.641915
------------------------------------------------------------------------------
(est2 stored)

. eststo: reg lincome male black yrs_educ, robust

Linear regression Number of obs = 995


F(3, 991) = 49.46
Prob > F = 0.0000
R-squared = 0.1424
Root MSE = .32556

------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2428386 .0210778 11.52 0.000 .2014764 .2842009
black | -.0677965 .0348131 -1.95 0.052 -.1361124 .0005195
yrs_educ | .0492704 .0071684 6.87 0.000 .0352034 .0633373
_cons | 6.010936 .0889598 67.57 0.000 5.836365 6.185507
------------------------------------------------------------------------------
(est3 stored)

. esttab *, se r2 star(* .1 ** .05 *** .01)

------------------------------------------------------------
(1) (2) (3)
lincome lincome lincome
------------------------------------------------------------
male 0.217*** 0.218*** 0.243***
(0.0209) (0.0209) (0.0211)

black -0.0799** -0.0678*


(0.0362) (0.0348)
yrs_educ 0.0493***
(0.00717)

_cons 6.608*** 6.615*** 6.011***


(0.0137) (0.0139) (0.0890)
------------------------------------------------------------
N 995 995 995
R-sq 0.095 0.099 0.142
------------------------------------------------------------
Standard errors in parentheses
* p<.1, ** p<.05, *** p<.01

2. (2 points) Using column 3, interpret each of the three coefficients on male,


black, and educ, and discuss whether each is statistically significant at the 5%
level.
The coefficient on male is equal to 0.243. This implies that on average being male
is associated with having 21.7 percent higher earnings than females, holding race
and education constant. The coefficient is statistically significant at the 1% level.
The coefficient on black is equal to -0.0678. This implies that on average being
black is associated with 6.78% lower earnings than non-blacks, holding gender and
education constant. The coefficient is statistically significant at the 10% level.
The coefficient on years of education is equal to 0.0493. This implies that on
average an additional year of education is associated with an increase in earnings
of about 4.93 percent, holding race and gender constant. The coefficient is
statistically significant at the 1% level.

3. (3 points) Is the return to education different for blacks and non-blacks? If so, by
how much? Explain in detail how you would answer this question, then answer it
by running a fourth regression.
We want to assess the relationship of one independent variable (education) with
the outcome variable (lnincome) depending on the level of another independent
variable (black). This is an ideal case to use interactions.
We first make an interaction variable that is equal to black*yrs_educ. If the
coefficient on this variable is statistically different from zero, then we could argue
that the returns to education are different for blacks and non-blacks.
gen black_ed = black*yrs_educ

. reg lincome male black yrs_educ black_ed, robust


Linear regression Number of obs = 995
F(4, 990) = 37.89
Prob > F = 0.0000
R-squared = 0.1431
Root MSE = .32558

------------------------------------------------------------------------------
| Robust
lincome | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male | .2430082 .0210995 11.52 0.000 .2016033 .2844132
black | -.3642078 .2752017 -1.32 0.186 -.9042534 .1758378
yrs_educ | .0475133 .0074795 6.35 0.000 .0328357 .0621909
black_ed | .0252565 .0232488 1.09 0.278 -.0203661 .0708791
_cons | 6.031893 .092684 65.08 0.000 5.850014 6.213773

Since the coefficient on black_ed is not statistically significant, we fail to reject the
implicit null that this coefficient is equal to zero. Thus, we cannot say with any
certainty that the returns to education for blacks and non-blacks are different.

4. (1 point) Why might it make sense to use ln(income) as a dependent variable


than income, as you did in part 1?
Using the log of income allows us to assume that the explanatory variables are
associated with percent changes (or differences) in income, rather than absolute
changes in income. It may more sense that an additional year of education would
be associated with a 5 percentage point increase in monthly income rather than a
particular dollar amount (such as $200). The dollar value of an additional year of
education can thus vary by income level.

Part 4: The impact of education on income.

1. (1 point) Continue using the same data. Regress income on education and
education2 (this requires you to generate a new variable). Report your results.
gen ed2 = yrs_ed^2

. reg income yrs_educ ed2, robust


Linear regression Number of obs = 995
F(2, 992) = 12.95
Prob > F = 0.0000
R-squared = 0.0196
Root MSE = 341.15

------------------------------------------------------------------------------
| Robust
income | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
yrs_educ | 43.85142 45.62291 0.96 0.337 -45.67708 133.3799
ed2 | -.5104324 2.016529 -0.25 0.800 -4.467584 3.446719
_cons | 437.5441 258.0703 1.70 0.090 -68.88224 943.9704
------------------------------------------------------------------------------

2. (1 point) Can you interpret the coefficient on yrs_educ? Why or why not? If you
can interpret it, do so.
You cannot directly interpret the magnitude of the coefficient on yrs_educ since we
can’t change yrs_educ and hold ed2 constant. In order to see the relationship
between educ and income, we must look at a particular level of income. We could
create a table using sample education values, the regression constant and
coefficients on yrs_educ and ed2 to calculate sample incomes and changes in
income associated with an additional year of education. We can then calculate the
changes in income at different levels of education.
3. (1 point) What's the predicted difference in income between people with 11 and
12 years of educ?

( 437.54+ 4 3.85∗12−0.5 1∗1 22 )−( 437.54+ 43.85∗11−0.51∗1 12) =3 2.12

4. (1 point) What's the predicted difference in income between people with 15 and
16 years of educ?

( 437.54+ 43.85∗1 6−0.51∗1 6 2) −( 437.54+ 43.85∗1 5−0.51∗1 52 )=28.04

5. (1 point) Are the answers to 3 and 4 the same? Why or why not? Discuss.
They are different because the change in income associated with a change in
education now depends on the level of education. Since the sign on the education
is positive and on the square term is negative, we know that income is increasing
in education at a decreasing rate. Therefore, in this sample, going from 15 to 16
years of education is associated with a smaller increase in monthly income
($28.04) than going from 11 to 12 years of education ($32.12).

6. (1 point) Why might it make sense to include education2 as an explanatory


variable in addition to education?
We might believe that the predicted change in income associated with a change in
education depends on the level of education one receives. Including ed2 in our
regression allows for a nonlinear relationship between education and income. In
other words, we might believe that the return to education is greater for an
additional year of education at the high school level than for an additional year
after graduate school – that would yield coefficients like those that we see, where
the coefficient on education is positive and that on education squared is negative.

7. (2 points) You should find that the coefficient on “educ” is statistically


insignificant. Does this mean that education isn't a significant predictor of income?
test yrs_educ ed2

( 1) yrs_educ = 0
( 2) ed2 = 0

F( 2, 992) = 12.95
Prob > F = 0.0000

No, this does not necessarily mean that education is not a significant predictor of
income. It is possible that since yrs_ed and ed2 are correlated, they do not appear
to be individually significant, but they are jointly significant. To determine whether
education is a significant predictor of income, we must test the hypothesis that
both education related coefficients, yrs_educ and ed2, are jointly zero taken
together. When we run an F-test for this purpose, we find that education is, in fact,
a statistically significant predictor of income (p=0.000). Hence, we reject our
implicit null hypothesis. (Note that the test above is already reported in the
regression table from 1, so you don’t have to run the test separately, you can also
just refer to that Stata output).

Part 5: Before answering the five questions below, please read the following
(slightly edited) excerpt from a science blog about the concept p-hacking:

“Most scientists are careful and scrupulous in how they collect data and carry out
statistical tests. However, there are ways in which statistical techniques can be
misused and abused to show effects which are not really there. To avoid reporting
spurious results as fact and giving air to bad science, we must be able to recognize
when such methods may be in use. This piece introduces one such technique
known as ‘p-hacking’. It is one of the most common ways in which data analysis is
misused to generate statistically significant results where none exists, and is one
which we should remain vigilant against.

A scrupulous scientist should go in with a well-motivated hypothesis which she put


to the test in an experiment or with using observational data. This is a baseline
assumption of scientific testing: that the scientist forms a prior hypothesis (ideally
based on a theory) which they then put to the test. Suppose, however, a scientist
took the opposite approach. Suppose they started off with the conclusion they want
to reach, and were not particularly concerned with scientific ethics. In this case,
they could use statistical testing to manufacture this result through selective
reporting.

To take a toy example, suppose you wanted to establish a link between chocolate
and baldness. You could then get a group of 10,000 men (a pretty big sample size
by all accounts) to report on their consumption of M&Ms, Twix and Mars Bars over
a period of time. In addition, you record the rate of going bald in the group over
time. Once you have your chocolate and baldness data, you run tests on everything
you can think of. Do men who eat only M&Ms go bald younger? Do young men who
eat both Mars and M&Ms but not Twix go bald on top more often than the front?
Do older, unmarried men who don’t exercise and eat non chocolate bars have a
lower incidence of baldness? Run enough of these tests and you are eventually
bound to get a result that is ‘statistically significant’.

A p-value of for example 0.01 indicates the probability of a result occurring


randomly just 1 in 100 times. This is generally judged to be pretty highly
significant as it’s rather unlikely the association came about by chance. This is
based on the assumption, of course, that you are not running hundreds of tests in
order to find the 1 in a 100 occurrence. P-hacking is particularly insidious because
it can be hard to detect. With a plausible explanation for why the ‘hypothesis’ was
proposed, results generated by torturing the data in this way can be hard to
distinguish from genuine studies.”

1. (1 point) What is p-hacking according to the text above?


A) A misuse of data sampling to find patterns in data where no real
underlying effect exists.
B) A misuse of data analysis to find patterns in data where no real
underlying effect exists.
C) A misuse of experimental design to find patterns in data where no real
underlying effect exists. D) All of the above.

2. (1 point) How should a scientist who is genuinely interested in the


relationship between chocolate consumption and baldness go about
their research?
A) It isn’t possible, so they should not attempt this.
B) They should limit their sample size to at most 1000 men.
C) They should go into their research with a clear, motivated
hypothesis that can be tested with a limited number of tests.
D) A misuse of data sampling and analysis, and of experimental design, to
find patterns in data where no real underlying effect exists.
3. (1 point) Which of the following is a red flag for potential p-
hacking?
A) A sensationalized finding.
B) A result which goes against the majority of existing research on a topic.
C) A higher than expected number of studies reporting p-values just
below 0.05.
D) All of the above.

4. (1 point) Pre-registration of scientific hypotheses before


experiments are run and data are analyzed is seen as one weapon in
the quest against p-hacking. Which is the main reason why this is?
A) It forces the scientist to formulate clear, testable hypotheses.
B) It makes it possible for those evaluating the research to make sure that
the scientists have not run an inappropriate number of tests in search of
significant results.
C) The pre-registration can set threshold for statistical significance ahead of
the analysis being conducted.
D) All of the above.

5. (1 point) Can more stringent standards for at what level a result is


considered statistically significant eliminate the problem of p-
hacking?
A) Yes, because a really low p-value of for example 0.001 is not possible to
generate through p-hacking.
B) Yes, especially if all scientists agree to the new level of statistical
significance.
C) No, but it may make it harder to p-hack.
D) No, p-hacking is as easy regardless of whether the p-value needs to be
0.05 or 0.01 for a result to count as statistically significant.

You might also like