1 Assignment Presentation: Patrick Blanchenay Due Wednesday 9th Dec 2020, 11.59pm
1 Assignment Presentation: Patrick Blanchenay Due Wednesday 9th Dec 2020, 11.59pm
Patrick Blanchenay
Due Wednesday 9th Dec 2020, 11.59pm
1 Assignment presentation
This assignment tests your comprehension of Difference-in-Differences. You must submit the answers to the exercises in
Section 8. You will have to submit three elements:
• A PDF document giving the answers to the exercises
• The unique Stata do-file that you used to generate the answers to all questions
• The unique log file produced by Stata when running the do-file
1
To make these tables more readable, you are also asked to give a short readable label to any variable that you cre-
ate:
// Giving variable a short label
gen logwage = ... // creating a new variable
label variable cityDummy "Log wages"
5 Documents to upload
5.1 Results PDF document
Filename: ECO372_Assignment4_SURNAME_FirstName.pdf
The Results PDF should be a single document with your answers to all exercises. Most questions will require to perform
analyses using Stata and provide suitable explanations and interpretations of the results. You are expected to provide,
whenever possible, properly formatted regression results using the esttab command. You do not need to answer
questions that only ask you to generate a new variable.
The answers you provide should only on results that are directly produced by your do-file. Conversely, it is not
always necessary to include ALL Stata results into your Results document. Only put the parts that are used to
answer the questions; but keep tables together (do not only copy isolated numbers).
Answers will be graded based on the quality of the explanations. It is not enough to use Stata output. You have to explain
how the output answers the specific question.
The PDF document must be uploaded to the Quercus assignment by the deadline. This is a necessary but not sufficient
condition for your submission to be complete.
Format
• PDF only. No other file type will be accepted (in particular, no MS Word document).
• Letter-sized. Font should be at least 10points, everything should be readily readable, including the Stata output.
2
• Top line of the document should contain : [SURNAME] [First name] - ECO372 Assignment 4
• Second line: Student Number: [Student Number]
• Answers should be clearly numbered, but you do not need to copy the questions.
• Filename should be: EC0372_Assignment4_SURNAME_FirstName.pdf. For instance, mine would be called EC0372_
Assignment4_BLANCHENAY_Patrick.pdf. (It is OK if Quercus adds a number for a resubmission.)
5.2 Do-file
Filename: ECO372_Assignment4_SURNAME_FirstName.do
You can insert your commands in the space indicated in the provided template. Your code should produce all analyses
and output necessary for all exercises and questions, from one single do-file.
Your do-file must be able to run in one go if placed on a computer with the same datasets available. The only thing I
should need to change in your do-file, to reproduce exactly your results, is to change the working directory. In particular,
this requires to keep the do-file in your working directory, and for the /datasets/ folder to be in your working directory.
If you’re not sure, try on a classmate’s computer. If you get error when running your do-file (red lines in Stata output),
correct the errors, then re-run the do-file again, until the whole do-file can execute in one pass.
Comment your code. You do not need to comment every instruction, but you should comment the big steps, or the big
blocks of code. Explain why you are doing such or such instructions, and what you expect Stata to do. Indentation is
also useful to make your code more readable.
Part of your grades depends on code formatting & commenting.
Format
• Text file only.
• Only ASCII characters should be used; no accented characters, no characters from extended alphabets or writing
systems.
• Filename should be: EC0372_Assignment4_SURNAME_FirstName.do, e.g. EC0372_Assignment4_BLANCHENAY_Patrick.
do. (It is OK if Quercus adds a number for a resubmission.)
If you followed the steps in Section 4, your log file should be created automatically when you run your do-file. And
it will be automatically named EC0372_Assignment4_SURNAME_FirstName.log, where SURNAME and Firstname have
been appropriately replaced by your ACORN surname and your ACORN first name. For instance, mine would be called
EC0372_Assignment4_BLANCHENAY_Patrick.log. Again, this should happen automatically if you are using the do-file
template provided, and if you have configured it appropriately (see step 5 in Section 4).
Anything in your log file should come from your do-file, not from instructions typed in Stata command window. That is,
if I re-run your do-file, I should obtain exactly the same log file (apart from the path to the working directory).
If you get error when running your do-file, correct the errors, then re-run the do-file in its entirety to generate an
error-free log file.
Format
• Text file only, not in SMCL.
• Filename should (automatically) be: EC0372_Assignment4_SURNAME_FirstName.log. (It is OK if Quercus adds a
number for a resubmission.)
6 Submission instructions
By Wednesday 9th Dec 2020, 11.59pm, you should have uploaded all three documents. Your submission will only be
considered complete when you have done all of those things. Failure to complete one or more of those will count as late
submission.
3
Only the results file, the do-file and the log-file should be uploaded. Do not include the datasets in your submission.
Do not group files in a zip file.
No submission will be accepted on paper, or by email, regardless of any technological problem.
7 Grading
7.1 Rubric
The assignment is worth 100 points, graded according to the following rubric.
Item Points
Question 1 40
Question 2 40
Code formatting & commenting 10
PDF formatting 10
For exercise questions, you will be graded on the quality of the answers to the questions. Emphasis will be put on clear
and concise answers that address specifically the question, and show your understanding of the topic and the statistical
issues it raises. Appropriate use of the Stata output in the answer will also be taken into account: use what is necessary,
leave out the irrelevant.
Note on PDF formatting: You are expected to make use of esttab to produce appropriately formatted regression
results, whenever possible. To make these tables more readable, you are also asked to give a short readable label to any
variable that you create. See section 3.3 for more details.
All results file will be checked. Some do-files and log-files will be checked at random.
7.2 Penalties
Note the penalties below, as they can quickly lower your grade:
Problem Penalty
Late submission (starting immediately at deadline) 10pts per 24hrs
File names do not follow the prescribed pattern 5pts per file
Do-file generates errors after modifying working directory 10pts
Do-file does not run in one go after modifying working directory 10pts
Log file does not correspond to do-file 10pts
Results are used that are not reproducible with the do-file 10pts
4
8 Questions
Reminder: You are expected to make use of esttab to produce nicely formatted regression tables.
This question asks you to reproduce selected results of the paper “Does Aid Matter? Measuring the Effect of Student Aid
on College Attendance and Completion,” published by Susan Dynarski (2003) in The American Economic Review . This
paper is attached to the assignment, as Dynarski2003.pdf.
a. (10 pts) Describe the outcome, the variable of interest (treatment), and the source of exogenous variation that
the author uses to identify her model. Discuss how this approach is an improvement over attempting to estimate
equation (1) on page 279 of the paper by simply using observational data.
b. Load the dataset Dynarski2003.dta; locate the variable equal to 1 if a youth is a member of a cohort that
graduated from high school before student benefits were eliminated.
c. (10 pts) Replicate the summary statistics highlighted in Table 1 reproduced at the end of this exercise (the table
is also available in the original article). You do not have to replicate the exact formatting. (It is easier to put
the “Father Deceased/Not” and “Before/After Change in Benefits” as rows; and the summary statistics for each
variable “Attend college by 23”, “Yrs of Schooling” as columns.) The suggested way to do this is the table
command. For instance, the command:
table x2 [weight=wt88], by(x1) contents(mean y mean z)
generates a table where each row corresponds to a unique combination of x1 and x2, and each cell in the table
gives the mean of variables y and z respectively.
d. (10 pts) Replicate the highlighted results in Table 2 of the paper reproduced at the end of this assignment. Note
that for Table 2, the standard errors are clustered at the household level, to account for potential correlation of
students within the same household. To cluster standard errors, use the cluster() option instead of robust .
e. (10 pts) State the key assumption for the DD estimates to be valid in this context. What would be the preferred
method for supporting that assumption? What evidence does the author provide in lieu of the preferred evidence
to support this assumption?
5
6
Exercise 2: Facezon’s Headquarters This exercise gets you to estimate the effect that the arrival of big headquar-
ters on the wages of local workers. It has two parts. First, you will be given the data generating process of wages, and
ask to create a fictitious panel data of wages. Then, you will “forget” that you created the dataset and imagine that you
are a researcher who just received the data to estimate the “headquarters” effect.
Make sure to insert your Stata commands in the relevant part of the ECO372_Assignment4_Surname_Firstname.do. It
is important to leave the following commands intact:
clear
set seed ‘studentnumber’
set obs 780
gen workerID = _n
There are one thousand workers in our dataset, spread over two cities, A and B. Allocate approximately 60% of workers
to city A, and the rest to city B by running the following:
gen byte cityA = (runiform() < 0.6) // allocate approximately 60% of workers to city A
gen byte cityB = 1 - cityA // allocate the rest to city B
Duplicate each observation, and create 7 years of data for each individual, starting from 2012, to 2018, by doing the
following:
expand 7, generate(expandy)
sort workerID expandy
bysort workerID: gen year = 2012+ _n -1 // creates year
drop expandy
In this world, the hourly wage of worker i in year t is determined by the following equation:
where
• Y t is the year since 2012 (for instance for observations in 2014, Y2014 = 2);
• HQi t is a dummy equal to 1 if worker i is in a city where a big company’s headquarters exist in year t;
• ui t is the error term (white noise).
This is the “true” model of wages, a.k.a. data-generating process.
a. (6 pts) What is the interpretation of κ1 in Equation (2)?
b. (6 pts) What is the interpretation of κ2 and κ3 in Equation (2)?
Suppose now that after a lengthy selection process, the Internet giant Facezon set up new headquarters in city A in
2016.
c. Create a dummy variable POST equal to 1 only for observations in year 2016 and following, and 0 otherwise. Create
a dummy variable HQ equal to 1 for workers who work in a city with big headquarters, that is, for observations in
city A in years 2016 and following; HQ is equal to 0 otherwise.
d. Generate a white “noise” error term u using the following:
gen u = rnormal(0,1.5) if cityA == 1
replace u = rnormal(0,1) if cityB == 1
Do you notice anything about the error term? How is this called? What does it imply for our estimations?
e. Create a variable ys2012 equal to the number of years elapsed since 2012. For instance, it should be equal to 0
for observations in 2012, and equal to 3 for observations in 2015.
f. Using Equation (2), generate hourly wages w for workers, using the following values: κ0 = 5; κ1 = 2.1; κ2 =
0.8; κ3 = 0.4.
We have now created the dataset. Imagine you are a researcher who does not know how the data was generated. You
are only told that city A welcomed Facezon’s new headquarters in 2016. You would like to estimate the effect that this
had on the wages of workers in city B using difference-in-differences approach. You receive the dataset, but containing
only the following variables: workerID, year, ys2012, cityA,cityB, POST and w.
g. Drop all the other variables.
h. (4 pts) Write down the equation to estimate this using a regression. NB: this equation will be different from
equation (2).
7
i. (6 pts) Based on the value of the κ’s given in question f., what would a correct estimation of the causal effect of
headquarters on local wages find?
j. (10 pts) Estimate the equation you set up in question h.. How does it compare to your answer to question i.?
Explain the difference. Be specific, using information on how we generated the data.
k. (8 pts) Can you suggest a way to remedy the problem? Re-do the estimation, and check whether your estimates
is statistically different from the value you were expecting in question i..