Metrics Jan 2021
Metrics Jan 2021
Part A
PART A
Q1. Statistics
−𝜆𝜆𝜆𝜆
a. Let the pdf of random variable 𝑥𝑥 be 𝑓𝑓(𝑥𝑥) = �𝜆𝜆𝜆𝜆 , 𝑥𝑥 ≥ 0. Define a new random
0, 𝑥𝑥 < 0
2
variable 𝑦𝑦 = 𝑥𝑥 . Find the pdf of 𝑦𝑦.
Let X1, X2, …, Xn represent a random sample following 𝜒𝜒 2 (1). 𝑋𝑋� is sample mean.
b. Find the limiting distribution of �𝑋𝑋� . (note: 𝜒𝜒 2 (𝑘𝑘) distribution has mean 𝑘𝑘 and variance
2𝑘𝑘)
Consider two random variables X1 and X2, whose joint density function is
Page 1 of 4
Q2. OLS estimation
For a dependent variable vector y with n observations, its corresponding independent variable
matrix is X with k variables, parameter vector is 𝛽𝛽 and residual vector is e.
a. Write out the matrix representation of linear regression, detail the dimensions of each
matrix/vector.
c. Interpret the goodness of fit measurement 𝑅𝑅 2 and derive its expression as a function of y,
𝑦𝑦� and e.
d. Derive the distribution of estimated parameter vector, assuming the normality of residual
distribution 𝑁𝑁(0, 𝜎𝜎 2 𝐼𝐼𝑛𝑛 ).
Page 2 of 4
Q3. Regression application
Consider the following time series regression output. Someone is regressing per capita annual
gasoline consumption on a set of explanatory variables including linear trend, income and price
of each year.
log(gas/pop) ~ log(income) + log(price) + ltrend
Residuals:
Min 1Q Median 3Q Max
-0.06302 -0.01648 0.00579 0.01840 0.04726
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Intercept -16.634946 1.002185 -16.599 < 2e-16 ***
log(income) 1.870306 0.114454 16.341 < 2e-16 ***
log(price) -0.114410 0.022667 -5.048 1.73e-05 ***
ltrend -0.017939 0.002599 ??? ???
---
Residual standard error: 0.02956 on 32 degrees of freedom
Multiple R-squared: 0.9653, Adjusted R-squared: 0.962
F-statistic: 296.7 on 3 and 32 DF, p-value: < 2.2e-16
a. Using 𝑡𝑡0.025 = 2, calculate the 95% confidence interval for variable log(income). Round to
the 2nd decimal place.
d. To perform an F test on the hypothesis that the coefficient of log(price) = 0, what is the test
statistics value and why?
e. Suppose below is the residual plot of this time series regression, what type of problem is likely
to exist in the error terms? What kind of problem does it introduce to the estimates?
Page 3 of 4
Page 4 of 4
PhD Econometrics Examination
Part B
PART B
In econometric analysis, we are often concerned with estimating the causal effect of some treatment
variable, 𝐷𝐷𝑖𝑖 , on an outcome variable of interest, 𝑌𝑌𝑖𝑖 . However, obtaining a causal estimate can be
challenging.
a. What is the “Fundamental Problem of Causal Inference”. Please define it and explain
what it means for empirical analysis.
b. Define 𝑌𝑌1𝑖𝑖 as the outcome of individual 𝑖𝑖 if she/he is treated and 𝑌𝑌0𝑖𝑖 and the outcome of
that same individual if she/he were not treated. If treatment were randomly assigned
across individuals in the sample, then the treatment effect of 𝐷𝐷𝑖𝑖 on 𝑌𝑌𝑖𝑖 is as follows:
where 𝐷𝐷𝑖𝑖 is equal to one if individual 𝑖𝑖 was treated and equal to zero if she/he was not
treated (i.e., was in the control group).
Suppose that treatment was not randomly assigned. Using the Potential Outcomes
Framework, decompose the expectation in equation (B.1) into the “average treatment
effect” and “selection bias”. Say in words what is captured by the average treatment
effect term and the selection bias terms that you derive.
c. A large body of evidence indicates that in utero and early life health affects adult
economic well-being. Suppose that you are interested in estimating the effect of being
born at a low birth weight (an indicator of poor in utero health and health at birth) on
adult economic well-being for a sample of individuals in Ethiopia. Note: in utero refers
to the period during which a child was in his/her mother’s womb (i.e., after conception
but before birth). So, you estimate the following model:
Page 1 of 4
where 𝑌𝑌𝑖𝑖 is the adult earnings of individual 𝑖𝑖 and 𝐿𝐿𝐿𝐿𝑊𝑊𝑖𝑖 is equal to one if individual 𝑖𝑖 was
born at a low birth weight and zero otherwise. Ideally, then, 𝛽𝛽 would capture the effect of
being born at a low birth weight on adult earnings.
What assumption is required in order for the OLS estimate of 𝛽𝛽 to represent the unbiased,
causal effect of low birth weight on earnings? Why is this assumption likely to fail?
d. You decide to instrument for low-birth weight using the instrumental variable, 𝑍𝑍. Which
two requirements are necessary for this instrument to be valid? Please define both
mathematically and with words.
f. Derive and describe the estimator for 𝛽𝛽 using the control function approach and using 𝑍𝑍
as your instrument.
g. Would the following variables be plausible instruments for being born at a low birth
weight? Explain why or why not and be sure to address each of the requirements for a
valid instrument in your explanation.
iii. The prevalence of infectious disease in the area where individual 𝑖𝑖 was born
during his/her in utero period.
iv. Rainfall in during the most recent growing season prior to individual 𝑖𝑖′𝑠𝑠 birth.
Page 2 of 4
Q2. Maximum Likelihood
Define 𝜙𝜙(𝜃𝜃) as the pdf for a standard normal and Φ(𝜃𝜃) as the cdf for the standard normal. Note:
𝜕𝜕Φ(𝑧𝑧) 𝜕𝜕𝜕𝜕
= 𝜙𝜙(𝑧𝑧)
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
b. Derive the probabilities that 𝑦𝑦𝑖𝑖 = 1 and 𝑦𝑦𝑖𝑖 = 0 for individual 𝑖𝑖.
c. Derive the contribution of each individual in your sample to the overall likelihood
function (i.e., derive ℒ𝑖𝑖 (θ)) and the individual log-likelihood function.
e. Explain what is implied by the simplified form of the Score function (i.e., what is the
implied orthogonality condition).
Page 3 of 4
Q3. Partitioned Regression and Frisch-Waugh-Lovell Theorem
Consider the model 𝑦𝑦 = 𝑋𝑋𝑋𝑋 + 𝑒𝑒, where 𝑋𝑋 is a 𝑛𝑛 × 𝑘𝑘 matrix. Let the data matrix 𝑋𝑋 be partitioned
into two matrices, 𝑋𝑋 = [𝑋𝑋1 : 𝑋𝑋2 ], where 𝑋𝑋1 and 𝑋𝑋2 have the dimensions 𝑛𝑛 × 𝑘𝑘1 and 𝑛𝑛 × 𝑘𝑘2 ,
respectively, and 𝑘𝑘1 + 𝑘𝑘2 = 𝑘𝑘. Thus, we can rewrite the model as
a. Perform an OLS regression of 𝑋𝑋1 on 𝑋𝑋2 . Derive the matrix of residuals from this
regression and denote it 𝑒𝑒12 . (Hint: use the residual matrix for 𝑋𝑋2 ).
b. Perform and OLS regression of 𝑦𝑦 on 𝑋𝑋2 . Derive the matrix of residuals from this
regression and denote it 𝑒𝑒𝑦𝑦2 . (Hint: use the residual matrix for 𝑋𝑋2 ).
c. Perform and OLS regression of 𝑒𝑒𝑦𝑦2 on 𝑒𝑒12 . Derive the OLS coefficient from this
regression and denote it 𝑏𝑏�1 . (You may use the normal equations to do this).
d. Show that 𝑏𝑏�1 = 𝑏𝑏�1 , where 𝑏𝑏�1 is the OLS coefficient on 𝑋𝑋1 obtained from a regression
of 𝑦𝑦 on both 𝑋𝑋1 and 𝑋𝑋2 . (Hint: use the answer derived in part c and substitute it into
the full model for 𝑦𝑦 represented by equation B.3. The residual 𝑒𝑒 in the regression of
𝑦𝑦 on 𝑋𝑋 is orthogonal to both 𝑋𝑋1 and 𝑋𝑋2 .)
e. Denote the residuals from the regression of 𝑒𝑒𝑦𝑦2 on 𝑒𝑒12 as 𝑒𝑒̃ . Show that these
residuals, based on the model, 𝑒𝑒𝑦𝑦2 = 𝑒𝑒12 𝑏𝑏� + 𝑒𝑒̃ , are the same as the residuals obtained
from the regression of 𝑦𝑦 on both 𝑋𝑋1 and 𝑋𝑋2 . (Hint: decompose 𝑒𝑒𝑦𝑦2 and 𝑒𝑒12 into their
original parts. You will also need to use the results from part d).
f. Suppose that 𝑋𝑋1 𝑋𝑋2 = 0, meaning that the two sets of variables are orthogonal. Show
�1 = 𝑏𝑏1∗, where 𝑏𝑏1∗ is the OLS coefficient on 𝑋𝑋1 obtained from a
that, in this case, 𝑏𝑏
regression of 𝑦𝑦 on 𝑋𝑋1 alone.
Page 4 of 4
PhD Econometrics Examination
Part C
PART C
Q7. Consider the following study/survey scenario. In order to find out people’s preference to buy micro
health insurance, a choice experiment study was set up. Three attributes were considered with
respective levels:
In addition, other socio economic variables were collected: Age, Gender, Income, Current Insurance
(Yes/No); Education Level, and Distance (in minutes) to Nearest Clinic, number of children under 18.
The objective was to calculate the marginal willingness to pay value.
a. Set up a RUM structure for this model using the indirect utility functions etc.
e. Present the formula for the marginal willingness to pay for each attribute, and interpret them.
Page 1 of 2
Q8. In the same survey, new mothers were asked the following health outcome questions: 1) Number
of times the women visited the clinic for antenatal care; 2) The BMI of the child at birth; 3) Mode of
delivery (At-home by family members, by community midwife; or at Clinic), and 4) Self-rated health
status (4= Feeling very well …. 1 = Not feeling well at all). That is, there were four different data
generating processes to describe the outcome variables (Antenatal visits, BMI, Mode of delivery, and the
self-reported ranked health status.
a. For each health outcome measure, choose an appropriate modeling/estimation method and
describe as to why you chose that estimation/modeling method.
b. For each health outcome case, write in steps all regression equations; and the log-likelihood
functions.
c. Also, describe the expected sign for each of the independent variables you chose to include in
your model.
Q9. Define and describe the difference between the Tobit and the Heckman Selectivity model. Give a
read-world example for each of the cases with the corresponding loglikelihood functions. Show your
work.
Page 2 of 2