0% found this document useful (0 votes)
50 views9 pages

ESB2021 Resit With Solution

Uploaded by

so hozen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views9 pages

ESB2021 Resit With Solution

Uploaded by

so hozen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ESB – Analytics - Resit Exam 2021 /22

[Correct answers in bold]

Question 1 – MCQ (25 Marks, 2.5 points for each MCQ Question)

1. Suppose you have estimated the following regression model on a


representative sample of the UK population
2
𝑊𝑒𝑖𝑔ℎ𝑡𝑖 = β0 + β1𝐹𝐸𝑀𝐴𝐿𝐸𝑖 + β2𝐴𝐺𝐸𝑖 + β3𝐴𝐺𝐸𝑖 + ϵ𝑖

where 𝑊𝑒𝑖𝑔ℎ𝑡𝑖 is weight of a person in 100s of kilograms. Suppose you find


that β1 =−0.10 Which of the following is a correct interpretation?

a. The average weight for women is 10kg

b. Women – on average - weigh 10 kg less than men for a given age

c. Women – on average – weigh 10% less than men for a given age

d. None of the above.

2. For the model from the previous question: suppose you find that β2 = 0. 01
and β3=−0.0001. What does it tell you about the age at which we expect
people to be heaviest?

a. At 25 years

b. At 50 years

c. At 40 years

d. None of the above

3. Using a univariate model you obtain a parameter estimate β = 2. 1 with a


standard error of 10 obtained using a sample size of 1000 observations.
Should you reject the hypothesis that β=1 at the 5% level?

a. Yes
b. No

c. It depends on the variance of the residuals

d. It depends on the variance of the explanatory variable

4. The demand for a new drug is known to be linear and downward sloping; i.e.
a higher price means a lower demand. A researcher provided an estimate of
this demand curve but suspects that a confounding factor led to a downward
bias. This means that the estimated curve is

a. Flatter than it should be

b. Steeper than it should be

c. Upward sloping

d. None of the above

5. The probability that a standardised normal variable takes a value of zero or


less is:

a. Equal to zero.

b. 50%

c. 95% for a significance level of 0.1

d. I need more information on the variance and probability distribution to


answer this question.

6. The R output below shows a regression of COVID19 cases per student among
US Universities in Fall 2020. The variable partyrank ranks Universities
according to the quality of the local party scene (i.e. the University with the
best party scene has rank 1). What does the regression suggest about the
relationship between party rank and covid cases?
#simple regression of case on partyrank
lm(casesOstudent~partyrank,datafinal2) %>% summary()

Call:

lm(formula = casesOstudent ~ partyrank, data = datafinal2)

Residuals:
Min 1Q Median 3Q Max

-0.05011 -0.02694 -0.01124 0.01172 0.20685

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.937e-02 4.823e-03 12.310 < 2e-16 ***

partyrank -5.224e-05 1.205e-05 -4.336 2.08e-05 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04017 on 260 degrees of freedom

Multiple R-squared: 0.06743, Adjusted R-squared: 0.06384

F-statistic: 18.8 on 1 and 260 DF, p-value: 2.078e-05

Moving to one lower rank (e.g from rank 4 to rank 5) ...

a. ... leads to 5.2 students less being affected by covid

b. ... leads to 5.2% less COVID cases.

c. ... leads to 5.2 less COVID cases in 1000 students.


d. None of the above

7. The figure below shows the result of a Monte-Carlo study of a parameter


estimate. It's a density plot of the estimated parameter for a large number of
replications along with the true parameter value (vertical line)
a. The estimate is unbiased.

b. The estimate is upward biased.

c. The estimate is downward biased

d. There is not enough information to tell

8. The following R output provides results using a dataset from the UK Health
and Lifestyle Survey (1984-85). In this survey, several thousand people in the
UK were being asked questions about their health and lifestyle.The variable
bmi records the body mass index (BMI) of the respondents. The BMI uses
weight and height to work out whether a weight is healthy or if someone is
overweight. A value between 18.5 and 24.9 indicates a healthy weight. The
variable region is a categorical variable recording in which region a
respondent is based. According to the output provided, which region is the
least overweight region (on average)?

a) London
b) Scotland
c) Wales
d) South East
summary(halsx$bmi)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


## 12.31 21.71 23.97 24.54 26.74 55.61 1700

table(halsx$region)
##
## wales north north west yorks/humber west midlands
## 498 540 1092 808 823
## east midlands east anglia south west south east greater london
## 682 333 720 1607 943
## scotland
## 925

summary(lm(bmi~ region, halsx))

##
## Call:
## lm(formula = bmi ~ region, data = halsx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3808 -2.8505 -0.5398 2.2378 30.3695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.2405 0.2071 121.860 < 2e-16 ***
## regionnorth -0.5668 0.2840 -1.996 0.04598 *
## regionnorth west -0.5400 0.2484 -2.174 0.02973 *
## regionyorks/humber -0.6353 0.2608 -2.436 0.01487 *
## regionwest midlands -0.7341 0.2626 -2.796 0.00519 **
## regioneast midlands -0.5497 0.2694 -2.040 0.04135 *
## regioneast anglia -0.6755 0.3183 -2.122 0.03385 *
## regionsouth west -0.4772 0.2676 -1.783 0.07455 .
## regionsouth east -1.1507 0.2349 -4.899 9.82e-07 ***
## regiongreater london -1.2294 0.2561 -4.801 1.61e-06 ***
## regionscotland -0.3269 0.2560 -1.277 0.20161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.08 on 7260 degrees of freedom
## (1700 observations deleted due to missingness)
## Multiple R-squared: 0.006982, Adjusted R-squared: 0.005614
## F-statistic: 5.105 on 10 and 7260 DF, p-value: 1.825e-07
9. Suppose you have estimated the following equation describing the
relationship between a wind turbine’s monthly electricity output (in MWh) and
the age of a turbine
2
𝐸 = 1000 + 30𝐴𝐺𝐸 − 𝐴𝐺𝐸 + ϵ
Based on this, at what age would we expect the highest output?

a) 15 years
b) 5 years
c) 30 years
d) There is not enough information to tell.

10. A research methods professor (sponsored by a multinational fast food chain)


conducts an experiment among her students. At the beginning of the year she
randomly selects half of the 100 students she teaches. These 50 students will
be given a voucher to consume absolutely free as much as they want for the
entire academic year in the outlets of the fast-food chain. At the end of the
year all students’ weight is measured. The professor notes that the students
with the free voucher have a significantly higher weight than those without.
However, the professor is interested in the effects of having free fast food on
the academic performance and therefore runs regressions of the form:
𝐺𝑃𝐴 = β0 + 𝐹𝐹×β + ϵ
where GPA is the grade point average of the students throughout the year
and FF is a dummy variable equal to 1 if a student received a free fast food
voucher.
Which of the following would lead to a biased estimate of the causal impact of
fast food vouchers on academic performance?

a) Running the regression without further control variables


b) Include the weight of the student at the beginning of the year.
c) Include a variable capturing if the student was off sick during the year.
d) Include the gender of the student.

Question 2 (25 Marks, 5 for each sub question)


Download the following dataset:
https://www.dropbox.com/s/y9blrodauw9k4ya/hotels-vienna.csv?dl=1.
This dataset contains hotel price (in Euro) for hotels in Vienna (price) along with tripadvisor
ratings (ratingsa) for those hotels. (ratings can go from 1=poor to 5=top)
a) Run a regression of prices on ratings. What does the regression output suggest about
the relation between ratings and prices?
b) Can you suggest a causal mechanism that would motivate the finding reported in the
regression; i.e. a reason for why ratings could have a causal effect on prices?
c) Discuss a mechanism that might lead to a bias in the reported regression; i.e. a
reason why the causal effect from ratings to prices might actually be systematically
higher or lower than what is reported in the R output. Explain if you would expect an
upward or downward bias and why.
d) Run a regression where you include the variable distance (defined as the distance in
km from the centre of town) as additional explanatory variable. Explain if this could
provide a better causal estimate of the impact of ratings on price.

e) Now include distance squared as additional explanatory variable. Assume you can
interpret this regression causally. What does it tell you about the relationship
between prices and distance? What is the impact of an additional km of distance on
price 2km from the centre? Can you identify a distance from the centre at which
distance has no more impact on price?

Question 3 (25 Marks, 5 for each sub question)


Download the following dataset:
https://www.dropbox.com/s/f578hptuj9szf12/worldbank-immunization-panel.csv?dl=1
This contains data on child mortality (mort: number of deaths of under 5 year olds per 1000
life births) for a panel of countries from annually from 1998 to 2017.
The variable imm is the percentage of children ages 12-23 months that have misels
immunization.

(a) Run a regression of mort on imm. Provide an interpretation of the parameter related
to imm.
(b) Would you say that the regression reported above provides a causal estimate of the
impact of immunization? Can you suggest reason why there might be a bias? Discuss
the possible direction of the bias.
(c) Now include year and country fixed effects in the regression from part a). Discuss the
merits (or lack thereof) of this specification there to establish the causal effect of
immunization on mortality.

(d) What do the results from part c) suggest about the worldwide trend in childmortality
from 199 onwards? How much lower or higher is child mortality in 2007 compared to
1999?
(e) Add GDP per capita (gdppc) as additional explanatory variable to the specification
from part c). Discuss why this might be a good idea. Could there also be reasons why
it is problematic? Discuss the results shown below. How does this affect the
coefficient for imm?

Question 4 (25 Marks, 5 for each sub question)


Below you see a table that is reported in a recent paper. The authors examine daily crime
and air pollution data across boroughs of London in 2004-05. The dependent variable is the
log of the number of crimes committed on a particular day. The main explanatory variable is
an air quality index that ranges from 1 (best air quality), to 100 (worst air quality) which is
recorded as 10 units; i.e. if the AQI is 10 the dependent variable will be 1.

(a) Consider column 1. How can we interpret the regression coefficient reported there?
(b) Can you propose a mechanism that would lead to a causal effect from air quality to
crime?
(c) Columns 3 to 5 include a variety of fixed effects as control variables. Namely: Ward,
Day of week (DOW) and year-month fixed effects. Explain why these might help in
getting a better estimate of the causal effect of pollution. Can you also discuss at
least one confounding factor that is not addressed by these control variables?
(d) The authors propose to use the wind direction on a particular day in a particular
ward interacted with broad city location (central, north, south, east, west) as
instrumental variable to deal with any remaining confounding factors that might exist
even after including all the fixed effects discussed in part (d). Explain why this might
help. Can you also discuss potential issues that might invalidate this instrumental
variable strategy?
(e) Columns 3 to 6 provide results from an instrumental variable estimation using wind
speeds. Discuss this result. If windspeed is indeed a valid instrument, what do the
results suggest about the direction of the bias in original regression (repeated in
columns 1 and 2)? Which confounding mechanism would be consistent with this kind
of bias?

You might also like