0% found this document useful (0 votes)
7 views10 pages

Econ7020X FinalReview (Answers)

Uploaded by

Afra Jagger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Econ7020X FinalReview (Answers)

Uploaded by

Afra Jagger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

List and define the four Gauss Markov Assumptions.

Assumptions of the ordinary least squares linear regression model.


 Linearity
o The relationship between our independent and dependent variables are linear in
the parameters
o Y = β0 + β 1 X +ε
o Y = β0 + β 1 X 2 +ε →W = X 2 → Y =β 0 + β 2 W + ε
 Strict Exogeneity
o The independent variable X should not be correlated with our error term.
 Homoskedasticity
o Constant variance in our error term.
 No multicollinearity
o Perfect linear relationship between the variables.
BLUE – best linear unbiased estimator when all four gauss Markov assumptions hold.

Explain which Gauss-Markov assumption each of these models break and give a potential
solution.
1. The Cobb-Douglas production function is a classic example of a production function
often used in macroeconomics. It is commonly expressed as:
α β
Y=A⋅K ⋅L
 Y is the total production (output),
 K is the amount of capital input,
 L is the amount of labor input,
 A represents the total factor productivity, and
 α , and β are output elasticities of capital and labor respectively.

Violates the linearity assumption- because our parameters are in the exponents, therefore this
is not a linear functional form.

ln ( Y ) =ln ( A K α L β ) → ln ( A ) +ln ( K α ) + ln ( Lβ ) → ln ( A ) + α ln ( K ) + β ln ( L )
ln ( Y ) =ln ( A ) +α ln ( K ) + β ln ( L )
2. If we are constructing a model of precipitation and the variables we include in our model
are:
 Relative humidity in percentage
 Cloud coverage in percentage
 Temperature in degrees Celsius
 Temperature in degrees Fahrenheit
R=β 0+ β 1 H+ β 2 L+ β 3 T c + β 4 T F +ε

R=β 0+ β 1 H+ β 2 L+ β 3 T c + ε

Temperature in degrees Celsius and Temperature in degrees Fahrenheit are linearly


9
related. Fahrenheit=32+ Celsius
5
This is going to break our assumption of no perfect multicollinearity.
Exclude one of the variables. What is the relationship between β 3 and β 4 ?

3. A simple model of housing prices (Y) based on square footage (X). Consider the error
terms.

Sqft 1000 price 200,000


Sqft 1500 price 300,000
Sqft 15,000 price 30,000,000
Sqft 15,000 price 3,000,000
Y = β0 + β 1 X +ε

This likely breaks homoskedasticity because as the square footage of the house increases, the
error is likely to be more dispersed.
Logarithm transformation.

4. Suppose we run a study examining the effect of education on income, where the
researcher fails to account for unobserved factors such as natural ability or motivation. If
these unobserved factors are correlated with both education and income, what assumption
is violated?
Strict exogeneity, because the unobserved factors are causing correlation between the regressor
education and the error term.
Income=β 0+ β1 education+ ε

correl ( Ability , educ ) ≠ 0 , correl ( Ability ,inc ) ≠ 0


When exogeneity is broken, we can use instrumental variables or IV estimation to
address this endogeneity

Explain what a dummy variable is, when they are useful, and what the interpretation of the effect
of a dummy variable is.
E.g. for a dummy variable, D, what is the interpretation of gamma?
Y = β0 + β 1 X 1 + D1 γ 1 +ε

A dummy variable is a binary variable, that either takes a zero or one, and is typically useful in
denoting membership in a category or having a characteristic.
Is_female Is_black Is_approved Attended_college
Education via dummy variables
Less than high school, high school diploma, some college, associates degree, bachelors, masters,
doctoral professional

Index Is_female Attended_college


0 0 0
1 0 1
2 1 1
3 1 1

Y = β0 + D 1 γ 1 + ε

Gamma refers to the marginal effect of being a woman on the expected value of attending
college.

Explain the dummy variable trap, and how to avoid it.


If we include a dummy variable for every possible classification or category, then there is linear
dependence between the dummy variables and the constant term. Easy solution is drop one
dummy variable from the set of regressors.

Suppose we estimate an autoregressive, AR(p), model with the following form:


y t =5+ y t −1+ ε t

1. What order is this model?


2. Could y t be a stationary process?
3. If this model is estimated using annual data from 2000 to 2020, inclusive, what is N for
this model?

Is this an AR model? Is this an MA model?


AR because it contains the lagged dependent variable in the set of regressors.
This is an AR(1) model, because it contains the lagged dependent variable with the highest order
lag being the first order.

No, y contains a unit root. The coefficient on y t −1 is 1. Y cannot be stationary


Y=Yt-1
180

160

140

120

100

80

60

40

20

0
1995 2000 2005 2010 2015 2020 2025 2030 2035

There are 21 observations. What is N for this model, N=20.

Your colleague approaches you about a fancy new trick they learned in Microsoft excel, that
allows them to forecast data using a polynomial regression. Excitedly they explain that by
making a graph of a column in their spreadsheet, clicking add trendline, and format trendline,
they can turn that inaccurate line forecast into a really closely fitting ‘polynomial 4’ regression.
Explain to your colleague what a polynomial 4 regression is, and why it might not be a good idea
to use for this dataset that you happen to know grows exponentially. Why does this forecast look
so good? (specifically, you know the true data generating process is:
ROUND(0.003*EXP(A2)+A2*2+RANDBETWEEN(0,5), 0) where the A column contains an
index from 1 to 10.)
Index Value
1 6
2 4
3 6
4 10
5 15
6 14
7 17
8 26
9 44
10 86

Polynomial Regressions
100

90

f(x) = 0.096008159 x⁴ − 1.744075369 x³ + 10.97071678 x² − 24.9013209 x + 21.75


80
R² = 0.998948872161781
70

60

50 f(x) = 6.78787878787879 x − 14.5333333333333


R² = 0.665991331069473
40

30

20

10

0
1 2 3 4 5 6 7 8 9 10

Value Linear (Value) Polynomial (Value)

This is a classic example of overfitting. Some sampling variation is causing the model to fit the
data very closely, despite not being the true relationship. Also, R^2 is a measure of goodness of
fit, not a measure of statistical significance.

Anscombe’s Quartet: descriptive statistics are not everything. Do not only rely on statistics
to justify the model.
What is an instrumental variable?
b. What are the two conditions for a valid instrumental variable, and how might we assess the
validity of an instrument?

1. Binary dependent variables (20, 10)


a. In class we discussed three models for binary dependent variables. Give pros and
cons for using a linear probability model vs a probit or logit model.

b. Suppose you are presented with regression results for the following probit model,
and your colleague comments that the logit and probit model results cannot be
right- there is a 3.5% higher chance of working at the government given a one-
year increase in schooling, but the probit model says it should be a 27% increase
and logit says over 55%. How would you respond to this comment?

For context schooling refers to years of schooling, the dependent variable is probability of
working in a government job.

c. How can instrumental variables be used to estimate a model with a binary


dependent variable?

Let 𝑌1,𝑌2,… ,𝑌𝑛 be i.i.d. draws from a distribution of mean μ. A test of 𝐻0: 𝜇≥10
Question 1, single variable statistics (10 pts):

versus 𝐻𝐴:𝜇<10
using the usual t-statistic yields a p-value of 0.03.
a. Can we reject the null hypothesis at 5% significance level (or 𝛼=0.05)? Explain?
b. How about at 1% significance level (or 𝛼=0.01)? Explain?

[Draw a figure to explain, if helpful.]


When running a regression analysis, we are often interested in the significance of a coefficient.
Explain what a t-statistic is, how we use it, and how we interpret it.

Mark Questions

11. Which of the following best describes the term "endogeneity" in regression analysis?
a) The independent variable is caused by the dependent variable.
b) The variance of the error term changes over observations.
c) The dependent variable is caused by the independent variable.
d) The error term is correlated with one or more independent variables.

12. Which of the following statistics is commonly used to detect the presence of
autocorrelation in the residuals of a time-series regression?
a) F-statistic
b) T-statistic
c) Durbin-Watson statistic
d) Jarque-Bera statistic

T-statistic of 3 with a standard error of 2


β est−β hypothesized
tstat=
SE ( β )
Beta of zero implies that there is no relationship between the X regressor and the dependent
variable Y.

13. In the context of OLS regression, which assumption, if violated, can lead to
heteroskedasticity?
a) No endogeneity
b) Linearity in parameters
c) Constant variance of errors
d) No multicollinearity

14. What is the primary purpose of adding dummy variables in a regression model?
a) To account for autocorrelation
b) To represent categorical variables
c) To handle missing data
d) To improve the R-squared value

15. In a simple linear regression, if all data points lie perfectly on a straight line, what will be
the value of R-squared?
a) 0
b) 0.5
c) 0.75
d) 1
Regression Results
Which of the following coefficients are statistically significant results?
Interpret the base case model.

Data Results
Without reading the full paper, tell me everything you can about this research design and
findings, just with this data visualization.

You might also like