0% found this document useful (0 votes)
9 views8 pages

ProblemSet Solution

The document discusses various econometric models and their estimation problems, including specification errors due to omitted variables and issues of multicollinearity. It analyzes the impact of different variables on food expenditure, office rental prices, wages, stock prices, and educational outcomes, highlighting the importance of including relevant factors for accurate predictions. Additionally, it addresses potential non-linearity and outlier problems in the data, providing insights into model efficiency and explanatory power.

Uploaded by

4317elyafi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

ProblemSet Solution

The document discusses various econometric models and their estimation problems, including specification errors due to omitted variables and issues of multicollinearity. It analyzes the impact of different variables on food expenditure, office rental prices, wages, stock prices, and educational outcomes, highlighting the importance of including relevant factors for accurate predictions. Additionally, it addresses potential non-linearity and outlier problems in the data, providing insights into model efficiency and explanatory power.

Uploaded by

4317elyafi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

DATA ANALYSIS FOR ECONOMICS

PROBLEM SET 5: ESTIMATION PROBLEMS - SOLUTION

1 We have the following variables:

Y: Food expenditure in USA.

X: Family income.

P: Price index.

Two different regressions are estimated with the following estimation results
(standard errors are in brackets and sample size is 500):

Regressio Coefficient Coefficient Adjusted R-


n for X for P Squared

Y/P 2.462 0.614


(0.407)
Y / X; P 0.112 -0.739 0.978
(0.003) (0.114)

Find and discuss the specification error the first model is suffering. Explain it
using the estimation results of the above table.

The estimation problem that the first regression model is suffering is the
omission of a relevant explanatory variable (X). You can see that omitting X
variable in the first regression model produces your OLS estimator to be
overestimated. In other words, the effect of P on Y is greater than the one it
should be. In fact, ^β 2=2.462 in the first model whereas ^β 2=−0.739 in the
second model. Additionally, the efficiency of the OLS estimator in the first
model is lower than in the second (compare the standard errors for ^β 2 in
both models). Finally, the determination coefficient in model one is
unreliable as you are omitting a relevant factor making the explanatory
power of the model lower than it should be when introducing X.

2 We have estimated a SLRM explaining office rental prices in the city


of Madrid (Y) with the information contained in distance to the city center
(X). The following two graphs: Figure 1( Y versus X) and Figure 2 (residuals
versus fitted values of Y) are related to the above model.

1
DATA ANALYSIS FOR ECONOMICS

Figure 1
Figure 2

a- Discus according to the two graphs if the model may suffer a non-
linearity problem
According to the first figure, it seems that the relationship between Y
and X is suffering a non-linearity problem. Moreover, it seems there
are decreasing returns. That is, as the distance to the city center
increases, the negative effect of distance on office rental prices
seems to decrease.

When plotting the residuals versus the fitted values (Figure 2), it
seems there is a relationship between them and therefore the
covariance between these two variables is not equal to zero. This is a
signal of this model suffering a non-linearity problem as we want
residuals and predicted values of the dependent variable being
independent. This figure is consistent with the analysis of the first
figure.

b- Provide an economic reason explaining the possible non-linearity in


the above relationship.
As you move further away from the city center you do not expect the
negative effect of distance on rental prices being so big as when you
are very close of the city center and this is the reason why the
relationship between rental prices and distance may suffer
decreasing returns.

c- How should Figure 2 be if the relationship between office rental prices


and distance was a linear relationship?

If the relationship were to be linear, Figure 2 should be a random


cloud of points signaling that residuals and predicted values of the
dependent variable are independent and therefore satisfying linearity
assumption

2
DATA ANALYSIS FOR ECONOMICS

3 There is an econometric study at IE University which relates the


average grade in Econometrics with the time students employ in different
activities during the week. Some students are asked about how many hours
they employ in four different activities: study, sleep, work and leisure. Any
activity must be included in one of these four categories such that the time
spent in the four activities is 168 hours for each student.

The model is the following:

GRADE=β 0 + β 1 study+ β2 sleep+ β3 work + β 4 leisure+ u

a- Find the assumption that does not hold in this model and explain why.
The above model suffers a perfect multicollineraity problem because:
study+ sleep+work +leisure=168 ∀ i

That is, there is a perfect linear relationship among the explanatory


variables included in the regression model. Therefore, there is no
solution when performing the OLS minimization problem and
coefficients cannot be estimated.

b- How would you rewrite the model in order to solve the problem?

One possible solution would be dropping one regressor from the


model such as:

AGE=β 0+ β1 study + β 2 sleep+ β 3 work +u

4 We have representative data for 30 years old people for the US.
Levine, Gustafson and Velenchik (1997) estimated a wage equation using
the following variables:

Y = log(wage)

F = a dummy variable that takes a value of 1 for smokers and 0, otherwise

ED = years of education

Two specifications are considered:

MODEL 1: Y = 5.61 - 0.176F  omitting education

(0.031)

R-squared = 0.35

MODEL 2: Y = 3.78 - 0.080F + 0.070ED  including education

(0.021) (0.0004)

3
DATA ANALYSIS FOR ECONOMICS

R-squared = 0.68

Compare the two fitted models and explain what happens when we omit
one relevant variable (in this case, years of education).

When omitting years of education in the first model we can see that the
negative effect of smokers on salaries is overestimated if compared with the
second regression model (the coefficient in Model 1 is more negative than in
Model 2). In addition, the standard error associated to the effect of smokers
is higher in the first model than in the second. That is, omitting education
produces the estimators in Model 1 being less efficient than in model 2.
Finally, if we compare the two regression models in terms of explanatory
power by computing the adjusted determination coefficients, we can see
that model 2 is better than model 1. That is, including education in the
second model helps to predict better variability in salaries if compared with
the first model. Furthermore, if you were to test the individual significance
of education, you would reject the null, meaning education is statistically
significant variable to explain the behavior of salaries. All of the above is
indicative of model 1 suffering the omission of a relevant factor (years of
education).

5 We have the following information for the annual growth rates (%) in
different countries about stock prices (Y) and in consumer prices (X):

Estimatio
Stock Consumer Predicte n
Country prices (Y) prices (X) dY Residuals
Australia 5 4.3
Austria 11.1 4.6
Belgium 3.2 2.4
Canada 7.9 2.4
Denmark 3.8 4.2
Finland 11.1 5.5
France 9.9 4.7
Germany 13.5 2.2
India 1.5 4
Ireland 6.4 4
Israel 8.9 8.4
Italy 8.1 3.3
Japan 13.5 4.7
Mexico 4.7 5.2
Netherlands 7.5 3.6
New
Zealand 4.7 3.6

4
DATA ANALYSIS FOR ECONOMICS

Sweden 8 4
UK 7.5 3.9
USA 9 2.1

Knowing that: ^
y i=6.83+ 0.201 x i

Answer to the following questions:

y AUSTRALIA=6.83+ 0.201 ( 4 , 3 )=7.694


^

a- Complete the missing values in the above table.

Consum Predict Normalis


Stock er ed ed
Prices( Prices Residu Residual
Country Y) (X) Y al s
Australia 5 4,3 7,694 -2,694 -0.823
Austria 11,1 4,6 7,755 3,345 0.995
Belgium 3,2 2,4 7,312 -4,112 -1.245
Canada 7,9 2,4 7,312 0,588 0.170
Denmark 3,8 4,2 7,674 -3,874 -1.178
Finland 11,1 5,5 7,936 3,165 0.938
France 9,9 4,7 7,775 2,125 0.627
Germany 13,5 2,2 7,272 6,228 1.870
India 1,5 4 7,634 -6,134 -1.858
Ireland 6,4 4 7,634 -1,234 -0.383
Israel 8,9 8,4 8,518 0,382 0.092
Italy 8,1 3,3 7,493 0,607 0.174
Japan 13,5 4,7 7,775 5,725 1.712
Mexico 4,7 5,2 7,875 -3,175 -0.970
Netherlan
ds 7,5 3,6 7,554 -0,054 -0.026
New
Zeeland 4,7 3,6 7,554 -2,854 -0.869
Sweden 8 4 7,634 0,366 0.099
UK 7,5 3,9 7,614 -0,114 -0.045
USA 9 2,1 7,252 1,748 0.521

b- Show both graphically and formally if the above data suffers from an
outlier problem.

5
DATA ANALYSIS FOR ECONOMICS

15
10
Stock
5
0

2 4 6 8
Consumer

According to the scatter plot, there may be an outlier problem related


to the observation of Israel (slightly different behavior than the rest of
country observations – it present the highest consumer price, equal to
8.4).

Formally, we have to compare each of the normalized estimation


residuals to the critical values such that:

If z >2.06 or z ←2.06 (critical values with a 2% probability at the right


and left hand tails of the normal distribution) then, the corresponding
data point associated to that specific estimation residual can be
considered as an outlier.

Note that the standard deviation of the estimation residuals is 3.3.


See the normalized residuals in the table of section (a) in the last
column.

None of the normalized estimation residuals satisfied the above


conditions and therefore, our model does not suffer from a significant
outlier problem.

c- If the answer to b is positive, please explain any strategy you would


perform in order to solve the problem.

No strategy is required as none of the residuals are significantly


outliers.

6 We have data for a sample of high schools in Vietnam where the


variable math denotes the percentage of students who passed a math test.
We want to estimate the effect that spending per student has on the
outcomes of this test and propose the following model:

6
DATA ANALYSIS FOR ECONOMICS

log ⁡(math)=β 0 + β 1 log ( spend ) + β 2 log ( enroll )+ β3 poverty+u

Where poverty describes the percentage of students living below the


poverty line, spend denotes spending per student and enroll is the number
of students enrolled in the high school.

a- We do not have data for poverty variable but the variable lnchprg
describes the percentage of students eligible for a programme
subsidising school lunches. Why is this variable a sensible proxy
variable for poverty?

Since we do not have data for poverty variable we need to find a


proxy (similar variable to capture the same effect). Therefore, lnchprg
is a good proxy because students living below the poverty line will be,
on average, students eligible for the programme subsidising school
lunches.

b- The table below shows the OLS estimates with and without the
inclusion of lnchprg:

Explanatory
variables (1) (2)

log(spend) 0.13 0.75


(0.30) (0.04)
log(enroll) 0.022 -0.66
(0.615) (0.58)
lnchprg -0.324
(0.036)
intercept -0.24 -0.14
(0.74) (0.99)
n 408 408
R-squared 0.0293 0.1893

Explain why the effect of spending and enrol are greater in the first
model than in the second one? What about if we compare standard
errors between the two models?

In the above table we have a problem of omission of a relevant


explanatory variable. The first model omits lnchprg variable
(significant explanatory variable in the second model). One
consequence when omitting relevant explanatory variables is that
your OLS coefficients are going to be biased. In our example, the
coefficients associated to spend and enroll variables are biased
(greater values than the ones in the second model and therefore

7
DATA ANALYSIS FOR ECONOMICS

overestimating the effect of both variables on the dependent


variable). In addition, and comparing the standard errors associated
to each of the explanatory variables, we can see that in the first
model they are less efficient (standard errors are greater than in the
second model).

c- What conclusions can you derive when comparing both models?

We can conclude that the second model is a better specification than


the first one because it includes an additional relevant and significant
explanatory variable, the signs of the coefficients are the expected
ones, standard errors are more efficient than in the first model and it
has a greater explanatory power than the first model.

You might also like