0% found this document useful (0 votes)
54 views9 pages

Data Analysis Exam

This document discusses using regression analysis to model relationships between variables. It provides examples of interpreting regression results, testing hypotheses about coefficients, and assessing overall model significance. Summary statistics and estimation output are included for models relating worker satisfaction to absenteeism, crime to college enrollment, and property prices to characteristics.

Uploaded by

ayushi saraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views9 pages

Data Analysis Exam

This document discusses using regression analysis to model relationships between variables. It provides examples of interpreting regression results, testing hypotheses about coefficients, and assessing overall model significance. Summary statistics and estimation output are included for models relating worker satisfaction to absenteeism, crime to college enrollment, and property prices to characteristics.

Uploaded by

ayushi saraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DATA ANALYSIS FOR ECONOMICS: PS3

PROBLEM SET 3: HYPOTHESIS TESTING


CORRECTION

1 We have a sample of 45 workers employed in a company. We ask to each worker to


evaluate her/his satisfaction level at work (𝑥𝑥) from 0 to 10. We also know, for each worker,
the number of labour absenteeism days (𝑦𝑦) last year. A linear regression line is estimated such
that

𝑦𝑦�𝑖𝑖 = 12.6 − 1.2𝑥𝑥𝑖𝑖


(0.112) (0.088)

𝑅𝑅2 = 0.321

a- Interpret the estimated regression model and the value of the R-squared

Intercept: 𝛽𝛽̂0 = 12.6: The average of absenteeism days per year in this company is
12.6 when workers are totally unsatisfied.

Slope: 𝛽𝛽̂1 = −1.2: Absenteeism days are reduced by 1.2 when satisfaction level at
work increases by 1. When the worker is completely satisfied, that is when 𝑥𝑥 = 10,
the model predicts that she will be absent 0.6 days.
R-squared: work satisfaction helps explain 32.1% of the variability of absenteeism
days for this sample of workers.

b- Test the null hypothesis that work satisfaction does not produce any significant effect
on labour absenteeism at a 1% significance level.

The null hypothesis in this case is postulated as follows:

𝐻𝐻0 : 𝛽𝛽1 = 0

𝐻𝐻1 : 𝛽𝛽1 ≠ 0

We construct the t-ratio under the null:

𝛽𝛽̂1
𝑡𝑡 = ~𝑡𝑡45−2
𝑠𝑠𝑠𝑠(𝛽𝛽̂1 )

−1.2
𝑡𝑡 = = −13.63
0.088
The critical value of the test at 1% significance level is 2.58

1
DATA ANALYSIS FOR ECONOMICS: PS3

As a result, we have that |−13.63| > 2.58, so that we reject the null hypothesis at
1% significance level. We can conclude that work satisfaction is statistically significant
to determine absenteeism (at 1% significance level).

c- The level of work satisfaction of a different worker is 6. Find the predicted labour
absenteeism days per year for this worker.
Using the OLS regression function: a worker from that company who reports a
satisfaction level of 6 is predicted to be absent from work 5.4 days per year.

𝑦𝑦� = 12.6 − 1.2(6) = 5.4

d- In your opinion, explain one application of the above model from the perspective of
the Human Resources department of the company.

Check work satisfaction frequently in order to try to motivate those workers who
report being unsatisfied, and give a premium to those who are already satisfied.
CURIOSITY: HENRY FORD CASE STUDY

Perhaps Henry Ford was the first to discover the full use of the efficiency-wage
theories. The Ford Motor Company began to pay its workers $5.00 per day in 1914
when the average wage at that time was between $2.00 and $3.00 per day. This
significantly increased the amount of people who were waiting in line to receive a job
from this company. Henry Ford believed that by paying above the equilibrium wage
it would secure the business for the future.

He seemed to think that by paying his workers a higher wage it would inevitably
lower costs. And evidence shows that this has been the case in production for the
Ford Motor Company ever since. Worker productivity increased across the board
because they knew they were not going to find the type of pay they were receiving
anywhere else. It created an incentive for them to stay with the Ford company and
work hard.

2 Consider a SLRM relating the annual number of crimes on college campuses (crime)
to student enrollment (enroll) with the following estimation results:

� 𝚤𝚤 = −6.63 + 1.27log (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒)𝑖𝑖


log (𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐) 𝑛𝑛 = 97 𝑅𝑅2 = 0.585
(1.03) (0.11)

a- Interpret the estimated slope coefficient.

When enrolment in campus increases by 1%, on average crime increases by 1.27%.

2
DATA ANALYSIS FOR ECONOMICS: PS3

b- Calculate two-tailed test to find whether the variable enroll should be included in the
regression model (at 1% significance level).

𝐻𝐻0 : 𝛽𝛽1 = 0, 𝐻𝐻1 : 𝛽𝛽1 ≠ 0

1.27
𝑡𝑡𝛽𝛽�1 = = 11.54
0.11

We reject the null hypothesis since 11.54 > 2.58 at 1% significance level and
therefore, enrollment is statistically significant in explaining the behavior of crimes.

c- Test whether the model is useful at 5% significance level.

In the SLRM this test is also the individual significant test of (b), so we just proved
in (b) that the variable was significant at 1% (so it is also at 5%) which is equivalent
to say that the model is useful.

You could also use the overall test:

𝐻𝐻0 : 𝛽𝛽1 = 0 (q=k=1)

𝐻𝐻1 : 𝐻𝐻0 is not true

𝑅𝑅2 /𝑘𝑘 0.585/1


𝐹𝐹 = 2
= = 133.86
(1 − 𝑅𝑅 )/𝑛𝑛 − 𝑘𝑘 − 1 (1 − 0.585)/95

Since 𝐹𝐹1,95(0.05) = 3.94, we reject the null. And, therefore the model is useful al 5%
significance level.

3 Consider the model: 𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1𝑖𝑖 + 𝛽𝛽2 𝑥𝑥2𝑖𝑖 + 𝛽𝛽3 𝑥𝑥3𝑖𝑖 + 𝛽𝛽4 𝑥𝑥4𝑖𝑖 + 𝛽𝛽5 𝑥𝑥5𝑖𝑖 + 𝑢𝑢𝑖𝑖

Explain how you would test the following null hypothesis:

a- 𝛽𝛽2 = 0
I would perform an individual two-tailed t-test for the second explanatory variable
because it is about testing the individual significance of the second explanatory
variable.

b- 𝛽𝛽3 = 𝛽𝛽4 = 𝛽𝛽5 = 0


I would perform a F-test for exclusion restrictions because it is about testing the joint
significance of the last three explanatory variables. In order to be able to test the
above hypothesis I would need first to estimate a model only including the first and
the second explanatory variables and then the model with all the explanatory
variables so that I could compute the F-statistic by comparing the SSR (or R-square)
of both estimations.

3
DATA ANALYSIS FOR ECONOMICS: PS3

c- 𝛽𝛽1 = 𝛽𝛽2 = 𝛽𝛽3 = 𝛽𝛽4 = 𝛽𝛽5 = 0


I would perform an F-test for overall significance of the model because it is about
testing whether the model is useful. In order to be able to test the above hypothesis
I would need to estimate a model with all the explanatory variables so that I could
compute the F-statistic by using the simplified R-squared form.

4 A consultancy firm has been commissioned by a property investment company to


develop a model that will help their managers assess the value of real estate in the London
area. They have information on the following variables: price (property price in pounds);
floorm2 (size of the property in square meters); dholborn (distance in kilometers from the
property to the city center – Holborn tube station); age (age of the property in years) and
buyage (age of the buyer). The estimation results are given in the next Table.

OLS estimates - Dependent variable: log(price)


Model 1 Model 2 Model 3
const 10.89 11.68 11.53
(0.036) (0.056) (0.072)
floorm2 0.012 0.0104 0.009
(0.0005) (0.0005) (0.0005)
log(dholborn) -0.291 -0.251
(0.017) (0.018)
age 0.002
(0.0002)
buyage -0.0007
(0.001)
n 1199 1199 1199
R 2
0.314 0.451 0.472
SSR 147.188 117.803 113.431
Note: Standard errors in parentheses.
Conduct all hypothesis tests at 1 % significance level

a- Interpret the OLS slope coefficient of the SLRM. Is size statistically significant?

When there is an increase of one square meter in the property size, on average price
increases by 1.2%.
The t-ratio of the variable floorm2 is equal to 0.012/0.0005= 24 which is much
greater than the critical value of the standard normal with 1% significance level (2.58),
which implies that size is statistically significant to explain property prices in London

b- Interpret the estimated coefficient on the variable log(dholborn) in Model 2. Is


distance statistically significant?

4
DATA ANALYSIS FOR ECONOMICS: PS3

When there is a 1% increase in the distance from the property to the city center, on
average price decreases by 0.291%, keeping size constant.
The t-ratio of the variable log(dholborn) is equal to -17.12 which in absolute value is
much greater than the critical value of the standard normal with 1% significance level
(2.58), which implies that distance is statistically significant to explain property prices
in London.
c- Predict the price of a 120 square meter property that is 7 km away from the Holborn
station using Model 2.

log�
(𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝)𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 11.68 + 0.0104 ∗ 120 − 0.291 ∗ log (7)
log� (𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝)𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 11.68 + 1.248 − 0.566 = 12.36
� 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑒𝑒 12.36 = 233,281.23 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝

d- How does the effect of size on property prices change respect to the estimation result
in the SLRM (section a)? Why?

If we compare the slope coefficients of size in model 1 (0.012) and 2 (0.0104), we


observe that it is smaller in model 2. This has to do with SLR4 being violated in the
SLRM, and therefore the estimator being upward biased. The reason is the
correlation between size and distance.

e- Test whether age and buyage are jointly significant.

𝐻𝐻0 : 𝛽𝛽3 = 0, 𝛽𝛽4 = 0.(q=2)

𝐻𝐻1 : 𝐻𝐻0 is not true

The unrestricted model is Model 3 (k=4) , and the restricted model is Model 2. Thus:

117.803 − 113.431
(𝑆𝑆𝑆𝑆𝑆𝑆𝑟𝑟 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑢𝑢𝑢𝑢 )/𝑞𝑞 2
𝐹𝐹 = = = 23.01
𝑆𝑆𝑆𝑆𝑆𝑆𝑢𝑢𝑢𝑢 113.431
𝑛𝑛 − 𝑘𝑘 − 1 1199 − 4 − 1

Since 23.01 > 𝐹𝐹2,1194(0.01) = 4.62, we reject the null hypothesis at 1% significance
level and therefore, age and buyage are jointly statistically significant. We prefer the
unrestricted version of the model.

f- Test the overall significance of Model 3.

The overall test:

5
DATA ANALYSIS FOR ECONOMICS: PS3

𝐻𝐻0 : 𝛽𝛽1 = 𝛽𝛽2 = 𝛽𝛽3 = 𝛽𝛽4 = 0

𝐻𝐻1 : 𝐻𝐻0 is not true

𝑅𝑅2 /𝑘𝑘 0.472/4


𝐹𝐹 = = = 266.84
(1 − 𝑅𝑅2 )/𝑛𝑛 − 𝑘𝑘 − 1 (1 − 0.472)/1194

Since 𝐹𝐹4,1194(0.01) = 3.33, we reject the null. And, therefore the model is useful al
1% significance level.

g- If you were part of the team in the consultancy firm and had to choose a model,
which one would you choose and why?

I would choose model 3, given the answers in e) and f). All the tests are consistent
with choosing Model 3, also if we compute Adjusted R-squared will be the highest
among the three models.

5 We have information about families below poverty level (POVRATE=percentage of


families with income below the poverty level) in a specific year for 58 counties in California
combined with information about potential determinants: UNEMP (percentage of
unemployment rate), FAMSIZE (persons per household), EDU (percent that completed
four years of college or higher), URBAN (percentage of urban population). Estimation
results are presented in the following table:

OLS Estimation Results - Dependent variable: POVRATE


Variable Model 1 Model 2 Model 3
Constant 2.637 1.906 4.309
(0.987) (4.292) (4.535)
Unemp 0.731 0.721 0.424
(0.092) (0.106) (0.166)
Famsize 0.305 2.388
(1.742) (1.871)
Edu -0.177
(0.081)
Urban -0.051
(0.022)
n 58 58 58
Adjusted R squared 0.518 0.510 0.548
SSR 421.692 421.457 374.675
Note: Standard errors are in parenthesis.
Conduct all the hypothesis tests at 1% significance level.

6
DATA ANALYSIS FOR ECONOMICS: PS3

a- Interpret the slope coefficient in Model 1 and test its individual significance.
If unemployment rate increases by one percentage point, on average, poverty rate
will increase by 0.731%. This is a realistic result, as poverty rate tends to increase if
there are many individuals without a job.

𝐻𝐻0 : 𝛽𝛽1 = 0, 𝐻𝐻1 : 𝛽𝛽1 ≠ 0

0.731 − 0
𝑡𝑡𝑈𝑈 = = 7.945
0.092
𝑐𝑐 = 2.58 two-tailed t-test with 56 degrees of freedom at 1%

We reject the null hypothesis since 7.945 > 2.58 at 1% significance level and
therefore, unemployment rate variable is statistically significant in explaining the
behavior of poverty rate.

b- Test the individual and global significance in Model 2.

Starting with variable Unemployment:

𝐻𝐻0 : 𝛽𝛽1 = 0, 𝐻𝐻1 : 𝛽𝛽1 ≠ 0

0.721
𝑡𝑡𝑈𝑈 = = 6.801
0.106

We reject the null hypothesis since 6.801 > 2.58 at 1% significance level and
therefore, unemployment rate variable is an individually significant variable
explaining the behavior of poverty rate.

Secondly, variable Family Size,

𝐻𝐻0 : 𝛽𝛽2 = 0, 𝐻𝐻1 : 𝛽𝛽2 ≠ 0

0.305
𝑡𝑡𝐹𝐹 = = 0.175
1.742

We fail to reject the null hypothesis since 0.175 < 2.58 at 1% significance level and
therefore, family size is not statistically significant at 1% level to explain poverty rate.

Test of overall significance of model 2:

First we need to get the R-squared Given the relationship between adjusted R-
squared and R-squared:

7
DATA ANALYSIS FOR ECONOMICS: PS3

(1 − 𝑅𝑅2 )(58 − 1)
0.510 = 1 −
(58 − 2 − 1)
0.510 = 1 − (1 − 𝑅𝑅2 )1.0364
(1 − 𝑅𝑅2 )1.0364 = 0.49
𝑅𝑅2 = 0.527

The overall test:

𝐻𝐻0 : 𝛽𝛽1 = 0, 𝛽𝛽2 = 0. (q=k=2)

𝐻𝐻1 : 𝐻𝐻0 is not true

𝑅𝑅2 /𝑘𝑘 0.527/2


𝐹𝐹 = 2
= = 30.63
(1 − 𝑅𝑅 )/𝑛𝑛 − 𝑘𝑘 − 1 (1 − 0.527)/55

Since 𝐹𝐹2,55(0.01) = 5.01, we reject the null. And, therefore 𝑈𝑈 and 𝐹𝐹 are jointly
statistically significant at 1% level (the model is useful)

c- Comment on the effect of FAMSIZE on POVRATE in the second model. Why do


you think is a positive and insignificant effect? Does this effect affect the goodness
of fit of model 2 if compared with model 1? Why?

It is a positive effect because you might expect that more individuals in a household,
holding everything else constant, implies less resources and, consequently, the
poverty probability increases. It is not significant because this effect may not be so
relevant to understand poverty rate variability. As you can see, when comparing the
adjusted determination coefficients in Model 1 and in Model 2 is the same. That is,
adding famsize variable does not increase the explanatory power of the second model
if compared with the first one. This is consistent with famsize variable not being
individually significant.

d- In Model 3 we add two new explanatory variables: EDU and URBAN. Test whether
this inclusion helps to improve the quality of the model. Is model 3 the best in terms
of goodness-of-fit? Why?

𝐻𝐻0 : 𝛽𝛽3 = 0, 𝛽𝛽4 = 0.

𝐻𝐻1 : 𝐻𝐻0 is not true

8
DATA ANALYSIS FOR ECONOMICS: PS3

421.457 − 374.675
(𝑆𝑆𝑆𝑆𝑆𝑆𝑟𝑟 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑢𝑢𝑢𝑢 )/𝑞𝑞 2
𝐹𝐹 = = = 3.308
𝑆𝑆𝑆𝑆𝑆𝑆𝑢𝑢𝑢𝑢 374.675
𝑛𝑛 − 𝑘𝑘 − 1 58 − 4 − 1

Since 3.308 > 𝐹𝐹2,53(0.05) = 3.17, we reject the null hypothesis at 5% significance
level and therefore, the new two explanatory variables introduced in the third model
are jointly statistically significant. That is, the inclusion of the new two explanatory
variables helps us to understand better the behavior of our dependent variable.
Therefore, our preferred specification will be the third model.

Yes, model 3 is the best in terms of explanatory power because its adjusted
determination coefficient is larger than the ones associated to Model 1 and Model 2.
This result is consistent with the result of the above F-test.

e- Are the effects of these two new variables the expected ones?

The effects are the expected ones. The more education more chances to have a job
and therefore not suffering poverty (education having a negative effect on poverty
rate). Living in urban areas may imply more opportunities and therefore a lower
poverty rate than in rural areas (negative effect of urban on poverty rate).

f- What about the individual significance of UNEMP in model 3 if compared with


model 2? Explain.

𝐻𝐻0 : 𝛽𝛽1 = 0, 𝐻𝐻1 : 𝛽𝛽1 ≠ 0

0.424
𝑡𝑡𝑈𝑈 = = 2.554
0.166

We fail to reject the null hypothesis since 2.554 < 2.58 at 1% significance level and
therefore, unemployment rate is not statistically significant in Model 3.

In model 2 unemployment rate was individually significant but in model 3 it becomes


individually insignificant. This may be due to the inclusion of education and urban in
the third model. Education an unemployment rate are very likely to be correlated and
therefore introducing education in Model 3 makes unemployment rate turning
insignificant.

You might also like