Tutorial Answers
Tutorial Answers
Semester 1 2015
Weeks 11 and 12
1. Recall the Anzac Garage data (AnzacG.xls) used previously and available in the Excel
data subfolder on the Moodle site under Tutorial Questions and Information. We
previously considered the simple linear regression model given by:
pric ei =0 + 1 ag e i+u i
where price = the price of a used car, in dollars, and age = the age of the car, in years.
The Excel results obtained using ordinary least squares to estimate this model are
presented below:
Regression Statistics
R2
0.077
Standard Error
42069
Observations
117
(a)
t Stat
p-value
Intercept
47469
6748
7.035
0.000
Age
-2658
856
-3.106
0.002
Interpret the t-Stat and the p-values in the output above. What do you need
to assume for these interpretations to be correct?
The t-stat and p-values in the Excel output are derived from two-tailed tests with null
hypotheses that the associated population parameter equals 0. Hence, larger t-stats and
lower p-values mean we are more confident that the associated population parameter is nonzero. Here, p-values for both the intercept and the coefficient on age are below 1%, and
hence we can be more than 99% confident in each case that the corresponding population
parameter is not zero.
We need either to assume the disturbances are normal, or because the sample size is large
invoke the CLT.
(b)
The standard normal critical value is 1.96, hence the 95% confidence interval is:
-2658 1.96856 = -2658 1678 = (-4336, -980)
(c)
The regression model including age explains 7.7% of the variation in used car prices.
(d)
Test whether the estimated coefficient on Age is significantly less than zero at
the 5% level of significance.
Estimate a 95% confidence interval for the mean price for a second-hand
passenger car that is 10 years old, and interpret the result. Note: the sample
mean of age is 6.44 years.
2
X p X )
(
1
Y^ p t s +
,
n ( X i X ) 2
2
s
42069
=
=2415
2
2
856
( se( b1) )
Hence:
1 ( 106.44 )
20889 1.98 42069
+
=20889 9783
117
2415
We are 95% confident that the expected price of a 10- year- old car will fall between
$11,106 and $30,672. While the impact of age on price is precisely estimated, the CI is
quite wide because of the large amount of unexplained variation that is indicated by
the very low R2 value reported. (Note: use of normal critical values here would be
acceptable given the large sample size, and would make little practical difference as
the critical value would be 1.96 rather than 1.98)
Anzac Garage is worried about its pricing scheme, which is based solely on the age of
the car. When its second-hand car prices are compared with the prices of cars of the
same age at other dealerships, they are often different. A consultant notes that the
value of a second-hand car should depend on both the odometer reading and the age
of the vehicle. This consultant wanted to estimate the following two simple linear
regression models separately:
pric ei =0 + 1 ag e i+u i
pric ei = 0 + 1 odometer i+v i
where odometer = distance the car has travelled since leaving the factory, in
kilometres. A senior consultant advised the use of a multiple linear regression model
instead, i.e.,:
(f)
Discuss why the simple linear regression methods may not be preferable to the
multiple regression method, in general, and in the context of this problem. The
resultant OLS estimates for the multiple regression model are given below:
SUMMARY OUTPUT
Regression Statistics
R Square
0.150
Standard
Error
40568
Observations 117
Coefficient Standard
s
Error
t Stat
P-value
Intercept
53867
6825
7.893
0.000
Odometer
(km)
-0.270
0.087
-3.110
0.002
Age
-360
1108
-0.325
0.746
The predictive performance of the model will improve as relevant variables are added to a
simple regression model.
Also the assumption that the disturbance is uncorrelated with the explanatory variables is
critical for the unbiased estimation of coefficients of included variables. In the simple price on
age regression it will be violated if variables affecting price and correlated with age have
been omitted from the model. This is likely to be the case here with distance the car has
traveled.
We see the R2 has improved (approximately doubled) with the addition of odometer and the
coefficient on age is now much smaller in magnitude and is now statistically insignificant.
2. Sydney housing prices, encore.
Recall the housing price data for Sydney suburbs used previously. Your statistically
nave friend has been doing some analysis of Sydney housing prices using these data
and has asked you for help. In addition to the price data, a number of characteristics
associated with the suburb have been collected and are likely to explain some of the
large variation in housing prices across suburbs that are observed in the data. Your
friend is very interested in the impact on housing prices of being located under the
flight path. He ran a regression of housing price on the flightpath variable (Model 1) and
the results surprised him. On your advice he ran a second regression (Model 2) that
included several extra explanatory variables. Results for Model 1 and Model 2 are
presented in the table below. Note that:
Housing price is the mean of the median price of houses sold in each suburb for two
quarters (September and December 2002) measured in thousands of dollars;
Distance to CBD is the distance, measured in kilometres, of the suburb from Sydneys
CBD;
Distance to Airport is the distance, measured in kilometres, of the suburb from the
Sydney airport;
Distance to beach is the distance, measured in kilometres, of the suburb from the
nearest beach;
Flightpath is a dummy variable that equals 1 if the suburb is under the flight path and
0 otherwise.
Multiple regression results for Sydney housing prices*
Explanatory
variables
Model 2
Intercept
569.9
(20.6)
853.5
(35.5)
Flightpath
216.2
(56.0)
51.5
(50.2)
Distance to CBD
-21.5
(3.4)
Distance to
Airport
21.0
(2.9)
Distance to
beach
-13.9
(2.3)
Observations
R squared
503
503
0.029
0.372
How would you interpret the regression estimates for the parameters in Model 1?
Explain why your friend found these results to be unexpected.
Because the estimate of 1 is positive, this means that houses under the flight path on
average sell for more ($216,200 more) than houses not under the flight path. This is
surprising because you would expect aircraft noise associated with being under the flight path
would be unattractive and hence lead to lower, not higher, house prices.
(b)
Explain why the results in Model 1 are unreliable as a basis for determining the
impact on housing prices of being located under the flight path. Which of the
assumptions associated with simple linear regression has clearly been violated in
Model 1?
Your friend would like to make a statement about the impact on prices of being under the
flight path holding other factors constant. This is not possible with Model 1, as it is a
simple linear regression and hence there is the potential for omitted (confounding) variables
that lead to biased estimates of the impact of being situated under the flight path.
For example, given Sydneys geographical layout, proximity to the beach is likely to impact
on housing prices and to be correlated with being under the flight path. In Model 1, the
variable capturing distance to beach is in the disturbance term and hence leads to a violation
of the assumption that E(|Xi) = 0.
(c)
Write a brief description of the results for Flightpath in Model 2 in terms of the
parameter estimate, its interpretation, and its statistical significance.
The estimated parameter indicates a $51,500 premium (much smaller than for Model 1) for
suburbs under the flight path relative to those not under the flight path, holding other factors
constant.
For statistical significance:
H0: i = 0 versus H1: i 0 where i is the ith regression coefficient
Because we have a large sample size we can invoke the CLT and use standard normal critical
values when evaluating the test statistics given by b i/se(bi)
If we choose = 0.05 then the decision rule will be to reject if | b i/se(bi)| > 1.96
The test statistic for flightpath (51.5/50.2 = 1.03) indicates that this parameter is not
statistically different from zero. This indicates that certis paribus, for houses in the Sydney
suburbs, there is no statistically significant effect on price from being located in the flight
path.
(d)
Use Model 2 to predict the average house price for the suburb of Randwick, which
is 5.21 kms from the CBD, 1.78 kms from the beach, 6.62 kms from the airport
and is not deemed to be under the flight path.