Final Exam 102 w10 Solutions
Final Exam 102 w10 Solutions
z∗ = x̄−µ yˆi = b1 + b2 xi
√σ
n
1 Pn
x̄−µ s2e = i=1 (yi − yˆi )2
t∗ = √s
n−2
n Pn
T SS = i=1 (yi − ȳ)2
P r[Tn−k > tα,n−k ] = α
Pn
ESS = i=1 (yi − yˆi )2
P r[|Tn−k | > t α2 ,n−k ] = α
ESS
Pn R2 = 1 − T SS
i=1 a = na
q
Pn Pn s2e
i=1 (axi ) = a i=1 xi s b2 = Pn 2
i=1 (xi −x̄)
Pn Pn Pn
i=1 (xi + yi ) = i=1 xi + i=1 yi F = R2 n−k
1−R2 k−1
s2 = x̄(1 − x̄) for proportions data
ESSr −ESSu n−k
F = ESSu k−g
tα,n−k = T IN V (2α, n − k)
2 −R2
Ru r n−k
F =
P r(|Tn−k | ≥ |t∗ |) = T DIST (|t∗ |, n − k, 2) 1−Ru2 k−g
1. Suppose we regress SAT score on parent’s education and parent’s income. If we run the
regression again but also include the student’s GPA as an additional regressor:
(a) The R2 for the regression will either stay the same or increase.
(b) The adjusted R2 for the regression will either stay the same or increase.
(c) Both (a) and (b) are true.
(d) Neither (a) nor (b) is true.
(a) When adding an additional regressor, our fit should be at least as good as before,
so the R2 for the regression should either stay the same or increase. Adjusted R2
may decrease if the additional regressor had little to no explanatory power.
2. Suppose we have a sample of the heights of Davis students and want to use the sample mean
to get a confidence interval for the mean height in the population. Which of the following
would increase the width of this confidence interval?
(a) Switching from a 95% confidence interval to a 90% confidence interval.
(b) Increasing the sample size used to calculate the sample mean.
(c) Switching from a 95% confidence interval to a 99% confidence interval.
(d) All of the above.
(c) The smaller we make α, the wider our confidence interval will get. A larger
sample size would make the confidence interval narrower.
3. Suppose we can reject the null hypothesis that β2 ≥ 0 at a 5% significance level where β2 is
the slope coefficient from a bivariate regression. Which of the following is definitely true?
(a) Our test statistic was negative.
(b) We can reject the null hypothesis that β2 = 0 at a 5% significance level.
(c) We can reject the null hypothesis that β2 ≥ 0 at a 2.5% significance level.
(d) We can reject the null hypothesis that β2 < 0 at a 5% significance level.
(a) The critical value for a lower one-tailed hypothesis test will be negative and we
will reject the null when the test statistic is more negative than the critical value.
4. Suppose we regress y on x2 . Which of the following would lead to a biased coefficient for x2 ?
(a) There is a variable x3 that is correlated with y but not with x2
(b) There is a variable x3 that is correlated with x2 but not with y.
(c) y is measured with some random error.
(d) x is measured with some random error.
(d) An omitted variable will bias the coefficient on x2 only if it is correlated with
both x2 and with y. Measurement error in x2 will bias the coefficient on x2 since
it will lead to errors that are negatively correlated with x2 . Measurement error in
y will decrease the precision of the estimated slope coefficient but will not bias the
coefficient.
5. When testing the significance of a subset of regressors, the R2 of the unrestricted model will
always be:
Final Exam - Solutions 3
regression using snow depth measured in inches, the new estimated coefficient on snowfall
will be:
(a) Larger than 5.
(b) Smaller than 5.
(c) Still equal to 5.
(d) Not enough information.
(b) The coefficient is giving us the change in y with a change in snowdepth of one
1
foot. The change in y with a change in snowdepth of one inch will be 12 of this
value.
10. When regressing annual work hours on income, a researcher finds that the variance of the
residuals increases as work hours increases. This will affect:
(a) The expected value of the slope coefficient for income.
(b) The magnitude of the standard error for the slope coefficient for income.
(c) Both (a) and (b).
(d) Neither (a) nor (b).
(b) This is a case of heteroskedasticity. Heteroskedasticity will change the standard
errors of our estimates but will not bias the coefficients.
11. The histogram for hours of study per week based on a sample of 400 Davis students is
symmetric and centered at 15 hours. Which of the following statements is true?
(a) The sample median is 15 hours.
(b) The sample mean is 15 hours.
(c) The skewness for the sample is zero.
(d) All of the above.
(d) Because the distribution is symmetric, 50 percent of the observations will be
to the right of 15 hours and 50 percent will be to the left of 15 hours, making 15
hours the median. The symmetry will also lead to the mean being equal to the 15
hours (for every observation that is larger than 15, there is a corresponding obser-
vation that is smaller than 15 by the same amount). For a symmetric distribution,
skewness is zero.
12. When running a bivariate regression, which of the following is not possible?
(a) The error sum of squares is larger than the total sum of squares.
(b) The error sum of squares is equal to the total sum of squares.
(c) The error sum of squares is zero.
(d) The error sum of squares is positive.
(a) The largest the error sum of squares can ever be is the magnitude of the total
sum of squares. If it were larger you could achieve a better fit by simply setting all
of your slope coefficients to zero.
13. Which of the following would definitely not lead to the error term being correlated with a
regressor x?
Final Exam - Solutions 5
22. Suppose we ran a regression of Y on X 1000 times using 1000 different samples and made a
histogram of the resulting slope coefficient values. Which of the following is true about the
distribution shown on the histogram?
(a) It would be centered at zero.
(b) It would look like a normal distribution.
(c) All of the observations would be located at the true value of the slope coefficient.
(d) It would be right skewed.
(b) The estimated slope coefficient is simply a random variable. It will be dis-
tributed normally with a mean equal to the true population value of the slope
coefficient.
23. Which of the following depends on the units variables are measured in?
(a) Correlation.
(b) Coefficient of variation.
(c) Estimated slope coefficient.
(d) t statistic.
(c) The estimated slope coefficient is in the units of y divided by the units of x.
Changing the units of either y or x will rescale the slope coefficient.
24. Suppose the size of your social network grows exponentially over time. Which of the following
equations would be the most appropriate for modeling social network size (S) as a function
of time (T ):
(a) S = β1 + β2 T + ε.
(b) ln(S) = β1 + β2 ln(T ) + ε.
(c) S = β1 + β2 ln(T ) + ε.
(d) ln(S) = β1 + β2 T + ε.
(d) If S grows exponentially, then S increases by a constant percentage for every
one unit change in time. This can be modeled with a log-linear equation.
25. Suppose that we want to test whether eye color influences the likelihood of being hired for a
job. Our dataset includes five different values for the eye color variable. If we want to regress
the probability of being hired on eye color, we will:
(a) Convert each eye color to a number and include this new variable in for eye color number
in the regression.
(b) Create dummy variables for each eye color and include all of the dummy variables as
regressors.
(c) Create dummy variables for each eye color and include four of the dummy variables as
regressors.
(d) Create dummy variables for each eye color and include three of the dummy variables as
regressors.
(c) We always include one fewer dummy variable than the total number of cate-
gories. If we didn’t do this, we would run into the dummy variable trap and have
a perfect collinearity problem (one of the dummy variables could be rewritten in
terms of the other dummy variables).
8 Final Exam - Solutions
1. (14 points) Suppose that the number of traffic accidents (N ) is a function of the number of
cars on the road (C) and the average speed of cars on the road (S). The true population
relationship between accidents, cars and average speed is given by:
where ε is a random error that meets all of our standard assumptions. The number of
cars on the road is negatively correlated with average speed due to the increased congestion
associated with additional cars. The true population relationship between the number of cars
and average speed is given by:
S = 80 − 4C + ν (2)
where ν is a random error that meets all of our standard assumptions.
(a) If you ran a regression with N as the dependent variable and C and S as the independent
variables, what would the expected value of the estimated slope coefficient for C be?
Assume that you include a constant term in your regression.
Given that ε meets all of our standard assumptions, the estimated coefficient
will be unbiased. So its expected value will be equal to the true population
value of 50.
(b) If you ran a regression with N as the dependent variable and C as the only independent
variable, what would the expected value of the estimated slope coefficient for C be?
Assume that you include a constant term in your regression.
By omitting S from the regression equation, S enters the error term making
the error term correlated with C and creating an omitted variable bias. The
expected value of the estimated coefficient for C will be equal to the true value
plus a bias term that captures the indirect effect of S on N :
E(b˜c ) = βc + βs · γc
E(b˜c ) = 50 + 25 · (−4)
E(b˜c ) = −50
(c) Suppose that you ran a regression with average speed as the dependent variable and
number of cars on the road as the independent variable but you forced the intercept to
be zero (in other words, you do not include a constant term). Will the the expected
value of the estimated slope coefficient be greater than, equal to or less than the true
population value of the slope coefficient? Include a written explanation and a scatter
plot showing speed as a function of number of cars to illustrate your answer. Assume
that we always observe positive numbers of cars and positive average speeds.
We know that all of our data points will have positive values for number of cars
and average speed, so they will all lie above and to the right of the origin on a
graph with S on the vertical axis and C on the horizontal axis. We are forcing
our regression line to pass the origin and through this scatter of data points
above and to the right of the origin. This means that we will get a positive
Final Exam - Solutions 9
slope for the regression line. Given that the true value of the slope coefficient is
negative, the estimated slope will certainly be greater than the true value. This
situation is depicted on the graph below.
S
estimated
80 regression line,
slope>0
C
10 Final Exam - Solutions
Regression Statistics
Multiple R 0.747320875
R Square 0.558488491
Adjusted R Square 0.533259262
Standard Error 0.357993439
Observations 75
ANOVA
df SS MS F Significance F
Regression 4 11.34802735 2.837007 22.13657 7.68637E‐12
Residual 70 8.971151182 0.128159
Total 74 20.31917853
Regression Statistics
Multiple R 0.260580275
R Square 0.06790208
Adjusted R Square 0.055133615
Standard Error 0.509357157
Observations 75
ANOVA
df SS MS F Significance F
Regression 1 1.379714485 1.379714 5.317952 0.023948764
Residual 73 18.93946405 0.259445
Total 74 20.31917853
2. (14 points) For this problem, use the regression output shown on the previous page. Both
regressions use the same data set. The dataset is a sample of 75 cities. height is a variable
giving the average height in inches of adult males from the city. tyhpoiddeaths is a variable
giving the number of typhoid deaths per 1,000 people in the city. The variables northeast,
south and west are all dummy variables that are equal to one if the city is in that region
and zero otherwise. All cities are located either in the Northeast, the South, the West or the
Midwest.
(a) Based on the regression results, what is the difference in the average male height between
a city in the South and a city in the Northeast.
E(height|south = 1) = b1 + b2 · 0 + b3 · 1 + b4 · 0 + b5 typhoiddeaths
E(height|south = 1) = b1 + b3 + b5 typhoiddeaths
E(height|northeast = 1) = b1 + b2 northeast + b3 south + b4 west + b5 typhoiddeaths
E(height|northeast = 1) = b1 + b2 · 1 + b3 · 0 + b4 · 0 + b5 typhoiddeaths
E(height|northeast = 1) = b1 + b2 + b5 typhoiddeaths
E(height|south = 1)−E(height|northeast = 1) = b1 +b3 +b5 typhoiddeaths−b1 −b2 −b5 tyhpoiddeaths
E(height|south = 1) − E(height|northeast = 1) = b3 − b2
E(height|south = 1) − E(height|northeast = 1) = (−.26) − (−.73)
E(height|south = 1) − E(height|northeast = 1) = .47
So the average height in a southern city is .47 inches greater than the average
height in a northeastern city.
(b) What is the average male height for a city in the West with no typhoid deaths?
E(height|west = 1, typhoid = 0) = b1 + b2 · 0 + b3 · 0 + b4 · 1 + b5 · 0
E(height|west = 1, typhoid = 0) = b1 + b4
E(height|west = 1, typhoid = 0) = 68.42 + .26 = 68.68
(c) Based on the first set of regression results, can you reject the null hypothesis that the
coefficient for typhoid deaths is less than or equal to zero at a 5% significance level? Be
certain to justify your answer.
Notice that the p-value for the typhoid deaths coefficient is .033. This value
corresponds to a two-tailed test and means that would reject the null hypothesis
that the coefficient is equal to 0 at a 5% significance level (.05 > .033) and that
our t-statistic is to the right of t.025,n−k (since our coefficient is positive). For
an upper one-sided test, we would reject the null if the t-statistic is to the
right of t.05,n−k . Notice that t∗ > t.025,n−k > t.05,n−k , so we will reject the null
hypothesis that the coefficient is less than or equal to zero at a 5% significance
level.
12 Final Exam - Solutions
(d) Calculate the test statistic you would use to test the following set of hypotheses:
Ho : βne = βs = βw = 0
3. (12 points) Suppose that we are interested in the relationship between hours of weekly exercise
and resting heart rate. The more a person exercises on average, the lower his or her resting
heart rate is. For individuals who don’t exercise at all, males have a lower resting heart
rate on average than females. The decrease in resting heart rate from an additional hour of
exercise per week is bigger for males than females.
(a) Write down the regression model you would use to estimate the relationship between
resting heart rate, gender and weekly exercise. Resting heart rate should be your de-
pendent variable. Provide clear definitions of all variables you include in your model.
Our regression model will have to include resting heart rate, weekly exercise and
a variable capturing gender. Since gender is a categorical variable, we will need
to use a dummy variable. We have two values for gender (male and female) so
we will need one dummy variable. Let’s make our dummy variable for male, so
it equals one if gender is male and equals zero if gender is female. This leaves
gives us the following set of variables:
• R - resting heart rate
• E - amount of weekly exercise
• M - dummy variable equal to one if male, zero if female
Our regression equation will have R as the dependent variable. E and M will
be independent variables. We also need to include an interaction term between
E and M since the marginal effect of E on R depends on the value of M . This
gives us the following regression model:
R = β1 + β2 E + β3 M + β4 E · M + ε
(b) Based on the information given above, what are the expected signs for each of your
coefficients in the regression model you specified in part (a)?
Notice that β2 is the marginal effect of exercise on resting heart rate for females
(since the interaction term will be zero). According to the problem, more exer-
cise lowers the resting heart rate, so β2 should be negative. The marginal effect
of exercise on heart rate is larger (more negative) for males than females. This
marginal effect is captured by β2 + β4 , so β4 should be negative. For individuals
exercising the same amount, the difference between the average male heart rate
and the average female heart rate will be β3 . We are told that males have a
lower heart rate than females that exercise the same amount. So β3 should be
negative. Finally, heart rate has to be positive overall, so the constant term β1
should be positive (if it were negative, we would predict that a female who does
not exercise has a negative heart rate). To summarize:
β1 > 0
β2 < 0
β3 < 0
β4 < 0
14 Final Exam - Solutions
Note that if you used a dummy variable equal to one for females and zero for
males, your signs for β3 and β4 would be reversed. The signs for β1 and β2
would stay the same.
(c) Suppose people tend to make random mistakes when measuring their heart rate. What
effects will this have on the estimation results when you run the regression model specified
in part (a)?
Measurement error in the dependent variable will not bias our coefficients. So
the expected values of the coefficients will stay the same. However, the mea-
surement error does add variance to the error term which will lead to less precise
estimates of the coefficients (larger standard errors).