Stat 302 Practice Final: Brad Mcneney 2017-04-15
Stat 302 Practice Final: Brad Mcneney 2017-04-15
Brad McNeney
2017-04-15
Introduction
This practice final is made from the datasets that were analyzed for the final exam of Stat 302 in 2015. We
used a different text book then, so I have had to modify the questions from their original. Please note: The
analyses in this document represent a subset of all topics listed in the review document. You are responsible
for all the topics in the review. This practice exam is intended to give you an idea of the style of questions.
Questions
Short questions
1. (1 mark) Which of the following summary statistics measures the spread of a distribution: first quartile,
third quartile, inter-quartile range, Median.
2. (1 mark) Briefly, define the sampling distribution of a statistic.
An individual’s critical flicker frequency is the highest frequency at which the flicker in a flickering light
source can be detected. At frequencies above the critical frequency, the light source appears to be continuous
even though it is actually flickering. This investigation recorded critical flicker frequency and iris colour of
the eye for 19 subjects. A summary of the flicker frequencies by group (eye colour) is as follows:
dat <- read.table("flicker.txt",header=TRUE)
library(dplyr)
dat %>% group_by(Colour) %>%
summarize(n=n(),mean=mean(Flicker),sd=sd(Flicker))
## # A tibble: 3 ◊ 4
## Colour n mean sd
## <fctr> <int> <dbl> <dbl>
## 1 Blue 6 28.16667 1.527962
## 2 Brown 8 25.58750 1.365323
## 3 Green 5 26.92000 1.843095
a. (1 mark) Is this a balanced or unbalanced design?
b. (1 mark) Comment on the constant error SD assumption.
c. (2 marks) Using baseline coding with brown eyes as the baseline group, write down the dummy variables
needed for an ANOVA model.
d. Using your dummy variables from (c), write down the model for a one-way ANOVA for these data,
including the error terms. You do not need to define any regression coefficients.
e. (4 marks) The model from (d) is fit to the data and we obtain the following Q-Q plot:
ffit<-aov(Flicker~Colour,data=dat)
augment(ffit) %>% ggplot(aes(sample=.resid)) + geom_qq()
1
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00
https://www.coursehero.com/file/90171528/prfinalpdf/
2
1
sample
−1
−2
−2 −1 0 1 2
theoretical
What assumption does this plot assess? Does the assumption appear plausible? Justify your answer.
f. (4 marks) The ANOVA summary for the model fit in (e) is as follows:
anova(ffit)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: Flicker and Colour
##
## Blue Brown
## Brown 0.0071 -
## Green 0.2020 0.1504
##
## P value adjustment method: none
What are the Bonferroni-corrected p-values?
2
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00
https://www.coursehero.com/file/90171528/prfinalpdf/
h. (1 mark) In light of (f), to what do you attribute the significant F test in (e).
A football team records data on the length of punts made by 13 players at a tryout for the team. The distance
measure for each punter is the average of ten punts. They also record the following information on each
player:
• Hang: Time in air in seconds
• RStrength: Right leg strength in pounds
• LStrength: Left leg strength in pounds
• RFlexibility: Right leg flexibility in degrees
• LFlexibility: Left leg flexibility in degrees
• OStrength: Overall leg strength in pounds
You perform stepwise regression with the BIC criterion to build a predictive model of distance. The largest
model in your search includes all main effects; the smallest model includes only an intercept.
a. (1 mark) What is the BIC penalty term for model selection in this example? Report your answer to
four digits.
b. (2 marks) After several iterations of stepwise selection you obtain a model that includes RStrength,
RFlexibility, LFlexibility and OStrength. Here is a summary of the next iteration:
Distance ~ RStrength + RFlexibility + LFlexibility + OStrength
A softdrink vendor collects data on the relationship between the time in minutes a delivery takes, and two
explanatory variables: (i) the number of cases delivered (Cases), and (ii) the walking distance in feet to make
the delivery (Distance). The following summaries are obtained for 25 deliveries:
dat <-read.table("softdrin.txt",header=TRUE)
summary(dat)
3
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00
https://www.coursehero.com/file/90171528/prfinalpdf/
## Time Cases Distance
## Min. : 8.00 Min. : 2.00 Min. : 36.0
## 1st Qu.:13.75 1st Qu.: 4.00 1st Qu.: 150.0
## Median :18.11 Median : 7.00 Median : 330.0
## Mean :22.38 Mean : 8.76 Mean : 409.3
## 3rd Qu.:21.50 3rd Qu.:10.00 3rd Qu.: 605.0
## Max. :79.24 Max. :30.00 Max. :1460.0
ggplot(dat,aes(x=Time)) + geom_histogram(binwidth=10)
12.5
10.0
7.5
count
5.0
2.5
0.0
20 40 60 80
Time
ggplot(dat,aes(x=Cases)) + geom_histogram(binwidth=5)
10
count
0
0 10 20 30
Cases
ggplot(dat,aes(x=Distance)) + geom_histogram(binwidth=100)
4
count
0
0 500 1000 1500
Distance
a. (1 mark) How would you describe the distribution of the Distance variable?
b. (2 marks) Write out a linear model for mean Time that includes interaction between Cases and Distance.
Define any notation you use.
c. In terms of your model from the previous question, state formal hypotheses for testing for statistical
interaction.
4
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00
https://www.coursehero.com/file/90171528/prfinalpdf/
d. (2 marks) Computer software reports the following VIFs. Do you have any concerns about collinearity?
If so, why? If not, why not?
sfit<-lm(Time~Cases*Distance,data=dat)
library(car)
##
## Attaching package: car
## The following object is masked from package:dplyr :
##
## recode
vif(sfit)
3
.resid
−3
20 40 60 80
.fitted
augment(sfit) %>% ggplot(aes(sample=.resid)) + geom_qq()
5
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00
https://www.coursehero.com/file/90171528/prfinalpdf/
5.0
2.5
sample
0.0
−2.5
−2 −1 0 1 2
theoretical
f. (2 marks) The following graphic, called a dotplot, shows the hat values for the fitted model. Each
observation is represented by a dot.
augment(sfit) %>% ggplot(aes(x=.hat)) + geom_dotplot(binwidth=.03)
1.00
0.75
count
0.50
0.25
0.00
0.00 0.25 0.50 0.75
.hat
Are there any observations with very high leverage? If so, why? If not, why not?
g. (2 marks) The Cook’s Distance values are:
round(augment(sfit)$.cooksd,2)
## [1] 0.09 0.00 0.00 0.06 0.00 0.00 0.04 0.00 2.76 0.17 0.19 0.01 0.00 0.01
## [15] 0.01 0.03 0.00 0.05 0.01 0.11 0.02 0.13 0.03 0.07 0.01
Are there any highly influential observations? If so, why? If not, why not?
h. (2 marks) A summary of the fitted model is as follows:
tidy(sfit)
6
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00
https://www.coursehero.com/file/90171528/prfinalpdf/
## 2 Cases 1.318060e+00 0.1462440700 9.012740 1.157667e-08
## 3 Distance 1.232658e-02 0.0027574849 4.470225 2.111072e-04
## 4 Cases:Distance 7.419211e-04 0.0001749773 4.240100 3.659326e-04
Test for interaction at the 5% level and write two sentences to report your conclusion, one using technical
language and one in language that anyone can understand.
i. (BONUS question) Write a sentence to interpret the effect of increasing Distance by 100 feet for a
delivery of 10 cases. Ten cases translates to a value of 1.24 of the centred Cases explanatory variable.
To calculate the effect, round coefficients to three significant digits and report the effect size to three
digits.
Maple tree seeds look like spinning helicopters when they fall from the tree. A forest scientist studied the
relationship between how fast they fall (Velocity) and their size (Load), taking a total of 35 measurements
on three trees (12 on two of them and 11 on the third). We will analyze the relationship between Velocity
and Load, allowing for different relationships in the three trees (Tree).
a. (1 mark) How many dummy variables are required to include Tree in regression models?
b. (2 marks) A model that allows different lines for mean Velocity as a function of Load for each Tree
gives the following summary.
dat <- read.table("samara.txt",header=TRUE)
dat$Tree <- factor(dat$Tree)
fit1<-lm(Velocity~Load*Tree,data=dat)
tidy(fit1)
7
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00
https://www.coursehero.com/file/90171528/prfinalpdf/
Powered by TCPDF (www.tcpdf.org)