0% found this document useful (0 votes)
153 views7 pages

Stat 302 Practice Final: Brad Mcneney 2017-04-15

This practice exam document provides a 3-question practice final for a statistics course. It includes questions on topics like ANOVA, regression, and distributions. The questions analyze real datasets on flicker frequency by eye color, football punt distances, and time to make soft drink deliveries. Students are asked to interpret analyses, identify models, conduct statistical tests, and explain results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views7 pages

Stat 302 Practice Final: Brad Mcneney 2017-04-15

This practice exam document provides a 3-question practice final for a statistics course. It includes questions on topics like ANOVA, regression, and distributions. The questions analyze real datasets on flicker frequency by eye color, football punt distances, and time to make soft drink deliveries. Students are asked to interpret analyses, identify models, conduct statistical tests, and explain results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Stat 302 practice final

Brad McNeney
2017-04-15

Introduction

This practice final is made from the datasets that were analyzed for the final exam of Stat 302 in 2015. We
used a different text book then, so I have had to modify the questions from their original. Please note: The
analyses in this document represent a subset of all topics listed in the review document. You are responsible
for all the topics in the review. This practice exam is intended to give you an idea of the style of questions.

Questions

Short questions

1. (1 mark) Which of the following summary statistics measures the spread of a distribution: first quartile,
third quartile, inter-quartile range, Median.
2. (1 mark) Briefly, define the sampling distribution of a statistic.

Question 1: Flicker frequency

An individual’s critical flicker frequency is the highest frequency at which the flicker in a flickering light
source can be detected. At frequencies above the critical frequency, the light source appears to be continuous
even though it is actually flickering. This investigation recorded critical flicker frequency and iris colour of
the eye for 19 subjects. A summary of the flicker frequencies by group (eye colour) is as follows:
dat <- read.table("flicker.txt",header=TRUE)
library(dplyr)
dat %>% group_by(Colour) %>%
summarize(n=n(),mean=mean(Flicker),sd=sd(Flicker))

## # A tibble: 3 ◊ 4
## Colour n mean sd
## <fctr> <int> <dbl> <dbl>
## 1 Blue 6 28.16667 1.527962
## 2 Brown 8 25.58750 1.365323
## 3 Green 5 26.92000 1.843095
a. (1 mark) Is this a balanced or unbalanced design?
b. (1 mark) Comment on the constant error SD assumption.
c. (2 marks) Using baseline coding with brown eyes as the baseline group, write down the dummy variables
needed for an ANOVA model.
d. Using your dummy variables from (c), write down the model for a one-way ANOVA for these data,
including the error terms. You do not need to define any regression coefficients.
e. (4 marks) The model from (d) is fit to the data and we obtain the following Q-Q plot:
ffit<-aov(Flicker~Colour,data=dat)
augment(ffit) %>% ggplot(aes(sample=.resid)) + geom_qq()

1
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00

https://www.coursehero.com/file/90171528/prfinalpdf/
2

1
sample

−1

−2

−2 −1 0 1 2
theoretical
What assumption does this plot assess? Does the assumption appear plausible? Justify your answer.
f. (4 marks) The ANOVA summary for the model fit in (e) is as follows:
anova(ffit)

## Analysis of Variance Table


##
## Response: Flicker
## Df Sum Sq Mean Sq F value Pr(>F)
## Colour 2 22.997 11.4986 4.8023 0.02325 *
## Residuals 16 38.310 2.3944
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
State the null and alternative hypotheses being tested by the F test, and report the results of a test at the
5% level in technical language and language that anyone can understand.
g. (3 marks) The following raw p-values are obtained from pairwise comparisons.
with(dat,pairwise.t.test(Flicker,Colour,p.adjust.method="none"))

##
## Pairwise comparisons using t tests with pooled SD
##
## data: Flicker and Colour
##
## Blue Brown
## Brown 0.0071 -
## Green 0.2020 0.1504
##
## P value adjustment method: none
What are the Bonferroni-corrected p-values?

2
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00

https://www.coursehero.com/file/90171528/prfinalpdf/
h. (1 mark) In light of (f), to what do you attribute the significant F test in (e).

Question 2: Football punts

A football team records data on the length of punts made by 13 players at a tryout for the team. The distance
measure for each punter is the average of ten punts. They also record the following information on each
player:
• Hang: Time in air in seconds
• RStrength: Right leg strength in pounds
• LStrength: Left leg strength in pounds
• RFlexibility: Right leg flexibility in degrees
• LFlexibility: Left leg flexibility in degrees
• OStrength: Overall leg strength in pounds
You perform stepwise regression with the BIC criterion to build a predictive model of distance. The largest
model in your search includes all main effects; the smallest model includes only an intercept.
a. (1 mark) What is the BIC penalty term for model selection in this example? Report your answer to
four digits.
b. (2 marks) After several iterations of stepwise selection you obtain a model that includes RStrength,
RFlexibility, LFlexibility and OStrength. Here is a summary of the next iteration:
Distance ~ RStrength + RFlexibility + LFlexibility + OStrength

Df Sum of Sq RSS BIC


- RStrength 1 221.49 1728.3 73.829
- RFlexibility 1 228.58 1735.3 73.882
- LFlexibility 1 64.68 1571.5 72.592
- OStrength 1 660.99 2167.8 76.774
<none> 1506.8 74.611
+ Hang 1 5.81 1501.0 77.126
+ LStrength 1 4.45 1502.3 77.137
What action would you take next? Justify your answer.
c. (1 mark) After finishing stepwise selection we obtain the following model:
tidy(pfitbyBIC)

## term estimate std.error statistic p.value


## 1 (Intercept) 12.7675932 24.9925728 0.5108555 0.62054007
## 2 R_Strength 0.5563157 0.2104269 2.6437479 0.02457537
## 3 O_Strength 0.2716885 0.1003015 2.7087170 0.02198197
From this model, what is the predicted Distance for a punter with RStrength 170 and OStrength 266? Round
the coefficients to three significant digits and report your answer to three digits.

Question 3: Soft drinks

A softdrink vendor collects data on the relationship between the time in minutes a delivery takes, and two
explanatory variables: (i) the number of cases delivered (Cases), and (ii) the walking distance in feet to make
the delivery (Distance). The following summaries are obtained for 25 deliveries:
dat <-read.table("softdrin.txt",header=TRUE)
summary(dat)

3
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00

https://www.coursehero.com/file/90171528/prfinalpdf/
## Time Cases Distance
## Min. : 8.00 Min. : 2.00 Min. : 36.0
## 1st Qu.:13.75 1st Qu.: 4.00 1st Qu.: 150.0
## Median :18.11 Median : 7.00 Median : 330.0
## Mean :22.38 Mean : 8.76 Mean : 409.3
## 3rd Qu.:21.50 3rd Qu.:10.00 3rd Qu.: 605.0
## Max. :79.24 Max. :30.00 Max. :1460.0
ggplot(dat,aes(x=Time)) + geom_histogram(binwidth=10)
12.5

10.0

7.5
count

5.0

2.5

0.0
20 40 60 80
Time
ggplot(dat,aes(x=Cases)) + geom_histogram(binwidth=5)

10
count

0
0 10 20 30
Cases
ggplot(dat,aes(x=Distance)) + geom_histogram(binwidth=100)

4
count

0
0 500 1000 1500
Distance
a. (1 mark) How would you describe the distribution of the Distance variable?
b. (2 marks) Write out a linear model for mean Time that includes interaction between Cases and Distance.
Define any notation you use.
c. In terms of your model from the previous question, state formal hypotheses for testing for statistical
interaction.

4
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00

https://www.coursehero.com/file/90171528/prfinalpdf/
d. (2 marks) Computer software reports the following VIFs. Do you have any concerns about collinearity?
If so, why? If not, why not?
sfit<-lm(Time~Cases*Distance,data=dat)
library(car)

##
## Attaching package: car
## The following object is masked from package:dplyr :
##
## recode
vif(sfit)

## Cases Distance Cases:Distance


## 6.932817 4.842433 10.765414
e. (6 marks) The model is refit with the Cases and Distance variables centred by their means. The fitted
model yields the following residual plots. For each plot, state the assumptions being checked, and give
your opinion about whether or not each assumption is plausible:
dat <- mutate(dat,Cases = Cases-mean(Cases), Distance=Distance-mean(Distance))
sfit<-lm(Time~Cases*Distance,data=dat)
augment(sfit) %>% ggplot(aes(x=.fitted,y=.resid)) + geom_point() +
geom_smooth()

## geom_smooth() using method = loess


6

3
.resid

−3

20 40 60 80
.fitted
augment(sfit) %>% ggplot(aes(sample=.resid)) + geom_qq()

5
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00

https://www.coursehero.com/file/90171528/prfinalpdf/
5.0

2.5
sample

0.0

−2.5

−2 −1 0 1 2
theoretical
f. (2 marks) The following graphic, called a dotplot, shows the hat values for the fitted model. Each
observation is represented by a dot.
augment(sfit) %>% ggplot(aes(x=.hat)) + geom_dotplot(binwidth=.03)

1.00

0.75
count

0.50

0.25

0.00
0.00 0.25 0.50 0.75
.hat
Are there any observations with very high leverage? If so, why? If not, why not?
g. (2 marks) The Cook’s Distance values are:
round(augment(sfit)$.cooksd,2)

## [1] 0.09 0.00 0.00 0.06 0.00 0.00 0.04 0.00 2.76 0.17 0.19 0.01 0.00 0.01
## [15] 0.01 0.03 0.00 0.05 0.01 0.11 0.02 0.13 0.03 0.07 0.01
Are there any highly influential observations? If so, why? If not, why not?
h. (2 marks) A summary of the fitted model is as follows:
tidy(sfit)

## term estimate std.error statistic p.value


## 1 (Intercept) 2.107030e+01 0.5795254762 36.357857 1.895179e-20

6
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00

https://www.coursehero.com/file/90171528/prfinalpdf/
## 2 Cases 1.318060e+00 0.1462440700 9.012740 1.157667e-08
## 3 Distance 1.232658e-02 0.0027574849 4.470225 2.111072e-04
## 4 Cases:Distance 7.419211e-04 0.0001749773 4.240100 3.659326e-04
Test for interaction at the 5% level and write two sentences to report your conclusion, one using technical
language and one in language that anyone can understand.
i. (BONUS question) Write a sentence to interpret the effect of increasing Distance by 100 feet for a
delivery of 10 cases. Ten cases translates to a value of 1.24 of the centred Cases explanatory variable.
To calculate the effect, round coefficients to three significant digits and report the effect size to three
digits.

Question 4: Maple seeds

Maple tree seeds look like spinning helicopters when they fall from the tree. A forest scientist studied the
relationship between how fast they fall (Velocity) and their size (Load), taking a total of 35 measurements
on three trees (12 on two of them and 11 on the third). We will analyze the relationship between Velocity
and Load, allowing for different relationships in the three trees (Tree).
a. (1 mark) How many dummy variables are required to include Tree in regression models?
b. (2 marks) A model that allows different lines for mean Velocity as a function of Load for each Tree
gives the following summary.
dat <- read.table("samara.txt",header=TRUE)
dat$Tree <- factor(dat$Tree)
fit1<-lm(Velocity~Load*Tree,data=dat)
tidy(fit1)

## term estimate std.error statistic p.value


## 1 (Intercept) 0.5414479 0.2632359 2.0568924 0.04879063
## 2 Load 3.0628684 1.1599016 2.6406278 0.01318748
## 3 Tree2 -0.8407505 0.3356459 -2.5048736 0.01812005
## 4 Tree3 -0.2986812 0.4454446 -0.6705239 0.50782912
## 5 Load:Tree2 3.7342611 1.4999857 2.4895311 0.01877360
## 6 Load:Tree3 0.8204951 2.2836681 0.3592882 0.72198244
What are the estimated intercept and slope for the line for Tree2. Use three significant digits in your
calculations and report your answer to three digits.
c. (2 marks) The results of a multiple partial F test for interaction is summarized as follows.
fit2 <- lm(Velocity ~ Load+Tree,data=dat)
anova(fit2,fit1)

## Analysis of Variance Table


##
## Model 1: Velocity ~ Load + Tree
## Model 2: Velocity ~ Load * Tree
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 0.20344
## 2 29 0.16549 2 0.037949 3.325 0.05011 .
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
What are the degrees of freedom for the F statistic?
d. (2 marks) State the results of testing the no-interaction null hypothesis at the 5% level in (i) technical
language and (ii) non-technical language that anyone can understand.

7
This study source was downloaded by 100000850425706 from CourseHero.com on 08-11-2022 14:43:03 GMT -05:00

https://www.coursehero.com/file/90171528/prfinalpdf/
Powered by TCPDF (www.tcpdf.org)

You might also like