MATH1541-WE01 Statistics I May 2016

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8


Examination Session: Year: Exam Code:

May 2016 MATH1541-WE01


Time Allowed: 3 hours

Additional Material provided: Graph paper
Tables: Normal distribution, t-distribution, Chi-squared distri-
bution, F-distribution, Wilcoxon test, Mann-Whitney test.
Materials Permitted: You may keep one folder of notes at your desk.

Calculators Permitted: A Casio fx-83 GTPLUS or a Casio fx-85 GTPLUS electronic

calculator may be used.
Visiting Students may use dictionaries: Yes

Instructions to Candidates: Credit will be given for the best SIX answers.
All questions carry the same marks.
This is an open-book examination: you may keep one folder of
notes at your desk.


University of Durham Copyright
Page number Exam code
2 of 8 MATH1541-WE01

1. The metabolic rate of a particular reaction is measured in the presence of different

concentrations of an enzyme thought to catalyse the reaction. The table below
shows the results of 9 experiments, and gives the enzyme concentration x and the
corresponding metabolic rate y.

x, enzyme concentration 2 2.5 3.3 3.8 5 6 7.1 8.2 9

y, metabolic rate 6.7 9.1 12.1 12.9 13.4 14.3 14.2 15.9 16.4
(a) Find the mean and standard deviation of both variables and the correlation
between them.
(b) Draw a scatterplot showing the dependence of metabolic rate on concentration
of enzyme.
(c) Calculate the regression equation ŷ = â + b̂x for predicting metabolic rate from
concentration of enzyme and add the regression line to your plot.
(d) Calculate a prediction for the metabolic rate when the enzyme concentration is
1.2. What assumptions are you making? Does your plot help you to validate
those assumptions?
(e) The regression line can be determined by minimising L, the sum of squared
vertical distances from the data yi to the corresponding points on the regression
line ŷi , where
L = (yi − ŷi )2
By minimising L with respect to a then b prove that the regression estimates â
and b̂ are given by:
(xi yi ) − nx̄ȳ
â = ȳ − b̂x̄ and b̂ = Pi=1 n 2 2
i=1 (xi ) − nx̄

2. A new and cheaper insulating material was tested for electrical resistance. Forty
resistance measurements on the material were made, and the results shown below:
14.0 15.6 16.1 16.8 16.9 17.0 17.1 17.2 17.7 17.7
17.9 18.1 18.4 18.5 18.6 18.9 19.1 19.1 19.1 19.2
19.2 19.4 19.5 19.7 20.0 20.4 20.4 20.7 20.8 21.0
21.1 21.3 21.5 22.0 22.4 23.0 23.6 24.7 26.0 30.3

(a) Construct a stem and leaf plot of the 40 measurements, choosing an appropriate
class width.
(b) State clearly the endpoint convention you have used and describe the shape of
the distribution.
(c) Construct a box-plot of the data, making the usual modification to show po-
tential outliers. Show your working.
(d) What effect would you expect on each of the following if any potential out-
liers were moved to just fall within the range of non-outlying observations:
(i) median, (ii) mean, (iii) inter-quartile range, and (iv) standard deviation?

University of Durham Copyright
Page number Exam code
3 of 8 MATH1541-WE01

3. The following data set is extracted from the book Weisberg, S. (2005). Applied
Linear Regression, 3rd edition. New York: Wiley.
Data were collected in an experiment in which rats were injected with a dose of a
drug approximately proportional to body weight. At the end of the experiment, the
animal’s liver was weighed, and the fraction of the drug recovered in the liver was
recorded. The response variable, y, is the amount of drug recovered from the liver.
The experimenter expected y to be unrelated to the predictors.
A pairs plot of the data is shown on page 4 of this exam paper. The data are shown
on page 5 of this exam paper as part of the R output.
Multiple regression was used to construct an equation for predicting y. The regres-
sion output from R is shown on page 5 of this exam paper, together with standard
deviations for the variables in the data set.
Do not attempt this question until you have seen the pairs plot and R
output on pages 4 and 5 respectively.

(a) Briefly interpret the pairs plots shown on page 4.

(b) Calculate se .
(c) Assess the relative value of the predictors.
(d) Calculate the fitted value and residual for the first rat in the data frame. Calcu-
late a range which might be expected to include the amount of drug recovered
for 95% of rats having the same body characteristics as this rat.
(e) It does not seem wise to include both Body Weight and Dose in the model.
(f) A biologist attempts to fit alternative regression models for y that use fewer
than three of the predictor variables, as given in the table below. The first line
gives the result for the full model containing three predictors. Complete the
table by stating approximately what value of R2 you would expect to find for
the four remaining models, and state a reason why we should be cautious about
such values.

Predictors used Multiple R2

BodyWt, LiverWt, Dose 0.3639
BodyWt, Dose
BodyWt, LiverWt

University of Durham Copyright
Page number Exam code
4 of 8 MATH1541-WE01

6 7 8 9 10 0.20 0.30 0.40 0.50

● ● ●
● ● ●

● ● ●
● ● ●
● ● ●
● ● ●

BodyWt ● ● ● ● ● ●● ● ●

● ● ●
● ● ●
● ● ●
● ● ●
● ● ●

● ● ● ●● ● ● ●
● ● ●

● ● ●
● ● ●
● ● ● ●● ● ● ●

● ● ● ●
● ● ●
● ● ● ● ● ●●

● ●
●● ● ●
LiverWt ●● ● ● ● ● ●●

● ● ● ● ●●
● ● ● ● ● ●

● ● ●

● ● ● ● ● ●
● ● ●

●● ● ● ● ●

● ● ●
● ● ● ● ● ●● ● ●

● ● ●
● ● ●
● ● ●
● ● ●
● ● ●

● ● ● ● ●
● ● ●
● ● ●

● ● ●

● ● ●
● ● ●

● ● ●

●● ●

● ● ●

● ● ● ● ● ●
● ● ●

● ● ●
● ● ● ●● ● ●● ●
● ● ●
● ● ● ● ● ●

● ● ●

150 170 190 0.75 0.85 0.95

University of Durham Copyright
Page number Exam code
5 of 8 MATH1541-WE01

> pairs(rat)
> rat
BodyWt LiverWt Dose y
1 176 6.5 0.88 0.42
2 176 9.5 0.88 0.25
3 190 9.0 1.00 0.56
4 176 8.9 0.88 0.23
5 200 7.2 1.00 0.23
6 167 8.9 0.83 0.32
7 188 8.0 0.94 0.37
8 195 10.0 0.98 0.41
9 176 8.0 0.88 0.33
10 165 7.9 0.84 0.38
11 158 6.9 0.80 0.27
12 148 7.3 0.74 0.36
13 149 5.2 0.75 0.21
14 163 8.4 0.81 0.28
15 170 7.2 0.85 0.34
16 186 6.8 0.94 0.28
17 146 7.3 0.73 0.30
18 181 9.0 0.90 0.37
19 149 6.4 0.75 0.46

> fit <- lm(y~BodyWt+LiverWt+Dose, data=rat)

> summary(fit)

lm(formula = y ~ BodyWt + LiverWt + Dose, data = rat)

Min 1Q Median 3Q Max
-0.100557 -0.063233 0.007131 0.045971 0.134691

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.265922 0.194585 1.367 0.1919
BodyWt -0.021246 0.007974 -2.664 0.0177 *
LiverWt 0.014298 0.017217 0.830 0.4193
Dose 4.178111 1.522625 2.744 0.0151 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07729 on 15 degrees of freedom

Multiple R-squared: 0.3639, Adjusted R-squared: 0.2367
F-statistic: 2.86 on 3 and 15 DF, p-value: 0.07197

> sd(rat)
BodyWt LiverWt Dose y
16.49029486 1.22288127 0.08580203 0.08846647

University of Durham Copyright
Page number Exam code
6 of 8 MATH1541-WE01

4. The effects of two brands of pesticides on the yield of three types of barley were
investigated. For each combination of pesticide and barley variety, two repetitions
were performed and the resulting yields for all 12 experiments are shown below.

Pesticide 1 Pesticide 2
Barley A 5.0, 7.2 3.3, 6.9
Barley B 3.6, 5.3 1.9, 1.6
Barley C 5.4, 3.2 3.0, 2.0

(a) Decompose the data into a table of group means and a table of residuals.
(b) Apply mean polish to the the table of means, clearly labelling the components
of the result.
(c) Use your decomposition of the data to draw an “effects and residuals” plot and
briefly comment on it.
(d) Calculate the analysis of variance table and comment on it.
(e) Investigate whether the assumption of homogeneity seems reasonable and sug-
gest a possible remedy if not.

5. A new blood test is developed for a rare, but often fatal disease. The test is quicker
and cheaper than existing methods (which are essentially perfect), but it is some-
times inaccurate. Trials find that the average sensitivity of the test was 96% and
the average specificity was 93%.

(a) Explain in this context what is meant by sensitivity and specificity.

(b) Define and calculate the False Positive and False Negative rates, and explain
why one is worse in this context.
(c) A patient is randomly selected from the UK population and receives a positive
test result. He reads that the average incidence of this disease in the UK pop-
ulation is approximately 1 in 2000. Calculate the probability that he has the
disease given the positive test result. Comment on the result.
(d) There are plans for widespread screening of the UK population using this test
(the plans involve testing every adult in the UK). Comment on whether you
think this is a good idea.
(e) The patient then reads that the incidence of the disease for men aged between
65-70, of which he is a member, is suspected to be much higher, but is currently
unknown. How high would his prior probability P (D+ ) have to be before there
was a 90% posterior probability that he had the disease?
(f) Comment briefly on whether the prior found in part (e) is reasonable.

University of Durham Copyright
Page number Exam code
7 of 8 MATH1541-WE01

6. A group of arachnologists are interested in studying the effect of environmental

impacts in Venezuela on the growth of rare species of tarantula spider. They capture
10 adult spiders in each of two locations, and measure the leg span of each spider
in cm, with the results given below. The scientists want to know if there is any
difference between spider size at the two locations, which could be attributable to
various environmental differences (including certain pollutant levels) between the

Tarantula Leg span (cm)

Location A 12.2 11.8 12.2 11.5 7.9 11.7 11.0 11.5 10.3 10.2
Location B 11.1 10.9 10.6 10.4 11.0 9.1 10.7 10.8 7.3 9.8

Analyse the data and respond to the scientists’ question. Your answer should include
clear justifications of the choices of analysis made.

7. (i) A random variable X has probability density function f (x) given as

x 0≤x≤3
f (x) = 9
0 elsewhere

(a) Find E[X] and Var[X].

(b) A new random variable Y has unknown probability density function, but
we are informed that Var[Y ] = 1/50. What is the value of Var[X + 5Y ],
assuming X and Y are independent?
(c) What is the maximum possible value of X + 5Y , without the assumption
of independence?
(ii) A study was performed into the effects on cardiac problems in later life in
athletes, due to a performance enhancing steroid taken earlier in their careers
(which was legal at the time). The table below shows the counts of such ath-
letes classified by severe, minor and no cardiac problems against their previous
steroid usage classified as high, low or zero. Test the hypothesis that there is
no association between steroid use and cardiac problems and comment on any
discrepancies between the independence model and the observations.
Cardiac Problems
Steroid Usage Severe Minor None Total
High 67 57 65 189
Low 42 66 89 197
Zero 78 95 166 339
Total 187 218 320 725

University of Durham Copyright
Page number Exam code
8 of 8 MATH1541-WE01

8. A new drug is developed to help reduce the symptoms of migraines. However it is

suspected that the drug may affect the user’s reaction times, and hence the following
study was performed. On the first day, 12 people took part in an extensive reaction
time test, with their scores listed below. The following day the same people were
then given a standard dose of the new drug and retested. Their new scores are also
found below (higher scores imply longer reaction times).

Test Subject 1 2 3 4 5 6 7 8 9 10 11 12
Day 1 Score (no drug) 51.3 73.8 80.8 75.9 68.5 81.5 84.8 63.5 88.1 49.4 62.3 60.7
Day 2 Score (drug) 51.0 70.1 88.8 81.5 81.2 82.0 88.5 67.4 90.7 56.5 60.2 55.5

(a) Analyse the data and address the scientific question. You must justify any
choices you have made within your analysis.
(b) Suggest any possible ways to improve the experiment.

ED01/2016 END
University of Durham Copyright

You might also like