Problems

MATH11400 Statistics 1 2013-14
Homepage http://www.stats.bris.ac.uk/maotj/teaching.html
Problem Sheet 1: Basics of data analysis
Remember: when online, you can access the Statistics 1 data sets from an R console by typing
load(url("http://www.maths.bris.ac.uk/maotj/teach/stats1.RData"))
Ensure that you can run R . You may wish to investigate how to download R (for free) onto your
own computer, or to try it in the undergraduate Computer Lab. Running R is a central part of this
course, and you do need to be able to do it.
*1. In an experiment to investigate the heat of sublimation of iridium, the following 27 mea-
surements were made, listed across the rows in the order they were taken. The data is
contained in the Statistics 1 data set iridium.
136.6 145.2 151.5 162.7 159.1 159.8 160.8 173.9 160.1
160.4 161.1 160.6 160.2 159.5 160.3 159.2 159.3 159.6
160.0 160.2 160.1 160.0 159.7 159.5 159.5 159.6 159.5
(a) Use the R commands stem, hist, boxplot, and plot to make a stem-and-leaf
plot, a histogram, a boxplot and a plot of the observations in the order they were
taken. Print your plots and comment on the overall pattern and any striking features.
(b) Use the R commands median and mean to nd the median and the mean. Use
?mean in R to see how to compute a trimmed mean in R . Compute the 10% and
20% trimmed means for the iridium data set.
Compare how well the the mean and median and trimmed means represent the centre
of this data set.
(c) Use the R commands var and sd to nd the sample variance and standard devi-
ation, use the R commands fivenum and summary to nd the hinges and the
sample quartiles, and use the R command IQR to nd the interquartile range (but
see comments on Hinges and Quartiles overleaf). Again, compare how these values
represent the spread of the data. [Care! In an ideal data set, would the variance,
standard deviation and inter-quartile range be the same?]
(d) What conclusions do you draw from your plots and numerical summaries? What
effect do the outliers have on the numerical and graphical summaries? What would
the corresponding results look like if the outliers were removed?
*2. Make a new version of the iridium data set, excluding the apparent outliers, by typing
ir2<-iridium[-c(1,2,3,4,8)]. Create a histogram and stem and leaf plot of
this new data set. Now make similar plots for an artical sample made by generating the
same number (22) of observations from a normal distribution (e.g. data<-rnorm(22)).
Visually compare the plots for the real data and the articial data.
Repeat a similar exercise with the storm.claims data set, comparing it with an articial
sample of 19 observations drawn from an exponential distribution, created using
data<-rexp(19).
3. Construct an R function for calculating an empirical distribution function by typing in the
following instructions (note that the R prompt will switch from > to + while it is waiting
for the command to be completed):
plot.edf <- function(x){
n <- length(x)
plot(sort(x),(1:n)/(n+1),type=s,xlab=data,ylab=empirical cdf)}
Having loaded the Statistics 1 data sets, produce an empirical distribution function (edf)
plot of the iridium data by typing the command plot.edf(iridium) and comment on
how the shape of the edf relates to the data.
4. Having loaded the Statistics 1 data set into R , use stem(us.temp,scale=4) to pro-
duce a stem-and-leaf plot of the dataset us.temp. The data gives the mean January
temperatures for 60 U.S. metropolitan areas. Comment on any unusual pattern in the data
and try to nd a plausible explanation.
5. (a) (From a recent Guardian puzzle) A lazy ea is wandering along a ruler. He knows
that at a certain time, he will receive an instruction to move to the 1 inch mark on the
ruler, the 2 inch mark or the 11 inch mark. Which of these it will be is uncertain, and
he can assume there is 1/3 probability of each of these possibilities. Where should he
position himself to minimise the distance he has to move when instructed?
(b) How does your answer change if instead he want to minimise the mean squared
distance he has to move?
(c) What if he wants to minimise the maximum distance he might have to move?
(d) What does this question have to do with the issue of giving a numerical summary of
the centre of a sample of data x
1
, x
2
, . . . , x
n
?
6.* Let {x
1
, . . . , x
n
} be a data set of real numbers and let y
i
= ax
i
+ b, for i = 1, . . . , n, for
some a = 0.
(a) Let x =
1
n
n
i=1
x
i
and s
2
x
=
1
n1
n
i=1
(x
i
x)
2
. Showthat y = ax+b and s
2
y
= a
2
s
2
x
.
(b) Find expressions for the median, interquartile range and trimmed mean of {y
i
} in
terms of those of {x
i
}. (Note: why do you need to consider the cases a > 0 and
a < 0 separately)
(c) Let x denote temperature in degrees centigrade and let y denote temperature in de-
grees Fahrenheit, so y = 1.8x+32. Assume the {x
i
} data set has mean 68.1, median
68.9, variance 3.2 and IQR 7.7. Calculate the corresponding quantities for the {y
i
}
data.
7. Boxplots are most useful for comparing more than one sample. The built-in data set
InsectSprays in R gives the number of insects found on plants subjected to 6 different
treatments labelled A-F. Type the following in R :
data(InsectSprays)
help(InsectSprays)
InsectSprays
boxplot(count spray, data = InsectSprays)
The help command gives some background information about the data, and the command
InsectSprays on its own prints out the data. For this data set, the boxplot command
produces a separate boxplot (on common axes) for each of the treatments. Use this plot to
compare the different treatments. Calculate the mean and variance for each of the treatment
types and see if you come to the same conclusions. (It is good practice working out how
to do this in R ).
Hinges and Quartiles
The lower hinge is the median of the set of values the sample median and the upper hinge is
the median of the set of values the sample median. Hinges were introduced by Tukey as a
simple alternative to quartiles, since sources disagreed on how quartiles should be dened.
Loosely speaking, quartiles are values that divide a dataset into four equal parts a quarter of
the data values are greater than the upper quartile, a quarter are between the upper quartile and
the median, a quarter are between the median and the lower quartile, and a quarter are less than
the lower quartile.
Given a dataset with n data values x
1
, x
2
, . . . , x
n1
, x
n
, denote the ordered values (the order
statistics) by x
(1)
x
(2)
x
(n1)
x
(n)
. This suggests Q
1
should be roughly the
n/4th observation (ordered in increasing size). If n/4 is not an integer, then Q
1
should lie
between x
(i)
and x
(i+1)
where i = [n/4] (the integer part of n/4). The methods actually used to
compute, say, Q
1
are more complicated than this, but most have the following common basis:
Set r = (n+12a)/4+a, set i = [r] (the integer part of r) and set = r i. Then the required
value is Q
1
= (1 )x
(i)
+ x
(i+1)
, i.e. the value that lies of the way between x
(i)
and x
(i+1)
.
Similarly for Q
3
, set s = 3(n + 1 2a)/4 + a, set j = [s] and set = s j. Then the required
value is Q
3
= (1 )x
(j)
+ x
(j+1)
.
Where methods differ is in the value of a some have used a = 0 (Minitab), some a = 1 (Excel),
some a = 1/2 (S-plus). Rice suggests using a = 0 or a = 1/2. R uses a = 1/2 in some places
and a = 3/8 in other places. The differences between using different values of a, or indeed the
differences between the hinges and the quartiles, have no real practical importance in terms of
interpretation, and they are negligible numerically in larger data sets.
Problem Sheet 2: Parametric families and method of moments
*1. Let X
1
, . . . , X
n
be a random sample from a Geom () distribution, where is a single
unknown parameter. State the value of E(X; ) and hence nd the method of moments
estimator of the unknown parameter .
2. Let X
1
, . . . , X
n
be a random sample from a N(0,
2
) distribution, where > 0 is a single
unknown parameter. Find the method of moments estimator of the unknown parameter .
3. The data {2.08, 2.81, 0.04, 1.54, 1.27, 0.74} are thought to come from a Uniform(0,3) dis-
tribution. Calculate the corresponding expected quantiles of the Uniform(0,3) distribution.
Use R to plot the sample quantiles against these expected quantiles, and comment on the
t of the distribution to the data.
*4. The data in the disasters data set relates to all British coal mining disasters between
March 1851 and March 1962 in which 10 or more men were killed. We will study the 120
gaps in days between successive disasters from the start of the series up to January 1889,
which we can extract as follows
source("http://www.stats.bris.ac.uk/maotj/teach/disasters.R")
(beware: the symbol may not cut and paste from pdf into R )
gaps<-disasters$gap[2:121]
(if you are not going to be online when using R , you have rst to copy the le from the
link on the website and save it on your computer, then use the Source R code item on
the File menu in R to navigate to the saved le and load it in).
(a) Use R to plot a histogram of the gaps data. Does the histogram indicate that an
exponential distribution would not be an appropriate model for this set of data? Note
that if the occurrence of disasters was completely at random, the gaps would have the
lack of memory property, with an exponential distribution.
(b) Assuming an Exponential distribution with parameter is an appropriate model, show
that method of moments estimate of for this data set is

= 0.008681.
(c) You are given that the distribution function F
X
(x; ) has inverse
F
1
X
(y; ) = log(1 y)/
Use R to produce a quantile plot of the sample quantiles (the ordered observations)
against the corresponding approximate expected quantiles for the tted Exponential
distribution and comment on how well the estimated Exponential distribution ts the
data.
*5. In an experiment to investigate the effect of seeding clouds, the rainfall measurements
below were recorded for 25 seeded clouds. The data are contained in the Statistics 1 data
set seeded.rain.
4.1 7.7 17.5 31.4 32.7 40.6 92.4 115.3 118.3
119.0 129.6 198.6 200.7 242.5 255.0 274.7 302.8 334.1
430.0 489.1 703.4 978.0 1656.0 1697.8 2745.6
It is thought that a distribution in the family Gamma(, ) may provide a good model for
the data. Write down the two equations which determine the method of moments estimates
of the two unknown parameters and . Solve these equations to nd explicit expressions
for and

, each in terms of the sample moments m
1
and m
2
.
6. The following data come from a series of experiments by Henry Cavendish in 1798, de-
signed to measure the density of the earth, as a multiple of the density of water. The data
are contained in the Statistics 1 data set cavendish.
5.50 5.61 4.88 5.07 5.26 5.55 5.36 5.29 5.58 5.65
5.57 5.53 5.62 5.29 5.44 5.34 5.79 5.10 5.27 5.39
5.42 5.47 5.63 5.34 5.46 5.30 5.75 5.68 5.85
(a) Use exploratory methods (histogram, boxplot etc.) to see if there is any immediate
reason to believe a N(,
2
) distribution would not provide an appropriate model for
these data.
(b) Calculate the method of moments estimates of and
2
. Use the estimates to produce
a plot in R of the sample quantiles against the corresponding approximate expected
quantiles for the tted Normal distribution. Comment on how well the estimated
Normal distribution ts the data.
(c) You can produce an automatic plot of the quantiles of a data set against the corre-
sponding quantiles of a standard N(0, 1) distribution, and, if desired, a tted line
through the rst and third quartiles, with the commands
> qqnorm(dataset)
> qqline(dataset)
where dataset is the name of the data le. Comment on the similarities and dif-
ferences between this plot and the plot in part (b) above.
Problem Sheet 3: Likelihood and Maximum Likelihood Estimation
1. Find an expression for the maximum likelihood estimate of in terms of the observed
values x
1
, . . . , x
n
of a random sample of size n from a Poisson() distribution where > 0
is an unknown parameter.
*2. Find an expression for the maximum likelihood estimate of in terms of the observed val-
ues x
1
, . . . , x
n
of a randomsample of size n froma continuous distribution with probability
density function
f
X
(x; ) =
x
1
0 x 1
0 otherwise
where > 0 is an unknown parameter.
In a randomsample of size n = 5 fromthis distribution, the observed values of x
1
, x
2
, . . . , x
5
were 0.07, 0.29, 0.45, 0.61 and 0.30 respectively. Compute the value of the maximum like-
lihood estimate of for this set of observations.
3. (a) Find an expression for the maximum likelihood estimate of in terms of the observed
values x
1
, . . . , x
n
of a random sample of size n from a Binomial(K, ) distribution
where K is known and where is an unknown parameter such that 0 < < 1.
(b) Passing a course is based on an exam with 10 multiple choice questions, and a pass
mark of 9. In seven mock exams a student has obtained scores of 9, 8, 10, 5, 8, 10
and 7. Assuming that she has independent probability of correctly answering each
question in the exam, and assuming these seven scores are a random sample of size
seven froma Binomial(10, ) distribution with the same value of , nd the maximum
likelihood estimate of the probability that the student will pass the examination.
*4. Consider the following data, recording the failure time, in hours, for a batch of 25 lamps.
The data are contained in the Statistics 1 data set lamp.
5.5 3.8 8.0 7.8 9.3 4.7 4.0 0.3 4.6 0.6 7.9 1.8 4.0
0.7 4.0 1.6 2.6 0.7 0.2 3.1 1.0 3.4 3.7 10.8 1.2
Assuming an Exponential distribution with parameter is an appropriate model, nd

mle
based on the above data. Hence nd the maximum likelihood estimates of:
(a) the median of the distribution of the lifetimes of lamps in the population;
(b) the probability of a randomly chosen lamp surviving beyond 10 hours.
Compare these to appropriate simple estimates calculated directly from the data, without
assuming an Exponential distribution.
*5. Let X
1
, . . . , X
n
be a random sample of size n from a N(
X
,
2
) distribution and let
Y
1
, . . . , Y
m
be a random sample of size m from a N(
Y
,
2
) distribution, and assume
the samples are independent of each other. Note that the means of the two distributions are
assumed to be possibly different, but the variances are assumed to be the same.
(a) Since all n +m random variables are independent, show that the loglikelihood func-
tion for
X
,
Y
and
2
based on all n + m observations is given by
n
i=1
log f
X
(x
i
;
X
,
2
) +
m
j=1
log f
Y
(y
j
;
Y
,
2
).
(b) Hence, explain why the likehihood equations for this problem are
X
l(
X
,
Y
,
2
=
n
i=1
x
i
n
X
2
= 0
Y
l(
X
,
Y
,
2
) =
m
j=1
y
j
m
Y
2
= 0
l(
X
,
Y
,
2
) =
n
n
i=1
(x
i

X
)
2
3

m
m
j=1
(y
j

Y
)
2
3
= 0
(c) Hence nd expressions in terms of x
1
, . . . , x
n
and y
1
, . . . , y
m
for the joint maximum
likelihood estimates of
X
,
Y
and
2
from the combined sample.
6. Assume n subjects are each given an envelope. Half the envelopes contains the instructions
Tick box 1 if you have ever cheated on your tax return and tick box 0 otherwise; the
others contain the instructions Toss a coin and tick box 1 if it falls heads and box 0 if
it falls tails. The two sets of instructions are allocated to the envelopes at random and
only the subject knows which set of instructions applied to him (or her). Assume that all
subjects follow the instructions in their envelope honestly and correctly.
(a) Assume the probability that any given subject cheated on their tax return is and
let denote the probability that a randomly chosen subject, following the procedure
above, will tick box 1. Express in terms of . What are the possible values for ?
(b) Assume 8 out of a sample of 20 subjects ticked box 1; nd the maximum likelihood
estimate of based on these data. Hence estimate .
Problem Sheet 4: Linear regression
*1. As part of a study of the relationship between stress and skill, the stress levels for eight
second year student volunteers were assessed and compared with their subject skill levels,
as measured by each students average mark at the end of their rst year.
Summary statistics for the data set are:
n = 8
x
i
= 492
y
i
= 379
x
2
i
= 32, 894
y
2
i
= 20, 115
y
i
x
i
= 21, 087
where x
i
is the subject skill level for the ith student and y
i
is their assessed stress level.
Calculate by hand the least squares estimates of and and the equation for the tted
regression line, under the simple linear regression model E(Y |x) = + x. What broad
conclusion can you draw immediately from the tted model? What stress level would you
predict for a student with skill level x = 60?
2. The table below shows a data set with ve pairs of values (x
i
, y
i
), i = 1, . . . , 5. It is thought
that the data satisfy the simple linear regression model E(Y |x) = +x, Var (Y |x) =
2
.
x 1 3 4 6 7
y 0 1 2 5 4
(a) Calculate by hand the least squares estimates of and .
(b) For i = 1, . . . , 5, calculate by hand the tted values y
i
= +

x
i
and the residual
values e
i
= y
i
y
i
. Hence calculate by hand an estimate of the common variance
2
.
Also, nd the sum of the residuals.
3. The table below shows the average weight (in kg) of piglets in a litter, for seven litters
of varying size. The data are contained in the Statistics 1 data frame pig, variables
littersize and wt respectively.
Litter size (x) 1 3 5 8 8 9 10
Average weight (y) 1.6 1.5 1.5 1.3 1.4 1.2 1.1
Perform a simple linear regression of average weight on litter size and output the results to
the R object piglets, with the commands:
> attach(pig); piglets <- lm(wt littersize)
Calculate by hand the least squares estimates and

for the simple linear regression
model E(Y |x) = +x and conrmyour answers with the Rcommand: > coef(piglets)
Draw a scatter plot of the data and add in the tted regression line, using the commands:
> plot(littersize,wt); abline(coef(piglets))
Use your tted regression line to predict the average weight of a piglet in a litter of size 6.
Let x
i
denote the litter size for the ith litter and let y
i
denote the corresponding average
weight for the piglets in that litter. Inspect the tted values y
i
= +

x
i
and the residual
values e
i
= y
i
y
i
with the commands:
> fitted(piglets); residuals(piglets)
Plot the residuals against the litter sizes using the command:
> plot(littersize, residuals(piglets))
and comment on the t of the model.
*4. The table below shows the rainfall (in inches) for the spring (April/May) and the following
autumn (September/October) for each of ten consecutive years. The data are contained in
the Statistics 1 data frame rain, in variable spring and autumn respectively. You can
extract columns using rain$spring etc.
Spring rainfall (x) 1.6 5.3 2.8 9.6 6.7 1.5 5.4 8.5 4.1 3.9
Autumn rainfall (y) 4.6 6.0 2.9 11.1 8.2 1.3 9.1 10.2 5.2 8.3
(a) Let x
i
denote the spring rainfall in the ith year and let y
i
denote the corresponding
autumn rainfall. Use R to calculate the least squares estimates and

for the simple
linear regression model E(Y |x) = + x. Use R to produce a scatter plot of the
data and use R to add in the tted regression line.
(b) For i = 1, . . . , 10, use R to calculate the tted values y
i
= +

x
i
and the residual
values e
i
= y
i
y
i
. Use R to plot the residuals against the spring rainfalls, and
comment of the t of the model.
*5. Consider a regression problem where the data values y
1
, . . . , y
n
are observed values of
response variables Y
1
, . . . , Y
n
.
In the notes we assume that, for given values x
1
, . . . , x
n
of the predictor variable, the Y
i
satisfy the simple linear regression model Y
i
= + x
i
+ e
i
, where the e
i
are i.i.d.
N(0,
2
). The least squares estimates of the regression parameter(s) are dened to be the
values which minimise the sum of squares of the differences between the observed y
i
and
the tted values. For this model, E(Y
i
| x
i
) = + x
i
, so the least squares estimates are
the values minimising

n
i=1
(y
i
( + x
i
))
2
.
Now consider an alternative model which takes the form Y
i
= x
i
+ e
i
, i = 1, . . . , n,
where the e
i
satisfy the same assumptions as before but where there is now a single un-
known regression parameter . This model is sometimes used when it is clear from the
problem description that the value of E(Y ) must be zero if the corresponding x value is
zero.
Derive an expression, in terms of the x
i
and y
i
values, for the least squares estimate for
for this new model and suggest, with reasons, an appropriate estimate for
2
.
Problem Sheet 5: Assessing the performance of estimators
*1. Let X
1
, . . . , X
n
be a random sample from the Uniform(0, ) distribution, for which the
population median is = /2.
(a) The method of moments estimate of
mom
= X. Find E(X) and Var (X), and hence
show that
mom
is unbiased and has mean square error
2
/12n.
(b) The maximum likelihood estimate of is Y = max{X
1
, . . . , X
n
}. You are given
that Y has probability density function f
Y
(y; ) = ny
n1
/
n
for 0 < y < (and
f
Y
(y; ) = 0 otherwise). Show that E(Y ) = n/(n + 1).
The maximum likelihood estimator of the population median = /2 is

mle
= max{X
1
, . . . , X
n
}/2. Use the results for Y to show that
mle
has bias
/(2(n + 1)).
*2. The methods and R commands required for this question are similar to those in the notes
for the simulation from the Uniform(0,1) distribution, but with the obvious adjustments to
the names and to the formulae for the various estimators.
(a) Use the R commands below to construct a matrix xsamples with 1000 rows and
10 columns. The 10 data values in each row can be thought of as a random sample of
size 10 from an Exp() distribution with rate = 1 and the 1000 rows can be thought
of as B = 1000 independent repeated samples.
> xvalues <- rexp(10000,rate=1)
> xsamples <- matrix(xvalues,nrow=1000)
(b) Calculate the maximum likelihood estimate

= 1/x for each sample, and plot a
histogram of the relative frequencies of the 1000 estimates of obtained.
(c) You may assume that the Exp() distribution has median log(2)/. For each of the
B = 1000 samples in part (a) above, calculate both the sample median and the maxi-
mum likelihood estimate of the population median () = log(2)/.
Produce a single plot containing a boxplot of the 1000 values of the sample median
and a boxplot of the 1000 values of the mle of the population median.
(d) Since the observations above were drawn from a population distribution with = 1,
add a horizontal line at height log(2) to your plot showing the true value of the median
for this population and use it to compare the sample median and the mle as estimators
of the population median.
(e) Calculate the mean and the variance of the 1000 values of the sample median and the
1000 values of the maximum likelihood estimate of the population median and use
these numerical values to compare the bias, variance and mean square error of the
two estimators.
3. Let T be the total number of heads obtained when a fair coin is tossed 10 times. Let
A = {T 1} be the event that at most one head is obtained and let B = {T 6} be the
event that at least 6 heads are obtained.
(a) Calculate the exact values of P(A) and P(B).
(b) Calculate the approximate values of P(A) and P(B) given by applying the central
limit theorem without and with a continuity correction.
4. An architect is designing the car park for a newapartment block, which has 200 apartments.
She believes that the residents in 20% of the apartments will require 2 parking places, that
70% will require 1 place, and that the remaining 10% will not have a car.
(a) Let X be the number of parking places required by the residents of a randomly chosen
apartment. Find the mean and variance of X.
(b) If the architect provides 230 car parking places for residents, what is the probability
that this will not be enough? How many places would she need to provide for there to
be a 99%chance that there will be enough places to satisfy all the residents demands?
*5. Opinion polls indicate that support for the government has been about 37%, but it is
thought that this may have changed in the light of recent events. Assume a random sample
of n electors is interviewed. Let X
i
= 1 if the ith elector sampled supports the government
and let X
i
= 0 otherwise, so that T
n
= X
1
+ +X
n
is the total number in the sample that
say they support the government. Assume throughout that X
1
, X
2
, X
3
, . . . are independent
random variables and that P(X
i
= 1) = 0.37 = 1 P(X
i
= 0).
Assume the pollsters take a sample of size n = 1500. Use the central limit theorem to
nd the probability that the proportion in the sample supporting the government will differ
from 0.37 by no more than 0.02, i.e. nd P(|T
n
/n0.37| 0.02). Perform this calculation
with and without a continuity correction, to see if it makes a signicant difference in this
case.
Problem Sheet 6: Sampling distributions related to the Normal distribution
*1. Let X
1
, . . . , X
n
be a random sample of size n from a general distribution with population
mean denoted by = E(X) and population variance denoted by
2
= Var (X). (Note:
we are not assuming here that the population has a Normal distribution).
(a) Show that the sample mean X = (X
1
+ +X
n
)/n has expected value (and so X
is always an unbiased estimator of the population mean).
(b) Show also that X has variance
2
/n, (and so X always has mean square error equal
to
2
/n as an estimator of , where denotes the population mean and
2
denotes
the population variance).
(c) Assume the fact (see the notes) that

n
j=1
X
2
j
=
n
j=1
(X
j
X)
2
+nX
2
. Show that,
whatever the distribution of X, the sample variance S
2
=
n
j=1
(X
j
X)
2
/(n 1)
has expected value
2
(and so S
2
is always an unbiased estimator of the population
variance
2
).
2. Let Y have a Gamma(, ) distribution. Show that E(Y ) = /, and show that, for > 1,
E(1/Y ) = /( 1). [Hint: Recall from your Probability 1 notes that

0
x
a1
e
bx
dx =
(a)/b
a
for all a > 0 and b > 0.]
*3. Let X
1
, . . . , X
n
be a random sample of size n from the Exponential() distribution. We
found earlier that the maximum likelihood estimator of was

mle
= n/
n
i=1
X
i
.
(a) Find the moment generating function of

n
i=1
X
i
. Hence show

n
i=1
X
i
has the
Gamma(n, ) distribution and state its mean.
(b) The population mean of the Exponential() distribution is () = 1/ and the max-
imum likelihood estimator of this population mean is
mle
= 1/
mle
. Show that the
maximum likelihood estimator
mle
has expected value 1/ (and so it is unbiased as
as estimator of the population mean).
(c) Using the fact (see previous question) that for Y with a Gamma(, ) distribution,
E(1/Y ) = /(1), show that E(
mle
) = n/(n1). Hence nd the average error
(i.e. the bias) of

mle
as an estimator of and show it is not an unbiased estimator of
.
4. Use the R commands help(TDist) and help(Chisquare) to nd out how to com-
pute the probability density function, distribution function and the inverse distribution
function for the t and the
2
families of distributions.
(a) Plot the probability density function for the t
r
distribution over the interval (4, 4) for
r = 1, 5, 10 and 15 degrees of freedom, and compare it with the probability density
function for the N(0, 1) distribution.
(b) Plot the probability density function for the
2
r
distribution over the interval (0, 20)
for r = 5, 10 and 15 degrees of freedom.
*5. Let X
1
, . . . , X
n
be a random sample of size n from the N(,
2
) distribution. Denote the
sample variance by S
2
=
n
i=1
(X
i
X)
2
/(n 1) and denote the maximum likelihood
estimator of
2
by

2
mle
=
n
i=1
(X
i
X)
2
/n.
(a) State the distribution of

n
i=1
(X
i
X)
2
/
2
and its mean and variance, and thus nd
the mean and variance of both S
2
and

2
mle
.
(b) Let
2
denote an estimator for
2
. The bias of
2
is dened as E(
2
2
), while the
mean square error is dened as E[(
2
2
)
2
]. Note that it can be easier to calcuate the
mean square error from its representation as E[(
2
2
)
2
] = Var (
2
) +[bias(
2
)]
2
.
Use the results of part (a) to compare the performance of S
2
and

2
mle
as estimators
of
2
in terms of their bias (i.e. their average error) and their mean square error (i.e.
their average squared error).
*6. Suppose that X
1
, X
2
, . . . , X
n
and Y
1
, Y
2
, . . . , Y
n
are all i.i.d. Normal(0,
2
) where is
unknown.
(a) Write down the distribution of the random variable T
i
= X
2
i
+ Y
2
i
for each i. Hence
nd the maximum likelihood estimate of based on observations t
1
, t
2
, . . . , t
n
of
T
1
, T
2
, . . . , T
n
.
(b) Find also the maximum likelihood estimate of based on observations x
1
, x
2
, . . . , x
n
and y
1
, y
2
, . . . , y
n
of X
1
, X
2
, . . . , X
n
and Y
1
, Y
2
, . . . , Y
n
, respectively.
Problem Sheet 7: Condence Intervals
To complete this sheet, you will need values of distributions such as t and
2
. This requires
access to R (or the use of statistical tables). Use of tables is not taught in the course, and in the
exam, tables are not needed and will not be provided.
All of the standard distribution probability and quantile values needed can be found in the R
output below.
> z
[1] 0.84162 0.95000 0.99000 1.28155 1.64485 1.95996 2.57583 2.84400
> pnorm(z)
[1] 0.80000 0.82894 0.83891 0.90000 0.95000 0.97500 0.99500 0.99777
> qt( c(.9,.95,.975), 8)
[1] 1.396815 1.859548 2.306004
> qt( c(.9,.95,.975), 32)
[1] 1.308573 1.693889 2.036933
> qt( c(.9,.95,.975), 33)
[1] 1.307737 1.692360 2.034515
> qchisq(c(.025,.05,.1,.9,.95,.975),7)
[1] 1.6899 2.1673 2.8331 12.0170 14.0671 16.0128
> qchisq(c(.025,.05,.1,.9,.95,.975),8)
[1] 2.1797 2.7326 3.4895 13.3616 15.5073 17.5345
> qchisq(c(.025,.05,.1,.9,.95,.975),9)
[1] 2.7004 3.3251 4.1682 14.6837 16.9190 19.0228
> qchisq(c(.025,.05,.1,.9,.95,.975),32)
[1] 18.291 20.072 22.271 42.585 46.194 49.480
> qchisq(c(.025,.05,.1,.9,.95,.975),33)
[1] 19.047 20.867 23.110 43.745 47.400 50.725
> qchisq(c(.025,.05,.1,.9,.95,.975),34)
[1] 19.806 21.664 23.952 44.903 48.602 51.966
> qchisq(c(.025,.05,.1,.9,.95,.975),50)
[1] 32.357 34.764 37.689 63.167 67.505 71.420
> qchisq(c(.025,.05,.1,.9,.95,.975),51)
[1] 33.162 35.600 38.560 64.295 68.669 72.616
Similar information may be provided in the exam, and it is important to know how to interpret it
and extract the required information.
*1. In an experiment to determine the fuel consumption of a new model of car, a driver was
employed to drive nine new cars, each for 100km. The fuel consumption in litres for each
of the nine 100km drives is displayed in the table below. The data is contained in the
Statistics 1 data set fuel.
12.09 11.18 9.97 10.50 9.92 9.97 11.84 10.93 10.70.
(a) State clearly any assumptions you make. Explore the data (e.g. with a stem and leaf
plot) to conrm that your assumptions are appropriate.
(b) nd a 90% condence interval for the mean fuel consumption per 100km for cars of
this type.
(c) Find a 90% condence interval for the variance of fuel consumption per 100km for
the population of cars of this type.
2. Neurobiological arguments suggest that learning to play an instrument may improve the
spatial-temporal reasoning of pre-school children. A study measured the spatial-temporal
reasoning of 34 preschool children before and after six months of piano lessons. The
changes in the reasoning scores of the children are displayed in the table below. You may
assume the data represents the observed values of a simple random sample of size n = 34
from a population with unknown mean and unknown variance
2
. The data is contained
in the Statistics 1 data set piano.
2 5 7 2 2 7 4 1 0 7 3 4 3 4 9 4 5
2 9 6 0 3 6 1 3 4 6 7 1 7 3 3 4 4
Construct a 95% condence interval for the population mean improvement in reasoning
scores, stating clearly any assumptions you make. You should summarize and display the
distribution of the data, say in a histogram, and hence check that your assumptions are
appropriate.
Under the same assumptions, construct a 95% condence interval for the variance of the
improvement in reasoning scores in the population.
Note: Given a single data set xdata, containing n values with sample mean x and sample vari-
ance s
2
, the R command t.test(xdata,conf.level=0.95) will produce output
that includes a 95% condence interval based on the end points
c
L
= x t
n1;/2
s/
n and c
U
= x + t
n1;/2
s/
n.
with = 0.05. The condence level can, of course, be changed as desired. You should
answer the questions above by going through the relevant working yourself, but you may
wish to use this command to check your answers in cases where it is appropriate.
*3. Consider again the following failure-time data for the batch of 25 lamps, which you may
assume is a simple random sample from an Exponential distribution with unknown param-
eter . The data is contained in the Statistics 1 data set lamp.
5.5 3.8 8.0 7.8 9.3 4.7 4.0 0.3 4.6 0.6 7.9 1.8 4.0
0.7 4.0 1.6 2.6 0.7 0.2 3.1 1.0 3.4 3.7 10.8 1.2
(a) Find an equal-tailed 95% condence interval for the unknown parameter based on
this set of 25 observations.
(b) Let X
1
, . . . , X
n
be a simple random sample of size n from the Exp() distribution.
You may assume that E(1/
n
i=1
X
i
) = /(n1). Use this result to nd the average
length of a 95% condence interval for based on a random sample of size n = 25,
expressing your answer as a multiple of the unknown parameter .
4. Assume that a random sample of 1000 electors are interviewed and that 370 of those in-
terviewed say that they support the govenment. Find a 99% condence interval for the
proportion of electors that support the govenment.
*5. Assume the 25 observations below are a random sample from the Unif(0, ) distribution.
1.41 0.11 0.61 4.06 2.81 4.23 2.68 4.43 2.98 4.15 0.10 4.04 5.57
2.04 4.44 5.48 1.53 0.10 4.82 5.99 2.35 0.07 3.24 5.83 1.57
For the Unif(0, ) distribution we saw earlier that the method of moments estimate

mom
and the maximum likelihood estimate

mle
were given by

mom
= 2X, where X is the
sample mean, and

mle
= X
(n)
, where X
(n)
= max(X
1
, . . . , X
n
) is the sample maximum.
(a) Use the fact that for a random sample of size n from the Unif(0, ) distribution,
P(X
(n)
/ v) = v
n
for 0 < v < 1, to nd values v
1
and v
2
such that P(X
(25)
/ <
v
1
) = 0.025 and P(X
(25)
/ > v
2
) = 0.025. Hence, following the general idea
seen in construction of other condence intervals, but with different details, nd an
equal-tailed 95% condence intervals for based on

mle
.
(b) Find an equal-tailed 95% condence intervals for based on

mom
. [Hint: Use the
Normal approximation to the distribution of X based on the Central Limit Theorem.]
Comment on whether the interval you get is compatible with the data.
Problem Sheet 8: Hypothesis tests
*1. A tyre company claims its tyres have a mean useful lifetime of 42,000 miles. A consumer
association bought one of the companys tyres from each of 10 randomly chosen outlets
and tested them on a test rig that simulated normal road conditions. The observed lifetimes
(in thousands of miles) were
42 36 46 43 41 35 43 45 40 39.
Thinking carefully about the context, what is an appropriate alternative hypothesis H
1
to
use in testing the manufacturers claim? Use the data to test whether or not there is suf-
cient evidence to reject the manufacturers claim, using a test procedure with signicance
level = 0.05. The data are contained in the Statistics 1 data set tyre.lifetimes.
[Your answer should include a statement of any model assumptions, a brief description
of your working at each stage of the test procedure including the p-value and the critical
region for the test, and a summary of your conclusions. You may nd it helpful to check
your numerical calculations using the t.test() function in R . If you do this question
by hand calculation, it may help to know that pt(0.8808,9) gives 0.7993325 and
qt(0.95,9) gives 1.833113.]
*2. A certain manufacturer produces packets of biscuits with a nominal weight of 200g. You
may assume that it is known from experience that the standard deviation of the weight of
the packets is 4g. To carry out a control check on the actual weight of the packets produced,
an employee weighs 25 packets selected at random from a days production and nds that
the average weight of the sample is x = 202.275g.
Let denote actual the mean weight of 200g packets produced by the manufacturer. Test
the null hypothesis H
0
: = 200 against the alternative H
1
: = 200, using a test
procedure with signicance level = 0.01. For what range of signicance levels would
you reject H
0
in favour of H
1
?
[Your answer should include a statement of any model assumptions, a brief description
of your working at each stage of the test procedure including the p-value and the critical
region for the test, and a summary of your conclusions.]
*3. A random variable X is known to have a Normal distribution with mean and variance
25. To test the hypotheses H
0
: = 100 versus H
1
: > 100 a test procedure is proposed
which would take a simple random sample of size n from the population distribution of X
and reject H
0
in favour of H
1
if the sample mean x > 102, and otherwise accept H
0
.
Find an expression in terms of the sample size n for the signicance level of this test
procedure. Hence nd the smallest sample size for which the signicance level would be
less than 0.05.
Problem Sheet 9: Comparison of population means
Note: for each of the following questions your answer should include a statement of any model
assumptions, a brief description of your working at each stage of the test procedure including the
approximate p-value and the critical region for the test, and a summary of your conclusions. You
may nd it helpful to check your numerical calculations using the t.test() function in R .
*1. You have learned about several different hypothesis tests that are relevant to different situa-
tions. It is vitally important that you apply the correct test in each situation. The following
are brief descriptions of experiments that are designed to test a hypothesis. For each de-
scription write down (i) the null and alternative hypothesis, and (ii) the type of hypothesis
test that should be used. There is no need to write down the test statistic or perform the
test.
(a) A car manufacturer claims that the level of nitrous oxide emissions from its new
engine is lower than from its old engine. A researcher evaluates eight engines of the
old type and nine of the new type. As a preliminary step the researcher establishes
that there is no difference between the variances of nitrous oxide emissions in both
engine types.
(b) A researcher measures the systolic blood pressure (SBP) of 20 men and 20 women
in a clinic. She wants to know whether there is a difference between SBP in men and
women.
(c) Students willingly volunteer for a test of the effects of alcohol on reaction times. A
random sample of 24 students have their reaction times measured and are then given
an alcoholic drink. Their reaction times are measured again, half an hour later.
(d) A manufacturer makes chocolate bars that are advertised as having a weight of 500g.
To carry out a production control check, employees select 30 bars at random from
a days production. They want to make sure that the company is not manufacturing
underweight packets of biscuits.
(e) An opinion polling company wants to know whether the majority of people in Bristol
would support a congestion charge. They telephone a random sample of 1000 people
in Bristol and 466 say they are in favour of a congestion charge.
2. In a study of blood glucose level, measurements were made on a sample of 52 pregnant
women in their third trimester of pregnancy. Their values (in milligrams/100 millilitres of
blood) were found to have a sample mean of 70.12. Healthy women who are not preg-
nant are known to have a mean value of 80 (mg/100ml, with a standard deviation of 10
(mg/100ml), and we will assume that the standard deviation for pregnant women is also 10
(mg/100ml).
(a) Use the data to test the hypothesis that pregnant women have a lower glucose level
than women who are not pregnant. [Your answer should include a statement of any
model assumptions, a brief description of your working at each stage of the test pro-
cedure including the p-value, and a summary of your conclusions. You will probably
need to use R to calculate the actual p-value.]
(b) Now assume the test was carried out using a procedure with signicance level =
0.01. Calculate the probability of type II error under the alternative hypothesis that
the true mean value of glucose level in pregnant women is 79 (mg/100 ml). Hence
calculate the power of the test to discriminate between the null hypothesis and this
alternative.
3. (This question reminds you that sometimes the null hypothesis is true but due to chance the
test statistic is such that the null hypothesis is rejected. In particular, under a test procedure
with signicance level , about 100% of samples will result in the null hypothesis being
rejected even when H
0
is true. ).
In this question we investigate the one-sided one sample t-test of the null hypothesis H
0
:
=
0
against the alternative H
1
: >
0
, using a test procedure with signicance level
= 0.05.
(a) Choose valid numerical values of
0
and
2
for normally distributed data. Use the R
commands below to construct a matrix xsamples with 1000 rows and 10 columns.
The 10 data values in each row can be thought of as a random sample of size 10 from
a N(
0
,
2
) distribution. You will need to substitute in your own chosen values of
0
and , or store them as separate objects in R .
> xvalues <- rnorm(10000, mean=mu0, sd=sigma)
> xsamples <- matrix(xvalues,nrow=1000)
Now calculate the sample mean and sample standard deviation for each sample, and
store them as sample.mean and sample.sd respectively.
(b) Calculate the observed test statistic for each sample, using the command:
> sample.tobs <- sqrt(10)
*
(sample.mean - mu0) / sample.sd
again substituting in your chosen value of
0
. Plot a histogram of the relative fre-
quencies of the 1000 observed test statistics that you obtain. What sort of distribution
does the histogram portray? What distribution should we expect to see?
(c) For a one-sided test the critical region will be of the form C = {T > c
}. Calculate
the exact value of c
for n = 10 and mark this point with a vertical line on your

histogram. What do you notice?
(d) On average, how many of the 1000 sample test statistics should be inside the critical
region? How many of the tests in your sample were signicant?
*4. Eight athletes ran a 400 metre race at sea level and at a later meeting ran a 400 metre
race at high altitude. Their times in seconds are shown in the table below. Test at the null
hypothesis that race times are unaffected by altitude, against the alternative that race times
are greater at high altitude using a test procedure with signicance level = 0.05. The
data are contained in the Statistics 1 data set runner.
Runner 1 2 3 4 5 6 7 8
Sea Level 48.3 47.6 49.2 50.3 48.8 51.1 49.0 48.1
High Altitude 50.4 47.3 50.8 52.3 47.7 54.5 48.9 49.9
[These values may be useful: pt(2.1958,7) is 0.9679363 and qt(0.95,7) is
1.894579.]
*5. To investigate the relative size of secretarial starting salaries in the public and private sec-
tor, 9 private sector posts and 10 public sector posts were chosen at random from jobs
advertised on the web. The table below shows the advertised stating salaries (in 1,000;
this data set is quite old!). You may assume that the population variances are the same in
the private and public sectors. Use the data to test at the 0.05-level the hypothesis that start-
ing salaries are the same in the two sectors against the alternative that private sector starting
salaries are higher. The data are contained in the Statistics 1 data set secretaries .
Private sector 12.1 13.4 11.3 10.6 9.7 12.5 9.6 13.6 11.2
Public sector 9.3 8.5 8.2 13.1 8.8 11.9 10.1 9.8 12.2 10.4
6. A study was conducted of 10 households to see if alerting them to high usage rates of elec-
tricity reduced their actual consumption. A small monitor was installed in each household,
which activated a red ashing light whenever the current rate of usage exceeded a pre-
set threshold. The monthly usage (in kilowatt-hours) before and after installation of the
monitor is given below. Test at the 0.05-level whether the monitor is effective at reducing
electrical consumption. The data are contained in the Statistics 1 data set kwh.
Household 1 2 3 4 5 6 7 8 9 10
Before 940 1370 1030 2030 1540 2300 1800 910 640 1200
After 900 1230 1060 2100 1250 2200 1820 900 630 1110
7. In a study to examine whether increasing the amount of calcium in the diet reduced blood
pressure, a group of 10 men were given a calcium supplement in their diet for 12 weeks,
and a control group of 11 men received a placebo (a pill that appeared identical, but con-
tained no active substance). The table below shows the relative change (in mm of mercury)
in blood pressure over the 12 week period (before - after) for each subject. What do you
conclude from the results shown? The data are contained in the Statistics 1 data set bp.
Calcium group 7 4 18 17 3 5 1 10 11 2
Placebo group 1 12 1 3 3 5 5 2 11 1 3
Problem Sheet 10: Linear Regression: Condence Intervals & Hypothesis
Tests
1. The table below shows a data set with ve pairs of values, (x
i
, y
i
) i = 1, . . . , 5. Assume
that the data satisfy the simple Normal linear regression model Y
i
= + x
i
+ e
i
, where
the e
i
are i.i.d. N(0,
2
).
x 1 3 4 6 7
y 0 1 2 5 4
(a) Test the null hypothesis H
0
1
: = 0 using a test
procedure with signicance level 0.05.
(b) Test the null hypothesis H
0
1
: = 0 using a test
procedure with signicance level 0.05.
In each case calculate the p-value of the observed test statistic.
*2. A study was conducted to examine the dependence of metabolic rate on body mass for 7
dogs, yielding data given in the table below.
Body mass (kg) 31.20 24.00 19.80 18.20 9.61 6.50 3.19
Metabolic rate (kcal/day) 1113.2 981.8 908.2 840.8 626.2 429.5 280.9
It was decided to analyse these data on a log scale; dening x as log body mass and y as
log metabolic rate, we t a Normal linear regression model Y
i
= + x
i
+ e
i
. Summary
statistics calculated from the data are
x
i
= 17.800
y
i
= 45.590
x
2
i
= 49.239
y
2
i
= 298.433
y
i
x
i
= 118.365.
Calculate the end points of 99% condence intervals for and .
[In R , I found qt(c(0.95,0.975,0.99,0.995),5) gives
[1] 2.015048 2.570582 3.364930 4.032143; some of these values may be
useful.]
3. The table below shows the rainfall (in ins) for the spring and the following autumn for
each of ten consecutive years. Let x
i
denotes the observed spring rainfall in the ith year
and y
i
denotes the corresponding observed autumn rainfall. Assume that the data satisfy
the simple Normal linear regression model Y
i
= + x
i
+ e
i
, where the e
i
are i.i.d.
N(0,
2
). The data are contained in the Statistics 1 data set rain.
Spring rainfall (x) 1.6 5.3 2.8 9.6 6.7 1.5 5.4 8.5 4.1 3.9
Autumn rainfall (y) 4.6 6.0 2.9 11.1 8.2 1.3 9.1 10.2 5.2 8.3
(a) Find a 90% condence interval for .
(b) Test the null hypothesis H
0
1
: = 0, using a test
procedure with signicance level equal to 0.10.
(c) Show that in general a test of the null hypothesis H
0
: = 0 against the alternative
H
1
: = 0 with signicance level (say) will accept the null hypothesis if and only
if the corresponding 100(1 )% condence interval for contains the value = 0.
*4. The table below shows the average weight (in kg) of piglets in a litter, for seven litters of
varying size. The data are contained in the Statistics 1 data set pig.
Litter size (x) 1 3 5 8 8 9 10
Average weight (y) 1.6 1.5 1.5 1.3 1.4 1.2 1.1
Summary statistics are:
x
i
= 44,
y
i
= 9.6,
x
2
i
= 344,
y
2
i
= 13.36,
y
i
x
i
= 57.
When a simple linear regression of average weight on litter size is performed in R using
the command piglets <- lm(wt littersize,data=pig), the output of
the command summary(piglets) includes the following lines:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.683051 0.064520 26.086 1.55e-06
***
littersize -0.049576 0.009204 -5.387 0.00297
**
---
Signif. codes: 0
***
0.001
**
0.01
*
0.05 . 0.1 1
Residual standard error: 0.07558 on 5 degrees of freedom
(a) Show how each of the numerical values on the three lines beginning (Intercept),
littersize and Residual standard error were calculated, and explain
the interpretation of the asterisks
***
and
**
at the end on the lines.
(b) What conclusion would you reach from the output for testing the null hypothesis H
0
:
= 0 against the alternative H
1
: = 0?

Problems

Uploaded by

Copyright:

Available Formats

Problems

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Problems

Uploaded by

Copyright:

Available Formats

MATH11400 Statistics 1 2013-14

for n = 10 and mark this point with a vertical line on your

You might also like