0% found this document useful (0 votes)
127 views29 pages

ST104a Commentary Autumn 2021

This document provides guidance for students taking the ST104a Statistics 1 examination. It outlines the learning outcomes students should achieve, advises how to plan time during the exam, and what examiners will be looking for in answers. Key advice includes answering the specific question asked, showing calculations clearly, and labeling all diagrams rather than relying on memorizing answers from prior years. The overall message is for students to study the entire syllabus rather than focusing only on past exam questions.

Uploaded by

Nghia Tuan Nghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views29 pages

ST104a Commentary Autumn 2021

This document provides guidance for students taking the ST104a Statistics 1 examination. It outlines the learning outcomes students should achieve, advises how to plan time during the exam, and what examiners will be looking for in answers. Key advice includes answering the specific question asked, showing calculations clearly, and labeling all diagrams rather than relying on memorizing answers from prior years. The overall message is for students to study the entire syllabus rather than focusing only on past exam questions.

Uploaded by

Nghia Tuan Nghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Examiners’ commentaries 2021 (Autumn)

Examiners’ commentaries 2021 (Autumn)


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

General remarks

Learning outcomes

At the end of the half course and having completed the Essential reading and activities you should:

be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common statistical
methods
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.

Planning your time in the examination

You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.

1
ST104a Statistics 1

Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2021, for
example, the first part of Question 2 required contingency tables while the second part related to
aspects of sampling design. In Question 3, the first part covered data visualisation and descriptive
statistics while the second part related to statistical inference related to means. Finally, in Question
4, the first part related to correlation and linear regression while the second part covered statistical
inference related to means. This means that it is really important that you make sure you have a
reasonable idea of what topics are covered before you start work on the paper! We suggest you
divide your time as follows during the examination:

Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
Once you have chosen your two Section B questions, give them about 25 minutes each.
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!

What are the examiners looking for?

The examiners are looking for very simple demonstrations from you. They want to be sure that you:

have covered the syllabus as described and explained in the subject guide
know the basic formulae given there and when and how to use them
understand and answer the questions set.

You are not expected to write long essays where explanations or descriptions of sampling design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.

Key steps to improvement

The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2021 examinations!
Remember the following.

If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’, ‘Stem-and-leaf diagram’, ‘Boxplot’ or ‘Scatter diagram’ in itself is insufficient.
What do the data describe? What are the units? What are the x-axis and y-axis?
If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.

2
Examiners’ commentaries 2021 (Autumn)

How should you use the specific comments on each question given in the
Examiners0 commentaries?

We hope that you find these useful. For each question and subquestion, they give:

further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold et al. (2012).

Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.

Memorising from the Examiners0 commentaries

It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.

Examination revision strategy

Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.

We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.

The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.

If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.

3
ST104a Statistics 1

Examiners’ commentaries 2021 (Autumn)


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions – Zone A

Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.

Section A

Answer all parts of question 1 (50 marks in total).

Question 1

(a) Consider the following sample dataset:

18, 12, 16, x and 15.

You are told that the value of the sample mean is x̄ = 16.

i. Calculate the value of x.


(2 marks)
ii. Calculate the range of the data.
(2 marks)
iii. Calculate the sample variance.
(3 marks)

Reading for this question


This question involves the descriptive statistics of the sample mean, range and sample
variance, covered in Sections 4.8.1, 4.9.1 and 4.9.3 of the subject guide, respectively.

4
Examiners’ commentaries 2021 (Autumn)

Approaching the question


i. Since the sample mean is equal to 16, we can write:
n
1X 18 + 12 + 16 + x + 15
x̄ = xi = = 16
n i=1 5

or else:
61 + x = 80 ⇒ x = 19.

ii. The range is:


x(n) − x(1) = x(5) − x(1) = 19 − 12 = 7.
Note that the range must be computed as the difference between the sample maximum
and sample minimum, and not reported as ‘12 to 19’.
iii. We have:
n
1 X
s2 = (xi − x̄)2
n − 1 i=1

(18 − 16)2 + (12 − 16)2 + (16 − 16)2 + (19 − 16)2 + (15 − 16)2
=
4
= 7.5

Note the use of the ‘n − 1’ divisor for the sample variance.

(b) Suppose that x1 = −2, x2 = −4, x3 = 8, x4 = 5, and y1 = 6, y2 = −10, y3 = 8,


y4 = 5. Calculate the following quantities:
3 4 3
X X √ X 1
i. yi2 ii. xi yi iii. y42 + .
i=1 i=3 i=1
xi

(6 marks)

Reading for this question


This question refers to the basic bookwork which can be found in Section 2.9 of the subject
guide.
Approaching the question
Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are as follows.
i. We have:
3
X
yi2 = 62 + (−10)2 + 82 = 36 + 100 + 64 = 200.
i=1

ii. We have:
4
X √ √ √
xi yi = 8×8+ 5 × 5 = 8 + 5 = 13.
i=3
√ √
Note that to be mathematically precise, 64 and 25 are also equal to −8 and −5,
respectively. For this reason, −13 as a final answer was also accepted as correct.
iii. We have:
3  
X 1 1 1 1
y42 + 2
=5 + − − + = 24.375.
x
i=1 i
2 4 8

5
ST104a Statistics 1

(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)

i. The upper quartile of a sample dataset is never smaller than the lower
quartile.
(2 marks)

ii. The probability that a normal random variable is less than one standard
deviation from its mean is 99%.
(2 marks)

iii. Convenience sampling is free of selection bias.


(2 marks)

iv. A correlation coefficient of 0.95 between variables x and y suggests that


there is a strong positive influence of variable x on variable y.
(2 marks)

Reading for this question


This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. refers to Section 4.9.1 which introduces the quartiles of a
sample dataset, whereas part ii. requires knowledge of the normal distribution that can be
found in Section 6.8. Part iii. is about sampling design – see Sections 10.7.1 and 10.8. Part
iv. covers correlation, which is covered in Section 12.8.
Approaching the question
i. True. Since the quartiles are computed based on the ordered observed values, it must be
that Q1 ≤ Q3 .
ii. False. The probability is approximately 68%. This could be illustrated with a suitable
sketch.
iii. False. In convenience sampling the selection is non-random and hence introduces
selection bias.
iv. False. The correlation may be spurious. Correlation does not imply causality, so we
cannot be certain that x influences y.

(d) The probability distribution of a random variable X is given below.


X=x 2 5 7 9
P (X = x) 0.20 0.40 0.30 0.10

i. Find E(X), the expected value of X.


(2 marks)
2
ii. Find the probability that X > 30.
(2 marks)
iii. Does X follow a uniform distribution? Justify your answer.
(2 marks)

Reading for this question


This is a question on a discrete random variable, exploring the concepts of discrete
probability distributions. Reading from Chapter 5 is suggested with a focus on the sections
on these topics.

6
Examiners’ commentaries 2021 (Autumn)

Approaching the question


i. We have:
X
E(X) = x P (X = x) = 2 × 0.20 + 5 × 0.40 + 7 × 0.30 + 9 × 0.10 = 5.4.
x

ii. The probability distribution of Z = X 2 will be:


Z=z 4 25 49 81
P (Z = z) 0.20 0.40 0.30 0.10
Hence the correct probability is 0.40.
Note that this part may be answered without deriving the probability distribution table
of Z. One can note that only the values X = 7 and X = 9 will give X 2 > 30, hence the
requested probability is 0.30 + 0.10 = 0.40.
iii. Since the probabilities are not all equal, X does not have a uniform distribution.

(e) The times of marathon runners, participating in the Olympic Games, are
normally distributed with mean 3.5 hours and a standard deviation of 0.75
hours.
i. What is the proportion of runners in the Olympic Games that finish in less
than 3 hours?
(2 marks)

ii. What is the proportion of runners that finish the Olympic Games with times
between 2.5 and 4.5 hours?
(3 marks)

iii. Do you think it is reasonable to assume that the times of marathon runners
follow a normal distribution? Briefly explain your view.
(2 marks)

Reading for this question


This question requires application of the normal distribution, for which Section 6.8 of the
subject guide is relevant.
Approaching the question
i. We can write:
 
X − 3.5 3 − 3.5
P (X < 3) = P < = P (Z < −0.67).
0.75 0.75

Continuing from above, we get P (Z < −0.67) = 1 − Φ(0.67) = 0.2525.


ii. We can write:
 
2.5 − 3.5 4.5 − 3.5
P (2.5 ≤ X ≤ 4.5) = P ≤Z≤ = P (−1.33 ≤ Z ≤ 1.33).
0.75 0.75

Continuing from above, we get:

P (−1.33 ≤ Z ≤ 1.33) = Φ(1.33) − Φ(−1.33) = 0.9088 − 0.0912 = 0.8176.

iii. Any reasonable argument accepted. A discussion of time being a continuous variable
supporting the use of the normal distribution (which is continuous), and whether or not
it is reasonable to assume the distribution of times is symmetric (which the normal
distribution is).

7
ST104a Statistics 1

(f ) An online retailer dispatches products from one of three warehouses (A, B and
C), where these warehouses account for 10%, 40% and 50% of the retailer’s
sales, respectively. It is known that the percentage of defective items are 4%,
6% and 3%, from warehouses A, B and C, respectively. A customer complains
that they have received a defective item. What is the probability this item came
from warehouse A? Provide your answer to four decimal places.
(6 marks)

Reading for this question


This is a question on conditional probability, requiring the use of Bayes’ formula. See
Section 5.10 of the subject guide for details.

Approaching the question


With obvious notation, for warehouse A we have:

P (D | A) P (A)
P (A | D) =
P (D | A) P (A) + P (D | B) P (B) + P (D | C) P (C)
0.04 × 0.10
=
0.04 × 0.10 + 0.06 × 0.40 + 0.03 × 0.50
= 0.0930.

Note the use of the total probability formula in the denominator (which represents P (D)).
The answer should be reported to four decimal places, as requested in the question.
Note a correct probability tree is also acceptable.

(g) A random sample is drawn from a normal distribution, N (µ, 36). You are told
that a 90% confidence interval for the population mean is (5.78, 7.22). What
was the size of the sample?
(5 marks)

Reading for this question


Sample size determination for estimating population means (and population proportions) is
covered in Section 7.11 of the subject guide.

Approaching the question


The sample mean, x̄, would be in the centre of the interval (5.78, 7.22). Adding the two
endpoints and dividing by 2 gives:

5.78 + 7.22
x̄ = = 6.5.
2
To find the sample size, note that the standard error when estimating a single mean is:

σ 6
√ =√ .
n n

Since this is a 100(1 − α)% = 90% confidence interval, then α = 0.10, so the confidence
coefficient is zα/2 = z0.05 = 1.645. Therefore, to determine n we need to solve:

6
1.645 × √ = 0.72 ⇒ n = 187.91.
n

Since this is the minimum value of n, and we must have that n is an integer, we round up to
get a sample size of 188.

8
Examiners’ commentaries 2021 (Autumn)

(h) Explain how you would develop an experimental design to determine the
effectiveness of a vaccine.
(5 marks)

Reading for this question


Chapter 11 of the subject guide discusses aspects of experimental design.

Approaching the question


This was a deliberately open-ended question, although a complete answer should discuss
control and treatment groups, randomisation, the placebo effect and double blinding.

Section B

Answer two out of the three questions from this section (25 marks each).

Question 2

(a) A factory uses four different machines to manufacture a particular type of


machine component. A random sample of 400 components is selected from the
output of the factory. Each component in the sample is inspected to determine
whether or not it is faulty. The machine that produced the component is also
recorded. The results are as follows:

Outcome
Faulty Non-faulty Total
Machine 1 4 96 100
Machine 2 2 98 100
Machine 3 11 89 100
Machine 4 14 86 100
Total 31 369 400

i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the machine number and the
component being faulty?

ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?

(14 marks)

Reading for this question


This question targets Chapter 9 on contingency tables and chi-squared tests. Note that in
part i. of the question it does not require any calculations, just understanding and
interpreting contingency tables. Part ii. is a straightforward chi-squared test and the reading
is also given in Chapter 9.

Approaching the question

i. There are some differences in the proportions of faulty components for each machine.
More specifically, 2% of the components in Machine 2 are faulty, whereas the
corresponding proportion for Machine 3 is 11% and for Machine 4 is 14%. Hence there
seems to be an association between machine number and the component being faulty,
although this needs to be investigated further.

9
ST104a Statistics 1

ii. Set out the null hypothesis that there is no association between machine number and the
component being faulty against the alternative, that there is an association. Be careful
to get these the correct way round! We test:

H0 : No association between the machine number and the component being faulty

vs.

H1 : Association between machine number and the component being faulty.

Work out the expected values to obtain the table below:


7.75 92.25
7.75 92.25
7.75 92.25
7.75 92.25
The test statistic formula is:
X (Oi,j − Ei,j )2

i,j
Ei,j

which gives a value of 13.53. This is a 4 × 2 contingency table so the degrees of freedom
are (4 − 1) × (2 − 1) = 3. For α = 0.05, the critical value is 7.815, hence we reject H0 . We
conclude that there is moderate evidence of an association between machine number and
the component being faulty.

(b) i. Describe how stratified random sampling is performed and explain how it
differs from quota sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how people of different ages rate its products. For
this reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)

Reading for this question


This was a question on basic material concerning survey designs. Background reading is
given in Chapter 10 of the subject guide which, along with the recommended reading,
should be looked at carefully. Candidates were expected to have studied and understood the
main important constituents of design in random sampling. It is also a good idea to try the
learning activities of Chapter 10.
Approaching the question
i. A description of stratified random sampling can be found on page 245 of the subject
guide. It is a form of probability sampling as opposed to quota sampling which is a form
of non-probability sampling. In stratified random sampling a sampling frame is required,
whereas in quota sampling pre-chosen frequencies in each category are sought.
ii. This is an open-ended question. Possible ‘ingredients’ of a good answer are as follows:
• Propose stratified sampling since customers of all ages are to be surveyed.
• Sampling frame could be the company’s customer database.
• Take a simple random sample from each stratum.
• The appropriate stratification factor would be age group.
• Other stratification factors could be gender, country of residence etc.
• Contact method: mail, phone or email (likely to have all details in the database).
• Minimise non-response through suitable incentive, such as a discount off their next
purchase.

10
Examiners’ commentaries 2021 (Autumn)

Question 3

(a) The data below represent heights, measured in centimetres, of women from an
adult female population:

162 164 164 165 165


166 166 166 167 167
167 167 167 168 168
168 168 168 168 169
169 169 169 170 170
170 171 172 184 185

i. Carefully construct, draw and label a histogram of these data. The histogram
can be drawn on ordinary paper – no graph paper needed. You should draw
by hand; do not use a computer.

ii. Find the median height among these women and the upper quartile. What
percentage of women were below 165 cm?

iii. Comment on the data given the shape of the histogram without doing any
further calculations.

iv. Name two other types of graphical displays that would be suitable to
represent the data.

(12 marks)

Reading for this question


Chapter 4 provides all the relevant material for this question. More specifically, reading on
histograms can be found in Section 4.7.3, but the entire Section 4.7 is highly relevant. For
measures of location, see Section 4.8.

Approaching the question

i. A histogram compatible with what the examiners were expecting to see is shown below.

11
ST104a Statistics 1

This histogram is based on the following class intervals and frequency densities:
Interval Frequency
Class interval width Frequency density
[160, 165) 5 3 0.6
[165, 170) 5 20 4.0
[170, 175) 5 5 1.0
[175, 180) 5 0 0.0
[180, 185) 5 1 0.2
[185, 190) 5 1 0.2
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure.
ii. The median is the midpoint of the ordered observations, which is 168 centimetres.
Q3 = 169 centimetres. Note that units for at least one of median or Q3 are required. The
percentage is 3/30 = 10%.
iii. The histogram is positively (right) skewed. There are two women (with heights of 184
cm and 185 cm) who may be regarded as outliers.
iv. Any two of boxplot, dot plot and stem-and-leaf diagram.

(b) A random sample of 9 people tried a specific diet that lasted 2 months to lose
weight. The weights of these people, measured in kilograms, were measured
both at the beginning and the end of the diet, and are shown in the table below:

Weight before diet Weight after diet


75 73
76 72
90 92
92 93
89 89
63 61
65 62
80 76
90 84

i. Carry out an appropriate hypothesis test to determine whether the diet is


effective in helping people lose weight. State the test hypotheses, and specify
your test statistic and its distribution under the null hypothesis. Comment
on your findings.
ii. State any assumptions you made in part i.
iii. Give a 90% confidence interval for the difference between the means of the
weights before and after the diet.
(13 marks)

Reading for this question


Look up the sections about hypothesis testing and confidence intervals for differences in
means for paired data in Sections 8.16.4 and 7.13.4, respectively.
Approaching the question
i. We test:
H0 : µbefore = µafter vs. H1 : µbefore > µafter .
Equivalently, we test:
H0 : µd = 0 vs. H1 : µd < 0.

12
Examiners’ commentaries 2021 (Autumn)

The differences are:

−2, −4, 2, 1, 0, −2, −3, −4 and − 6.

The sample mean and standard deviation of these differences are:

x̄d = −2.0 and sd = 2.598.

Under H0 , the test statistic is:

X̄d
√ ∼ tn−1 = t9
Sd / n

and the test statistic value is −2.309. For α = 0.05, the critical value is
−t8, 0.05 = −1.860, hence we reject H0 since −2.309 < −1.860. For α = 0.01, the critical
value is −t8, 0.01 = −2.896, hence we do not reject H0 since −2.896 < −2.309. We
conclude that the test is moderately significant, i.e. there is moderate evidence that the
diet is effective in helping people lose weight.
ii. The assumptions are that:
• differences are normally distributed
• pairs of observations are independent.

iii. The confidence coefficient for a 90% confidence interval is t8, 0.05 = 1.860, hence a 90%
confidence interval is:
2.598
−2.0 ± 1.860 × √ ⇒ (−3.611, −0.389).
9

Question 4

(a) The director of a local Tourism Authority would like to know whether a family’s
annual expenditure on recreation (y), measured in $000s, is related to their
annual income (x), also measured in $000s. In order to explore this potential
relationship, the variables x and y were recorded for 10 randomly selected
families that visited the area last year. The results were as follows:

Family #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 41.2 50.1 52.0 62.0 44.5 37.7 73.5 37.5 56.7 65.2
y 2.4 2.7 2.8 8.0 3.1 2.1 12.1 2.0 3.9 8.9

The summary statistics for these data are:

Sum of x data: 520.4 Sum of the squares of x data: 28,431.42


Sum of y data: 48 Sum of the squares of y data: 343.74
Sum of the products of x and y data: 2,858.63

(a) i. Draw a scatter diagram of these data. Label the diagram carefully. (The
scatter diagram can be drawn on ordinary paper – no graph paper needed.
You should draw by hand; do not use a computer.)
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Do you find the analyses in parts ii. and iii. appropriate? Justify your
answer and suggest any alternative ways to model the relationship
between x and y.
(13 marks)

13
ST104a Statistics 1

Reading for this question


This is a standard regression question and the reading is to be found in Chapter 12 of
the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for
part i., whereas the remaining parts are on correlation and regression that are covered in
Sections 12.8–12.10. Sample examination question 2 of this chapter is also recommended
for practice on questions of this type.
Approaching the question
i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled
axes which give their units in addition. Far too many candidates threw away marks
by neglecting these points and consequently were only given one mark out of the
possible four allocated for this part of the question.
We have:

ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9222. An
interpretation of this value is the following: The data suggest that the higher a
family’s annual income is, the higher the holiday expenditure. The fact that the value
is close to 1, suggests that this is a strong, linear, positive correlation.
Many candidates did not mention all three words (strong, linear, positive). Note that
all of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 0.267. The formula for a is:

a = ȳ − bx̄

so we get a = −9.107. Hence the regression line can be written as:

yb = −9.107 + 0.267x or y = −9.107 + 0.267x + ε.

It should also be plotted on the scatter diagram.

14
Examiners’ commentaries 2021 (Autumn)

Many candidates reported incorrectly the regression line as y = −9.107 + 0.267x.


This expression is false; one of the two above is required. Also, many candidates did
not draw this line on the scatter diagram; instead they drew an approximate line
trying to go around the points but without reference to the above equation. No
marks were awarded in such cases.
iv. Some discussion mentioning, for example, ‘no, due to the non-linear shape of the plot’
or ‘no, due to outliers’. Alternative ways to model the relationship between x and y
could include Spearman’s rank correlation coefficient and/or the need to transform
the data, such as a logarithmic transformation.

(b) The fuel consumption of two different car models (A and B) was compared in
the following way. A random sample of 20 cars of model A and 35 cars of
model B were taken and the fuel consumption (in miles per gallon) was
measured for each car. The results are summarised in the table below.
Sample size Sample mean Sample standard deviation
Car Model A 20 30.9 6.11
Car Model B 35 27.1 6.41

i. Use an appropriate hypothesis test to determine whether the model A


cars can do more miles per gallon than model B cars. State clearly the
hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
Comment on your findings.
ii. State clearly any assumptions you made in part i.
iii. Provide a 95% confidence interval for the difference between the mean
fuel consumption of the two car models.

(12 marks)

Reading for this question


The first two parts of the question refer to a two-sided hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant, one can focus
on Section 8.16. For part iii., see Section 7.13.

Approaching the question

i. Let µA denote the mean fuel consumption for car model A and µB the mean fuel
consumption for car model B. We test:

H0 : µA = µB vs. H1 : µA > µB .

The test statistic value is 2.150 (using the pooled sample variance of 39.74). (If equal
variances are not assumed the test statistic value is 2.179.) For reference the test
statistic formula is:
x̄A − x̄B x̄A − x̄B
q or p .
s2p (1/nA + 1/nB ) sA /nA + s2B /nB
2

For α = 0.05, the critical value is 1.684 (based on the t40 distribution) or 1.671 (based on
the t60 distribution) or 1.645 (based on the standard normal distribution). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical value is 2.423 (based on the
t40 distribution) or 2.390 (based on the t60 distribution) or 2.326 (based on the standard
normal distribution), so do not reject H0 . Hence the test is moderately significant, i.e.
there is moderate evidence that the mean fuel consumption of model A cars is greater
than that of model B cars.

15
ST104a Statistics 1

ii. The assumptions for i. concerned an assumption:


• about equal variances
• about whether nA + nB is ‘large’ so that the normality assumption is satisfied
• about independent samples.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. For a 95% confidence interval, if the t40 distribution is used then the confidence
coefficient is 2.021, or if the standard normal distribution is assumed then it is 1.96.
These result in confidence intervals of (0.229, 7.371) and (0.337, 7.263), respectively.

16
Examiners’ commentaries 2021 (Autumn)

Examiners’ commentaries 2021 (Autumn)


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions – Zone B

Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.

Section A

Answer all parts of question 1 (50 marks in total).

Question 1

(a) Consider the following sample dataset:

24, 22, 29, x and 27.

You are told that the value of the sample mean is x̄ = 24.

i. Calculate the value of x.


(2 marks)
ii. Calculate the range of the data.
(2 marks)
iii. Calculate the sample variance.
(3 marks)

Reading for this question


This question involves the descriptive statistics of the sample mean, range and sample
variance, covered in Sections 4.8.1, 4.9.1 and 4.9.3 of the subject guide, respectively.

17
ST104a Statistics 1

Approaching the question


i. Since the sample mean is equal to 24, we can write:
n
1X 24 + 22 + 29 + x + 27
x̄ = xi = = 24
n i=1 5

or else:
102 + x = 120 ⇒ x = 18.

ii. The range is:


x(n) − x(1) = x(5) − x(1) = 29 − 18 = 11.
Note that the range must be computed as the difference between the sample maximum
and sample minimum, and not reported as ‘18 to 29’.
iii. We have:
n
1 X
s2 = (xi − x̄)2
n − 1 i=1

(24 − 24)2 + (22 − 24)2 + (29 − 24)2 + (18 − 24)2 + (27 − 24)2
=
4
= 18.5

Note the use of the ‘n − 1’ divisor for the sample variance.

(b) Suppose that x1 = −4, x2 = −5, x3 = 10, x4 = 8, and y1 = 4, y2 = −12,


y3 = 10, y4 = 8. Calculate the following quantities:
3 4 3
X X √ X 1
i. yi2 ii. xi yi iii. y42 + .
i=1 i=3 i=1
xi

(6 marks)

Reading for this question


This question refers to the basic bookwork which can be found in Section 2.9 of the subject
guide.
Approaching the question
Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are as follows.
i. We have:
3
X
yi2 = 42 + (−12)2 + (10)2 = 16 + 144 + 100 = 260.
i=1

ii. We have:
4
X √ √ √
xi yi = 10 × 10 + 8 × 8 = 10 + 8 = 18.
i=3
√ √
Note that to be mathematically precise, 100 and 64 are also equal to −10 and −8,
respectively. For this reason, −18 as a final answer was also accepted as correct.
iii. We have:
3  
X 1 1 1 1
y42 + 2
=8 + − − + = 63.65.
x
i=1 i
4 5 10

18
Examiners’ commentaries 2021 (Autumn)

(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)

i. The mean of a sample dataset is never equal to the median.


(2 marks)

ii. The probability that a normal random variable is less than two standard
deviations from its mean is 68%.
(2 marks)

iii. Systematic sampling suffers from selection bias.


(2 marks)

iv. A correlation coefficient of −0.85 between variables x and y suggests that


there is a strong negative influence of variable x on variable y.
(2 marks)

Reading for this question


This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. refers to Section 4.8 which introduces measures of location,
whereas part ii. requires knowledge of the normal distribution that can be found in Section
6.8. Part iii. is about sampling design – see Sections 10.7.2 and 10.8. Part iv. covers
correlation, which is covered in Section 12.8.

Approaching the question

i. False. They would be equal for a symmetric sample distribution.

ii. False. The probability is approximately 95%. This could be illustrated with a suitable
sketch.

iii. False. In systematic sampling the selection is random and hence eliminates selection bias.

iv. False. The correlation may be spurious. Correlation does not imply causality, so we
cannot be certain that x influences y.

(d) The probability distribution of a random variable X is given below.

X=x 1 4 7 10
P (X = x) 0.10 0.30 0.40 0.20

i. Find E(X), the expected value of X.


(2 marks)
2
ii. Find the probability that X < 20.
(2 marks)
iii. Does X follow a uniform distribution? Justify your answer.
(2 marks)

Reading for this question


This is a question on a discrete random variable, exploring the concepts of discrete
probability distributions. Reading from Chapter 5 is suggested with a focus on the sections
on these topics.

19
ST104a Statistics 1

Approaching the question


i. We have:
X
E(X) = x P (X = x) = 1 × 0.10 + 4 × 0.30 + 7 × 0.40 + 10 × 0.20 = 6.1.
x

ii. The probability distribution of Z = X 2 will be:


Z=z 1 16 49 100
P (Z = z) 0.10 0.30 0.40 0.20
Hence the correct probability is 0.40.
Note that this part may be answered without deriving the probability distribution table
of Z. One can note that only the values X = 1 and X = 4 will give X 2 < 20, hence the
requested probability is 0.10 + 0.30 = 0.40.
iii. Since the probabilities are not all equal, X does not have a uniform distribution.

(e) The times of marathon runners, participating in the London Marathon, are
normally distributed with mean 3.4 hours and a standard deviation of 0.85
hours.
i. What is the proportion of runners in the London Marathon that finish in
more than 4 hours?
(2 marks)

ii. What is the proportion of runners that finish the London Marathon with
times between 2.75 and 4.75 hours?
(3 marks)

iii. Do you think it is reasonable to assume that the times of marathon runners
follow a normal distribution? Briefly explain your view.
(2 marks)

Reading for this question


This question requires application of the normal distribution, for which Section 6.8 of the
subject guide is relevant.
Approaching the question
i. We can write:
 
X − 3.4 4 − 3.4
P (X > 4) = P > = P (Z > 0.71).
0.85 0.85

Continuing from above, we get P (Z > 0.71) = 1 − Φ(0.71) = 0.2389.


ii. We can write:
 
2.75 − 3.4 4.75 − 3.4
P (2.75 ≤ X ≤ 4.75) = P ≤Z≤ = P (−0.76 ≤ Z ≤ 1.59).
0.85 0.85

Continuing from above, we get:

P (−0.76 ≤ Z ≤ 1.59) = Φ(1.59) − Φ(−0.76) = 0.9441 − 0.2236 = 0.7205.

iii. Any reasonable argument accepted. A discussion of time being a continuous variable
supporting the use of the normal distribution (which is continuous), and whether or not
it is reasonable to assume the distribution of times is symmetric (which the normal
distribution is).

20
Examiners’ commentaries 2021 (Autumn)

(f ) A supplier dispatches products from one of three warehouses (A, B and C),
where these warehouses account for 20%, 30% and 50% of the supplier’s sales,
respectively. It is known that the percentage of defective items are 7%, 4% and
5%, from warehouses A, B and C, respectively. A customer complains that they
have received a defective item. What is the probability this item came from
warehouse A? Provide your answer to four decimal places.
(6 marks)

Reading for this question


This is a question on conditional probability, requiring the use of Bayes’ formula. See
Section 5.10 of the subject guide for details.

Approaching the question


With obvious notation, for warehouse A we have:

P (D | A) P (A)
P (A | D) =
P (D | A) P (A) + P (D | B) P (B) + P (D | C) P (C)
0.07 × 0.20
=
0.07 × 0.20 + 0.04 × 0.30 + 0.05 × 0.50
= 0.2745.

Note the use of the total probability formula in the denominator (which represents P (D)).
The answer should be reported to four decimal places, as requested in the question.
Note a correct probability tree is also acceptable.

(g) A random sample is drawn from a normal distribution, N (µ, 25). You are told
that a 99% confidence interval for the population mean is (4.69, 6.31). What
was the size of the sample?
(5 marks)

Reading for this question


Sample size determination for estimating population means (and population proportions) is
covered in Section 7.11 of the subject guide.

Approaching the question


The sample mean, x̄, would be in the centre of the interval (4.69, 6.31). Adding the two
endpoints and dividing by 2 gives:

4.69 + 6.31
x̄ = = 5.5.
2
To find the sample size, note that the standard error when estimating a single mean is:

σ 5
√ =√ .
n n

Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the confidence
coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we need to solve:

5
2.576 × √ = 0.81 ⇒ n = 252.85.
n

Since this is the minimum value of n, and we must have that n is an integer, we round up to
get a sample size of 253.

21
ST104a Statistics 1

(h) Explain how you would develop an experimental design to determine the
effectiveness of a new medicine.
(5 marks)

Reading for this question


Chapter 11 of the subject guide discusses aspects of experimental design.

Approaching the question


This was a deliberately open-ended question, although a complete answer should discuss
control and treatment groups, randomisation, the placebo effect and double blinding.

Section B

Answer two out of the three questions from this section (25 marks each).

Question 2

(a) A sample consisting of 400 randomly selected students was classified in terms of
personality type (introvert or extrovert) and in terms of their favourite colour
(out of red, yellow, green or blue). Their responses are summarised in the table
below:

Personality type
Introvert Extrovert Total
Red 32 68 100
Yellow 26 74 100
Green 21 79 100
Blue 46 54 100
Total 125 275 400

i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the student’s type of
personality and colour preference?

ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?

(14 marks)

Reading for this question


This question targets Chapter 9 on contingency tables and chi-squared tests. Note that in
part i. of the question it does not require any calculations, just understanding and
interpreting contingency tables. Part ii. is a straightforward chi-squared test and the reading
is also given in Chapter 9.

Approaching the question

i. There are some differences in rates of introvert students for each colour preference. More
specifically, 21% of the students who prefer the green colour are introvert, whereas the
corresponding proportion for students who prefer red is 32% and for students preferring
blue is 46%. Hence there seems to be an association between personality type and colour
preference, although this needs to be investigated further.

22
Examiners’ commentaries 2021 (Autumn)

ii. Set out the null hypothesis that there is no association between machine number and the
component being faulty against the alternative, that there is an association. Be careful
to get these the correct way round! We test:

H0 : No association between personality type and colour preference

vs.
H1 : Association between personality type and colour preference.
Work out the expected values to obtain the table below:
31.25 68.75
31.25 68.75
31.25 68.75
31.25 68.75
The test statistic formula is:
X (Oi,j − Ei,j )2

i,j
Ei,j

which gives a value of 16.33. This is a 4 × 2 contingency table so the degrees of freedom
are (4 − 1) × (2 − 1) = 3. For α = 0.05, the critical value is 7.815, hence we reject H0 . We
conclude that there is moderate evidence of an association between personality type and
colour preference.

(b) i. Describe how quota sampling is performed and explain how it differs from
stratified random sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how men and women rate its products. For this
reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)

Reading for this question


This was a question on basic material concerning survey designs. Background reading is
given in Chapter 10 of the subject guide which, along with the recommended reading,
should be looked at carefully. Candidates were expected to have studied and understood the
main important constituents of design in random sampling. It is also a good idea to try the
learning activities of Chapter 10.
Approaching the question
i. A description of quota sampling can be found on page 241 of the subject guide. It is a
form of non-probability sampling as opposed to stratified random sampling which is a
form of probability sampling. In quota sampling a sampling frame is not required,
whereas in stratified random sampling it is.
ii. This is an open-ended question. Possible ‘ingredients’ of a good answer are as follows:
• Propose stratified sampling since male and female customers are to be surveyed.
• Sampling frame could be the company’s customer database.
• Take a simple random sample from each stratum.
• The appropriate stratification factor would be gender.
• Other stratification factors could be age group, country of residence etc.
• Contact method: mail, phone or email (likely to have all details in the database).
• Minimise non-response through suitable incentive, such as a discount off their next
purchase.

23
ST104a Statistics 1

Question 3

(a) A camera recorded the speed of 30 cars on a road with a 30 miles per hour
speed limit. The recorded data are shown below:

25.6 25.7 25.7 25.8 25.8


26.2 26.9 27.5 27.7 27.8
27.9 27.9 28.3 28.4 28.5
28.8 28.9 28.9 29.0 29.1
29.2 29.3 29.5 29.7 29.8
30.1 30.1 30.2 36.2 36.9

i. Carefully construct, draw and label a histogram of these data. The histogram
can be drawn on ordinary paper – no graph paper needed. You should draw
by hand; do not use a computer.

ii. Find the median speed among these cars and the upper quartile. What
percentage of drivers were exceeding the 30 miles per hour speed limit?

iii. Comment on the data given the shape of the histogram without doing any
further calculations.

iv. Name two other types of graphical displays that would be suitable to
represent the data.

(12 marks)

Reading for this question


Chapter 4 provides all the relevant material for this question. More specifically, reading on
histograms can be found in Section 4.7.3, but the entire Section 4.7 is highly relevant. For
measures of location, see Section 4.8.

Approaching the question

i. A histogram compatible with what the examiners were expecting to see is shown below.

24
Examiners’ commentaries 2021 (Autumn)

This histogram is based on the following class intervals and frequency densities:
Interval Frequency
Class interval width Frequency density
[25.0, 27.0) 2 7 3.5
[27.0, 29.0) 2 11 5.5
[29.0, 31.0) 2 10 5.0
[31.0, 33.0) 2 0 0.0
[33.0, 35.0) 2 0 0.0
[35.0, 37.0) 2 2 1.0
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure.
ii. The median is the midpoint of the ordered observations, which is 28.65 miles per hour.
Q3 = 29.45 miles per hour. Note that units for at least one of median or Q3 are required.
The percentage is 5/30 = 16.67% (17% is also acceptable).
iii. The histogram is positively (right) skewed. There are two cars (with recorded speeds of
36.2 and 36.9 miles per hour) which may be regarded as outliers.
iv. Any two of boxplot, dot plot and stem-and-leaf diagram.

(b) A random sample of 9 students received special training to improve their


performance on IQ tests. Each of the 9 students took an IQ test before and
after the training and their scores are shown in the table below:

IQ score before training IQ score after training


105 107
116 120
120 118
93 92
119 119
133 135
75 78
86 90
90 96

i. Carry out an appropriate hypothesis test to determine whether the special


training is effective at increasing the average IQ score. State the test
hypotheses, and specify your test statistic and its distribution under the null
hypothesis. Comment on your findings.
ii. State any assumptions you made in part i.
iii. Give a 90% confidence interval for the difference between the means of the
IQ scores before and after training.
(13 marks)

Reading for this question


Look up the sections about hypothesis testing and confidence intervals for differences in
means for paired data in Sections 8.16.4 and 7.13.4, respectively.
Approaching the question
i. We test:
H0 : µbefore = µafter vs. H1 : µbefore < µafter .
Equivalently, we test:
H0 : µd = 0 vs. H1 : µd < 0.

25
ST104a Statistics 1

The differences are:

2, 4, −2, −1, 0, 2, 3, 4 and 6.

The sample mean and standard deviation of these differences are:

x̄d = 2.0 and sd = 2.598.

Under H0 , the test statistic is:

X̄d
√ ∼ tn−1 = t9
Sd / n

and the test statistic value is 2.309. For α = 0.05, the critical value is t8, 0.05 = 1.860,
hence we reject H0 since 1.860 < 2.309. For α = 0.01, the critical value is t8, 0.01 = 2.896,
hence we do not reject H0 since 2.309 < 2.896. We conclude that the test is moderately
significant, i.e. there is moderate evidence that the special IQ training is effective.
ii. The assumptions are that:
• differences are normally distributed
• pairs of observations are independent.

iii. The confidence coefficient for a 90% confidence interval is t8, 0.05 = 1.860, hence a 90%
confidence interval is:
2.598
2.0 ± 1.860 × √ ⇒ (0.389, 3.611).
9

Question 4

(a) An insurance company wants to relate the amount of fire damage y, measured
in $000s, in major residential fires to the distance between the residence and the
nearest fire station x, measured in miles. For this reason, a study was conducted
in a large suburb of a major city based on a sample of 10 recent fires in this
suburb. For each of these fires, the variables x and y were recorded and are
shown in the table below:
Fire #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 3.4 1.8 4.6 2.3 3.1 5.5 0.7 3.0 2.6 4.3
y 2.6 1.8 5.9 2.3 2.8 8.6 1.4 2.3 2.0 5.7

The summary statistics for these data are:

Sum of x data: 31.3 Sum of the squares of x data: 115.85


Sum of y data: 35.4 Sum of the squares of y data: 175.64
Sum of the products of x and y data: 138.08

(a) i. Draw a scatter diagram of these data. Label the diagram carefully. (The
scatter diagram can be drawn on ordinary paper – no graph paper needed.
You should draw by hand; do not use a computer.)
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Do you find the analyses in parts ii. and iii. appropriate? Justify your
answer and suggest any alternative ways to model the relationship
between x and y.
(13 marks)

26
Examiners’ commentaries 2021 (Autumn)

Reading for this question


This is a standard regression question and the reading is to be found in Chapter 12 of
the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for
part i., whereas the remaining parts are on correlation and regression that are covered in
Sections 12.8–12.10. Sample examination question 2 of this chapter is also recommended
for practice on questions of this type.
Approaching the question
i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled
axes which give their units in addition. Far too many candidates threw away marks
by neglecting these points and consequently were only given one mark out of the
possible four allocated for this part of the question.
We have:

ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9093. An
interpretation of this value is the following: The data suggest that the greater the
distance of the residence from the nearest fire station, the higher the amount of fire
damage. The fact that the value is close to 1, suggests that this is a strong, linear,
positive correlation.
Many candidates did not mention all three words (strong, linear, positive). Note that
all of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 1.526. The formula for a is:

a = ȳ − bx̄

so we get a = −1.235. Hence the regression line can be written as:

yb = −1.235 + 1.526x or y = −1.235 + 1.526x + ε.

It should also be plotted on the scatter diagram.

27
ST104a Statistics 1

Many candidates reported incorrectly the regression line as y = −1.235 + 1.526x.


This expression is false; one of the two above is required. Also, many candidates did
not draw this line on the scatter diagram; instead they drew an approximate line
trying to go around the points but without reference to the above equation. No
marks were awarded in such cases.
iv. Some discussion mentioning, for example, ‘no, due to the non-linear shape of the plot’
or ‘no, due to outliers’. Alternative ways to model the relationship between x and y
could include Spearman’s rank correlation coefficient and/or the need to transform
the data, such as a logarithmic transformation.

(b) The 55 university students on a certain course were randomly assigned to


two class groups of size 30 and 25 students, respectively. At the end of the
year, all students took the examination and their marks are summarised in
the table below.
Sample size Sample mean Sample standard deviation
Class Group 1 30 75.33 7.61
Class Group 2 25 71.40 6.37

i. Use an appropriate hypothesis test to determine whether the students of


class group 1 were better in terms of examination marks. State clearly the
hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
Comment on your findings.
ii. State clearly any assumptions you made in part i.
iii. Provide a 95% confidence interval for the difference between the mean
examination marks of the two class groups.

(12 marks)

Reading for this question


The first two parts of the question refer to a two-sided hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant, one can focus
on Section 8.16. For part iii., see Section 7.13.

Approaching the question

i. Let µA denote the mean examination mark for class group 1 and µB the mean
examination mark for class group 2. We test:

H0 : µA = µB vs. H1 : µA > µB .

The test statistic value is 2.051 (using the pooled sample variance of 50.06). (If equal
variances are not assumed the test statistic value is 2.085.) For reference the test
statistic formula is:
x̄A − x̄B x̄A − x̄B
q or p .
s2p (1/nA + 1/nB ) sA /nA + s2B /nB
2

For α = 0.05, the critical value is 1.684 (based on the t40 distribution) or 1.671 (based on
the t60 distribution) or 1.645 (based on the standard normal distribution). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical value is 2.423 (based on the
t40 distribution) or 2.390 (based on the t60 distribution) or 2.326 (based on the standard
normal distribution), so do not reject H0 . Hence the test is moderately significant, i.e.
there is moderate evidence that the mean examination mark for class group 1 is greater
than that of class group 2.

28
Examiners’ commentaries 2021 (Autumn)

ii. The assumptions for i. concerned an assumption:


• about equal variances
• about whether nA + nB is ‘large’ so that the normality assumption is satisfied
• about independent samples.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. For a 95% confidence interval, if the t40 distribution is used then the confidence
coefficient is 2.021, or if the standard normal distribution is assumed then it is 1.96.
These result in confidence intervals of (0.057, 7.802) and (0.175, 7.685), respectively.

29

You might also like