ST104a Commentary Autumn 2021
ST104a Commentary Autumn 2021
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
General remarks
Learning outcomes
At the end of the half course and having completed the Essential reading and activities you should:
be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common statistical
methods
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.
You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.
1
ST104a Statistics 1
Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2021, for
example, the first part of Question 2 required contingency tables while the second part related to
aspects of sampling design. In Question 3, the first part covered data visualisation and descriptive
statistics while the second part related to statistical inference related to means. Finally, in Question
4, the first part related to correlation and linear regression while the second part covered statistical
inference related to means. This means that it is really important that you make sure you have a
reasonable idea of what topics are covered before you start work on the paper! We suggest you
divide your time as follows during the examination:
Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
Once you have chosen your two Section B questions, give them about 25 minutes each.
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!
The examiners are looking for very simple demonstrations from you. They want to be sure that you:
have covered the syllabus as described and explained in the subject guide
know the basic formulae given there and when and how to use them
understand and answer the questions set.
You are not expected to write long essays where explanations or descriptions of sampling design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.
The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2021 examinations!
Remember the following.
If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’, ‘Stem-and-leaf diagram’, ‘Boxplot’ or ‘Scatter diagram’ in itself is insufficient.
What do the data describe? What are the units? What are the x-axis and y-axis?
If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.
2
Examiners’ commentaries 2021 (Autumn)
How should you use the specific comments on each question given in the
Examiners0 commentaries?
We hope that you find these useful. For each question and subquestion, they give:
further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold et al. (2012).
Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.
It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.
Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.
We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.
The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.
If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.
3
ST104a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.
Section A
Question 1
You are told that the value of the sample mean is x̄ = 16.
4
Examiners’ commentaries 2021 (Autumn)
or else:
61 + x = 80 ⇒ x = 19.
(18 − 16)2 + (12 − 16)2 + (16 − 16)2 + (19 − 16)2 + (15 − 16)2
=
4
= 7.5
(6 marks)
ii. We have:
4
X √ √ √
xi yi = 8×8+ 5 × 5 = 8 + 5 = 13.
i=3
√ √
Note that to be mathematically precise, 64 and 25 are also equal to −8 and −5,
respectively. For this reason, −13 as a final answer was also accepted as correct.
iii. We have:
3
X 1 1 1 1
y42 + 2
=5 + − − + = 24.375.
x
i=1 i
2 4 8
5
ST104a Statistics 1
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. The upper quartile of a sample dataset is never smaller than the lower
quartile.
(2 marks)
ii. The probability that a normal random variable is less than one standard
deviation from its mean is 99%.
(2 marks)
6
Examiners’ commentaries 2021 (Autumn)
(e) The times of marathon runners, participating in the Olympic Games, are
normally distributed with mean 3.5 hours and a standard deviation of 0.75
hours.
i. What is the proportion of runners in the Olympic Games that finish in less
than 3 hours?
(2 marks)
ii. What is the proportion of runners that finish the Olympic Games with times
between 2.5 and 4.5 hours?
(3 marks)
iii. Do you think it is reasonable to assume that the times of marathon runners
follow a normal distribution? Briefly explain your view.
(2 marks)
iii. Any reasonable argument accepted. A discussion of time being a continuous variable
supporting the use of the normal distribution (which is continuous), and whether or not
it is reasonable to assume the distribution of times is symmetric (which the normal
distribution is).
7
ST104a Statistics 1
(f ) An online retailer dispatches products from one of three warehouses (A, B and
C), where these warehouses account for 10%, 40% and 50% of the retailer’s
sales, respectively. It is known that the percentage of defective items are 4%,
6% and 3%, from warehouses A, B and C, respectively. A customer complains
that they have received a defective item. What is the probability this item came
from warehouse A? Provide your answer to four decimal places.
(6 marks)
P (D | A) P (A)
P (A | D) =
P (D | A) P (A) + P (D | B) P (B) + P (D | C) P (C)
0.04 × 0.10
=
0.04 × 0.10 + 0.06 × 0.40 + 0.03 × 0.50
= 0.0930.
Note the use of the total probability formula in the denominator (which represents P (D)).
The answer should be reported to four decimal places, as requested in the question.
Note a correct probability tree is also acceptable.
(g) A random sample is drawn from a normal distribution, N (µ, 36). You are told
that a 90% confidence interval for the population mean is (5.78, 7.22). What
was the size of the sample?
(5 marks)
5.78 + 7.22
x̄ = = 6.5.
2
To find the sample size, note that the standard error when estimating a single mean is:
σ 6
√ =√ .
n n
Since this is a 100(1 − α)% = 90% confidence interval, then α = 0.10, so the confidence
coefficient is zα/2 = z0.05 = 1.645. Therefore, to determine n we need to solve:
6
1.645 × √ = 0.72 ⇒ n = 187.91.
n
Since this is the minimum value of n, and we must have that n is an integer, we round up to
get a sample size of 188.
8
Examiners’ commentaries 2021 (Autumn)
(h) Explain how you would develop an experimental design to determine the
effectiveness of a vaccine.
(5 marks)
Section B
Answer two out of the three questions from this section (25 marks each).
Question 2
Outcome
Faulty Non-faulty Total
Machine 1 4 96 100
Machine 2 2 98 100
Machine 3 11 89 100
Machine 4 14 86 100
Total 31 369 400
i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the machine number and the
component being faulty?
ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(14 marks)
i. There are some differences in the proportions of faulty components for each machine.
More specifically, 2% of the components in Machine 2 are faulty, whereas the
corresponding proportion for Machine 3 is 11% and for Machine 4 is 14%. Hence there
seems to be an association between machine number and the component being faulty,
although this needs to be investigated further.
9
ST104a Statistics 1
ii. Set out the null hypothesis that there is no association between machine number and the
component being faulty against the alternative, that there is an association. Be careful
to get these the correct way round! We test:
H0 : No association between the machine number and the component being faulty
vs.
i,j
Ei,j
which gives a value of 13.53. This is a 4 × 2 contingency table so the degrees of freedom
are (4 − 1) × (2 − 1) = 3. For α = 0.05, the critical value is 7.815, hence we reject H0 . We
conclude that there is moderate evidence of an association between machine number and
the component being faulty.
(b) i. Describe how stratified random sampling is performed and explain how it
differs from quota sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how people of different ages rate its products. For
this reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)
10
Examiners’ commentaries 2021 (Autumn)
Question 3
(a) The data below represent heights, measured in centimetres, of women from an
adult female population:
i. Carefully construct, draw and label a histogram of these data. The histogram
can be drawn on ordinary paper – no graph paper needed. You should draw
by hand; do not use a computer.
ii. Find the median height among these women and the upper quartile. What
percentage of women were below 165 cm?
iii. Comment on the data given the shape of the histogram without doing any
further calculations.
iv. Name two other types of graphical displays that would be suitable to
represent the data.
(12 marks)
i. A histogram compatible with what the examiners were expecting to see is shown below.
11
ST104a Statistics 1
This histogram is based on the following class intervals and frequency densities:
Interval Frequency
Class interval width Frequency density
[160, 165) 5 3 0.6
[165, 170) 5 20 4.0
[170, 175) 5 5 1.0
[175, 180) 5 0 0.0
[180, 185) 5 1 0.2
[185, 190) 5 1 0.2
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure.
ii. The median is the midpoint of the ordered observations, which is 168 centimetres.
Q3 = 169 centimetres. Note that units for at least one of median or Q3 are required. The
percentage is 3/30 = 10%.
iii. The histogram is positively (right) skewed. There are two women (with heights of 184
cm and 185 cm) who may be regarded as outliers.
iv. Any two of boxplot, dot plot and stem-and-leaf diagram.
(b) A random sample of 9 people tried a specific diet that lasted 2 months to lose
weight. The weights of these people, measured in kilograms, were measured
both at the beginning and the end of the diet, and are shown in the table below:
12
Examiners’ commentaries 2021 (Autumn)
X̄d
√ ∼ tn−1 = t9
Sd / n
and the test statistic value is −2.309. For α = 0.05, the critical value is
−t8, 0.05 = −1.860, hence we reject H0 since −2.309 < −1.860. For α = 0.01, the critical
value is −t8, 0.01 = −2.896, hence we do not reject H0 since −2.896 < −2.309. We
conclude that the test is moderately significant, i.e. there is moderate evidence that the
diet is effective in helping people lose weight.
ii. The assumptions are that:
• differences are normally distributed
• pairs of observations are independent.
iii. The confidence coefficient for a 90% confidence interval is t8, 0.05 = 1.860, hence a 90%
confidence interval is:
2.598
−2.0 ± 1.860 × √ ⇒ (−3.611, −0.389).
9
Question 4
(a) The director of a local Tourism Authority would like to know whether a family’s
annual expenditure on recreation (y), measured in $000s, is related to their
annual income (x), also measured in $000s. In order to explore this potential
relationship, the variables x and y were recorded for 10 randomly selected
families that visited the area last year. The results were as follows:
Family #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 41.2 50.1 52.0 62.0 44.5 37.7 73.5 37.5 56.7 65.2
y 2.4 2.7 2.8 8.0 3.1 2.1 12.1 2.0 3.9 8.9
(a) i. Draw a scatter diagram of these data. Label the diagram carefully. (The
scatter diagram can be drawn on ordinary paper – no graph paper needed.
You should draw by hand; do not use a computer.)
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Do you find the analyses in parts ii. and iii. appropriate? Justify your
answer and suggest any alternative ways to model the relationship
between x and y.
(13 marks)
13
ST104a Statistics 1
ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9222. An
interpretation of this value is the following: The data suggest that the higher a
family’s annual income is, the higher the holiday expenditure. The fact that the value
is close to 1, suggests that this is a strong, linear, positive correlation.
Many candidates did not mention all three words (strong, linear, positive). Note that
all of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 0.267. The formula for a is:
a = ȳ − bx̄
14
Examiners’ commentaries 2021 (Autumn)
(b) The fuel consumption of two different car models (A and B) was compared in
the following way. A random sample of 20 cars of model A and 35 cars of
model B were taken and the fuel consumption (in miles per gallon) was
measured for each car. The results are summarised in the table below.
Sample size Sample mean Sample standard deviation
Car Model A 20 30.9 6.11
Car Model B 35 27.1 6.41
(12 marks)
i. Let µA denote the mean fuel consumption for car model A and µB the mean fuel
consumption for car model B. We test:
H0 : µA = µB vs. H1 : µA > µB .
The test statistic value is 2.150 (using the pooled sample variance of 39.74). (If equal
variances are not assumed the test statistic value is 2.179.) For reference the test
statistic formula is:
x̄A − x̄B x̄A − x̄B
q or p .
s2p (1/nA + 1/nB ) sA /nA + s2B /nB
2
For α = 0.05, the critical value is 1.684 (based on the t40 distribution) or 1.671 (based on
the t60 distribution) or 1.645 (based on the standard normal distribution). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical value is 2.423 (based on the
t40 distribution) or 2.390 (based on the t60 distribution) or 2.326 (based on the standard
normal distribution), so do not reject H0 . Hence the test is moderately significant, i.e.
there is moderate evidence that the mean fuel consumption of model A cars is greater
than that of model B cars.
15
ST104a Statistics 1
16
Examiners’ commentaries 2021 (Autumn)
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.
Section A
Question 1
You are told that the value of the sample mean is x̄ = 24.
17
ST104a Statistics 1
or else:
102 + x = 120 ⇒ x = 18.
(24 − 24)2 + (22 − 24)2 + (29 − 24)2 + (18 − 24)2 + (27 − 24)2
=
4
= 18.5
(6 marks)
ii. We have:
4
X √ √ √
xi yi = 10 × 10 + 8 × 8 = 10 + 8 = 18.
i=3
√ √
Note that to be mathematically precise, 100 and 64 are also equal to −10 and −8,
respectively. For this reason, −18 as a final answer was also accepted as correct.
iii. We have:
3
X 1 1 1 1
y42 + 2
=8 + − − + = 63.65.
x
i=1 i
4 5 10
18
Examiners’ commentaries 2021 (Autumn)
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
ii. The probability that a normal random variable is less than two standard
deviations from its mean is 68%.
(2 marks)
ii. False. The probability is approximately 95%. This could be illustrated with a suitable
sketch.
iii. False. In systematic sampling the selection is random and hence eliminates selection bias.
iv. False. The correlation may be spurious. Correlation does not imply causality, so we
cannot be certain that x influences y.
X=x 1 4 7 10
P (X = x) 0.10 0.30 0.40 0.20
19
ST104a Statistics 1
(e) The times of marathon runners, participating in the London Marathon, are
normally distributed with mean 3.4 hours and a standard deviation of 0.85
hours.
i. What is the proportion of runners in the London Marathon that finish in
more than 4 hours?
(2 marks)
ii. What is the proportion of runners that finish the London Marathon with
times between 2.75 and 4.75 hours?
(3 marks)
iii. Do you think it is reasonable to assume that the times of marathon runners
follow a normal distribution? Briefly explain your view.
(2 marks)
iii. Any reasonable argument accepted. A discussion of time being a continuous variable
supporting the use of the normal distribution (which is continuous), and whether or not
it is reasonable to assume the distribution of times is symmetric (which the normal
distribution is).
20
Examiners’ commentaries 2021 (Autumn)
(f ) A supplier dispatches products from one of three warehouses (A, B and C),
where these warehouses account for 20%, 30% and 50% of the supplier’s sales,
respectively. It is known that the percentage of defective items are 7%, 4% and
5%, from warehouses A, B and C, respectively. A customer complains that they
have received a defective item. What is the probability this item came from
warehouse A? Provide your answer to four decimal places.
(6 marks)
P (D | A) P (A)
P (A | D) =
P (D | A) P (A) + P (D | B) P (B) + P (D | C) P (C)
0.07 × 0.20
=
0.07 × 0.20 + 0.04 × 0.30 + 0.05 × 0.50
= 0.2745.
Note the use of the total probability formula in the denominator (which represents P (D)).
The answer should be reported to four decimal places, as requested in the question.
Note a correct probability tree is also acceptable.
(g) A random sample is drawn from a normal distribution, N (µ, 25). You are told
that a 99% confidence interval for the population mean is (4.69, 6.31). What
was the size of the sample?
(5 marks)
4.69 + 6.31
x̄ = = 5.5.
2
To find the sample size, note that the standard error when estimating a single mean is:
σ 5
√ =√ .
n n
Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the confidence
coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we need to solve:
5
2.576 × √ = 0.81 ⇒ n = 252.85.
n
Since this is the minimum value of n, and we must have that n is an integer, we round up to
get a sample size of 253.
21
ST104a Statistics 1
(h) Explain how you would develop an experimental design to determine the
effectiveness of a new medicine.
(5 marks)
Section B
Answer two out of the three questions from this section (25 marks each).
Question 2
(a) A sample consisting of 400 randomly selected students was classified in terms of
personality type (introvert or extrovert) and in terms of their favourite colour
(out of red, yellow, green or blue). Their responses are summarised in the table
below:
Personality type
Introvert Extrovert Total
Red 32 68 100
Yellow 26 74 100
Green 21 79 100
Blue 46 54 100
Total 125 275 400
i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the student’s type of
personality and colour preference?
ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(14 marks)
i. There are some differences in rates of introvert students for each colour preference. More
specifically, 21% of the students who prefer the green colour are introvert, whereas the
corresponding proportion for students who prefer red is 32% and for students preferring
blue is 46%. Hence there seems to be an association between personality type and colour
preference, although this needs to be investigated further.
22
Examiners’ commentaries 2021 (Autumn)
ii. Set out the null hypothesis that there is no association between machine number and the
component being faulty against the alternative, that there is an association. Be careful
to get these the correct way round! We test:
vs.
H1 : Association between personality type and colour preference.
Work out the expected values to obtain the table below:
31.25 68.75
31.25 68.75
31.25 68.75
31.25 68.75
The test statistic formula is:
X (Oi,j − Ei,j )2
i,j
Ei,j
which gives a value of 16.33. This is a 4 × 2 contingency table so the degrees of freedom
are (4 − 1) × (2 − 1) = 3. For α = 0.05, the critical value is 7.815, hence we reject H0 . We
conclude that there is moderate evidence of an association between personality type and
colour preference.
(b) i. Describe how quota sampling is performed and explain how it differs from
stratified random sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how men and women rate its products. For this
reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)
23
ST104a Statistics 1
Question 3
(a) A camera recorded the speed of 30 cars on a road with a 30 miles per hour
speed limit. The recorded data are shown below:
i. Carefully construct, draw and label a histogram of these data. The histogram
can be drawn on ordinary paper – no graph paper needed. You should draw
by hand; do not use a computer.
ii. Find the median speed among these cars and the upper quartile. What
percentage of drivers were exceeding the 30 miles per hour speed limit?
iii. Comment on the data given the shape of the histogram without doing any
further calculations.
iv. Name two other types of graphical displays that would be suitable to
represent the data.
(12 marks)
i. A histogram compatible with what the examiners were expecting to see is shown below.
24
Examiners’ commentaries 2021 (Autumn)
This histogram is based on the following class intervals and frequency densities:
Interval Frequency
Class interval width Frequency density
[25.0, 27.0) 2 7 3.5
[27.0, 29.0) 2 11 5.5
[29.0, 31.0) 2 10 5.0
[31.0, 33.0) 2 0 0.0
[33.0, 35.0) 2 0 0.0
[35.0, 37.0) 2 2 1.0
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure.
ii. The median is the midpoint of the ordered observations, which is 28.65 miles per hour.
Q3 = 29.45 miles per hour. Note that units for at least one of median or Q3 are required.
The percentage is 5/30 = 16.67% (17% is also acceptable).
iii. The histogram is positively (right) skewed. There are two cars (with recorded speeds of
36.2 and 36.9 miles per hour) which may be regarded as outliers.
iv. Any two of boxplot, dot plot and stem-and-leaf diagram.
25
ST104a Statistics 1
X̄d
√ ∼ tn−1 = t9
Sd / n
and the test statistic value is 2.309. For α = 0.05, the critical value is t8, 0.05 = 1.860,
hence we reject H0 since 1.860 < 2.309. For α = 0.01, the critical value is t8, 0.01 = 2.896,
hence we do not reject H0 since 2.309 < 2.896. We conclude that the test is moderately
significant, i.e. there is moderate evidence that the special IQ training is effective.
ii. The assumptions are that:
• differences are normally distributed
• pairs of observations are independent.
iii. The confidence coefficient for a 90% confidence interval is t8, 0.05 = 1.860, hence a 90%
confidence interval is:
2.598
2.0 ± 1.860 × √ ⇒ (0.389, 3.611).
9
Question 4
(a) An insurance company wants to relate the amount of fire damage y, measured
in $000s, in major residential fires to the distance between the residence and the
nearest fire station x, measured in miles. For this reason, a study was conducted
in a large suburb of a major city based on a sample of 10 recent fires in this
suburb. For each of these fires, the variables x and y were recorded and are
shown in the table below:
Fire #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 3.4 1.8 4.6 2.3 3.1 5.5 0.7 3.0 2.6 4.3
y 2.6 1.8 5.9 2.3 2.8 8.6 1.4 2.3 2.0 5.7
(a) i. Draw a scatter diagram of these data. Label the diagram carefully. (The
scatter diagram can be drawn on ordinary paper – no graph paper needed.
You should draw by hand; do not use a computer.)
ii. Calculate the sample correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Do you find the analyses in parts ii. and iii. appropriate? Justify your
answer and suggest any alternative ways to model the relationship
between x and y.
(13 marks)
26
Examiners’ commentaries 2021 (Autumn)
ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9093. An
interpretation of this value is the following: The data suggest that the greater the
distance of the residence from the nearest fire station, the higher the amount of fire
damage. The fact that the value is close to 1, suggests that this is a strong, linear,
positive correlation.
Many candidates did not mention all three words (strong, linear, positive). Note that
all of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 1.526. The formula for a is:
a = ȳ − bx̄
27
ST104a Statistics 1
(12 marks)
i. Let µA denote the mean examination mark for class group 1 and µB the mean
examination mark for class group 2. We test:
H0 : µA = µB vs. H1 : µA > µB .
The test statistic value is 2.051 (using the pooled sample variance of 50.06). (If equal
variances are not assumed the test statistic value is 2.085.) For reference the test
statistic formula is:
x̄A − x̄B x̄A − x̄B
q or p .
s2p (1/nA + 1/nB ) sA /nA + s2B /nB
2
For α = 0.05, the critical value is 1.684 (based on the t40 distribution) or 1.671 (based on
the t60 distribution) or 1.645 (based on the standard normal distribution). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical value is 2.423 (based on the
t40 distribution) or 2.390 (based on the t60 distribution) or 2.326 (based on the standard
normal distribution), so do not reject H0 . Hence the test is moderately significant, i.e.
there is moderate evidence that the mean examination mark for class group 1 is greater
than that of class group 2.
28
Examiners’ commentaries 2021 (Autumn)
29