ST104a Commentary 2021 PDF
ST104a Commentary 2021 PDF
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
General remarks
Learning outcomes
At the end of the half course and having completed the Essential reading and activities you should:
be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common statistical
methods
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.
You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.
1
ST104a Statistics 1
Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2021, for
example, the first part of Question 2 related to correlation and linear regression while the second
part covered statistical inference related to means. In Question 3, the first part covered data
visualisation and descriptive statistics while the second part related to statistical inference related to
proportions. Finally, in Question 4, the first part required contingency tables while the second part
related to aspects of sampling design. This means that it is really important that you make sure you
have a reasonable idea of what topics are covered before you start work on the paper! We suggest
you divide your time as follows during the examination:
Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
Once you have chosen your two Section B questions, give them about 25 minutes each.
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!
The examiners are looking for very simple demonstrations from you. They want to be sure that you:
have covered the syllabus as described and explained in the subject guide
know the basic formulae given there and when and how to use them
understand and answer the questions set.
You are not expected to write long essays where explanations or descriptions of sampling design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.
The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2019 examinations!
Remember the following.
If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’, ‘Stem-and-leaf diagram’, ‘Boxplot’ or ‘Scatter diagram’ in itself is insufficient.
What do the data describe? What are the units? What are the x-axis and y-axis?
If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.
2
Examiners’ commentaries 2021
How should you use the specific comments on each question given in the
Examiners0 commentaries?
We hope that you find these useful. For each question and subquestion, they give:
further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold et al. (2012).
Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.
It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.
Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.
We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.
The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.
If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.
3
ST104a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.
Section A
Question 1
3 3
!2 3
X √ X X yi3
i. xi ii. x i yi iii. |x1 | + .
i=2 i=1 i=2
yi2
(6 marks)
4
Examiners’ commentaries 2021
i. We have:
3
X √ √ √
xi = 9+ 16 = 3 + 4 = 7.
i=2
√ √
Note that to be mathematically precise, 9 and 16 are also equal to −3 and −4,
respectively. For this reason, −7 as a final answer was also accepted as correct.
ii. We have:
3
!2
X
xi yi = ((−3 × −2) + (9 × 1) + (16 × 0.5))2 = (23)2 = 529.
i=1
iii. We have:
3 3
X y3i
X
|x1 | + 2 = |x1 | + yi = | − 3| + 1 + 0.5 = 4.5.
y
i=2 i i=2
(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. Age brackets of 18–30, 31–50, 51–70, 70+.
ii. Passport number.
iii. A country’s inflation rate.
(6 marks)
i. Categorical, ordinal. Age brackets are in a ranked order, with those 18–30 being younger
than those 31–50 etc.
ii. Categorical, nominal. Although numeric, passport numbers are for identification only.
iii. Measurable. Inflation rates can be measured in percentages to several decimal places.
Weak candidates did not provide a justification for their choices, reported nominal or
categorical to a measurable variable and sometimes answered ordinal when their justification
was pointing towards a nominal variable. There were also phrases like ‘It is measurable
because it can be measured’ that were not awarded any marks.
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. For a set of observation x1 , x2 , . . . , xn , with mean x̄, then:
n
X
(xi − x̄) > 0.
i=1
5
ST104a Statistics 1
ii. For two independent events A and B such that P (A) > 0 and P (B) > 0,
then:
P (A ∪ B) < P (A) + P (B).
ii. True. Since A and B are independent, with P (A) > 0 and P (B) > 0, then
P (A ∩ B) = P (A) P (B) > 0 and:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= P (A) + P (B) − P (A) P (B)
< P (A) + P (B).
iv. False. Rejection of a true null hypothesis is known as a Type I error. Or, power is the
probability of rejecting a false null hypothesis.
v. False. With (4 − 1)(2 − 1) = 3 degrees of freedom, χ20.05, 3 = 7.815 > 6.724, hence it is not
statistically significant at the 5% significance level.
6
Examiners’ commentaries 2021
where Z ∼ N (0, 1). Since P (Z > 0.84) = P (Z < −0.84) ≈ 0.20, we have:
4
−0.84 = − ⇒ σ = 4.76
σ
so, approximately, Var(X) = (4.76)2 = 22.66.
ii. We have:
2
X
2
E(X ) = x2 p(x)
x=−2
7
ST104a Statistics 1
(f ) Based on the central limit theorem, you are told that a 90% confidence interval
for a population proportion is (0.7077, 0.7723).
i. What was the sample proportion which resulted in this confidence interval?
(2 marks)
ii. What was the size of the sample used?
(4 marks)
ii. To find the sample size, note that the (estimated) standard error when estimating a
single proportion is:
r r √
p(1 − p) 0.74 × 0.26 0.74 × 0.26 0.4386
= = √ = √ .
n n n n
Since this is a 100(1 − α)% = 90% confidence interval, then α = 0.10, so the confidence
coefficient is zα/2 = z0.05 = 1.645. Therefore, to determine n we need to solve:
0.4386
1.645 × √ = 0.0323.
n
(g) It is assumed that investors are equally split between those who prefer ‘growth’
stocks and those who prefer ‘value’ stocks. In a random sample of 200 investors,
105 agreed with the statement ‘Growth stocks are better than value stocks’.
8
Examiners’ commentaries 2021
For α = 0.05, the critical values are ±z0.025 = ±1.96. Since 0.7071 < 1.96 we do not
reject H0 , hence there is no evidence that π 6= 0.50.
Section B
Answer two out of the three questions from this section (25 marks each).
Question 2
(a) The manager of a store selling shoes is looking into the association between
daily sales (in hundreds of $) in the store, y, and the number of customers who
visited the store in that day, x. For this reason, in 10 days selected at random
the variables x and y were recorded. They appear in the table below:
Days #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
# of customers (x) 90 92 50 74 78 88 87 51 53 42
Sales (y) 11.2 11.1 6.8 9.2 9.4 10.1 9.4 7.7 8.2 6.1
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Suppose that you observe more data and when you draw the corresponding
scatter diagram a non-linear association is revealed. Discuss how this can be
interpreted in the context of the problem.
(13 marks)
9
ST104a Statistics 1
ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9502. An
interpretation of this value is the following: The data suggest that the higher the number
of customers, the higher the weekly sales. The fact that the value is close to 1, suggests
that this is a strong, linear, positive association.
Many candidates did not mention all three words (strong, linear, positive). Note that all
of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 0.0835. The formula for a is
a = ȳ − bx̄, so we get a = 3.0314. Hence the regression line can be written as:
10
Examiners’ commentaries 2021
(b) A study focused on the perception of job satisfaction that may vary between
women and men. For this reason, at random 15 women and 13 men took a job
satisfaction questionnaire that gave a score for each one of them (high values of
the score indicate higher job satisfaction). Summaries of these scores are
presented below.
H0 : µA = µB vs. H1 : µA 6= µB .
The test statistic value is 2.2979 (using the pooled sample variance of 17.0923). (If equal
variances are not assumed the test statistic value is 2.2778.) For reference the test
statistic formula is:
x̄A − x̄B x̄A − x̄B
q or p .
s2p (1/nA + 1/nB ) sA /nA + s2B /nB
2
For α = 0.05, the critical values are ±2.056 (the t26 distribution is used). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical values are ±2.779, hence we
do not reject H0 (again, the t26 distribution is used). Hence there is moderate evidence
of a difference in the mean scores of job satisfaction between men and women.
ii. The assumptions for i. concerned an assumption:
∗ about equal variances
∗ about whether nA + nB is ‘large’ so that the normality assumption is satisfied.
∗ about independent samples.
Some candidates stated assumptions in this part that were not made in part (i). Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. Some example related to confounding (causation and association), such as ‘if the male
participants of the study were working in night shifts, this could explain their lower job
satisfaction’.
11
ST104a Statistics 1
Question 3
(a) Thirty people were asked about the number of hours they exercise in a week
and their answers were recorded and listed below.
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure.
12
Examiners’ commentaries 2021
ii. The requested results are below and are simple to compute.
∗ Mean: 9.2333 hours of exercise per week.
∗ Median: 8.5 hours of exercise per week.
∗ Modal group: 5–10 hours of exercise per week.
Make sure to use measurements units. Also, avoid the use of grouped data formulae as
they are approximate.
iii. The distribution of the data appears to be positively/right skewed. This is also
supported by the fact that the mean is larger than the median.
iv. Some graphs are listed below, although there are more.
∗ Boxplot
∗ Stem-and-leaf diagram.
∗ Dot plot.
13
ST104a Statistics 1
ii. Let π1 denote the vitamin C group proportion and π2 the placebo group proportion.
Also, denote by p the overall sample proportion of those who got a cold. Marks were
awarded for the following points. We test:
H0 : π 1 = π 2 vs. H1 : π1 < π2 .
The test statistic (p1 − p2 )/s.e.(p1 − p2 ) follows a standard normal distribution,
approximately. The calculation of the standard error is:
s
1 1
s.e.(p1 − p2 ) = p(1 − p) + = 0.0452.
n1 n2
The test statistic value is −2.2145. For α = 0.05, the critical value is −1.645, hence we
reject H0 at the 5% significance level. The probability of getting a cold is lower for the
vitamin C group (can be mentioned in iv.).
iii. Below are a couple of assumptions made:
∗ Sample sizes are large to justify the normality assumption.
∗ Independent samples.
iv. Some brief discussion is expected including the following points:
∗ The evidence from both analyses points in that direction.
∗ One has to be cautious as the above implies association, not causation.
Question 4
(a) A mental health study focused on 300 patients visiting three community mental
health centres. The patients were classified into three groups according to the
primary issue for which they were seen. The data are shown below.
Type of Problem
Social Adjustment Stress Related Other Total
Centre 1 45 28 27 100
Centre 2 28 44 28 100
Centre 3 46 29 25 100
Total 119 101 80 300
i. Based on the data in the table, and without conducting a significance test,
describe the differences in terms of the primary issue for which the patients
were seen across the different centres.
ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(13 marks)
14
Examiners’ commentaries 2021
ii. Set out the null hypothesis that there is no association between community mental
health centre and type of problem against the alternative, that there is an association.
Be careful to get these the correct way round! We test:
vs.
i,j
Eij
which gives a value of 10.107. This is a 3 × 3 contingency table so the degrees of freedom
are (3 − 1) × (3 − 1) = 4. For α = 0.05, the critical value is 9.488, hence we reject H0 .
For α = 0.01, the critical value is 13.277, hence we do not reject H0 . We conclude that
there is moderate evidence of an association between community mental health centre
and type of problem.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.
(b) i. You have been asked to design a nationwide survey in your country to find
out about internet use by children less than 10 years old. Provide a
probability sampling scheme and a sampling frame that you would like to
use. Identify a potential source of selection bias that may occur and discuss
how this issue could be addressed.
ii. Describe what a longitudinal survey is. State two ways in which panel
surveys differ from longitudinal surveys.
(12 marks)
15
ST104a Statistics 1
16
Examiners’ commentaries 2021
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.
Section A
Question 1
3 3
!2 3
X √ X X yi3
i. xi ii. xi yi iii. |y1 | + .
i=2 i=1 i=2
yi2
(6 marks)
17
ST104a Statistics 1
i. We have:
3
X √ √ √
xi = 36 + 4 = 6 + 2 = 8.
i=2
√ √
Note that to be mathematically precise, 36 and 4 are also equal to −6 and −2,
respectively. For this reason, −8 as a final answer was also accepted as correct.
ii. We have:
3
!2
X
xi yi = ((−5 × −7) + (36 × 0.5) + (4 × 1))2 = (57)2 = 3,249.
i=1
iii. We have:
3 3
X y3i
X
|y1 | + 2 = |y1 | + yi = | − 7| + 0.5 + 1 = 8.5.
y
i=2 i i=2
(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. Exchange rate between two currencies.
ii. Candidate number in an examination.
iii. University degree type (in terms of Bachelors, Masters, Ph.D.).
(6 marks)
i. Measurable. Exchange rate is price of one currency in terms of another, and can be
measured to several decimal places.
ii. Categorical, nominal. Candidate numbers are used for identification purposes only.
iii. Categorical, ordinal. A Ph.D. ranks higher than a Masters, which in turn ranks higher
than a Bachelors.
Weak candidates did not provide a justification for their choices, reported nominal or
categorical to a measurable variable and sometimes answered ordinal when their justification
was pointing towards a nominal variable. There were also phrases like ‘It is measurable
because it can be measured’ that were not awarded any marks.
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. For a set of observation x1 , x2 , . . . , xn , with mean x̄, then:
n
X
(xi − x̄) < 0.
i=1
18
Examiners’ commentaries 2021
19
ST104a Statistics 1
ii. We have:
X
E(X) = x p(x)
x
and:
X
E(X 2 ) = x2 p(x)
x
= (10)2 × 0.05 + (20)2 × 0.15 + (30)2 × 0.60 + (40)2 × 0.15 + (50)2 × 0.05
= 970
hence:
Var(X) = E(X 2 ) − (E(X))2 = 970 − (30)2 = 70.
iv. X is discrete, but a normal random variable is continuous, hence X does not have a
normal distribution. Alternatively, an accurate mass function plot vs. a normal curve
could be provided.
(f ) Based on the central limit theorem, you are told that a 99% confidence interval
for a population proportion is (0.5782, 0.7018).
20
Examiners’ commentaries 2021
i. What was the sample proportion which resulted in this confidence interval?
(2 marks)
ii. What was the size of the sample used?
(4 marks)
i. The sample proportion, p, would be in the centre of the interval (0.5782, 0.7018). Adding
the two endpoints and dividing by 2 gives:
0.5782 + 0.7018
p= = 0.64.
2
ii. To find the sample size, note that the (estimated) standard error when estimating a
single proportion is:
r r √
p(1 − p) 0.64 × 0.36 0.64 × 0.36 0.48
= = √ = √ .
n n n n
Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the confidence
coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we need to solve:
0.48
2.576 × √ = 0.0618.
n
(g) It is assumed that investors are equally split between those who prefer ‘growth’
stocks and those who prefer ‘value’ stocks. In a random sample of 500 investors,
255 agreed with the statement ‘Growth stocks are better than value stocks’.
ii. Calculate the p-value of the test statistic value calculated in part i.
(2 marks)
21
ST104a Statistics 1
For α = 0.10, the critical values are ±z0.05 = ±1.645. Since 0.4472 < 1.645 we do not
reject H0 , hence there is no evidence that π 6= 0.50.
Section B
Answer two out of the three questions from this section (25 marks each).
Question 2
(a) A study is made for a particular allergy medication in order to determine the
length of relief it provides y (in hours) in relation to the dosage of medication x
(in mg). For this reason, ten patients were given different doses of the
medication and were asked to report back when the medication seemed to wear
off.
Patient #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Dosage (x) 3 3.5 4 5 6 6.5 7 8 8.5 9
Relief hours (y) 9.1 5.5 12.3 9.2 14.2 16.8 22.0 18.3 24.5 22.7
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Suppose that you observe more data and when you draw the corresponding
scatter diagram a non-linear association is revealed. Discuss how this can be
interpreted in the context of the problem.
(13 marks)
22
Examiners’ commentaries 2021
ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9180. An
interpretation of this value is the following: The data suggest that the higher the dosage,
the longer the length of relief. The fact that the value is close to 1, suggests that this is a
strong, linear, positive association.
Many candidates did not mention all three words (strong, linear, positive). Note that all
of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 2.7936. The formula for a is
a = ȳ − bx̄, so we get a = −1.4412. Hence the regression line can be written as:
23
ST104a Statistics 1
(b) A study focused on the perception of life satisfaction that may vary between
older and younger people. For this reason 12 adults over the age of 70 and 16
adults aged between 18 and 30 took a life satisfaction questionnaire that gave a
score for each one of them (high values of the score indicate higher life
satisfaction). Summaries of these scores are presented below.
For α = 0.05, the critical values are ±2.056 (the t26 distribution is used). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical values are ±2.779, hence again
we reject H0 (again, the t26 distribution is used). Hence there is strong evidence of a
difference in the mean scores of life satisfaction between older and younger adults.
ii. The assumptions for i. concerned an assumption:
∗ about equal variances
∗ about whether nA + nB is ‘large’ so that the normality assumption is satisfied.
∗ about independent samples.
Some candidates stated assumptions in this part that were not made in part (i). Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. Some example related with confounding (causation and association), such as ‘perhaps
the younger people participated in the study were university students that had just
obtained a poor mark in one of their courses’.
24
Examiners’ commentaries 2021
Question 3
(a) A variety of a broad bean plant is studied and the number of beans per plant is
counted and listed below.
71 94 62 74 106
76 87 94 76 78
83 56 78 79 80
60 92 54 81 45
72 54 45 85 72
74 65 68 55 66
25
ST104a Statistics 1
H0 : π 1 = π 2 vs. H1 : π1 < π2 .
The test statistic value is 3.7655. For α = 0.05, the critical value is 1.645, hence we reject
H0 at the 5% significance level. There is evidence that the pill increases the chances of
getting better (can be mentioned in iv.).
26
Examiners’ commentaries 2021
Question 4
(a) A study looked into the views of workers towards school closure in order to
reduce coronavirus transmission. 300 participants from three sectors
(hospitality, banking and construction) were interviewed and their responses
were classified into three categories, namely positive, neutral and negative. The
data are shown below.
i. Based on the data in the table, and without conducting a significance test,
describe the differences of views across the different sectors.
ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(13 marks)
ii. Set out the null hypothesis that there is no association between sector and view towards
school closures against the alternative, that there is an association. Be careful to get
these the correct way round! We test:
vs.
H1 : Association between sector and view towards school closures.
27
ST104a Statistics 1
i,j
Eij
that gives a value of 17.744. This is a 3 × 3 contingency table so the degrees of freedom
are (3 − 1) × (3 − 1) = 4. For α = 0.05, the critical value is 9.488, hence we reject H0 . For
α = 0.01, the critical value is 13.277, hence we again reject H0 . We conclude that there is
strong evidence of an association between sector and view towards school closures.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.
(b) i. Describe what selection bias is and when it may occur. Give an example.
ii. You have been asked to design a nationwide survey in your country to find
out about working conditions among employees in the postal offices. Provide
a probability sampling scheme and a sampling frame that you would like to
use. Identify a potential source of response bias that may occur and discuss
how this issue could be addressed.
(12 marks)
28