0% found this document useful (0 votes)
273 views28 pages

ST104a Commentary 2021 PDF

This document provides examiners' commentary and guidance for students taking the ST104a Statistics 1 exam. It includes: 1. Information about exam format changes, reading materials, and how to use past exam commentaries as guides rather than memorizing exact answers. 2. Learning outcomes and exam structure details to help students plan their time. 3. Advice on what examiners are looking for in answers and keys to improvement like fully labeling diagrams and showing work clearly. 4. A warning against only studying past questions and emphasizing the need to understand the entire syllabus.

Uploaded by

Nghia Tuan Nghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
273 views28 pages

ST104a Commentary 2021 PDF

This document provides examiners' commentary and guidance for students taking the ST104a Statistics 1 exam. It includes: 1. Information about exam format changes, reading materials, and how to use past exam commentaries as guides rather than memorizing exact answers. 2. Learning outcomes and exam structure details to help students plan their time. 3. Advice on what examiners are looking for in answers and keys to improvement like fully labeling diagrams and showing work clearly. 4. A warning against only studying past questions and emphasizing the need to understand the entire syllabus.

Uploaded by

Nghia Tuan Nghia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Examiners’ commentaries 2021

Examiners’ commentaries 2021


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

General remarks

Learning outcomes

At the end of the half course and having completed the Essential reading and activities you should:

be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common statistical
methods
be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.

Planning your time in the examination

You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.

1
ST104a Statistics 1

Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2021, for
example, the first part of Question 2 related to correlation and linear regression while the second
part covered statistical inference related to means. In Question 3, the first part covered data
visualisation and descriptive statistics while the second part related to statistical inference related to
proportions. Finally, in Question 4, the first part required contingency tables while the second part
related to aspects of sampling design. This means that it is really important that you make sure you
have a reasonable idea of what topics are covered before you start work on the paper! We suggest
you divide your time as follows during the examination:

Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
Once you have chosen your two Section B questions, give them about 25 minutes each.
This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!

What are the examiners looking for?

The examiners are looking for very simple demonstrations from you. They want to be sure that you:

have covered the syllabus as described and explained in the subject guide
know the basic formulae given there and when and how to use them
understand and answer the questions set.

You are not expected to write long essays where explanations or descriptions of sampling design
are required, and note-form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.

Key steps to improvement

The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2019 examinations!
Remember the following.

If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’, ‘Stem-and-leaf diagram’, ‘Boxplot’ or ‘Scatter diagram’ in itself is insufficient.
What do the data describe? What are the units? What are the x-axis and y-axis?
If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
Do not waste time calculating things which are not required by the examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not gain you marks.
When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.

2
Examiners’ commentaries 2021

How should you use the specific comments on each question given in the
Examiners0 commentaries?

We hope that you find these useful. For each question and subquestion, they give:

further guidance for each question on the points made in the last section
the answers, or keys to the answers, which the examiners were looking for
the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
Business and Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold et al. (2012).

Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.

Memorising from the Examiners0 commentaries

It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.

Examination revision strategy

Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons, but one particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.

We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.

The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.

If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.

3
ST104a Statistics 1

Examiners’ commentaries 2021


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions – Zone A

Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.

Section A

Answer all parts of question 1 (50 marks in total).

Question 1

(a) Suppose that x1 = −3, x2 = 9, x3 = 16, and y1 = −2, y2 = 1, y3 = 0.5.


Calculate the following quantities:

3 3
!2 3
X √ X X yi3
i. xi ii. x i yi iii. |x1 | + .
i=2 i=1 i=2
yi2

(6 marks)

Reading for this question


This question refers to the basic bookwork which can be found in Section 2.9 of the subject
guide.
Approaching the question
Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are as follows.

4
Examiners’ commentaries 2021

i. We have:
3
X √ √ √
xi = 9+ 16 = 3 + 4 = 7.
i=2
√ √
Note that to be mathematically precise, 9 and 16 are also equal to −3 and −4,
respectively. For this reason, −7 as a final answer was also accepted as correct.
ii. We have:
3
!2
X
xi yi = ((−3 × −2) + (9 × 1) + (16 × 0.5))2 = (23)2 = 529.
i=1

iii. We have:
3 3
X y3i
X
|x1 | + 2 = |x1 | + yi = | − 3| + 1 + 0.5 = 4.5.
y
i=2 i i=2

(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. Age brackets of 18–30, 31–50, 51–70, 70+.
ii. Passport number.
iii. A country’s inflation rate.
(6 marks)

Reading for this question


This question requires identifying types of variables so reading the relevant section in the
subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion of
a variable and be able to distinguish between discrete and continuous (measurable) data. In
addition to identifying whether a variable is categorical or measurable, further distinctions
between ordinal and nominal categorical variables should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.

i. Categorical, ordinal. Age brackets are in a ranked order, with those 18–30 being younger
than those 31–50 etc.
ii. Categorical, nominal. Although numeric, passport numbers are for identification only.
iii. Measurable. Inflation rates can be measured in percentages to several decimal places.

Weak candidates did not provide a justification for their choices, reported nominal or
categorical to a measurable variable and sometimes answered ordinal when their justification
was pointing towards a nominal variable. There were also phrases like ‘It is measurable
because it can be measured’ that were not awarded any marks.

(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. For a set of observation x1 , x2 , . . . , xn , with mean x̄, then:
n
X
(xi − x̄) > 0.
i=1

5
ST104a Statistics 1

ii. For two independent events A and B such that P (A) > 0 and P (B) > 0,
then:
P (A ∪ B) < P (A) + P (B).

iii. For a random variable X, E(X 2 ) can be less than (E(X))2 .


iv. Rejection of a true null hypothesis is known as the power of a test.
v. A 4-by-2 contingency table results in a χ2 test statistic value of 6.724 is
statistically significant at the 5% significance level.
(10 marks)

Reading for this question


This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. refers to Chapter 4 and in particular Section 4.9.3, whereas
part ii. requires knowledge of basic probability properties that can be found in Section 5.9.
Part iii. is about population mean and variance, see Section 6.7. Part iv. targets concepts
related to hypothesis testing, covered in Chapter 8. Finally, part v. focuses on material of
Chapter 9, and more specifically Section 9.7.
Approaching the question
Candidates always find this type of question tricky. It requires a brief explanation of the
reason for a true/false and not just a choice between the two. Some candidates lost marks for
long rambling explanations without a decision as to whether a statement was true or false.
i. False. Since:
n
X n
X
(xi − x̄) = xi − nx̄ = nx̄ − nx̄ = 0.
i=1 i=1

ii. True. Since A and B are independent, with P (A) > 0 and P (B) > 0, then
P (A ∩ B) = P (A) P (B) > 0 and:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= P (A) + P (B) − P (A) P (B)
< P (A) + P (B).

iii. False. Variances can never be negative, hence:

Var(X) = E(X 2 ) − (E(X))2 > 0 ⇒ E(X 2 ) > (E(X))2 .

iv. False. Rejection of a true null hypothesis is known as a Type I error. Or, power is the
probability of rejecting a false null hypothesis.
v. False. With (4 − 1)(2 − 1) = 3 degrees of freedom, χ20.05, 3 = 7.815 > 6.724, hence it is not
statistically significant at the 5% significance level.

(d) X is a normal random variable with a mean of µ = 5. If P (X < 1) = 0.20,


approximately what is the value of the variance, σ 2 ?
(5 marks)

Reading for this question


This question covers the normal distribution, for which read Section 6.8 of the subject guide.
Approaching the question
We have:    
1−5 4
P (X < 1) = P Z< =P Z<− = 0.20
σ σ

6
Examiners’ commentaries 2021

where Z ∼ N (0, 1). Since P (Z > 0.84) = P (Z < −0.84) ≈ 0.20, we have:
4
−0.84 = − ⇒ σ = 4.76
σ
so, approximately, Var(X) = (4.76)2 = 22.66.

(e) The probability distribution of a random variable X is given below.


X=x −2 −1 0 1 2
P (X = x) k 2k 4k 2k k

i. Explain why k = 0.10.


(2 marks)
ii. Given that E(X) = 0, calculate the standard deviation of X to four decimal
places.
(3 marks)
iii. Is it possible to calculate E(1/X)? If yes, calculate its value. If no, explain
why.
(3 marks)
iv. Does X have a normal distribution? Briefly explain your answer.
(2 marks)

Reading for this question


This is a question on probability, exploring the concepts of relative frequency and
probability distributions. Reading from Chapter 5 is suggested with a focus on the sections
on these topics. Try Activity A5.1 and the exercises on probability trees. For part iv., and
in particular the discrete uniform distribution, read Section 9.8.
Approaching the question
i. The probabilities must sum to 1, so:
2
X
p(x) = k + 2k + 4k + 2k + k = 10k = 1 ⇒ k = 0.10.
x=−2

ii. We have:
2
X
2
E(X ) = x2 p(x)
x=−2

= (−2)2 × 0.10 + (−1)2 × 0.20 + 02 × 0.40 + 12 × 0.20 + 22 × 0.10


= 1.2

hence Var(X) = E(X 2 ) = 1.2, since E(X) = 0, hence:



Std. dev. = 1.2 = 1.0954.

iii. It is not possible since:


  2
1 X 1
E = p(x)
X x=−2
x

is only defined for x 6= 0, but P (X = 0) = 0.40 > 0.


iv. X is discrete, but a normal random variable is continuous, hence X does not have a
normal distribution. Alternatively, an accurate mass function plot vs. a normal curve
could be provided.

7
ST104a Statistics 1

(f ) Based on the central limit theorem, you are told that a 90% confidence interval
for a population proportion is (0.7077, 0.7723).

i. What was the sample proportion which resulted in this confidence interval?
(2 marks)
ii. What was the size of the sample used?
(4 marks)

Reading for this question


This question contains material on sample size determination in relation to the normal
distribution and the distribution of the sample proportion. Moreover, knowing the concept
of confidence intervals is essential. Sample size determination is covered in Section 7.11. For
confidence intervals, read Section 7.6 for the principle and Section 7.10 for the case of
proportions.
Approaching the question
i. The sample proportion, p, would be in the centre of the interval (0.7077, 0.7723). Adding
the two endpoints and dividing by 2 gives:
0.7077 + 0.7723
p= = 0.74.
2

ii. To find the sample size, note that the (estimated) standard error when estimating a
single proportion is:
r r √
p(1 − p) 0.74 × 0.26 0.74 × 0.26 0.4386
= = √ = √ .
n n n n

Since this is a 100(1 − α)% = 90% confidence interval, then α = 0.10, so the confidence
coefficient is zα/2 = z0.05 = 1.645. Therefore, to determine n we need to solve:

0.4386
1.645 × √ = 0.0323.
n

The correct sample size is n ≈ 499 (depending on rounding).


Given the approximate values obtained from tables, n = 500 or n = 498 were also taken
as correct depending on the decimal places of the z-value used (t should not be used).

(g) It is assumed that investors are equally split between those who prefer ‘growth’
stocks and those who prefer ‘value’ stocks. In a random sample of 200 investors,
105 agreed with the statement ‘Growth stocks are better than value stocks’.

i. Conduct a two-sided hypothesis test, at the 5% significance level, to test


whether in the population of investors there are equal preferences for growth
and value stocks. Show all your steps and use the ‘critical value’ approach to
perform the test.
(5 marks)
ii. Calculate the p-value of the test statistic value calculated in part i.
(2 marks)

Reading for this question


This question refers to a hypothesis test for a single proportion, for which read Section 8.14
of the subject guide. The second part of the question looks at p-values and the relevant
section is Section 8.11.

8
Examiners’ commentaries 2021

Approaching the question

i. We test H0 : π = 0.50 vs. H1 : π 6= 0.50.


The sample proportion is p = 105/200 = 0.525. The test statistic value is:

p−π 0.525 − 0.50


p =p = 0.7071.
π(1 − π)/n 0.50 × 0.50/200

For α = 0.05, the critical values are ±z0.025 = ±1.96. Since 0.7071 < 1.96 we do not
reject H0 , hence there is no evidence that π 6= 0.50.

ii. The p-value is:


2 × P (Z > 0.71) = 2 × 0.2389 = 0.4778.

Section B

Answer two out of the three questions from this section (25 marks each).

Question 2

(a) The manager of a store selling shoes is looking into the association between
daily sales (in hundreds of $) in the store, y, and the number of customers who
visited the store in that day, x. For this reason, in 10 days selected at random
the variables x and y were recorded. They appear in the table below:

Days #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
# of customers (x) 90 92 50 74 78 88 87 51 53 42
Sales (y) 11.2 11.1 6.8 9.2 9.4 10.1 9.4 7.7 8.2 6.1

The summary statistics for these data are:

Sum of x data: 705 Sum of the squares of x data: 53,111


Sum of y data: 89.2 Sum of the squares of y data: 822
Sum of the products of x and y data: 6,573.3

i. Draw a scatter diagram of these data. Label the diagram carefully.

ii. Calculate the sample correlation coefficient. Interpret your findings.

iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.

iv. Suppose that you observe more data and when you draw the corresponding
scatter diagram a non-linear association is revealed. Discuss how this can be
interpreted in the context of the problem.

(13 marks)

Reading for this question


This is a standard regression question and the reading is to be found in Chapter 12. Section
12.6 provides details for scatter diagrams and is suitable for part i., whereas the remaining
parts are on correlation and regression that are covered in Sections 12.8–12.10. Section 12.7
is also relevant. Sample examination question 2 of this chapter is also recommended for
practice on questions of this type.

9
ST104a Statistics 1

Approaching the question


i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which give their units in addition. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question.
We have:

ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9502. An
interpretation of this value is the following: The data suggest that the higher the number
of customers, the higher the weekly sales. The fact that the value is close to 1, suggests
that this is a strong, linear, positive association.
Many candidates did not mention all three words (strong, linear, positive). Note that all
of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 0.0835. The formula for a is
a = ȳ − bx̄, so we get a = 3.0314. Hence the regression line can be written as:

yb = 3.0314 + 0.0835x or y = 3.0314 + 0.0835x + ε.

It should also be plotted on the scatter diagram.


Many candidates reported incorrectly the regression line as y = 3.0314 + 0.0835x. This
expression is false; one of the two above is required. Also, many candidates did not draw
this line on the scatter diagram; instead they drew an approximate line trying to go
around the points but without reference to the above equation. No marks were awarded
in such cases.
iv. Some discussion mentioning, for example, that it may be possible that if too many
customers come in, not all of them will be willing to buy and some will just visit the
shop for browsing.

10
Examiners’ commentaries 2021

(b) A study focused on the perception of job satisfaction that may vary between
women and men. For this reason, at random 15 women and 13 men took a job
satisfaction questionnaire that gave a score for each one of them (high values of
the score indicate higher job satisfaction). Summaries of these scores are
presented below.

Sample size Sample mean Sample variance


Women 15 32.1 15.2
Men 13 28.5 19.3

i. Use an appropriate hypothesis test to determine whether the mean job


satisfaction scores differ between women and men. Test at two appropriate
significance levels, stating clearly the hypotheses, the test statistic and its
distribution under the null hypothesis. Comment on your findings.
ii. State clearly any assumptions you made in part i.
iii. Is it possible that there is no difference between men and women in terms of
their job satisfaction? Discuss.
(12 marks)

Reading for this question


The first two parts of the question refer to a two-sided hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant, one can focus
on Section 8.16. The third part of the question is open-ended but can be approached most
likely as the difference between causation and association. The content of Chapter 11 is
relevant.
Approaching the question
i. Let µA denote the mean score for women and µB the mean score for men. We test:

H0 : µA = µB vs. H1 : µA 6= µB .

The test statistic value is 2.2979 (using the pooled sample variance of 17.0923). (If equal
variances are not assumed the test statistic value is 2.2778.) For reference the test
statistic formula is:
x̄A − x̄B x̄A − x̄B
q or p .
s2p (1/nA + 1/nB ) sA /nA + s2B /nB
2

For α = 0.05, the critical values are ±2.056 (the t26 distribution is used). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical values are ±2.779, hence we
do not reject H0 (again, the t26 distribution is used). Hence there is moderate evidence
of a difference in the mean scores of job satisfaction between men and women.
ii. The assumptions for i. concerned an assumption:
∗ about equal variances
∗ about whether nA + nB is ‘large’ so that the normality assumption is satisfied.
∗ about independent samples.
Some candidates stated assumptions in this part that were not made in part (i). Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. Some example related to confounding (causation and association), such as ‘if the male
participants of the study were working in night shifts, this could explain their lower job
satisfaction’.

11
ST104a Statistics 1

Question 3

(a) Thirty people were asked about the number of hours they exercise in a week
and their answers were recorded and listed below.

2.0 4.0 4.5 5.0 5.5


6.0 6.5 6.5 7.0 7.0
7.5 7.5 8.0 8.0 8.5
8.5 8.5 9.0 9.0 10.0
10.5 10.5 11.0 11.5 12.0
13.0 14.0 17.0 18.0 21.0

i. Carefully construct, draw and label a histogram of these data.


ii. Find the mean (given that the sum of the data is 277), the median and the
modal group.
iii. Comment on the data based on the shape of the histogram and the measures
you have calculated.
iv. Name two other types of graphical displays that would be suitable to
represent the data.
(12 marks)

Reading for this question


Chapter 4 provides all the relevant material for this question. More specifically, reading on
histograms can be found in Section 4.7.3, but the entire Section 4.7 is highly relevant. For
measures of location (mean, median, modal group) see Section 4.8.
Approaching the question
i. A histogram compatible with what the examiners were expecting to see is shown below.

Marks were awarded for including the title, labelling correctly and accurately drawing
the figure.

12
Examiners’ commentaries 2021

ii. The requested results are below and are simple to compute.
∗ Mean: 9.2333 hours of exercise per week.
∗ Median: 8.5 hours of exercise per week.
∗ Modal group: 5–10 hours of exercise per week.
Make sure to use measurements units. Also, avoid the use of grouped data formulae as
they are approximate.
iii. The distribution of the data appears to be positively/right skewed. This is also
supported by the fact that the mean is larger than the median.
iv. Some graphs are listed below, although there are more.
∗ Boxplot
∗ Stem-and-leaf diagram.
∗ Dot plot.

(b) A researcher is interested in determining whether taking additional vitamin C


helps prevent the common cold. A randomised experiment was conducted to
address this question. The study randomly allocated 279 people to either a
group where vitamin C supplements were given, or a group where a placebo pill
was given. These people were monitored and the numbers of those who got or
did not get a cold were recorded. The results are summarised below:
Got a cold Did not get a cold
Vitamin C 17 122
Placebo 31 109

i. Give a 95% confidence interval for the difference in the probabilities of


getting a cold between the vitamin C and the placebo groups.
ii. Carry out an appropriate hypothesis test at the 5% significance level to
determine whether the probability of getting a cold is lower in the vitamin C
group, compared to the probability in the placebo group. State the test
hypotheses, and specify your test statistic and its distribution under the null
hypothesis. Comment on your findings.
iii. State any assumptions you made in part ii.
iv. On the basis of the data alone, would you conclude that a vitamin C pill
reduces the chances of getting a cold? Provide an explanation with your
answer.
(13 marks)

Reading for this question


Look up the sections about hypothesis testing and confidence intervals for differences in
proportions; more specifically Sections 7.12 and 8.15.
Approaching the question
i. Let p1 and n1 refer to the proportion of those who got a cold in the vitamin C group and
to the sample size of the vitamin C group, respectively. Similarly, denote by p2 and n2
the corresponding quantities in the placebo group. Marks were awarded for the following
points:
∗ Calculation of standard error:
s
p1 (1 − p1 ) p2 (1 − p2 )
s.e.(p1 − p2 ) = + = 0.0448.
n1 n2

∗ Calculation of lower and upper bounds −0.1869 and −0.0114.


∗ Present as an interval (−0.1869, −0.0114).

13
ST104a Statistics 1

ii. Let π1 denote the vitamin C group proportion and π2 the placebo group proportion.
Also, denote by p the overall sample proportion of those who got a cold. Marks were
awarded for the following points. We test:
H0 : π 1 = π 2 vs. H1 : π1 < π2 .
The test statistic (p1 − p2 )/s.e.(p1 − p2 ) follows a standard normal distribution,
approximately. The calculation of the standard error is:
s  
1 1
s.e.(p1 − p2 ) = p(1 − p) + = 0.0452.
n1 n2
The test statistic value is −2.2145. For α = 0.05, the critical value is −1.645, hence we
reject H0 at the 5% significance level. The probability of getting a cold is lower for the
vitamin C group (can be mentioned in iv.).
iii. Below are a couple of assumptions made:
∗ Sample sizes are large to justify the normality assumption.
∗ Independent samples.
iv. Some brief discussion is expected including the following points:
∗ The evidence from both analyses points in that direction.
∗ One has to be cautious as the above implies association, not causation.

Question 4

(a) A mental health study focused on 300 patients visiting three community mental
health centres. The patients were classified into three groups according to the
primary issue for which they were seen. The data are shown below.
Type of Problem
Social Adjustment Stress Related Other Total
Centre 1 45 28 27 100
Centre 2 28 44 28 100
Centre 3 46 29 25 100
Total 119 101 80 300
i. Based on the data in the table, and without conducting a significance test,
describe the differences in terms of the primary issue for which the patients
were seen across the different centres.
ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(13 marks)

Reading for this question


This question targets Chapter 9 on contingency tables and chi-squared tests. Note that in
part i. of the question it does not require any calculations, just understanding and
interpreting contingency tables. Part ii. is a straightforward chi-squared test and the reading
is also given in Chapter 9.
Approaching the question
i. An example of a ‘good’ answer is given below:
There are some differences in the distributions within community mental health centres.
More specifically, the rates of problems related with social adjustment appear to be
higher than problems related with living in Centres 1 and 3 (45% vs. 28% and 46% vs.
29%, respectively). In Centre 2, however, problems related to stress appear to be more
common (44% vs. 28%). Hence there seems to be an association between community
mental health centre and type of problem, although this needs to be investigated further.

14
Examiners’ commentaries 2021

ii. Set out the null hypothesis that there is no association between community mental
health centre and type of problem against the alternative, that there is an association.
Be careful to get these the correct way round! We test:

H0 : No association between community mental health centre and type of problem

vs.

H1 : Association between community mental health centre and type of problem.

Work out the expected values to obtain the table below:


39.67 33.67 26.67
39.67 33.67 26.67
39.67 33.67 26.67
The test statistic formula is:
X (Oi,j − Ei,j )2

i,j
Eij

which gives a value of 10.107. This is a 3 × 3 contingency table so the degrees of freedom
are (3 − 1) × (3 − 1) = 4. For α = 0.05, the critical value is 9.488, hence we reject H0 .
For α = 0.01, the critical value is 13.277, hence we do not reject H0 . We conclude that
there is moderate evidence of an association between community mental health centre
and type of problem.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.

(b) i. You have been asked to design a nationwide survey in your country to find
out about internet use by children less than 10 years old. Provide a
probability sampling scheme and a sampling frame that you would like to
use. Identify a potential source of selection bias that may occur and discuss
how this issue could be addressed.
ii. Describe what a longitudinal survey is. State two ways in which panel
surveys differ from longitudinal surveys.
(12 marks)

Reading for this question


This was a question on basic material concerning survey designs. Background reading is
given in Chapters 10 and 11 of the subject guide which, along with the recommended
reading, should be looked at carefully. Longitudinal surveys are described explicitly on page
271. Candidates were expected to have studied and understood the main important
constituents of design in random sampling. It is also a good idea to try the learning
activities of Chapter 10.
Approaching the question
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires one or two lines. If you are
unsure of what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked, keep in mind that the answer should not be long.
Note also that in some cases there is no unique answer to the question.
i. A ‘good’ answer could contain the points given below:
Sampling frame: Note that the target group is ‘children less than ten years old’ hence
candidates might take the view that they only need to look at children who are at school
or nursery school – aged from 4 or 5 to 10, say. If this is the case, they may suggest using
a sampling frame of schools and nurseries and sampling from their lists. Another
example is to use doctors’ lists (if possible).

15
ST104a Statistics 1

Sampling scheme: First, it is important to briefly describe your preferred sampling


scheme. Examples include clustering (area of the country/ type of school, . . ., junior,
infants, pre-school etc.) or stratified (stratification factors: gender, age group) random
sampling. Second, it is important to briefly explain why these schemes would be
advantageous.
Source of selection bias: Selection bias will arise from the omission of those who are
not at school or pre-school (in most countries, school is compulsory only for five- or
six-year olds) and those who are home-schooled. (Not to be confused with response bias
– for example, things about children responding differently if the teachers ask the
questions.)
Way to address it: Reset the target population group to match what the sampling
frame is actually providing.
ii. A good description of longitudinal studies could include the following points:
∗ A longitudinal survey is a survey where the same individuals are re-surveyed over
time.
∗ For example, we may form a group of students from a young age, say 7 years old, and
keep surveying their interaction with the internet every year until they become 10
years old.
Some ways in which panel surveys are different from longitudinal surveys are listed below:
∗ They are more likely to be chosen by quota rather than random methods.
∗ Individuals are interviewed every two to four weeks (rather than every few years).
∗ Individuals are unlikely to be panel members for longer than two years at a time.

16
Examiners’ commentaries 2021

Examiners’ commentaries 2021


ST104a Statistics 1

Important note

This commentary reflects the examination and assessment arrangements for this course in the
academic year 2020–21. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).

Information about the subject guide and the Essential reading


references

Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2019).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.

Comments on specific questions – Zone B

Candidates should answer THREE questions: all parts of Section A (50 marks in total) and TWO
questions from Section B (25 marks each). Candidates are strongly advised to divide their
time accordingly.

Section A

Answer all parts of question 1 (50 marks in total).

Question 1

(a) Suppose that x1 = −5, x2 = 36, x3 = 4, and y1 = −7, y2 = 0.5, y3 = 1.


Calculate the following quantities:

3 3
!2 3
X √ X X yi3
i. xi ii. xi yi iii. |y1 | + .
i=2 i=1 i=2
yi2

(6 marks)

Reading for this question


This question refers to the basic bookwork which can be found in Section 2.9 of the subject
guide.
Approaching the question
Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are as follows.

17
ST104a Statistics 1

i. We have:
3
X √ √ √
xi = 36 + 4 = 6 + 2 = 8.
i=2
√ √
Note that to be mathematically precise, 36 and 4 are also equal to −6 and −2,
respectively. For this reason, −8 as a final answer was also accepted as correct.
ii. We have:
3
!2
X
xi yi = ((−5 × −7) + (36 × 0.5) + (4 × 1))2 = (57)2 = 3,249.
i=1

iii. We have:
3 3
X y3i
X
|y1 | + 2 = |y1 | + yi = | − 7| + 0.5 + 1 = 8.5.
y
i=2 i i=2

(b) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (No marks will be awarded without a justification.)
i. Exchange rate between two currencies.
ii. Candidate number in an examination.
iii. University degree type (in terms of Bachelors, Masters, Ph.D.).
(6 marks)

Reading for this question


This question requires identifying types of variables so reading the relevant section in the
subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion of
a variable and be able to distinguish between discrete and continuous (measurable) data. In
addition to identifying whether a variable is categorical or measurable, further distinctions
between ordinal and nominal categorical variables should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.

i. Measurable. Exchange rate is price of one currency in terms of another, and can be
measured to several decimal places.
ii. Categorical, nominal. Candidate numbers are used for identification purposes only.
iii. Categorical, ordinal. A Ph.D. ranks higher than a Masters, which in turn ranks higher
than a Bachelors.

Weak candidates did not provide a justification for their choices, reported nominal or
categorical to a measurable variable and sometimes answered ordinal when their justification
was pointing towards a nominal variable. There were also phrases like ‘It is measurable
because it can be measured’ that were not awarded any marks.

(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. For a set of observation x1 , x2 , . . . , xn , with mean x̄, then:
n
X
(xi − x̄) < 0.
i=1

18
Examiners’ commentaries 2021

ii. For two mutually exclusive events A and B, then:


P (A ∪ B) > P (A) + P (B).
iii. For a random variable X, (E(X))2 can be greater than E(X 2 ).
iv. Failure to reject a false null hypothesis is known as the power of a test.
v. A 5-by-3 contingency table results in a χ2 test statistic value of 15.312 is
statistically significant at the 1% significance level.
(10 marks)

Reading for this question


This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part i. refers to Chapter 4 and in particular Section 4.9.3, whereas
part ii. requires knowledge of basic probability properties that can be found in Section 5.9.
Part iii. is about population mean and variance, see Section 6.7. Part iv. targets concepts
related to hypothesis testing, covered in Chapter 8. Finally, part v. focuses on material of
Chapter 9, and more specifically Section 9.7.
Approaching the question
Candidates always find this type of question tricky. It requires a brief explanation of the
reason for a true/false and not just a choice between the two. Some candidates lost marks for
long rambling explanations without a decision as to whether a statement was true or false.
i. False. Since:
Xn Xn
(xi − x̄) = xi − nx̄ = nx̄ − nx̄ = 0.
i=1 i=1

ii. False. Since A and B are mutually exclusive then P (A ∩ B) = 0 and:


P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= P (A) + P (B) − 0
= P (A) + P (B).
iii. False. Variances can never be negative, hence:
Var(X) = E(X 2 ) − (E(X))2 > 0 ⇒ E(X 2 ) > (E(X))2 .
iv. False. Failure to reject a false null hypothesis is known as a Type II error. Or, power is
the probability of rejecting a false null hypothesis.
v. False. With (5 − 1)(3 − 1) = 8 degrees of freedom, χ20.01, 8 = 20.09 > 15.312, hence it is
not statistically significant at the 1% significance level.

(d) X is a normal random variable with a mean of µ = 7. If P (X < 4) = 0.40,


approximately what is the value of the variance, σ 2 ?
(5 marks)

Reading for this question


This question covers the normal distribution, for which read Section 6.8 of the subject guide.
Approaching the question
We have:    
4−7 3
P (X < 4) = P Z < =P Z<− = 0.40
σ σ
where Z ∼ N (0, 1). Since P (Z > 0.25) = P (Z < −0.25) ≈ 0.40, we have:
3
−0.25 = − ⇒ σ = 12
σ
so, approximately, Var(X) = (12)2 = 144.

19
ST104a Statistics 1

(e) The probability distribution of a random variable X is given below.


X=x 10 20 30 40 50
P (X = x) k 3k 12k 3k k

i. Explain why k = 0.05.


(2 marks)
ii. Calculate the variance of X.
(3 marks)
iii. Calculate P (X > E(X) | X > 10).
(3 marks)
iv. Does X have a normal distribution? Briefly explain your answer.
(2 marks)

Reading for this question


This is a question on probability, exploring the concepts of relative frequency, conditional
probability and probability distributions. Reading from Chapter 5 is suggested with a focus
on the sections on these topics. Try Activity A5.1 and the exercises on probability trees. For
part iv., and in particular the discrete uniform distribution, read Section 9.8.
Approaching the question
i. The probabilities must sum to 1, so:
X
p(x) = k + 3k + 12k + 3k + k = 20k = 1 ⇒ k = 0.05.
x

ii. We have:
X
E(X) = x p(x)
x

= 10 × 0.05 + 20 × 0.15 + 30 × 0.60 + 40 × 0.15 + 50 × 0.05


= 30

and:
X
E(X 2 ) = x2 p(x)
x

= (10)2 × 0.05 + (20)2 × 0.15 + (30)2 × 0.60 + (40)2 × 0.15 + (50)2 × 0.05
= 970

hence:
Var(X) = E(X 2 ) − (E(X))2 = 970 − (30)2 = 70.

iii. Since E(X) = 30, we have:

P ({X > 30} ∩ {X > 10}) P (X > 30) 0.20


P (X > 30 | X > 10) = = = = 0.2105.
P (X > 10) P (X > 10) 0.95

iv. X is discrete, but a normal random variable is continuous, hence X does not have a
normal distribution. Alternatively, an accurate mass function plot vs. a normal curve
could be provided.

(f ) Based on the central limit theorem, you are told that a 99% confidence interval
for a population proportion is (0.5782, 0.7018).

20
Examiners’ commentaries 2021

i. What was the sample proportion which resulted in this confidence interval?
(2 marks)
ii. What was the size of the sample used?
(4 marks)

Reading for this question


This question contains material on sample size determination in relation to the normal
distribution and the distribution of the sample proportion. Moreover, knowing the concept
of confidence intervals is essential. Sample size determination is covered in Section 7.11. For
confidence intervals, read Section 7.6 for the principle and Section 7.10 for the case of
proportions.

Approaching the question

i. The sample proportion, p, would be in the centre of the interval (0.5782, 0.7018). Adding
the two endpoints and dividing by 2 gives:

0.5782 + 0.7018
p= = 0.64.
2

ii. To find the sample size, note that the (estimated) standard error when estimating a
single proportion is:
r r √
p(1 − p) 0.64 × 0.36 0.64 × 0.36 0.48
= = √ = √ .
n n n n

Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the confidence
coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we need to solve:

0.48
2.576 × √ = 0.0618.
n

The correct sample size is n ≈ 401 (depending on rounding).


Given the approximate values obtained from tables, n = 400 or n = 402 were also taken
as correct depending on the decimal places of the z-value used (t should not be used).

(g) It is assumed that investors are equally split between those who prefer ‘growth’
stocks and those who prefer ‘value’ stocks. In a random sample of 500 investors,
255 agreed with the statement ‘Growth stocks are better than value stocks’.

i. Conduct a two-sided hypothesis test, at the 10% significance level, to test


whether in the population of investors there are equal preferences for growth
and value stocks. Show all your steps and use the ‘critical value’ approach to
perform the test.
(5 marks)

ii. Calculate the p-value of the test statistic value calculated in part i.
(2 marks)

Reading for this question


This question refers to a hypothesis test for a single proportion, for which read Section 8.14
of the subject guide. The second part of the question looks at p-values and the relevant
section is Section 8.11.

21
ST104a Statistics 1

Approaching the question

i. We test H0 : π = 0.50 vs. H1 : π 6= 0.50.


The sample proportion is p = 255/500 = 0.51. The test statistic value is:

p−π 0.51 − 0.50


p =p = 0.4472.
π(1 − π)/n 0.50 × 0.50/500

For α = 0.10, the critical values are ±z0.05 = ±1.645. Since 0.4472 < 1.645 we do not
reject H0 , hence there is no evidence that π 6= 0.50.

ii. The p-value is:


2 × P (Z > 0.45) = 2 × 0.3264 = 0.6528.

Section B

Answer two out of the three questions from this section (25 marks each).

Question 2

(a) A study is made for a particular allergy medication in order to determine the
length of relief it provides y (in hours) in relation to the dosage of medication x
(in mg). For this reason, ten patients were given different doses of the
medication and were asked to report back when the medication seemed to wear
off.

Patient #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Dosage (x) 3 3.5 4 5 6 6.5 7 8 8.5 9
Relief hours (y) 9.1 5.5 12.3 9.2 14.2 16.8 22.0 18.3 24.5 22.7

The summary statistics for these data are:

Sum of x data: 60.5 Sum of the squares of x data: 406.75


Sum of y data: 154.6 Sum of the squares of y data: 2,767.3
Sum of the products of x and y data: 1,049.1

i. Draw a scatter diagram of these data. Label the diagram carefully.

ii. Calculate the sample correlation coefficient. Interpret your findings.

iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.

iv. Suppose that you observe more data and when you draw the corresponding
scatter diagram a non-linear association is revealed. Discuss how this can be
interpreted in the context of the problem.

(13 marks)

Reading for this question


This is a standard regression question and the reading is to be found in Chapter 12. Section
12.6 provides details for scatter diagrams and is suitable for part i., whereas the remaining
parts are on correlation and regression that are covered in Sections 12.8–12.10. Section 12.7
is also relevant. Sample examination question 2 of this chapter is also recommended for
practice on questions of this type.

22
Examiners’ commentaries 2021

Approaching the question


i. Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which give their units in addition. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question.
We have:

ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9180. An
interpretation of this value is the following: The data suggest that the higher the dosage,
the longer the length of relief. The fact that the value is close to 1, suggests that this is a
strong, linear, positive association.
Many candidates did not mention all three words (strong, linear, positive). Note that all
of these words provide useful information on interpreting the association and are
therefore required.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = 2.7936. The formula for a is
a = ȳ − bx̄, so we get a = −1.4412. Hence the regression line can be written as:

yb = −1.4412 + 2.7936x or y = −1.4412 + 2.7936x + ε.

It should also be plotted on the scatter diagram.


Many candidates reported incorrectly the regression line as y = −1.4412 + 2.7936x. This
expression is false; one of the two above is required. Also, many candidates did not draw
this line on the scatter diagram; instead they drew an approximate line trying to go
around the points but without reference to the above equation. No marks were awarded
in such cases.
iv. Some discussion mentioning, for example, that it may be possible that a higher dosage
beyond a certain value will not provide extra benefit and may even have adverse effects.

23
ST104a Statistics 1

(b) A study focused on the perception of life satisfaction that may vary between
older and younger people. For this reason 12 adults over the age of 70 and 16
adults aged between 18 and 30 took a life satisfaction questionnaire that gave a
score for each one of them (high values of the score indicate higher life
satisfaction). Summaries of these scores are presented below.

Sample size Sample mean Sample variance


Older adults 12 33.5 16.0
Younger adults 16 29.0 15.3

i. Use an appropriate hypothesis test to determine whether the life satisfaction


scores were different between these two age groups. Test at two appropriate
significance levels, stating clearly the hypotheses, the test statistic and its
distribution under the null hypothesis. Comment on your findings.
ii. State clearly any assumptions you made in part i.
iii. Is it possible that there is no difference between older and younger adults in
terms of their life satisfaction? Discuss.
(12 marks)

Reading for this question


The first two parts of the question refer to a two-sided hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant, one can focus
on Section 8.16. The third part of the question is open-ended but can be approached most
likely as the difference between causation and association. The content of Chapter 11 is
relevant.
Approaching the question
i. Let µA denote the mean score for older adults and µB the mean score for younger adults.
We test:
H0 : µA = µB vs. H1 : µA 6= µB .
The test statistic value is 2.9838 (using the pooled sample variance of 15.5962). (If equal
variances are not assumed the test statistic value is 2.9740.) For reference the test
statistic formula is:
x̄A − x̄B x̄A − x̄B
q or p .
s2p (1/nA + 1/nB ) sA /nA + s2B /nB
2

For α = 0.05, the critical values are ±2.056 (the t26 distribution is used). Decision: we
reject H0 . Choosing a smaller α, say α = 0.01, the critical values are ±2.779, hence again
we reject H0 (again, the t26 distribution is used). Hence there is strong evidence of a
difference in the mean scores of life satisfaction between older and younger adults.
ii. The assumptions for i. concerned an assumption:
∗ about equal variances
∗ about whether nA + nB is ‘large’ so that the normality assumption is satisfied.
∗ about independent samples.
Some candidates stated assumptions in this part that were not made in part (i). Marks
were not awarded in such cases. Also, some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that unknown
variances are equal or unequal.
iii. Some example related with confounding (causation and association), such as ‘perhaps
the younger people participated in the study were university students that had just
obtained a poor mark in one of their courses’.

24
Examiners’ commentaries 2021

Question 3

(a) A variety of a broad bean plant is studied and the number of beans per plant is
counted and listed below.
71 94 62 74 106
76 87 94 76 78
83 56 78 79 80
60 92 54 81 45
72 54 45 85 72
74 65 68 55 66

i. Carefully construct, draw and label a stem-and-leaf diagram of these data.


ii. Find the mean (given that the sum of the data is 2,182), the median and the
modal stem.
iii. Comment on the data based on the shape of the stem-and-leaf diagram and
the measures you have calculated.
iv. Name two other types of graphical displays that would be suitable to
represent the data.
(12 marks)

Reading for this question


Chapter 4 provides all the relevant material for this question. More specifically, reading on
stem-and-leaf diagrams can be found in Section 4.7.4, but the entire Section 4.7 is highly
relevant. For measures of location (mean, median, modal group) see Section 4.8.
Approaching the question
i. A stem-and-leaf plot compatible with what the examiners were expecting to see is shown
below.
Stem-and-leaf plot of number of beans in each plant
Stem = 10s of beans | Leaf = beans
4 | 55
5 | 4456
6 | 02568
7 | 1224466889
8 | 01357
9 | 244
10 | 6
Marks were also awarded for title, sensible choice of stems, stem and leaf labels,
accuracy, and having adequate vertical alignment when drawing the figure.
ii. The requested results are below and are simple to compute.
∗ Mean: 72.7333 beans per plant.
∗ Median: 74 beans per plant.
∗ Modal stem: 70s.
Make sure to use measurements units.
iii. The distribution of the data appears approximately symmetric, with a very slight
negative skew. This is also supported by the fact that the mean is slightly less than the
median.
iv. Some graphs are listed below, although there are more.
∗ Boxplot
∗ Histogram.
∗ Dot plot.

25
ST104a Statistics 1

(b) A researcher is interested in determining whether a particular pill provides


effective treatment for stomach pain. A randomised experiment was conducted
to address this question. The study randomly allocated 200 people to either a
group where the pill was administered, or a group where a placebo pill was
given. These people were monitored and the numbers of those who got better
(or did not) were recorded. The results are summarised below:

Did not get better Got better


Pill 21 74
Placebo 50 55

i. Give a 95% confidence interval for the difference in the probabilities of


getting better from stomach pain between those who took the pill and the
placebo group.
ii. Carry out an appropriate hypothesis test at the 5% signi
cance level to determine whether the probability of getting better is higher
in the pill group, compared to the probability in the placebo group. State
the test hypotheses, and specify your test statistic and its distribution under
the null hypothesis. Comment on your findings.
iii. State any assumptions you made in part ii.
iv. On the basis of the data alone, would you conclude that the particular pill
increases the chances of getting better from stomach pain? Provide an
explanation with your answer.
(13 marks)

Reading for this question


Look up the sections about hypothesis testing and confidence intervals for differences in
proportions; more specifically Sections 7.12 and 8.15.
Approaching the question
i. Let p1 and n1 refer to the proportion of those who got better in the pill group and to the
sample size of the pill group, respectively. Similarly, denote by p2 and n2 the
corresponding quantities in the placebo group. Marks were awarded for the following
points:
∗ Calculation of standard error:
s
p1 (1 − p1 ) p2 (1 − p2 )
s.e.(p1 − p2 ) = + = 0.0647.
n1 n2

∗ Calculation of lower and upper bounds 0.1283 and 0.3820.


∗ Present as an interval (0.1283, 0.3820).
ii. Let π1 denote the pill group proportion and π2 the placebo group proportion. Also,
denote by p the overall sample proportion of those who got better. Marks were awarded
for the following points. We test:

H0 : π 1 = π 2 vs. H1 : π1 < π2 .

The test statistic (p1 − p2 )/s.e.(p1 − p2 ) follows a standard normal distribution,


approximately. The calculation of the standard error is:
s  
1 1
s.e.(p1 − p2 ) = p(1 − p) + = 0.0678.
n1 n2

The test statistic value is 3.7655. For α = 0.05, the critical value is 1.645, hence we reject
H0 at the 5% significance level. There is evidence that the pill increases the chances of
getting better (can be mentioned in iv.).

26
Examiners’ commentaries 2021

iii. Below are a couple of assumptions made:


∗ Sample sizes are large to justify the normality assumption.
∗ Independent samples.

iv. Some brief discussion is expected including the following points:


∗ The evidence from both analyses points in that direction.
∗ One has to be cautious as the above implies association, not causation.

Question 4

(a) A study looked into the views of workers towards school closure in order to
reduce coronavirus transmission. 300 participants from three sectors
(hospitality, banking and construction) were interviewed and their responses
were classified into three categories, namely positive, neutral and negative. The
data are shown below.

View Towards School Closure


Positive Neutral Negative Total
Hospitality 35 25 40 100
Banking 28 47 25 100
Construction 38 22 40 100
Total 101 94 105 300

i. Based on the data in the table, and without conducting a significance test,
describe the differences of views across the different sectors.

ii. Calculate the χ2 statistic and use it to test for independence, using a 5%
significance level. What do you conclude?
(13 marks)

Reading for this question


This question targets Chapter 9 on contingency tables and chi-squared tests. Note that in
part i. of the question it does not require any calculations, just understanding and
interpreting contingency tables. Part ii. is a straightforward chi-squared test and the reading
is also given in Chapter 9.

Approaching the question


i. An example of a ‘good’ answer is given below:
There are some differences in the distributions across sectors. More specifically, the
proportion of participants with positive views vs. neutral seems higher in hospitality and
construction sectors (35% vs. 25% and 38% vs. 22%, respectively). However, in the
banking sector the proportion of participants with neutral views is larger (47% vs. 25%).
Hence there seems to be an association between sector and views, although this needs to
be investigated further.

ii. Set out the null hypothesis that there is no association between sector and view towards
school closures against the alternative, that there is an association. Be careful to get
these the correct way round! We test:

H0 : No association between sector and view towards school closures

vs.
H1 : Association between sector and view towards school closures.

27
ST104a Statistics 1

Work out the expected values to obtain the table below:


33.67 31.33 35.00
33.67 31.33 35.00
33.67 31.33 35.00
The test statistic formula is:
X (Oi,j − Ei,j )2

i,j
Eij
that gives a value of 17.744. This is a 3 × 3 contingency table so the degrees of freedom
are (3 − 1) × (3 − 1) = 4. For α = 0.05, the critical value is 9.488, hence we reject H0 . For
α = 0.01, the critical value is 13.277, hence we again reject H0 . We conclude that there is
strong evidence of an association between sector and view towards school closures.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work.

(b) i. Describe what selection bias is and when it may occur. Give an example.
ii. You have been asked to design a nationwide survey in your country to find
out about working conditions among employees in the postal offices. Provide
a probability sampling scheme and a sampling frame that you would like to
use. Identify a potential source of response bias that may occur and discuss
how this issue could be addressed.
(12 marks)

Reading for this question


This was a question on basic material on survey designs. Background reading is given in
Chapters 10 and 11 of the subject guide which, along with the recommended reading, should
be looked at carefully. Selection bias is explicitly discussed on page 249. Candidates were
expected to have studied and understood the main important constituents of design in
random sampling. It is also a good idea to try the learning activities of Chapter 10.
Approaching the question
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires one or two lines. If you are
unsure of what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked, keep in mind that the answer should not be long.
Note also that in some cases there is no unique answer to the question.
i. Selection bias is when a part of the population is excluded or under-represented from the
sample. It can occur when (i) the sampling frame is not equal to the target population,
or (ii) the sampling frame is not strictly adhered to, or (iii) non-random sampling is used.
As an example consider an online survey that will exclude people with no access to the
internet.
ii. An indicative, ‘good’ answer could contain the following points:
Sampling frame: Contact postal offices to obtain a list of employees (and obtain
permission for the study). The list could include email addresses or other contact
information.
Linking sampling frame with selection bias: For example, if email addresses are
used some maybe excluded.
Sampling scheme: First, it is important to briefly describe your preferred sampling
scheme. Examples include clustering (area of the country) or stratified (stratification
factors: gender, age group) random sampling. Second, it is important to briefly explain
why these schemes would be advantageous.
Source of response bias: Response bias could arise because employees could be
hesitant to criticise the conditions in their workplace or to report their income.
Way to address it: Make sure that the data will be anonymised and that the
respondents will be assured of that.

28

You might also like