GEA1000 Lecture Notes
GEA1000 Lecture Notes
GEA1000 Quantitative Reasoning with Data is a module that aims to equip students
with essential data literacy skills to analyse data and make decisions under uncertainty.
It covers the basic principles and practice for collecting data and extracting useful in-
sights, illustrated in a variety of application domains. For example, when two issues are
correlated (e.g. smoking and cancer), how can we tell whether the relationship is causal
(e.g. smoking causes cancer)? How can we analyse categorical data? What about numer-
ical data? What about uncertainty and complex relationships? These and many other
questions will be addressed using data software and computational tools, with real-world
data sets.
The framework that we will be making reference to frequently in this course is the
PPDAC cycle.1 The figure below is a representation of the data problem-solving cycle,
“Problem, Plan, Data, Analysis and Conclusion.”
(to) document the stages a person would undertake when solving a problem using
numerical evidence,
using data which they had collected themselves, or from existing (public) data sets,
(where) analysis methods can include machine learning algorithms, as well as more
traditional statistical techniques.
The following figure briefly describes what happens at each stage of the PPDAC cycle.3
1 Spiegelhalter, David. (2019). The Art of Statistics. Penguin/Pelican Books
2 Wolff, A. et al. (2016). Creating an Understanding of Data Literacy for a Data-driven Society. The Journal of
Community Informatics, 12(3), 9–26.
3 Spiegelhalter, David. (2019). The Art of Statistics. Penguin/Pelican Books
This set of notes is meant to follow the four chapters of the module closely. The topics
covered in the chapters are summarised below.
Chapter 1: Getting data. Data collection and sampling. Experiments and observa-
tional studies. Data cleaning and recoding. Interpreting summary statistics (mode,
mean, quartiles, standard deviation etc.)
Chapter 2: Categorical data analysis. Bar plots, contingency table, rates and basic
rules on rates. Association, confounders and Simpson’s Paradox.
Chapter 3: Dealing with numerical data. Univariate and bivariate data. Histograms,
boxplots and scatter plots. Correlation, ecological and atomistic fallacies and simple
linear regression.
Chapter 4: Statistical inference. Probability, conditional probability and indepen-
dence. Prosecutor’s fallacy, base rate fallacy and conjunction fallacy. Discrete and
continuous random variables. Interpreting confidence intervals. Hypothesis testing
and learning about population based on a sample. Simple simulation.
Exploratory data analysis (EDA) will be incorporated extensively into the content of
the module. Students will appreciate that even simple plots and contingency tables can
give them valuable insights about data. There will be an emphasis on using suitable real
world data sets as motivating examples to introduce content and through the process of
problem solving, elucidate techniques/materials in the syllabus.
Contents
Getting Data
Discussion 1.1.1 Data exists in our everyday life. As we flip through our newspapers each day, we
see evidence of data being used and many questions being asked about data that has been collected. In
other words, we see that research is becoming data driven and it is fast becoming necessary for one to be
proficient in reasoning quantitatively. The ability to investigate and make sense of a data set is a core
21st century skill that any undergraduate, regardless of discipline should acquire.
An online article in 2021 shows the following:
(Source: https://www.todayonline.com/singapore/fall-singapore-marriages-divorces-2020-amid-covid-
19-restrictions-uncertainty)
After reading the article, it is natural for one to ask questions on how the conclusion was arrived at.
What kind of data was collected that supported this conclusion? Is the conclusion made correctly?
Definition 1.1.2 A population is the entire group (of individuals or objects) that we wish to know
something about.
Definition 1.1.3 A research question is usually one that seeks to investigate some characteristic of a
population.
1. What is the average number of hours that students study each week?
3. Are student athletes more likely than non-athletes to do final year projects?
2 Chapter 1. Getting Data
Broadly speaking, we can classify research questions into the following categories.
1. To make an estimate about the population.
2. To test a claim about the population.
3. To compare two sub-populations / to investigate a relationship between two variables in the pop-
ulation.
Example 1.1.5 Having a well designed research question is a critical beginning to any data driven
research problem. While an in-depth discussion on how research questions can be designed is beyond
the scope of this course, the following table gives a few examples and provides some insights into what
are some considerations and desirable features that good research questions should have.
Considerations Example of a neutral Example of a better Explanation
research question research question
Narrow vs. Q1: Do Primary Six Q2: Do Primary Six Q1 is too narrow
Less Narrow students have an av- students have an av- as it can be an-
erage sleep time of 7 erage sleep time of 7 swered with a sim-
hours a day? hours a day? What ple statistic. It
are some variables does not look at any
that may play a part other context sur-
in affecting the num- rounding the issue.
ber of hours they Q2 is less narrow and
sleep? attempts to go be-
yond simply finding
some data or num-
bers. It seeks to un-
derstand the bigger
picture too.
Unfocussed vs. Q1: What are the ef- Q2: How does eating Q1 is too broad
Focussed fects of eating more more than 2 meals of which makes it diffi-
than 2 meals of fast fast food per week af- cult to identify a re-
food per week? fect the BMI (Body search methodology.
Mass Index) of chil- Q2 is focussed and
dren between 10 to clear on what data
12 years old in Sin- to be collected and
gapore? analysed.
Simple vs. Q1: How are schools Q2: What are the Q1 is simple and
Complex in Singapore ad- effects of interven- such information can
dressing the issue of tion programs im- be obtained with a
mental health among plemented at schools search online with
school children? in Singapore on the no analysis required.
mental health among Q2 is more complex
school children aged and requires both in-
13 to 16? vestigation and eval-
uation which may
lead the research to
form an argument.
We will now proceed to describe the process of Exploratory Data Analysis (EDA).
Definition 1.1.6 Exploratory Data Analysis (EDA) is a systematic process where we explore a data set
and its variables and come up with summary statistics as well as plots. EDA is usually done iteratively
until we find useful information that helps us answer the questions we have about the data set.
In general, the steps involved in EDA are
Section 1.2. Sampling 3
2. Search for answers to the research questions using data visualisation tools. In the process of
exploration, we could also perform data modelling (e.g. regression analysis).
3. We ask ourselves the following question: To what extent does the data we have, answer the questions
we are interested in?
4. We refine our existing questions or generate new questions about the data before going back to the
data for further exploration.
Definition 1.2.1 A population of interest refers to a group in which we have interest in drawing con-
clusions on in a study.
Example 1.2.3 The following are some examples of a population and an associated population param-
eter.
1. The average height (population parameter) of all primary six students in a particular primary school
(population).
2. The median number of modules taken (population parameter) by all first year undergraduates in a
University (population).
3. The standard deviation of the number of hours spent on mobile games (population parameter) by
pre-schoolers aged 4 to 6 in Singapore (population).
Definition 1.2.4 A census is an attempt to reach out to the entire population of interest.
While it is obviously nice to have a census, this is often not possible due to the high cost of conducting
a census. In addition, some studies are time sensitive and a census typically takes a long time to complete,
even when it is possible to do so. Furthermore, in a census attempt, one may not be able to achieve
100% response rate.
Definition 1.2.5
1. It is usually not feasible to gather information from every member of the population, so we look at
a sample, which is a proportion of the population selected in the study.
2. Without the information from every member of the population, we will not be able to know exactly
what is the population parameter. The hope is that the sample will be able to give us a reasonably
good estimate about the population parameter. An Estimate is an inference about the population’s
parameter based on the information obtained from a sample.
3. A sampling frame is the list from which the sample was obtained. Note that since a census does
not involve sampling, the notion of sampling frame is not applicable in that context.
Remark 1.2.6
4 Chapter 1. Getting Data
1. Suppose the population of interest are people who drink coffee in Singapore. How should we
design a sampling frame for this population? The sampling frame may or may not cover the entire
population or it may contain units not in the population of interest. The all important question is
whether the sample obtained from such a sampling frame is still able to tell us something about
the population parameter. The following are some of the characteristics of the sampling frame that
we should pay attention to:
Does the sampling frame include all available sampling units from the population?
Does the sampling frame contain irrelevant or extraneous sampling units from another popu-
lation?
Does the sampling frame contain duplicated sampling units?
Does the sampling frame contain sampling units in clusters?
2. One of the conditions of generalisability, which is the ability to generalise the findings from a sample
to the population is that the sampling frame must be equal to or greater than the population of
interest. Note that this does not mean that when our sampling frame covers the entire population
of interest, our findings from the sample will always be generalisable to the population. It is
still an important question to know how the sample was collected. (See Remark 1.2.17 for more
information on the criteria for generalisability.)
Definition 1.2.7 When we sample from a population, we must try to avoid introducing bias into
our sample. A biased sample will almost surely mean that our conclusion from the sample cannot be
generalised to the population of interest. There are two major kinds of biases.
1. Selection bias is associated with the researcher’s biased selection of units into the sample. This can
be caused by imperfect sampling frame, which excluded units from being selected. Selection bias
can also be caused by non-probability sampling (see Definition 1.2.15 and Example 1.2.16).
Example 1.2.8
1. Suppose we would like to study the number of modules taken by all first year undergraduates in
a University. To collect a sample, the researcher went to two different lecture theatres to survey
undergraduates who were taking two different first year Engineering foundation (compulsory) mod-
ules. The sampling frame in this case consists of all undergraduates who were registered in the two
modules in the semester. Undergraduates who are not taking either of the two modules will not
have a chance to be sampled and thus the sampling frame is imperfect, leading to selection bias.
2. Suppose we would like to find out the proportion of students living at a boarding school who have
received some form of financial assistance in the past and if they had received financial assistance,
what was the quantum they received. A questionnaire was distributed to all students via a survey
form slipped under their room doors and instructions were given to them to complete the form
and drop it in a collection box if they had received financial assistance before. Students do not
need to return the form if they had not received any form of financial assistance previously. The
data collected from this is likely to be biased due to non-response as students who actually had
received financial assistance in the past may be reluctant to share this information or be seen by
their friends when they have to drop the form at the collection box. This will likely result in an
underestimate of the proportion of students who had received financial assistance.
Section 1.2. Sampling 5
Definition 1.2.9 Probability sampling is a sampling scheme such that the selection process is done via
a known randomised mechanism. It is important that every unit in the sampling frame has a known
non-zero probability of being selected but the probability of being selected does not have to be same
for all the units. The randomised mechanism is important as it introduces an element of chance in the
selection process so as to eliminate biases.
1. Simple random sampling (SRS) - this happens when units are selected randomly from the sampling
frame. More specifically, a simple random sample of size n consists of n units from the popula-
tion chosen in such a way that every set of n units has an equal chance to be the sample actually
selected. We are referring to sampling without replacement here, where a unit chosen in the sample
is removed and has no chance of being chosen again into the same sample. A useful way to perform
simple random sampling is to use a random number generator. While it is expected that differ-
ent samples sampled from the same sampling frame using SRS would be different, the variability
between the samples is entirely due to chance.
Example 1.2.10 The classic lucky draw that is carried out during dinners is the best example of
simple random sampling. In this case, every attendee has his/her lucky draw ticket placed inside a
box and a simple random sample of these tickets are drawn out of the box, one at a time, without
replacement. If we assume that before each draw, the remaining tickets in the box are mixed
properly such that every ticket has a equally likely chance of being drawn out, then the probability
of each ticket being drawn at any instance is n1 where n is the number of tickets remaining inside
the box.
Example 1.2.11 Suppose we would like to sample 500 households in Singapore and find out how
many household members there are in each household. Let us assume that every household has a
unique home phone number. If we have a listing of all such phone numbers and list them from 1
to n, we can use a random number generator to select 500 phone numbers from the list to form
our sample. Unique phone calls (i.e. sampling without replacement) can then be made to these
households to survey the number of household members. This is another example of simple random
sampling. Notice that this example also illustrates a common shortcoming of SRS, in that it can
possibly be subjected to non-response from the units that are sampled.
2. Systematic sampling is a method of selecting units from a list by applying a selection interval k
and a random starting point from the first interval. To carry out systematic sampling:
(a) Suppose we know how many sampling units there are in the population (denoted by p);
(b) We decide how big we want our sample to be (denoted by n). This means that we will select
one unit from every k = np units;
p
(c) from 1 to k = n, select a number at random, say r;
With this, the sample will consist of the following units from the list:
r, r + k, r + 2k, · · · , r + (n − 1)k.
However, it is often that we do not know the number of sampling units p in the population. In
such a situation, systematic sampling can still be done by deciding on the selection interval k and
randomly selecting a unit from the first k units and then subsequently every kth unit will be
sampled. For example, if k = 10, we can sample the 5th , 15th , 25th units and so on.
Example 1.2.12 Suppose we know there are 110 sampling units in the population (so p = 110)
and we would like to select a sample with 10 units (so n = 10). Imagine the sampling units are
numbered 1 to 110 in a list and arranged according to the table below.
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107 108 109 110
Since p = 110 and n = 10, we select one unit from every k = 110 10 = 11 units. So we randomly
select a number from 1 to 11 which will start off the sampling process. For example, if the number
selected was 5, then our sample will comprise of the elements
Similarly, if the number selected was 9, then our sample will comprise of the elements
From this example, it should be clear that if the sampling units are listed with some inherent
pattern, then it is possible that the sample obtained could have selection bias.
3. Stratified sampling is a method where the sampling frame is divided into groups called strata. Each
stratum is similar in that they share similar characteristics but the size of each stratum does not
necessarily have to be the same. We then apply simple random sampling to each stratum to generate
the overall sample. While stratified sampling is a commonly used probability sampling method,
there are some situations where it may not be possible to have information on the sampling frame
of each stratum in order to perform simple random sampling properly. Furthermore, depending on
how the strata are defined, we may face ambiguity in determining which stratum a particular unit
belongs to. This can complicate the sampling process.
Example 1.2.13 An example of stratified sampling can be seen during elections, for example,
a Presidential Election. Voters visit their designated polling stations to cast their votes for the
candidate that they wish to support. In countries where the number of voters is very large, it may
take a long time before all the votes are counted. Stratified sampling can be employed if we wish
to make a reasonably good prediction of the outcome. This is done by taking a simple random
sample of the voters at each polling station (stratum) and then computing the weighted average of
the overall vote count, based on the size of each stratum, for each candidate. This way, we would
be able to have a reasonably good estimate of the total votes each candidate would receive.
4. Cluster sampling - is a method where the sampling frame is divided into clusters. A fixed number
of clusters are then selected using simple random sampling. All the units from the selected clusters
are then included in the overall sample. One advantage of this sampling method is that it is
usually simpler, less costly and not as resource intensive than other probability sampling methods.
The clusters are usually naturally defined which makes it easy to determine which cluster a unit
belongs to. The main disadvantage of this sampling method is that depending on which clusters
are selected, we may see high variability in the overall sample if there are largely dissimilar clusters
with distinct characteristics. In addition, if the number of clusters sampled is small, there is also
a risk that the clusters selected will not be representative of the population.
Section 1.2. Sampling 7
Example 1.2.14 Suppose a study wants to survey the mental wellness of Primary school students
in Singapore. Cluster sampling can be done by treating each Primary school as a cluster and this
way of clustering the population of interest is natural and unambiguous since all students in the
population belongs to exactly one Primary school. A number of schools are then selected using
simple random sampling for this survey and all the students in the selected schools will be part
of the sample while those not in the selected schools will not be included. Another approach is of
course to apply simple random sampling with the list of all students (from all Primary schools) as
the sampling units. If this was done, then there is a possibility that all schools will have students
forming part of the sample. Cluster sampling would not provide such a characteristic.
We have presented four different probability sampling methods, below is a summary table of the
advantages and disadvantages of the methods.
Remember:
There is no single universally best probability sampling method as each has its advan-
tages and disadvantages. All probability sampling methods can produce samples that
are representative of the population (that is, sample is unbiased). However, depending
on the situation, some methods would further reduce the variability, resulting in a more
precise sample.
Definition 1.2.15 A non-probabilty sampling method is when the selection of units is not done by
randomisation. There is no element of chance in determining which units are selected, instead it is
usually down to human discretion.
Example 1.2.16
2. Volunteer sampling happens when subjects volunteer themselves into a sample. Such a sample is
also known as a self-selected sample and very often, the sample contains subjects who have a strong
opinion (either positive or negative) on the research question than the rest of the population. Such
8 Chapter 1. Getting Data
a sample is unlikely to be representative of the population of interest. For example, the host of a
“popular” radio talkshow may wish to find out how well received is his show. To do this, he asked
his listeners to go online and submit a rating of this show, out of a score of 10. Each listener can
voluntarily decide if they wish to be part of this rating exercise or not. By collecting a sample
of opinions this way, it is likely that the sample will be skewed towards a high rating because
listeners who did not like the talkshow would not even be aware of such a survey and therefore
their opinions would have been left out. On the other hand, listeners who are strong supporters of
this show would be more enthusiastic to go online to support their favourite radio show.
Let us summarise our discussion on sampling. In most instances where a census is not possible,
obtaining a sample of the population of interest is necessary. The following outlines the general approach
to sampling:
1. To design a sampling frame. Recall that a sampling frame should ideally contain the population
of interest so that every unit in the population has a chance to be sampled.
2. Decide on the most appropriate sampling method to generate a sample from the sampling frame.
Probability sampling methods are generally preferred over non-probability sampling methods as
non-probability sampling methods have a tendency to generate a biased sample.
3. Remove unwanted units (those that are not from the population) from the generated sample.
Remark 1.2.17 If the following generalisability criteria can be met, we will be more confident in
generalising the conclusion from the sample to the population.
1. Have a good sampling frame that is equal to or larger than the population;
3. Have a large sample size to reduce the variability or random errors in the sample;
Definition 1.3.1
2. A data set is a collection of individuals and variables pertaining to the individuals. Individuals can
refer to either objects or people.
In a research question where we are examining relationships between variables, there is usually a
distinction between which are independent and which are dependent variables.
Definition 1.3.2
1. Independent variables are those that may be subjected to manipulation, either deliberately or
spontaneously, in a study.
2. Dependent variables are those that are hypothesised to change depending on how the independent
variable is manipulated in the study.
Section 1.3. Variables and Summary Statistics 9
It is important to note that the dependent variable is hypothesised to change when the independent
variable is manipulated. It does not mean that the dependent variable must change. It is perfectly
possible that any changes to the independent variable does not result in any change in the dependent
variable.
Example 1.3.3
1. In a study, if we wish to investigate the relationship between time spent on computer gaming and
examination scores, the independent variable is the amount of time one spends on computer gaming
while the dependent variable is the examination score.
2. In a study where we investigate which brand of tissue paper is able to absorb the most water, the
independent variable is the brand of the tissue paper and the dependent variable is the amount of
water a piece of tissue paper (from a particular brand) can absorb. In this study, we will vary the
different brands of tissue paper used and record the different amounts of water absorbed.
3. We would like to study whether drinking at least 2 glasses of orange juice per day for a year is asso-
ciated 1 with having lower cholesterol levels in a year’s time. In this case, the independent variable
is whether (or not) a person drinks at least 2 glasses of orange juice a day. Each individual will
have an attribute labelled either as “YES” or “NO” with regards to this variable. The dependent
variable would be whether an individual’s cholesterol level next year is lower than this year’s level.
Again, each individual will have an attribute labelled either as “YES” or “NO” with regards to
this variable.
Definition 1.3.4
1. Categorial variables are those variables that take on categories or label values. The categories
or labels are mutually exclusive, meaning that an observation cannot be placed in two different
categories or given two different labels at the same time.
2. Numerical variables are those variables that take on numerical values and we are able to meaning-
fully perform arithmetic operations like adding and taking average.
3. Among categorical variables, there are generally two sub-types. An ordinal variable is a categorical
variable where there is some natural ordering and numbers can be used to represent the ordering.
A nominal variable is a categorial variable where there is no intrinsic ordering.
4. Among numerical variables, there are also generally two sub-types. A discrete numerical variable
is one where there are gaps in the set of possible numbers taken on by the variable.
5. A continuous numerical variable is one that can take on all possible numerical values in a given
range or interval.
Example 1.3.5
1. The happiness index used to measure how happy a group of Secondary school students are, is an
ordinal variable. For instance, we can specify “1” as “not happy”, “2” as ‘somewhat not happy’,
“3” as neutral, “4” as “somewhat happy” and “5” as “happy”. Whether a subject drinks at least
2 glasses of orange juice or not is an example of a nominal variable.
2. The number of children in the school who scored an A grade in Mathematics for PSLE is a discrete
numerical variable. In this case, the gaps are the non integer values that lie between every two
integer values. It is clear that we cannot have, for example, 134.5 children scoring A in the school,
so there is a gap between 134 and 135.
1 The notion of association between variables will be discussed extensively in Chapter 2.
10 Chapter 1. Getting Data
3. The height or the weight of a person is a continuous numerical variable, as the weight can take on
all numerical values, not necessarily only the integer values.
A common way of presenting data is to use a table with rows and columns. Each row of the table
usually gives information pertaining to a particular individual while each column is a variable. So if we
look across a row in the table, we will see the variables’ information for that particular individual.
The table above shows part of a data set involving different species of penguins and some of the
physical attributes of the penguins. Each row represents a particular penguin and the columns are the
variables pertaining to that particular penguin. Some of the variables are categorical variables while
others are numerical. Can you figure out whether the categorial variables are ordinal or nominal? Can
you figure out whether the numerical variables are discrete or continuous?
With a data set, we are able to zoom into a particular individual’s information at a micro level. If we
do this, we can extract all the information on that particular individual for our use. However, we may
also be interested in looking at the entire data set at the macro level, obtaining information on groups of
individuals or the entire population. Useful information like trends and patterns can be observed from
the data through data visualisation, which is very useful. While calculations cannot be done through
visualisations, we can use summary statistics to do numerical and quantitative comparisons between
groups of data.
Summary statistics for numerical variables can be broadly classified into two types. Firstly, there
are those that measure the central tendencies of the data, like mean, median and mode. Secondly,
there are those that measure the level of dispersion (or spread) of the data, like standard deviation and
interquartile range.
Definition 1.4.1 The mean is simply the average value of a numerical variable x. We denote the mean
of x by x and the formula to compute x is
Pn
x1 + x2 + · · · + xn xi
x= = i=1 .
n n
Here, n is the number of data points and x1 , x2 , . . . , xn are the numerical values of the numerical variable
x in the data set.
Example 1.4.2 Suppose the bill length (in mm) of 7 penguins were
1. x1 + x2 + · · · + xn = nx. This means that we may not know each of the individual values
x1 , x2 , . . . , xn , but we can calculate their sum if we know their mean (x) and the number of data
points (n) that is used to compute the mean.
2. Adding a constant value c to all the data points changes the mean by that constant value. So if
the mean of the values x1 , x2 , . . . , xn is x, then the mean of
x1 + c, x2 + c, . . . , xn + c
will be x+c. For example, the mean of 1, 6, 8 is 13 (1+6+8) = 5 and the mean of (1+3), (6+3), (8+3)
(adding 3 to each of the 3 numbers 1, 6 and 8) is
(1 + 3) + (6 + 3) + (8 + 3) 4 + 9 + 11
= = 8 = 5 + 3.
3 3
3. Multiplying a constant value of c to all the data points will result in the mean being changed by
the same factor of c. So if the mean of the values x1 , x2 , . . . , xn is x, then the mean of
will be cx. For example, the mean of 2, 7, 12 is 13 (2 + 7 + 12) = 7 and the mean of (2 × 2), (2 ×
7), (2 × 12) (multiplying 2 to each of the 3 numbers 2, 7 and 12) is
(2 × 2) + (2 × 7) + (2 × 12) 42
= = 14 = 2 × 7.
3 3
Example 1.4.4 Consider a data set where we have daily weather data, collected at various weather
stations in Singapore. Part of the data set is shown below.
With this data set, some of the questions that we can ask are
2. If the mean monthly rainfall in 2020 was 157.22mm, what was the total amount of rainfall recorded
in 2020?
3. Is there any relationship between wind speed and temperature? What about between the amount
of rainfall and wind speed?
4. Does the weather pattern for 2020 allow us to make a good prediction for how the weather will be
like in 2021?
12 Chapter 1. Getting Data
To answer the first question on the month with the most amount of rainfall, we need to add up the amount
of rainfall recorded on each day of a month, for every month in the year in order to do a comparison.
To answer the second question, using the information on the average rainfall (x = 157.22), with the fact
that
12x = x1 + x2 + · · · + x12 ,
we can find the total rainfall in 2020 to be 12 × 157.22 = 1866.64mm. This way, we can find the total
rainfall in 2020 without having to add the total amount of rainfall for each of the twelve months. It is
also useful to note that if the average rainfall in 2020 was 157.22mm, then
1. It is not possible for the amount of rainfall to be less than (or more than) 157.22mm every month
in 2020.
2. It is not necessarily the case that the amount of rainfall is 157.22mm every month in 2020.
3. In fact, it may not even be the case that there were six (half of twelve) months where the monthly
rainfall were higher than the mean and the other six months lower than the mean.
In conclusion, knowing the mean, while useful, does not tell us how the rainfall was distributed over the
twelve months of 2020. We would not know which months had more than the mean and which months
had less. In order to have further information beyond the mean, we need to know a bit more about the
spread of the data. This will be covered later in this chapter.
Example 1.4.5 Suppose students from two different schools (A and B) took a common examination
and the table below shows the average performance of the students in both schools.
No. of students Average mark
School A 349 32.21
School B 46 30.72
Overall 395 ?
The mean score of students in school A was 32.21 and the mean score of students in school B was
30.72. What would be the mean score of all the students in both schools if we consider them altogether?
Would it be the simple average
(32.21 + 30.72)
= 31.465?
2
The answer is no and the reason for this is because we do not know how many students in each school
contributed to the mean sores recorded in their respective schools. Imagine the extreme case where
school A had 500 students who took the examination while school B only had 5. In such a situation,
you would expect that the overall average score of the 505 students in both schools to be very close to
the mean score of school A. In order to know what is the overall mean for the students in both schools,
we need to have the information on the number of students in each school, given below.
Number of students
School A 349
School B 46
With this information, the overall mean can be computed using the weighted average of the two subgroup
means. The overall mean for the 349 + 46 = 395 students would be
349 46
× 32.21 + × 30.72 = 32.04.
395 395
349 46
The numbers 395 and 395 that were multiplied to their respective group means are called the weights
of the subgroups. Observe that due to the much larger subgroup size of school A compared to that of
school B, the overall mean as we expected, is much closer to the mean of school A.
Another useful observation is that the overall mean of 32.04 lies between the two subgroup means
of 32.21 and 30.72 (although closer to 32.21). This is not a one-off coincidence. Generally, the overall
mean will always be between the smallest and largest means among all the subgroups (when we have
more than just two subgroups). This will be discussed in greater detail in the next chapter.
Section 1.5. Summary Statistics - Variance and Standard Deviation 13
Example 1.4.6 In this final example on means, we introduce a related concept known as proportions.
Suppose we would like to investigate the effectiveness of a new drug for treating asthma attacks compared
to existing drugs. The table below shows the number of patients taking the new drug and the number
taking the existing drug.
Since there are only 200 asthma attacks among those patients taking the new drug, compared to
300 asthma attacks among those taking the existing drug, can we conclude that the new drug is more
effective? The answer is no. Notice that the number of patients taking the new drug and those taking
the existing drug are vastly different. This means that we should not be simply looking at the absolute
number of asthma attacks observed in the two groups of patients, but instead consider the proportion of
patients in each group having asthma attacks. We see that the proportion is higher in the group taking
the new drug compared to the group taking the existing drug and this makes us a lot less confident that
the new drug is more effective than the existing one.
The computation of proportion can actually be thought of as a mean in the following way. Imagine
that among the 500 patients receiving the new drug, we assign a numerical value of 1 to those who had
an asthma attack after the taking the new drug and a numerical value of 0 to those patients who did not
have an asthma attack. If we do this, then the mean of these 500 observations of 0s and 1s would be
200 300
z }| { z }| {
1 + 1 + ··· + 1+0 + 0 + ··· + 0
= 0.4,
500
which coincides with what was computed as the proportion for this group of patients having asthma
attack. Therefore, proportion can be thought of as a special case of mean.
Definition 1.5.1 Recall that in Example 1.4.4, we saw that knowing the mean of a variable does not
tell us about how the data is distributed and the spread of the data points. Standard deviation is one of
the ways to measure the spread of the data about the mean. The computation of the standard deviation
is done via the computation of the sample variance of the data as follows:
this would result in the wrong conclusion that there is no variance (and thus no spread) of the data points
about the mean. The reason is simply because each data point could be smaller or bigger than the mean
and if the differences (xi − x) are not squared, they may cancel out each other like in the example above,
giving us the wrong impression that there is no variation or spread among the data points about the
mean.
Remark 1.5.2 You may wonder why, in the computation of sample variance, we divide the sum of the
squares (xi − x)2 by n − 1 instead of n, since we have n data points and not n − 1. The reason is because
x1 , x2 , . . . , xn are assumed to be a sample taken from a population. We are using the variance observed
in such a sample to estimate the variance at the population level, which is usually unknown. You can
think of dividing by n − 1 instead of n as a ‘correction’ to make since our data is only a sample of the
population. More detailed discussion on this is beyond the scope of this module.
Example 1.5.3 The highest temperature recorded on the 1st day of every month is shown below:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
30.1 31.1 31.8 32.1 31.9 32.6 33.0 32.4. 32.0 32.5 31.3 29.6
The mean is
30.1 + 31.1 + 31.8 + 32.1 + 31.9 + 32.6 + 33.0 + 32.4 + 32.0 + 32.5 + 31.3 + 29.6
= 31.7.
12
The sample variance is
1
(30.1 − 31.7)2 + (31.1 − 31.7)2 + · · · + (31.3 − 31.7)2 + (29.6 − 31.7)2 ≈ 1.038
Var =
11
The standard deviation is √
sx = Var ≈ 1.019.
Remark 1.5.4 The following are some properties of the standard deviation of a variable x.
1. The standard deviation sx is always non negative. In fact, sx is almost always positive and the
only instance when sx = 0 is when the data points are all identical, that is, x1 = x2 = · · · = xn .
In this case, the variance is zero and so is the standard deviation.
2. The standard deviation shares the same unit as the numerical variable x. For example, if x measures
the weight (in kilograms) of adult males in Singapore, then the unit for sx is also kilograms.
3. Adding a constant c to all data points does not change the standard deviation. So the standard
deviation for the set of data
A = {x1 , x2 , x3 , . . . , xn }
is the same as the standard deviation for the set of data
B = {x1 + c, x2 + c, x3 + c, . . . , xn + c}.
Intuitively, since all the data points are adjusted by the same constant c, the spread of the data
points about the new mean will be the same as the spread of the original data about the previous
mean.
Section 1.5. Summary Statistics - Variance and Standard Deviation 15
4. Multiplying all the data points by a constant c results in the standard deviation being multiplied
by |c|, the absolute value of c. In other words, if sx is the standard deviation for the set of data
A = {x1 , x2 , x3 , . . . , xn },
then the standard deviation for the set of data
B = {cx1 , cx2 , cx3 , . . . , cxn }
will be |c|sx .
Example 1.5.5 Let us return to the data set involving three different species of penguins introduced
earlier in the chapter. The three species were named Chinstrap, Adelie and Gentoo and the data set
contained information on the physical attributes (e.g. mass, bill length, bill depth etc.) of various
penguins in each of the three species. An overarching question that one may be interested to answer is
- how different are these penguins? A common approach to answer this question is to compare those
physical attributes across samples collected for the different species and see if they are significantly
different. For example, we can compute the mean and standard deviation of the mass of the penguins,
summarised as follows:
Mean mass Standard deviation of mass
Chinstrap 3733g 384.3g
Adelie 3710g 458.6g
Gentoo 5076g 504.1g
Overall 4201g 802.0g
1. Observe that the overall mean mass 4201g is indeed between the group with the highest mean mass
(Gentoo) at 5076g and the group with the lowest mean mass (Adelie) at 3710g. This is consistent
with our earlier discussion.
2. Even though the overall mean mass is 4201g with standard deviation 802g, it does not imply that
the heaviest penguin weighs 4201 + 802 = 5003g.
3. Suppose we wish to investigate whether the Adelie and Chinstrap species are similar in terms of
their mass. First, we observe that the mean mass of these two groups are rather similar with the
Adelie species having a mean mass of 3710g while the Chinstrap species has a mean mass of 3733g.
However, the standard deviation of mass for these two species are rather different.
4. To examine further on the difference in physical attributes between the Adelie and the Chinstrap
species, we need to delve into other factors or variables that we have information on from the data
set, for example, variables like age, gender, location and so on. This is Exploratory Data Analysis
in action, where we start off with a few questions about the data set and with exploration into
the data, we ask new questions and go back to the data set to look more closely at the data in an
attempt to answer the new questions. In data analysis, this process is often repeated several times.
In relation to this penguin data set, here are some further questions that can be asked:
Are male penguins heavier than female penguins?
Is there a relationship between bill length and bill depth across all species?
Do heavier penguins come from colder locations?
Can findings in this data be generalised to all of the three species?
5. The concept of coefficient of variation is often used to quantify the degree of spread relative to the
mean. The formula is
sx
coefficient of variation = .
x
Observe that since sx and x have the same units, the coefficient of variation has no units and
is simply a number. The coefficient of variation is a useful statistic for comparing the degree of
variation across different variables within a data set, even if the means are drastically different
from one another.
16 Chapter 1. Getting Data
Definition 1.6.1 In this section, we will introduce a few other summary statistics. We have already
discussed the mean, which measures the central tendencies of a variable, as well as standard deviation
which measures the spread of the data points about the mean. The median of a numerical variable in
a data set is the middle value of the variable after arranging the values of the data set in ascending or
descending order. If there are two middle values (when there are an even number of data points), we
will take the average of the two middle values as the median. The median is an alternative to the mean
as a measure of central tendencies of a numerical variable.
Remark 1.6.3
1. We have seen that when a constant c is added to every data point in a data set, the mean will also
be increased by c. The median behaves in the same way, so if the median of the values x1 , x2 , . . . , xn
is r, then the median of
x1 + c, x2 + c, . . . , xn + c
is r + c.
2. We have also seen that when a constant c is multiplied to all the data points, then the mean is also
multiplied by c. The effect on the median is similar, so if the median of the values x1 , x2 , . . . , xn is
r, then the median of
cx1 , cx2 , . . . , cxn
is cr.
Example 1.6.4 Returning to Example 1.4.5 we saw that school B had 46 students who took an exam-
ination and the mean of their scores was 30.72. The plot below, known as a dot plot shows the scores
obtained by each of the 46 students.
Each dot placed on a particular number represents a student obtaining that score for the examination.
Since there were 46 students, the median score would be the average of the 23rd and 24th ranked students’
scores. The 23rd ranked student scored 30 marks while the 24th ranked student scored 31 marks. So the
median score is 30.5. This also means that 50% of the students scored below 30.5 marks and the other
50% scored more than 30.5 marks.
It is interesting to note that the mean score for school B was 30.72, which is very close to the median
score. The main reason for this is because the spread of the scores are quite symmetrical about the mean
and the median. Can you construct a data set where the mean and median are far apart?
We can also compute the median score for students in school A, as well as the overall median score
when we combine the students from both schools together. The median and mean (computed in Example
1.4.5) for each subgroup as well as the overall median and mean scores are shown in the table below.
Section 1.6. Summary Statistics - Median, quartiles, IQR and mode 17
Similar to what we observed for means, the overall median score (32) lies between the subgroup with
the higher median (32) and subgroup with the lowest median (30.5). This is by no means a coincidence.
Even when there are more than 2 subgroups, the overall median will always be between the lowest median
and the highest median among all the subgroups. However, if we know each of the subgroup medians,
it is not possible to use this information to derive the overall median. This is unlike the case for mean
where, if we know the mean of each subgroup, together with the “weights” of each group (meaning the
number of members in each subgroup) we can take a weighted average to compute the overall mean
exactly.
Definition 1.6.5 We have seen that the median represents a numerical value where 50% of the data is
less than or equal to this value. This is also known as the 50th percentile of the data values. The first
quartile, denoted by Q1 , is the 25th percentile of the data values, while the third quartile, denoted by
Q3 is the 75th percentile of the data values. This means that 25% of the data is less than or equal to Q1
while 75% of the data is less than or equal to Q3 .
Definition 1.6.6 The interquartile range, denoted by IQR is the difference between the third and first
quartiles, so IQR = Q3 − Q1 .
Remark 1.6.7
1. IQR and standard deviation share similar properties. For example, we know that IQR is always
non negative since Q3 is always at least as large as Q1 and so Q3 − Q1 ≥ 0.
2. If we add a positive constant c to all the data points, not only does the median value increase by c,
Q1 and Q3 are increased by c as well. Thus, there will be no change in IQR. Of course, IQR also
remains unchanged if c is subtracted from all data points.
3. If we multiply all data points by a constant c, then IQR will be multiplied by |c|.
Example 1.6.8 Let us consider two simple data sets and compute the first quartile, median, third
quartile and interquartile range. The first data set consists of an even number of data points as follows:
1. Since there are 10 data points, the median is the average of the 5th and 6th ranked data points, so
median is 21 (10 + 16) = 13.
2. To find the first and third quartiles, we divide the data set into the lower half (1st to 5th ranked
data points) and upper half (6th to 10th ranked data points). The first quartile is the median of
the lower half
1, 5, 8, 9, 10,
which is the 3rd ranked data point in this lower half, so Q1 = 8. The third quartile is the median
of the upper half
16, 19, 22, 28, 30,
which is the 3rd ranked data point in this upper half, so Q3 = 22.
18 Chapter 1. Getting Data
Remark 1.6.9
1. In the example above, when the data set has odd number of data points, we have not included the
median in both the lower and upper halves. This is not the universal practice. You may encounter
some texts that includes the median in both halves.
2. In reality, when the number of data points is large, summary statistics like median and quartiles are
not computed manually but instead, they are computed using softwares. However, even softwares
do not adopt the same algorithm in computing these statistics. The good news is that we do not
have to worry too much about finding the exact value of the quartile since for large data sets, all
the different methods give pretty close answers and the small difference is not an issue. For small
data sets, it is also not really meaningful to summarise the data since we have complete information
of the entire data set anyway.
Remark 1.6.10 For a numerical variable, we can always use the mean and standard deviation as a pair
of summary statistics to describe the central tendency as well as the dispersion and spread of the data.
Similarly, the median and IQR can also be used. Which choice is more appropriate? There is no clear
cut answer but very often, the choice depends on the distribution of the data. Generally speaking, the
median and IQR is preferred if the distribution of the data is not symmetrical or when there are outliers.
We will conclude this section with a final summary statistic that can be used for both numerical and
categorical variables.
Definition 1.6.11 The mode of a numerical variable is the numerical value that appears most often
in the data. For categorical data, a mode is the category that has the highest occurrence in the data.
The mode is generally interpreted as the peak of the distribution and this means that the mode has the
highest probability of being observed if a data point is to be selected randomly from the entire data set.
3. To compare two sub-populations / to investigate a relationship between two variables in the pop-
ulation.
In this section, we will focus on the third type of question, where we investigate a relationship
between two variables in the population. For example, consider the question “does drinking coffee help
students pass the mathematics examination?” The two variables here are drinking coffee (yes or no)
and passing the mathematics examination (yes or no). Here, both variables are nominal categorical
variables. Commonly, a researcher looking at this situation may want to define “drinking coffee” as the
independent variable as it can be controlled and adjusted while “passing the mathematics examination”
is the dependent variable. In order to investigate this relationship, we need to design a study and for this
course, we will discuss two main study designs, namely experimental studies and observational studies.
Definition 1.7.1 In an experimental study (sometimes also known as controlled experiment or simply
an experiment), we intentionally manipulate one variable (the independent variable) to observe whether
it has an effect on another variable (the dependent variable). The primary goal of an experiment is to
provide evidence for a cause-and-effect relationship between two variables.
Example 1.7.2 Returning to the experiment to investigate the relationship between drinking coffee and
passing the mathematics examination, we can set up an experimental study by dividing the subjects,
that is, the students taking the examination, into two groups. The first group will be required to drink
exactly one cup of coffee every day for a month. The second group will not drink any coffee for one
month. The group who are required to drink one cup of coffee every day for a month is often known as
the treatment group since they are thought to be put through the “treatment” of drinking coffee. The
other group who does not drink coffee is known as the control group.
It is important to have a control group to compare against the treatment group. Without a control
group (imagine every subject is required to drink coffee for a month), we would not be able to determine if
there were indeed any difference between drinking coffee or not. However, it should be noted at this point
that sometimes the control group are also subjected to other forms of treatment (not to be mistaken with
the treatment of interest in the study) i.e. a control group does not necessarily mean no treatment at all.
One example is when we are comparing the effects of a new treatment with an existing treatment.
For such instances, the treatment group will be formed by subjects receiving the new treatment while
the control group will be those who continue to receive the existing treatment. Note also that if we were
to design a controlled experiment where the treatment group receives a new treatment, and the control
group receives the existing treatment, it is implicitly assumed, as part of the experimental design, that
we must know the effect of having no treatment. This ensures a meaningful comparison of the effects
between the new treatment and existing treatment with reference to the known baseline of having no
treatment.
A natural question now is how the subjects are to be divided into the two groups. Can we do
it anyway we like? Can we let the odd numbered subjects be in the treatment group and the even
numbered subjects be in the control group? Does it matter? The problem of how to assign subjects to
the two groups is our next topic of discussion.
Discussion 1.7.3 Continuing on with the coffee drinking experiment, suppose one month after the
experiment started, the subjects from both groups took the mathematics examination and the number
of passes in each group is shown below.
20 Chapter 1. Getting Data
We see that 90% (900 out of 1000) of the students in the treatment group passed the examination while
only 45% (450 out of 1000) of the students in the control group passed. There seems to be some evidence
that drinking coffee may help a student pass the mathematics examination. Is this evidence convincing?
Can we go one step further and say that coffee causes improvement in passing the examination?
The skeptics among us will probably not be so easily convinced. Possible doubts that could arise and
questions that can be asked could be
1. Maybe the students in the “coffee group” just happen to be better in mathematics and thus have
a higher chance of passing the examination? Or maybe they just have higher IQ than those in the
“no-coffee” group?
2. Maybe many of the students in the “coffee group” had longer revision time before the examination
than those in the “no-coffee” group?
These are some of the possible factors that could have contributed to the difference in passing rates
between the two groups. In trying to establish a cause-and-effect relationship between two variables, we
want to make sure that the independent variable is the only factor that impacts the dependent variable.
In the coffee drinking example, we want to ensure that coffee drinking (or not) is the only variable
that distinguishes the treatment group from the control group. In other words, we need to ensure that
coffee drinking (or not) is the only difference between the subjects in the two groups. All other possible
differentiating factors, for example amount of revision time, should be removed.
How can these factors be “removed”? Surely we cannot mandate that all students in both groups
are only allowed a fixed number of revision hours before the examination! Even if we could, we most
definitely cannot enforce that all students in both groups must have the same IQ! The answer to this is
a powerful statistical method known as random assignment.
Definition 1.7.4 Random assignment is an impartial procedure that uses chance (or probability) to
allocate subjects into treatment and control groups.
How do we perform random assignment for our coffee drinking experiment? The following procedure
can be considered:
2. Put all the pieces of paper into a box and mix them up.
3. Draw the names out one by one until exactly half the total number of students are chosen. The
names of the students chosen will form the treatment group.
4. The remainder of the students not in the treatment group will form the control group.
The procedure above is just an example of how random assignment can be done. As long as there
is a random element, there can be other procedures to conduct random assignment. It should be noted
that at every draw, each name in the box has an equally likely chance of being chosen. Perhaps there
are still doubters out there who feels that even with such a chance event of assigning the subjects into
treatment and control groups, it may still happen that many of the high IQ students will be assigned to
the treatment group. However, we can be assured that:
If the number of subjects is large, by the law of probability, the subjects in the treatment
and control groups will tend to be similar in all aspects.
Example 1.7.5 The S&P Index is a stock market index of the largest US publicly traded companies.
We are interested in the percentage returns of these S&P companies in 2013. Suppose these percentage
Section 1.7. Study Designs - Experimental Studies and Observational Studies 21
returns were written on 1000 tickets and we are aware that the percentage returns range from −4% to 4%.
Using the method of random assignment, the 1000 tickets are separated into two groups, each comprising
of 500 tickets. The following plots show the distributions of the percentage returns of companies in both
groups.
In each plot, the horizontal axis is the percentage returns and the vertical axis counts the number
of companies with the specific percentage returns. We observe the effect of random assignment as the
distribution in both groups are rather similar.
Remark 1.7.6
1. While performing random assignment to allocate subjects into treatment and control groups, it is
not necessary/possible for both groups to have exactly the same number of subjects. For example,
if we have 501 students to be divided into two groups. As long as some form of random assignment
is done and the number of subjects in each group is big enough, we can still be assured that the
two groups are similar in almost every aspect.
2. When we use the term “random” in random assignment, we do not mean that the assignment is
haphazard. The term random in this case is used in relation to the use of an impartial chance
mechanism that is effected to assign the subjects into two (or more) groups.
Discussion 1.7.7 While random assignment is an important step to take when we divide our subjects
into the treatment and control groups, there is another important consideration when it comes to design-
ing a controlled experiment. If we make it known to the control group that they are indeed the control
group, and therefore not going to receive any form of treatment, this could possibly lead to bias.
To see why this is so, let us return to the coffee experiment. If the subjects in the control group are told
that they will not be assigned any coffee for a month, when we are testing if coffee helps a student pass
the mathematics examination, students in the control group may feel disadvantaged and therefore lack
confidence and motivation to study. This may in turn result in these students not doing well in the
examination and perform poorer than their friends in the treatment group who were given coffee. Any
observed difference in passing rate between the two groups of students may not be the result of coffee
at all. If this happens, the effect of coffee may be overstated.
On the other hand, to the students in the control group, knowing that they will not be given coffee
may actually cause them to take certain measures for their own benefit of passing the examination.
For example, they may study harder and spend more time on their revision which may then result in
the control group performing better than the treatment group in passing the examination. Again, any
observed difference in passing rates between the two groups of students may not be the result of coffee
at all. If this happens, the effect of coffee may be understated.
One way to reduce the anxiety of the control group which could influence the study on the effects of
coffee drinking is to give the subjects in the control group another beverage which tastes and smells the
same as coffee but is without the active ingredients in coffee that is believed to improve one’s cognitive
ability.
Definition 1.7.8 In the previous discussion, the alternative beverage is termed a placebo. A placebo
is an inactive substance or other intervention that looks the same as, and is given the same way as, an
22 Chapter 1. Getting Data
active drug or treatment being tested. In the context of an experiment, a placebo is something given to
the control group that in actual fact, has no effect on the subjects in the group.
However, it has been observed in some instances, subjects in the control group upon receiving the
placebo still showed some positive effects which is likely caused by the psychology of believing that
they are actually being “treated”. This is known as the placebo effect.
Definition 1.7.9
1. One way to prevent the placebo effect from interfering with our experiment and observation on the
benefits (if any) of the treatment is to blind the subjects involved in the experiment. By blinding
the subjects, we mean that they do not know whether they belong to the treatment or control
group. To do this, a placebo that is “similar” to the treatment is given to the control group so
that the two treatments appear identical to the subjects. As a result, subjects do not know which
group they belong to. If we can do this, we would have achieved single blinding.
2. To take blinding one step further, other than blinding the subjects, it may be necessary to consider
blinding the researchers conducting the study as well, especially if measuring the effects of the
treatment may involve subjective assessments of the subjects. For example, in the coffee experi-
ment, if the assessors marking the students’ answers are aware of which group each student belongs
to, they may be inclined to award higher marks to students in the treatment group than those in
the control group. This is because the assessors may subconsciously believe that the treatment is
effective and this could introduce bias in the outcome.
Thus, we should also blind the assessors so that they do not know whether they are assessing the
treatment or the control group. We would have acienved double blinding if subjects and assessors
are blinded about the assignment.
To conclude this discussion on blinding, we should note that sometimes it may not be possible to
blind both the subjects and the assessors (can you think of one such experiment?) but when done right,
double blinding can be very effective in reducing bias in the outcome of the experiment.
Discussion 1.7.10 Besides an experimental study, another study design is an observational study.
Consider the following research question: Does vaccination help reduce the effects of the coronavirus?
If we were to design a controlled experiment, would the following be a possible and reasonable
approach?
Enrol a group of participants into the study and inject all the participants with low dosages of the
virus strain.
Perform random assignment to divide the group of subjects into the treatment group and control
group.
Inject the treatment group with the vaccine and inject a harmless liquid (similar in colour, smell
etc to the vaccine) into the control group, without revealing what they are being injected with.
Observe the number of participants in each group who develop symptoms similar to a coronavirus
patient.
It is interesting to note that this is not a hypothetical situation. In fact, in 2020, during the COVID-
19 pandemic, a Dublin-based commercial clinical research organisation was reported to be planning an
experiment to test the effectiveness of a COVID-19 vaccine2 . The plan was similar to the approach
described above.
2 Ewen Callaway, 2020. ”Dozens to be deliberately infected with coronavirus in UK ‘human challenge’ trials,” Nature,
You probably realise by now that it is not so straightforward to design a controlled experiment like
this. There are obvious ethical issues that need to be addressed. Some immediate questions that need
to be answered are
2. How should we decide who is to be assigned to the treatment group and who is to be assigned to
the control group?
3. Is it fair not to let the subjects know if they are injected with the vaccine or with a placebo? Should
we obtain consent from the subjects at the beginning of the study?
Experiments can give us useful evidence for a cause-and-effect relationship. However, not all research
questions are suitable to be investigated using an experiment, sometimes due to ethical issues like those
listed above. Therefore, we need to consider the pros, cons and feasibility of an experimental study
before deciding if we should proceed.
Definition 1.7.11 An observational study observes individuals and measures the variables of interest,
usually without any direct/deliberate manipulation of the variables by the researchers.
Remark 1.7.12 Observational studies are alternatives to experiments that can be used when we are
faced with ethical issues in experiments. An observational study observes individuals and measures
the variables of interest. As researchers usually do not attempt to directly manipulate or change one
variable to cause an effect in another variable, observational studies do not provide convincing evidence
of a cause-and-effect relationship between two variables.
Example 1.7.13 We would like to investigate whether exercising regularly (defined as exercising at
least 3 times a week, at least 30 minutes of strenuous exercise each time) is associated3 with having a
healthy body mass index (BMI) (defined as between 18.5 to 22.9 kg/m2 ) for Singaporean men between
the ages of 30 to 40 years old.
Participants were recruited into the study and by their own declaration, they were classified into
either the “treatment” group (those who exercise regularly) or the “control group” (those who do not).
Participants were then told to proceed with their usual lifestyle habits and their body mass index were
measured after 3 months. The following table summarises the findings at the end of the study.
Treatment Control
(Exercise regularly) (Do not exercise regularly)
Healthy BMI range 320 127
Outside Healthy BMI range 101 191
3 previously mentioned in Example 1.3.3, the topic on association between variables will be discussed in Chapter 2.
24 Chapter 1. Getting Data
This is an example of an observational study. Do you think there is sufficient evidence of association
between exercising regularly and having a healthy BMI? We will discuss more questions like this in
subsequent chapters.
Let us conclude this chapter with some final remarks on study designs.
Remark 1.7.14
1. Not all research questions can be studied practically using an experiment. For example, if we would
like to investigate if long term smoking is linked to heart disease, it is extremely difficult to design
an experiment and put subjects into the treatment group where they will be required to smoke for
the long term, even if this is against their will. This is challenging and unethical. An observational
study may be more suitable for such an investigation.
2. For observational studies, there is no actual treatment being assigned to the subjects but we
normally still use the term treatment and control in the same way as though we are dealing with
an experiment. For the investigation on smoking and heart disease, smokers who are observed to
be smoking over a long period of time will be in the treatment group while non-smokers are in the
control group. Sometimes, we may use the term exposure group instead of treatment group and
non-exposure group instead of control group.
3. For experimental studies, subjects are assigned into either the treatment or control group by the
researcher. For observational studies, subjects assign themselves into either the treatment or con-
trol group.
4. Observational studies cannot provide evidence of cause-and-effect relationships. On the other hand,
experimental studies can provide such evidence if it has the features of randomised assignment and
blinding (preferably double blinding).
Having a good design is not the only important piece of the puzzle. In order to generalise the
results from a sample to a bigger population, there are other factors that are equally important,
for example, the sampling frame, sampling method, sample size and response rate.
Exercise 1 25
Exercise 1
1. Drug A is a new drug created to treat Congenital Amegakaryocytic Thrombocytopenia (CAMT).
An experiment was done to evaluate survival outcomes of drug A, against the current standard,
Rituximab, for treating CAMT. 150 CAMT patients were assigned to the drug A group, and 150
CAMT patients were assigned to the Rituximab group. A portion of the experimental design is
summarised in the table below.
Drug A Rituximab
Male 82 85
Female 68 65
Total 150 150
3. A new drug Y was created to reduce stomach-aches. Researchers wanted to test the effectiveness of
drug Y. Thus, they conducted a study containing 2000 subjects. The subjects have the following
characteristics:
Random assignment was conducted to assign the 2000 subjects into the treatment and the control
groups. 1200 subjects were assigned to the treatment group, while the remaining 800 subjects were
assigned to the control group. How many young females would we expect to see in the treatment
group?
4. In a large scale experiment, a researcher randomly assigned 6000 subjects to receive either a drug
or a placebo. 4000 patients were assigned to receive the drug, and the other 2000 patients received
the placebo. The researcher did a quick headcount in the drug-receiving group and noted that
there were 3002 males who received the drug. The researcher does not have time to do a headcount
in the placebo group. Which is the most reasonable number of males to be expected in the placebo
group?
(A) 1000.
(B) 1500.
(C) 2000.
26 Chapter 1. Getting Data
(D) 2998.
5. Virus X has been known to cause very severe symptoms in its patients. Previously there has been
no anti-viral medicine to treat virus X. Recently, researchers have finally managed to produce a
trial drug in the form of a tablet. Researchers want to investigate if the trial drug helps to reduce
the duration of symptoms (number of days) in patients. 1000 patients were sampled for the study,
and all consented to join the study.
Which of the following statements is/are true? Select all that apply.
(A) Random sampling should be done to ensure that the subjects’ demographics/characteristics
are similar (in the treatment and control groups).
(B) Blinding the researchers to the subjects’ assigned groups (treatment or control group) is
important because the researchers may have certain bias for/against the drug.
(C) If the study randomly assigns 400 subjects into the treatment group, and 600 subjects into
the control group, the result of the study will be biased due to the unequal number in the two
groups.
6. In order to investigate the relationship between height and weight of people in a country, a re-
searcher draws a simple random sample of the country’s population, and records how height varies
with weight in the sample. What kind of study is this?
(A) An experiment.
(B) An observational study.
7. Alex wants to investigate the effect of listening to music on time taken to solve a maths problem.
Which of the following scenarios is an example of random assignment?
(I) Subjects are assigned based on the time they arrive at the lab. Alex assigns the first 50%
of the participants to arrive to the treatment group, and the remaining 50% to the control
group.
(II) For each subject, Alex asks him/her to choose head or tails. He then tosses a fair coin. If the
result matches the subject’s choice, the subject is assigned to the treatment group. If not, the
subject is assigned to the control group.
8. A study was conducted to investigate if memory foam pillows helped to improve one’s sleep qual-
ity. A large number of subjects were sampled via simple random sampling. The study aimed to
randomly assign the subjects into the treatment and control groups. Thus, a fair coin was flipped
for each subject to determine if the subject should be assigned into the treatment or control group
– if “heads”, the subject is assigned into the treatment group, if “tails”, the subject is assigned into
the control group. Subjects in the treatment group received a memory foam pillow, while subjects
in the control group received a regular pillow. Below is the description of the number of subjects
between the two groups.
9. Which of the following statements must be true about controlled experiments and observational
studies? Select all that apply.
(A) A placebo is used to blind the participants in a controlled experiment to which group they
belong to.
(B) In controlled experiments, we do not have confounders present.
(C) There can be treatment and control groups in observational studies.
(D) If there are unequal numbers of participants in the treatment and control groups in an exper-
iment, the results of the study will be biased.
(E) Selection bias occurs in observational studies.
10. A pharmaceutical company in Singapore has just developed a new drug that helps alleviate a
particular type of skin rash. They wish to compare the performance of the new drug against the
existing treatment and to also generalise their findings to all who are suffering from this form of
rash in Singapore. Specifically, they wish to compare the following
Number of days taken for the rash to disappear when taking the new drug as compared to
the existing drug.
Severity of side effects experienced when taking the new drug as compared to the existing
drug. Severity of side effects is classified as 0, 1 and 2 where 0 represents low severity, 1
represents moderate severity and 2 represents high severity.
Both the new drug and existing treatment are administered via an injection, and it is also known
that the rash will worsen if no treatment is administered. A brief description of their study design
is given below.
The company has a partnership with a particular clinic and recruits the first 1000 patients suffering
from this rash from a namelist obtained from the clinic to make up their sample. All 1000 patients
agree to take part in the study.
They set up 2 rooms, called Room A and Room B. Those who are assigned to room A will receive
the new drug and those who are assigned to Room B will receive the existing treatment. The
assignment is done as follows.
For each patient, a fair coin is tossed and if the coin lands on heads, the patient is told to go to
Room A and if it lands on tails, he/she is told to go to Room B. The participants are not told which
room is receiving the new drug and which room is receiving the existing treatment. Furthermore,
the doctors in each room are also not aware of whether the injections they are administering contain
the new drug or the existing treatment. After each person is injected, they conduct follow-ups to
determine how many days it took for the rash to disappear as well as the severity of side effects.
Based on the information above, determine which of the following statements is/are true. Select all that apply.
(A) Number of days taken for the rash to disappear, and severity of side effects are both numerical
variables.
(B) Since this is a double-blind randomised controlled trial, it means the conclusions of the study
in the sample can be generalised to the target population.
(C) The random assignment will likely minimise the effects of confounding variables.
28 Chapter 1. Getting Data
11. Patch Z is a new medicine created to remove muscle soreness. A study was done to investigate
the effectiveness of patch Z. The population of interest was Singaporean adult males. For this
study, the researchers requested the Singapore Sports Association to sample all male athletes who
reported for training over the week. 200 male athletes were sampled. There was no non-response.
The 200 subjects had their identity tags randomly shuffled in a box. The first 100 tags picked
from the box were assigned to the treatment group - administered patch Z. The remaining 100
were administered a placebo. 72% of the group that received patch Z had their muscle soreness
alleviated, while 34% of the other group had their soreness alleviated. Which of the following is/are
true?
(I) We are not able to generalise these results to the population of interest.
(II) Random assignment was not conducted.
12. A researcher wishes to study procrastination and social anxiety levels amongst students majoring
in architecture in University X. He wanted to collect a sample of 100 students out of the 1000
students majoring in architecture. He took a name list of all architecture students in University
X. The researcher rolled a fair six-sided die, which landed on 3.
He then decided to pick the 3rd student in the name list, and every 10th student afterwards until he
collected his desired sample size of 100 students. That is, he selects the 3rd student, 13th student,
23rd student . . . until he gets his desired sample size of 100 students.
What kind of sampling method did the researcher employ?
13. Suppose you are a researcher who is interested in drawing a simple random sample of 200 people
from a population of 5000 individuals. Which of the following would be a correct approach?
Select all that apply.
(A) Sort the names of the entire population by alphabetical order (A to Z) and place the names
in a list. Select the people whose names appear at the top 200 of the list.
(B) Assign each individual in the population a unique integer from 1 to 5000 by random assign-
ment. Then choose the people assigned numbers 4801 to 5000.
(C) Write all the names of the entire population on equal-sized pieces of paper, mix the papers in
a box and draw out 200 pieces of paper at one go. Choose the people whose names appear on
the drawn papers.
14. We use random numbers to take a simple random sample of 50 students from a list of 6000
undergraduate students (45% males, 55% females) at the National University of Singapore. 50
random numbers from 1 to 6000 are selected, and the correspondingly numbered students are
selected. Which of the following statements is TRUE?
Exercise 1 29
(A) We would never get the number 1111, because it is not random.
(B) The draw 1234 is no more or less likely than the draw 1111.
(C) Since the sample is random, it is impossible that it has only females in the sample.
(D) Since the sample is random, it is impossible to get the random numbers 0001, 0002, 0003, . . .,
0049, 0050.
15. There is a population of 200 secondary school students from Singapore that I wish to take a sample
from. Among the students,
(A) Scenario 1: simple random sampling; Scenario 2: stratified sampling; Scenario 3: cluster
sampling.
(B) Scenario 1: non-probability sampling; Scenario 2: cluster sampling; Scenario 3: stratified
sampling.
(C) Scenario 1: non-probability sampling; Scenario 2: stratified sampling; Scenario 3: non-
probability sampling.
(D) Scenario 1: simple random sampling; Scenario 2: cluster sampling; Scenario 3: non-probability
sampling.
16. A class of 150 men and 250 women is seated in an examination hall that has 25 rows of chairs,
with 16 chairs in each row. The men are seated in chairs numbered 1 to 150, and the women are
seated in chairs numbered 151 to 400. Which of the following scenarios will NOT produce a simple
random sample of students from this class of 400? Select all that apply.
(A) Choose a row at random and select all the students in that row.
(B) Use a random number generator to generate 10 integers from 1 to 400. Select the students
seated in chairs corresponding to these numbers.
(C) Use a random number generator to generate 10 integers from 1 to 150, and 10 integers from
151 to 400. Select the students seated in chairs corresponding to these numbers.
(D) Randomly select a letter from the English alphabet. Select for the sample students whose
family names begin with that letter. If no family name begins with that letter, randomly
choose another letter from the alphabet.
30 Chapter 1. Getting Data
17. As a response to initial feedback on social media from its customers, Singapore Airlines is reviewing
the implementation of paper serviceware, which is to be rolled out to the premium economy and
economy class on medium- and long-haul flights. To that end, the company decides to conduct a
study to learn about customers’ opinions on the matter. As the company prepares to collect data,
its research team has to make a choice between two options: either to conduct a census or to obtain
a simple random sample. (Note: For this question, assume that the sampling frame is perfect, and
the simple random sample was drawn from the perfect sampling frame.) Which of the following
considerations regarding data collection is/are true? Select all that apply.
(A) A census may be preferred over a simple random sample because the results of a simple random
sample are likely to be biased.
(B) A simple random sample may be preferred over a census because results can be obtained with
reduced cost.
(C) Neither the results of a census nor a simple random sample are subject to selection bias.
(D) Given the same method of simple random sampling, taking a larger sample increases the
amount of random error of the estimate.
(E) Given the same method of simple random sampling, taking a larger sample reduces the amount
of random error of the estimate.
18. A researcher in University X wanted to conduct a survey to find out the average amount of time
spent studying a week, by students in the university. He obtained the list of email addresses of all
2000 students in the university and sent out a survey form to everyone. As a token of appreciation,
students who filled up the form received a “10% off” coupon from the university’s bookshop. 300
students responded to the survey.
Which of the following statements is/are correct? Select all that apply.
19. A publication is estimated to have about 20000 subscribers. A survey was sent to a random sample
of 5000 of its subscribers. 300 of them returned the survey. Which of the following statements is
true?
(A) The sample results may not be generalisable to the population of subscribers because they
used a self-selected sample.
(B) The sample results may not be generalisable to the population of subscribers because there is
likely to be non-response bias.
(C) The sample results will be generalisable to the population of subscribers because they used a
random sample.
20. Which of the following scenarios involve(s) the use of probability in the sampling process?
Select all that apply.
(A) A student wants to know how receptive bus commuters in Singapore are to the recent an-
nouncement of a price increase. To obtain a sample, he went to a nearby bus interchange and
looked for commuters wearing white clothing (his favourite colour). For each commuter wear-
ing white, if the commuter was wearing spectacles, he approached the commuter. Otherwise,
he did not approach the commuter. We may assume that every commuter that he approached
completed the survey.
Exercise 1 31
(B) An event organiser wants to survey a sample of the participants of his event to find out if
they liked the activities he planned. He announced to all the participants that anyone who
completed the survey would win a prize with a probability of 0.8. The sample was made up
of 75 participants who responded to the survey.
(C) The principal of a primary school wants to select a sample of his primary one students to
find out how they are coping with formal school education. There are 10 primary one classes
and each class has 42 students. The principal went into each class and rolled a six-sided die.
If the die showed k (k = 1, 2, 3, 4, 5 or 6), then students in the class with register numbers
k, k + 6, k + 12, k + 18, k + 24, k + 30 and k + 36 would be included in the sample. A final
sample of 70 students, 7 from each class, was formed.
(D) A researcher is trying to study how much time students from the Faculty of Science sleep
every day. He obtained a list of all Science students from the administration and for each
student on the list, he generated a random number from 1 to 10. If the number generated
was less than 7, the student was not selected. If the number was 7 or more, the student was
selected, and the researcher sent an email to the student with a few survey questions. A total
of 700 students received the survey email and 500 of them replied.
21. We have learnt that the standard deviation and interquartile range (IQR) are examples of summary
statistics that help us to quantify the spread of data points. However, they are not the only ways
of quantifying spread and there are other summary statistics that can also help us to do this. For a
numerical variable x, we can define the Mean Absolute Deviation (commonly abbreviated as MAD)
using the formula
22. A teacher has finished marking her students’ test scripts for a Mathematics test. The maximum
mark attainable for the test is 50. She records the following summary statistics for her class.
Mean: 37.4
Median: 35
Standard deviation: 17.22
Quartile 1: 23
Quartile 3: 43
Highest mark: 48
Lowest mark: 16
Range: 48 − 16 = 32.
She returns the test papers to her class and goes through the answers. Whilst going through the
answers, she realises that she has marked a question incorrectly for the whole class. She collects her
students’ scripts back and corrects her mistake. As a result, everyone in the class gets 2 additional
marks. Which of the following summary statistics will change for the class? Select all that apply.
32 Chapter 1. Getting Data
(A) Median.
(B) Standard deviation.
(C) Highest mark.
(D) Quartile 1.
23. A school consists of 53 classes. During a budget meeting, school board members decided to review
class size information to determine budgeting for the classes. Let x be the numerical variable whose
values are the number of students among the 53 classes. Summary statistics for x are shown in the
table below.
x 33.39 students
sx 5.66 students
min 17
Q1 29
median 33
Q3 40
max 40
During the meeting, the following budget is set for classroom stationery supplies. Every class
receives $12 plus an additional $0.75 for each student in the class. For example, a class with one
student receives $12.75, while a class of 40 students receives $12 + 40($0.75) = $42. Define a
numerical variable y where
y = $(12 + 0.75x).
Basically, y takes values that correspond to the amount of money that classes receive for their
stationery supplies. Based on the summary statistics for x, which of the following statements must
be true regarding the summary statistics of y? Select all that apply.
24. Consider a data set consisting of values for a numerical variable x. Let the values be x1 , x2 , . . . , xn
arranged in ascending order. A value y is said to be the balancing point of x in the data set
if the following condition is satisfied.
where x1 , x2 , . . . , xk are the values of x in the data set that are smaller than or equal to y and
xk+1 , xk+2 , . . . , xn are the values of x in the data set that are larger than y. For example consider
a small data set {1, 3, 5, 5, 5, 7, 9}. In this case the value 5 is the balancing point of the data set
since
(5 − 5) + (5 − 5) + (5 − 5) + (5 − 3) + (5 − 1) = (7 − 5) + (9 − 5).
Which of the two statements below is/are true?
(I) The median of x is always the balancing point of x in any data set.
(II) The mode of x is always the balancing point of x in any data set.
25. Consider the sample data set WEIGHT comprising the following numerical values:
Obtain WEIGHT2 and WEIGHT3 by multiplying all values in WEIGHT by 2 and 3 respectively.
Which of the following statements is/are true? Select all that apply.
(A) The coefficient of variation of WEIGHT2 is the same as the coefficient of variation of WEIGHT.
(B) The coefficient of variation of WEIGHT3 is the same as the coefficient of variation of WEIGHT.
(C) The coefficient of variation of WEIGHT, correct to 3 decimal places, is 0.108.
26. Let x1 .x2 , . . . , xn be values of a numerical variable x in a data set. Assume that the data is a
sample (i.e., it is not population data). Which of the following statements is/are definitely true
with regards to the variance? Select all that apply..
27. The Registry of Marriages is interested to see the relationship between husbands’ and wives’ age in
City X. In a sample of 1000 pairs of husbands and wives, they observed that men always married
women who are younger than them and the difference in age between a man and his wife is less
than or equal to 2 years. In other words
Based only on the information given above, which of the following statements must be true?
(I) The difference between the mean age of the husbands and the mean age of the wives in the
sample is less than or equal to 2.
(II) The standard deviation of the wives’ age in the sample is the same as the standard deviation
of the husbands’ age.
28. Let X be a numerical variable and assume that its mean and standard deviation are non-zero.
Which of the following statements is true regarding the coefficient of variation for X?
(A) If we multiply all the values of X by –2, then the coefficient of variation is multiplied by 2.
(B) If the mean of X is positive, then the coefficient of variation of X is also positive.
(C) Adding a constant to all the values of X does not change the coefficient of variation.
(D) Assuming that X is a numerical variable with units, the coefficient of variation has the same
units as X.
34 Chapter 1. Getting Data
29. A study was done to determine the relationship between perceived obstetric violence and the risk
of postpartum depression (PPD). A total of 782 women were asked to report on their baseline
characteristics including the following: Age, Education level (Secondary School/ Pre-university
/ University), Family monthly wage (Euros), and Nationality (Spanish / Portuguese / French /
Others). Which of the following statements is true about the types of variables of Age, Education
level, Family monthly wage and Nationality, respectively?
30. The paper titled “An empirical study on gender, video game play, academic success and complex
problem-solving skills”, published in Computers and Education Journal, contained the following
excerpts from the section on data collection tools:
Gaming time: Two open-ended items were used to measure the daily time spent on video gaming
during weekdays and weekends, respectively (“How much time do you spend playing video games
during the weekdays/weekends?”). The average daily gaming time was calculated with a weighted
formula ((DailytimeForWeekdays*5 + DailyTimeForWeekends*2)/7).
Gaming frequency was measured with a single-item Likert scale: “In the last twelve months,
how often have you played personal computer (PC)/mobile/console games?” Answers to the scale
ranged between “Never” (1), “Less than once a month” (2), “Once a month” (3), “A few times per
month” (4), “3–4 times a week” (5), “5–6 times a week” (6), and “At least once a day ” (7).
Gaming experience was measured for PC, mobile, and console platforms separately. For each
platform, participants were asked how many years they had been playing video games on that
platform. Average gaming experience was calculated through taking the average gaming experience
for the PC, mobile and console platforms.
Playing alone vs. playing with a team: In order to measure playing alone vs. playing with
a team preference, participants were asked: “Do you play video games alone or with a team?”
Participants were allowed to choose only one of the two options.
Based on the above information, choose the correct classification for the variables Gaming time,
Gaming frequency, Gaming experience and Playing alone vs. playing with a team (in this order)
from the options below.
In Section 1.3, we learnt that there are two main types of variables, namely categorical variables and
numerical variables. For categorical variables, there are two sub-types, namely ordinal variables and
nominal variables. Ordinal variables are those whose categories come with some natural ordering. On
the other hand, there is no intrinsic ordering for the nominal variables. For numerical variables, there
are those that are continuous and those that are discrete. The focus of this chapter is on categorical
variables and we will discuss numerical variables in the next chapter.
Much of the discussion in this section is centred around the following example.
Example 2.1.1 Suppose a patient newly diagnosed with kidney stones visits his urologist for the first
time since diagnosis to discuss what are some of the best possible treatments that he should undergo.
In preparation, the urologist took out some historical records of the various patients he had previously
and summarised the data into a table. Part of the table is shown below.
Each row of the table is a particular patient that the urologist had seen previously and the columns
are the variables related to each patient. While the table only shows the first 6 cases, the data in actual
fact contains 1050 observations (or data points). The four variables are
1. The size of the kidney stone. This is an ordinal categorical variable that has two categories. The
kidney stones can be classified as either small or large.
2. The gender of the patient. This is a nominal categorical variable that has two categories, male or
female.
3. The treatment that the patient underwent. Again, this is a nominal categorical variable and there
are two categories, namely treatment X and treatment Y.
4. The outcome of the treatment is also a nominal categorical variable. The categories are success
and failure.
38 Chapter 2. Categorical Data Analysis
How should the urologist use the 1050 observations to assist in the decision for this new patient?
Before we continue, let us recall the PPDAC cycle that was introduced as the main process behind
the approach to a data driven problem.
The overarching question faced by the urologist is simply how to treat his patient better. In particular,
this new patient. What kind of insights does the data set give the urologist that will enable him to
better advise his patient?
To apply the PPDAC cycle to this context, let us start with a question that we want to answer. A
simple question to start with is:
Question 1: Are the treatments given to the patients successful? In other words, should this new
patient receive treatment?
Moving on from “Problem” to “Plan”, we next determine what are the variables that needs to be
measured and then proceed to obtain data on 1050 previous cases where the outcome of the treatment
was recorded as either a success or failure. The PPDAC cycle is a continuous process where after looking
at the data, drawing some preliminary conclusions might lead to more questions, some of which were
even considered from the start. This stage of analysis involves sorting the data, tabulating and plotting
graphs of the outcome variable. We may observe interesting trends and this leads us to asking more
questions on those trends, leading us back to the top of the cycle again. Some of the new questions that
we can ask include
We will now discuss some of the tables and charts that can be generated from the data that will give
us useful information.
Example 2.1.2 (Analysing 1 categorical variable using a table.) Suppose out of the 1050 pre-
vious patients, there were 831 records of success and 219 records of failure after treatment was given.
Thus from this simple collation, a preliminary conclusion is that we should generally recommend the
new patient to go for treatment since there are more successful outcomes than failed outcomes. We
can present this information on the number of success and failures in a table, together with two other
columns, namely rate and percentage.
Categories of the Count Rate Percentage
“Outcome” variable
831
Success 831 rate(Success) = 1050 = 0.791 0.791 × 100% = 79.1%
219
Failure 219 rate(Failure) = 1050 = 0.209 0.209 × 100% = 20.9%
1050
Total 1050 1050 = 1 1 × 100% = 100%
1 The concept of rates will soon be discussed in this section.
Section 2.1. Rates 39
We can also represent this as a percentage of total treatments that were successful, which is 79.1%.
Similarly, the rate of failed treatments is
Since we are interested in the variable Outcome, which is a categorical variable, we can illustrate
the counts in each of the categories “Success” and “Failure” in the form of a bar plot. The two bar plots
above are created using Microsoft Excel.
The bar plot on the left is known as a dodged bar plot. The x-axis indicates the variable Outcome
whereas the y-axis shows the number (that is, the count) of successes and failures in the variable Outcome.
Two bars, one for success counts and the other for failure counts, are placed next to each other. Such
an illustration is useful in comparing the relative numbers in the categories.
The bar plot on the right is known as a stacked bar plot. The x and y-axes are similar to the dodged
bar plot but instead of two bars, we now have only one bar where the counts of failure (219) is stacked
on top of the counts of success (831). Such an illustration is useful in comparing the occurrences of each
category as a percentage or fraction of the total number of responses. Instead of showing the absolute
numbers in each category, it is also possible to show the percentages directly in the plot itself, as seen in
the figure below. However, it should be noted that the y-axis is now giving the percentages rather than
the actual numbers.
40 Chapter 2. Categorical Data Analysis
Regardless of which bar plot is used, we can see that there are many more successes than failures and
based on this, it is reasonable to recommend our patient to go for some form of treatment based on the
information that we have at this stage.
Remark 2.1.4 In this example on treatment of kidney stones, the success of any treatment is defined
as having the kidney stones removed or reduced significantly so that it does not pose any further threat
to the patient. On the other hand, failure means that the stones were not able to be removed. In general,
kidney stones cause little morbidity and mortality. It is useful to note that for other kinds of illness,
where treatments have higher stakes, the conclusion may be different.
Now that we are rather convinced that the new patient should receive treatment, the PPDAC cycle
brings us back with new questions that arise from our investigation into the data set of 1050 previous
patients. It is reasonable to ask the next question as follows:
Question 2: There are two types of treatment, namely X and Y. Which treatment type is better
for our new patient?
To answer this question, we can revisit the PPDAC cycle and define a new problem and plan to look
at new variable(s) of interest and analyse the data again using plots that we have introduced previously.
1. The new problem is as stated above, namely, which treatment is better for our new patient.
2. This means that the key variable that we should look at is the treatment type categorical variable,
which has two categories, treatment X and treatment Y.
3. This does not mean that treatment type is the only variable of interest, but rather, it should
be investigated together with the outcome variable. This is because we want to know how the
treatment type affects the outcome.
Example 2.1.5 (Analysing 2 categorical variables using a table.) When we used a table to
analyse 1 categorical variable (Outcome), the table showed only the number of successes and failures
among the 1050 previously treated patients. When we introduce a second categorical variable (Treatment
type), we have a 2 × 2 contingency table that will summarise the two variables across the 4 (that’s why
it is called 2 × 2) possible combinations of (Treatment, Outcome).
Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
Recall that out of 1050 previous patients, 831 underwent successful treatments while the other 219
were failed treatments. The 2 × 2 table breaks down the 831 successful treatments according to the
treatment type. As seen from the Success column, 542 were given treatment X while 289 were given
treatment Y. Similarly, for the 219 failed treatments, 158 were given treatment X while 61 were given
treatment Y.
If we look across a row instead of down a column, we could, for example, see that there were 700
previous patients given treatment X, of which 542 were successful and 158 failed. Similarly, looking at
the row for treatment Y, we see that out of 350 people who underwent treatment Y, 289 of them had
successful treatments while 61 did not.
Remark 2.1.6
1. It should be noted that by convention, the dependent variable Outcome is placed on the columns
on the table while the independent variable Treatment type is placed on the rows.
2. The column total values for the success (831) and failure (219) columns should add up to the same
values as the sum of the row total values for Treatment X (700) and Treatment Y (350), which
obviously should both add up to the total number of data points in the data set which is 1050.
Discussion 2.1.7 In order to answer Question 2, it will be useful to ask other related questions, for
example:
1. Question 2a: What proportion of the total number of patients were given treatment Y (or X)?
2. Question 2b: Among those patients given treatment X, what proportion were successful?
3. Question 2c: What proportion of patients were given treatment Y and had a failed treatment
outcome?
To answer Question 2a, we note that there were 350 previous patients who underwent treatment
Y. The proportion of the total number of patients that underwent treatment Y is
350 1 1
= = 33 %.
1050 3 3
We can also denote this as
1
rate(Y) = 3 or 33 13 %.
We have seen earlier that out of 1050 patients, there were 831 successful treatments, so we can write
rate(Success) = 0.791 or 79.1%. We know that
700 2
rate(X) = 1050 = 3 or 66 32 %.
Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
42 Chapter 2. Categorical Data Analysis
Notice that in the calculations above, we have used two numbers in the margin of the table (for
example, 831 and 1050) that relate to just one of the categorical variables (Outcome) each time, we
350
call these marginal rates. Similarly, rate(Y) = 1050 = 31 is also a marginal rate.
How should we answer Question 2b? In this case, we need to zoom in onto the patients who had
undergone treatment X and figure out what proportion of them have had a successful treatment.
Referring to the table again, we see that out of 700 patients who were given treatment X, 542 of them
were successfully treated. Hence the proportion of successful treatments was
542
= 0.774 = 77.4%.
700
This rate of success is computed based on only those patients who were under treatment X, which sets
the condition for the calculation of the rate. Once such a condition is set, those patients on treatment
X will be considered as the population and those on treatment Y will not be part of any consideration.
Such a rate is known as a conditional rate, which is one that is based on a given condition.
A note on the notation used for conditional rates is that we replace the word “given” by a vertical
bar so that rate(Success given treatment X) is written as
rate(Success | X).
Let us consider Question 2c. From the table, we can see easily that there are 61 cases where
treatment Y was given but had an unsuccessful outcome.
Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
So the rate of patients who were given treatment Y AND had a failure was
61
= 0.0581 = 5.81%.
1050
This rate is known as a joint rate and it is not a conditional rate since we are looking at all the 1050
patients as our baseline. In other words, we are now considering patients on treatment X, as well as
patients on treatment Y as the population. One should be careful with the implicit difference in the
phrasing of the two statements:
What proportion of patients were given treatment Y and had an unsuccessful outcome?
61
Answer: rate(Unsuccessful and Y) = 1050 = 0.0581.
61
Answer: rate(Unsuccessful | Y) = 350 = 0.174.
The first question refers to the joint rate/proportion/percentage while the second question refers to the
conditional rate/proportion/percentage.
Discussion 2.1.8 It should be clear at this point from our discussion of rates and proportions that
our decision on which treatment to suggest to our new patient cannot be based simply on the absolute
number of successes and failures for each treatment type. If we had based the decision on absolute
numbers, we would have gone for treatment X since there were 542 success cases compared to only 289
for treatment Y.
Section 2.1. Rates 43
The reason why we should look at rates rather than absolute numbers is because the number of pa-
tients undergoing each treatment is different, so it would not be surprising if there were more successful
cases for treatment X because there were just more patients given this treatment, rather than because
it is more effective. Finding the rate of success for each treatment before comparing them is a form of
normalisation. At this stage of our analysis, when using the success rates to compare the treatment
types, our conclusion is to recommend treatment Y to our patient.
The rate of success given treatment X is the conditional rate we have already calculated in answering
Question 2b, which is
542
rate(Success | X) = = 0.774 = 77.4%.
700
Similarly, we can calculate the rate of success given treatment Y, which is
289
rate(Success | Y) = = 0.826 = 82.6%.
350
We can also look at the conditional rates in another way. For treatment X, having the conditional
rate of success to be 77.4% means that out of 100 patients who underwent treatment X, 77 of them
had a successful outcome. For treatment Y, the numbers were 83 successes out of 100 patients
receiving this treatment.
As the rate of success for treatment Y is higher, we can now say that treatment Y is better than
treatment X and advise the patient appropriately. Notice that we would have given the opposite
advice if we were looking at absolute numbers instead of rates, which is incorrect.
We can now add in the rates to the 2 × 2 contingency table given at the beginning of Example
2.1.5.
Outcome
Success Failure Row Total
Treatment
X 542 (77.4%) 158 (22.6%) 700 (100%)
Y 289 (82.6%) 61 (17.4%) 350 (100%)
Column Total 831(79.1%) 219 (20.9%) 1050 (100%)
Example 2.1.9 (Analysing 2 categorical variables using a plot.) In Example 2.1.3, we introduced
dodged bar plots and stacked bar plots to present the data on a single variable Outcome. We can also
use these plots to present the counts of Outcome broken down by Treatment. These were the two
variables we analysed using a table in Example 2.1.5.
The dodged bar plot on the left shows the success and failure counts for both treatments X and Y.
The numbers above each bar is the success or failure count for that particular treatment type.
The stacked bar plot on the right shows the same information but with the success and failure bars
under the same treatment stacked instead of being placed side by side.
44 Chapter 2. Categorical Data Analysis
Both bar plots tell us that there are a lot more successful treatments in treatment X than in treatment
Y, which may lead to the conclusion that treatment X is more effective (since the green bars for treatment
X are bigger than the green bars for treatment Y). However, it is also obvious from the stacked bar plot
that these two treatments have very different number of patients (represented by the height of the two
bars).
Similar to our analysis using tables, we can also create the plots using rates instead of absolute
numbers.
In this plot, notice that both the treatment X and treatment Y bars have been normalised to the
same height (which is 100%). We are no longer comparing absolute numbers, but instead comparing
the rates of success (the height of the green bars, as a proportion of the total height) between the two
treatments. We can see immediately that treatment Y has a higher rate of success (taller green bar)
compared to treatment X.
To summarise this section, we have discussed how we can analyse two categorical variables. This can
be done either using a 2 × 2 contingency table, or bar plots (dodged or stacked) which makes it easier
for us to observe any differences between the categories. We also introduced the concept of rates, as a
means of fair comparison when group sizes are unequal. To formally discuss the relationship between
two categorical variables, we will introduce the concept of association in the next section.
Definition 2.2.1 In Section 2.1, we considered the example of two different treatments for patients
with kidney stones. Let’s say that initially, we guessed that the treatment type involved does not affect
the outcome of the treatment, meaning that we could advise our patient to undergo either treatment
because the outcome would not be affected. If this was the case, then we can say that the treatment
type is not related to the outcome of the treatment.
After analysing the data using rates, we found that this was not the case. There was a higher success
rate observed for patients under treatment Y compared to those under treatment X. Due to the difference
in success rates, we say that there is a relationship between the type of treatment and the outcome of
the treatment.
To formalise the notion of such a relationship, we say that treatment type is associated with the
outcome of the treatment. More specifically, treatment Y is positively associated with the success of the
treatment. What this means is that treatment Y and successful treatments tend to occur together.
On the other hand, we say that treatment X is negatively associated with the success of the treatment.
This is because we tend to see treatment X and failed treatments go hand in hand.
Remark 2.2.2
Section 2.2. Association 45
1. Note that treatment X being negatively associated with the success of the treatment does not mean
that a significant proportion of patients undergoing treatment X will see the treatment fail (77.4%
of them still recorded success). The negative association is stated as a comparison between the
two treatment types X and Y, where in this case treatment Y tends to produce more successful
outcomes.
2. We should be conscious of the choice of the word associated because we do not know if the
outcome of the treatment was entirely due to the treatment type received. The data we had came
from an observational study hence it might be erroneous for us to say that the type of treatment
and the outcome of the treatment have a causal relationship. It is important to see the distinction
between association and causation and for the rest of this chapter, we will be focussing on discussing
associative relationships between categorical variables rather than causal relationships.
Discussion 2.2.3 So how do we identify an association between two variables? Suppose the two vari-
ables we are considering represent two characteristics in a population. Let us call these two characteristics
A and B. For example, A could be smoker (so one categorical variable could be smoking habit, with
two categories smoker and non-smoker ) while B could be male (so the other categorical variable could
be gender, with two categories male and female). The population can be a well-defined group of peo-
ple. In the population, those “with A”, refers to smokers, while “without A”, denoted by NA refers to
non-smokers. Similarly, those “with B” refers to male and those “without B”, denoted by NB, refers to
female.
So if the rate of A given B (proportion of smokers among males) is the same as the rate of A given
NB (proportion of smokers among females), then it means that the rate of A is not affected by the
presence or absence of B. Thus in this case, there is no difference in the proportion of smokers between
both gender groups and we write
rate(A | B) = rate(A | NB).
However, if the rate of A given B is not the same as the rate of A given NB, then there are two
possible situations.
The first possibility is the rate of A given B is more than the rate of A given NB. This means that
the presence of A when B is present is stronger compared to when B is absent. Hence we say
that there is positive association between A and B. In this case, we write
and for the gender/smoking example, this means that there is a higher proportion of smokers among
males than the proportion of smokers among females. So being male and smoking are positively
associated.
The other possibility is the rate of A given B is less than the rate of A given NB. This means that
the presence of A when B is present is weaker compared to when B is absent. Hence we say
that there is negative association between A and B. In this case, we write
and for the gender/smoking example, this means that there is a lower proportion of smokers among
males than the proportion of smokers among females. So being male and smoking are negatively
associated.
The inequality rate(A | B) > rate(A | NB) (resp. rate(A | B) < rate(A | NB)) is not the only one
that allows us to conclude that there is positive (resp. negative) association between A and B. The table
below provides three other comparisons between rates that leads to the same conclusion of a positive
(resp. negative) association between A and B. These different comparisons are mathematically equivalent
to each other and their equivalence will be established using the symmetry rule in Discussion 2.3.1.
46 Chapter 2. Categorical Data Analysis
Establishing association
Positive association between A and B: Negative association between A and B:
(any of the following) (any of the following)
rate(A | B) > rate(A | NB) rate(A | B) < rate(A | NB)
rate(B | A) > rate(B | NA) rate(B | A) < rate(B | NA)
rate(NA | NB) > rate(NA | B) rate(NA | NB) < rate(NA | B)
rate(NB | NA) > rate(NB | A) rate(NB | NA) < rate(NB | A)
Example 2.2.4 Let us revisit the earlier example on two different treatments for kidney stones. Recall
that the two variables were treatment outcome and treatment type. For the treatment outcome variable,
let us split the patients into group A, which is the group of patients with successful outcomes and the
group NA will be those with unsuccessful outcomes. For the other variable, we will also split the patients
into group B for those given treatment X and group NB for those given treatment Y.
Let us revisit some conditional rates that were computed previously.
542
1. Rate(A | B) = rate(Success | X) = 700 = 0.774.
289
2. Rate (A | NB) = rate(Success | Y) = 350 = 0.826.
Since
rate(A | B) < rate(A | NB),
we can say that the presence of A when B is present is weaker than the presence of A when B is absent.
Thus there are fewer successful treatments when looking at treatment X compared to treatment Y and
hence success of the treatment is negatively associated with treatment X.
Conversely, since there are more successful treatments when looking at treatment Y compared to
treatment X, we can conclude that success of the treatment is positively associated with treatment Y.
Discussion 2.3.1 In this section, we will discuss two important rules regarding rates. Suppose we have
a population with two population characteristics A and B. Among the population there are those who
possess characteristic A and those who do not.
For ease of notation, we will denote those who possess characteristic A simply as “A” and those who
do not as “NA”. Similarly for characteristic B, those in the population who possess this characteristic
will be denoted as “B” and those who do not as “NB”.
(Symmetry rule)
The first rule that we will be discussing is known as the symmetry rule. Although there are three
parts to this rule, once we can understand the first part, the second and third parts will follow naturally.
The above rule states that the rate of A given B is more than the rate of A given NB (call this
statement 1) if and only if the rate of B given A is more than the rate of B given NA (call this
statement 2). The if and only if here, denoted by ⇔, means that statements 1 and 2 happen together,
meaning that if one of the statements is true, the other one will also be true. In other words, the two
statements are either both correct or both incorrect. Another way of understanding (statement 1) if and
only if (statement 2) is
Section 2.3. Two rules on rates 47
1. If rate(A | B) is more than rate(A | NB), which is (1), then this means that there is a positive
association between A and B;
2. This means that we are more likely to see A when B is present, compared to when B is absent;
3. This in turn means that we are more likely to see B when A is present, compared to when A is
absent;
This is the same as saying that A and B are positively associated. Conversely, suppose we know that
1. If rate(B | A) is more than rate(B | NA), which is (2), then this means that there is positive
association between B and A;
2. This means that we are more likely to see B when A is present, compared to when A is absent;
3. This in turn means that we are more likely to see A when B is present, compared to when B is
absent;
We have now seen Part 1 of the Symmetry Rule. Parts 2 and 3, as shown below can be similarly
explained.
We will now present a mathematical derivation for Part 1 of the Symmetry Rule. You are encouraged
to go through the same process for Parts 2 and 3. Consider a 2 × 2 contingency table shown below
representing variables A and B. Let w, x, y, z denote the counts in each of the 4 cells.
48 Chapter 2. Categorical Data Analysis
Hence, both inequalities are in fact equivalent to wz > xy and thus they are equivalent.
Example 2.3.2 Let us revisit our kidney stones treatment example. The 2 × 2 contingency table below
gives us the number of patients in each treatment type as well as the number of success and failure
outcomes for each treatment type.
Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
We have earlier shown that A (representing treatment outcome) is associated with B (representing
treatment type) since
rate(A | B) = rate(Success | X)
= 0.774
< 0.826 = rate(Success | Y) = rate(A | NB).
By symmetry rule part 2, we should have rate(B | A) < rate(B | NA). Let us verify that this is indeed
the case.
Since 0.652 < 0.721, we have thus verified that rate(B | A) < rate(B | NA) as predicted by symmetry
rule part 2. This also confirms that there is negative association between success of treatment (A) and
treatment X (B).
Section 2.3. Two rules on rates 49
Discussion 2.3.3 (Basic rule on rates.) The second rule on rates is known as the basic rule on
rates. The main rule, as well as three consequences of the main rule are shown below.
The overall rate(A) will always lie between rate(A | B) and rate(A | NB).
Consequence 1:
Consequence 2:
Consequence 3:
1. The basic rule on rates states that the overall rate(A) is always between the conditional rates of A
given B and A given not B.
2. The first consequence gives us a little more indication of where the overall rate(A) is going to
be. If rate(B) is closer to 100% (than rate(NB)), then rate(A) is going to be closer to rate(A | B)
compared to rate(A | NB).
3. The second consequence specifically states that if rate(B) is exactly 50%, then rate(A) will be
exactly the mid point between rate(A | B) and rate(A | NB).
4. Finally, the third consequence states that if the two conditional rates, namely
rate(A | B) and rate(A | NB) are the same, then the overall rate(A) will also take the same value
of the two conditional rates.
At this point, the significance of the basic rule and its consequences are not immediately apparent or
intuitive. The best way towards understanding them is through an example.
Example 2.3.4 Suppose a school has two different classes (call them class Bravo and class Charlie) of
students who took the same data analytics examination. We are interested in studying the passing rate
of students at the entire school level and also at each individual class level. Suppose we are given the
following information:
For convenience, let us denote class Bravo as “B” and class Charlie as “NB” (not B). Similarly, denote
passing the examination as “A” and not passing as “NA”. So the two pieces of information we have
above are:
rate(A | B) = 0.75 and rate(A | NB) = 0.55.
By the basic rule on rates, the overall passing rate of all students from the school, that is, rate(A) is
between the two conditional rates,
0.55 0.75
rate(A | NB) rate(A | B)
However, without any further information, we will not be able to determine the exact value of rate(A).
What about the three consequences of the basic rule?
1. The first consequence states that the closer rate(B) is to 100%, then the closer rate(A) will be to
rate(A | B). In our example, it means that if the number of students in Bravo is far more than the
number of students in Charlie (thus rate(B) is closer to 100%), then the overall passing rate of the
school (that is, rate(A)) will be closer to the passing rate of class Bravo (that is, rate(A | B)).
rate(A)
0.55 0.75
rate(A | NB) rate(A | B)
2. The second consequence states that if rate(B) = 50% (which also means that rate(NB) = 50%),
then
1
rate(A) = [rate(A | B) + rate(A | NB)] .
2
That is, rate(A) will be right in between the two conditional rates. In our example, this means
that if the number of students in Bravo and Charlie are exactly the same, then the overall passing
rate of the school will be exactly in between 0.55 and 0.75, that is 0.65.
3. The third consequence states that if the two conditional rates rate(A | B) and
rate(A | NB) are the same, then the overall rate(A) will be the same value as the two conditional
rates. In our example, if the passing rates of class Bravo and class Charlie are the same, then the
overall passing rate of the school will be the same as the passing rate in either class.
Example 2.3.5 Let us continue with Example 2.3.4 and validate the basic rule on rates and the con-
sequences by considering actual numbers.
1. Suppose the total number of students and the number of passes in each of the two classes are given
in the table below.
Notice that the passing rates of both classes are what they are supposed to be, but the number of
students in Bravo far exceeds the number in Charlie (so rate(B) is 600 680 which is closer to 100%).
While the overall school passing rate is between 0.55 and 0.75 (in accordance to the basic rule on
rates), it is much closer to the passing rate of Bravo, as predicted by consequence 1.
Section 2.3. Two rules on rates 51
2. Suppose the total number of students and the number of passes in each of the two classes are as
given below instead:
Again, the passing rates of both classes are what they are supposed to be, but in this case, the
number of students in Bravo and Charlie are the same (so rate(B) = rate(NB) = 0.5). As predicted
by consequence 2, the overall school passing rate will be 0.65, which is right in between the two
class passing rates.
3. To illustrate consequence 3, suppose the passing rates for both classes are the same, as shown
below.
Now the two conditional rates, namely rate(A | B) and rate(A | NB) are equal. By consequence 3,
rate(A) will be the same value as the two conditional rates. This is indeed the case as we see that
the two classes have the same passing rate which will result in the school having the same passing
rate of 0.75. It is important to note that we do not require rate(B) to be the same as rate(NB)
for consequence 3 to hold. For our example, this means that we do not require classes Bravo and
Charlie to have the same number of students. As long as the two class passing rates are the same,
consequence 3 will hold.
Finally, let us verify the rule on rates using our kidney stones data set.
Example 2.3.6 We have seen the following table from Example 2.3.2.
Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
The conditional rates of success among the two treatment types are:
542
rate(Success | X) = = 0.774,
700
289
rate(Success | Y) = = 0.826.
350
The overall rate of success is
831
rate(Success) = = 0.791,
1050
which is closer to the rate of success among patients with treatment X. This agrees with the basic
rule on rates since there were more patients with treatment X (66.67%) compared to treatment Y
(33.33%).
In the three sections of this chapter we have discussed so far, we have seen how we can use the concept
of rates to investigate relationships, in particular, association between categorical variables. Very often,
exact rates (overall or conditional) are unknown to us but if we can apply some general rules like the
52 Chapter 2. Categorical Data Analysis
symmetry rule, basic rule or the consequences of the basic rule, we can still obtain valuable insights into
the data set we have on our hands. Making the best use of limited information is an important skill
when analysing data.
In the next section, we will discuss a surprising observation that can be counterintuitive to some but
is very important for anyone analysing data to be aware of.
Discussion 2.4.1 From earlier sections, when faced with the problem of advising our new kidney stones
patient, we have gone through two cycles of the PPDAC process.
The first question we asked was whether having any sort of treatment was better than not having
one. By comparing the rate of success (0.791) versus the rate of failure (0.209), we conclude that there
are many more successful than failed treatments from past records, so the decision was to advise our
new patient that he should be treated.
But this led us to the next question, as there are two treatment types available, which type of
treatment should we recommend? This made us go back to the data and compare the success rates of
those patients who were given treatment X as opposed to the success rates of those given treatment Y.
Upon delving deeper into the data, we discovered that treatment Y is positively associated with success
rate. This suggests that treatment Y is “better” than treatment X and perhaps we should advise our
patient to undergo treatment Y.
Are we done with our analysis? Is there some lingering doubt in our minds that we may be providing
wrong advice to our patient? If we are convinced that treatment Y is better, should we always send
kidney stone patients for treatment Y? If not always, then when do we do so? What should our decision
be based on? These are again questions that prompt us to go back to our data and see if more information
can be obtained from it.
The table above from Example 2.1.1 shows there are two other variables that we have not used in our
analysis thus far, namely the size of the kidney stone and also the gender of the patient. Would these
variables be an important factor in our consideration? How should we go about analysing them? What
type of visualisation would be useful when doing such analysis? Let us begin by exploring the stone size
variable.
Example 2.4.2 (Analysing 3 categorical variables using a plot.) In Example 2.1.9, we used a
stacked bar plot for “Outcome” by “Treatment” to compare the success rates for treatments X and Y.
Section 2.4. Simpson’s Paradox 53
From the stacked bar plot, we have concluded that treatment Y is positively associated with success.
We have not taken stone sizes into consideration thus far and the plot was made based on simply counting
the number of successes and failures across all stone sizes. In other words, this plot gave us the overall
success rates of treatments X and Y.
Let us now separate the data by considering the categorical variable of “stone size” which has two
categories, namely large stones and small stones.
The table above shows the outcome of treatments given to patients with large stones. For example,
out of 526 treatment X patients with large kidney stones, 381 had a successful outcome and 145 were
unsuccessful. Similarly, out of 80 treatment Y patients with large kidney stones, 55 were successful while
25 were not. We can present these information using a stacked bar plot like before, as shown below2 .
2 Notice that for the bar plot for treatment Y, the two percentages do not add up to 100%. This is due to rounding off
in Excel, where the success percentage is in fact 68.75% and the failure percentage is 31.25%.
54 Chapter 2. Categorical Data Analysis
How do the two different treatments compare? Although the margin of difference is not very big,
there is no doubt that treatment X has a higher success rate of 72.4% compared to treatment Y, which
has a success rate of 68.8%. This means that, for treating large kidney stones,
381 55
= 0.724 = rate(Success | X) > rate(Success | Y) = 0.688 = ,
526 80
and thus treatment X is positively associated with success for treating large stones. This observation is
surprising, since we have already concluded that for all stone sizes combined together,
that is, treatment X is negatively associated with success if we do not segregate by stone size.
Why are we observing a different behaviour for large kidney stones as opposed to what we saw earlier
when all kidney stone sizes are combined?
Let us consider the data for small kidney stones.
The table above shows the outcome of treatments given to patients with small stones. For example,
out of 174 treatment X patients with small kidney stones, 161 had a successful outcome and 13 were
unsuccessful. Similarly, out of 270 treatment Y patients with small kidney stones, 234 were successful
while 36 were not. Let us again present these data using a stacked bar plot.
The margin of difference between the two treatment types is again not very big, but again we see
that treatment X has a higher success rate of 92.5% compared to treatment Y, which has a success rate
of 86.7%. This means that, for treating smaller kidney stones,
161 234
= 0.925 = rate(Success | X) > rate(Success | Y) = = 0.867,
174 270
so again treatment X is positively associated with success for treating small stones, which is the opposite
from what we had when the data was combined and not segregated by stone size.
We can now combine the two previous plots by putting them side by side as shown below.
Section 2.4. Simpson’s Paradox 55
Notice that the first two bars from the left are for large kidney stones data while the last two bars
are for small kidney stones. This type of plot is sometimes referred to as a sliced stacked bar plot. Such
a plot can be used for comparing across three categorical variables. The three variables here are stone
size, treatment outcome and treatment type.
We are now facing a paradox. Although treatment Y is the better treatment overall, when the stone
sizes are combined and not segregated, we see that if we focus only on the large stones, or only on
the small stones, treatment X is observed to have higher success rate that treatment Y. This is indeed
strange!
This phenomenon is known as Simpson’s Paradox .
Simpson’s Paradox:
We are now back to the same question which we thought we have already answered: Which treatment
is better for our patient? Should we advise him to undergo treatment X or Y?
Remark 2.4.3 In the example of kidney stones, there were only two subgroups for the stone size,
namely, small and large. We claim that Simpson’s Paradox was observed because the trend in both
subgroups is different from the trend observed when the subgroups are combined.
In examples where there are more than two subgroups, we will say that Simpson’s Paradox is observed
as long as a majority of the individual subgroup rates shows the opposite trend to the overall rate. For
example, if there are three subgroups, as long as there are at least 2 subgroups showing the opposite
trend to the overall rate, we can say that Simpson’s Paradox is observed.
Example 2.4.4 (Analysing 3 categorical variables using a table.) Let us put the two tables in
Example 2.4.2 for both the large and small kidney stones together into one unified table.
To recap, we have two different treatment types, X and Y. In the row for treatment X, we see that there
were 526 large stones cases that were under treatment X, of which 381 were successful. This gives a
success rate of 72.4%. Similarly, in the row for treatment Y, we see that there were 270 small stones
56 Chapter 2. Categorical Data Analysis
cases that were under treatment Y, of which 234 were successful. This gives a success rate of 86.7%.
The last 3 columns of the table gives the combined numbers for both stone sizes.
Recall we had initially concluded that treatment Y was the better treatment because 82.6% of
patients who were given treatment Y had a successful outcome, compared to 77.4% for treatment X. We
then separated the cases according to the size of the stone, i.e., we created subgroups and this method
of subgroup analysis is called slicing.
This is when we observed Simpson’s Paradox, where the rate of success amongst small (92.5% )
and large (72.4%) stones is higher for treatment X compared to treatment Y. This reverses the trend
observed when the small and large kidney stones were combined.
Let us look at the numbers highlighted in blue in the table. A crucial observation at this point is that
treatment X seems to be used to treat mostly patients with large stones as compared to small stones.
Thus, by the basic rule on rates, we know that the overall success rate of treatment X will be closer
to the large stones success rate of 72.4% than the small stones success rate of 92.5%. Indeed, we have
the overall treatment X success rate to be 77.4%.
Turning our attention to the numbers highlighted in orange in the table, we observe the opposite of
the above. Treatment Y seems to be used to treat patients with small stones compared to large stones.
Again, by the basic rule on rates, we would expect the overall success rate of treatment Y to be closer
to the small stones success rate of 86.7% than the large stones success rate of 68.8%. Indeed, we have
the overall treatment Y success rate to be 82.6%.
Combining these two observations, it is no wonder that we have the overall success rate of X to be
lower than the overall success rate of Y.
Another very telling observation from the table is that the range of success rates for treating large
stones is between 68.8% (treatment Y) and 72.4% (treatment X). Compare this with the range of success
rates for treating small stones which is between 86.7% (treatment Y) and 92.5% (treatment X). This
tells us that treatments for large stones have a lower rate of success compared to small stones, which is
not unreasonable to believe.
In conclusion, we can explain Simpson’s Paradox in the following way. Treatment X is in fact a better
treatment than Y. However, because patients have been using Treatment X to treat more difficult cases
(large kidney stones), this lowers the overall success rate of treatment X. It does not change the fact that
in the individual subgroups, regardless of stone size, treatment X achieves a higher success rate than
treatment Y. Slicing the data into the small and large stone subgroups will reveal that treatment X is
indeed a better treatment.
Before we conclude this section, let us recap the story so far.
We started off with a new kidney stone patient coming to us for advice. Based on past patient
records, we were convinced that the success rate of undergoing treatment is higher than the failure
rate and thus conclude that the patient should undergo some form of treatment.
We were then faced with the decision between two treatment types. Treatment X and Treatment
Y. In determining which treatment type to recommend to the patient, we looked at the data on
hand of past patients and investigated if there was any association (positive or negative) between
Treatment X and Treatment Success.
Association?
Treatment X Success
With further analysis, we found that Treatment X was negatively associated with Success. This
meant that we should recommend Treatment Y to our new patient. However, through another
iteration of the PPDAC cycle, we wondered how another variable like stone size may affect our
conclusion.
Section 2.5. Confounders 57
By slicing, we segregated our data into past patients with large stone size and others with small
stone size. Surprisingly for both subgroups, we found that Treatment X had a higher success rate
than Treatment Y. This reversed that trend that we saw when the subgroups were combined.
More importantly, we observed that Treatment X was used more often in dealing with large stones
compared to Treatment Y, which was more frequently used to deal with small stones. This means
that large stone size is likely to be associated with Treatment X.
Association?
Treatment X Success
Association
On the other hand, we also observed that patients with large stones have a lower success rate
(regardless of treatment type) compared to patients with small stones. This is perfectly reasonable
and thus also suggests that large stone size is likely to be associated with treatment success.
Association?
Treatment X Success
Association Association
This means that stone size is a (third) variable that was associated with the other two variables
whose relationship we were initially investigating, thus affecting the conclusion of our initial study.
Such a variable is called a confounder and they will be the focus of our discussion in the next
section.
For now, we will note that when Simpson’s Paradox is observed, it implies that there is definitely a
confounding variable present, that is a third variable that is associated with the two variables whose
relationship we are investigating. However, the existence of a confounder does not necessarily lead
to us observing Simpson’s Paradox.
58 Chapter 2. Categorical Data Analysis
Discussion 2.5.1 Continuing our kidney stones patients example, we were fortunate that the data
set contained information that may not seem to be important initially. Without performing further
investigation into the size of the kidney stones, we could have ended up giving the wrong recommendation
to our new patient.
In data collection, it is often important to collect more information on the subjects in addition to
those variables that are immediately apparent to be of importance. This is because we can never be
sure whether there would be some other variables that may be confounders that would influence our
study of association between two variables of interest. Of course, as the owner of the study, we can ask
our subjects (for example, in a survey) as many questions and collect as much data as we want, but
practically, we also know that respondents do not like to see a long list of seemingly unrelated questions
in surveys. There are also cost considerations if we collect more data than necessary. To design a good
study, we need to strike a balance between the two.
Definition 2.5.2 A confounder is a third variable that is associated with both the independent and
dependent variables whose relationship we are investigating. Note that we do not specify the direction
(positive or negative) of association here. As long as the variable is associated in some way to the main
variables, we will call it a confounder, or a confounding variable.
Example 2.5.3 At the end of the previous section, we explained how the variable kidney stone size
is a confounding variable because it is associated with both the (independent) variable Treatment type
and (dependent) variable Treatment outcome. Let us now work through the calculations to justify these
associations. First, let us show that stone size is associated with treatment type.
The table shows the number of large and small stones treated by treatments X and Y respectively.
Out of 700 cases treated by treatment X, 526 were large stones and 174 were small stones. Out of 350
cases treated by treatment Y, 80 were large stones and 270 were small stones. Since
526 80
rate(Large | X) = = 0.751 and rate(Large | Y) = = 0.229,
700 350
we see that
0.751 = rate(Large | X) > rate(Large | Y) = 0.229,
and so large stones are positively associated with treatment X. This means that there is a higher pro-
portion of large stones being treated by treatment X compared to treatment Y.
Now let us turn our attention to the association between stone size and treatment outcome.
Stone size Success Failure Total
Large 436 170 606
Small 395 49 444
Total 831 219 1050
This table shows the number of success and failure outcomes for patients with large and small stones.
Out of 606 large stones cases, 436 were successfully treated while 170 were not successful. Out of 444
small stones cases, 395 were successfully treated while 49 were not successful. Since
436 395
rate(Success | Large) = = 0.719 and rate(Success | Small) = = 0.890,
606 444
we see that
0.719 = rate(Success | Large) < rate(Success | Small) = 0.890,
Section 2.5. Confounders 59
and so large stones are negatively associated with success outcome. This means that there is a lower
proportion of successful outcomes for large stones cases compared to small stones cases.
As we have now shown that stone size is associated with both the treatment type and the treatment
outcome, we are convinced that stone size is a confounding variable that needs to be managed. The way
to do it, as shown previously is to use slicing, where we segregate the data by the confounding variable.
This is done by investigating the association between the dependent and independent variables for large
stone cases separately from the small stone cases.
Discussion 2.5.4 We have now seen the benefits of having more information on the subjects because it
allows us to identify confounding variables which would not have been possible if, for example, information
of stone size was not available or collected. Thus, an important learning point when it comes to designing
a study is to measure and collect data on additional variables that we feel may be relevant in our study.
Whether these additional variables turn out to be confounders or not we would have to probe further,
but we will never know if we do not collect data on them in the first place.
That said, we have to come to terms with the fact that most of the time, collecting information on
variables is costly in practice. Even if we do manage to collect all the information we need, the analysis
can be complicated if the data needs to be sliced along many different variables.
For non-randomised designs like observational studies, it is usually the case that the two groups that
we are comparing are not “identical” except for the treatment. Despite our best efforts, we can never
be totally sure that every single confounder has been identified and controlled for. Thus, observational
studies offer only a limited conclusion in providing evidence of association and not causation.
Example 2.5.5 Fundamentally, confounding variables occur due to association which is a consequence
of having unequal proportion of variables in the two groups that we are trying to compare. For the kidney
stones example, stone size was a confounder because patients with large stones were disproportionately
allocated to treatment X instead of treatment Y. Now, if the allocation of large (and small) stone size
cases to the two treatment types was done randomly, which tends to result in an equal proportion across
the two groups, there would no longer be any association between stone size and treatment type. In this
case, stone size would no longer be a confounder. Note that a confounding variable is associated to both
the independent and dependent variables, so removing one of the associations is enough to remove the
confounding variable.
Association?
Treatment X Success
No Association Association
How can we achieve randomised assignment of patients to the two treatment types? One simple way
is, for example, to toss a fair coin when deciding which treatment a patient will be given. Surely, such a
method of randomised assignment tends to give us approximately equal proportions of large (and small)
60 Chapter 2. Categorical Data Analysis
stone cases across the two treatment types. If we have sufficiently many patients to assign to either
treatment types, the two groups of patients assigned to treatment X and treatment Y will tend to be
similar in all characteristics, including stone size.
Surely this addresses the problem of confounders appropriately right? Unfortunately, randomisation
is not always possible in every study. Imagine the scenario where the type of treatment given to each
patient is dependent on a coin toss! Would you agree to this if you were one of the patients? Certainly
not! Patients usually have the right to choose which treatment group they want to be in and this would
make the assignment process non-random. Such ethical issues could very well constrain and prevent us
from performing randomised assignment of our subjects. In such a situation, we have no choice but to
fall back on the method of slicing for suspected confounders.
With this, we conclude Chapter 2, where we discussed, in detail how we can use rates to study the
association between two (or more) categorical variables. We learnt about Simpson’s Paradox which
led us eventually to the issue of confounders and how they can be managed. In the next Chapter, we
will turn our attention to the other variable type, namely numerical variables.
Exercise 2 61
Exercise 2
1. On 19 June 2021, The Straits Times published results from a population census of Singapore. The
figure below shows the household sizes across all individual ethnic groups in Singapore in the year
2020. Assume that the ethnic groups in the figure make up the entire population.
Using only the information in the figure, fill in the blanks below.
The overall proportion of households having 6 or more persons in 2020 must be greater than
(1) % and smaller than (2) %. (Both answers correct to 1 decimal place.)
2. On 19 June 2021, The Straits Times published the figure below, taken from a population census
of Singapore.
Use only the information shown in the figure to answer the following question.
Suppose that households with 1-3 people are considered “small” whereas those with 4 or more
people are considered “large”. Which of the following statements is/are true among Chinese resident
households in the years 2010 and 2020? Select all that apply.
3. 1000 patients who suffer from a certain cancer took part in a study to determine the effectiveness
of a new medication. They were randomly divided into two groups. Patients in Group A were
given the new medication, while those in Group B received a placebo. The following tabulates the
number of patients in remission from cancer after 1 year.
Based on the data given by the table above, was the new medication effective for cancer remission
for subjects in the study?
(A) Yes, because the rate of patients who are in remission is higher in the group who took the
medication than those who did not.
62 Chapter 2. Categorical Data Analysis
(B) Yes, because there are more patients in remission than not in remission in both groups.
(C) No, because there are patients who had cancer remissions in the placebo group.
(D) No, because the two groups are not of the same size.
4. The contingency table below shows the income level for a group of adults classified by gender.
Income level
Low Middle High Total
Male 23 30 29 82
Gender Female 26 28 20 74
Total 49 58 49 156
The marginal rate, rate(Middle income), is calculated to be (1) %; while the joint
rate, rate(Female and Low income), is calculated to be (2) %. Give each answer as a
percentage correct to 2 decimal places.
5. Cheryl suspects that there is a higher percentage of girls in the Faculty of Arts, as compared to
other faculties. To prove her point, she collects data on the total number of males and females
from the Faculty of Arts, as shown below.
Males Females
Number 400 1625
Based on the findings, which of the following statements can Cheryl conclude?
6. How does “forgiveness” (being forgiving) and empathy go together? The study of Toussaint and
Webb on 45 men and 82 women are summarised in the following hypothetical tables:
Distribution of 45 men
Empathy No empathy Row total
Forgiving 10 10 20
Not forgiving 9 16 25
Column total 19 26 45
Distribution of 82 women
Empathy No empathy Row total
Forgiving 30 31 61
Not forgiving 12 9 21
Column total 42 40 82
7. The bar graph below shows the number of gamers and non-gamers among males and females.
Which of the following statements is/are true?
(A) There is a negative association between being female and being a gamer since rate(Female |
Gamer) = 0.33 is less than rate(Female | Non-Gamer) = 0.53.
(B) There is a negative association between being female and being a gamer since rate(Gamer |
Female) = 0.4 is less than rate(Gamer | Male) = 0.67.
(C) There is a negative association between being female and being a gamer since rate(Female |
Gamer) = 0.33 is less than rate(Male | Gamer) = 0.67.
8. Su is investigating the association between blood pressure and “workaholism” in a certain popula-
tion. Someone who works more than 75 hours per week is considered a workaholic.
The income level and blood pressure (high or normal) for each subject and whether or not they are
classified as “workaholic” are recorded and summarised in the table below. Here “HBP” denotes
“high blood pressure” while “NBP” denotes “normal blood pressure”.
Income Group
Low Middle High
HBP NBP HBP NBP HBP NBP
Workaholic 25 75 23 87 26 134
Non-workaholic 25 80 18 72 9 51
(A) For subjects in the “Middle” income level group, there is a positive association between being
a “workaholic” and having “high blood pressure”.
64 Chapter 2. Categorical Data Analysis
(B) For subjects in the “Middle” income level group, there is no association between being a
‘’workaholic” and having “high blood pressure”.
(C) For subjects in the “Middle” income level group, there is a negative association between being
a “workaholic” and having “high blood pressure”.
9. The table below classifies a city of 250,000 people according to their smoking status and heart
health.
(A) a positive
(B) a negative
(C) no
10. A researcher is interested in finding out if there is an association between eating popcorn and getting
cancer. He gathered data on 90 participants summarised in the table below. Unfortunately, he
spilled coffee on the data, which resulted in the missing numbers in cells A, B, C, D. All he can
remember is that A is greater than B, and C is greater than D.
What was/were the possible result(s) from the original numbers in his data set? Select all that apply.
11. A news station conducted a survey to find out whether one’s political affiliation (i.e., whether one
self-identified as having ‘Liberal’ or ‘Conservative’ views) was associated with whether one favoured
abstinence-only sex education in schools (denoted as ‘Favour’ or ‘Not Favour’). The information
collected is represented in the following contingency table.
12. Jolie is interested to find out if there is any association between heights of individuals and reading
habits. She collected a sample of 60 individuals, all with unique heights. Individuals whose height is
equal to or greater than the median height among the 60 individuals are considered as “Tall”,while
individuals whose height is lower than the median height were considered as “Not Tall”. Individuals
who complete at least 2 books a week were classified as “Frequent Reader”, while individuals who
read less than 2 books a week were classified as “Not Frequent Reader”. She then constructed a 2
by 2 contingency table as follows:
Unfortunately, after constructing the above table on a piece of paper, she spilled coffee on that
paper, and she was only able to recover the above figures. What conclusion could possibly be
deduced from the above?
(A) It is impossible to deduce if “Frequent Reader” is associated with being “Tall” as more infor-
mation is needed.
(B) “Frequent Reader” is positively associated with being “Tall”.
(C) “Frequent Reader” is not associated with being “Tall”.
(D) “Frequent Reader” is negatively associated with being “Tall”.
13. The following is an excerpt modified from a press release by the Housing and Development Board
(HDB) in February 2023:
HDB launched 4,428 flats for sale today, under the February 2023 Build-To-Order (BTO) exer-
cise. The flats are spread across five projects in both mature and non-mature estates – Kallang,
Whampoa, Queenstown, Jurong West and Tengah.
For BTO projects launched in February 2023, the median waiting time is 4.4 years. Those who are
looking to move into their flats sooner can consider applying for flats at Jurong West, and Brickland
Weave in Tengah. Both projects are in non-mature estates where application rates tend to be lower.
They also come with shorter waiting times, at about 47 months and 48 months respectively.
Additionally, with at least 95 per cent of the 4-room and bigger flats set aside for first-timer families,
applicants can look forward to a higher chance of securing flats in these projects.
Based only on the excerpt above, which of the following statements must be true about the February
2023 BTO exercise? Select all that apply.
(A) There is a BTO project that has a waiting time of at least 4.4 years.
(B) There is a BTO project in either Kallang, Whampoa or Queenstown that has a waiting time
of more than 4.4 years.
(C) Out of all the flats that are set aside for first-timer families, at least 95 per cent of them are
4-room or bigger.
(D) There are more 4-room and bigger flats set aside for first-timer families than there are for
families that are not first-timers.
14. Consider the incomplete, 2 × 2 table shown below. It is known that rate(Heart Disease | Smoking)
= 0.16. Based on this information, rate(Heart Disease and Smoking) is %.
15. The following infographic was extracted from the Department of Statistics Singapore, depicting
the usual mode of transport amongst resident students travelling to school in 2020.
Let us classify a student whose usual mode of transport to school is “Public Bus Only”, “MRT/LRT
and Public Bus Only”, “MRT/LRT Only” or “Other combinations of MRT/LRT or Public Bus”
as a student who takes public transport to school. If the proportion of female resident students
travelling to school who take public transport and the proportion of male resident students trav-
elling to school who take public transport are the same, what is the proportion of students who
take public transport to school, out of the male resident students who travel to school? Give your
answer as a percentage correct to one decimal place.
16. A course has an enrolment of 2005 students, of whom 1245 are males and the rest are females.
Among the males, 702 have identified themselves as sleep-deprived, which is measured by having
an average of less than seven hours a sleep a day. The rest of the males have identified themselves
as not sleep-deprived. We assume that we can obtain responses from all 2005 students. What
is the minimum number of females who are sleep-deprived, so that there is a positive association
between females and being sleep-deprived?
17. The Institute of Policy Studies (IPS) conducted a survey to gain insight on Singaporeans’ views
towards issues like family, well-being, work and other areas. Singapore residents from three different
age groups responded to the survey. One of the questions asked in the survey was how often the
respondents feel isolated from others. The figure below summarises the findings for the question
asked.
Use only information shown in the figure to answer the following question. Suppose that those
who are aged 21 to 34 are considered “youths” whereas those who are aged 35 to 64 are considered
Exercise 2 67
“adults”. Suppose also that those who answered “Hardly ever” do not feel isolated from others,
whereas those who responded “Some of the time” or “Often” feel isolated from others. Which one
of the following is true among the respondents?
18. In a certain college, 60% of the students are enrolled in the Arts program, and the remaining 40%
are in the Science program. Surveys show that:
(A) The joint rate of students who are in the Arts program and are members of the Debate Club
is 18%. (TRUE/FALSE)
(B) The conditional rate of being in the Arts program given that a student is a member of the
Debate Club is 30%. (TRUE/FALSE)
(C) The joint rate of students who are in the Science program and are not members of the Debate
Club is 20%. (TRUE/FALSE)
(D) The overall rate of students who are either in the Arts program or are members of the Debate
Club is 82%. (TRUE/FALSE)
20. The rate of lung cancer among females in Singapore is 40%, while the rate of lung cancer among
males in Singapore is also 40%. Researchers also discovered that the rate of lung cancer among
smokers in Singapore is 70%. Which of the following statements is/are true?
(I) Sex is a confounder when discussing the relationship between smoking and lung cancer.
(II) Lung cancer is positively associated with smoking.
21. A researcher is interested in finding out if gender affects whether a student is left-handed. The
information on gender (Male/Female), master hand (Left/Right) and IQ (Above average/Below or
At average) of all students was collected. The data was used to plot the figure below.
68 Chapter 2. Categorical Data Analysis
(I) IQ can be a confounder when investigating the relationship between gender and master hand.
(II) When finding out if IQ affects whether a person is left-handed, it is possible that Simpson’s
Paradox is observed in the data due to the third variable, gender.
22. Consider a study that intends to examine whether the colour red makes children act impulsively. A
group of 500 children were assigned into two groups by the expert opinion of a child psychologist;
group Red if the psychologist pointed to the child, and group Green if the psychologist did not.
Each child is then led into a room that has a big button in the colour of their group and labelled
“DO NOT PRESS ME!”. It is then recorded whether the child presses the button within 10
minutes. All the children were each then given a candy for participating.
Which of the following conclusions from the analysis of data can establish that wearing spectacles
(whether the child wears spectacles) is a confounder in this study?
(A) Wearing spectacles is positively associated with being in group Red, and negatively associated
with pressing the button.
(B) Wearing spectacles is associated with being in group Red, and is not associated with pressing
the button.
(C) Wearing spectacles is not associated with being in group Red, and is not associated with
pressing the button.
(D) None of the other given options is correct.
23. To investigate whether controlling mothers tend to have obese children, a study was conducted and
it was found that controlling behaviours in mothers is positively associated with obesity in their
children:
250
rate(Obese child | Controlling mom)= 350 = 71%;
138
rate(Obese child | Non-controlling mom)= 350 = 39%.
It was suspected that the health status of the father plays a role in the phenomenon, so the sample
was sliced according to whether the fathers were obese or not.
Exercise 2 69
(I) Obesity in the father is a confounder; furthermore, this is an example of Simpson’s Paradox
being observed.
(II) Obesity in children is positively associated with controlling mothers when the father is obese.
24. In a certain year, it is known that the prevalence of diabetes among Singapore residents is 10%
and the prevalence of diabetes among old (age 60 and above) Singapore residents is 30%. It was
suggested that sex is a possible confounder in the observed association between age and diabetes
among Singapore residents. After further analysis, the researchers concluded that sex is not a
confounder, and there is an association between sex and age. Which of the following statements
is/are true? Select all that apply.
25. A researcher would like to find out if there is any relationship between age (young and old) and
ramen consumption (high and low) among Singaporeans. From the data he obtained, he suspects
that sex is a confounder. Which of the following should hold in order to show that his suspicion is
correct? Select all that apply.
(A) The percentage of old people among males is different from the percentage of old people among
females.
(B) The percentage of males among the high ramen consumers is different from the percentage of
females among the high ramen consumers.
(C) The percentage of high ramen consumers among males is different from the percentage of high
ramen consumers among females.
26. A team of researchers are interested in seeing if there is an association between the amount of sleep
and memory retention. They have also collected information on each subject’s gender. Which of
the following statements is correct?
(A) If Simpson’s Paradox is not observed when combining the 2 subgroups of gender, then gender
is not a confounder when exploring the association between the amount of sleep and memory
retention.
(B) Suppose that when the 2 subgroups of gender are combined, Simpson’s Paradox is observed
when checking for association between the amount of sleep and memory retention. Then
gender must be a confounder.
(C) If gender is a confounder when determining the association between the amount of sleep and
memory retention, Simpson’s Paradox will be observed when combining the 2 subgroups of
gender.
70 Chapter 2. Categorical Data Analysis
27. A study employed 6 subjects and categorised each subject’s hypertension status (hypertension /
no hypertension), age (young / old) and sex (male / female). The results were then presented
in a contingency table, which is partially shown below. Suppose that sex is a confounder in the
relationship between age and hypertension status. In which cell should the count of the last subject
belong to?
Young Old
Hypertension No hypertension Hypertension No hypertension
Male (A) 1 2 (D)
Female 1 (B) (C) 1
28. Suppose that a population is divided into Young, Middle-aged, and Old people only. You are given
that cancer is negatively associated with being male among the Middle-aged people and among
the Old people. In addition, Simpson’s Paradox is observed when all the Young, Middle-aged, and
Old people are combined. Which of the following statements must be true?
29. The table below shows male and female patients undergoing two treatment types, X or Y. The
outcome of the treatment is designated as either successful or unsuccessful. The success rates of
the respective treatments across genders are also calculated.
Unfortunately, some of the data is missing. We know that all the missing values are non-zero.
Which of the following statements must be true?
(I) We observe Simpson’s Paradox when the subgroups of patients receiving Treatment X and
Treatment Y are combined, when considering the relationship between gender and outcome.
(II) Treatment type can be a confounder in the relationship between gender and outcome.
30. A study investigating the relationship between growing plants and happiness was conducted on
60 males and 80 females, and the results are summarised in the following tables (where x is the
number of men who do not grow plants and are happy). Among the males:
Exercise 2 71
It was found that sex is not a confounder in the association between growing plants and happiness
in this study. What must the value of x be?
This page is blank
Chapter 3
In Chapter 1, we introduced two main types of variables that we will be focussing on, namely categorical
variables and numerical variables. Categorical variables were discussed extensively in Chapter 2 and in
this chapter, we will turn our attention to numerical variables and how they can be analysed.
Consider the following table that shows a portion of a data set relating to COVID-19 cases in Singa-
pore.
An example of a numerical variable in this data set is Age. Can you identify another numerical
variable? The analysis of data, more precisely, Exploratory Data Analysis (or EDA) is a process of
summarising or understanding the data and extracting insights or main characteristics of the data. This
is a critical part of the “Analysis” step of the PPDAC problem solving cycle. In this chapter, we will
discuss how numerical variables can be summarised and understood. To begin, the focus of this section
will be on data exploration techniques for one variable, or univariate exploratory data analysis.
Example 3.1.1 In Chapter 2, the recurring data set that was used to drive the discussion on categorical
variables was the patients with kidney stones data set. In this chapter, we will be using a data set closer
to home.
74 Chapter 3. Dealing with Numerical Data
The data set (Microsoft Excel file partially shown above) that we will be looking at in this chapter
corresponds to sales of Housing Development Board (HDB) resale flats within the period of January 2017
to June 2021. The entire data set contains 99, 236 rows and 11 columns. Note that each transaction is
a row of the Excel file and each transaction contains information on variables (the columns) like month
(of sale), flat’s floor area (in square metres), resale price, etc.
The PPDAC cycle starts off with
1. Problem. So what is the problem that we are considering and attempting to answer? If you are
a potential buyer, perhaps a question that you may be interested in investigating could be
What factors may affect the pricing of resale flats sold in Singapore?
2. Plan. Here, we need to decide what are some of the variables that are relevant and possible factors
that answer the question. Suppose these variables were determined to be the 11 columns of the
data set. Some of these variables are
3. Data. In this stage, data is collected and prepared as shown in the table above.
4. Analysis. We are now at this stage where the data is going to be analysed in attempting to answer
the Problem.
Definition 3.1.2 A distribution is an orientation of data points, broken down by their observed number
or frequency of occurrence.
Example 3.1.3 Let us look at our HDB resale flats data set. The first few rows of the data set for
transactions from January to June 2021, is reproduced in the table below.
We would like to investigate the distribution of the Age1 variable. To do this, we would need to
collate the number of flats with the same ages when the resale transaction was made and put them in a
frequency table. For example the first two rows of the data indicates that the first two HDB flats in the
data set had the same age of 35 years when they were sold, while the third flat was 45 years old and so
on. Suppose the frequency table collated for the entire data set is as follows:
Age Frequency
2 9
3 8
4 583
5 1105
6 884
7 295
8 255
.. ..
. .
If we simply look at the frequency values in the table, it would be hard to observe any patterns or
gain insights into how the frequencies are distributed across the different age values. We will introduce
two different graphs to present the distribution in better way.
Example 3.1.4 (Histograms for Univariate EDA) A histogram is a graphical representation that
organises data points into ranges or bins. It is particularly useful when we have large data sets. Let
us see how the histogram will look like when we use Microsoft Excel to create one based on the “Age”
frequency from Example 3.1.3. To create a histogram, the variable values are “grouped” into equal size
intervals called bins. For our “Age” variable, we can use bins with a width of 2 years. The number of
flats in each bin are counted and tabulated.
Bins Frequency
0-2 9
2-4 591 (8 + 583)
4-6 1989 (1105 + 884)
6-8 550 (295 + 255)
8-10 336 (219 + 47)
.. ..
. .
You may notice that for the 2-4 Bin, the frequency is obtained by adding the number of flats sold
at Age 3 and Age 4 and excludes those sold at Age 2. Thus, the left-end point of the interval 2-4 is
excluded. The same is observed for the rest of the bins. The histogram created using Radiant is shown
below:
1 The data set, which can be downloaded from https://data.gov.sg/dataset/resale-flat-prices actually does not contain
the “Age” variable. The “Age” variable was created by subtracting lease commence date from the year the flat was sold.
76 Chapter 3. Dealing with Numerical Data
With the height of each bar representing the frequency for that bin range, the highest bar would
represent the most frequently occurring range of values.
From the histogram above, we see that the range 4-6 years has the highest frequency as it accounts
for 1989 out of the total 11644 transactions, or about 17% of the flats sold.
Remark 3.1.5 You may wonder how we came to the decision to have bin widths of 2 years rather
than 3 years (or any bigger number). There is no correct answer for this. Normally, we would construct
several histograms with different bin widths before deciding which one is most appropriate.
Once we have obtained and visualised the distribution of a numerical variable, we would like to
describe the overall pattern of the distribution as well as whether there are any deviations from the
overall pattern. To describe the overall pattern of the distribution, we will focus on the
1. Shape;
2. Center; and
For deviations from the overall pattern, this usually refers to identifying outliers which will be discussed
later on in this Chapter. Let us start by looking at how we can describe the shape of a distribution.
Discussion 3.1.6 (Shape - peaks and skewness). There are two important descriptors when we
discuss the shape of a distribution, namely the peaks and the skewness. Let us look at another histogram
plot obtained from the HDB resale data set. Rather than the age of the flat at the point of resale, we
consider another numerical variable of interest, which is the “Resale Price”. The following histogram
was obtained when we set a bin size of 25,000.
Section 3.1. Univariate EDA 77
There is a peak in the interval [455000, 480000]. The distribution is unimodal , which means that it
has one distinct peak. This tells us that the most frequent resale flat prices lies between $455,000 and
$480,000.
Distributions are not always unimodal. Looking at the histogram we plotted earlier for the Age of
the resale flats, we see that there is more than one distinct peak. In such a situation, we say that the
distribution is multimodal . If a distribution has exactly two distinct peaks, we say it is bimodal .
In the histogram above, we see the highest peak in the 4-6 years range and the second highest peak
occurring in the 34-36 years range. It should be noted that we say these are peaks because they occur
most frequently in their immediate neighbourhoods of age ranges.
For a unimodal distribution, we can use another descriptor to describe the shape of the distribution,
that is, whether the distribution is symmetrical or skewed .
78 Chapter 3. Dealing with Numerical Data
In a symmetrical distribution (middle picture above), the left and right halves of the distribution are
approximate mirror images of each other, with the peak in the middle.
For the picture on the left, the distribution is left skewed, with the peak shifted to the right and a
relatively long “tail” on the left.
The picture on the right shows a distribution that is right skewed. Such a distribution has the peak
shifted to the left and a relatively long “tail” on the right. Referring back to the distribution of resale
prices of HDB flats, we see that the distribution is right skewed, meaning that there are some (but few)
flats sold at very high prices. These data points gave rise to the long tail to the right of the peak.
Example 3.1.7 (Symmetrical distribution - Bell curve) One of the most well-known symmetrical
distributions is the normal distribution or what is commonly known as the bell curve. A famous example
of the normal distribution is that of the IQ scores in a population, based on the Wechsler Intelligence
scale.
From the figure, we see that the peak happens at 100, which means that the average IQ of a person in
the population is 100. We also see that about 68% of the population has IQ scores in the range between
85 and 115.
Discussion 3.1.8 (Central tendency - mean, median and mode). Besides describing the shape of
the distribution, we can also describe the characteristics of a distribution more precisely using measures
of central tendency. The three most common measures of central tendency are mean, median and mode,
which were all introduced in Chapter 1.
The three possible shapes of a distribution have different relative positions of the mean, median and
mode.
1. For a symmetrical distribution, the mean, median and mode will be very close to each other near
the peak of the distribution.
To see why this is the case, notice that the small number of extremely small values which contributes
to the long tail on the left, will push down the mean/average, as compared to the median which is
less affected by these extremely small values. The mode, found at the peak of the distribution is
naturally the largest among the three measures of central tendency.
Section 3.1. Univariate EDA 79
3. For a right skewed distribution, we have the opposite of the left skewed distribution, which is
In this case, there are a small number of extremely big values which contributes to the long tail on
the right. These big values will push up the mean/average as opposed to the median which is less
affected by these extremely large values. The mode in such a distribution would be the smallest
among the three measures of central tendency.
Example 3.1.9 Referring again to the resale prices distribution, we have seen the shape of the distri-
bution and concluded that the distribution is right skewed.
The mean, median and mode of this distribution were found to be $496,870.40, $468,000 and $420,000
respectively. This indeed agrees with
Discussion 3.1.10 (Spread - standard deviation and range). Besides the shape and center of the
distribution, we can also describe the spread of a distribution. This refers to how the data vary around
the central tendency.
Take a look at the two distributions above, both of which have the same central tendencies. In fact,
the mean, median and mode of both distributions are 10. However, the top distribution has a relatively
lower variability compared to the distribution below. This means that the data in the top distribution
are all relatively close to the center while the data in the bottom distribution are more spread out, or
has more variability. We can also say that the data in the bottom distribution is spread across a much
wider range.
The most commonly used measure of variability is standard deviation which was introduced in Section
1.5. For the two distributions shown here, the top distribution has standard deviation 1.69 while the
bottom distribution has standard deviation 4.30.
A simpler measure of variability is the range of the distribution. This is defined to be the difference
between the largest and the smallest data points in the distribution. The range is simple to compute
but sometimes it can be misleading. For example, if we look at the range of the HDB resale prices data,
we obtain
Range = Highest resale price − Lowest resale price = $1, 250, 000 − $180, 000 = $1, 070, 000.
The range is very large and is due to the existence of a few extremely high resale prices. It is not really
the case that there is great variability in resale prices as we see that most of the resale prices are actually
much lower and the variability is not as big as the range indicates it to be.
Definition 3.1.11 An outlier is an observation that falls well above or below the overall bulk of the
data.
80 Chapter 3. Dealing with Numerical Data
Consider the data set with 11 data points shown above. We can consider 75 and 85 as outliers since
they are way larger than the rest of the data points. At this point, we use our judgement to identify
values that appear to be exceptions to the general trend in the data. Later on, we will be introducing a
more precise method (boxplot) to identify outliers.
Identifying outliers can be useful when we wish to identify any strong skewness in a distribution.
Sometimes the outliers are caused by erroneous data collection or data entry but this may not always be
the case. It is also possible that outliers are legitimate data points that provide us interesting insights into
the behaviour of the data. A general rule when we investigate a data set is that outliers should not be
removed unnecessarily as they do tell us something about the behaviour of the variable and prompt us
to investigate further why such extreme values can happen.
4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 300.
It is not difficult to be convinced that 300 is an outlier in the data set. The table below shows the three
different central tendencies as well as the standard deviation for the entire set and also when the outlier
is removed from the data set.
Mean Median Mode Standard deviation
Without removing 300 30 5.5 5 85.03
With 300 removed 5.45 5 5 1.04
We see that between the three central tendencies, the mean seems to be the most affected by the
removal of the outlier, while both the median and the mode either remained the same or only changed
slightly. Without removing the outlier, the mean is pulled away in the direction of the skew (in this
example, the distribution is skewed to the right). In such cases, mean may no longer be a good measure
of the central tendency of the distribution. We call the median and the mode robust statistics.
In addition, the standard deviation also increases greatly from 1.04 to 85.03 because of the outlier.
This is expected because the standard deviation measures the spread of the data points and with the
outlier being far away from the other data points, the variability of the distribution is understandably
high.
As mentioned above, we need to treat outliers with care. If they have minimal effect on the conclusions
and if we cannot figure out why they are there, such outliers may possibly be removed. However, if they
substantially affect the results, then we should not drop them without justification.
Example 3.1.13 Suppose we are interested to find out if there are significant differences in the dis-
tribution of HDB resale prices for different time periods. For example, would the distributions differ
significantly if we compare the period July to December 2020 with January to June 2021? The two
distributions are shown below.
Section 3.1. Univariate EDA 81
The distribution on the left corresponds to the period of resale from July to December 2020. The
distribution for January to June 2021 is shown on the right. We observe that both distributions have
a similar shape which is right skewed with a single peak. Taking it one step further, we compare the
central tendencies and variabilities of the data points in both periods. The values in the table can be
computed using the Microsoft Excel Data Analysis Toolpak.
Observe that all measures of mean, median and mode are higher in the time period January to June
2021 compared to those in the time period July to December 2020. The range of the resale prices is lower
in January to June 2021 while the standard deviation is actually higher. In conclusion, we can say that
resale prices in January to June 2021 are higher, but more spread out (in terms of standard deviation)
compared to the resale prices in July to December 2020.
Example 3.1.14 In Example 3.1.4, we described the setting of bin widths when creating a histogram.
Deciding the bin width to use can have a big impact on how the histogram looks like and thus affect our
observation and conclusion on the shape of the distribution.
The three histograms above are constructed using the same data set of 233 students’ final exam scores
with the only difference being the bin width settings. The histogram on the left has a bin width of 20,
while the one in the middle has bin width of 10. The last histogram has bin width set at 5. What
conclusions can be made on the distribution based on these histograms?
Based on the first histogram, we may make the conclusion that most students score between 60 to 80
marks, and the distribution is rather symmetric. However, with a slightly smaller bin width, the second
histogram reveals that most students actually scored between 70 to 80 marks. This does not contradict
the observation made earlier based on the first histogram but because of the smaller bin width, we are
able to narrow the range of marks that are scored by most students. With an even smaller bin width, the
third histogram suggests that most students scored between 65 and 75 marks. How do you rationalise
this conclusion with the one from the second histogram?
In general, we should bear in mind the following when determining bin widths for histograms.
1. Avoid histograms with bin widths that are too large. This will result in only a few bins and
information in the data will be lost when data points are grouped together into a small number of
groups/bins.
2. Avoid histograms with bin widths that are too small. If we do this, there may be bins that have
very few data points (or none) that does not give us a sense of the distribution.
3. Our initial choice of bin width may not be the most appropriate. Different histograms with various
bin widths should be created before deciding which one is the most useful and informative.
Remark 3.1.15 We should not confuse histograms with bar graphs introduced in Chapter 2. A his-
togram shows the distribution of a numerical variable across a number line. So one of the axes (usually
the horizontal) will display the range of values taken on by the numerical variable. On the other hand,
the horizontal axis of a bar graph will show the different categories of a categorical variable.
In addition, the ordering of the bars in a histogram cannot be changed, as it progresses through the
range of values, usually in an ordered manner, taken on by the numerical variable. On the other hand,
82 Chapter 3. Dealing with Numerical Data
the ordering of the bars in a bar graph can be switched around with little consequence. There are also
usually no gaps between the bars in a histogram.
Discussion 3.1.16 (Boxplots for Univariate EDA) Besides a histogram, another way to visualise
the distribution of a numerical variable is to use a boxplot. To construct a boxplot, we will use the
five-number summary, consisting of
1. Minimum;
2. Quartile 1 (Q1 );
3. Median (Q2 );
4. Quartile 3 (Q3 );
5. Maximum.
The median and quartiles have already been introduced in Definition 1.6.1 and Definition 1.6.5. Fur-
thermore, we have also introduced the Interquartile range
IQR = Q3 − Q1 .
While the median can be viewed as the center of a data set, the IQR is a way to quantify the spread of a
data set. We have defined an outlier in Definition 3.1.11 but did not provide an explicit way to classify
a data point as an outlier. For our purpose we will adopt the following consideration to classify a data
point as an outlier.
2. Draw a vertical line in the box where the median (Q2 ) is located.
4. Extend a line from Q1 to the smallest value that is not an outlier and another line from Q3 to the
largest value that is not an outlier. These lines are called whiskers.
Example 3.1.17 Consider the following data set, with the data points already sorted in increasing
order.
18, 44, 47, 55, 61, 62, 78, 79, 83, 145.
There are 10 data points. The median (Q2 ) is the average of the fifth and sixth data points, so
1
Q2 = (61 + 62) = 61.5.
2
The first quartile is the median of the first five data points: 18, 44, 47, 55, 61, so Q1 = 47. The third
quartile is the median of the last five data points: 62, 78, 79, 83, 145, so Q3 = 79. Following Remark
1.6.9, it should be pointed out that you may encounter slightly different ways of finding quartiles for a
data set in other texts. For this course, we will adopt what is presented here.
Section 3.1. Univariate EDA 83
Example 3.1.18 Let us return to the HDB resale flats data set. The boxplot below is based on the
resale prices of flats sold in January to June 2021.
The boxplot confirms our earlier conclusion that there are outliers that correspond to very high resale
prices. Note that the cross in the box, just above the median line represents the mean resale price. Recall
that we have discussed the shape, center and spread of the distribution using a histogram. What can we
say based on the boxplot?
1. (Shape) From the boxplot, we see that the variability in the upper half of the data, given by (Max
− Median) is significantly larger than the variability in the lower half of the data which is equal
to (Median − Min). This confirms our earlier observation that the distribution is skewed to the
right and there is a relatively long tail to the upper end of the distribution due to the existence of
outliers.
2. (Center) The center, described by the median is easily observed in the boxplot, unlike in a
histogram. We can also compare the relative positions of the median and the mean from the
boxplot.
3. (Spread) The IQR of 204, 000 gives us an idea of the spread for the middle 50% of the data set.
On its own it may not be immediately informative but this would be a meaningful measure to
compare across different distributions (see next example).
Example 3.1.19 The three boxplots below show the distributions of resale flat prices in three different
time periods, namely January to June 2020 (call this period P1), July to December 2020 (call this period
P2) and January to June 2021 (call this period P3). What can we say about the three distributions after
comparing the three boxplots?
84 Chapter 3. Dealing with Numerical Data
1. All three distributions are right skewed as the upper halves of the data have greater variability than
the lower halves, due to (large-valued) outliers. However, upon a closer look, it is also apparent
that the upper half variability in period P1 is greater than the upper half variability in P2 which
in turn is greater than the upper half variability in P3.
2. The middle 50% (that is, the IQR) box of resale prices is lowest in P1, followed by P2 and then
P3. Hence, the overall resale prices have increased over time. The spread (given by the height of
the boxes) appears to be similar between P1 and P2 while slightly higher in P3.
3. There appears to be more outliers in P1 and P2 compared to P3.
To conclude this section, we summarise the comparison between using histograms and boxplots to
represent a distribution.
1. A histogram typically gives a better sense of the shape of the distribution of a variable, compared to
a boxplot. When there are great differences among the frequencies of the data points, a histogram
will be able to illustrate this difference better than a boxplot.
2. If we wish to compare the distributions of different data sets, putting the different boxplots side
by side is more illustrative than using histograms.
3. To identify and indicate outliers, boxplots do a better job than histograms.
4. The number of data points we have in a data set is better shown in a histogram than in a boxplot.
In fact, two distributions with very different number of data points can have almost identical
boxplots. On the other hand, this difference is apparent by comparing the histograms.
The bottom line is that different graphics and summary statistics have their advantages and disad-
vantages and they are often used together to complement each other.
Discussion 3.2.1 We start off with a relationship between two variables that is deterministic. This
means that the value of one variable can be determined exactly if we know the value of the other variable.
Perhaps the most common type of deterministic relationship is the one that involves the conversion of
units of measurement from one metric to another. For example:
Section 3.2. Bivariate EDA 85
1. The relationship between Fahrenheit (F ) and Degree Celsius (C) in the measurement of tempera-
ture. We know that F and C are related by
5
C = (F − 32) × .
9
This is a deterministic relationship between F and C. For example, if the temperature in the oven
now is 450 degrees Fahrenheit (so F = 450), then the temperature in the oven now, measured in
Degree Celsius is
5
C = (450 − 32) × = 232.22.
9
2. Meters (M ) and Feet (F ) are both measurements of length (or height) and they are related (ap-
proximately) by
F = 3.2808 × M.
So, if Johnny’s height is 5.9 Feet (so F = 5.9), then his height in meters will be
F 5.9
M= = ≈ 1.8 meters.
3.2808 3.2808
Discussion 3.2.2 The main focus of this section is on a relationship between two variables that is not
deterministic in nature. We say such a relationship is statistical or non-deterministic. Recall that in a
deterministic relationship, given the value of one variable, we can find a unique value of another variable.
However, this is not possible for a statistical relationship, where given the value of one variable, we can
describe the average value of the other variable. Such relationships between variables, called associations
occur quite often in our daily life.
Example 3.2.3 In a Medical News Today article2 published in November 2020, it was reported that in
a study involving more than 150, 000 participants, a clear link was observed between low physical fitness
and the risk of experiencing symptoms of depression, anxiety, or both.
This association between physical fitness and mental health may not be surprising but we wonder if
it could be due to other factors, like a confounder. More interestingly, does having better fitness make
a person mentally healthier or having better mental health make a person exercise more resulting in
better physical fitness? We will not only measure the association (if one exists) between variables but
also attempt to interpret any observed associations.
Bivariate data is data involving two variables. For example, in the HDB resale flat data set, we can
study the two variables Age and Resale Price .
2 https://www.medicalnewstoday.com/articles/large-study-finds-clear-association-between-fitness-and-mental-health
86 Chapter 3. Dealing with Numerical Data
In Section 3.1, we saw two ways to display univariate data, using either a histogram or a boxplot. For
bivariate data, it is clear that using a table like the one above is not really useful if we wish to investigate
if the two variables are associated. Instead, we will use a scatter plot to give us an idea of the pattern
formed by the data between the two variables in question. After looking at the scatter plot, we use a
quantitative measure called the correlation coefficient to quantify the level of linear association (if any)
between the two variables. Finally, we will attempt to fit a line or a curve through the points in the
data set which will enable us to make predictions on the values of the variables. This process is known
as regression analysis. For now, we will focus on scatter plots and defer the discussion on correlation
coefficients and regression analysis to the next few sections.
Example 3.2.4 Returning to our HDB resale flats prices data set, we will focus on the bivariate data
with the variables Age and Resale price. Suppose we wish to know if the age of the flat affects the resale
price, with the ultimate intention to make a prediction, based on the past resale prices, of how much
a 38 year old resale flat is likely going to cost. In this case, we can treat age as the independent (or
explanatory) variable and resale price as the dependent (or response) variable.
Our scatter plot shown above has the age (independent) variable on the x-axis and the resale price
(dependent) variable on the y-axis. Each resale transaction would be represented by an ordered pair
(x, y)
where x is the age of the resale flat and y is the resale price of that flat. For example, the ordered pair
(35, 225000) corresponds to the first resale flat listed in the table above. With a point plotted for each
ordered pair, since there are 11, 644 resale transactions in the data set, there will be 11, 644 points on the
scatter plot. Observe that in the scatter plot, each value of x (age of flat) corresponds to many different
values of y (the resale price). This is to be expected because there are many different transactions
involving flats of the same age and all these transactions are made at different resale prices.
How do we describe the relationship between two numerical variables using a scatter plot?
Section 3.2. Bivariate EDA 87
We have seen that for univariate data, we discussed the shape (symmetrical or skewed), center (me-
dian, mean and mode) and spread (interquartile range, standard deviation and range) of the distribution.
For bivariate data, we will use descriptors like the direction, form and strength to describe the relationship
between the two variables. For both univariate and bivariate data, data points that deviate significantly
from the pattern of the main bulk of data points are called outliers.
Definition 3.2.5 The direction of the relationship can be either positive, negative or neither. We say
that there is a positive relationship between two variables when an increase in one of the variables is
associated with an increase in the other variable.
On the other hand, a negative relationship between two variables means that an increase in one
variable is associated with a decrease in the other.
Not all relationships can be classified as either positive or negative and there are those that do not
behave in one way or the other.
The form of the relationship describes the general shape of the scatter plot. In general, we can classify
the form of the relationship as either linear or non-linear. The form of the relationship is linear when
the data points appear to scatter about a straight line. Later in the chapter, we will use a mathematical
equation to describe the straight line when the form of the relationship between two variables is linear.
When the data points appear to scatter about a smooth curve, we say that the form of the relationship
is non-linear. It is beyond the scope of this course to summarise curve patterns in the data but it is
useful to note that quadratic and exponential equations are examples of non-linear forms of relationship.
88 Chapter 3. Dealing with Numerical Data
The two scatter plots on the left shows a linear form of the relationship between the two variables
while the two scatter plots on the right shows non-linear forms.
The strength of the relationship indicates how closely the data follow the form of the relationship.
Both scatter plots above suggests that there is positive, linear relationship between the two vari-
ables. However, the scatter plot on the left shows the data points lying very close to the straight line.
This indicates that the strength of the relationship is strong. The scatter plot on the right shows the
data points scattered loosely around the straight line and thus the strength of the relationship is weaker
than that in the scatter plot on the left.
Example 3.2.6 Let us look at the scatter plot from the HDB resale flats data again. The scatter plot
below is similar to the one from Example 3.2.4 except for an additional trendline drawn in black.
The trendline suggests that as the age of the HDB flat increases, the resale price decreases linearly
on average, in the period of January to June 2021. Is this relationship strong or weak? In fact, one can
argue that without the trendline, one may not even observe that there is a linear relationship between
age and resale price.
At this point, we cannot really tell if there is indeed a linear relationship and if there is, whether the
relationship is strong or weak. Nevertheless, in the next section, we will discuss a more precise measure
of the strength of a relationship.
As mentioned earlier, outliers are data points that deviate significantly from the pattern of the
relationship. Consider the scatter plot shown below that plots the resale price against the floor area of
the HDB resale flats. Do you observe any outliers?
Section 3.3. Correlation coefficient 89
Recall that for univariate data, using a boxplot, we can determine if a data point is an outlier by
checking if its value is greater than Q3 + 1.5 × IQR or smaller than Q1 − 1.5 × IQR. What about for
bivariate data? We will discuss more about outliers in the next section.
In the previous section, using the HDB resale flats data set, we have observed that a flat’s resale price
is associated with the age of the flat. From the scatter plot, we concluded that the relationship between
the age of the flat and the resale price of the flat was negative. This means that flats whose ages were
higher tended to have a lower resale price. This is not surprising. However, can we say anything about
whether this relationship is strong or weak? If possible, can we measure the strength of this relationship
using a number?
More generally, given two numerical variables, is it possible for us to measure the relationship between
the two variables quantitatively?
Definition 3.3.1 The correlation coefficient between two numerical variables is a measure of the linear
association between them. The correlation coefficient, denoted by r, always ranges between −1 and 1. We
can use this number to summarise the direction and strength of linear association between two variables.
The sign of r tells us about the direction of the linear association. If r > 0, then the association
is positive, which means that when one of the variables increase, the other variable will tend to increase
as well. On the other hand, if r < 0, then the association is negative, which means that when one of
the variables increase, the other variable will tend to decrease. In the event that r = 1 (resp. r = −1),
we say that there is perfect positive association (resp. negative association). When r = 0, we say there
is no linear association. Thus, while the sign of r tells us the direction of the linear association, the
magnitude of r (that is, how close r is to 1 or −1) will tell us the strength of the linear association
between two numerical variables.
Example 3.3.2 The two scatter plots below are examples of positive linear association between two
variables.
90 Chapter 3. Dealing with Numerical Data
The plot on the left plots the price index of HDB flats against the price index of condominiums. We
observe that there is positive linear association between the two indices, which means that as the price
of HDB flats increase, it is likely that the price of condominiums would increase as well. The value of r
in this case is 0.95 which indicates that the association is strong.
The plot on the right shows the midterm mark of students against the final mark. Again, we observe
that there is positive linear association between the two marks and in this case, r was found to be 0.75.
The next two scatter plots are examples of negative linear association between two variables.
The plot on the left shows the price of oil against the price of gold. In this case, we observe that the
trend is that when the price of gold increases, the price of oil tends to decrease. The value of r was found
to be −0.67 and this indicates that there is negative linear association between gold and oil prices.
The plot on the right shows the amount of financial aid received by students against the students’
family income. It is not surprising to find that as the family income increases, the amount of financial
aid received by students would tend to decrease. The value of r in this case is −0.49 and there is negative
linear association between the two variables.
The two scatter plots above are examples where r = 0. This means that there is no linear association
between the two variables. However, note that while r = 0 for the second plot, we can see that the data
points fit very well onto a curve and there is a clear non-linear relationship between X and Y . More
generally, no linear association between variables does not necessarily mean no association between
variables.
Section 3.3. Correlation coefficient 91
The two plots above show situations where there is perfect (positive or negative) linear correlation
between the two variables. In such cases, all the data points are connected by (and thus lie on) a straight
line. There is however, one exception, which is when the straight line joining all the data points is
actually a straight horizontal (or vertical) line. In such instances, the value of r is 0 and there is no
association between the two variables. This is because when the data points are connected by a vertical
or horizontal line, a change of value in one of the variables does not relate to a change in the other
variable.
When describing the strength of a linear relationship, we usually follow the rule of thumb as given
in the diagram below.
When the magnitude of r is between 0.7 and 1, we say that the two variables have a strong linear
association. If the magnitude is between 0.3 and 0.7, the two variables have a moderate linear association.
If the magnitude is between 0 and 0.3, the two variables have a weak linear association. Do note that
other sources may differentiate strong/moderate/weak linear associations at other “cut-off” points that
are different from 0.3 and 0.7.
In general, as the value of r becomes closer to 1 or −1, the data points will increasingly fall more
closely to a straight line. Scatter plots where the data points are loosely dispersed typically mean that
correlation is weak (or non-existent). We will now discuss how to compute the value of r numerically.
Example 3.3.3 We will go through the steps required to compute the correlation coefficient using an
example. Consider the following table that shows a total of 10 data points of bivariate data (x, y):
x 9 4 5 10 6 3 7 2 8 1
y 41 17 28 50 39 26 30 6 4 10
1. First compute the mean and standard deviation of x and y. (Refer to Definition 1.4.1 and Definition
1.5.1 if you have forgotten how these are computed.) For this data set, we find the mean and
standard deviation of x to be 5.5 and 3.03 respectively while the mean and standard deviation of
y are 25.1 and 15.65 respectively.
2. Convert each value of x and y into standard units. To convert x (resp. y) into its standard unit,
we compute
x−x y−y
resp. ,
sx sy
where sx and sy are the standard deviations of x and y respectively. The table below shows the
values of x and y after they have been converted to standard units.
92 Chapter 3. Dealing with Numerical Data
x 1.16 −0.50 −0.17 1.49 0.17 −0.83 0.50 −1.16 0.83 −1.49
y 1.02 −0.52 0.19 1.59 0.89 0.06 0.31 −1.22 −1.35 −0.96
3. Compute the product xy in their standard units for each data point. The table below has an
additional row for the value xy for each data point.
x 1.16 −0.50 −0.17 1.49 0.17 −0.83 0.50 −1.16 0.83 −1.49
y 1.02 −0.52 0.19 1.59 0.89 0.06 0.31 −1.22 −1.35 −0.96
xy 1.17 0.26 −0.03 2.36 0.15 −0.05 0.15 1.41 −1.11 1.43
4. Sum the products xy obtained in the previous step over all the data points and then divide the
sum by n − 1, where n is the number of data points. The result is the correlation coefficient r. For
the data set above,
1
r= (1.17 + 0.26 − 0.03 + 2.36 + 0.15 − 0.05 + 0.15 + 1.41 − 1.11 + 1.43) = 0.64.
9
Remark 3.3.4 For the purpose of this module, you will not be required to compute r manually, instead
you should be familiar with the method of how r is computed and thereby develop some basic intuition
on the properties of r.
Example 3.3.5 Let us revisit Example 3.2.6, where the scatter plot of HDB resale flat prices against
the ages of the flat shown below does indeed suggest that these two variables are negatively associated.
Indeed, upon computing the correlation coefficient between these two variables, we find that r =
−0.356, confirming that there is moderate negative linear association between the age and resale price
of HDB flats from the period January to June 2021.
We will now present three properties of correlation coefficients.
1. From the “Age” vs. “Price” of HDB resale flats example, we saw that r = −0.36 when we consider
the scatter plot with Age as the x-axis and Resale price as the y-axis. What would happen to r if
we had done the plot with Resale price as the x-axis and Age as the y-axis? In other words, what
happens to r when we interchange the x and y variables? If we revisit the process that describes
how r is computed from a bivariate data set, you would realise that regardless of which variable is
x (or y), the computation of r would not be affected in any way.
Section 3.3. Correlation coefficient 93
2. What would happen to the value of r if we add a constant to all the values of a variable? For
example, suppose it was discovered that there was an error in the recording of all the resale prices
of HDB flats and that the actual resale prices were all $1000 higher than what was given in the
data set. To correct this error, we would have to add $1000 to all the resale prices in the data set.
It turns out that such a change does not affect the value of r.
While this may not be immediately obvious, you are encouraged to verify this result by using the
data set in Example 3.3.3 and adding some number to all the values of x (or y).
3. Instead of adding the same number to all the values of a variable, what would happen to the value
of r if we multiply a positive number to all the values of a variable instead? For example, if the
resale prices were converted to US dollars instead? This means that we have to multiply a factor
of 0.73 (assuming an exchange rate of 1 Singapore dollar is to 0.73 US dollars) to all the resale
prices in the data set. It turns out that such a change again does not affect the value of r.
You are again encouraged to verify this result by adjusting the data set in Example 3.3.3 and
recalculating the correlation coefficient.
While the correlation coefficient between two numerical variables is insightful, there are certain limi-
tations.
Discussion 3.3.6
1. Association is not causation. To confuse association with causation is a common mistake that
is made by many. Very often when there is a strong association between two variables, with a
correlation coefficient of r that is close to 1 or −1, it is mistakenly concluded that any change in
the explanatory variable, say x, will cause the response variable y to change. This is incorrect as
what we can conclude is only a statistical relationship between x and y and not a causal relationship.
94 Chapter 3. Dealing with Numerical Data
Consider the example above of a scatter plot that came from a data set containing information
on the percentage of people that earned a Bachelor’s Degree in 2017 across 3142 counties in the
United States, as well as the per capita income of these counties in 2017.3 Each data point in the
scatter plot represents a county. The x-axis is the per capita income in the past 12 months while
the y-axis is the percentage of the population in the county that earned a Bachelor’s Degree in
2017. The correlation coefficient for the two variables is 0.79, which indicates that there is strong
and positive association between the two variables.
It would be tempting to conclude that the higher the per capita income of a county, the higher
the percentage of the county’s population would have earned a Bachelor’s Degree. This is not
necessarily true. The data here merely suggests association of the two variables and does not
establish any causal relationship.
2. r does not tell us anything about non-linear association. The correlation coefficient r,
as defined and described in this section, measures the degree of linear association between two
numerical variables. Whatever the computed value of r is, it does not give any indication of
whether the two variables could be associated in a non-linear way.
The correlation coefficients for the three scatter plots above are small but yet there is actually a
strong relationship between the variables. The value of r is small because the relationship between
the variables is not a linear one. It is always a good practice to look at a scatter plot of the data
set and not just deduce any relationship between the variables from the computed value of r.
3 Data set can be downloaded from www.openintro.org/data/?data=county complete.
Section 3.3. Correlation coefficient 95
3. Outliers can affect the correlation coefficient significantly. Outliers are observations that
lie far away from the overall bulk of the data. How do outliers affect the value of the correlation
coefficient? The removal of outliers from a data set can have different effects on the correlation
coefficient, depending on how the outlier is positioned in relation to the rest of the data points.
Consider the scatter plot on the left, where the outlier is circled, the correlation coefficient is 0.22
based on the data set that includes the outlier. However, when we remove the outlier, we see that
there is a strong positive linear association between the remaining data points. Thus, in this case,
the presence of the outlier decreases the strength of the correlation, compared to when the outlier
is removed.
Consider the scatter plot on the right where again the outlier is circled. In this case, the correlation
coefficient is −0.75 based on the data set that includes the outlier. When the outlier is removed,
the remaining data points give a correlation coefficient of 0.01. Thus, in this case, the presence
of the outlier actually increases the strength of the correlation, compared to when the outlier is
removed.
Example 3.3.7 For the HDB data set that we introduced earlier, the scatter plot below shows the
relationship between the resale price and the floor area of the flat. There are three outliers (circled) and
these are resale flats whose floor areas are larger than 200 square meters.
Using a statistical software, it was found that the correlation coefficient was 0.626 before the out-
liers were removed. After the outliers are removed, the correlation coefficient becomes 0.625, which is
practically the same as before.
Definition 3.3.8 So far, we have discussed correlation in the setting where individual data points are
considered. For example, the collection of data points could represent individuals from a population.
However, we can also examine the data at an aggregated level by grouping these individuals based
on factors like ethnic group or education level. An ecological correlation is computed based on the
96 Chapter 3. Dealing with Numerical Data
aggregates rather than on the individuals. Thus, ecological correlation represents relationships observed
at the aggregate level, considering the characteristics of groups rather than individuals.
Example 3.3.9 Consider the scatter plot below for a data set consisting of individuals belonging to
three distinct groups. The three groups are represented by the symbols circle, cross and plus.
The correlation coefficient computed at the individual level is r = 0.85, indicating that there is a
moderate and positive linear association between the variables X and Y . Suppose we compute the group
averages (for X and Y ) for the three subgroups and obtain the three red dots as shown in the figure.
These three red dots, or aggregate points, align rather closely along a straight line. In fact, if we compute
the correlation coefficent based on these three agrregate points, the correlation coefficent would be 0.98.
Consequently, this example illustrates that the ecological correlation derived from group averages
suggests a more pronounced (since 0.98 is closer to 1 than 0.85) positive linear association compared to
correlation calculated at the individual level.
This phenomenon does not happen all the time. In general, when the association for both individuals
and aggregates are in the same direction, the ecological correlation based on aggregates will typically
overstate the strength of the association in individuals. Without getting into details, the intuitive
explanation for this is because the variability among individuals will not be as significant when correlation
is computed based on group aggregates.
Definition 3.3.10 The previous example reminds us that correlation at the individual level and at
aggregate level may tell us a different story about our data set. We need to be careful not to make
any wrong deductions. Consider the scatter plot below that represents the relationship between two
variables.
Section 3.4. Linear regression 97
There are clearly four distinct subgroups of individuals (grouped by the four ovals). If we consider the
subgroup averages, represented by the four red dots in the diagram, the correlation between these four
subgroup averages suggests that there is a positive linear association, as indicated by the blue regression
line. Can we now conclude that at the individual level, there is also a positive linear association between
the two variables?
This is not the case. If we look at the individual level within each subgroup, we notice a weak, but
nevertheless negative linear association between the two variables. Thus, we would have been wrong if
we drew conclusion about correlation at the individual level based on what we observe at the aggregate
level. If we do so, we would have committed what is known as ecological fallacy.
The moral of the story is that we should not assume that correlations based on aggregates will hold
true for individuals. Ecological correlation and correlation based on individuals are not the same and
should not be confused.
Definition 3.3.11 Consider another scatter plot below, again representing the relationship between
two variables X and Y .
Again, there are clearly three distinct subgroups of individuals in the data set and within each
subgroup, we observe a strong positive linear association between the two variables. Can we now conclude
that at the aggregate level, there is also a positive linear association between the two variables?
The three subgroup averages, represented by the three red dots are shown. It turns out that there is
actually no clear correlation between the variables at the aggregate level. Based on the correlation we
observed at the individual level, if we had mistakenly concluded that the same correlation would exist
at the aggregate level, we would have committed what is known as atomistic fallacy.
To differentiate the two types of fallacies described above can be confusing initially. The following
table summarises them.
Now that we have seen that the age of a HDB resale flat is negatively associated with the resale price,
it is reasonable to wonder if we can make some predictions on the resale price of a flat given the age of
the flat. For example for a flat that is 40 years old, what is our guess for its resale price?
Definition 3.4.1 If we believe that two variables X and Y are linearly associated, we may model the
relationship between the two variables by fitting a straight line to the observed data. This approach is
known as linear regression. Recall that the equation of a straight line is given by
Y = mX + b,
where b is the y-intercept and m is the slope or gradient of the line. The y-intercept is the value of
Y when the value of X is 0. The slope of the line is the amount of change in Y when the value of X
increases by 1.
In the figure above, the straight line in red is the regression line that is fitted to the observed data,
represented by the blue dots. Consider the i-th observation (Xi , Yi ). The “?” in the figure represents the
residual of the i-th observation, which is the observed value of Y for Xi (that is, Yi ) minus the predicted
value of Y for Xi (predicted by the straight line). This residual, denoted by ei , is sometimes also called
the error of the i-th observation as it measures how far the predicted value is from the observed value.
Example 3.4.2 Let us return to the question we posed at the beginning of this section. What is our
prediction for the resale price of a HDB flat that is 40 years old?
Section 3.4. Linear regression 99
With X representing the age of the resale flat and Y being the resale price, the regression line obtained
from the data set is
Y = −4007X + 591857.
So the predicted resale price of a 40 year old flat is $431,577. It is important to note that we are not
concluding that
Furthermore, as the correlation between resale flat price and age of the flat is weak, the prediction
obtained from the linear regression above may not be as accurate compared to the scenario where the
correlation is stronger.
Now that we have seen how a regression line can be used, the question is how do we obtain such a
line given bivariate data? What method and principle is used to determine the regression line? Among
the many different straight lines that we can use to fit the data points, which one is the “best”?
Discussion 3.4.3 There are several ways to assess which straight line fits the observed data better.
One of the most common way is the method of least squares. For this module, we will not go into the
technicalities of this method but instead we will briefly describe the idea behind this method.
Recall that when we fit a straight line through a set of observed data points (xi , yi ), the difference
between the observed value yi and the predicted outcome, predicted by the straight line, is known as
the residual of the i-th observation. This residual, denoted by ei is also known as the error of the i-th
observation that measures how far is the observed from the predicted.
100 Chapter 3. Dealing with Numerical Data
In the plot above, we see that each data point gives rise to an error term and it is reasonable to say
that a line of good fit is one that keeps the error terms (considered over all data points) small. However,
instead of looking at the overall error by summing up
e1 + e2 + · · · + en ,
where n is the total number of data points, the method of least squares seek to find a straight line that
minimises the overall sum of squares of errors,
You may wonder why minimising e21 + e22 + · · · e2n is more appropriate than minimising e1 + e2 + · · · + en .
We will leave you to ponder about this question before having a discussion with your friends or instructor.
Remark 3.4.4
1. It is useful to note that the least squares regression line obtained from a set of observed data points
(xi , yi ) will always pass through that point of averages for that data set, that is, (x, y). This fact
can be established mathematically, but is beyond the scope of this course.
2. It is important to note that while we have obtained the least squares regression line that allows us
to predict the average resale price for a given age of the resale flat, the same regression line cannot
be used to predict the average age of resale flats for a given resale price. The reason is essentially
because of the way the regression line was obtained.
In obtaining the regression line with the independent variable (x) as age and the dependent variable
(y) as the resale price, the line was fitted to minimise the square of error terms between the observed
and predicted resale prices.
If the intention was to use a given resale price to predict the average age of the resale flats, then
we would be looking at another regression line that minimises the square of error terms between
the observed and predicted ages of resale flats.
The two regression lines are different and thus not interchangeable.
Section 3.4. Linear regression 101
3. The correlation coefficient r between the variables X and Y is closely related to the regression line
Y = mX + b
4. Another important point to note about the linear regression line obtained using a data set is with
regards to the range of the independent variable in the data set.
Recall that we have obtained the linear regression line for the purpose of predicting the average
resale price based on the age of the resale flat. From the data set, the value of the independent
variable (in this case, this is the age of the resale flat) ranges from 2 to 54 years. Thus the prediction
that can be arrived at using the regression line is only applicable for HDB flats whose age is between
2 and 54 years old. Outside this range, we should not use the regression line to make our prediction
as the best fit regression line may change outside this range. For example, we should not use the
regression line to predict the average resale price of flats that are 60 years old as our data set does
not contain any information on resale flats that are more than 54 years old.
Discussion 3.4.5 To conclude this section and also the chapter, we will describe a method to study
the relationship between two variables if the relationship is not linear. The following table shows part of
a data set that provides the total number of confirmed COVID-19 cases in South Africa since 5 March
2020.4 .
4 Data set can be downloaded from www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset.
102 Chapter 3. Dealing with Numerical Data
In this data set, t is the variable representing the number of days since 5 March 2020.
It can be computed using Microsoft Excel or other statistical software that the correlation coefficient
between the total number of confirmed cases and t is 0.812, which indicates that there is a strong
positive linear association between the two variables. Is this indeed the case? Perhaps we may make
such a conclusion but as stated earlier, correlation coefficient alone does not give the entire picture. We
should create a scatter plot using our bivariate data and verify if there is really a linear relationship.
Are the two variables associated linearly? It is quite clear visually that the total number of confirmed
cases increases exponentially when t increases. Thus, if we let y be the variable representing the total
number of confirmed cases, y and t are not linearly associated but instead the relationship between them
seems to be exponential. For such a situation, can we apply our linear regression technique to make
predictions on the total number of confirmed cases? The answer is yes, but it would have to be done
indirectly.
Now, if the relationship between y and t is indeed exponential in nature, we can model this relationship
using the equation
y = cbt ,
where c and b are some constants that we will determine. Using the property of the logarithmic function,
we see that
Thus, instead of making a scatter plot with y plotted against t, we will make a scatter plot with ln y
plotted against t. If there is indeed an exponential relationship between y and t, then we would expect
to see a linear relationship between ln y and t, as indicated by the equivalent equations above. Let us go
through the steps:
Section 3.4. Linear regression 103
(a) Step 1: For each data point (t, y), compute (t, ln y). For our data set on COVID-19 cases in South
Africa, we have the following table:
(b) Step 2: Find the linear regression line for ln y vs t. For our example, the regression line was found
to be
ln y = 4.287 + 0.066t.
We are now able to write down the exponential equation relating y and t:
Exercise 3
1. The results of 10 students in a test are given below, where L indicates that a student obtained a
score below 55, and the two H’s (not necessarily the same) indicate scores above 90. The maximum
possible score is 100.
60 80 L 70 H 83 59 H 70 65.
Which of the following statements must be true about the scores given above?
Select all that apply.
2. Which of the statements below will always hold if an outlier for a single numerical variable is
removed?
3. The graph below compares two sets of grades of a cohort of 233 students who took a course.
The histograms of the same two sets of grades are provided below.
Exercise 3 105
4. The Registry of Marriages is interested to see the relationship between the ages of husbands and
wives in City X. They randomly sampled 1000 pairs of husbands and wives from the population
of City X and obtained data of their ages (in years). Looking through the data, they found that
men always marry women who are younger than them. Based only on the information given above,
which of the following statements must be true?
(I) The average age of the husbands is more than the average age of the wives.
(II) The standard deviation of husband’s age is more than the standard deviation of wife’s age.
5. In the scatter plot below, the dotted straight lines mark the average values of X and Y .
(I) The line Y = X cuts through the data points in half, with 50% of the data points on either
side of the line.
(II) The average of Y is larger than the average of X.
6. The following boxplot shows the final examination marks of students from class A.
The passing mark for the final examination is 12 out of 20. Suppose that the boxplot of the final
examination marks for class B is the same as the boxplot for class A. What can be said about the
final examination marks of students from class B? Select all the correct statements.
(A) The proportion of students in class B who passed the examination must be the same as that
for class A.
(B) At least 50% of the students from class B failed the final examination.
(C) The standard deviation of the students’ marks in class B is equal to the standard deviation
of those in class A.
(D) Based on the boxplot, there are no outliers in the grades of students from class B.
Which of the following statements must be true? Select all that apply.
8. Suppose that the following are 10 data points for a numerical variable X:
where r is an unknown whole number and r = ̸ 138. Based on the definition of an outlier
for a boxplot, if 138 is the only outlier in this data set, the maximum possible value of r is
.
9. Alice wants to compare the distribution of car transaction prices in 2020 with that in 2021. In
her comparison, she also wants to highlight that the number of transactions in 2021 is much larger
than that in 2020. Which of the following best serves her purpose?
(A) Plotting two boxplots (one for each year) of car transaction prices along the same Y -axis.
(B) Plotting two histograms (one for each year) of car transaction prices along the same Y -axis.
(D) Plotting a scatter diagram of car transaction prices in 2021 against car transaction prices in
2020.
10. The following histogram and boxplot are constructed using 90 observations of the numerical variable
x. Based on the graphics, determine which of the following statements must be true.
11. 100 students sat for a mid-term test. Ms Lee created the following boxplot to visualise the test
scores and noted that only one student had a test score of 0 because he was absent for the test.
108 Chapter 3. Dealing with Numerical Data
The next day, Ms Lee found out that she had missed out grading the last question of the test
paper, and everyone who sat for the test should have 2 marks more.
The new (and correct) median score is (1) and the new (and correct) IQR is (2) .
FIll in the blanks for the statement above, give your answers correct to 2 decimal places.
12. The graph below shows a scatter plot between exam scores for 100 students and their corresponding
number of days absent from school throughout a year. The maximum mark for the exam is 100.
There is an outlier present in the scatter plot.
From the scatter plot above, which of the following statements is/are true? Select all that apply.
(A) After removing the outlier, the strength of the linear association between the two variables
will increase.
(B) After removing the outlier, the correlation coefficient between the two variables will increase.
(C) Higher absenteeism rate causes lower exam scores.
(D) If a linear regression line is fitted to this data, the gradient of the line will be negative.
Exercise 3 109
13. Mary wanted to find out if the duration spent on playing video games is associated with the amount
of sleep one gets. She collected data from 5000 participants who played video games for 6 to 10
hours each day. She found that the participants’ number of sleep hours varied from 6 to 8 hours
each day, and calculated the formula of the linear regression line to be
y = 11 − 0.5x,
where x represents the number of hours spent playing video games each day and y represents the
number of sleep hours each day. From the data, what is a reliable prediction of the average number
of sleep hours for a person who plays an hour of video games each day?
(A) 10.
(B) 9.
(C) 10.5.
(D) There is insufficient data to make a reliable prediction.
(A) The correlation coefficient changes when we add 3 to all the values of both variables.
(B) The correlation coefficient changes when we subtract 5 from all the values of one variable.
(C) The correlation coefficient changes when we multiply all the values of one variable by −1.
(D) The correlation coefficient changes when we interchange the two variables.
(I) A correlation coefficient of 0 means that there is no linear association between the two vari-
ables.
(II) A correlation coefficient of −0.8 indicates a weaker linear association than a correlation coef-
ficient of 0.7.
16. A researcher wants to study the relationship between the heights of mothers and daughters. He
surveys 100 mother-daughter pairs and plots their data on a scatter plot: mothers’ heights on the
horizontal axis and their respective daughters’ heights on the vertical axis. He realises that for
every daughter, her mother is at least twice her height.
From this information alone, what can we conclude about the correlation coefficient between the
heights of mothers and daughters?
(A) It is negative.
(B) It is positive.
(C) It is zero.
(D) There is insufficient information to choose any of the other options.
17. To find out if there is a relationship between exercising and intelligence, the researchers of a study
obtained the exercise duration and IQ scores of each participant. The data was plotted on a scatter
plot, with exercise duration on the x-axis and IQ scores on the y-axis. The correlation coefficient
was computed to be 0.8. From this information alone, what can you conclude?
110 Chapter 3. Dealing with Numerical Data
18. If an outlier is removed from a data set of two variables, the gradient of the new regression line
without the outlier can be the gradient of the original regression line with the outlier.
Which of the following can be used to fill in the blank? Select all that apply.
19. Consider plots A and B shown below. What will happen to the correlation coefficients for both
plots after removing the outliers indicated?
(A) The correlation coefficient in plot A will increase and correlation coefficient in plot B will
decrease.
(B) The correlation coefficient in plot A will decrease and correlation coefficient in plot B will
increase.
(C) The correlation coefficients in plots A and B will both increase.
(D) The correlation coefficients in plots A and B will both decrease.
20. There are two primary six classes in a tuition center. Class A and Class B each has 100 students
and all students sat for a mathematics midterm test as well as a final examination. In Class A,
every student scores 1 point higher in the final examination than in the midterm. In Class B, every
student scores 1 point lower in the final examination than in the midterm. For the midterm test,
the average score is 50 and standard deviation is 20 for both Classes A and B. Suppose now Class
C is formed by combining all the students from Classes A and B. Which of the following statements
is/are correct?
(I) The correlation coefficient of Class C is smaller than the correlation coefficient of Class A.
(II) The correlation coefficient of Class C is larger than the correlation coefficient of Class B.
21. For the 40 students in a class, the results of their second English test are plotted against the results
of their first English test. It was found that for each student, the result of the second test is better
than that of the first test (i.e., the second test score is higher). Which of the following must be
true about the relationship between the students’ second test results and their first test results?
(I) If student A scores better than student B in the first test, then student A also scores better
than student B in the second test.
(II) The correlation between students’ second test results and first test results is positive.
22. There is a weak linear association between numerical variables X and Y , where X ranges from 0
to 5 (inclusive). Based on the data from X and Y , the regression line is given by the equation
Y = 0.25X + 2. Which of the following statements must be true? Select all that apply.
24. Dash is interested to find out if there is a correlation between the time spent in playing computer
games (in hours) and student’s score (out of 100). After slicing the data according to gender, it
was found that the regression line for males is
Furthermore, Simpson’s Paradox is observed when the 2 subgroups are combined. Which of the
following is/are possible ranges of r, the correlation coefficient between the time spent gaming and
the students’ score when the subgroups are combined?
(A) 0 ≤ r ≤ 1.
(B) −1 ≤ r < 0.
(C) There is not enough information to guess the range of r.
25. A data set of four points A to D were plotted on the graph below. All the points A to D have
positive X and positive Y values. Assume that this graph is drawn to scale.
112 Chapter 3. Dealing with Numerical Data
Which of the following changes would result in the highest increase in the r value?
26. Which of the following gives the most likely correct values of the correlation coefficients obtained
from the four scatter plots? Here, r1 refers to the correlation coefficient of plot (1), r2 refers to
the correlation coefficient of plot (2) and so on.
27. There are at least two ways of measuring mass, namely kilograms (kgs) and pounds (lbs). Moreover,
the conversion between the two units of measurement is as follows:
1 kg = 2.20 lbs.
Exercise 3 113
Within a data set, it is known that the correlation between numerical variables x and M is 0.72
where M is the mass, measured in kg. The equation of the regression line is M = 0.32x + 300 for
x ranging from 0 to 5 inclusive. Based on this information, if we change the scale of mass from
kilograms to pounds, which of the following statements is true?
(A) Both the correlation coefficient and gradient of the regression line remain unchanged.
(B) Only the correlation coefficient remains unchanged but the gradient of the regression line
changes.
(C) Only the gradient of the regression line remains unchanged but the correlation coefficient
changes.
(D) Both the correlation coefficient and the gradient of the regression line will change.
28. A team of researchers found that countries that mandated Bacillus Calmette-Guérin (BCG) vacci-
nation had lower COVID-19 death rates compared to countries that did not mandate BCG vaccina-
tion. Based on this association alone, the researchers concluded that mandating BCG vaccinations
causes lower COVID-19 death rates. Which one of the following mistakes have they committed?
29. A researcher wishes to examine the association between the number of hours students engage in
gaming and their academic performance. He conducts his study in 15 schools, and calculates for
each school the average number of hours students spend gaming and the average academic score
of students. He notices that there is a correlation of −0.9 between the two sets of averages. He
concludes that
“The correlation between gaming hours and academic score for all students from the 15 schools is
−0.9”.
30. Suppose that the relationship between two variables can be best modelled by an exponential func-
tion of the form y = axb , where a and b are positive constants and x > 0. Which set of variables
should we plot if we wish to observe a linear relationship?
Statistical Inference
In Chapter 1, we introduced the following types of research questions that are of interest.
3. To compare two sub-populations / to investigate a relationship between two variables in the pop-
ulation.
You would have noticed that a common term that recurs in the three research questions is the word
population. Indeed these are all questions pertaining to the population. In order to answer these
questions, we would need to have complete information about the entire population. This is usually not
possible due to the sheer size of the population.
In order to give an approximate answer to the research questions, we need to use a sample of the
population. The process of drawing conclusions about the population from sample data is known as
statistical inference.
Recall the PPDAC cycle, first introduced in Chapter 1. In particular, when we focus on the second
to fourth phases of the cycle, “Plan”, “Data” and “Analysis”, these phases involve specialised tools
and techniques. These tools and techniques lead us to take a closer look at how these three phases are
inter-related.
116 Chapter 4. Statistical Inference
The “Plan” and “Data” phases were discussed in Chapter 1. How is a sample obtained from a
population? What are the different methods of sampling and what are the types of biases we need to
avoid? From summary statistics introduced in Chapter 1, to how categorical variables can be analysed
in Chapter 2 and likewise for numerical variables in Chapter 3, these are all tools that we can use under
the Analysis phase.
To conclude the “Analysis” phase, we need to look at the results of our analysis of the sample and
subsequently draw conclusions on the population. This is where statistical inference comes into the
picture. In order to have a meaningful discussion on statistical inference, we need to acquire some
knowledge about probability. Probability and inference will form the main thrust of this chapter. To
begin, we will introduce some basic results in probability which allow us to discuss the tools required in
statistical inference. The two kinds of tools most common in statistical inference are confidence intervals
and hypothesis tests, both of which will be discussed in some detail in the rest of the chapter.
Let us lay the groundwork for probability by defining some basic terms that are necessary in this
subject.
Discussion 4.1.1 In previous chapters, we have occasionally touched on the notion of uncertainty.
When we use the word “chance” it is understood intuitively that something is not definite, or not certain
to hold or occur. In order to compare the likelihood of occurrence, we use terms like “more likely” or
“less likely”. These terms are common and adequate for everyday use. However, they are not precise
and as we deal with data at a deeper level, we need a rigorous framework to ground the concept of
uncertainty. Probability is a mathematical means that we can use to reason about uncertainty.
Consider a coin, with one side called “heads”, and the other called “tails”. Let’s say the coin is tossed
twice and the side that faces up when the coin lands is observed for both tosses. What are the possible
outcomes after two tosses? If we represent observing heads as H and observing tails as T , then the four
possible outcomes are:
HH, T T, HT, T H.
Here HT is differentiated from T H as HT means heads was observed in the first toss and tails in the
second toss, while T H means tails was observed first, followed by heads. In this example, the procedure
of tossing the coin twice is called a probability experiment. The set {HH, T T, HT, T H} contains the
outcomes of the probability experiment.
It should be noted that the probability experiment defined here is narrower than the type of exper-
iments described in Chapter 1. A probability experiment must be repeatable and allows for the exact
listing of all the possible outcomes, like the way we have listed down all the four possible outcomes of
the probability experiment of tossing a coin twice.
A sample space is the collection of all possible outcomes of a probability experiment. A sub-collection
of the sample space is called an event. Referring back to our coin tossing probability experiment:
Section 4.1. Probability 117
HH, T T, HT, T H.
2. An event could be
HH, T T.
We can describe this event as two in a row or two identical observations.
Having understood a probability experiment, the sample space of the experiment and an event of the
sample space, we are now ready to give context to the mathematical discussion of probability.
For example, in our coin tossing experiment, the probability of at least one tail is how likely the
outcome of the experiment will be T T , T H or HT .
Example 4.1.2 Another common example of a probability experiment is the rolling of a six-sided die.
It is obvious that such an experiment can be repeated as many times as we wish to and it is also easy to
list down all the outcomes of a single roll of the die.
1. Probability experiment: rolling (once) a six-sided die and observe the top facing side.
2. Sample space:
1, 2, 3, 4, 5, 6.
Rather than to use a list, we can put all the outcomes into a set,
{1, 2, 3, 4, 5, 6}.
3. An example of an event:
2, 4, 6.
This event can be described as “die shows an even number”. As an event is a sub-collection of
the sample space, when we represent the sample space as a set, an event will be a subset of the
sample space, for example
{2, 4, 6}.
As an exercise, you may wish to write down the sample space (as a set) for the probability experiment
of rolling a six-sided die twice and observing the top facing side on both rolls. Follow this by describing
the event and writing down the subset of the sample space that represents the event.
Notation 4.1.3 Suppose E is an event, then P (E) is the probability of event E. Probabilities are
numerical values between 0 and 1 (both inclusive), so P (E) takes on a numerical value between 0 and 1
and this is the probability assigned to event E.
118 Chapter 4. Statistical Inference
Discussion 4.1.4 The question now is how do we know which numerical value between 0 and 1 to
assign to an event E? In other words, how do we know the probability P (E)? Mathematically, we
can define P (E) as the long run proportion of observing E when a large number of repetitions of the
experiment is being performed. Thus, we can repeat the probability experiment a large number of times
(say N times) and each time we observe if the outcome is an element of the event E. Suppose the first
experiment’s outcome is in E, then we mark that experiment with a “YES”. Repeat the experiment
again, suppose now the outcome is not in E, then we mark this experiment with a “NO”. Continue this,
till we have done the experiment N times and each time we have either a “YES” or a “NO”.
We now count the number of “YES” we have, out of the N times the experiment was done. Then
the probability of event E, P (E) is estimated by
number of “YES”
.
N
It should be noted that
1. The estimate of P (E) we obtain from these N repetitions of the experiment is likely to be
different if we repeat the experiment (and get another estimate) another N times.
2. Such estimates get more accurate and closer to the true value of P (E) as the number N becomes
larger.
1. Probability experiment: rolling (once) a six-sided die and observe the top facing side.
What would be an estimate of P (E)? Suppose we repeatedly roll (and observe) the die 500 times and
recorded the “YES” and “NO” as follows:
Suppose the total number of “YES”, out of the N = 500 times the experiment was carried out, is
268, then an estimate of P (E) is
268
P (E) = = 0.536.
500
Rules of Probabilities
It is virtually impossible to verify what is the true probability for an event of a probability experiment.
For example, even if we say that a coin is “fair” does it mean that the probability of “heads” is exactly
0.5 and that for ’tail’ is exactly 0.5? Probably (pun intended) not! The probabilities that we encounter in
everyday life are just estimates of what the true probability is but in the analysis of data, it is sufficient
to treat the estimates as if it is the true probability. What is important and relevant in the use of such
estimates is that in the assignment of probabilities to events of a probability experiment, the following
rules of probabilities must be obeyed.
1. The probability of each event E, denoted by P (E) is a number between 0 and 1 (inclusive).
3. If E and F are mutually exclusive events (meaning both events cannot occur simultaneously),
then the probability of E union F is equal to the sum of the probabilities of E and F . That is,
P (E ∪ F ) = P (E) + P (F ).
Section 4.1. Probability 119
When the sample space contains only a finite number of outcomes, we only need to assign probabilities
to the outcomes so that these probabilities sum up to 1. The probabilities of all other events can then
be derived from there.
Example 4.1.6 Suppose we have a biased six-sided die being rolled once. The following probabilities
are assigned to the six possible outcomes.
Outcome 1 2 3 4 5 6
Probability 0.1 0.1 0.1 0.1 0.1 0.5
Check that the probabilities add up to 1. We are now able to derive the probabilities of certain events by
applying the third rule of probability as stated above. For example, if E is the event “an odd-numbered
face” and F is the event “an even-numbered face”, it is easy to see that
1. P (E) is the sum of P (1), P (3) and P (5), so P (E) = 0.3. (Here “1”, “3”, “5” are mutually exclusive
events.)
2. P (F ) is the sum of P (2), P (4) and P (6), so P (F ) = 0.7. (Here “2”, “4”, “6” are mutually exclusive
events.)
Definition 4.1.7 Uniform probability is the way of assigning probabilities to outcomes such that equal
probability is assigned to every outcome in the finite sample space. Thus, if the sample space contains
a total of N different outcomes, then the probability assigned to each outcome is
1
.
N
As an example, if the sample space S contains the outcomes of flipping a fair coin twice, then
S = {HH, HT, T H, T T }.
1
Using uniform probability, we will assign the probability of 4 or 0.25 to each of the four outcomes.
Example 4.1.8 We have in fact seen uniform probability in action much earlier in Chapter 1 when
simple random sampling was introduced. Recall that in simple random sampling, r units from the
sampling frame are randomly selected to be included in the sample. We are conducting a probability
experiment where the sample space is the sampling frame that contains all the units that could possibly
be selected. The probability of selecting a particular unit at the first draw from the sampling frame is
thus N1 where N is the size of the sampling frame.
Furthermore, for any subset of the sample space (an event) denoted by A, the probability of this
event, P (A) is interpreted as the likelihood of selecting a unit belonging to A into the sample. This is
equal to the rate of A in the sampling frame.
As a concrete example, suppose the sampling frame consists of 500 adults, comprising of 280 males
1
and 220 females. By simple random sampling, each adult, for example, John, has a probability of 500 to
be selected at the first draw. So the probability of a person selected at the first draw being male would
be 280
500 .
120 Chapter 4. Statistical Inference
If we define A to be the subset of the sample space consisting of the male adults, then the probability
of event A is the rate of A in the sampling frame. That is,
280
P (A) = rate(male) = = 0.56.
500
Discussion 4.1.9 (Sally Clark’s story) The story of Sally Clark’s trial sets the stage for our further
discussion on probability. In 1998, Sally Clark, a British woman was accused of having killed her first
child at 11 weeks of age and then her second child at 8 weeks of age. Sir Roy Meadow, who was a
professor and consultant paediatrician appeared as an expert witness for the prosecution.
In his testimony, Sir Meadow commented that the chance of two children from an affluent family
dying from Sudden Infant Death Syndrome (SIDS) is 1 in 73 million.
The jury eventually convicted Sally Clark by a 10-2 majority and she was given the mandatory
sentence of life imprisonment. Sensationally, it was later discovered that the probability that Sir Meadow
calculated was misrepresented. In 2001, the Royal Statistical Society (RSS) decided to issue a public
statement expressing its concern at the misuse of statistics in the courts. In the statement, the RSS
refuted that there was no statistical basis for the figure of 1 in 73 million quoted by Sir Meadow. A year
later, in 2002, the Society wrote to the Lord Chancellor pointing out that the calculation leading to the
figure of 1 in 73 million was erroneous.
So what went wrong with Sir Meadow’s computation? To discuss this further, we will need to learn
two concepts of probability namely, conditional probability and independence. They will be covered in
the next section and we will continue with the story of Sally Clark subsequently.
Let us begin this section by using the same example as Example 4.1.8 . Suppose we have 500 adults,
comprising of 280 males and 220 females, as participants in a lucky draw where there is only one prize
1
to be won. Under uniform probability, each person, for example, John has a probability of 500 to be the
winner of the prize.
Now suppose it is known that the winning ticket will be drawn from the male participants, what is
John’s probability of being the winner of the prize now?
Definition 4.2.1 The scenario described above involves the concept of conditional probability. Condi-
tional probability is normally written using the notation
P (E | F )
and is read as “probability of E given F ”. Here, E and F are events of a particular sample space. With
reference to our lucky draw example above, events E and F are:
E: Winner of the prize is John;
F : Winner of the prize is a male.
So the conditional probability P (E | F ) is the probability that John is the winner given that it
is a male who won. Intuitively, the probability of E given F measures how likely the outcome of the
probability experiment is an element of E, if we already know that it is an element of F . To
compute conditional probabilities, we use the idea of restricting the sample space based on the condition
that event F is known to have occurred.
Section 4.2. Conditional Probability and Independence 121
More precisely, to compute the probability of E given F , we restrict our focus on the given event F
as our restricted sample space (rather than to look at the entire sample space). The event F may or
may not contain overlap with event E. The overlap is denoted by E ∩ F , to be read “E intersect F ”.
The probability of E given F is obtained by dividing the probability P (E ∩ F ) by the probability P (F )
which acts as the baseline (restricted sample space). Thus,
P (E ∩ F )
P (E | F ) = .
P (F )
Remark 4.2.2
1. It is perfectly possible that there is no overlap between events E and F , meaning that it is not
possible that E and F happen simultaneously. In such a situation, it is clear that the probability
that event E occurs given that event F is known to have occurred is certainly 0. Indeed, with
P (E ∩ F ) = 0, we see that
P (E ∩ F )
P (E | F ) = = 0.
P (F )
2. If event F itself cannot occur, that is P (F ) = 0, then we will stipulate by convention that P (E | F )
is also equal to 0.
1 1
= .
280 number of males in total
Discussion 4.2.4 (Conditional probabilities as rate) We have seen earlier that uniform probabil-
ities are manifested as the probability experiment of randomly selecting a unit from a fixed sampling
frame. The table below draws the analogy between the two interpretations.
122 Chapter 4. Statistical Inference
What about for conditional probabilities? Will there be a similar correspondence to conditional rates?
More specifically, is the conditional probability of A given B equal to the rate of A given B whenever A
and B are subgroups of the sampling frame? The following derivation shows that they are indeed equal.
P (A ∩ B)
P (A | B) = (by using the idea of restricted sample space)
P (B)
rate(A ∩ B)
= (by the correspondence between probability and rates)
rate(B)
size of (A ∩ B) size of B
= ÷
size of sampling frame size of sampling frame
(by the definition of rates as ratios of two sizes)
size of (A ∩ B)
=
size of B
= rate(A | B) (by the definition of rates as ratios of two sizes)
So indeed, this derivation shows that just as probabilities are equivalent to rates in this probability
experiment, conditional probabilities are also equivalent to conditional rates.
Discussion 4.2.5 Let us continue from Discussion 4.1.9 with the story of Sally Clark’s trial. So where
did Sir Meadow make the mistake in his calculation? It turns out that the error was due to misconception
of the conditional probability. The calculation done by Sir Meadow is similar to the following.
which was what the prosecutor was really meant to impress on the jury. The intention was to illustrate
what is the chance that Clark is innocent given the evidence as observed.
For two events A and B, the mistake of confusing P (A | B) as P (B | A) is known as the prosecutor’s
fallacy. In general, we note that P (A | B) is not equal to P (B | A). To see why this is so, we use the
definition of conditional probability as described in Definition 4.2.1.
P (A ∩ B)
P (A | B) =
P (B)
P (B ∩ A)
P (B | A) =
P (A)
Section 4.2. Conditional Probability and Independence 123
It is now easy to see that for P (A | B) = P (B | A), we require either P (A) = P (B) or P (A ∩ B) = 0.
This is not always the case and therefore the two conditional probabilities are not necessarily equal.
The story of Sally Clark does not end here. Other than the confusion with conditional probabil-
ities, Sir Meadow’s calculation was also erroneous on another count, one that involves the concept of
independence which is what we will be discussing next.
Definition 4.2.6 When we say that two events A and B are independent, it means that
P (A) = P (A | B),
that is, the probability of A is the same as the probability of A given B. So, the fact that event B has
occurred does not affect the probability of A occurring. Now, if we express the conditional probability
P (A | B) as
P (A ∩ B)
P (A | B) = ,
P (B)
then A and B being independent means that
P (A ∩ B)
P (A) = which implies P (A) × P (B) = P (A ∩ B).
P (B)
We thus have an equivalent definition of what it means for two events to be independent.
Definition 4.2.7 The notion of independence can be extended to events that are conditionally indepen-
dent. We say that two events A and B are conditionally independent given an event C with P (C) > 0
if
P (A ∩ B | C) = P (A | C) × P (B | C).
Discussion 4.2.8 Let us continue where we left off from Discussion 4.2.5 and discuss the second error
in Sir Meadow’s calculation. Recall that the calculation was done in the following manner
We have already noted that the first conditional probabilty was misinterpreted as P (Clark is innocent |
Evidence as observed), resulting in the prosecutor’s fallacy. Now let us focus on the second equality
If we represent the events “First infant death”, “Second infant death”, and “Clark is innocent” as A,
B and C respectively, the equation above is now
P (A ∩ B | C) = P (A | C) × P (B | C).
This is precisely what we saw in the definition for conditional independence. So by computing the
probability this way, Sir Meadow had assumed that the event where the first child died of SIDS and the
event where the second child died of SIDS are independent of each other. Is this a reasonable or even
intended assumption?
While we cannot say for certain, there could very well be factors (like genetic or environmental) that
predispose families to SIDS. What this means is that a second case of SIDS within the family becomes
much more likely to happen than it would be in another apparently similar family.
To conclude the story of Sally Clark’s trial, in January 2003, the guilty verdict was overturned on a
second appeal and Sally was eventally released from prison having served more than three years of her
sentence.
Example 4.3.1 Suppose there are two bags, each containing 10 colored balls. Bag A contains 7 red
balls and 3 green balls while Bag B contains 2 red balls and 8 green balls. One bag is randomly selected
and a ball is then randomly selected from the chosen bag. What is the probability that the selected ball
chosen is green?
Let us consider the events E, F and G such that
Note that E and F are mutually exclusive and either E or F must occur. The probability required
P (G) is the probability of (G and E) plus the probability of (G and F ). That is,
P (G) = P (G ∩ E) + P (G ∩ F ).
P (G) = P (G | E) × P (E) + P (G | F ) × P (F ).
In our example, this means that the probability that a green ball is selected is the sum of the probabilities
Formally, the law of total probability states that if E, F and G are events from the
same sample space S such that
(2) E ∪ F = S.
Then,
P (G) = P (G | E) × P (E) + P (G | F ) × P (F ).
Recall that through the recounting of the trial of Sally Clark, we explained what is commonly known
as the prosecutor’s fallacy where one confuses P (A | B) with P (B | A). In the remaining of this section,
we introduce two more fallacies which are common pitfalls and misunderstandings related to probabilities.
Example 4.3.2 Suppose Joseph was seen to be acting nervously and loitering outside a convenience
store. Furthermore, he had a knife in his pocket. Based on this observation, which would you say is
more probable about Joseph?
If you think that (2) is more probable than (1), you would have committed a conjunction fallacy.
Formally, for two events A and B, one would have committed conjunction fallacy if one believes that
In words, the fallacy is when one believes that the chances of two things happening together is higher
than the chance of one of those things happening alone.
In fact, what is actually true is the following
So going back to the two statements about Joseph, it cannot be the case that Statement (2) is more
probable than Statement (1).
Let us discuss another fallacy that is related to the discussion of rates from Chapter 2.
Discussion 4.3.3 Suppose there is a rise in the number of drink-driving cases in the city. The police
force, with the help of some researchers has developed a new breathalyser to better detect drivers who
drive after consuming an excessive amount of alcohol (for the ease of discussion, these drivers are said
to be drunk).
Through product testing, it is known that:
(1) When a driver is sober, there is a 5% chance that the breathalyser will falsely detect that the driver
is drunk.
(2) On the other hand, when the driver is in fact drunk, the breathalyser will always detect that the
driver is drunk.
Here comes the important information. Suppose it is known that 1 in 1000 drivers will perform the
dangerous act of driving when drunk.
Situation: Suppose a driver is picked up randomly at a spot check and takes the breathalyser test.
If the breathalyser indicates that he is drunk, what is the probability that he is indeed drunk? In terms
of conditional probability, we are interested in computing P (Drunk | positive test). In reality, this is an
important consideration because a sober person that returns a positive breathalyser test result may end
up being falsely accused of breaking the law!
126 Chapter 4. Statistical Inference
Now, if we simply consider facts (1) and (2) obtained from the product testing, we may be inclined
to think that P (Drunk | positive test) is quite high, since the breathalyser test never fails in detecting a
drunk driver. Unfortunately, if you are led to make such a conclusion based on (1) and (2) alone, you
would have committed a base rate fallacy.
Definition 4.3.4 The base rate fallacy is a decision making error in which information about the rate
of occurrence of some trait in a population, called the base rate information, is ignored or not given
appropriate weight.
Let us return to our breathalyser example and work out what is in fact the conditional probability
P (Drunk | positive test).
Example 4.3.5 The information where it is known that 1 in 1000 drivers will drive when they are drunk
is precisely that base rate information that cannot be ignored. Otherwise, we would have committed
the base rate fallacy. Similar to the method used in Chapter 2, we will construct a 2 × 2 contingency
table to help us with our calculations. The table below shows how it would look like after the cells are
populated with numbers. The sequence ((1), (2), (3), etc.) to populate the cells is given below the table
and indicated in each cell.
(2) Since 1 in 1000 drivers drives after drinking, we know that the total number of drunk drivers is
100.
(4) Since the breathalyser never fails to detect a drunk driver, all 100 of them will be tested positive
and none will be tested negative.
(5) Since the breathalyser falsely detects 5% of sober drivers as drunk, 0.05 × 99900 = 4995 sober
drivers will be tested positive.
(6) This implies that 94905 of the sober drivers will be tested negative.
(7) We can now write down the total number of drivers tested positive and the total number of drivers
tested negative.
100
The conditional probability P (Drunk | positive test) can now be calculated easily as 5095 = 0.019627
which is approximately 2%. This is a very low probability and contrary to the earlier belief that a driver
tested positive is very likely to be drunk. This illustrates the danger of committing the base rate fallacy,
when the base rate of the occurence of a trait in the population was not taken into consideration.
Designing kits or apparatus to do such tests usually involves balancing the risk of having someone
testing positive when he should not be, or testing negative when he is in fact supposed to be positive.
Such considerations are common, especially in medical diagnostics as the next example further illustrates.
Example 4.3.6 Most of us should be familiar with the term ART, or Antigen Rapid Test, by now.
This test is used to test for the presence of COVID-19 infection in humans. For most medical diagnostic
tests, there are four possible scenarios that can happen when the test is administered to an individual
to assess if the individual is infected. The possible scenarios are:
Scenario 1 is concerned with the conditional probability of an individual being tested positive, given
that the individual is infected. This is known as the true positive rate. This probability
is known as the sensitivity of the test. For this example, let us assume that this probability is 0.8.
On the other hand, scenario 4 is concerned with the conditional probability of an individual being
tested negative, given that the individual is not infected. This is known as the true negative rate. This
probability
P (Test negative | Individual is not infected)
is known as the specificity of the test. For this example, let us assume that this probability is 0.99.
In reality, these two conditional probabilities are not helpful to average users like ourselves because we
do not really know whether we are indeed infected or not. What we do know, with certainty is whether
the individual’s test returns positive or negative. Therefore, instead of the conditional probability
which is difficult to ascertain if the “condition” is fulfilled, we look at the conditional probability
It is important to gain insight into this conditional probability as it can cause an individual much distress
after being tested positive only to find out later that the person involved is actually not infected. To
determine this conditional probability, having only the sensitivity and specificity of the test is insufficient.
As discussed in the preceding example, we require one additional piece of information, which is the base
rate of infection in the population. This is the infection rate in the population and we can interpret this
as the probability of a person selected at random from the population is infected with COVID-19. For
this example, let us assume that 1% of the population is infected with COVID-19, so
We will again use a contingency table to study the rates. To start, we choose a large enough number to
represent the total population such that our calculations would result in whole numbers. Let us assume
that the population consists of 100000 individuals.
Using the information we have for the base rate of infection, we can now fill in the row total for those
that are infected (= 1% × 100000 = 1000) and those that are not infected (= 99% × 100000 = 99000).
Next, using the true positive rate (sensitivity) of 0.8, we see that 80% of those infected would be
tested positive, that is,
Number of tested negative and Not infected = 0.99 × 99000 = 98010 and
The table can now be completed by summing up the column totals for those tested positive and those
tested negative.
By now, you should appreciate the choice of 100000 as the total population, as we did not have to deal
with the awkward situation of not having whole numbers when we are dealing with human individuals.
We are now able to calculate the rate of COVID-19 infection among those tested positive. Since there
are 1790 individuals tested positive, and 800 of them are infected, the rate is
800
rate(Infected | Tested positive) = = 0.447 (rounded to 3 significant figures).
1790
Using the correspondence between conditional rates and conditional probabilities, we are now able to
say that if an individual is tested positive for COVID-19 infection using an ART, the probability of
him actually being infected is about 0.45. This conditional probability is rather low so typically, more
rigorous tests need to be conducted to ascertain if the individual is indeed infected with COVID-19.
To conclude this section, we will give a very brief introduction to the concept of a random variable.
Definition 4.3.7 A random variable is a numerical variable with probabilities assigned to each of the
possible numerical values taken by the numerical variable.
Example 4.3.8 Consider the probability experiment of rolling a six-sided die. The faces of the die are
denoted 1, 2, 3, . . . , 6. In this experiment, the possible outcomes are 1, 2, 3, . . . , 6. If we let Y be the
numerical variable that represents the outcome of this experiment with the assigned probabilities
1 1 1
P (Y = 1) = , P (Y = 2) = , P (Y = 3) = ,
3 3 12
1 1 1
P (Y = 4) = , P (Y = 5) = , P (Y = 6) = ,
12 12 12
then Y is a random variable.
Definition 4.3.9 If the numerical variable is a discrete variable, we call the random variable a discrete
random variable. On the other hand, if the numerical variable is a continuous variable, then the random
variable is a continuous random variable.
The random variable Y in Example 4.3.8 is a discrete random variable. It is common to use a table
similar to the one below to illustrate a discrete random variable and the associated probabilities for each
outcome.
Section 4.4. Statistical Inference and Confidence Intervals 129
Outcome 1 2 3 4 5 6
1 1 1 1 1 1
Probability 3 3 12 12 12 12
Unlike discrete random variables which may take on only a countable number of distinct values, a
continuous random variable takes an infinite number of possible values. Height, weight and time required
to run one kilometer are some examples of continuous random variables. Unlike discrete random variables,
a continuous random variable is not defined at specific values.
Example 4.3.10 While a discrete random variable X takes on a countable number of distinct values, a
continuous random variable Y can be visualised by a “continuous series of points” which forms a density
curve on the standard x and y-axes.
In particular, a continuous random variable is defined over an interval of values and is represented
by the area under the density curve.
For the density curve shown above, the probability that Y assumes a value between 0.3 and 0.5 is
the area under the density curve of Y in the interval [0.3, 0.5] as indicated by the shaded region. In this
example, this area turns out to be 0.311 and we write
In general, the probability that a continuous random variable takes on a value in an interval [a, b] is equal
to the area under its density curve from a to b.
of the population where the sample was taken from? In order to answer this question, we need to first
know the survey methodology used to generate the sample and then secondly the statistical methods
used to infer this finding about the target population. This second phase is statistical inference.
Definition 4.4.1 Statistical inference refers to the use of samples to draw inferences or conclusions
about the population in question.
The figure above shows how statistical inference fits into the exploratory data analysis (EDA) frame-
work. When given a sample, we learnt how to generate questions, visualise and summarise data and refine
our questions before starting another round of further data exploration and visualisation. However, these
findings are all at the sample level and the natural question would be whether similar conclusions can
be made at the population level. Examples of population level information could be about a population
parameter or whether two categorical variables are associated with each other in the population.
Recall the notion of a census, previously defined in Chapter 1. To obtain definitive conclusions about
the population, one would have to take a census, which may not be possible or desirable. Some possible
reasons why taking a sample is preferred over conducting a census are:
(i) Cost. A census requires measurement of every unit in the population which, even when possible,
can be very costly. For example, in a study on the mental health of the Singaporeans, a census would
amount to measuring the state of mental wellness of every Singaporean. This require resources
which may be beyond the reach of the researcher conducting the study.
(ii) Feasibility. Imagine instead of taking a sample of your blood for a blood test, the doctor tells
you that in order to find out if you are free from a particular illness, he would need to take ALL of
your blood? While this example is a bit far fetched, many other instances exist where it is simply
not feasible to conduct a census.
Recall that the population parameter is a numerical fact about the population, and something that
is of interest to a researcher conducting a study. When we take a sample from the population, the use of
a sample statistic to estimate the population parameter is subjected to inaccuracies. These inaccuracies
primarily come under two categories, namely bias and random error. So we typically have
Ideally, we want the sample statistic to be as close to the population parameter as possible so that it is
a good estimate of the population parameter. In order to use our sample to make inference about the
population, the fundamental rule for using data for inference should be met.
By adopting good sampling methods (e.g. using simple random sampling) and practices (e.g. having
a good sampling frame), selection bias can be reduced. In addition, having a high response rate will
minimise non-response bias. If bias can be reduced to an insignificant level, this would allow us to say
What stands between the sample statistic and the population parameter is random error. This quantity
refers to the small differences that arise as a result of the sampling variability when using any probability-
based sampling method.
In what follows, we will discuss two types of statistical inference, namely confidence intervals and
hypothesis testing.
Example 4.4.2 Consider the following screenshots that show 10 simple random samples, each of size
2500, drawn from a data set containing information on the distances covered by various airplane flights1
.
Notice that the sample means (average distances covered by the 2500 flights) of the 10 samples are all
different. What can we infer the population mean to be? We require the concept of confidence intervals.
Definition 4.4.3 A confidence interval is a range of values that is likely to contain a population pa-
rameter based on a certain degree of confidence. This degree of confidence is called the confidence level
and is usually expressed as a percentage (%).
Some examples of population parameters that we would like to construct confidence intervals for are
proportion, mean and also standard deviation. For this course, we will focus on the construction of
confidence intervals for population proportion and mean. We will use two examples to explain the idea
behind the construction of the confidence intervals.
Example 4.4.4 The figure below shows part of the data set “2020 Resale Price Data”. This data set
provides information on the resale transactions of HDB resale flats in the year 2020. There are a total
of 23334 transactions (the population) in 2020 and there are 14 variables in this data set.
To illustrate the construction of the confidence interval for population proportion, we will consider
the variable “flat type”. This variable indicates whether the resale flat is of the type 1-room, 2-rooms,
3-rooms, 4-rooms, 5-rooms, executive or multi-generational. It is clear that “flat type” is a categorical
variable with 7 categories. Suppose we ask the following question on the population parameter:
Among the HDB resale transactions in 2020, what proportion (denoted by p) of them
is for 5-room flats?
Now, let’s say that a simple random sample of 2000 resale transactions are taken and the breakdown
of the 2000 transactions according to flat type is shown in the table below.
Flat type
1-rm 2-rm 3-rm 4-rm 5-rm Executive Multi-Gen.
Frequency 2 41 464 819 508 165 1
Proportion 0.001 0.0205 0.232 0.4095 0.254 0.0825 0.0005
508
Notice that for this sample, the proportion of resale transactions that are 5-room flats is 2000 = 0.254.
The population proportion p is unknown to us and can only be found if we take a census of all the 23334
transactions. What we are interested to know is how good an estimate is our sample proportion of 0.254.
If we assume that there is no bias in our sample, then
It should be noted at this point that random error can be positive or negative. If the random error is
positive, then the sample proportion of 0.254 is larger than the population proportion. On the other hand,
if the random error is negative, then the sample proportion is smaller than the population proportion.
To construct a confidence interval for the population proportion, we use the following formula
r
∗ ∗ p∗ (1 − p∗ )
p ±z × ,
n
where
p∗ = sample proportion
∗
z = “z-value” from standard normal distribution
n = sample size
Section 4.4. Statistical Inference and Confidence Intervals 133
The exact value of z ∗ depends on the confidence level of the confidence interval we are constructing. For
a 90% confidence interval, the value of z ∗ is 1.645 while for a 95% confidence interval, the value of z ∗ is
1.96. Thus, for this example, the 95% confidence interval for the population proportion is
r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0191.
2000
Remark 4.4.5
While the computation of the confidence interval is simple, for this course, we will use software to
help us perform these computations.
In particular, the value of z ∗ that is dependent on the chosen confidence level can be found from
statistical tables similar to the one shown below. However, when we use software for the compu-
tation of confidence intervals, these values will be appropriately chosen by the software when we
specify the confidence level we wish to use.
Discussion 4.4.6 Now that we have seen how the computation of a confidence interval for population
proportion is done, what is also important is the interpretation of the interval. What does it mean to
say that the 95% confidence interval for the population proportion of 5-room resale flat transactions in
2020 is 0.254 ± 0.0191?
A confidence interval is reported in 2 parts, namely:
The confidence level (for example, 95% in the example above); and
The value 0.0191 is known as the margin of error which directly impacts the width (how wide/narrow)
of the confidence interval.
134 Chapter 4. Statistical Inference
We are 95% confident that the population proportion (the parameter in this case) of
resale flat transactions in 2020 that are 5-room, lies within the confidence interval.
It is natural to ponder what we mean by “95% confident”. In the context of confidence intervals, this
has a specific meaning which can be explained by repeated sampling. Recall that the sample statistic
of 0.254 was computed from a single sample (collected via Simple Random Sampling) of 2000 resale
transactions. It was from this sample statistic that the confidence interval was constructed.
The idea of repeated sampling is based on the supposition that many simple random samples of the same size
are taken and with the different sample statistics obtained from the different samples, different confidence
intervals are constructed using the same method as above.
Using the idea of repeated sampling, the interpretation of “95% confident” is that if many simple
random samples of the same size are taken, and a confidence interval is constructed for each of them,
then about 95% of the confidence intervals constructed would contain the population parameter. Thus,
if we collected 100 simple random samples and their 95% confidence intervals were computed in the same
manner, then about 95 out of the 100 confidence intervals will contain the population parameter. So
in the figure above, assuming that the purple line is the actual population proportion of 5-room resale
flats, the confidence intervals constructed for samples 1 and 2 would actually contain the population
parameter while the confidence interval derived from sample 100 would not.
It is important to remember that in actual fact, we do not know what is the exact value of the
population parameter. Confidence intervals certainly give us a better idea of where this parameter lies
but they can never tell us its exact value.
Section 4.4. Statistical Inference and Confidence Intervals 135
Remark 4.4.7 Going back to Example 4.4.4, based on the first sample of 2000 households, it is a
common mistake to say that there is a 95% chance that the population proportion of 5-room resale
flats lies between 0.235 and 0.273. It is actually incorrect to make a statement like this because
The population proportion p is “fixed”, although unknown to us. There is no probabilistic element
in what this proportion is going to be.
For any particular sample, the confidence interval constructed only depends on the sample propor-
tion and the value of z ∗ corresponding to a chosen confidence level. Thus, the confidence interval
is also “fixed” and there is also no probabilistic element in it.
Thus either the population parameter IS in the interval or it IS NOT. It is wrong to say there is a 95%
chance that it is in the interval (and 5% chance that it is not)! The element of chance (or probability)
comes from the uncertainty of sampling rather than the uncertainty in the value of the population
parameter. Therefore, we should always remember the interpretation as the percentage of samples of
the same size collected repeatedly, using the same method of simple random sampling, that give rise to
confidence intervals containing the unknown population parameter.
1. Recall that in Example 4.4.4 we computed the confidence interval using the sample estimate of
0.254, confidence level of 95% and sample size n = 2000:
r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0191.
2000
What happens when another sample is taken, using the same sampling frame, same sampling
method (simple random sampling) but a smaller sample size of 1000? If this new sample also
resulted in the sample estimate of 0.254, the 95% confidence interval would be
r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0270.
1000
This confidence interval is wider than the previous one because the sample size is smaller.
Similarly, if yet another sample is taken, under exactly the same conditions with the only difference
being that the sample size is 5000, then if the sample estimate is again 0.254, the confidence interval
will now be r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0121.
5000
What we are seeing here is that the larger the sample size, the smaller the random error, which
will then result in a narrower confidence interval. This is not surprising as we have seen in Chapter
1 that increasing sample size can result in reducing random error.
2. Other than the size of the sample, the other factor that affects the width of the confidence interval
is the confidence level. Recall that when we set the confidence level to be 95%, the confidence
interval obtained, based on n = 2000 and the sample proportion of 0.254 was 0.254 ± 0.0191. What
happens if we set the confidence level to be 90%? In this case, the value of z ∗ is 1.645 and the 90%
confidence interval for the population proportion is
r
0.254(1 − 0.254)
0.254 ± 1.645 × = 0.254 ± 0.0160.
2000
So the interval is 0.254 ± 0.0160. This interval is narrower than the interval obtained when the
confidence interval was 95%. Using the idea of repeated sampling, this makes sense since having a
narrower interval would imply that a smaller percentage (90% and not 95%) of repeated samples
would contain the population parameter. Generally speaking, the higher the confidence level at
which the confidence interval is constructed, the wider the confidence interval.
136 Chapter 4. Statistical Inference
Example 4.4.9 Let us consider another example on the construction of a confidence interval, where
we would like to estimate the population mean based on a sample mean. Using the same data set
previously, containing all the resale transactions of HDB flats in 2020, we would like to investigate the
mean resale price of all the transactions by constructing confidence intervals.
We will describe how a confidence interval for population mean resale price is constructed. The
properties and interpretations of the confidence interval for a population mean are similar to those for a
population proportion that we have discussed in Example 4.4.4. Again, we will not be computing these
confidence intervals by hand but instead use software to help us perform these computations.
Suppose we have a sample, obtained via simple random sampling, with sample size 2000. The sample
mean resale price is found to be x = $448727. Let µ be the population mean resale price, a population
parameter of interest that is unknown to us unless we take a census of the population. A 95% confidence
interval for the population mean µ is constructed using the formula
s
x ± t∗ × √ ,
n
where
x = sample mean
∗
t = “t-value” from t-distribution
s = sample standard deviation
n = sample size
The exact value of t∗ depends on the sample size n and the confidence level of the confidence interval
we are constructing. Without going into the computation details, we will simply state that the 95%
confidence interval for the population mean is found to be 448727 ± 6706.01.
The margin of error, $6706.01 is a way of quantifying the random error and as discussed previously,
this error can be reduced by increasing the sample size n (everything else being equal). The width of
the confidence interval can also be narrowed if we reduce the confidence level to one that is lower than
95%.
To summarise this section on confidence intervals, recall the following:
Section 4.5. Hypothesis Testing 137
1. The use of confidence intervals is a way for us to quantify random error that is present in every
sample, even in those obtained via simple random sampling where the level of bias can be reduced
or assumed negligible.
2. Confidence intervals and the confidence level used to compute the intervals can be understood via
the idea of repeated sampling. We should avoid using the word “chance” or “probability” when
we discuss whether the population parameter lies inside the confidence interval constructed from
a single sample.
3. We have discussed some properties of confidence intervals, in particular how the interval varies
according to the sample size and the confidence level applied.
4. We saw how confidence intervals are constructed for two population parameters, namely the popu-
lation proportion and the population mean. It is useful to understand how the construction is done
although for the purpose of this module, we will rely on software to assist in the computation.
Discussion 4.5.1 The second tool for statistical inference is hypothesis testing. Recall that when a
sample is taken from a population, we can try to use a sample statistic to infer on a population parameter.
If biases can be reduced to something neglibible, what separates the sample statistic from the population
parameter is the quantity of random error as seen previously
In this section, we will assume that our sample is taken from the population using simple random
sampling, from a perfect sampling frame and with 100% response rate.
Definition 4.5.2 A hypothesis test is a statistical inference method used to decide if the data from a
random sample is sufficient to support a particular hypothesis about a population.
Discussion 4.5.3 A typical hypothesis about the population could be anything we want to know about
the population. For this course we will focus on two types of hypothesis about the population, in
particular, whether
(ii) in the population, 2 categorical variables A and B are associated with each other.
For example, a survey obtained from a simple random sample of Singaporeans (assuming a good
sampling frame and 100% response rate) found that 2 in 5 Singaporeans struggle with mental health
issues. A hypothesis can be made that in the population of Singaporeans, 3 in 5 (or a proportion of 0.6)
struggle with mental health issues. Using our sample where the sample proportion is 0.4, we can test
whether it is sufficient to reject the null hypothesis that the population proportion is 0.6.
Confidence intervals discussed in the previous section and hypothesis testing are two related methods.
Using the same example of population proportion above
(a) we can use our sample proportion of 0.4 to construct a confidence interval and say, with some
degree of confidence that our interval contains the hypothesised population proportion.
(b) we can also answer the question of whether it is likely to observe a sample proportion of 0.4 if the
population proportion is hypothesised to be 0.6.
138 Chapter 4. Statistical Inference
Hypothesis testing asks if our observed sample proportion’s deviation from the hypothesised popula-
tion proportion can be explained by chance variation. Of course, the bigger the difference between the
sample proportion and the hypothesised population proportion, the less likely this difference is due to
random chance.
(Five steps of hypothesis testing)
Step 1: The first step of hypothesis testing is to identify the question and state the null hypothesis and
alternative hypothesis. How these hypotheses are stated depends on the context of the question and our
aim.
Step 2: Next, we have to set the significance level of our test. The significance level of a hypothesis
test is a measurement of our threshold or tolerance for determining if the deviation of what is observed
for the sample, from what is hypothesised for the population, can be explained by chance variation. The
significance level is often set at 5%, although others like 1% or 10% are also used frequently.
Step 3: Using our sample, we find the relevant sample statistic.
Step 4: With the sample statistic and the hypothesis, we can calculate the p-value (see Definition 4.5.4).
Step 5: We then make a conclusion of the hypothesis test. What the conclusion turns out to be depends
on the p-value calculated and the significance level set for the test.
Definition 4.5.4 The p-value is the probability of obtaining a result as extreme or more extreme than
our observation in the direction of the alternative hypothesis, assuming the null hypothesis is true.
Using examples, we will now describe three types of hypothesis tests commonly used. These examples
are based on the data set StudentsPerformance.csv (the population, with 1000 observations) and a sample
SP Sample A.csv, of size 200, taken from it. You may access these csv files by referring to the Technical
videos provided for this Chapter. The background of the data set is from a high school with 1000
students. Some information of each student is provided in the form of categorical variables like gender
and ethnicity. The scores obtained by each student for three tests are also provided.
Example 4.5.5 (Hypothesis test for population proportion) The figure below shows a snapshot
of the sample. For this particular example, we are looking at the categorical variable “test prep”. The
principal of the school believes that his students are generally hardworking and half of them would have
completed the test preparation course before taking the three tests. On the other hand, the teachers of
the school, who claim they know the students better, feel that the students are lazy and less than half of
them would have completed the course before sitting for the three tests. We will conduct a hypothesis
test for population proportion at 5% significance level to check if there is sufficient evidence to reject the
principal’s hypothesis that half of the student population are hardworking.
To proceed, we need to state our null and alternative hypotheses. Recall that the null hypothesis
corresponds to the case where our observation can be explained by chance variation, that is, in this
example, that the principle’s belief is correct. On the other hand, the alternative hypothesis corresponds
to the case where our observation is not due to random chance.
Null hypothesis:
Alternative hypothesis:
Note: For this example and Example 4.5.7, when the hypothesis test is conducted for the population
proportion or population mean, the null hypothesis typically takes the form
It should be noted that the null and alternative hypotheses should be mutually exclusive, meaning that
they cannot be true simultaneously.
Let us return to our example where we would like to test the principal’s hypothesis that the population
proportion p = 0.5. The next step requires us to choose the significance level. For this example, we will
select the significance level to be 5%. Step three is to determine the sample statistic which in this case is
simply the sample proportion derived from the sample SP Sample A.csv. A quick check of the .csv file
using an appropriate software will reveal that the sample proportion, denoted by p∗ equals 0.335.
There are two trains of thought right now.
(T1) The principal’s hypothesis is correct. Despite the low sample proportion p∗ = 0.335 being observed,
this is a chance variation and simply because when the sample was selected, fewer students who
completed the test preparation course were drawn into the sample.
(T2) The population proportion p is really smaller than 0.5 and thus it is natural to see the sample
proportion p∗ to be less than 0.5 (p∗ = 0.335).
Without going into the details of how the p-value is computed, we use a software to calculate the
p-value in this case which turns out to be smaller that 0.001. The interpretation of this is that the prob-
ability of obtaining a sample proportion of p∗ = 0.335 or lower, (assuming that the null hypothesis
is true, that is p = 0.5) is smaller than 0.001, which is very small indeed.
To make our conclusion, we compare the p-value (which is smaller than 0.001) with the significance
level that we have set, which is 5% (or 0.05). Since the p-value is smaller than 0.05, our conclusion is
to reject the null hypothesis in favour of the alternative hypothesis. That is, we reject the explanation
given in (T1) in favour of that explained in (T2).
Remark 4.5.6 In Example 4.5.5, since the p-value computed was smaller than the significance value,
we rejected the null hypothesis in favour of the alternative hypothesis. The table below shows the two
possibilities when the computed p-value is compared to the significance level.
Example 4.5.7 (Hypothesis test for population mean) We next discuss the second type of hy-
pothesis test, where we hypothesise the population mean instead of proportion. We will use the t-test
to perform the hypothesis test for population mean.
140 Chapter 4. Statistical Inference
Returning to the high school example, the principal believes that his students are poor readers and
that the average reading score of all the 1000 students in the school is 69. The teacher who is in charge
of teaching reading skills to all the 1000 students think differently and believes that the average reading
score of all the students in the school is greater than 69. Again, we will conduct the test at 5% significance
level.
The null and alternative hypotheses in this case is as follows:
Null hypothesis:
Alternative hypothesis:
The sample statistic in this case is simply the sample mean reading score derived from the sample
SP Sample A.csv. A quick check of the .csv file using an appropriate software will reveal that the sample
mean reading score x is equal to 70.345. What are the two trains of thought now?
(T1) The principal’s hypothesis is correct. Despite the high sample mean µ = 70.345 being observed,
this is a chance variation and simply because when the sample was selected, more students who
scored higher marks for reading were drawn into the sample.
(T2) The population mean µ is really larger than 69 and thus it is natural to see the sample mean
x(= 70.345) to be larger than 69.
Again without going into how the p-value is computed, we use a software to obtain the p-value in this
case, which turns out to be 0.093. Note that in this case, the p-value is greater than the significance level
of 0.05. Thus, the conclusion of our test is that there is insufficient evidence to reject the null hypothesis
that the population mean reading score is 69.
Example 4.5.8 (Hypothesis test for association) In this third example, we discuss a hypothesis
test for association between categorical variables in the population. This is done using a chi-squared
test.
The two categorical variables we are interested to investigate are gender and test preparation. In other
words, we would like to test whether gender (male/female) is associated with the test preparation course
(completed/not completed) at the population level. We will also conduct the test at 5% significance
level. The null and alternative hypotheses are stated as follows:
Section 4.5. Hypothesis Testing 141
Null hypothesis:
There is no association between gender and test preparation course at the population level.
Alternative hypothesis:
There is an association between gender and test preparation course at the population level.
A simple check on the sample of 200 students reveal the number of male/female students who com-
pleted/not completed (none) the preparation course, as shown in the table below.
On the other hand, if we assume that H0 is true, then we would expect the following:
(i) Based on 200 students (96 females and 104 males), where 67 completed the course and 133 did not
complete the course, the number of female students who completed the course should be
96
× 67 ≈ 32.16.
200
(ii) This implies that 67 − 32.16 = 34.84 male students completed the course.
The p-value for the chi-squared test then calculates the probability of getting our observation as such,
and even more extreme (bigger difference between expected and observed figures) assuming that the null
hypothesis is true. Using a software, we obtain a p-value of 0.517. Since the p-value is not smaller than
the significance level of 5%, we conclude that there is insufficient evidence to reject the null hypothesis.
We have only given very brief descriptions of the three types of hypothesis tests. Further explanations
on them, as well as the technicalities involved can be found in standard books on statistics, but are beyond
the scope of this course.
142 Chapter 4. Statistical Inference
Exercise 4
1. Let the random variable X denote the number of heads obtained in 3 independent tosses of a fair
coin. Which of the following tables correctly displays the probabilities over the possible values of
X?
X 0 1 2 3
(A)
Probability 0.125 0.375 0.375 0.125
X 0 1 2 3
(B)
Probability 0.125 0.5 0.25 0.125
X 0 1 2 3
(C)
Probability 0.333 0.167 0.167 0.333
X 0 1 2 3
(D)
Probability 0.25 0.25 0.25 0.25
2. The sample space S of a probability experiment comprises the 30 days of September 2021. Let
A denote the first 10 days of September 2021. Let B denote 1st September 2021. Which of the
following statements is/are correct? Select all that apply.
3. A football match is played between Singapore and Italy, at the Singapore National Stadium. 90%
of the spectators in the stadium are wearing red. Among the spectators in the stadium wearing
red, 80% are rooting for Singapore. Among the spectators in the stadium not wearing red, 80%
are rooting for Singapore.
Among all the spectators in the stadium, a spectator is randomly chosen with every spectator
having the same chance of being chosen.
Let A be the event that the chosen spectator is rooting for Singapore, and B be the event that the
chosen spectator is wearing red. Which of the following statements, regarding the 2 events A and
B is true?
5. Benny is a messy student who keeps all his coloured socks in a box. The box contains a total of 4
blue and 2 yellow socks. While running late for class, he randomly selects (without replacement)
two socks out of the box to wear before leaving the house. Assume the socks are indistinguishable
from one another in all respects other than their color.
What is the probability that Benny will end up wearing a pair of matching coloured socks when
leaving the house?
1
(A) 3.
2
(B) 5.
7
(C) 18 .
7
(D) 15 .
6. Every resident in a town undergoes a new COVID test. The sensitivity and specificity of the test
are 90% and 99% respectively. Among those who tested positive, 60% are actually infected with
the virus. Which of the following is closest to the base rate for COVID infection in this town?
(A) 60%.
(B) 16%.
(C) 1.6%.
(D) 0.16%.
7. A bag contains four balls numbered 1, 2, 3 and 4. In a game, a ball is drawn once at random from
the bag to have its number read. Next, a fair coin is tossed that number of times independently.
The discrete random variable X is the number of heads observed from the coin tosses.
Fill in the blank in the following statement:
The probability of X being 0 is (give your answer correct to 2 decimal
places).
8. A bag consists of 10 balls. 4 of the balls are yellow while the remaining are green. 2 balls are drawn
at random from the bag, one at a time with every ball having the same chance of being chosen on
each draw. Let A be the event that the first ball drawn is yellow. Let B be the event that the
second ball drawn is green. Which of the following statements is true?
(A) If the balls are drawn without replacement, events A and B are independent and mutually
exclusive.
(B) If the balls are drawn without replacement, events A and B are independent but not
mutually exclusive.
(C) If the balls are drawn with replacement, events A and B are independent and mutually
exclusive.
(D) If the balls are drawn with replacement, events A and B are independent but not mutually
exclusive.
9. In an undergraduate statistics class, 70% of the students are males and 30% are females. After all
of them have taken a test, 40% of the males failed and 70% of the females failed. If I randomly
pick a student from the class, what is the probability that the student passed the test?
(A) 0.51.
(B) 0.49.
(C) 0.61.
144 Chapter 4. Statistical Inference
(D) 0.55.
10. Let E and F be events of a sample space S, such that P (E) and P (F ) are both non-zero. Which
of the following statements is/are true? Select all that apply.
11. Consider the random process of throwing a fair dice once, with the following sample space S =
{1, 2, 3, 4, 5, 6} of possible outcomes. Which of the following subsets of S correspond to events but
do not correspond to outcomes of this random process? Select all that apply.
A player is presented a box containing three balls, each with a whole number written on it
that the player cannot see.
It is known that there is only one mode of the numbers in the box, and the mean of the
numbers is less than the mode.
The player then chooses a measure of central tendency (mean, median, or mode).
The player picks a ball at random from the box. If the number on the ball is less than the
chosen measure of central tendency applied to all three numbers, the player wins.
Which of the following statements are true? Select all that apply.
(A) P(player wins | player chooses median) ≤ P(player wins | player chooses mean).
(B) P(player wins | player chooses median) = 0.5.
(C) The probability of the player winning is the same regardless of the choice of measure of central
tendency.
(D) The player always loses if he or she chooses mean.
13. Based on a random sample of 200 staff members in NUS, the 95% confidence interval for the
proportion of all NUS staff who went on vacation for at least 5 days in 2018 is (0.33, 0.59). Which
of the following statements must be true?
(I) If another sample of size 500 is drawn using the same sampling method, for the same confidence
level, the confidence interval will be wider than (0.33, 0.59).
(II) A maximum of 59% of all NUS staff went on vacation for at least 5 days in 2018.
14. Two random samples, sample A and sample B, of NUS students were taken. Sample A consists
of 100 NUS students, while sample B consists of 250 NUS students. For both samples A and B,
the sample proportion of NUS students who visited the campus at least once a week from January
2021 to March 2021 is 0.8.
For sample A, the 95% confidence interval for the proportion of NUS students who visited the
campus at least once a week from January 2021 to March 2021 is [0.75,0.85].
Which of the following statements must be true? Select all that apply.
(A) A maximum of 85% of all NUS students visit campus at least once a week from January 2021
to March 2021.
(B) If a 90% confidence interval for the proportion of NUS students who visited the campus at
least once a week from January 2021 to March 2021 is constructed using sample A, the width
of the confidence interval will be narrower than [0.75, 0.85].
(C) If a 95% confidence interval for the proportion of NUS students who visited the campus at
least once a week from January 2021 to March 2021 is constructed using sample B, the width
of the confidence interval will be narrower than [0.75, 0.85].
(D) The population proportion of NUS students who visited the campus at least once a week from
January 2021 to March 2021 must be either in the 95% confidence interval of sample A or the
95% confidence interval of sample B.
15. A random sample of size 500 is taken from a population of 10000 people of age 50. From the
sample, a 95% confidence interval for the population mean weight is constructed. Which of the
following statements is/are correct?
(I) The confidence interval will always contain the sample mean weight.
(II) If many samples of the same size are collected using the same sampling method, about 5% of
the confidence intervals from these samples will not contain the population mean weight.
16. A 99% confidence interval for the mean height (in meters) of NUS students is [1.58, 1.80]. It is
constructed using a random sample of 100 students. Using the same sample, which of the following
is a plausible 95% confidence interval for the mean height?
17. The newspaper The Daily Bugle conducted a poll on a random sample of 100 readers. All readers
responded, and results indicated that 58% of readers like Reed Richards, with 95% confidence
interval [48%, 68%]. Which of the following statements is/are true? Select all that apply.
(D) There is a 5% chance that the population parameter lies outside the interval [48%, 68%].
(A) A confidence interval is an interval of values computed from sample data that is likely to
include the true population value.
(B) A formula for a 95% confidence interval is sample estimate ± margin of error.
(C) A confidence interval between 20% and 40% means that the population proportion lies between
20% and 40%.
(D) A 99% confidence interval is wider than the 95% confidence interval, when both are constructed
using the same sample.
19. A researcher working for the government wishes to estimate the average daily expense for food, for
singles staying alone in HDB flats in a particular neighbourhood. He manages to obtain a list of
all singles living in the neighbourhood and uses simple random sampling to pick 100 singles. The
sample estimate obtained is $30.20 and the standard deviation is $5.62. Calculate the margin of
error, correct to 2 decimal places, for a 95% confidence interval of the population parameter. The
corresponding t∗ value for the 95% confidence interval is 1.984.
20. Joseph wishes to know how everyone in his class fares for GEA1000 quiz 1. He sent a survey to
everyone in the class asking for their quiz 1 scores and everyone responded truthfully. He then
constructed a 95% confidence interval for the mean score with the data collected and found that
the interval is [4.4, 6.8].
Which of the following statements must be true?
(I) The constructed confidence interval may not contain the population mean score.
(II) The population mean score can be found given the constructed confidence interval.
21. Ah Wang wants to estimate the average value of tar content of a certain kind of cigarette. He took
a random sample of 25 such cigarettes and constructed a 95% confidence interval for the average
tar content value as (14.25, 14.55). Suppose he wanted a more conservative estimate for the average
tar content value and constructed another confidence interval using the same sample as (14.20, b).
What is the exact value of b?
22. Which of the following statements about the p-value is/are true? Select all that apply.
(A) The p-value is the probability of obtaining results at least as favourable to the alternative
hypothesis as the collected data, computed based on the assumption that the null hypothesis
is true.
(B) The p-value gives the probability that the null hypothesis is true.
(C) At a 5% significance level, a p-value smaller than 0.05 provides sufficient evidence that the
null hypothesis should be rejected.
(D) At a 5% significance level, a p-value larger than 0.05 provides sufficient evidence that the null
hypothesis should be rejected.
Exercise 4 147
23. Suppose we want to test if a coin is biased towards heads. We decide to toss the coin 10 times and
record the number of heads. We shall assume the independence of coin tosses, so that the 10 tosses
constitute a probability experiment.
Let X denote the number of heads occurring in 10 tosses of the coin. We will carry out a hypothesis
test with X as the test statistic. Let H be the event that the coin lands on heads, in a single toss.
We set our hypotheses to be
H0 : P (H) = 0.5,
H1 : P (H) > 0.5.
Suppose in our execution of the 10 tosses, we observe 4 heads. This means X = 4 is the test result
we observe.
Recall the definition of p-value to be the probability of obtaining a test result at least as extreme
as the one observed, assuming the null hypothesis is true. What is the range of test results “at
least as extreme as the one observed”, in this scenario?
(A) 0 ≤ X ≤ 4.
(B) 4 ≤ X ≤ 10.
(C) 0 ≤ X ≤ 5.
(D) 5 ≤ X ≤ 10.
24. A student flips a coin 100 times and the coin lands heads on 35 out of the 100 flips. You want to
show that this has not happened due to chance and that the coin is indeed biased against landing
heads. To test such a hypothesis, what should the null and alternative hypotheses be?
25. Mandy and Sue were playing a game - Sue draws a random card from a deck of five cards (red,
blue, yellow, green and black) and hides it out of sight from Mandy. Mandy will try to guess the
colour of that drawn card. Mandy wins if she can correctly guess the colour of the drawn card.
Otherwise, Mandy loses.
They played 5 rounds of the game, and Mandy won 4 out of 5 games. Sue is surprised that Mandy
won so many times, and suspects Mandy may have some method to detect the colour of the cards
instead of just guessing the colour. Based on the above, she wants to conduct a hypothesis test,
with the null hypothesis:
What event(s) need to be considered in the calculation of the p-value? Select all that apply.
26. A hypothesis test is done to find out whether a new drug increases the survival rate of a disease
in hamsters. The current death rate of the disease in hamsters is 0.7. A random sample of
10 hamsters with the disease was selected for the study. All the 10 hamsters received the new
drug and 4 eventually died from the disease. Which of the following statements is/are true?
Select all that apply.
148 Chapter 4. Statistical Inference
(A) The null hypothesis can be stated as “The new drug has no effect on the survival rate of the
disease in hamsters”.
(B) The p-value is the probability that the new drug is effective in increasing the survival rate of
the disease in hamsters.
(C) The p-value is the probability that 6 out of 10 hamsters survive from the disease, given that
the probability of survival is 0.3.
(D) The p-value is the probability that 6 or more hamsters out of 10 survive from the disease,
given that the probability of survival is 0.3.
27. Ming Xiao and Wei Da work for the registrar’s office in NSU University. For the graduating cohort
of 2020, Wei Da makes a claim that the average CAP for that graduating cohort is equal to 4. To
test Wei Da’s claim, Ming Xiao decides to perform a hypothesis test at 5% level of significance. He
sets up his null and alternative hypotheses as follows.
Null hypothesis: The average CAP for the graduating cohort of 2020 is 4.00.
Alternative hypothesis: The average CAP for the graduating cohort is greater than 4.00.
Since Ming Xiao works for the registrar’s office, he accesses the database which consists of the
individual CAP of all the NSU students who graduated in 2020. He then uses a software to
perform the hypothesis test with this data. The software shows that the average CAP is 4.06 with
a standard deviation of 0.23 along with a p-value of 0.09. Assume that the software is working
perfectly, and that the database is completely accurate. Which of the following statements is true?
(A) There is not enough evidence to reject the null hypothesis and he cannot conclude whether
the average CAP for the graduating cohort is greater than 4.00.
(B) He cannot reject the null hypothesis and therefore the average CAP for the graduating cohort
is 4.00.
(C) Instead of carrying out a hypothesis test, he should compute a 95% confidence interval and
use the margin of error of the confidence interval to determine whether Wei Da’s claim is
correct.
(D) Wei Da’s claim is wrong.
28. Jasmine is a sports psychologist who wants to investigate if participation rates in competitive sports
amongst Austrian youths under 12 years old is associated with sports funding to primary schools.
She collects her data from a random sample of 16-year-old Austrian students, in the following
format:
Subject Gender Did you participate in any Was there sports funding
sports competition before you in your primary school?
were 12 years old? (Y/N) (Y/N)
001 M Yes No
002 F Yes Yes
··· ··· ··· ···
··· ··· ··· ···
(A) There is no association between Gender and Participation in sports competition before 12
years old among 16-year-old Austrian students.
(B) There is an association between Gender and Participation in sports competition before 12
years old among 16-year-old Austrian students.
(C) There is no association between Participation before 12 years old and Sports Fundings among
16-year-old Austrian students.
Exercise 4 149
(D) There is an association between Participation before 12 years old and Sports Fundings among
16-year-old Austrian students.
29. Professor Ali, a sleep researcher, claims that the average amount of sleep that students in school
XYZ get each day is 7 hours. Professor Lily believes that it is less than that, as her students are
always sleeping in her classes. She decided to conduct a hypothesis test at the 5% significance
level. She randomly sampled 153 students from the school and obtained a sample average sleep
duration of 5.5 hours. Which of the following outcomes must Professor Lily consider in calculating
the p-value for her hypothesis test?
30. Jane suspects that among students from the National University of Singapore (NUS), those who
wear shirts of bright colours (defined as red, orange and yellow shirts) to lectures are more likely
to spend more money to purchase the NUS merchandise sold outside the lecture hall, as measured
by their expenditure (Low: $10 and below, High: above $10) compared to NUS students who wear
shirts of other colours.
She conducts a study to test her hypothesis. Which of the following should be her null hypothesis?
(A) There is insufficient evidence to conclude that shirt colour is associated with expenditure
among NUS students.
(B) Shirt colour is not associated with expenditure among NUS students.
(C) Shirt colour is associated with expenditure among NUS students.
(D) There is sufficient evidence to conclude that shirt colour is associated with expenditure among
NUS students.
Index
Sample, 3
Self-select, 7
Sample Space, 216
Sample variance, 13
Sampling
frame, 3
Non-probability, 4, 7
Probability, 5
Sampling Without Replacement, 5
Scatter plot, 152
Sensitivity, 227
Significance Level, 238
Simple Random Sampling, 5
Simpson’s Paradox, 81
Sliced stacked bar plot, 81
Slicing, 82