Chapter 4 Part 1
Chapter 4 Part 1
Chapter 4 Part 1
The Road to
Statistical Inference
In this unit, we give a high level overview of the chapter, by introducing its overarching
theme -- statistical inference.
1
In a previous chapter
Statistical inference
2
PPDAC
Cycle
In the greater context of the PPDAC cycle, it should be obvious by now that the first and
last steps of the cycle are connected with the help of statistical and data-related tools,
tools which are very much emphasized in this course. The high concentration of
specialized tools and techniques relevant to the middle three steps, allows these steps to
be recast in more specialized language.
3
Analysis
Plan &
Data
In the previous chapters, we have gone through means of producing data, which fall
under the steps “plan” and “data” in the PPDAC cycle, as well as methods of exploratory
data analysis, which fall under the step “analysis”.
4
Analysis
Picture source: courses.lumenlearning.com
The endgoal of the “analysis” step is very often statistical inference. It links the analysis
of sample data with the drawing of conclusions about the population.
5
This
• Confidence intervals Chapter
• Hypothesis tests
Probability and inference will be the main thrust of this chapter, as we build upon the
foundations of probability to arrive at tools required in statistical inference. We will
briefly explore two kinds of tools common in the art of statistical inference:
• confidence intervals, and
• hypothesis tests.
6
Probability:
Setting the Stage
In this unit, we will lay the groundwork for probability by defining some basic terms used
in the subject.
7
Vernacular
Uncertainty
(chance, likelihood)
Mathematical Probability
In the previous chapters, we occasionally and implicitly dealt with, and relied on, the
concept of uncertainty. Whenever we mentioned the word “chance”, or use the phrases
“more likely” or “less likely”, we were appealing to your intuition of things that are not
definite, or do not always hold true. Such terms are common and adequate in day-to-day
communication. However, in order to deal with data at a deeper level, their meanings
need to be made precise, and it is helpful to have a more rigorous framework to ground
uncertainty in.
8
Possibilities
×2 HH TT
HT TH
Let’s define the basic terms of probability using an example. Say, you flip a coin twice,
and observe the results of the two flips. There are four possibilities here. Namely, getting
two heads, getting two tails, getting a head followed by a tail, and getting a tail followed
by a head. In this example, the procedure of flipping the coin twice is called a probability
experiment, and the four possible results are the outcomes of the probability experiment.
Note that a probability experiment is defined more narrowly than the experiments of
Chapter 1, because we must be able to both repeat it as many times as we want, and
exactly describe, or list down, all its outcomes.
Just like how a probability experiment gives rise to a set of outcomes, every example of
mathematical objects we are going to define, going forward, will arise as a result of some
probability experiment. In this sense, probability experiments form the bedrock of the
study of probability.
9
possibilities
We will use the term sample space to denote the collection of all possible outcomes of a
probability experiment. We will also use the word event to denote a subcollection of the
sample space. In the probability experiment of flipping a coin twice, we have the
following sample space of four elements. An example of an event of this sample space is
getting either two heads or two tails. Colloquially, we may call this event a “two-in-a-
row”.
It turns out that, a sample space and an event of that sample space is enough to give
context to the mathematical discussion of probability. In other words, mathematically
speaking, we only ever talk about the probability of an event of a sample space.
Intuitively, this probability measures how likely the outcome of the probability
experiment -- which gives rise to the sample space in the first place -- is an element of
the event. It is useful to note that we often, in practice, regard outcomes as events, so
that we can talk about probability of outcomes.
10
Example: Die-rolling
Sample space A possible event
1 2 1 2
3 4 3 4
5 6 5 6
Rolling a six-
sided die
“an even-numbered face”
{1, 2, 3, 4, 5, 6}
= {2, 4, 6}
What are the basic properties of this probability experiment? Well, for one, its sample
space consists of six elements, one for each face of the die. The sample space can be
written using set notation as follows. A possible event of this sample space is highlighted
here is green. This event can be described as “the die landing on an even-numbered
face”, and we can check that this description corresponds to the subcollection of the
sample space containing exactly the outcomes ‘2’, ‘4’ and ‘6’.
11
Probabilities
In this unit, we define the class of mathematical objects that is the linchpin of statistics.
12
Sample space Numerical values
between 0 and 1
(inclusive)
Probability experiment
assigned to
Events Probabilities
We have learnt that a probability experiment gives rise to a sample space, as well as
events of the sample space. What is left to do, in modelling our probability experiment, is
to talk about the probabilities themselves. Conventionally, probabilities are numerical
values between 0 and 1 we assign to events. If E is an event that has been assigned a
probability, we use P of E, “the probability of E”, to denote the probability assigned to E.
All that being said, how can we read off probabilities from a probability experiment?
13
Simple Cases: Finite Sample Spaces
For any event E:
1. Repeat the probability experiment a large number (N) of times Every event can be
2. For each repetition, check if the outcome is in E assigned a probability
2 3 N
...
1
This is not too difficult, if our probability experiment has a finite sample space, in which
case, every event can be assigned a probability. We do so in the following manner:
choose any event E, repeat the probability experiment a large number of times, say N
times, and for each repetition, check if the outcome is in E.
For example, in the N repetitions, we might see that the first outcome is in E, so you
mark it with a ‘Yes”. The second outcome might not be in E, so we mark it with a ‘No’.
The third repetition might turn out to be a ‘Yes’ again, and so on and so forth.
After you have marked all N repetitions with either ‘Yes’ or ‘No’, you count the number
of ‘Yes’ and divide it by N, to get the proportion of E for this particular set of repetitions.
We can assign this proportion to E as its probability. Of course, it is reasonable to believe
that the proportion of E will be different if we repeat the probability experiment another
N or more times.
The point is, all these proportions are estimates of the true probability of E, and such
estimates get more accurate as N gets larger.
14
Simple Cases: Finite Sample Spaces
An Example
Event E = “an even-numbered face”
1. Repeat the probability experiment a large number (N) of times Every event can be
2. For each repetition, check if the outcome is in E assigned a probability
2 3 N
...
1
Yes No No
Yes
Rolling a six-
Proportion of E = → P(E) sided die
15
Rules of Probabilities
For every assignment of probabilities
For finite sample spaces, it is enough to assign probabilities to outcomes so that they add up to 1
When the sample space is finite, we only need to assign probabilities to outcomes so
that these probabilities sum to 1. The probabilities of all other events can then be
derived from there.
16
For finite sample spaces, it is enough to assign probabilities to outcomes so that they add up to 1.
P(1) = 0.1
P(2) = 0.1
P(3) = 0.1 add up to
P(4) = 0.1 1
P(5) = 0.1
P(6) = 0.5 Deriving probabilities of
other events
For example, in the rolling of a biased six-side die, after assigning these probabilities to
all 6 outcomes, and then checking that they add up to 1, we can derive the probabilities
of other events by invoking the 3rd rule of probability repeatedly, so that the probability
of each event is the sum of probabilities of its outcomes.
17
For finite sample spaces, it is enough to assign probabilities to outcomes so that they add up to 1.
P(1) = 0.1
P(2) = 0.1
P(3) = 0.1 add up to
P(4) = 0.1 1
P(5) = 0.1
P(6) = 0.5 Deriving probabilities of
other events
If we have two events E and F where E is the event the die lands on an odd-numbered
face, and F the event the dies lands on an even-numbered face, we can compute the
probabilities of E and F as follows. The probability of E will be the sum of the
probabilities of the dies landing on 1, 3 and 5, which evaluates to 0.3.
On the other hand, the probability of E will be the sum of the probabilities of the dies
landing on 2, 4 and 6, which equals 0.5. For the rest of this chapter, unless specified
otherwise, we will only concern ourselves with samples spaces which are finite.
18
Uniform Probabilities and Rates
Uniform
probability
HH TT
HT TH
In the sample space corresponding to flipping a coin twice, there are a total of four
outcomes, so uniform probability over this sample space simply assigns a probability of a
quarter, or 0.25, to each outcome. As you will learn in the subsequent units, uniform
probability in this case, corresponds to two independent flips of a fair coin.
19
Uniform Probabilities and Rates
Uniform
probability
Randomly Unit
20
Conditional Probabilities
In this unit, we look into the concept of conditional probability, and give a general
method of computing conditional probabilities.
21
E and F are events
Conditional
P(E | F) : “probability of E given F” Probability
E
Even number E∩F
P(E ∩ F)
= P(E|F)
Can be divided P(F)
by 3 F
The concept of conditional probability deals with probabilities written in the following
notation, which is usually read as the “probability of E given F”. Here, E and F are
events of a particular sample space. Intuitively, the probability of E given F measures
how likely the outcome of the probability experiment – which again, gives rise to the
sample space – is an element of E, if we already know that it is an element of F. To
compute conditional probabilities, we usually invoke the idea of restricting sample
spaces.
So imagine we have a finite sample space in mind, with two events labelled E and F. To
compute the probability of E given F, we restrict our focus to the given event F, which
may contain some overlap with E, denoted “E intersect F”. F acts as a baseline to the
computation, and we can read off its assigned probability. E intersect F is the part of E
we can find in F; it is also an event of the sample space. So we can read off the
probability of E intersect F. Taking quotient of these two probabilities, we arrive at a
value which is justifiably, the probability of E given F.
22
Conditional Probabilities as Rates
A and B are subgroups
Recall:
Randomly Unit
In the previous unit, we gave a common manifestation of uniform probability, and that
is, the probability experiment of randomly selecting a unit from a fixed sampling frame.
We saw that the sample space of this probability experiment is equal to the sampling
frame, and that for any subgroup A of the sampling frame, A is an event of the sample
space, with probability of A being equal to the rate of A.
Now, we can ask a follow-up question with respect to this probability experiment: does
the probability of A given B equal the rate of A given B, whenever A and B are subgroups
of the sampling frame? Let’s work this out on the board.
23
Conditional Probabilities as Rates
A and B are subgroups
Recall:
Randomly Unit
( ∩ )
𝑃(𝐴 | 𝐵) =
( )
( ∩ )
Yes!
=
( )
• Sample space = sampling ∩
frame = =
∩
= 𝑟𝑎𝑡𝑒(𝐴 | 𝐵)
• P(A) = rate(A), for any
subgroup (event) A
We start with the probability of A given B, which equals the probability of A intersect B
divided by the probability of B. This is the equation we derived previously using the idea
of restricting the sample space. Since talking about probabilities is the same as talking
about rates, we can substitute probabilities for rates in the expression on the left to get
the rate of A intersect B divided by the rate of B. Unravelling the definitions of rates as
ratios of two sizes, we get the following expression. Cancelling out the common
denominator in the ratio, we have it being equal to the size of A intersect B divided by
the size of B, which is precisely the rate of A given B.
This easy verification leads to an affirmative answer to our question. Indeed, just as
probabilities are equivalent to rates in this probability experiment, so too are conditional
probabilities equivalent to conditional rates.
24
Example: ART for Covid-19
Sensitivity : P(+ | Covid-19)
(True positive rate) = 0.80
Antigen Randomly Person
Rapid Test
Specificity : P(- | no Covid-19)
(True negative rate) = 0.99
Apply ART Test for Check Covid-
Covid-19 19 status
Here, the probabilities are formulated with respect to the probability experiment of
randomly selecting a person from the global population, applying the ART on this person
to test for Covid-19, and then checking his Covid-19 infection status using a surefire way.
An outcome of this probability experiment thus comprises a test result and a Covid-19
infection status. Now that we know how to make sense of the conditional probabilities
that define sensitivity and specificity, let’s go back to fill in the blanks. According to
studies, the ART has a sensitivity of 0.80 and a specificity of 0.99.
25
Example: ART for Covid-19
Sensitivity : P(+ | Covid-19)
(True positive rate) = 0.80
Antigen Randomly Person
Rapid Test
Specificity : P(- | no Covid-19)
(True negative rate) = 0.99
Apply ART Test for Check Covid-
Covid-19 19 status
Base rate of
: P(Covid-19) = 0.01
Covid-19
However, these two conditional probabilities do not help the average people because
they have no means of knowing for sure their Covid-19 infection statuses, without
jumping through a lot of hoops. What they have access to instead, are the results of
their ARTs. So what can we say about a person’s Covid-19 infection status given his or
her ART result? For example, what is the probability a person is in fact, infected with
Covid-19, given that he or she tested positive? This is an important question because a
false positive test result can cause a lot of grief and inconvenience to a person.
As it turns out, having just the sensitivity and specificity of the test is insufficient to
answer this question. What we are missing is the base rate, or rate of infection, of Covid-
19, which in this probability experiment, is equal to the probability the person selected
at random is infected with Covid-19. Let us take a conservative stance and assume that
1% of the global population is infected with Covid-19.
26
Example: ART for Covid-19
Sensitivity : P(+ | Covid-19)
(True positive rate) = 0.80
Antigen Randomly Person
Rapid Test
Specificity : P(- | no Covid-19)
(True negative rate) = 0.99
Apply ART Test for Check Covid-
Covid-19 19 status
+ - Row total
Covid-19 800 200 1000 Base rate of
No Covid-19 990 98010 99000 : P(Covid-19) = 0.01
Covid-19
Column total 1790 98210 100000
The rate of Covid-19 infection among those tested positive is thus equal to 800 divided
by 1790, or 0.447 when rounded off to 3 significant figures. By the correspondence
between conditional probabilities and conditional rates, this also answers the question
we posed. That is, if one is tested positive for Covid-19 in an ART, there is about 45%
probability that he or she is really infected with Covid-19. Such a low conditional
probability indicates that more rigorous tests need to be applied for confirmation,
following the ART.
27
Independence
In this unit, we define what it means for two events to be independent, and look at the
relation between independence and association.
28
Definition 1
Independence
: P(A) = P(A | B)
of events A, B
𝑃(𝐴 ∩ 𝐵)
P(A) =
𝑃(𝐵)
Definition 2
Independence
: 𝑃 𝐴 × 𝑃(𝐵) = 𝑃(𝐴 ∩ 𝐵)
of events A, B
One way to define independence of two events A and B, is to say that the probability of A
equals the probability of A given B. If we unpack the definition of the probability of A
given B, we get the probability of A intersect B divided by the probability of B, so the
equation can be reformulated as shown. Multiplying both sides of the equation by the
probability of B gives us this new equation. We thus have another, equivalent definition
of what it means for two events to be independent. By our second definition, it is clear
that the order of events does not matter when we talk about independence.
29
Independence as Non-Association
B Not B
Variable 1 Variable 2
A
A / not A B / not B
Not A
Rate(A) = Rate(A | B)
Check if A
Randomly Unit P(A) = P(A | B) No Association
Check if B
Independence
Here, the relevant probability experiment involves randomly selecting of one unit from
the population we want to study, followed checking the values of the selected unit with
regards to variable 1 and variable 2. The equivalence of rates and probabilities
immediately leads us to the conclusion that A and B being independent in this probability
experiment is exactly what it means for the two variables to not be associated.
30
Independent Probability Experiments
Carried out independently
Probability
experiment P P Q Q
Sample
space
S All pairings of an element of S
with an element of T
T
We start with two probability experiments, one of which we label P, which gives rise to
the sample space S. The other probability experiment, which we label Q has sample
space T. If these two probability experiments are independent, then we can view them as
two components of a larger experiment that has P coupled with Q. It is easy to see that
this combined probability experiment has sample space all pairings of possible outcomes
of the two components. That the two components are independent, is conveyed by how
probabilities are assigned to outcomes of the combined experiment. In short, we want the
probability of each pairing to obey something analogous to “Definition 2” of independent
events.
31
Independent Probability Experiments
Carried out independently
Probability
experiment ×2
×2
Suppose P is the rolling of a particular six-sided die, which gives rise to the sample space
comprising the six possible faces of the die. Also suppose Q is the tossing of a particular
coin twice, which has the familiar 4-element sample space. Then the sample space of the
combined probability experiment consists of the pairs shown in the slide.
32
Random Variables
33
Probability Numerical
Outcome
experiment value
Examples
Randomly Person
Consider a probability experiment such that each of its outcomes is given a numerical
value. One such probability experiment is a game of roulette, played in a casino, such
that every outcome is associated with a payoff, which is of course, numerical in nature.
Another example involves randomly selecting a person from a population, checking his
Covid-19 infection status, then giving a 1 or a 0, depending on the status.
34
all values (non int
by counting Random Variable as well)
Discrete Continuous
Numerical
Probabilities +
variable
Discrete Continuous
Variable Variable
Now, abstracting away the motivating probability experiment and its outcomes and
focusing on what is left, we can say that any numerical variable with probabilities
assigned over its possible values, is a random variable.
If the numerical variable is a discrete variable, we call the random variable a discrete
random variable. On the other hand, if the numerical variable is a continuous variable,
we call the random variable a continuous random variable.
35
Random
Variables
• Measures of
model central tendency
• Measures of
dispersion
… can be computed
Data
Distributions
Random variables and data distributions can be thought of two sides of the same coin. In
fact, random variables were conceived as a mathematical way to model data
distributions. As a result, common summary statistics like measures of central tendency,
measures of dispersion, et cetera, can be computed from random variables, just as they
can be computed from data distributions. These summary statistics can be difficult to
compute for many random variables, and we shall not cover them in our course.
36
HDB Household Size
Check size
A discrete random variable of practical use is usually derived from real-world data and/or
real-world probability experiments. Let’s look at one example. Complementary to this
unit is “Unit_5_household_size.csv”, containing modified Singstats data regarding HDB
households. Here, the size of a household refers to the number of individuals living in the
household.
Consider the probability experiment of randomly selecting a household from all HDB
households. The numerical variable we are interested in is the size of the selected
household. From the data given, equating probability with rate, we can plot a table
detailing the possible household sizes and their respective probabilities, with respect to
this procedure. Obviously, the household size is a finite discrete variable. This particular
table thus represents a discrete random variable.
37
Visualisation of a Discrete Random Variable
Household size 1 2 3 4 5 6
X
Probability 0.16 0.226 0.204 0.201 0.119 0.09
• Each point
represents a possible
value of X, indicated
by its x-value
• y-value of a point =
probability that X
assumes its x-value
Labelling this discrete random variable X, we can visualise it using a plot of points, in
which each point represents a possible value of X equal to its x-value, and the y-value of
each point equals the probability that X assumes its x-value.
38
Visualisation of a Discrete Random Variable
Household size 1 2 3 4 5 6
X
Probability 0.16 0.226 0.204 0.201 0.119 0.09
Discrete collection
of possible values
of X
Vertical line segments can be added to connect the points in the plot to the x-axis. This
is sometimes done to emphasise the height of each point. Since the points in the plot
are separated by gaps, it is clear that their x-values, which are the possible values X can
assume, are discrete. The points in the plot, as a whole, form a visual representation of X.
39
Visualisation of a Discrete Random Variable
Highest point
Notes:
• Probabilities of the
points add up to 1
X • x-value of a
highest point is a
mode
Mode
This representation agrees with the rules of probabilities in the sense that the
probabilities of the points add up to 1. Moreover, analogous to what has been taught
about distribution of data, a mode of a discrete random variable is the x-value of a
highest point. In this plot, there is a single highest point, with x-value 2, so 2 is the mode
of this discrete random variable.
40
Probabilities of a Discrete Random Variable
Probability that a
randomly selected HDB
household has size ≥ 5?
I.e.
P(X ≥ 5)
5)?
X
41
Probabilities of a Discrete Random Variable
P(X ≥ 5)
= P(5) + P(6)
X
= 0.119 + 0.09
= 0.209
To compute this probability, we simply add up the probabilities of points with x-values at
least 5, which would give us a value of 0.209.
42
Visualisation of a
Continuous Random Variable
Density curve of
continuous
random variable Y
Continuous range
of possible values
Y can assume
Any continuous random variable Y can be visualised with a density curve on the
standard x- and y- axes. A curve can be viewed as “continuous series of points”, which
makes it an analogue to how a discrete random variable is visualized with a plot of
discrete points. Following this parallel, the x-values under such a curve correspond to the
possible values Y can assume.
43
Visualisation of a
Continuous Random Variable
Highest point
Notes:
Density curve of • Area under the
continuous curve = 1
random variable Y • x-value of a
highest point is a
Mode mode
Any density curve of a continuous random variable must have area under it equal to 1.
Similar to the case of discrete random variables, the x-value of a highest point of the
curve is a mode. For the curve of Y, there is a single highest point, with x-value 0.2, so
0.2 is the mode of this continuous random variable.
44
Probabilities of a
Continuous Random Variable
Probability that Y
assumes a value
between 0.3 and 0.5?
Density curve of I.e.
continuous P(0.3 ≤ Y ≤ 0.5)
0.5)?
random variable Y
As in the case of discrete random variables, we are often curious about probabilities of a
random variable taking on values within a range. For example, what is the probability
that Y assumes a value between 0.3 and 0.5?
45
Probabilities of a
Continuous Random Variable
To compute this probability, we calculate the area under the density curve of Y in the
interval 0.3 to 0.5. This area, shaded in the plot here, evaluates to 0.311.
46
Probabilities of a
Continuous Random Variable
In general
Probability that a continuous random variable takes on a value in interval 𝑎, 𝑏
= area under its density curve from a to b
47
Normal Distributions
Two normal distributions can only differ by their means or their variances.
Common properties:
• Bell-shaped curve
• Peak of the curve occurs at the mean
• Curve is symmetrical about the mean Mean = mode = median
A particular normal distribution is fully described by its mean and variance. In other
words, any two normal distributions can only differ by their means or their variances.
We use the notation N of x comma y, to denote the normal distribution with mean x and
variance y. Every normal distribution possesses the following properties:
• it is a bell-shaped curve,
• it has a peak that occurs at the mean, and
• it is symmetrical about the mean.
The last two points, in conjunction, implies that the mean, the mode and the median of
any normal distribution, are the same.
48
Two normal distributions can only differ by their means or their variances.
Mean = 2
Mean = 0
Now let us have a look at how the mean and variance of a normal distribution manifest
visually.
When the mean is 0 and the variance is 1, the resulting normal distribution is called the
standard normal distribution. Its density curve is shown to the left of the slide. The dotted
line indicates the peak of the curve, which as mentioned, occurs at the mean, 0. The
normal distribution with mean 2 and variance 2 has its density curve plotted on the same
axes, to the right of the slide. Here, the peak of the curve is to the right of the peak of the
standard normal distribution, due to the mean being greater.
Contrasting the shape of the two curves, we can infer that a smaller variance corresponds
to a thinner bell shape, whereas a greater variance corresponds to a fatter bell shape. Note
that, since any continuous random variable must have area under curve being 1, a fatter
bell shape needs to compensate by being shorter.
49
A real life example of a normal distribution occurs in intelligence quotient, or IQ for
short. The Wechsler Adult Intelligence Scale is designed such that its IQ scores follow a
normal distribution with mean 100 and standard deviation 15 (hence, a variance of 225).
As a result, approximately 68% percent of the IQ scores fall within the range 85 to 115,
approximately 95% percent of the scores fall in the range 70 to 130, less than 2% of the
scores fall in either of the ranges 55 to 70, and 130 to 145.
Summary
Sample space
assigned to
Events Probabilities
At this point, a summary of what we have covered so far could be useful. When talking
about formalizing real-life scenarios involving uncertainty, we always start with
describing a probability experiment, which should come equipped with an obvious
sample space. We define events as subcollections of the sample space. The probability
experiment should also give us information about how we can associate probabilities to
events.
51
Sample space
assigned to
Events Probabilities
• Conditional probability
• Random variables
• Independence
• Discrete random variables
• Independent events
• Continuous random variables
• Independent probability
• Normal distributions
experiments
All this information allows us on one hand, to talk about conditional probability, and the
related concept of independence, not just in the context of events of one probability
experiment, but also, among multiple probability experiments. On the other hand, this
information allows us to conceptualise random variables, both discrete and continuous,
including probably the most well-known class of continuous random variables around,
the class of normal distributions.
52