100% found this document useful (1 vote)
7K views197 pages

Introduction To Probability and Statistics Student Version PDF

Uploaded by

alan.linhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
7K views197 pages

Introduction To Probability and Statistics Student Version PDF

Uploaded by

alan.linhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 197

Probability and Statistics

for the Natural Sciences

201 – SN1

Marianopolis College
Student Version
CC BY-NC-SA 4.0
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

This license requires that reusers give credit to the creator. It allows reusers to distribute,
remix, adapt, and build upon the material in any medium or format, for noncommercial
purposes only. If others modify or adapt the material, they must license the modified material
under identical terms.

BY: Credit must be given to Jean-François Deslandes and to the Marianopolis College
Mathematics Department.
NC: Only noncommercial use of this work is permitted. Noncommercial means not primarily
intended for or directed towards commercial advantage or monetary compensation.
SA: Adaptations must be shared under the same terms.

Probability and Statistics for the Natural Sciences © 2024 by Jean-François Deslandes is
licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To
view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/

ACKNOWLEDGEMENTS

Jean-François Deslandes would like to thank

o the Marianopolis College Mathematics Department for its contributions to the set of examples
used in this document, for providing insightful feedback and constructive comments on its
content and presentation.

o the Marianopolis College Chemistry, Physics and Biology Disciplines for their wisdom, time and
contributions to applications found in the Problem Sets and Excel Applications.

Special thanks are expressed to Maryse Desrochers, member of the Mathematics Department for her
generosity and effort in proofreading and providing feedback on the entirety of the work contained in
this document.

Finally, unconditional gratitude to Myriam, Émile and, most especially, Nathalie for their steadfast
support and encouragements with regards to all endeavors I have pursued, academic or otherwise.

3
CONTENTS

SECTION 1 DATA ANALYSIS AND DESCRIPTIVE STATISTICS 6


1.1 DEFINITIONS AND INTRODUCTION TO DATA PRESENTATION ________________________________________________ 6
1.2 QUALITATIVE VARIABLES: NON NUMERICAL OR CATEGORICAL DATA __________________________________________ 6
1.3 QUANTITATIVE VARIABLES: NUMERICAL MEASUREMENTS _________________________________________________ 6
1.4 TABLES AND THEIR CONTENT ____________________________________________________________________ 7
1.5 GRAPHICAL REPRESENTATIONS ___________________________________________________________________ 9
1.6 MEASURES OF CENTRAL TENDENCY _______________________________________________________________ 13
1.7 MEASURES OF DISPERSION ____________________________________________________________________ 15
1.8 MEASURES OF POSITION ______________________________________________________________________ 16
1.9 BOXPLOTS________________________________________________________________________________ 23
1.10 DESCRIPTION OF A TWO-VARIABLE SITUATION _______________________________________________________ 24
1.11 RELATIONSHIP BETWEEN TWO VARIABLES __________________________________________________________ 26
1.12 COVARIANCE AND CORRELATION________________________________________________________________ 28
1.13 LINEAR CORRELATION COEFFICIENT ______________________________________________________________ 29

SECTION 2 INTRODUCTION TO PROBABILITY 32


2.1 BASIC DEFINITIONS __________________________________________________________________________ 32
2.2 OPERATIONS ON EVENTS ______________________________________________________________________ 35
2.3 CONDITIONAL PROBABILITY AND INDEPENDENCE ______________________________________________________ 42
2.4 PARTITIONING AND BAYES’ LAW_________________________________________________________________ 47

SECTION 3 COUNTING TECHNIQUES 52


3.1 MULTIPLICATION RULE _______________________________________________________________________ 52
3.2 ADDITION RULE ____________________________________________________________________________ 55
3.3 ARRANGEMENTS, PERMUTATIONS AND COMBINATIONS _________________________________________________ 60
3.4 MIXED PROBLEMS AND VARYING STRATEGIES ________________________________________________________ 63

SECTION 4 RANDOM VARIABLES 66


4.1 DISCRETE RANDOM VARIABLES __________________________________________________________________ 66
4.2 PROBABILITY DISTRIBUTION TABLE________________________________________________________________ 68
4.3 COMPUTATIONAL FORMULAS FOR THE EXPECTED VALUE AND VARIANCE ______________________________________ 70
4.4 LINEAR TRANSFORMATION OF A RANDOM VARIABLE ___________________________________________________ 72
4.5 SUM OF INDEPENDENT VARIABLES ________________________________________________________________ 75
4.6 DIFFERENCE OF INDEPENDENT VARIABLES ___________________________________________________________ 75
4.7 BERNOULLI AND BINOMIAL DISTRIBUTIONS__________________________________________________________ 77
4.8 EXPECTED VALUE, VARIANCE AND STANDARD DEVIATION OF A BINOMIAL VARIABLE _______________________________ 79
4.9 PROBABILITY DISTRIBUTION OF A BINOMIAL VARIABLE: __________________________________________________ 79

SECTION 5 CONTINUOUS RANDOM VARIABLES 82


5.1 DENSITY FUNCTIONS _________________________________________________________________________ 83
5.2 INTEGRAL NOTATION ________________________________________________________________________ 84
5.3 EXPECTED VALUE, VARIANCE AND STANDARD DEVIATION OF A CONTINUOUS RANDOM VARIABLE ______________________ 86
5.4 LINEAR TRANSFORMATIONS AND SUMS OF INDEPENDENT CONTINUOUS RANDOM VARIABLES ________________________ 91

4
SECTION 6 NORMAL DISTRIBUTION AND SUMS OF VARIABLES 92
6.1 STANDARD NORMAL DISTRIBUTION _______________________________________________________________ 92
6.2 GENERAL FORMS OF NORMAL DISTRIBUTIONS ________________________________________________________ 97
6.3 PROPERTIES OF NORMAL DISTRIBUTIONS ___________________________________________________________ 99
6.4 CONVERSIONS BETWEEN GENERAL AND STANDARD NORMAL LAWS _________________________________________ 100
6.5 SUMS OF IDENTICAL AND INDEPENDENT RANDOM VARIABLES ____________________________________________ 103
6.6 CONTINUITY CORRECTION (OPTIONAL) ____________________________________________________________ 109

SECTION 7 SAMPLE DISTRIBUTIONS AND THE CENTRAL LIMIT THEOREM 112


7.1 SAMPLE MEANS __________________________________________________________________________ 113
7.2 CENTRAL LIMIT THEOREM ____________________________________________________________________ 115
7.3 SAMPLE PROPORTIONS ______________________________________________________________________ 118
7.4 SAMPLE SIZE AND INDEPENDENCE OF TRIALS ________________________________________________________ 120
7.5 STUDENT T-DISTRIBUTION ____________________________________________________________________ 122

SECTION 8 HYPOTHESIS TESTING 129


8.1 INTRODUCTORY DISCUSSION __________________________________________________________________ 129
8.2 TO REJECT, OR NOT TO REJECT, THAT IS THE QUESTION! ________________________________________________ 134
8.3 TESTS ON MEANS AND PROPORTIONS USING A REJECTION ZONE ___________________________________________ 136
8.4 TEST ON A MEAN – BIG SAMPLE SIZES (N  30) ______________________________________________________ 142
8.5 TEST ON A MEAN – SMALL SAMPLE SIZES (N  30) ____________________________________________________ 145
8.6 HYPOTHESIS TESTING AND P-VALUES _____________________________________________________________ 148

SECTION 9 CONFIDENCE INTERVALS 157


9.1 CONFIDENCE INTERVAL FOR  _________________________________________________________________ 158
9.2 CONFIDENCE INTERVAL FOR A PROPORTION, P_______________________________________________________ 164
9.3 DETERMINATION OF A SAMPLE SIZE ______________________________________________________________ 169

SECTION 10 CORRELATION AND LINEAR REGRESSION 175


10.1 LINEAR REGRESSIONS ______________________________________________________________________ 176

SECTION 11 CATEGORICAL DATA AND CHI-SQUARE TEST 187


APPENDIX – PROBABILITY TABLES 195
APPENDIX A STANDARD NORMAL DISTRIBUTION ___________________________________________________ 195
APPENDIX B STUDENT T-DISTRIBUTION _________________________________________________________ 196
APPENDIX C CHI-SQUARE DISTRIBUTION ________________________________________________________ 197

5
Section 1 Data Analysis and descriptive statistics
1.1 Definitions and introduction to data presentation
Individual: statistical unit from which qualitative and/or quantitative data is collected.

Population: the set of all individuals.

Sample: subset of a population.

Variable: aspect or character of an individual that is being studied. A variable is a concept or


property that can take different values (as opposed to data that represents specific values the
variable takes in a given study).

Descriptive statistics: set of techniques, tools and methods used to summarize, organize and
present sample or population data.

Statistic: measure that is calculated from sample data

Parameter: measure that is calculated from population data

Inferential statistics: methods that use statistics to predict or make decisions about
parameters. In other words, inference occurs when sample data is used to predict measures
that characterize the population it was taken from.

1.2 Qualitative data: non numerical or categorical data


Nominal data: qualitative data whose order is irrelevant
Ex: Car Fuel, Cégep program, store department, eye color.

Ordinal data: qualitative data whose order is meaningful


Ex: Level of appreciation, Degree obtained (DEC, B.Sc, M.Sc, Ph.D), Military rank.

1.3 Quantitative data: numerical measurements


Discrete quantitative data: numerical data whose possible values are finite or countable
Ex: Number of children in a household; years of work experience; number of rooms in a home;
number of employees in a business.

Continuous quantitative data: numerical data that can take any of the countless values in an
interval. Note that a discrete variable whose possible values are too numerous is generally
treated as being continuous.
Ex: Salary; height and weight of an individual; time of travel.

6
1.4 Tables and their content
Frequency table:
A frequency table is used to summarize data, whether it be the results of a discrete variable or
of a grouped continuous variable. To be complete, a frequency table should present the
number of observations for each value of the variable − or each class of values − as well as their
relative frequency and cumulative relative frequency.

Count: total number of individuals who were surveyed. Typically, we use the lowercase letter 𝑛
when referring to a sample and the uppercase letter 𝑁 when referring to a population.

Frequency: number of times a specific value − or a member of a class − of a variable has been
counted in a set of data (𝑛𝑖 ).

𝑛𝑖
Relative frequency: 𝑓𝑖 = where 𝑛 is the total amount of observations1. The relative
𝑛
frequency represents the proportion of a sample that has a specific value.

Cumulative relative frequency:


𝐹𝑖 = ∑ 𝑓𝑘
𝑘=1… 𝑖
The cumulative relative frequency represents the proportion of the ordered data that has been
tallied up to a specific value.

Example 1.4.1
A survey was conducted on a sample of 500 West Island households. The number of children
per household was the subject of the survey. The collected data is presented in Table 1.4.1
below.

Table 1.4.1
Nb of children (x i) Frequency (n i) Relative frequency (f i) Cumulative relative frequency (F i)
0 87 0.174 0.174
1 136 0.272 0.446
2 158 0.316 0.762
3 72 0.144 0.906
4 23 0.046 0.952
5 15 0.03 0.982
6 9 0.018 1
Total 500

1
The same formula is used with N rather than n when referring to a population.

7
Data groupings
When data is quantitative continuous (or discrete, but with too many distinct outcomes) one
should group the values into groups or classes. Usually, 5 to 12 classes are appropriate. Having
less than 5 classes results in a loss of information. Having more than 12 classes may be viewed
as too much information and can no longer be considered a summary.

Example 1.4.2
A sample of 480 randomly chosen individuals who reside in Granby were asked to reveal their
yearly revenues. The results of the survey are presented in Table 1.4.2 below. As you can
observe, there is no need for classes to be of equal width.

Table 1.4.2
Distribution of annual revenues (Granby, 2021)
Frequency Relative frequency Cumulative relative frequency
Revenues (in thousands of $)
(𝒏𝒊 ) (𝒇𝒊 ) (𝑭𝒊 )
0 6
]0, 25] 135
]25, 50] 147
]50, 70] 78
]70, 100] 60
]100, 500] 48
]500, 2000] 6
Total 480

8
1.5 Graphical representations
From a frequency table, a graphical summary can be constructed. The objective of a graph is to
describe visually the behavior of a random variable without exaggerating details nor losing
information. A picture is worth a thousand words… The graph one should use depends on the
type of variable that is being studied. Here are the main types of graphs and in which case they
are most appropriate.

Pie chart:
Although they are ‘cute’, pie charts should be used exclusively to illustrate the relative
frequency of qualitative nominal data. The area covered by each ‘piece of the pie’, must be
proportional to the relative frequency of the observed outcomes.

Example 1.5.1
Figure 1.5.1 shows a pie chart for the distribution of students registered to Science differential
calculus in the fall semester of 2023 among the different programs.

Figure 1.5.1

9
Bar graph:
Bar graphs are used to illustrate the results − usually the frequency or relative frequency − of
qualitative or discrete quantitative data. The bars are usually vertical, of equal width and
uniformly spaced.

Example 1.5.2
Refer to Table 1.4.1 which presents the results of a survey concerning the number of children
per household. Figure 1.5.2, shown below, shows the distribution of the results of this survey.

Figure 1.5.2

10
Histogram:
The graph which is most widely used to describe grouped quantitative data is the histogram. It
differs from a bar graph in two ways: the bars always touch, and their width can vary from one
class to another. Again, no less than 5 and no more than 12 classes should be used to construct
a histogram, none of which should be empty. Since no formal rules about constructing
histograms and how the classes should be divided exist, a variety of histograms can represent
the distribution of the same survey results.

Example 1.5.3
A sample of students currently in the last year of their university program were asked to state
the salary they expect to obtain (in thousands of dollars) once they graduate and find their first
job. Figure 1.5.3, shown below, histogram below summarizes the data that was compiled.

Figure 1.5.3

11
Step graphs:
To represent the cumulative relative frequency of a discrete quantitative variable.

Example 1.5.4
Refer to Table 1.4.1 which presents the Figure 1.5.4
results of a survey concerning the number
of children per household. Figure 1.5.4 on
the right, presents the step graph for the
cumulative relative frequencies obtained.

Ogive graph:
To represent the cumulative relative frequency of a continuous quantitative variable

Example 1.5.5
Refer to Table 1.4.2 which presents the distribution of salaries of a sample of Granby residents.
The line plot (see Figure 1.5.5) illustrates the cumulative relative frequency of the results
obtained.

Figure 1.5.5

12
Description of a quantitative variable

1.6 Measures of central tendency


Mode: the value of variable 𝑋 that is most frequently repeated. When a variable is continuous,
we will be interested in identifying the modal class, that is the class that is most frequently
observed.

Median: value of variable 𝑋 that falls in the middle position when the measurements are
ordered from smallest to largest. If data collected has 𝑛 ordered observations, then the median
is…

𝑚𝑒𝑑(𝑋) = 𝑥(𝑛+1)⁄2 if 𝑛 is odd


1
𝑚𝑒𝑑(𝑋) = (𝑥𝑛⁄2 + 𝑥𝑛⁄2+1 ) if 𝑛 is even
2

Mean (or average): unlike the mode and median, the mean takes into account the value of
every observation, rather than just its rank or frequency. This general formula must be adapted
to the source and the presentation of the available.

When the mean is calculated for an entire population (census), we use the symbol 𝜇 to
represent the mean of the variable.

Case 1: If data is available in exhaustive form, that is if all values of variable 𝑋 are enumerated,
the mean is obtained from the formula
1
𝜇= ∑ 𝑥𝑖
𝑁
𝑖

Case 2: If the data is discrete and presented in a frequency table, then we must remember
value 𝑥𝑖 is repeated 𝑛𝑖 times. Taking into account that some values are more frequent than
others, then the mean is obtained from
1
𝜇= ∑ 𝑛𝑖 𝑥𝑖
𝑁
𝑖

13
Case 3: If the data is continuous (or treated as such) and grouped into classes, we can only
approximate the mean by considering that all entries within a class are equal to its midpoint (or
equivalently, we assume that the data is distributed uniformly within a class). If 𝑐𝑖 is the center
of the 𝑖 th group, then the mean is obtained from

1
𝜇= ∑ 𝑛𝑖 𝑐𝑖
𝑁
𝑖

In the above, we have used 𝜇 to represent the mean of a variable whose values were obtained
from a population of 𝑁 individuals. In the case of sample data, the mean of variable 𝑋 will be
denoted by 𝑥. The formulas for the cases mentioned previously are then the following:

1
𝑥= ∑ 𝑥𝑖 when all values are enumerated individually
𝑛
𝑖
1
𝑥= ∑ 𝑛𝑖 𝑥𝑖 when discrete data is summarized in a frequency table
𝑛
𝑖
1
𝑥= ∑ 𝑛𝑖 𝑐𝑖 when continuous data is summarized in a frequency table
𝑛
𝑖

14
1.7 Measures of dispersion
Range: distance separating the minimum and maximum values 𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥− 𝑀𝑖𝑛
Interquartile range: distance separating the first and third quartiles: 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Variance and standard deviation: In rough terms, the variance measures the average of the
squared deviations data has relative to its mean.

It is difficult to formulate a fair interpretation of variance given that its units are not actually
those of the variable’s, but rather its squared units. Standard deviation is obtained from the
square root of variance – units are those of the variable’s.

As you will most likely notice, formulas and notations used (𝜎 versus 𝑠) for variance and
standard deviation vary slightly depending on whether you are using population data or sample
data, or whether the data is grouped.

For population data:


1 1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 = ∑(𝑥𝑖 − 𝜇)2 𝜎 = √ ∑(𝑥𝑖 − 𝜇)2
𝑁 𝑁
𝑖 𝑖

1 1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 = ∑ 𝑛𝑖 (𝑥𝑖 − 𝜇)2 𝜎 = √ ∑ 𝑛𝑖 (𝑥𝑖 − 𝜇)2
𝑁 𝑁
𝑖 𝑖

1 1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 = ∑ 𝑛𝑖 (𝑐𝑖 − 𝜇)2 𝜎 = √ ∑ 𝑛𝑖 (𝑐𝑖 − 𝜇)2
𝑁 𝑁
𝑖 𝑖

For sample data:


1 1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑠 2 = ∑(𝑥𝑖 − 𝑥)2 𝑠=√ ∑(𝑥𝑖 − 𝑥)2
𝑛−1 𝑛−1
𝑖 𝑖

1 1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑠 2 = ∑ 𝑛𝑖 (𝑥𝑖 − 𝑥)2 𝑠=√ ∑ 𝑛𝑖 (𝑥𝑖 − 𝑥)2
𝑛−1 𝑛−1
𝑖 𝑖

1 1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑠 2 = ∑ 𝑛𝑖 (𝑐𝑖 − 𝑥)2 𝑠=√ ∑ 𝑛𝑖 (𝑐𝑖 − 𝑥)2
𝑛−1 𝑛−1
𝑖 𝑖

15
Empirical rule
The empirical rule can be used to interpret the meaning – or the magnitude – of a standard
deviation. The rule itself is imperfect and its accuracy depends on the shape of the variable’s
distribution (its accuracy improves whenever the variable’s distribution is relatively symmetric,
unimodal and ‘bell-shaped’).
o Approximately 68.3% of the measurements will lie in the interval 𝜇 ± 1𝜎
o Approximately 95.5% of the measurements will lie in the interval 𝜇 ± 2𝜎
o Approximately 99.7% of the measurements will lie in the interval 𝜇 ± 3𝜎

1.8 Measures of position


Minimum: smallest value of variable 𝑋 in the data set
Maximum: greatest value of variable 𝑋 in the data set
𝒛-score: Any value of 𝑋 can be converted into its corresponding 𝑧-score. The 𝑧-score accounts
for the position of 𝑋 relative to the mean, calculated in multiples of the standard deviation.
More specifically, if 𝑥 is the observed value of variable 𝑋, then

𝑥−𝜇
𝑧=
𝜎
As you can tell, the 𝑧-score is
o 0, when 𝑥 = 𝜇,
o positive when 𝑥 > 𝜇,
o negative when 𝑥 < 𝜇.

Example 1.8.1
In a population where variable 𝑋 has mean 𝜇 = 32 and standard deviation 𝜎 = 8, the 𝑧-scores
for 𝑥1 = 34 and 𝑥2 = 20 are

𝑥1 − 𝜇 34 − 32
𝑧1 = = = 0.25
𝜎 8

𝑧2 =

𝑧1 = 0.25 means that 𝑥1 = 34 is one quarter of a standard deviation above the mean.
𝑧2 = means that 𝑥2 = 20 is standard deviations the mean.

16
Percentiles
Textbooks and websites publish a variety of definitions for the percentile. We shall use what is
typically called the interpolation approach. Regardless of the method that is used, it is worth
noting that percentiles are meaningful only when the amount of data at our disposal is
relatively big. The use of percentiles is also more relevant when the data pertains to a
continuous variable. In particular,

1. percentiles computed from small samples must be taken with a grain of salt: it is more
relevant in such cases to simply state the rank of observed values.
2. Percentiles are relatively useless for discrete variables given that the frequency table
already reflects the relative positions of the data

For 1 ≤ 𝑘 ≤ 99, the 𝑘th percentile of a distribution is a value that is greater or equal to at least
𝑘% of the data and is less than or equal to at least (100 − 𝑘)% of the data.

Notation
We shall often use 𝑃𝑘 (𝑋) to represent the 𝑘th percentile of variable 𝑋.

Interpolation approach
Numerically, if the data consists of 𝑛 values that are ordered from lowest to greatest, then 𝑃𝑘 is
the value that is found at position 𝑘%(𝑛 + 1).

When 𝑘%(𝑛 + 1) is an integer, 𝑃𝑘 is the value that is found in this position.


However, if 𝑘%(𝑛 + 1) is not an integer, then an interpolation between two values will have to
be computed.

Example 1.8.2
Suppose you have access to values for a variable 𝑋 that was obtained from a sample of size 35
(𝑛 = 35), and assume that the data is ordered from smallest to greatest.

The 50th percentile will be the value at rank 50%(35 + 1) = 18. Hence, the 18th entry of the
ordered data will play the role of 𝑃50 .

Likewise, the 9th and 27th entries will play the roles of 𝑃25 and 𝑃75 , respectively, since
25%(35 + 1) = 9 and 75%(35 + 1) = 27 .

The 90th percentile, however, is located at rank 90%(35 + 1) = 32.4, which means that 𝑃90
will actually be located 40% of the distance between the 32nd and 33rd data entries.

17
Excel has an integrated function to calculate any percentile from a data file. The definition we
have adopted will require that you use the PERCENTILE.EXC function.

Quartiles and median


Quartiles are specific cases of percentiles. The 25th percentile is often referred to as the 1st
quartile and is denoted by 𝑄1. The 75th percentile is also known as the 3rd quartile and denoted
by 𝑄3 . The 50th percentile corresponds to the second quartile, but is better known as the
median. The median, as we discussed earlier, is both an indicator of central tendency and an
indicator of position. All quartiles can be calculated by using the 𝑘th percentile formula seen
previously.

In short:
1st Quartile (𝑸𝟏 ) : value corresponding to the 25th percentile (𝑄1 = 𝑃25 )
Median (med): value corresponding to the 50th percentile (𝑚𝑒𝑑 = 𝑄2 = 𝑃50 )
3rd Quartile (𝑸𝟑 ) : value corresponding to the 75th percentile (𝑄3 = 𝑃75 )

Calculating a percentile from a frequency table


Given that a frequency table presents the cumulative relative frequency of a variable 𝑋,
percentiles can be obtained without having to rewrite and ordered list of the measurements.

Discrete quantitative data


Two situations may occur when observing discrete data:
1. The 𝑘th percentile is the value whose cumulative relative frequency is the first to exceed
𝑘%, or
2. If a value’s cumulative relative frequency is exactly 𝑘%, the 𝑘th percentile is located
between this value and the next one using an interpolation approach.

18
Example 1.8.3
Given the frequency table shown below (Table 1.8.1), find the following measures of position:
𝑃20 (𝑋), 𝑚𝑒𝑑(𝑋), 𝑄3 (𝑋), 𝑃80 (𝑋)

Table 1.8.1
Nb of rooms Frequency Relative frequency Cumulative relative frequency
3 3 0.0200 0.0200
4 5 0.0333 0.0533
5 11 0.0733 0.1267
6 20 0.1333 0.2600
7 36 0.2400 0.5000
8 45 0.3000 0.8000
9 18 0.1200 0.9200
10 6 0.0400 0.9600
11 4 0.0267 0.9867
12 2 0.0133 1.0000
Total 150

Answers

19
Grouped continuous data
When data has been grouped in classes, on must use a process of interpolation to estimate the
position of a given percentile.

𝑘 − 𝐹𝑐−1
𝑃𝑘 = 𝐿𝑐 + ( ) 𝑤𝑐
𝑓𝑐

where
𝑐 is the class (or class) where the 𝑘 𝑡ℎ percentile is found,
𝐿𝑐 is the lower bound of the class containing the 𝑘 𝑡ℎ percentile,
𝐹𝑐−1 is the cumulative relative frequency at the end of the previous class,
𝑓𝑐 is the relative frequency of the class containing the 𝑘 𝑡ℎ percentile,
𝑤𝑐 is the width of the class containing the 𝑘 𝑡ℎ percentile.

Example 1.8.4
With the grouped data shown in Table 1.8.2, find the 40th and 80th percentile.

Table 1.8.2

Answers

𝑃40 =

𝑃80 =

20
Summary Example
The Excel file called ‘Fish size’ contains data concerning the length of a sample of fish randomly
netted in Brome Lake. Searchers want to verify the condition of fish living in this lake and will
repeat this operation every year.
o Find the mean, median, first and third quartiles and standard deviation for this sample.
o Construct a frequency table containing classes of size 10. Produce a histogram of the results
of this survey.
o Find the modal class.
o Use the linear interpolation formula to approximate the 1st and 3rd quartiles from the table.

Answers
Whether it be with the Analysis Toolpak or by using integrated functions, the descriptive
statistics of the sample produce the following results:

Mean Sample Variance Standard Deviation


47.19 878.29 29.64

Minimum Q1 Median Q3 Maximum


1.9 22.575 44.15 74.825 98.3

The frequency table (Table 1.8.3) and histogram (Figure 1.8.1) were obtained by grouping the
quantitative data:

Table 1.8.3
Distribution of fish size
Relative frequency Cumulative relative
Length (in cm) Frequency (n i)
(fi) frequency (Fi)
[0, 10] 11 0.138 0.138
]10, 20] 8 0.100 0.238
]20, 30] 12 0.150 0.388
]30, 40] 5 0.063 0.450
]40, 50] 8 0.100 0.550
]50, 60] 3 0.038 0.588
]60, 70] 8 0.100 0.688
]70, 80] 12 0.150 0.838
]80, 90] 9 0.113 0.950
]90, 100] 4 0.050 1.000
Total 80

21
Figure 1.8.1

The modal classes are ]20,30] and ]70,80] which both have 15% of the total observations (12
occurrences each).

If the first and third quartiles are evaluated from the grouped data (by linear interpolation), we
find that

𝑄1 = 𝑃25 =

𝑄3 = 𝑃75 =

One can compare the values of 𝑄1 and 𝑄3 obtained by interpolation with the actual values
revealed in the descriptive statistics previously shown. This example shows that grouping
variables does lead to a certain loss of information. Within a class, no one can tell for sure if the
values are distributed uniformly − which is exactly what we assume when performing a linear
interpolation − or if they are all clustered near its upper or lower bound. For this reason, we
recommend the descriptive statistics be computed before any grouping is done.

22
1.9 Boxplots
It is convenient to visualize a variable’s distribution using a boxplot. The boxplot provides the
first and third quartiles, as well as the median in a central box representation. Lines (usually
referred to as whiskers) are extended at each extremity of the box to illustrate the range of
more extreme values. There is no set rule as to where the whiskers should end but two uses are
generally more common than others:

1. Whiskers end at the minimum and maximum values of the variable.

However, we will be applying the following guidelines:

2. Whiskers end at the minimum and maximum values unless these exceed 1.5 times the
interquartile range from the nearest quartile. Any value beyond these limits are highlighted
by stars and are considered outliers.

Example 1.9.1
Draw the boxplot for a variable 𝑋 given that it has 2 outliers and is such that
𝑄1 (𝑋) = 40 min(𝑋) = 20
𝑄3 (𝑋) = 60 med(𝑋) = 45
1.5𝐼𝑄𝑅 = 1.5(60 − 40) = 30 max(𝑋) = 100

Recall: by convention, whiskers should be no further than 1.5𝐼𝑄𝑅 from the nearest quartile:

23
1.10 Description of a two-variable situation
When conducting a survey, more than one character is frequently studied. For example, the
salary and years of education of the individuals in a sample. Pairing data is commonly done,
especially when one wishes to establish a relationship, or a potential dependence between
these characters.

Two-variable tables
Contingency table
Consider 𝑋 and 𝑌, two characters that are being studied. A table containing the frequency, 𝑛𝑖𝑗 ,
for every possible pair of data (𝑥𝑖 , 𝑦𝑗 ) is called a contingency table. Thus, a contingency table
presents the number of individuals that possess both the characters 𝑋 = 𝑥𝑖 and 𝑌 = 𝑦𝑗 .

Example 1.10.1
A company is examining its policies of equity towards women. The average salary of male
employees has been found to be greater than that of female employees. Some speculate this
could simply be due to the difference in educational level. Table 1.10.1, shown below, was
produced using the human resources’ database. Its content pertains to the educational
backgrounds of men and women. Let 𝑋 represent the sex of the employee (M or F) and let 𝑌
represent the highest diploma obtained by the employee.

Table 1.10.1
Highest Diploma
Sex High School Cégep Undergrad Graduate Total
F 9 11 5 12 37
M 10 12 10 10 42
Total 19 23 15 22 79

The contingency table can then be converted into many other forms, depending on the manner
with which you prefer or choose to display information.

Joint relative frequency


A joint relative frequency table presents the proportion of individuals that possess both the
characters 𝑋 = 𝑥𝑖 and 𝑌 = 𝑦𝑗 . The sum of the values contained in such a table is always 1 (or
100 % if the relative frequencies are expressed as percentages). It is obtained from the
contingency table when dividing all entries by the total frequency.

24
Example 1.10.2
The joint relative frequencies of Sex vs Diploma among the employees of the company are
shown in Table 1.10.2, below.

Table 1.10.2
Highest Diploma
Sex High School Cégep Undergrad Graduate Total
F 11.4% 13.9% 6.3% 15.2% 46.8%
M 12.7% 15.2% 12.7% 12.7% 53.2%
Total 24.1% 29.1% 19.0% 27.8% 100.0%

Conditional relative frequency table


A conditional relative frequency table presents the proportion of individuals by row (or by
column). In other words, the entries of the contingency table are now divided by the total row
(or column) frequencies. In a conditional table, the row (or column) total will be 1 (or 100 %).

Example 1.10.3
The relative frequencies of Diploma conditional to Sex are shown in Table 1.10.3.

Table 1.10.3
Highest Diploma
Sex High School Cégep Undergrad Graduate Total
F 24.3% 29.7% 13.5% 32.4% 100.0%
M 23.8% 28.6% 23.8% 23.8% 100.0%
Total 24.1% 29.1% 19.0% 27.8% 100.0%

Example 1.10.4
The relative frequencies of Sex conditional to Diploma are shown in Table 1.10.4.

Table 1.10.4
Highest Diploma
Sex High School Cégep Undergrad Graduate Total
F
M
Total

25
1.11 Relationship between two variables

Functional dependence
Consider 𝑋 and 𝑌 two variables. There exists a functional dependence between X and Y if there
is a function 𝑓 such that 𝑌 = 𝑓(𝑋).

For example, let 𝑋 represent the number of hours an individual works in a week and 𝑌, the
weekly salary. If this individual's base pay is 100$ to which is added hourly rate is 20$ per hour,
then the variables 𝑌 and 𝑋 are related by the function 𝑌 = 20𝑋 + 100. There is a functional
dependence (linear, in this example) between the variable "number of work hours in a week"
and the variable "weekly salary".

Scatter Plot (Cartesian representation of paired quantitative data)


Excel possesses graphical options which allow to produce two-dimensional dot plots. This type
of graphical representation is an excellent visual aid in determining the existence of a
relationship between to quantitative variables. The alignment of the dots may even allow one
to recognize the form of the dependence there could be (linear, quadratic, exponential).
Linear relationship
A linear dependence between two quantitative variables implies there could exist a linear
function that connects them. In other words, can the experimental observations of the
variables 𝑋 and 𝑌 be explained or even predicted by a function of form

𝑌 = 𝑎𝑋 + 𝑏

When the alignment is imperfect, we usually speak of ‘linear relationships’… we must avoid
assuming that there is a direct dependence relation or, worse even, that a cause and effect
dependence may exist between the variables. The apparent alignment could be coincidental, or
the result of a confounding factor.

Visually, the linear dependence of two variables is observed on a dot plot when the dots line up
perfectly. In practice, this seldom happens. However, to a certain extent, a linear dependence,
weak or strong, can be observed.

26
Example 1.11.1
The following dot plot (Figure 1.11.1) was Figure 1.11.1
obtained by graphing the pair of observed Dot plot of price vs area
values (area, price). There is a relatively 350000

clear trend between the variables since 300000

250000
increasing values of area generally tend to
200000
result in increasing values of price.

Price
150000

100000

50000

0
0 1000 2000 3000 4000
1st fl. area

Example 1.11.2
Linear dependence is not evident in the dot Figure 1.11.2
plot (Figure 1.11.2) of alkalinity vs depth of Dot plot of alkalinity
lakes. There is no apparent alignment of the 1200
1000
data points, indicating a very weak
800
relationship between the factors "depth"
alkalinity

600
and "alkalinity". 400
200
0
-200 0 10 20 30 40
depth

27
1.12 Covariance and correlation
The covariance of two variables, 𝑋 and 𝑌, is defined by either:

𝑁
1
𝐶𝑜𝑣(𝑋, 𝑌) = ∑(𝑥𝑖 − 𝜇𝑋 )(𝑦𝑖 − 𝜇𝑌 ) for data obtained from a population
𝑁
𝑖=1

𝑛
1
𝑐𝑜𝑣(𝑋, 𝑌) = ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) for data obtained from a sample
𝑛−1
𝑖=1

where 𝑋 = 𝑥𝑖 and 𝑌 = 𝑦𝑖 are quantitative measures obtained from individual 𝑖 (in listed data).

Excel has integrated functions that quickly calculate of the covariance of two-dimensional
variables.

If the data was summarized using a contingency table, the frequency of 𝑋, 𝑌 pairs must be
accounted for when calculating covariance. More specifically,

1
𝐶𝑜𝑣(𝑋, 𝑌) = ∑ 𝑛𝑖𝑗 (𝑥𝑖 − 𝜇𝑋 )(𝑦𝑗 − 𝜇𝑌 ) for data obtained from a population
𝑁
𝑖,𝑗

1
𝑐𝑜𝑣(𝑋, 𝑌) = ∑ 𝑛𝑖𝑗 (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) for data obtained from a sample
𝑛−1
𝑖,𝑗

Here, 𝑛𝑖𝑗 is the frequency for a specific pair of responses (𝑥𝑖 , 𝑦𝑗 ).

Note that if a contingency table presents quantitative data that is continuous or grouped,
whether it be for variable 𝑋, for 𝑌 or for both variables, the center of classes may be used to
replace 𝑥𝑖 or 𝑦𝑗 , whenever necessary, to compute the covariance.

28
1.13 Linear correlation coefficient
The coefficient of linear correlation also known as Pearson's coefficient of correlation, denoted
by 𝜌(𝑋, 𝑌), is closely related to covariance. For population data, Pearson’s coefficient is
obtained using
𝐶𝑜𝑣(𝑋, 𝑌)
𝜌(𝑋, 𝑌) =
𝜎𝑋 𝜎𝑌

If the data is gathered from a sample, the coefficient is calculated using

𝑐𝑜𝑣(𝑋, 𝑌)
𝑟(𝑋, 𝑌) =
𝑠𝑋 𝑠𝑌

𝑟(𝑋, 𝑌) provides an estimate of 𝜌(𝑋, 𝑌) the true population correlation.

More easily interpreted than covariance, Pearson's coefficient is a measure of the intensity of
the linear relationship between the variables 𝑋 and 𝑌. Here are the main properties of 𝜌 (and
likewise for 𝑟):

Properties of the linear correlation coefficient


o If 𝑌 = 𝑎𝑋 + 𝑏 where 𝑎 > 0, then 𝜌(𝑋, 𝑌) = 1
o If 𝑌 = 𝑎𝑋 + 𝑏 where 𝑎 < 0, then 𝜌(𝑋, 𝑌) = −1
o −1 ≤ 𝜌(𝑋, 𝑌) ≤ 1

In essence, we could interpret the coefficient of correlation by


1. its sign: the correlation coefficient is positive when, in general, the scatter plot’s trend is to
increase, and negative when a decreasing trend is observed,
2. its size: the closer the correlation coefficient is to 1 or to −1, the closer the scatter plot is to
being perfectly aligned.

29
Here are examples of scatter plots and their corresponding coefficients of correlation.

It is worth mentioning that a correlation coefficient that is close to, or equal to, 0 does not
mean that variables 𝑋 and 𝑌 have no relationship. The scatter plot may indeed show little or no
particular trend, but it may also have a very strong trend that is non-linear.

For instance, the scatter plot that is shown Figure 1.13.1


in Figure 1.13.1 has 𝜌(𝑋, 𝑌) = 0. However,
there is clearly a strong relationship, be it
non-linear, between 𝑋 and 𝑌.

30
Example 1.13.1

Figure 1.13.2 shows a scatter plot for a sample Figure 1.13.2


of homes, where the variables under study are:
Dot plot of price vs area

𝑋 = “𝑎𝑟𝑒𝑎 𝑜𝑓 𝑡ℎ𝑒 1𝑠𝑡 𝑓𝑙𝑜𝑜𝑟” 350000

𝑌 = “𝑝𝑟𝑖𝑐𝑒 𝑜𝑓 𝑡ℎ𝑒 ℎ𝑜𝑢𝑠𝑒” 300000

250000

𝑟(𝑋, 𝑌) = 0.781 200000

Price
150000

100000

50000

0
0 1000 2000 3000 4000
1st fl. area

Example 1.13.2

Figure 1.13.3 shows a scatter plot for a sample Figure 1.13.3


of lakes in Quebec, where the variables under
Dot plot of alkalinity
study are: 1200
1000
𝑋 = “𝑑𝑒𝑝𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑎𝑘𝑒”
800
𝑌 = “𝑎𝑙𝑐𝑎𝑙𝑖𝑛𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑤𝑎𝑡𝑒𝑟”
alkalinity

600
400
𝑟(𝑋, 𝑌) = −0.021 200
0
-200 0 10 20 30 40
depth

31
Section 2 Introduction to probability
2.1 Basic definitions
Random Experiment: A random experiment is an experiment or a process for which the
outcome is the result of chance.

Simple Event (𝝎): A simple event is one of the potential outcomes of a random experiment.

Sample Space (𝛀): The sample space consists of all possible outcomes of a random experiment.

Event: An event is a subset of the sample space. An uppercase letter is typically used to label an
event.

Cardinality of an event: The cardinality of an event is the number of simple events that it is
formed of. We shall use 𝑛(𝐸) to represent the cardinality of event 𝐸.

Example 2.1.1
“A fair die is thrown and its outcome it noted’, is an example of a random experiment. The list
of possible outcomes is known, but one cannot tell for sure which one will arise on any given
trial.

The sample space is Ω = {1, 2, 3 , 4, 5 , 6}

The individual outcomes are called simple events: 𝜔1 = 1 , 𝜔2 = 2 , … , 𝜔6 = 6 .

If the die is truly fair, every one of these simple events is equally likely.

𝐴 = ‘Obtaining an even number’ is an event of Ω. In listed form, 𝐴 = {2, 4, 6 }.

The cardinality of 𝐴 is 𝑛(𝐴) = 3 whereas the cardinality of Ω is 𝑛(Ω) = 6 .

32
Representation through Venn diagrams
Venn diagrams are a useful and simple way of visualizing the sample space and events that are
contained within it. Although this is rarely done explicitly, simple events of a given event can
also appear in the diagram.

Probability
Consider a random experiment whose sample space, Ω, is composed of simple events that are
all equally likely. If 𝐴 represents an event, then the probability of event 𝐴 is denoted by 𝑃(𝐴),
where
𝑃(𝐴) =

Example 2.1.2
A fair die is thrown and its outcome it noted. As mentioned earlier, the sample space is
Ω = {1, 2, 3, 4, 5, 6}
and the simple events are equally likely.

Let 𝐴 represent the event, ‘Obtaining an even number’.

The cardinality of 𝐴 is 𝑛(A) = 3 whereas the cardinality of Ω is 𝑛(Ω) = 6.


So, the probability that 𝐴 will occur is

3 1
𝑃(𝐴) = = = 0.5 = 50%
6 2

33
Basic properties of the probability

o For all events, 𝐴 ⊆ Ω, 0 ≤ 𝑃(𝐴) ≤ 1


o An event, 𝐴, is certain if and only if 𝑃(𝐴) = 1
o An event, 𝐴, is impossible if and only if 𝑃(𝐴) = 0

Example 2.1.3
Two regular dice are thrown and their respective outcomes are noted.

This time, a pair of outcomes, one for each die, is required to list all of the experiment’s simple
events. The simple events would be:

𝜔1 = (1,1), 𝜔2 = (1,2), … , 𝜔35 = (6,5), 𝜔36 = (6,6)

Hence, the sample space is Ω = {(1,1), (1,2), (1,3), … , (6,4), (6,5), (6,6)}

There are 36 possible outcomes to this random experiment, which we denote as follows:
𝑛(Ω) = 36.

Now consider the event 𝐴 = ‘Obtaining a sum of 6’. In listed form,


𝐴 = {(1,5), (2,4), (3,3), (4,2), (5,1)}

The cardinality of 𝐴 is 𝑛(A) = 5 . So,


5
𝑃(𝐴) =
36

𝐵 = ‘Obtaining a sum that is greater than 1’ is an event of Ω that is certain to occur, hence
𝑃(𝐵) = 1

𝐶 = ‘Obtaining a sum that is greater than 13’ is an impossible event of Ω, and therefore
𝑃(𝐶) = 0

34
2.2 Operations on events

Complement of an event
If 𝐴 is an event of Ω, its complement is Figure 2.2.1
denoted by 𝐴̅ and contains all simple
events of Ω that are not contained in 𝐴. In
other words, 𝐴̅ is observed whenever 𝐴 is
not. The shaded region in the Venn
diagram (Figure 2.2.1) shown here depicts
𝐴̅ .

Probability of the complement of an event

𝑛(𝐴̅ )
𝑃(𝐴̅ ) = = = = 1 − 𝑃(𝐴)
𝑛(Ω) 𝑛(Ω)

Property 𝑃(𝐴̅ ) = 1 − 𝑃(𝐴)

Example 2.2.1
Two regular dice are thrown and their respective outcomes are noted.

The sample space is 𝛀 = {(𝟏, 𝟏), (𝟏, 𝟐), (𝟏, 𝟑), … (𝟔, 𝟒), (𝟔, 𝟓), (𝟔, 𝟔)}. All simple points listed
in 𝛀 are equally likely.
𝑛(Ω) = 36

Consider the event 𝐴 = ‘Obtaining a sum of 6’. In listed form,


𝐴 = {(1,5), (2,4), (3,3), (4,2), (5,1)}

𝑛(𝐴) 5
So, 𝑛(𝐴) = 5 ⇒ 𝑃(𝐴) = 𝑛(Ω) = 36

The complement of 𝐴 is, 𝐴̅ = ‘Obtaining a sum that differs from 6’.


𝐴̅ consists of all simple events in Ω = {(1,1), (1,2), (1,3), … (6,4), (6,5), (6,6)} except for
those that are elements of 𝐴 = {(1,5), (2,4), (3,3), (4,2), (5,1)}.

So, 𝑛(𝐴̅) = ⇒ 𝑃(𝐴̅ ) =

35
Intersection of events
If 𝐴 and 𝐵 are events of Ω, their Figure 2.2.2
intersection is denoted by 𝐴 ∩ 𝐵 and
consists of all simple events of Ω that are
elements of both 𝐴 and 𝐵.
The shaded region in the Venn diagram
(Figure 2.2.2) depicts the event 𝐴 ∩ 𝐵.

Probability of the intersection

𝑛(𝐴 ∩ 𝐵)
𝑃(𝐴 ∩ 𝐵) =
𝑛(Ω)

Example 2.2.2
Two regular dice are thrown and their respective outcomes are noted.
Consider the events 𝐴 = ‘Obtaining a sum of 6’, and
𝐵 = ‘The dice reveal the same number’

In listed form: 𝐴={ }


𝐵={ }

𝐴∩𝐵 ={ }

𝑛(𝐴 ∩ 𝐵)
𝑃(𝐴 ∩ 𝐵) = =
𝑛(Ω)

36
Special Cases
𝐴 and 𝐵, as shown in Figure 2.2.3, are Figure 2.2.3
mutually exclusive events when no simple
events lie in their intersection. 𝐴 and 𝐵 are
mutually exclusive when 𝐴 ∩ 𝐵 is an
impossible event. In other words,
𝐴 ∩ 𝐵 = { ∅ } and, in such cases,
𝑃(𝐴 ∩ 𝐵) = 0.

𝐴 is included in 𝐵 or, if you prefer, 𝐵 Figure 2.2.4


includes 𝐴 when all simple events of event
𝐴 also lie in 𝐵. Figure 2.2.4 illustrates such a
situation. In other words, 𝐴 ∩ 𝐵 = 𝐴 and,
therefore, 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴).

Union of events
If 𝐴 and 𝐵 are events of Ω, their union is Figure 2.2.5
denoted by 𝐴 ∪ 𝐵 and consists of all simple
events of Ω that are elements of 𝐴, of 𝐵 or
of both 𝐴 and 𝐵. The shaded region in the
Venn diagram (Figure 2.2.5) depicts the
event 𝐴 ∪ 𝐵.

37
Probability of the union of events

𝑛(𝐴 ∪ 𝐵)
𝑃(𝐴 ∪ 𝐵) =
𝑛(Ω)

Although we may simply think of a union as ‘combining’ events 𝐴 and 𝐵, it is important not to
simply add their respective probabilities. It is usually false to assume that 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) +
𝑃(𝐵) because 𝑛(𝐴 ∪ 𝐵) is rarely equal to 𝑛(𝐴) + 𝑛(𝐵).

An appropriate computation of 𝑛(𝐴 ∪ 𝐵) should take into account the fact that some simple
events of the union are shared by both 𝐴 and 𝐵. Simple events counted in 𝐴 that also lie in 𝐵
require an adjustment (typically called the inclusion-exclusion rule), as shown below:

𝑛(𝐴 ∪ 𝐵)
𝑃(𝐴 ∪ 𝐵) = =
𝑛(Ω)

Property 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)

38
Example 2.2.3
Two regular dice are thrown and their respective outcomes are noted.
Consider the events 𝐴 = ‘Obtaining a sum of 6’, and
𝐵 = ‘The dice reveal the same number’

In listed form: 𝐴 = {(1,5), (2,4), (3,3), (4,2), (5,1)}


𝐵 = {(1,1), (2,2), (3,3), (4,4), (5,5), (6,6)}

Then, the union of 𝐴 and 𝐵 is (in listed form):

𝐴 ∪ 𝐵 = {(1,5), (2,4), (3,3), (4,2), (5,1), (1,1), (2,2), (4,4), (5,5), (6,6)}

𝑛(𝐴∪𝐵) 10 5
𝑃(𝐴 ∪ 𝐵) = = 36 = 18
𝑛(Ω)

Likewise, using the inclusion-exclusion rule:

𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) =

Special Cases
o If 𝐴 and 𝐵 are mutually exclusive (see Figure 2.2.3), then 𝐴 ∩ 𝐵 = { ∅ } ⇒ 𝑃(𝐴 ∩ 𝐵) = 0.

Then,
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − ⏟
𝑃(𝐴 ∩ 𝐵)
=0

𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)

o If 𝐵 includes 𝐴 (as in Figure 2.2.4) all simple events of 𝐴 are already counted among those of
𝐵. As a result,
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐵)

Of course, all the operations stated above could have to be combined…

39
Example 2.2.4
Assuming 𝐴 and 𝐵 are not mutually exclusive, use a Venn Diagram to illustrate the event
𝐵 ∩ 𝐴̅ .

Based on your diagram, how would you


compute 𝑃( 𝐴 ∩ 𝐵)?

Example 2.2.5
Assuming that event 𝐶 includes 𝐷, use a Venn Diagram to illustrate the event 𝐶 ∪ 𝐷?

Based on your diagram, how would you


compute 𝑃( 𝐶 ∪ 𝐷)?

Example 2.2.6
Use Venn diagrams to prove De Morgan’s identities:

𝐴∪𝐵 =𝐴∩𝐵 𝐴∩𝐵 =𝐴∪𝐵

40
Example 2.2.7
Two regular dice are thrown and their respective outcomes are noted and consider the events:
𝐴 = ‘Obtaining a sum of 6’
𝐵 = ‘The dice show the same number’

What is the probability that neither 𝐴 nor 𝐵 will occur?

Recall (from earlier examples) that we have already found that:

5 5
𝑃(𝐴) = 𝑃(𝐴 ∪ 𝐵) =
36 18
1 1
𝑃(𝐵) = 𝑃(𝐴 ∩ 𝐵) =
6 36

The expression ‘neither 𝐴 nor 𝐵 will occur’ can be translated into .

From De Morgan’s identities, we know that .

Therefore:

𝑃(𝐴 ∩ 𝐵) =

41
2.3 Conditional probability and independence
Occasionally, an event is more (or less) plausible to occur when another event is known (or
assumed) to have occurred. In this section, we will examine scenarios where such a
phenomenon occurs, as well as situations where the probability of observing a given event is
unaffected by the occurrence of another.

Conditional probability
We shall denote by 𝑃(𝐴|𝐵) the probability of event 𝐴 conditional to 𝐵;
It is assumed, in this case, that event 𝐵 is known, or assumed, to have occurred;
We can read 𝑃(𝐴 |𝐵) as, “the probability that 𝐴 will occur if/given that/assuming that 𝐵 has
occurred”.

A conditional probability can be computed using either of the rules shown below:
𝑛(𝐴∩𝐵)
𝑃(𝐴|𝐵) = (when all simple events are equally likely)
𝑛(𝐵)

𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴|𝐵) =
𝑃(𝐵)

Example 2.3.1
Two regular dice are thrown and their respective outcomes are noted.
Consider the events 𝐴 = ‘Obtaining a sum of 6’, and
𝐵 = ‘The dice show the same number’

In listed form: 𝐴 = {(1,5), (2,4), (3,3), (4,2), (5,1)}


𝐵 = {(1,1), (2,2), (3,3), (4,4), (5,5), (6,6)}

𝑛(𝐴 ∩ 𝐵) 1
𝐴 ∩ 𝐵 = {(3,3)} ⟹ 𝑃(𝐴 ∩ 𝐵) = =
𝑛(𝐵) 36

The probability of 𝐴 conditional to 𝐵 is

𝑛(𝐴∩𝐵) 1 𝑃(𝐴∩𝐵) 1⁄36 1


𝑃(𝐴|𝐵) = 𝑛(𝐵) = 6 or 𝑃(𝐴|𝐵) = = =
𝑃(𝐵) 6⁄36 6

5
Recall that 𝑃(𝐴) = 36.

42
1
Given that 𝑃(𝐴|𝐵) = 6 > 𝑃(𝐴), we gather that 𝐵’s occurrence has increased the chances of 𝐴
occurring as well.

It is important to note that events involved in a conditional probability are generally not
interchangeable:

For example, the probability of 𝐵 conditional to 𝐴 is

𝑛(𝐵∩𝐴) 1 𝑃(𝐵∩𝐴) 1⁄36 1


𝑃(𝐵|𝐴) = = or 𝑃(𝐵|𝐴) = = =
𝑛(𝐴) 5 𝑃(𝐴) 5⁄36 5

As we can observe, 𝑃(𝐵|𝐴) ≠ 𝑃(𝐴|𝐵)

Example 2.3.2
Two regular dice are thrown and their respective outcomes are noted.
Consider the events 𝐶 = ‘The first die reveals a 3’, and
𝐷 = ‘Obtaining a sum of 7’

In listed form: 𝐶 = {(3,1), (3,2), (3,3), (3,4), (3,5), (3,6)}


𝐷 = {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)}

The intersection of events 𝐶 and 𝐷 consists of a unique simple event: 𝐶 ∩ 𝐷 = {(3,4)}

6 6 1
Thus, 𝑃(𝐶) = 36 𝑃(𝐷) = 36 𝑃(𝐶 ∩ 𝐷) = 36

The probability of 𝐶 conditional to 𝐷 is

𝑛(𝐶∩𝐷) 1 𝑃(𝐶∩𝐷) 1⁄36 1


𝑃(𝐶|𝐷) = = or 𝑃(𝐶|𝐷) = = =
𝑛(𝐷) 6 𝑃(𝐷) 1⁄6 6

Likewise, the probability of 𝐷 conditional to 𝐶 is


𝑛(𝐷∩𝐶) 1 𝑃(𝐷∩𝐶) 1⁄36 1
𝑃(𝐷|𝐶) = = or 𝑃(𝐷|𝐶) = = =
𝑛(𝐶) 6 𝑃(𝐶) 1⁄6 6

Observe how, in the last examples, 𝑃(𝐶|𝐷) = 𝑃(𝐶) and 𝑃(𝐷|𝐶) = 𝑃(𝐷) which means that the
probability of 𝐶 occurring is unaffected by the knowledge that 𝐷 has occurred, and vice versa.
In other words, 𝐶 and 𝐷 are independent events.

43
Example 2.3.3
Show that the following statement is true:
Suppose that 𝐴 and 𝐵 represent events such that 𝑃(𝐴) ≠ 0 and 𝑃(𝐵) ≠ 0. If
𝑃(𝐴|𝐵) = 𝑃(𝐴), then 𝑃(𝐵|𝐴) = 𝑃(𝐵).

Answer

44
Independent events

Definition: Events 𝐴 and 𝐵, with 𝑃(𝐴) ≠ 0 and 𝑃(𝐵) ≠ 0, are independent whenever
𝑃(𝐴|𝐵) = 𝑃(𝐴) or 𝑃(𝐵|𝐴) = 𝑃(𝐵)

Property Events 𝐴 and 𝐵, with 𝑃(𝐴) ≠ 0 and 𝑃(𝐵) ≠ 0, are independent if and only if
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵).

a) Demonstrate this identity.

b) Show that if 𝐴 and 𝐵 are independent, then 𝐴 and 𝐵 are also independent.

Important: Do not confuse exclusive events and independent events…

45
Events that are mutually exclusive cannot occur at the same time. Independence, on the other
hand, does not prevent another event from happening… in fact it has no incidence on its
chances of occurring.

Example 2.3.4
Consider events 𝐴, 𝐵, 𝐶 such that
1 1
𝑃(𝐴) = 4 𝑃(𝐴 ∩ 𝐵) =
18
1
𝑃(𝐵) = 5 1
𝑃(𝐴 ∩ 𝐶) =
1 24
𝑃(𝐶) =
6
𝑃(𝐵 ∩ 𝐶) = 0

a) Find 𝑃(𝐵|𝐴)

b) Find 𝑃(𝐴|𝐶)

c) 𝐴 and 𝐶 are independent. Is this also true of 𝐴 and 𝐶̅ ?

d) Find 𝑃(𝐵|𝐶)

46
2.4 Partitioning and Bayes’ Law
Consider a random experiment whose sample space is Ω. The events 𝐸1 , 𝐸2 ,…, 𝐸𝑛 are a
partition of Ω if
1. they have nonzero probability, and if
2. they are mutually exclusive, and if
3. their union covers all of Ω (exhaustive).

Example 2.4.1
Figure 2.4.1
In the Venn diagram that is shown in Figure
2.4.1, events 𝐸1 , 𝐸2 , 𝐸3 , 𝐸4 form a partition
of Ω since
✓ 𝑃(𝐸𝑘 ) ≠ 0, for all 𝑘
✓ 𝐸𝑖 ∩ 𝐸𝑗 = {∅}, for all 𝑖 ≠ 𝑗
✓ 𝐸1 ∪ 𝐸2 ∪ 𝐸3 ∪ 𝐸4 = Ω

In the same manner, 𝐸1 , 𝐸2 ,…, 𝐸𝑛 can be used to partition an event 𝐴 into ‘smaller’, ‘more
specific’ events.

Example 2.4.2

Figure 2.4.2 In Figure 2.4.2, events 𝐸1 , 𝐸2 , 𝐸3 , 𝐸4 are used


to partition event 𝐴 as follows:
𝐴 = (𝐴 ∩ 𝐸1 ) ∪ (𝐴 ∩ 𝐸2 ) ∪ (𝐴 ∩ 𝐸3 ) ∪ (𝐴 ∩ 𝐸4 ).

Because the 𝐸𝑘 ’s are mutually exclusive, the probability of 𝐴 can be decomposed into a sum of
probabilities for events of the form 𝐴 ∩ 𝐸𝑘 :

𝑃(𝐴) = 𝑃[(𝐴 ∩ 𝐸1 ) ∪ (𝐴 ∩ 𝐸2 ) ∪ … ∪ (𝐴 ∩ 𝐸𝑛 )]
= 𝑃(𝐴 ∩ 𝐸1 ) + 𝑃(𝐴 ∩ 𝐸2 ) + ⋯ + 𝑃(𝐴 ∩ 𝐸𝑛 )

47
When 𝐴 is a complex event, the process of partitioning may be of great help, especially when
the probabilities of events 𝐴 ∩ 𝐸𝑘 are simpler to find. In other words, it is very possible that
computing 𝑃(𝐴 ∩ 𝐸𝑘 ) is straightforward, even when 𝑃(𝐴) is not.

Beyond being helpful, in some situations, partitioning is practically essential because ‘knowing’
that 𝐸𝑘 has occurred may affect the occurrence of 𝐴. The use of the term ‘knowing’ is not
accidental: we will often choose to convert 𝑃(𝐴 ∩ 𝐸𝑘 ) into a conditional probability.

Recall that
𝑃(𝐴 ∩ 𝐸𝑘 )
𝑃(𝐴|𝐸𝑘 ) =
𝑃(𝐸𝑘 )

which means that


𝑃(𝐴 ∩ 𝐸𝑘 ) = 𝑃(𝐸𝑘 ) 𝑃(𝐴|𝐸𝑘 )

In other words:

𝑃(𝐴) = 𝑃(𝐴 ∩ 𝐸1 ) + 𝑃(𝐴 ∩ 𝐸2 ) + ⋯ + 𝑃(𝐴 ∩ 𝐸𝑛 )


= 𝑃(𝐸1 ) 𝑃(𝐴|𝐸1 ) + 𝑃(𝐸2 ) 𝑃(𝐴|𝐸2 ) + ⋯ + 𝑃(𝐸𝑛 ) 𝑃(𝐴|𝐸𝑛 )

48
Example 2.4.3
General Tire stores sells The Claw, a very popular brand of winter tire. This brand is produced in
three different plants (Abbotsford, Brantford and Crawford) and guarantees an appropriate
supply to its retailers. The Abbotsford plant is smaller and produces 20% of The Claw. Its
machines are a bit outdated and, as a result, 10% of their tires have a defectiveness of some
type. The Brantford plant produces 36% of The Claw and 5% of its production has a
defectiveness. Finally, the Crawford plant handles the rest of the production. State-of-the-art
technology has reduced defectiveness to 1% of its production.

A tire that is on sale at a General Tire store is selected at random to be tested for quality
control. Consider the following events:

𝐴 = “the tire was produced in Abbotsford”,


𝐵 = “the tire was produced in Brantford”,
𝐶 = “the tire was produced in Crawford”,
𝐷 = “the tire is defective”.

Translate all of the information that is provided in the description of the problem into
corresponding probabilities:

Statement Translation
The Abbotsford plant is smaller and produces 20% of
The Claw:
Its machines are a bit outdated and, as a result, 10%
of their tires have a defectiveness of some type:
The Brantford plant produces 36% of The Claw:
and 5% of its production has a defectiveness:
The Crawford plant handles the rest of the
production:
State-of-the-art technology has reduced
defectiveness to 1% of its production:

49
What is the probability that the selected tire is defective?

Here, we are asked to find the value of 𝑃(𝐶).

To compute this probability, it would be great to know where the tire was produced. Since this
information is not provided, partitioning and taking into account all possible plants that may
have produced this particular tire is an interesting strategy to follow:

𝐴, 𝐵 and 𝐶 are mutually exclusive, have


non-zero probability and exhaustive set of
events. This means we can partition 𝐷 into
mutually exclusive events as follows:

𝐷=

The partition allows us to write 𝑃(𝐷) as the sum:

𝑃(𝐷) =

Recall that probabilities involving intersections can always be converted into conditional
probabilities. Indeed,
𝑃(𝐷 ∩ 𝐴) =
𝑃(𝐷 ∩ 𝐵) =
𝑃(𝐷 ∩ 𝐶) =
So,
𝑃(𝐷) =

Numerical values for all of the above have already been found. It is just a matter of computing
the result at this stage:

𝑃(𝐷) = (0.2)(0.1) + (0.36)(0.05) + (0.44)(0.01) = 0.0424 = 4.24%

50
What is the probability that the selected tire was produced at the Crawford plant given that is
defective?

Here, we are asked to find the value of 𝑃(𝐶|𝐷).

Although we know that 𝑃(𝐷|𝐶) = 0.01, we should not assume that 𝑃(𝐶|𝐷) takes the same
value. We will need to manipulate 𝑃(𝐶|𝐷) until it is expressed in terms of probabilities that are
available or known.

𝑃(𝐶|𝐷) =

Numerically,

(0.44)(0.01)
𝑃(𝐶|𝐷) = = 0.1038 = 10.38%
(0.2)(0.1) + (0.36)(0.05) + (0.44)(0.01)

The process we have followed to answer this last part of our question is summarized by Bayes’
Theorem.

Bayes’ Theorem
Let events 𝐸1 , 𝐸2 ,…, 𝐸𝑛 form a partition of the sample space Ω. Then, for any event 𝐴 with non-
zero probability,
𝑃(𝐸𝑘 ) 𝑃(𝐸𝑘 |𝐴)
𝑃(𝐸𝑘 |𝐴) =
𝑃(𝐴)

or, in expanded form,

𝑃(𝐸𝑘 ) 𝑃(𝐴|𝐸𝑘 )
𝑃(𝐸𝑘 |𝐴) =
𝑃(𝐸1 ) 𝑃(𝐴|𝐸1 ) + 𝑃(𝐸2 ) 𝑃(𝐴|𝐸2 ) + ⋯ + 𝑃(𝐸𝑛 ) 𝑃(𝐴|𝐸𝑛 )

The utility of Bayes’ theorem is highlighted in all situations where the probability of an event 𝐴
conditional to each 𝐸𝑖 ’s is easily obtained.

51
Section 3 Counting Techniques
Our discussion regarding probabilities, so far, has covered situation that required very little
counting. The list of outcomes was easily determined and, as a result, calculating probabilities
was relatively straightforward.
In many cases, however, the cardinalities of events – or of the sample space itself – are not
easily computed (and listing simple events may be too time consuming or tedious).
So, in this module, we will examine some basic counting methods…

3.1 Multiplication Rule


If an operation consists of two steps, the first of which has 𝑛1 outcomes, and 𝑛2 outcomes for
the other, then there are 𝑛1 ∙ 𝑛2 simple events to the experiment.
The term ‘operation’ is meant to represent any procedure, process or method of selections. The
‘steps’ could represent selections that take place in a specific order, or within distinct
categories.

Example 3.1.1
You must answer two questions on a test. Question 1 is a True or False question. Question 2 is a
multiple-choice question where options 𝑎, 𝑏, 𝑐, 𝑑 are offered to you. Given that you have not
studied the topic, and that questions are written in (seemingly) a foreign language, you have no
other choice but to answer both questions at random.

The set of all simple events to this experiment is listed as follows:

(𝑇𝑟𝑢𝑒, 𝑎), (𝑇𝑟𝑢𝑒, 𝑏), (𝑇𝑟𝑢𝑒, 𝑐), (𝑇𝑟𝑢𝑒, 𝑑),


Ω={ }
(𝐹𝑎𝑙𝑠𝑒, 𝑎), (𝐹𝑎𝑙𝑠𝑒, 𝑏), (𝐹𝑎𝑙𝑠𝑒, 𝑐), (𝐹𝑎𝑙𝑠𝑒, 𝑑)

One could use a tree diagram (see Figure Figure 3.1.1


3.1.1) to visualize the 8 simple events: there
are 2 ‘paths’ that result from the choice
between True and False, and regardless of
the chosen answer to Question 1, there are
4 ‘paths’ that result from the choice
between 𝑎, 𝑏, 𝑐 and 𝑑 on Question 2.

52
The multiplication rule can also be used to verify that 𝑛(Ω) = 8 without an exhaustive list being
elaborated. Indeed, the experiment consists of two steps: to select from TRUE or FALSE on the
first question, and from 𝑎, 𝑏, c, 𝑑 on the second.

Hence,
Question 1 Question 2
Possible outcomes TRUE, FALSE 𝑎, 𝑏, 𝑐, 𝑑

𝑛(Ω) = 2 ∙ 4 = 8

Now consider the event, 𝐴 = “guessing both answers correctly”.

Counting possible outcomes where 𝐴 occurs is also quite straightforward and involves the
multiplication rule. At every question, there is only one correct answer. Thus,
𝑛(𝐴) = 1 ∙ 1 = 1

The probability of event 𝐴 can then be computed:

𝑛(𝐴) 1
𝑃(𝐴) = =
𝑛(Ω) 8

Obviously, the multiplication rule can be extended to cases where more than two steps are
required to produce an outcome.

53
Example 3.1.2
Let us assume that a license plate is to be made up of 3 letters followed by 3 digits (selected
from 1 to 9). There are no restrictions regarding the repetition of a letter, or of a number.

Obviously, listing the possibilities is out of the question, as is the construction of a tree diagram.
Nonetheless, the number of different license plates can be counted using the multiplication
rule since selections are performed in a specific order. The outcomes at ‘every step’ occur when
each letter, or number, is selected.

Choice of 3 letters Choice of 3 digits


Possible
outcomes

𝑛(Ω) = ⏟ ∙⏟ 12 812 904


𝑙𝑒𝑡𝑡𝑒𝑟𝑠 𝑑𝑖𝑔𝑖𝑡𝑠

Let us define event, 𝐴 = “only vowels (𝐴, 𝐸, 𝐼, 𝑂, 𝑈, 𝑌) and odd numbers are obtained”. Its
cardinality can still be obtained by using the multiplication rule since selections are performed
in a specific order.

The cardinality of 𝐴, and thus its probability, is affected by restrictions that are imposed at
every selection:
𝑛(𝐴) = ⏟ ∙⏟ = 27000
3 𝑣𝑜𝑤𝑒𝑙𝑠 3 𝑜𝑑𝑑
𝑛𝑢𝑚𝑏𝑒𝑟𝑠

27 000
𝑃(𝐴) = ≈ 0.2107%
12 812 904

Finally, consider the event, 𝐵 = “no letter and no number is repeated”. The cardinality of 𝐵 can
once again be obtained by using the multiplication rule. Selections are made in a succession of
‘steps’ (that is, in a specific order), but with different restrictions than those of 𝐴:

𝑛(𝐵) = = 7 862 400

7 862 400
𝑃(𝐵) = ≈ 61.36%
12 812 904

54
3.2 Addition Rule
If an event can occur through outcomes that are mutually exclusive, then the number of ways
in which each outcome is observed can by added. In other words, suppose that event 𝐴 can be
partitioned into mutually exclusive events 𝐴1 and 𝐴2 , then

𝑛(𝐴) = 𝑛(𝐴1 ) + 𝑛(𝐴2 ).

Example 3.2.1
You must answer two questions on a test. Question 1 is a True or False question. Question 2 is a
multiple-choice question where options 𝑎, 𝑏, 𝑐, 𝑑 are offered to you. Given that you have not
studied the topic, and that questions are written in (seemingly) a foreign language, you have no
other choice but to answer both questions at random.

We have already shown that 𝑛(Ω) = 8, either by using the multiplication, by using a tree
(𝑇𝑟𝑢𝑒, 𝑎), (𝑇𝑟𝑢𝑒, 𝑏), (𝑇𝑟𝑢𝑒, 𝑐), (𝑇𝑟𝑢𝑒, 𝑑),
diagram, or by listing its simple events: Ω = { }.
(𝐹𝑎𝑙𝑠𝑒, 𝑎), (𝐹𝑎𝑙𝑠𝑒, 𝑏), (𝐹𝑎𝑙𝑠𝑒, 𝑐), (𝐹𝑎𝑙𝑠𝑒, 𝑑)

Consider now event 𝐵 = “getting exactly one correct answer”. This event can occur by either
getting the correct answer on the first question and an incorrect answer to the second. It can
also occur through the distinct outcome where an incorrect answer occurs on the first question,
but the answer to the second question is correct. Therefore,

𝑛(𝐵) = ⏟ + ⏟ =4
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 1𝑠𝑡 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 1𝑠𝑡
𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 2𝑛𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 2𝑛𝑑

4 1
𝑃(𝐵) = =
8 2

55
Example 3.2.2
Let us assume that a license plate is to be made up of 3 letters followed by 3 digits (selected
from 1 to 9), or of 3 numbers followed by 3 letters. There are no restrictions regarding the
repetition of a letter, or of a number.

The number of different license plates can be counted using the addition rule. Let’s count all
ways to print license plates starting with 3 letters, and then count all ways to print license
plates starting with 3 numbers. No simple event is common to these individual events.

𝑙𝑖𝑐𝑒𝑛𝑠𝑒
𝑛( )=⏟ +⏟ = 25 625 808
𝑝𝑙𝑎𝑡𝑒𝑠 𝑙𝑒𝑡𝑡𝑒𝑟𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑑
𝑏𝑦 𝑛𝑢𝑚𝑏𝑒𝑟𝑠 𝑏𝑦 𝑙𝑒𝑡𝑡𝑒𝑟𝑠

Example 3.2.3
A lottery proceeds as follows. 5 containers all have 10 balls labeled 0 to 9. A ball is randomly
selected from each container, in order, to form the ‘winning number’.
The number of possible outcomes is:

𝑛(Ω) =

Lottery tickets are printed with 5-digit numbers and prizes are awarded as follows:
o The 5000$ grand prize is awarded to participants whose tickets have the ‘winning number’,
that is correct numbers in the correct order;
o 1000$ are awarded to participants whose tickets have a string of exactly 4 correct
consecutive numbers (4 correct numbers in correct consecutive positions);
o 200$ are awarded to participants whose tickets have a string of exactly 3 correct numbers
(3 correct numbers in correct consecutive positions).

For example, if the winning number is 12345, a ticket with numbers 72345 would earn the
owner 1000$.

A ticket with numbers 72349 would earn its owner 200$.

You have purchased a lottery ticket. Find the probability for each of the following events:

𝑊 = “Winning the 5000$ prize”,


𝑆 = “Winning a 1000$ prize”,
𝑇 = “Winning the 200$ prize”.

56
Answer
The number of ways to win the lottery is 𝑛(𝑊) = +

The probability of winning the lottery is

𝑛(𝑊)
𝑃(𝑊) = = =
𝑛(Ω)

The number of ways to win 1000$ is

𝑛(𝑆) = +

The probability of winning 1000$ is

𝑛(𝑆)
𝑃(𝑆) = = =
𝑛(Ω)

The number of ways to win $200 is

𝑛(𝑇) = + + = 261

The probability of winning 200$ is

𝑛(𝑇)
𝑃(𝑇) = = =
𝑛(Ω)

57
Example 3.2.4
A lottery proceeds as follows. One container has 10 balls that are labeled with numbers 0 to 9,
respectively. 5 balls are randomly selected from the container without replacement. Their
numbers are noted in the order of selection to form the ‘winning number’.

The total number of possible outcomes is:

𝑛(Ω) = = 30240

Lottery tickets are printed with 5-digit numbers and prizes are awarded as follows:
o The 5000$ grand prize is awarded to participants whose tickets have the ‘winning number’,
that is correct digits in the correct order;
o 1000$ are awarded to participants whose tickets have a string of exactly 4 correct
consecutive digits (4 correct digits in correct consecutive positions);
o 200$ are awarded to participants whose tickets have a string of exactly 3 correct digits (3
correct digits in correct consecutive positions).

If the winning number is 12345, for example, a ticket with the 5-digit number 72345 would earn
the owner 1000$. Tickets with numbers 72349 or 17345 would earn the owner 200$.

You have purchased a lottery ticket. Find the probability for each of the following events:

𝑊 = “Winning the 5000$ prize”,


𝑆 = “Winning a 1000$ prize”,
𝑇 = “Winning the 200$ prize”.

58
Answer
The cardinality of 𝑊 is quite straightforward given that every correct number must appear in
the correct position (order). Hence,

𝑛(𝑊) = 1 ∙ 1 ∙ 1 ∙ 1 ∙ 1 = 1

The probability of winning the lottery is

𝑛(𝑊) 1
𝑃(𝑊) = = ≈ 0.00331%
𝑛(Ω) 30240

The cardinality of 𝑆 requires that the first 4 digits be correct, but not the 5th or that the last 4
digits be correct, but not the first. As a result, we will make use of both the addition and
multiplication rules, here. Hence,

𝑛(𝑆) = + = 10

So, the probability of winning 1000$ is

𝑛(𝑆) 10
𝑃(𝑆) = = ≈ 0.03307%
𝑛(Ω) 30240

The number of ways to have a string of exactly 3 consecutive correct number, that is 𝑛(𝑇), is
much more difficult to elaborate.

𝑛(𝑇) = (1 ∙ 1 ∙ 1 ∙ 6 ∙ 6) + (1 ∙ 1 ∙ 1 ∙ 1 ∙ 1) + (1 ∙ 1 ∙ 1 ∙ 1 ∙ 5) + (5 ∙ 1 ∙ 1 ∙ 1 ∙ 1)
+(5 ∙ 1 ∙ 1 ∙ 1 ∙ 4) + (6 ∙ 6 ∙ 1 ∙ 1 ∙ 1) = 103

Can you explain the reasoning behind this calculation?

So, the probability of winning 200$ is

𝑛(𝑇) 103
𝑃(𝑤𝑖𝑛) = = ≈ 0.3406%
𝑛(Ω) 30240

59
3.3 Arrangements, Permutations and Combinations
We call ‘arrangement’ the number of ways (orders) in which 𝑛 elements can be placed into 𝑛
spots without repetition.

In other words, arrangements count every way to scramble 𝑛 elements that are considered
distinct.

According to the multiplication rule, the number of arrangements of 𝑛 elements produces…

𝑛 ∙ (𝑛 − 1) ∙ (𝑛 − 2) ∙∙∙ 3 ∙ 2 ∙ 1 = 𝑛!

distinct simple events.

Example 3.3.1
How many 7-letter words, whether they are meaningful or not, can be formed by scrambling
the letters A – C – F – H – O – P – S ?

Answer
7 letters are to be placed into 7 spots to form a word. There are

= = 5040 arrangements

We call ‘permutation’ the number of ways to place (to order) 𝑟 elements that are selected
without repetition from a list of 𝑛 distinct elements.

In other words, permutations count every way to pick and then place 𝑟 elements selected from
a set of 𝑛 elements.

According to the multiplication rule, permutations of 𝑟 elements selected from 𝑛 elements add
up to
𝑛 ∙ (𝑛 − 1) ∙∙∙ (𝑛 − 𝑟 + 2) ∙ (𝑛 − 𝑟 + 1)

distinct outcomes. Observe how the list stops once there are 𝑛 − 𝑟 leftover elements from the
successive selections.

60
Mathematically, notice that

𝑛 ∙ (𝑛 − 1) ∙∙∙ (𝑛 − 𝑟 + 2) ∙ (𝑛 − 𝑟 + 1)

can be written as

𝑛 ∙ (𝑛 − 1) ∙∙∙ (𝑛 − 𝑟 + 2) ∙ (𝑛 − 𝑟 + 1) ∙ (𝑛 − 𝑟) ∙ (𝑛 − 𝑟 − 1) ∙∙∙ 3 ∙ 2 ∙ 1
(𝑛 − 𝑟) ∙ (𝑛 − 𝑟 − 1) ∙∙∙ 3 ∙ 2 ∙ 1

Notation
We shall denote the permutation of 𝑟 elements selected from 𝑛 elements as
𝒏𝑷 𝒓
It is read ‘𝑛 permute 𝑟 ’ and is calculated as follows:

𝒏𝑷𝒓 =

Example 3.3.2
Assuming letters cannot be re-used, how many 5-letter words, whether they are meaningful or
not, can be formed using A – C – F – H – O – P – S

Answer
5 letters are to be selected for from a set of 7 letters without repetition…

7∙6∙5∙4∙3∙2∙1 7!
7∙6∙5∙4∙3= = 2! = 7𝑃5 = 2520 words
2∙1

We call ‘combination’ the number of ways to select 𝑟 elements from a list of 𝑛 distinct elements
without repetition and without regard to order.

In other words, combinations count how many distinct sets of 𝑟 elements can be chosen from a
set of 𝑛 elements, without giving any importance to their order of selection. Unlike
permutations, a combination does not distinguish between 𝐴𝐵𝐶, 𝐵𝐶𝐴 or 𝐶𝐵𝐴.

𝑛!
We have already seen that there are 𝑛𝑃𝑟 = (𝑛−𝑟)! ways to permute 𝑟 elements selected from a
set of 𝑛.

61
Keep in mind, however, that this quantity includes the shuffling of the 𝑟 elements as producing
distinct outcomes. Every grouping of 𝑟 selected elements have generated 𝑟! different orders,
which should not considered as different outcomes when counting combinations.

Therefore, the number of combinations of 𝑟 elements selected from a set of 𝑛 distinct elements
is
𝑛𝑃𝑟 𝑛!
=
𝑟! (𝑛 − 𝑟)! 𝑟!
Notation
The combination of 𝑟 elements selected from 𝑛 elements is denoted by
𝒏
𝒏𝑪𝒓 or ( 𝒓 )

and is read ‘𝑛 choose 𝑟 ’ and is calculated as follows:

𝒏 𝒏!
𝒏𝑪 𝒓 =( )=
𝒓 (𝒏 − 𝒓)! 𝒓!

Example 3.3.3
A class has to elect a committee to represent them at a student union meeting. How many 5-
member committees can be formed by from a group of 30 students?

Answer
5 representatives are to be selected from a set of 30 students without repetition and with no
regard to order…

= 142 506

Note
If elected representatives had to occupy specific roles on the committee, thus giving
importance to the order (placement) of selections, then there would have been

= 17 100 720

different committees!

62
3.4 Mixed problems and varying strategies
In many counting problems, a mixture of counting rules we have at our disposal may need to be
used.

Example 3.4.1
Suppose that the alphabet consists of 6 vowels and 20 consonants (we will treat 𝑦 exclusively as
a vowel for the sake of this example). 8 letters are chosen at random and without repetition.
The letters are placed in the order of their selection to produce a ‘word’ (which is unlikely to be
found in a dictionary).
How many 8-letter words can be formed?

All letters are available to be selected (without repetition) and the order of selection matters. A
permutation fits this situation:

26!
Number of 8-letter words = 26𝑃8 = 18! = 6.299 × 1010

How many distinct words can be formed of 3 vowels and 5 consonants?

This time, selections must be made from two categories (vowels and consonants). This implies
that the multiplication rule will be required.

(Incorrect) Attempt #1

5 consonants selected without 3 vowels selected without replacement


replacement from 20 consonants and from 6 vowels consonants and
rearranged in all possible orders rearranged in all possible orders
20𝑃5 6𝑃3

Number of words with 3 vowels and 5 consonants = ( 20𝑃5 ) ∙ ( 6𝑃3 )

Perhaps, the count shown above looks appropriate because it does take into account the order
of vowels and of consonants, and it does consider elements are not repeated.
Sadly, it is incorrect.

63
The formula we have used would count, for instance, how many words can be formed by 5
consonants, all placed first, followed by 3 vowels, all placed last. The permutations shuffle the
consonants among themselves, and the vowels among themselves, but never has the switch of
positions between a vowel and a consonant been considered.

Here are two approaches that will lead to the correct count.
(Correct) Attempt #2
Choose the 8 letters and then form words with them:

⏟ ∙ ⏟ ∙ ⏟ =
𝑐ℎ𝑜𝑜𝑠𝑒 𝑐ℎ𝑜𝑜𝑠𝑒 𝑐𝑜𝑢𝑛𝑡 𝑎𝑙𝑙
𝑡ℎ𝑒 5 𝑡ℎ𝑒 3 𝑎𝑟𝑟𝑎𝑛𝑔𝑒𝑚𝑒𝑛𝑡𝑠
𝑐𝑜𝑛𝑠𝑜𝑛𝑎𝑛𝑡𝑠 𝑣𝑜𝑤𝑒𝑙𝑠 𝑜𝑓 𝑡ℎ𝑒 8
𝑙𝑒𝑡𝑡𝑒𝑟𝑠

(Correct) Attempt #2
Choose where the 5 consonants (or the 3 vowels) will be placed within the 8-letter word and
then fill the spots with all possible variations of letters:

⏟ ∙⏟ ∙⏟ =
𝑐ℎ𝑜𝑜𝑠𝑒 𝑓𝑖𝑙𝑙 𝑡ℎ𝑒 𝑓𝑖𝑙𝑙 𝑡ℎ𝑒
𝑡ℎ𝑒 5 𝑠𝑝𝑜𝑡𝑠 𝑠𝑝𝑜𝑡𝑠 𝑓𝑜𝑟 𝑠𝑝𝑜𝑡𝑠 𝑓𝑜𝑟
𝑓𝑜𝑟 𝑐𝑜𝑛𝑠𝑜𝑛𝑎𝑛𝑡𝑠 𝑡ℎ𝑒 5 𝑡ℎ𝑒 3
𝑐𝑜𝑛𝑠𝑜𝑛𝑎𝑛𝑡𝑠 𝑣𝑜𝑤𝑒𝑙𝑠

⏟ ∙⏟ ∙⏟ =
𝑐ℎ𝑜𝑜𝑠𝑒 𝑓𝑖𝑙𝑙 𝑡ℎ𝑒 𝑓𝑖𝑙𝑙 𝑡ℎ𝑒
𝑡ℎ𝑒 3 𝑠𝑝𝑜𝑡𝑠 𝑠𝑝𝑜𝑡𝑠 𝑓𝑜𝑟 𝑠𝑝𝑜𝑡𝑠 𝑓𝑜𝑟
𝑓𝑜𝑟 𝑣𝑜𝑤𝑒𝑙𝑠 𝑡ℎ𝑒 3 𝑡ℎ𝑒 5
𝑣𝑜𝑤𝑒𝑙𝑠 𝑐𝑜𝑛𝑠𝑜𝑛𝑎𝑛𝑡𝑠

64
Example 3.4.2
A container holds 40 marbles. 30 marbles are yellow, 9 are green and 1 is red. 5 marbles are
selected at random and without replacement.

A green marble that is selected first is worth 5 points. A green marble that is selected second is
worth 4 points. A green marble that is selected third is worth 3 points. And so on. The player
receives 0 points if no green marble is selected or if the red marble is selected.

How many ways – including order – are there to select 5 marbles?

𝑛(Ω) = = 78 960 960

How many ways – including order – are there to select 5 marbles from which at least one is
green?

𝑛(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛 𝑔𝑟𝑒𝑒𝑛) = 𝑛(Ω) − 𝑛( )= − = 58 571 640

How many ways – including order – are there to select 5 marbles from which exactly three are
green?

𝑛(3 𝑔𝑟𝑒𝑒𝑛) = ∙ ∙ = 46 87 200

How many ways – including order – are there for the player to obtain 0 points?

𝑛(0 points) = + = 30 259 440

How many ways – including order – are there for the player to obtain 3 points?

𝑛(3 points) = + = 7 673 400

65
Section 4 Random variables
We use the expression ‘random variable’ to denote situations where the simple events of a
random experiment are attributed a numerical value. The outcomes of the experience are
potentially numerical by default (time, weight, number of defects, number of individuals with a
given characteristic) or can become numerical through a transformation or a correspondence.

4.1 Discrete random variables


If random variable 𝑋 can take a finite, or an infinite but countable set of values, then it is called
a ‘discrete random variable’. The set of values that 𝑋 can take is called its support and is
denoted by 𝑆(𝑋).

Example 4.1.1
A soccer team will play 2 games this week. We will assume that both games are played against
an equally matched opponent and that every outcome (win, lose or draw) is equally likely to
occur.

The result of each game is a random experiment whose simple events are not numerical at all.
The sample space is formed of 9 distinct combinations of wins (𝑊), losses (𝐿) or draws (𝐷):

Ω = {(𝑊, 𝑊), (𝑊, 𝐷), (𝑊, 𝐿), (𝐷, 𝑊), (𝐷, 𝐷), (𝐷, 𝐿), (𝐿, 𝑊), (𝐿, 𝐷), (𝐿, 𝐿) }

But what if we performed an operation that converts each simple event into a numerical value?

For example, standings in soccer are generated by converting a team’s win into 3 points, a draw
into 1 point, whereas a loss is worth 0 points.

Thus, suppose we let

𝑋 = “the number of points the team will accumulate during the two games”.

𝑋 is now a random variable whose value depends on the games’ outcomes. The support of 𝑋, is
made up of 6 values, as opposed to the 9 elements that formed Ω.

66
Draw arrows to show the correspondence between the simple events and the values of 𝑋 they
are converted into.

Although obtaining 4 points (𝑋 = 4) is one of 6 possible values 𝑋 can take, explain why it is
1
false to assume that 𝑃(𝑋 = 4) = 6.

True or false

𝑃(𝑊, 𝑊) = 𝑃(𝐿, 𝐷) True False

𝑃(𝑋 = 0) = 𝑃(𝑋 = 1) True False

67
4.2 Probability distribution table
The probability distribution table of a discrete random variable is commonly used to present
the variable’s support as well as the probability associated to its different values.

For convenience, we frequently include the variable’s cumulative probability distribution as


well.

Example 4.2.1
A soccer team will play 2 games this week. We will assume that both games are played against
an equally matched opponent and that every outcome (win, lose or draw) is equally likely to
occur.

Let 𝑋 =“the number of points they will accumulate during the two games”.

Table 4.2.1

𝑿 = 𝒙𝒊 𝝎𝒌 → 𝒙𝒊 𝑷(𝑿 = 𝒙𝒊 ) 𝑷(𝑿 ≤ 𝒙𝒊 )

0 (𝐿, 𝐿) 1⁄9 1⁄9

1 (𝐷, 𝐿), (𝐿, 𝐷)

2 (𝐷, 𝐷)

3 (𝑊, 𝐿), (𝐿, 𝑊)

4 (𝑊, 𝐷), (𝐷, 𝑊)

6 (𝑊, 𝑊)

The third column of Table 4.2.1, which lists probabilities associated to each value of 𝑋, is the
variable’s probability distribution or probability distribution function, often referred to as the
pdf.

The fourth column of the table lists the sum of probabilities leading up to, and including, each
value of 𝑋. This is the variable’s cumulative probability distribution or cumulative distribution
function, often referred to as the cdf.

68
Typically, the probability distribution table is simplified by removing the list of simple events
that generate the different values of 𝑋, by rewriting 𝑃(𝑋 = 𝑥𝑖 ) as 𝑝(𝑥𝑖 ) or 𝑃(𝑋 ≤ 𝑥𝑖 ) by 𝐹(𝑥𝑖 ).

Example 4.2.2
Shown below, Table 4.2.2 would be deemed equivalent to Table 4.2.1 previously.

Table 4.2.2
Variable (𝑿) pdf cdf
𝒙𝒊 𝒑(𝒙𝒊 ) 𝑭(𝒙𝒊 )
0 1⁄9 1⁄9
1 2⁄9 3⁄9
2 1⁄9 4⁄9
3 2⁄9 6⁄9
4 2⁄9 8⁄9
6 1⁄9 1

Use the table to compute the following probabilities:

𝑃(𝑋 = 3) = 𝑝(3) =
𝑃(𝑋 < 3) = 𝐹(5) =
𝑃(𝑋 > 3) = 𝑃(2 ≤ 𝑋 ≤ 4) =

69
4.3 Computational formulas for the expected value and variance
Let 𝑋 represent a discrete random variable whose support is 𝑆(𝑋). Then…

𝑃(𝑋 = 𝑥𝑖 ) = 𝑝(𝑥𝑖 ) ≥ 0 for all 𝑥𝑖

∑ 𝑃(𝑋 = 𝑥𝑖 ) = ∑ 𝑝(𝑥𝑖 ) = 1
𝑥𝑖 𝜖𝑆(𝑋)

The following computational formulas can be used to obtain the expected value, variance and
standard deviation of a discrete random variable:

Expected value of variable 𝑿 𝐸(𝑋) = ∑ 𝑥𝑖 𝑝(𝑥𝑖 )

Variance of variable 𝑿 𝑉(𝑋) = ∑[𝑥𝑖 − 𝐸(𝑋)]2 𝑝(𝑥𝑖 )

Identity
It is often more convenient, not to mention quicker, to calculate the variance as follows:

𝑉(𝑋) = ∑[𝑥𝑖 − 𝐸(𝑋)]2 𝑝(𝑥𝑖 ) = ∑ 𝑥𝑖2 𝑝(𝑥𝑖 ) − [𝐸(𝑋)]2

Can you demonstrate this identity?

Standard deviation of variable 𝑿 𝜎(𝑋) = √𝑉(𝑋)

Interpretation of the expected value and standard deviation of a discrete random variable
One can make parallels between the formulas shown above and those we have used for the
population mean, variance and standard deviation in the descriptive statistics section.
1 𝑛𝑖
For example, 𝜇 = ∑ 𝑛𝑖 𝑥𝑖 can also be written as 𝜇 = ∑ 𝑥𝑖 = ∑ 𝑓𝑖 𝑥𝑖 .
𝑁 𝑁

𝑛
The relative frequency, 𝑓𝑖 = 𝑁𝑖 , associated to a measurement 𝑥𝑖 in the context of descriptive
statistics now translates into 𝑝(𝑥𝑖 ), the probability associated to 𝑥𝑖 ’s occurrence.

70
Parallels between relative frequencies and probabilities are appropriate when population data
is concerned given that, like populations, probabilities are calculated based on all possible
outcomes of a random experiment.

With that in mind, we can interpret a variable’s expected value as ‘the average we would
expect to obtain if the experiment were repeated extremely often’. Likewise, for a random
variable 𝑋, 𝜎(𝑋) calculates what one would expect to observe as the standard deviation of the
outcomes if the experiment were repeated extremely often.

Example 4.3.1
A soccer team will play 2 games this week against same-caliber teams. Find 𝐸(𝑋), 𝑉(𝑋), and
𝜎(𝑋) where 𝑋 = ‘the number of points they will accumulate’.
1 2 1 2 2 1 24
𝐸(𝑋) = ∑ 𝑥𝑖 𝑝(𝑥𝑖 ) = 0 ( ) + 1 ( ) + 2 ( ) + 3 ( ) + 4 ( ) + 6 ( ) = ≈ 2.67
9 9 9 9 9 9 9
𝑆

2
1 2 1 2 2 1 8
𝑉(𝑋) = ∑ 𝑥𝑖2 𝑝(𝑥𝑖 ) − [𝐸(𝑋)]2 = 02 ( ) + 12 ( ) + 22 ( ) + 32 ( ) + 42 ( ) + 62 ( ) − ( )
9 9 9 9 9 9 3
𝑆
28
= ≈ 3.11
9

𝜎(𝑋) = √28⁄9 ≈ 1.76

Interpret the results in the context of the problem.

71
4.4 Linear Transformation of a random variable
Consider a random variable, 𝑋, whose expected value, variance and standard deviation are
𝐸(𝑋), 𝑉(𝑋) and 𝜎(𝑋), respectively.

If random variable 𝑌 is obtained from 𝑋 through a relationship of the form


𝑌 = 𝑎𝑋 + 𝑏
then 𝑌 is also a random variable and is called a linear transformation of 𝑋.

Properties on the linear transformation of a variable


If random variable 𝑌 is obtained from 𝑋 through a relationship of the form
𝑌 = 𝑎𝑋 + 𝑏, then
𝐸(𝑌) = 𝐸(𝑎𝑋 + 𝑏) = 𝑎𝐸(𝑋) + 𝑏
𝑉𝑎𝑟(𝑌) = 𝑉𝑎𝑟(𝑎𝑋 + 𝑏) = 𝑎2 𝑉𝑎𝑟(𝑋)
𝜎(𝑌) = 𝜎(𝑎𝑋 + 𝑏) = |𝑎|𝜎(𝑋)

72
Example 4.4.1
Consider the following casino game: a container has 24 marbles bouncing around. 8 of them are
red. 3 marbles are randomly selected without replacement. The gambler will receive $10
dollars for every red marble that is selected. It costs $15, however, to participate.

Find the gambler’s expected profit (or loss) when the game is played once.
What is the variability of this number?

Answer
First, let 𝑋 = “number of red marbles that are selected”.

To find 𝐸(𝑋), 𝑉𝑎𝑟(𝑋) and 𝜎(𝑋), we need to determine the probability distribution of 𝑋. Using
counting techniques, we can find that there are:

o ways to choose any 3 of the 24 marbles without replacement and


without regard to order. This is 𝑛(Ω).

o ways to choose any 3 of the 8 red marbles without replacement


and without regard to order. This is 𝑛(𝑋 = 3).

o ways to choose any 2 of the 8 red marbles and one non red
marble without replacement and without regard to order. This is 𝑛(𝑋 = 2).

o ways to choose any one of the 8 red marbles and two non red
marbles without replacement and without regard to order. This is 𝑛(𝑋 = 1).

o ways to choose none of the 8 red marbles and three non red
marbles without replacement and without regard to order. This is 𝑛(𝑋 = 0).

The probability distribution table for variable 𝑋 = “number of red marbles that are selected” is
therefore:
𝒙𝒊 𝒑(𝒙𝒊 )
0 560/2024
1 960/2024
2 448/2024
3 56/2024

73
𝐸(𝑋) = ∑ 𝑥𝑖 𝑝(𝑥𝑖 ) = =1

1232
𝑉(𝑋) = ∑ 𝑥𝑖2 𝑝(𝑥𝑖 ) − [𝐸(𝑋)]2 = =
2024

𝜎(𝑋) = √𝑉(𝑋) ≈ 0.7802

The gambler will receive $10 dollars for every red marble that is selected. It costs $15, however,
to participate. Hence, if we 𝑌 = “the gambler’s profit when the game is played once”. Then 𝑌 is
a linear transformation of 𝑋:
𝑌 = 10𝑋 − 15

According to the properties of linear transformations of a random variable:

𝐸(𝑌) =

𝑉𝑎𝑟(𝑌) =

𝜎(𝑌) =

74
4.5 Sum of independent variables
Consider independent random variables, 𝑋, 𝑌 whose expected values, variances and standard
deviations are 𝐸(𝑋), 𝑉(𝑋) and 𝜎(𝑋), 𝐸(𝑌), 𝑉(𝑌) and 𝜎(𝑌), respectively.

Properties on sum of independent variables


If 𝑇 = 𝑋 + 𝑌 represents a sum of independent random variables, then

𝐸(𝑇) = 𝐸(𝑋 + 𝑌) = 𝐸(𝑋) + 𝐸(𝑌)


𝑉(𝑇) = 𝑉(𝑋 + 𝑌) = 𝑉(𝑋) + 𝑉(𝑌)

𝜎(𝑇) = √𝑉(𝑇) = √𝜎 2 (𝑋) + 𝜎 2 (𝑌)

4.6 Difference of independent variables


Consider the independent random variables, 𝑋, 𝑌 whose expected values, variances and
standard deviations are 𝐸(𝑋), 𝑉(𝑋) and 𝜎(𝑋), 𝐸(𝑌), 𝑉(𝑌) and 𝜎(𝑌), respectively.

Properties on the difference of independent variables


If 𝑇 = 𝑋 + 𝑌 represents the difference of independent random variables, then

𝐸(𝑇) = 𝐸(𝑋 − 𝑌) = 𝐸(𝑋) − 𝐸(𝑌)


𝑉(𝑇) = 𝑉(𝑋 − 𝑌) = 𝑉(𝑋) + 𝑉(𝑌)

𝜎(𝑇) = √𝑉(𝑇) = √𝜎 2 (𝑋) + 𝜎 2 (𝑌)

Can you explain why 𝑽(𝑻) ≠ 𝑽(𝑿) − 𝑽(𝒀)?

75
Example 4.6.1
Consider the casino game that was introduced earlier: a container has 24 marbles bouncing
around. 8 of them are red. 3 marbles are randomly selected with replacement. The gambler will
receive $10 dollars for every red marble that is selected. It costs $15, however, to participate.

Find the gambler’s expected total profit (or loss) when the game is played twenty times. And
what is the variability of this number?

Answer
We have already determined that when playing this casino game once, the gambler’s profit
obeys the following characteristics:
𝐸(𝑌) = −5
𝑉(𝑌) = 60.87

Now, let’s assume that the gambler plays twenty times. We can assume that outcomes of every
game are independent. The expected value and the variance of 𝑌𝑘 , the profit obtained at each
trial are equal to those of 𝑌, computed earlier. That is,

𝐸(𝑌𝑘 ) = −5
𝑉(𝑌𝑘 ) = 60.87

The total profit is a random variable that is the sum of the profits observed at each trial:

𝑇=

Using properties of the sum of independent variables, we can predict that

𝐸(𝑇) = 𝐸(𝑌1 + 𝑌2 + ⋯ 𝑌20 ) = = 100

𝑉(𝑇) = 𝑉(𝑌1 + 𝑌2 + ⋯ 𝑌20 ) = = 1217.40

𝜎(𝑇) = √1217.40 = 34.89

Observation : although the expected value of 𝑻 is 100 times that of 𝒀, observe that the standard
deviation of 𝑻 is only 10 times that of 𝒀.

76
4.7 Bernoulli and Binomial distributions
A variable 𝑋 obeys a Bernoulli distribution if it is the result of any random experiment whose
outcome is deemed to be either a ‘success’, in which case the value of 𝑋 is 1, or a ‘failure’, in
which case 𝑋 = 0. If we assume that the probability of observing a success is 𝑝, then the pdf of
𝑋 always takes the aspect shown in Table 4.7.1.

Table 4.7.1
𝒙𝒊 𝒑(𝒙𝒊 )
0 1−𝑝
1 𝑝

To denote the fact that 𝑋 obeys a Bernoulli distribution, we can use the notation

𝑋 ~ 𝐵(1, 𝑝)

The probability of observing a ‘failure’ from a Bernoulli experiment is 1 − 𝑝 which is often


denoted as 𝑞.

The expected value, variance and standard deviation of Bernoulli variables are
𝐸(𝑋) = ∑ 𝑥𝑖 𝑝(𝑥𝑖 ) = 0(1 − 𝑝) + 1(𝑝) = 𝑝

𝑉(𝑋) = ∑ 𝑥𝑖2 𝑝(𝑥𝑖 ) − [𝐸(𝑋)]2 = 02 (1 − 𝑝) + 12 (𝑝) − 𝑝2 = 𝑝(1 − 𝑝) = 𝑝𝑞

𝜎(𝑋) = √𝑉(𝑋) = √𝑝(1 − 𝑝) = √𝑝𝑞

Example 4.7.1
A dice is thrown. Consider the random experiment,

1 ; if a 6 is obtained
𝑋={
0 ; otherwise
Then, the pdf of 𝑋 is
𝒙𝒊 𝒑(𝒙𝒊 )
0
1

5
𝐸(𝑋) = 𝑉(𝑋) = 𝜎(𝑋) = √36 ≈ 0.372

77
A variable 𝑋 obeys a Binomial distribution if it is the result of a sum of identical Bernoulli
variables. More specifically, a binomial law is used to describe any variable with the following
criteria:

o 𝑛 identical random experiments take place;


o the result of each experiment leads either to a ‘success’ or to a ‘failure’;
o the probability (𝑝) of obtaining a ‘success’ at each trial remains constant, that is, each trial is
independent;
o variable 𝑋 is the total number of successes observed over the 𝑛 trials.

Notation
The following notation is used to denote a Binomial variable consisting of 𝑛 trials with constant
probability 𝑝 of success at each trial
𝑋 ~ 𝐵(𝑛, 𝑝)

Example 4.7.2
Suppose a fair coin is thrown 12 times. The variable, 𝑋 = “number of times the coin lands on
heads” obeys a Binomial law since:
o 12 identical coin tosses take place;
o the result of every toss is either heads (a success) or tails (a failure);
o the probability (𝑝) of the coin landing on ℎ𝑒𝑎𝑑𝑠 remains constant at all tosses (independent
outcomes);
o variable 𝑋 is the total number of heads observed over the 12 tosses.

We can therefore write that 𝑋 ~ 𝐵(12, 0.5)

Example 4.7.3
Suppose we toss two regular dice 36 separate times. The variable, 𝑌 = ‘number of times a sum
of 5 occurs’ obeys a Binomial law since:
o 36 identical dice tosses take place;
o the result of every toss is either a sum of 5 (a success) or it isn’t (a failure);
o the probability (𝑝) of a sum of 5 occurring remains constant at all tosses (independent
outcomes);
o variable 𝑌 is the total number of times a sum of 5 is observed over the 36 tosses.

1
We can therefore write that 𝑌 → 𝐵 (36, 9)

78
4.8 Expected value, variance and standard deviation of a Binomial variable

If 𝑋 ~ 𝐵(𝑛, 𝑝), then


o 𝐸(𝑋) = 𝑛𝑝
o 𝑉𝑎𝑟(𝑋) = 𝑛𝑝𝑞
o 𝜎(𝑋) = √𝑛𝑝𝑞

Can you demonstrate these identities given that a Binomial variable is a sum of independent
Bernoulli variables?

4.9 Probability distribution of a Binomial variable:


Let us suppose that 𝑋 ~ 𝐵(𝑛, 𝑝), then the probability that 𝑋 = 𝑘 where 0 ≤ 𝑘 ≤ 𝑛 is

𝑃(𝑋 = 𝑘) = 𝑝(𝑘) = 𝑛𝐶𝑘 𝑝𝑘 𝑞 𝑛−𝑘

We can justify the formula as follows. Every trial is independent, therefore, the probability of
obtaining 𝑘 success followed by 𝑛 − 𝑘 failures is

𝑞 ∙ 𝑞 … ∙ 𝑞 = 𝑝𝑘 𝑞 𝑛−𝑘
𝑝 ∙ 𝑝 ∙ …∙ 𝑝 ∙ ⏟

𝑘 𝑡𝑖𝑚𝑒𝑠 𝑛−𝑘 𝑡𝑖𝑚𝑒𝑠

However, one mustn’t forget that the successes need not happen first, nor even consecutively.
From the 𝑛 trials, any 𝑘 of them must result in successes (or, equivalently, 𝑛 − 𝑘 of them must
result in failures). There are ⬚𝑛𝐶𝑘⬚ ways to place the 𝑘 successes over 𝑛 trials. Hence,

𝑃(𝑋 = 𝑘) = 𝑝(𝑘) = 𝑛𝐶𝑘 𝑝𝑘 𝑞 𝑛−𝑘

Probabilities for Binomial variables can be computed on your calculator, with the use of Excel or
using Binomial tables.

79
Example 4.9.1
Let’s assume that 30% of current Marianopolis students graduated from a public high school.
Suppose that 20 Marianopolis students are selected randomly (and with replacement) from a
list provided by the registrar’s office. We then count how many of the selected students
graduated from a public high school. Obviously, there is no guarantee a priori as to the exact
number of students who graduated from a public high school in the selected sample.

Let 𝑋 = “number of students in the sample who graduated from a public high school”.

Does 𝑿 obey a Binomial distribution?


There are 20 identical ‘trials’ that take place (that is, selecting a student from the list of
Marianopolis students).

The result of each selection leads either to a ‘success’ (the student graduated from a public high
school) or a ‘failure’. The probability (𝑝 = 30%) of obtaining a ‘success’ at each trial remains
constant (since names are replaced and can be chosen again);

Variable 𝑋 is defined as the total number of ‘successes’ observed over the 20 trials. As a result,
𝑋 obeys a Binomial distribution and we can write 𝑋 ~ 𝐵(20, 0.3).

What is the expected value and standard deviation of 𝑿?


𝐸(𝑋) = = =6
𝑉(𝑋) = = = 4.2 ⇒ 𝜎(𝑋) = = ≈ 2.05

What is the probability that exactly 8 students in the sample are graduates of a public high
school?
𝑃(𝑋 = 8) = ≈ 11.4%

What is the probability that fewer than 5 students in the sample are graduates of a public
high school?
𝑃(𝑋 < 5) = ≈ 23.8%

What is the probability that at least 4 students in the sample are graduates of a public high
school?
𝑃(𝑋 ≥ 4) = ≈ 89.3%

80
Example 4.9.2
A rapid antigen test is used to detect an infection to Covid. Let us assume that the antigen test’s
accuracy is 90%, which means that a person that has Covid will test positive with 90%
probability.

100 people who are known to be infected with Covid are tested using this product. The number
of positive tests that will result from this experiment is a random variable. There is no
guarantee as to the exact number of the tested individuals whose tests will come out positive.

Let 𝑌 = “number of tests whose result is positive”.

Does Y obey a Binomial distribution?

As a result, 𝑌 obeys a Binomial distribution and we can write 𝑌 ~ 𝐵( , )

What is the expected value and standard deviation of 𝒀?


𝐸(𝑌) = = = 90
𝑉(𝑌) = = =9 ⇒ 𝜎(𝑌) = =3

What is the probability that exactly 90 positive tests will be observed?

𝑃(𝑌 = 90) = ≈ 13.2%

What is the probability that fewer that 8 or more false negatives are observed?
If there are fewer than 8 false negative results, then there are more than 92 true positive
results:

𝑃( )= ≈ 20.6%

81
Section 5 Continuous random variables
A continuous variable can obviously not be dealt with in the same manner as a discrete
variable. For instance, how can we even discuss the probability distribution table of a variable
that has infinitely (not to mention uncountable) many possible outcomes?

Example 4.9.1
From her office window, Awa can see the REM as its rolls between two Griffintown buildings.
The REM runs exactly every 4 minutes. Out of curiosity, Awa conducts the following
experiment: at a randomly selected moment of the day, she uses a stopwatch to measure the
time that passes until she gets a glimpse of the REM.

The time (in seconds) she will measure is a random variable that we shall denote as 𝑇. The
support of variable 𝑇, 𝑆(𝑇), consists of values in the time interval [0, 240].

A priori, what is the probability that Awa’s measurement of time will be 𝑻 = 𝟏𝟐𝟓. 𝟑𝟐 𝐬?

What does the previous question tell you about the probability associated to individual
values that a continuous random variable may take?

82
5.1 Density functions
Given that a continuous random variable cannot be described adequately by a probability
distribution table, we need a different tool that will allow us to compute probabilities and other
main features (its expected value, variance, standard deviation).

This tool is called the density function (or a probability density function). The density function
of a continuous random variable is denoted by 𝑓(𝑥) and has the following characteristics.

Characteristics of the density function


Let 𝑋 represent a continuous random variable whose support (domain) is 𝑆(𝑋). 𝑓(𝑥) is a
density function of 𝑋 if

1. 𝑓(𝑥) ≥ 0 for all 𝑥 ∈ 𝑆(𝑋), and if


2. the area that is enclosed between 𝑓(𝑥) and the 𝑥-axis on its support is equal to 1.

When these conditions are satisfied, then


𝑃(𝑎 ≤ 𝑋 ≤ 𝑏)
is equal to the area that is enclosed between 𝑓(𝑥) and the 𝑥-axis within the interval [𝑎, 𝑏]

83
Example 5.1.1
Geologists are studying Mount Bigdely, an active volcano whose eventual eruption is probable.
The geologists suggest there is a strong probability that the next eruption will happen very
soon. It is unlikely, in their opinion, that the eruption will occur far into the future. It is
extremely unlikely that an eventual eruption never takes place.
Let’s define variable 𝑇 as the time delay, counted in days from now, until Mount Bigdely erupts.
What kind of variable is 𝑻 (discrete or continuous)? Explain briefly.

What aspect would 𝑻’s density function, 𝒇(𝒕), have? Sketch a reasonable graph of 𝒇(𝒕) and
mention any of its characteristics.

How would you write and find the probability that the volcano will erupt in the next week?

84
5.2 Integral notation
As you will learn in your second calculus course, integral notation is used to represent, “the
area that is enclosed between 𝑓(𝑥) and the 𝑥-axis within the interval [𝑎, 𝑏]”, whenever a
function remains non-negative.

We will make use of this notation to avoid


having to write the tedious expression, “the
area that is enclosed between 𝑓(𝑥) and the
𝑥-axis within the interval [𝑎, 𝑏]”, every
single time a probability is calculated for a
continuous random variable.

Note 1: Given that the area enclosed between 𝑓(𝑥) and the 𝑥-axis over the support of 𝑋
is 1, then

∫ 𝑓(𝑥)𝑑𝑥 = 1
𝑆(𝑋)

Note 2: Given that probability is associated to area, only intervals can generate non-zero
probabilities. The probability that 𝑋 will take a specific individual value is zero.
𝑎

𝑃(𝑋 = 𝑎) = 𝑃(𝑎 ≤ 𝑋 ≤ 𝑎) = ∫ 𝑓(𝑥)𝑑𝑥 = 0


𝑎

Note 3: The boundaries, individually, do not generate area and can therefore be included
or excluded from the interval without modifying the ensuing probability. That is,
𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑃(𝑎 < 𝑋 ≤ 𝑏) = 𝑃(𝑎 ≤ 𝑋 < 𝑏) = 𝑃(𝑎 < 𝑋 < 𝑏)

85
5.3 Expected value, variance and standard deviation of a continuous random variable
Suppose that 𝑋 is a continuous random variable whose support is 𝑆(𝑋), and let 𝑓(𝑥) represent
its density function.

Then, the following computational formulas can be used to obtain the expected value, variance
and standard deviation of a continuous random variable:


Expected value 𝐸(𝑋) = ∫𝑆(𝑋) 𝑥𝑓(𝑥)𝑑𝑥
⬚ ⬚
Variance 𝑉(𝑋) = ∫𝑆(𝑋)[𝑥 − 𝐸(𝑋)]2 ∙ 𝑓(𝑥)𝑑𝑥 = ∫𝑆(𝑋) 𝑥 2 𝑓(𝑥)𝑑𝑥 − [𝐸(𝑋)]2

Standard deviation 𝜎(𝑋) = √𝑉(𝑋)

Do you see parallels between the formulas shown here and those you have used when we
discussed discrete random variables?

In the context of our course, you will not be asked to compute the expected values and
variances of continuous random variables unless the integrals can be associated to areas of
well-known geometric shapes.

Example 5.3.1
From her office window, Awa can see the REM as its rolls between two Griffintown buildings.
The REM runs exactly every 4 minutes. Awa conducts the following experiment: at a randomly
selected moment of the day, she uses a stopwatch to measure the time that passes until she
gets a glimpse of the REM.

Let 𝑇 = “the time (in seconds) that elapses until Awa sees the REM pass by”.

The REM runs every 4 minutes, and Awa has started her the time measurement at a random
moment. Therefore, the next REM may pass within seconds, or Awa may have to wait up to
another 4 minutes. Neither of these possibilities is more likely than the other and hence, the
1
density function will be constant. More specifically, 𝑓(𝑡) = 240 with support 𝑆(𝑇) = [0, 240].

86
Show that 𝒇(𝒕) is in fact a density function.

We can use the graph of 𝑓 to make this verification:

✓ 𝑓(𝑡) ≥ 0 for all 𝑡 𝜖 𝑆(𝑇)


⬚ 240
✓ ∫𝑆(𝑇) 𝑓(𝑡)𝑑𝑡 = ∫0 𝑓(𝑡) 𝑑𝑡

What is the probability that Awa’s waiting time will be between 1 and 2.5 minutes?

𝑃(60 < 𝑇 < 150) =

What is the probability that Awa’s waiting time will exceed 3 minutes?

𝑃(𝑇 > 180) = 𝑃(180 < 𝑇 ≤ 240)

87
What is the probability that Awa’s waiting time will not exceed 𝒙 minutes, where 𝒙 is in the
interval [𝟎, 𝟐𝟒𝟎]?

𝑃(𝑇 ≤ 𝑥) =

Explain why the rule above is only valid for 𝟎 ≤ 𝒙 ≤ 𝟐𝟒𝟎.

If 𝑥 < 0, then 𝑃(𝑇 ≤ 𝑥) = 0 since the support of 𝑇 is [0, 240]. No values of 𝑇 below 0 can
occur.

If 𝑥 > 240, then 𝑃(𝑇 ≤ 𝑥) = 1 since the support of 𝑇 is [0, 240]. All values of 𝑇 will necessarily
remain below 240.

This means that we could also write 𝑃(𝑇 ≤ 𝑥) as a piecewise function:

0; if 𝑥 < 0
𝑥
𝑃(𝑇 ≤ 𝑥) = { ; if 0 ≤ 𝑥 ≤ 240
240
1; if 𝑥 > 240

Use the rule you have created for 𝑷(𝑻 ≤ 𝒙) to compute:

a) 𝑃(𝑇 ≤ 80) =

b) 𝑃(𝑇 < 80) =

c) 𝑃(𝑇 = 80) =

d) 𝑃(𝑇 > 80) =

e) 𝑃(60 ≤ 𝑇 ≤ 80) =

f) 𝑃(𝑇 ≤ 80 | 𝑇 > 60) =

88
Compute the expected value of 𝑻.

240 240

𝐸(𝑇) = ∫ 𝑡𝑓(𝑡)𝑑𝑡 = ∫ ( )𝑑𝑡 = area under the function within [0, 240].
0 0

𝐸(𝑇) = = 120

The result above is in seconds and, as you probably would have guessed, equal to
minutes.

89
Example 5.3.2
The duration of a metro stoppage, calculated in minutes, is a random variable (𝑋) whose
density function is given by the function

0.01𝑥 , for 0 < 𝑥 ≤ 10


𝑓(𝑥) = {
0.2 − 0.01, for 10 < 𝑥 ≤ 20

Show that 𝒇(𝒙) is in fact a density function.


We can use the graph of 𝑓 to make this verification:

✓ 𝑓(𝑥) ≥ 0 for all 𝑡 𝜖 𝑆(𝑋) = [0, 20]


✓ ∫𝑆(𝑋) 𝑓(𝑥)𝑑𝑥 =

What is the probability that a stoppage will last more that 12 minutes?

𝑃(𝑋 > 12) = 𝑃(12 < 𝑇 ≤ 20)

What is the probability that a stoppage will last between 8 and 12 minutes?

𝑃(8 < 𝑋 < 12)

90
5.4 Linear transformations and sums of independent continuous random variables

Recall that, when discussing discrete random variables, we had highlighted properties
pertaining to expected values, variances and standard deviations of either linear
transformations or sums of independent variables. These properties are also applicable to
continuous random variables. More specifically:

For linear transformations 𝒀 = 𝒂𝑿 + 𝒃 For sums of independent variables 𝑻 = 𝑿 + 𝒀

𝑬(𝒀) = 𝑬(𝒂𝑿 + 𝒃) = 𝒂𝑬(𝑿) + 𝒃 𝐸(𝑇) = 𝐸(𝑋 + 𝑌) = 𝐸(𝑋) + 𝐸(𝑌)


𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒂𝑿 + 𝒃) = 𝒂𝟐 𝑽𝒂𝒓(𝑿) 𝑉(𝑇) = 𝑉(𝑋 + 𝑌) = 𝑉(𝑋) + 𝑉(𝑌)
𝝈(𝒀) = 𝝈(𝒂𝑿 + 𝒃) = |𝒂|𝝈(𝑿) 𝜎(𝑇) = √𝑉(𝑇) = √𝜎 2 (𝑋) + 𝜎 2 (𝑌)

91
Section 6 Normal distribution and sums of variables
6.1 Standard normal distribution
A continuous random variable that obeys a standard normal distribution is identified using the
notation
𝑍 ~ 𝑁(0, 1)

Note that, as much as possible, the label 𝑍 is strictly used to refer to the standard normal
distribution because of its frequent and varied use. Its parameters, 0 and 1, correspond to the
distribution’s mean (median, expected value) and standard deviation, respectively.

Saying that 𝑍 ~ 𝑁(0, 1) implies that 𝑍 is a continuous variable…


o whose support is ℝ
o whose expected value is 𝐸(𝑍) = 0
o whose standard deviation is 𝜎(𝑍) = 1
1 𝑧2
o whose density function is 𝑓(𝑧) = 𝑒− 2
√2𝜋

Note that
o 𝑓(𝑧) is strictly positive and symmetric about the vertical axis 𝑧 = 0
o 𝑓(𝑧) has its maximum value at 𝑧 = 0
o 𝑓(𝑧) has inflection points at 𝑧 = ±1
o 𝑓(𝑧) → 0 as 𝑧 → ±∞

In short, 𝑓(𝑧) has the appearance of a ‘Bell Figure 6.1.1


Curve’ as shown in Figure 6.1.1. As is the
case of all density functions, the area under
𝑓(𝑧) can be interpreted as probability, so


1 2 ⁄2
∫ 𝑒 −𝑧 𝑑𝑧 = 1
√2𝜋
−∞
𝑏
1 2 ⁄2
𝑃(𝑎 ≤ 𝑍 ≤ 𝑏) = ∫ 𝑒 −𝑧 𝑑𝑧
√2𝜋
𝑎

92
The main issue with the regards to this density function, even if you have learned integral
calculus, is in calculating the area it contains within a given interval. More specifically, without
computer assistance, we cannot compute an integral of the form:

𝑏
1
∫ 𝑒 −𝑧⁄2 𝑑𝑧
√2𝜋
𝑎

Excel will therefore be an essential tool in Figure 6.1.2


doing such calculations. Alternatively, we
can make use of a 𝑍-table (see Appendix A)
a portion of which is shown in Table 6.1.1.
Both Excel and the table reveal the
probability (area) that is cumulated to the
left of a given 𝑧-score (see Figure 6.1.2).

Table 6.1.1
z 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-1.1 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
-1.0 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
-0.9 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
-0.8 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
-0.7 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
-0.6 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451

You will notice when using the table that the empirical rule we introduced in the Descriptive
statistics section is based on the normal law. That is, given that 𝑍 ~ 𝑁(0, 1), then

o 𝑃(−1 < 𝑍 < 1) ≈ 68.27%


o 𝑃(−2 < 𝑍 < 2) ≈ 95.45%
o 𝑃(−3 < 𝑍 < 3) ≈ 99.73%

93
Example 6.1.1
Use a 𝑍-table or Excel to find
a) 𝑃(𝑍 ≤ −1.17) =
b) 𝑃(𝑍 > −1.17) =
c) 𝑃(𝑍 > 1.17) =
d) 𝑃(−1.17 < 𝑍 ≤ −0.64) =

Just as Excel (or a 𝑍-table) provides the cumulative probability left of a given 𝑧-value, one can
also use it find the 𝑍-value that bounds a region with a given cumulative probability.

Example 6.1.2

If 𝑍 → 𝑁(0, 1), then:


a) find 𝑎 such that 𝑃(𝑍 ≤ 𝑎) = 10%

𝑃(𝑍 ≤ 𝑎) = 10% 𝑎≈−

b) find 𝑏 such that 𝑃(𝑍 > 𝑏) = 35%

𝑃(𝑍 ≤ 𝑏) = 𝑏≈−

c) find 𝑐 such that 𝑃(−𝑐 ≤ 𝑍 < 𝑐) = 40%

𝑃(𝑍 < −𝑐) = −𝑐 ≈ OR 𝑐 ≈

94
Notation
We denote by 𝑧𝛼 and −𝑧𝛼 the values of 𝑍 that exclude an area corresponding to 𝛼 at either
extremity of 𝑓(𝑧).

Observe that we could also use 𝑧𝛼⁄2 and


−𝑧𝛼⁄2 as boundaries that would exclude a
total area 𝛼 in two equal regions at each
extremity.

Example 6.1.3
Use Excel (or a 𝑍-table), to find the value corresponding to −𝑧10% .

−𝑧10% is the 𝑍-value that excludes the


leftmost 10% of the distribution.
𝑃(𝑍 < −𝑧10% ) = 10%
−𝑧10% ≈ −1.282

95
Example 6.1.4
Use Excel (or a 𝑍-table), to find the value corresponding to 𝑧5% .

𝑧5% is the 𝑍-value that excludes the


rightmost 5% of the distribution.
𝑃(𝑍 > 𝑧5% ) = 5%
𝑃(𝑍 ≤ 𝑧5% ) =
𝑧5% ≈

Example 6.1.5
Use Excel (or a 𝑍-table), to find the value of 𝑎 such that 𝑃(−𝑎 ≤ 𝑍 ≤ 𝑎) = 80%.

With the use of a graph, we can deduce that


𝑎 and – 𝑎 must actually correspond to 𝑧10%
and −𝑧10%, respectively, given that these
values are symmetric (as are 𝑎 and – 𝑎), and
exclude 10% at each extremity, leaving the
central 80%. Since 𝑧10% = , then

𝑃( ≤𝑍≤ ) = 80%

96
6.2 General forms of normal distributions
A variable that obeys a normal distribution is identified using the notation
𝑋 ~ 𝑁(𝜇, 𝜎)
where the mean (median, expected value) and standard deviation of 𝑋 are given by the
parameters 𝜇 and 𝜎.

Saying that 𝑋 ~ 𝑁(𝜇, 𝜎) implies that 𝑋 is a continuous variable…


o whose support is ℝ
o whose expected value is 𝐸(𝑋) = 𝜇
o whose standard deviation is 𝜎(𝑋) = 𝜎
(𝑥−𝜇)2
1 −
o whose density function is 𝑓(𝑥) = 𝜎√2𝜋 𝑒 2𝜎2

Note that this density function has the appearance of a bell-shaped curve. Moreover:
o 𝑓(𝑥) is strictly positive and symmetric about the vertical axis 𝑥 = 𝜇.
o 𝑓(𝑥) has its maximum value at 𝑥 = 𝜇
o 𝑓(𝑥) has inflection points at 𝑥 = 𝜇 ± 𝜎
o 𝑓(𝑥) → 0 as 𝑥 → ±∞

As is the case of all density functions, the area under 𝑓(𝑥) can be interpreted as probability, so

1 (𝑥−𝜇)2

∫ 𝑒 2𝜎2 𝑑𝑥 =1
𝜎√2𝜋
−∞
𝑏
1 (𝑥−𝜇)2

𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑒 2𝜎2 𝑑𝑥
𝜎√2𝜋
𝑎

Once again, you will observe that the empirical rule introduced in the Descriptive statistics
section is based on the normal law: regardless of 𝜇 and 𝜎, if 𝑋 ~ 𝑁(𝜇, 𝜎), then

o 𝑃(𝜇 − 1𝜎 < 𝑋 < 𝜇 + 1𝜎) ≈ 68.27%


o 𝑃(𝜇 − 2𝜎 < 𝑋 < 𝜇 + 2𝜎) ≈ 95.45%
o 𝑃(𝜇 − 3𝜎 < 𝑋 < 𝜇 + 3𝜎) ≈ 99.73%

97
Example 6.2.1
𝑇, 𝑉 and 𝑊 are random variables such that

𝑇 ~ 𝑁(5, 8) 𝑉 ~ 𝑁(5, 4) 𝑊 ~ 𝑁(−5, 4)

Associate each density function shown in


the graph (𝑓1 , 𝑓2 , and 𝑓3 ) to its
corresponding random variable (𝑇, 𝑉 or 𝑊).

True or false: There is more area under 𝑓1 and 𝑓2 than there is under 𝑓3 .

98
6.3 Properties of normal distributions
Properties of normal laws will be essential to manipulations we will conduct in the following
sections. These properties are stated (without proof) below:

Property 6.1 - Linear transformations of normally distributed variables


Let 𝑋 represent a normally distributed variable with mean 𝜇𝑋 and standard deviation 𝜎𝑋 and
suppose that variable 𝑌 is a linear transformation of 𝑋 (that is, 𝑌 = 𝑎𝑋 + 𝑏).

Then, 𝑌 will be normally distributed with 𝜇𝑌 = 𝑎𝜇𝑋 + 𝑏 and 𝜎𝑌 = |𝑎|𝜎𝑋 .

To summarize, if 𝑋 ~ 𝑁(𝜇𝑋 , 𝜎𝑋 ) and 𝑇 = 𝑎𝑋 + 𝑏, then

𝑌 ~ 𝑁(𝑎𝜇𝑋 + 𝑏, |𝑎|𝜎𝑋 )

Property 6.2 – Sums of independent normally distributed variables


Let 𝑋 and 𝑌 represent independent, normally distributed variables with means 𝜇𝑋 and 𝜇𝑌 , and
standard deviations 𝜎𝑋 and 𝜎𝑌 , respectively. Let 𝑇 represent the sum of 𝑋 and 𝑌.

Then, 𝑇 = 𝑋 + 𝑌 will be normally distributed with 𝜇 𝑇 = 𝜇𝑋 + 𝜇𝑌 and 𝜎𝑇 = √𝜎𝑋2 + 𝜎𝑌2 .

To summarize, if 𝑋 ~ 𝑁(𝜇𝑋 , 𝜎𝑋 ) and 𝑌 ~ 𝑁(𝜇𝑌 , 𝜎𝑌 ) are independent, then

𝑋 + 𝑌 = 𝑇 ~ 𝑁 (𝜇𝑋 + 𝜇𝑌 , √𝜎𝑋2 + 𝜎𝑌2 )

Property 6.3 – Differences of independent normally distributed variables


Let 𝑋 and 𝑌 represent independent, normally distributed variables with means 𝜇𝑋 and 𝜇𝑌 , and
standard deviations 𝜎𝑋 and 𝜎𝑌 , respectively. Let 𝐷 represent the difference of 𝑋 and 𝑌.

Then, 𝐷 = 𝑋 − 𝑌 will be normally distributed with 𝜇𝐷 = 𝜇𝑋 − 𝜇𝑌 and 𝜎𝐷 = √𝜎𝑋2 + 𝜎𝑌2 .

To summarize, if 𝑋 ~ 𝑁(𝜇𝑋 , 𝜎𝑋 ) and 𝑌 ~ 𝑁(𝜇𝑌 , 𝜎𝑌 ) are independent, then

𝑋 − 𝑌 = 𝐷 ~ 𝑁 (𝜇𝑋 − 𝜇𝑌 , √𝜎𝑋2 + 𝜎𝑌2 )

Can you use Property 6.1 and Property 6.2 to demonstrate Property 6.3?

99
The fact that linear transformations, sums and differences of normally distributed variables
generate normally distributed variables is a newly acquired concept.

How we obtain the parameters of 𝑌 = 𝑎𝑋 + 𝑏 and of 𝑇 = 𝑋 ± 𝑌 are, however, the result of


properties we have discussed earlier.

𝐸(𝑌) = 𝐸(𝑎𝑋 + 𝑏) = 𝑎𝐸(𝑋) + 𝑏 𝐸(𝑇) = 𝐸(𝑋 ± 𝑌) = 𝐸(𝑋) ± 𝐸(𝑌)


𝑉𝑎𝑟(𝑌) = 𝑉𝑎𝑟(𝑎𝑋 + 𝑏) = 𝑎2 𝑉𝑎𝑟(𝑋) 𝑉(𝑇) = 𝑉(𝑋 ± 𝑌) = 𝑉(𝑋) + 𝑉(𝑌)
𝜎(𝑌) = 𝜎(𝑎𝑋 + 𝑏) = |𝑎|𝜎(𝑋) 𝜎(𝑇) = √𝑉(𝑇)

(𝑥−𝜇)2
1 −
Two problems now arise regarding the density function 𝑓(𝑥) = 𝑒 2𝜎2 .
𝜎 √2𝜋

Firstly, the curve’s geometry is not one for which we can easily find the area. Secondly, one
cannot print tables of normal distributions for every possible combination of mean 𝜇 and
standard deviation, 𝜎.

If you are using Excel, luckily, the integrated commands can be adapted to any normal law.
Without Excel, converting 𝑋 ~ 𝑁(𝜇, 𝜎) into a standard normal variable 𝑍 ~ 𝑁(0, 1), for which
a table is available, will be necessary.

6.4 Conversions between general and standard normal laws


Here is where we can make use of Property 6.1. If 𝑋 ~ 𝑁(𝜇, 𝜎), then any transformation of the
form 𝑎𝑋 + 𝐵 will generate a new normally distributed variable.

𝑋−𝜇 1 𝜇
Hence, consider the linear transformation =⏟
𝜎
𝑋 + (−
⏟ 𝜎
)
𝜎
𝑎 𝑏

The expected value and standard deviation of this new (normally distributed) variable are

1 𝜇 1 𝜇 1 𝜇
𝐸 ( 𝑋 − ) = 𝐸(𝑋) − = 𝜇 − = 0
𝜎 𝜎 𝜎 𝜎 𝜎 𝜎

1 𝜇 1 1
𝜎 ( 𝑋 − ) = | | 𝜎(𝑋) = 𝜎 = 1
𝜎 𝜎 𝜎 𝜎

100
𝑋−𝜇
As you can see, is not only normally distributed, it obeys a standard normal law.
𝜎

𝑋−𝜇
= 𝑍 ~ 𝑁(0,1)
𝜎

Transformations from 𝑋 to 𝑍, or vice-versa,


will be useful whenever the 𝑍-table is
required to perform probability
calculations.

Example 6.4.1
Given that 𝑋 ~ 𝑁(4, 9), use a conversion to 𝑍 to compute the following probabilities:

a) 𝑃(𝑋 ≤ 8) =

b) 𝑃(𝑋 > 6) =

c) 𝑃(−10 < 𝑋 ≤ 20) =

Once again, rather than calculating probabilities from given bounds, you may also need to
obtain bounds from a given probability. It is highly suggested, if you are using a 𝑍-table, to
sketch the region whose bounds you wish to identify. If you are using Excel, a conversion to a
standard normal law is not necessary. Whether you use Excel or a 𝑍-table, one must keep in
mind that only areas left of a given boundary are directly provided.

101
Example 6.4.2
Let 𝑋 ~ 𝑁(4, 9).

𝑋−4
Variable 𝑋 can be converted to 𝑍 as follows: 𝑍= 9

Likewise, variable 𝑍 can be converted to 𝑍 as follows: 𝑋=𝜇

Find the values of 𝑎, 𝑏 and 𝑐 such that

o 𝑃(𝑋 ≤ 𝑎) = 10%

We know that 𝑃(𝑍 ≤ −𝑧10% ) = 10%

From the 𝑍-table, find −𝑧10% ≈ −1.282


𝑃(𝑍 ≤ −1.282) = 10%
𝑃(4 + 9𝑍 ≤ 4 + 9(−1.282)) = 10%
𝑃(𝑋 ≤ −7.538) = 10%
Thus, 𝑎 ≈ −7.538

o 𝑃(𝑋 ≥ 𝑏) = 35%

o 𝑃(4 − 𝑐 ≤ 𝑋 ≤ 4 + 𝑐) = 90%

102
6.5 Sums of identical and independent random variables
Before moving on to the next sections whose focus will be inferential statistics, let us introduce
a valuable fact that applies to sums of independent random variables. This fact justifies the
importance that is given to normal distributions and will lead to the Central Limit Theorem.

First, let us revisit some important properties we have introduced earlier. Whether 𝑋 and 𝑌 are
discrete and continuous random variables, then

1. 𝐸(𝑋 + 𝑌) = 𝐸(𝑋) + 𝐸(𝑌)


2. 𝑉(𝑋 + 𝑌) = 𝑉(𝑋) + 𝑉(𝑌) as long as 𝑋 and 𝑌 are independent.

If 𝑋 and 𝑌 are normally distributed variables, we can even conclude that the sum 𝑋 + 𝑌 will be
normally distributed as well.

This is not the case, however, if 𝑋 and 𝑌 obey other distributions. Suppose, for instance, that
1
𝑋 and 𝑌 are independent and obey a Bernoulli distribution with 𝑝 = 6 , like throwing two dice
and considering, for each one, ‘obtaining a 6’ as a success.
𝑋 and 𝑌 look nothing like normal distributions:

𝒙 𝒑(𝒙)
0 5/6
1 1/6

1
𝐸(𝑋) = 6
1 5 5
𝑉(𝑋) = (6) (6) = 36

𝒚 𝒑(𝒚)
0 5/6
1 1/6

1
𝐸(𝑌) = 6
1 5 5
𝑉(𝑌) = (6) (6) = 36

103
1
The sum, 𝑋 + 𝑌, obeys a Binomial distribution 𝑋 + 𝑌 ~ 𝐵 (2, 6).

𝒙+𝒚 𝒑(𝒙 + 𝒚)
0 25/36
1 10/36
2 1/36

The expected value and variance of 𝑋 + 𝑌 are:


1 1
𝐸(𝑋 + 𝑌) = 2 ( ) =
6 3

1 5 5
𝑉(𝑋 + 𝑌) = 2 ( ) ( ) =
6 6 18

One can easily verify that both properties 𝐸(𝑋 + 𝑌) = 𝐸(𝑋) + 𝐸(𝑌) and
𝑉(𝑋 + 𝑌) = 𝑉(𝑋) + 𝑉(𝑌) are satisfied:

1 1 1
𝐸(𝑋) + 𝐸(𝑌) = + = = 𝐸(𝑋 + 𝑌)
6 6 3

5 5 10 5
𝑉(𝑋) + 𝑉(𝑌) = + = = = 𝑉(𝑋 + 𝑌)
36 36 36 18

Nonetheless, the graphical distribution of 𝑋 + 𝑌 looks nothing like a normally distributed


variable. This comes to no surprise… why would we expect a bell-shape curve?

104
Suppose, however, that the experiment we have described in the previous discussion had been
conducted on 180 dice, rather than just 2. For each dice we could define the Bernoulli variable:

1 ; when a 6 is obtained from die ′𝑘′


𝑋𝑘 = {
0 ; when a 6 is not obtianed from die ′𝑘′

1 1 5
For every 𝑋𝑘 , we would have an expected value of 𝐸(𝑋𝑘 ) = 6 and a variance 𝑉(𝑋𝑘 ) = (6) (6)
The sum 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋180 would obey a Binomial distribution.
1
More specifically, 𝑇 ∼ 𝐵 (180, 6). Its expected value and variance are
1
𝐸(𝑇) = 180 ( ) = 30
6
1 5
𝑉(𝑇) = 180 ( ) ( ) = 25
6 6
𝜎(𝑇) = √𝑉(𝑇) = 5

You will notice that properties for sums of variables are once again respected. That is,

𝐸(𝑇) = 𝐸(𝑋1 ) + ⋯ + 𝐸(𝑋180 )


𝑉(𝑇) = 𝑉(𝑋1 ) + ⋯ + 𝑉(𝑋180 ) since all 𝑋𝑘 are independent

So far, there is nothing surprising about variable 𝑇… that is, until we look at its probability
distribution graph.

105
Although 𝑇 is a discrete variable, its distribution is displaying a bell-shaped aspect. The
similitude to a Normal distribution becomes even more obvious when 𝐵(180, 1⁄6) and
𝑁(30, 5) are shown on the same graph:

It would be unlikely that such a resemblance would be the result of a ‘coincidence’! And, in fact,
it is not a coincidence at all.

Fact regarding the sum of identically distributed and independent variables


Let 𝑋1, 𝑋2, 𝑋3, …, 𝑋𝑛 represent variables that are identically distributed and independent.
Then, the sum 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 can be approximated by normal distribution law as long
as 𝑛 is big enough.

Assuming that 𝐸(𝑋𝑘 ) = 𝜇 and that 𝑉(𝑋𝑘 ) = 𝜎 2 for all 𝑘, then

𝐸(𝑇) = 𝐸(𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ) = 𝐸(𝑋1 ) + 𝐸(𝑋2 ) + ⋯ + 𝐸(𝑋𝑛 ) = 𝑛𝜇


𝑉(𝑇) = 𝑉(𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ) = 𝑉(𝑋1 ) + 𝑉(𝑋2 ) + ⋯ + 𝑉(𝑋𝑛 ) = 𝑛𝜎 2
𝜎(𝑇) = √𝑛 𝜎

We can therefore summarize our Fact in the following way:


✓ If 𝑋1, 𝑋2, 𝑋3, …, 𝑋𝑛 are identically distributed variables, and
✓ If 𝑋1, 𝑋2, 𝑋3, …, 𝑋𝑛 are independent, and if
✓ 𝑛 is big enough,

then, 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ~ 𝑁(𝑛𝜇, √𝑛𝜎)

106
How big must 𝑛 be to be big enough?

In general, we consider 𝑛 ≥ 30 as being an acceptable size, but we will provide more specific
guidelines later. Above all other factors, the size of 𝑛 will depend on how symmetric the
variables 𝑋𝑘 were in the first place. The more asymmetrical 𝑋𝑘 is, the bigger the 𝑛 will have to
be.

Many examples can illustrate how imperfect the 𝑛 ≥ 30 criterion is. Observe how, in the
graphs below, a sum of 16 identical variables (Figure 6.5.1) suffices to display a behaviour similar
to that of a normal law, whereas a sum of 80 identical variables (Figure 6.5.2) is insufficient to
generate a bell-shaped distribution in the second.

Figure 6.5.1 Figure 6.5.2

Example 6.5.1
Suppose that 60 regular dice are thrown. Let 𝑋𝑘 denote the result that is observed from the 𝑘 𝑡ℎ
die, for 𝑘 = 1, 2, … , 60. The probability distribution of every 𝑋𝑘 is identical.

Write the pdf of 𝑿𝒌 :

𝒙𝒌 𝒑(𝒙𝒌 )
1
2
3
4
5
6

107
Compute the expected value, variance and standard deviation of 𝑿𝒌 :

7
𝐸(𝑋𝑘 ) = ∑ 𝑥𝑘 𝑝(𝑥𝑘 ) = =
2
𝑘
35
𝑉(𝑋𝑘 ) = ∑ 𝑥𝑘2 𝑝(𝑥𝑘 ) − [𝐸(𝑋)]2 = =
12
𝑘
𝜎(𝑋𝑘 ) = √35⁄12

What does the variable 𝑻 = 𝑿𝟏 + 𝑿𝟐 + ⋯ + 𝑿𝟔𝟎 represent and what distribution law does it
approximately obey?

Estimate the probability that the sum will remain less than 200?

𝑃(𝑇 < 200) =

Estimate the 90th percentile of 𝑻.

108
6.6 Continuity correction (optional)
As you completed the last example, the following question may have popped into your mind “Is
it really OK to use a continuous distribution (a normal law, in this case), to estimate
probabilities of a discrete variable?” The answer is ‘YES, but…’

Indeed, the variable, 𝑇 = “the sum of faces the 60 dice”, can only produce discrete (integer)
outcomes, whereas the normal law we are using to estimate its probabilities has ℝ as its
support.

This discrepancy may lead to estimation errors or, worst, to contradictions.


For example, the variable 𝑇 = “the sum of faces the 60 dice” is a discrete variable. As such, the
probabilities 𝑃(𝑇 < 210) and 𝑃(𝑇 ≤ 210) differ, simply because 𝑃(𝑇 < 210) = 𝑃(𝑇 ≤ 209).
The normal distribution 𝑁(210, 5√7), on the other hand, would conclude that

𝑃(𝑇 < 210) = 𝑃(𝑇 ≤ 210)

Worst yet, the normal law 𝑁(210, 5√7) would predict that 𝑃(𝑇 = 210) = 0, given that an
individual value of a discrete random variable has 0 probability of occurring a priori. This is
clearly incorrect. You can surely list many ways for the sum of 60 dice to be 210.

More appropriate calculations can be conducted by applying a continuity correction. In essence,


the transition from discrete to continuous requires that every integer of the original variable be
replaced by an interval that begins half an increment before it, and ends half an increment after
it, when using the normal law to perform calculations.

For example, a sum 𝑇 = 210 for the original (discrete) variable would be represented by the
interval [209.5, 210.5[ in its normal law approximation.

At this stage, for notational purposes, you may want to distinguish the original variable 𝑇, with
its approximating normal law, 𝑇𝑁 .

109
Using the continuity correction, a more precise answer to question d) would be

𝑇𝑁 − 210 199.5 − 210


𝑃(𝑇 < 200) ≈ 𝑃(𝑇𝑁 ≤ 199.5) = 𝑃 ( < ) = 𝑃(𝑍 < −0.794) ≈ 21.4%
5√7 5√7

And rather than concluding that 𝑃(𝑇 = 210), we would find that

𝑃(𝑇 = 210) ≈ 𝑃(209.5 ≤ 𝑇𝑁 < 210.5)


= 𝑃(𝑇𝑁 < 210.5) − 𝑃(𝑇𝑁 < 199.5)
𝑇𝑁 − 210 210.5 − 210 𝑇𝑁 − 210 209.5 − 210
= 𝑃( < )−𝑃( < )
5√7 5√7 5√7 5√7
= 𝑃(𝑍 < 0.038) − 𝑃(𝑍 < −0.038)
≈ 0.515 − 0.485
≈ 3%

110
Main take-ways from this section

Linear transformations of normally distributed variables


If 𝑌 = 𝑎𝑋 + 𝑏 where 𝑋 ~ 𝑁(𝜇𝑋 , 𝜎𝑋 ), then 𝑌 ~ 𝑁(𝑎𝜇𝑋 + 𝑏, |𝑎|𝜎𝑋 )

Conversions between normal and standard normal variables:

𝑎−𝜇
𝑃(𝑋 ≤ 𝑎) = 𝑃 (𝑍 ≤ )
𝜎
𝑃(𝑍 ≤ 𝑧𝛼 ) = 𝑃(𝑋 ≤ 𝜇 + 𝑧𝛼 𝜎)

Sums of independent variables

Case 1
Consider 𝑋𝑘 , independent random variables such that 𝐸(𝑋𝑘 ) = 𝜇𝑘 and 𝜎(𝑋𝑘 ) = 𝜎𝑘 .
If variables 𝑋𝑘 are normally distributed, then the sum 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 will also be
normally distributed and 𝑇 ~ 𝑁 (𝜇1 + 𝜇2 + ⋯ + 𝜇𝑛 , √𝜎12 + 𝜎22 + ⋯ + 𝜎𝑛2 ) regardless of 𝑛.

Case 2
Consider 𝑋𝑘 , identical and independent random variables such that 𝐸(𝑋𝑘 ) = 𝜇 and 𝜎(𝑋𝑘 ) = 𝜎.
If 𝑛 is big enough (𝑛 ≥ 30), the sum 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 will be normally distributed and
𝑇 ~ 𝑁(𝑛𝜇, √𝑛𝜎)

Continuity corrections can be made if 𝑇 is a discrete random variable (which occurs if the 𝑋𝑘
are discrete or Bernoulli variables).

111
Section 7 Sample distributions and the Central Limit Theorem
Although a census (gathering data from an entire population) would provide exact and
incontestable information, there are many situations where such data collecting is impossible
(too long, too complicated, too costly). Sampling then becomes an essential tool to gathering
information that can potentially shine a light on a population’s characteristics. In this section,
we will introduce two of the most commonly used sampling distributions, the sample mean and
the sample proportion.

Let us consider a scenario where a sample of 𝑛 individuals is selected (with replacement) from a
population and let 𝑋 represent a variable whose average value and standard deviation are 𝜇
and 𝜎, respectively.

The value of 𝑋 that will eventually be collected from the 𝑘 𝑡ℎ individual is a random variable that
we will denote as 𝑋𝑘 . A priori, there is no way to predict with any confidence the exact value
that 𝑋𝑘 will take. However, given that the variable of study has a population mean 𝜇 and
standard deviation 𝜎, then 𝑋𝐾 ’ s expected value and standard deviation are 𝐸(𝑋𝑘 ) = 𝜇 and
𝜎(𝑋𝑘 ) = 𝜎, respectively (that is, in line with known parameters of the population).

Example 6.6.1
At Marianopolis, the 𝑅-score among the student population has an average value of 29.2, with
2.8 standard deviation.

Suppose that two students are randomly selected, with replacement, from the Marianopolis
student population and let 𝑋1 and 𝑋2 represent the 𝑅-scores of these Marianopolis students.
Replacement of the selected individuals guarantees the independence of 𝑋1 and of 𝑋2.

Furthermore, since the students are to be selected from the same population (the Marianopolis
student population), then 𝑋1 and 𝑋2 have the same probability distribution. From what we
know of the population, we deduce that 𝐸(𝑋1 ) = 𝐸(𝑋2 ) = 29.2, and 𝜎(𝑋1 ) = 𝜎(𝑋2 ) = 2.8.

Likewise, if 𝑛 individuals were selected at random, with replacement, from the Marianopolis
student population then 𝐸(𝑋𝑘 ) = 29.2 and 𝜎(𝑋𝑘 ) = 2.8 for all 𝑘 = 1, 2, … , 𝑛.

112
7.1 Sample Means
The average value of a sample – or sample mean – is denoted by 𝑋̅. It is a random variable
whose value depends on the 𝑛 individuals that will ultimately be selected from the population.
It is essential to comprehend the difference between 𝑋̅ and 𝜇, the population mean 𝜇. The
population mean is not a random variable. Its value is not subject to chance or to the effects of
random selections. 𝜇 is a fixed value (at a given moment in time) that is characteristic of the
population: it is a parameter. 𝜎, 𝜌 are other examples of parameters, not be confused with 𝑆
and 𝑅 (which are variables obtained from sample data).

The sample mean is defined as


𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛
𝑋̅ = ,
𝑛
or equivalently as
1
𝑋̅ = (𝑋 + 𝑋2 + ⋯ + 𝑋𝑛 )
𝑛 1

Although the observed value of 𝑋 will seldom (never) be equal to 𝜇, it should provide a good
estimate for it. In fact, the expected value of 𝑋 is equal to 𝜇:

1
𝐸(𝑋̅) = 𝐸 [ (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 )]
𝑛

=𝜇

In other words, if all possible random samples of size 𝑛 were collected from the population, the
average of the observed values of 𝑋 would be 𝜇. In terms of the variability of 𝑋̅, one should
expect it to be less than the variability of individual observations, 𝑋𝑘 . A sample with individuals
whose measures, 𝑋𝑘 , are left of 𝜇 will most likely be compensated by those whose measures
are right of 𝜇… creating a centering phenomenon.

113
1
𝑉(𝑋̅) = 𝑉 [ (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 )]
𝑛

𝜎2
=
𝑛

𝜎
𝜎(𝑋̅) = √𝑉(𝑋̅) =
√𝑛
𝜎
As predicted, the variability of 𝑋 is reduced compared to that of 𝑋𝑘 . The ratio is called the
√𝑛
standard error. It is the standard deviation of sample means, relative to the parameter 𝜇.

Conceptual questions
Based on what was shown above, what can you say about how close 𝑿 will be to 𝝁 when…

1. Variable 𝑿 among the population is extremely homogeneous?

2. The sample size is extremely big?

In short, 𝑋̅ will tend to remain close to 𝜇 whenever a population is homogeneous when it


comes to variable 𝑋, and/or when the sample size is big.
Let us go further in describing the behaviour of the sample mean, 𝑋̅.

114
Recall from section 6.5 that if

✓ 𝑋1, 𝑋2, 𝑋3, …, 𝑋𝑛 are identically distributed variables, and if


✓ 𝑋1, 𝑋2, 𝑋3, …, 𝑋𝑛 are independent, and if
✓ 𝑛 is big enough (𝑛 ≥ 30), then

𝑇 will be normally distributed and such that 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ~ 𝑁(𝑛𝜇, √𝑛𝜎).

1
Given that 𝑋̅ is a linear transformation of 𝑇 (that is, 𝑋̅ = 𝑛 𝑇), it too will be normally distributed.
𝜎
Using the fact that 𝐸(𝑋̅) = 𝜇 and 𝜎(𝑋̅) = , we can conclude that
√𝑛

𝜎
𝑋̅ ~ 𝑁 (𝜇 , )
√𝑛

This is, in synonymous terms, the main conclusion of the Central Limit Theorem.

7.2 Central Limit Theorem

The Central Limit Theorem (CLT)


If 𝑋1, 𝑋2, 𝑋3, …, 𝑋𝑛 are identically distributed and independent random variables, and if 𝑛 is big
𝑋̅ −𝜇 𝜎
enough (𝑛 ≥ 30), then → 𝑍 ~ 𝑁(0,1), or equivalently, 𝑋̅ ~ 𝑁 (𝜇 , )
𝜎⁄√𝑛 √𝑛

𝜎
The standard deviation of variable 𝑋̅, known as the standard error and measured by now
√𝑛
becomes the tool with which variations from the mean 𝜇 are calculated. Empirical rules that
predict the probability of a sample mean being within 1, 2 or 3 standard errors of 𝜇 still apply
given that 𝑋̅ is distributed normally:

𝜎 𝜎
o 𝑃 (𝜇 − 1 < 𝑋̅ < 𝜇 + 1 ) ≈ 68.27%
√𝑛 √𝑛
𝜎 𝜎
o 𝑃 (𝜇 − 2 < 𝑋̅ < 𝜇 + 2 ) ≈ 95.45%
√𝑛 √𝑛
𝜎 𝜎
o 𝑃 (𝜇 − 3 < 𝑋̅ < 𝜇 + 3 ) ≈ 99.73%
√𝑛 √𝑛

115
Note on the independent selections.
In theory, samples that are selected at random without replacement do not fulfill the criteria of
independence of the 𝑋𝑘 . However, we treat such cases as independent whenever the sample
size is small compared to the population size.

Example 7.2.1
At Marianopolis, the 𝑅-score of the student population has an average value of 29.2, with 2.8
standard deviation.

Intuitively, that is without computing, how likely (in %) do you think it is to select a student
whose 𝑅-score is 30 or more?

The population’s 𝑅-score distribution is unknown, so an exact probability calculation cannot be


conducted. However, if the population mean is 29.2 and the standard deviation is 2.8, then
obtaining an R-score of 30 or more requires a relatively small deviation from 𝜇 (a deviation of ¼
of a standard deviation suffices). Hence such an outcome should be quite likely to occur.

If a sample of size 𝒏 = 𝟒𝟗 is selected from the Marianopolis student population, what


̅ ) obey?
approximate distribution does the sample mean (that is, 𝑿

In virtue of the Central Limit Theorem:

𝑋−29.2
~ 𝑁(0, 1) or, equivalently 𝑋 ~ 𝑁( )
2.8⁄√49

What is the probability that the average 𝑹-score computed from sample of 49 randomly
selected Marianopolis students is at least 𝟑𝟎?

𝑃(𝑋̅ ≥ 30) =

It may be quite likely for a student to have an 𝑅-score of 30 or more, individually, but our
calculation predicts that it would be very unlikely for a random sample of 49 students to have
an average 𝑅-score of 30 or more.

116
What values of the sample mean, centered around 𝝁 = 𝟐𝟗. 𝟐, are likely to occur with 95%
probability?

117
7.3 Sample proportions
Suppose a particular character is being studied among a population (having blue eyes, using
public transit at least once a week, having a yearly salary above 100 000$, …). Certain
individuals will have this character, and others will not. The variable of study is not qualitative,
here, but binary. If a proportion 𝑝 of the population has the suggested character, then a
randomly selected individual will have a probability 𝑝 of having the character.

Hence, if we define

1, 𝑖𝑓 𝑡ℎ𝑒 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 ℎ𝑎𝑠 𝑡ℎ𝑒 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟


𝑋={
0, 𝑖𝑓 𝑡ℎ𝑒 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 ℎ𝑎𝑣𝑒 𝑡ℎ𝑒 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟

then, 𝑋 is a Bernoulli variable of probability 𝑝. In other words, 𝑋 ~ 𝐵(1, 𝑝) with 𝐸(𝑋) = 𝑝 and
𝜎(𝑋) = √𝑝(1 − 𝑝).

Now, consider a sample of 𝑛 individuals being selected at random and with replacement from
this same population. Due to replacement, the probability that any selected individual has the
character of study will remain equal to 𝑝. That is, if

1, 𝑖𝑓 𝑡ℎ𝑒 𝑘 𝑡ℎ 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 ℎ𝑎𝑠 𝑡ℎ𝑒 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟


𝑋𝑘 = {
0, 𝑖𝑓 𝑡ℎ𝑒 𝑘 𝑡ℎ 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 ℎ𝑎𝑣𝑒 𝑡ℎ𝑒 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟

then 𝑋𝑘 ~ 𝐵(1, 𝑝) for all 𝑘 = 1, 2, … , 𝑛

The total number of individuals who have the character in the sample is therefore

𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛

At this stage, you may recognize that 𝑇 is in fact a Binomial variable since it is a sum of
identically distributed and independent Bernoulli variables. In other words,

𝑇 ~ 𝐵(𝑛, 𝑝) with 𝐸(𝑇) = 𝑛𝑝 and 𝜎(𝑇) = √𝑛𝑝(1 − 𝑝).

However, as we have discussed in section 6.5, a sum of identically distributed and independent
random variables will tend to act as a normal variable as long as 𝑛 is big enough.
Therefore, if the sample size is sufficient, we can predict that

𝑇 ~ 𝑁 (𝑛𝑝, √𝑛𝑝(1 − 𝑝))

118
The sample proportion is a random variable whose value depends on the relative frequency of
𝑇
individuals in the sample with the character of study, that is 𝑛. We typically denote the sample
proportion as 𝑃̂.
𝑇 1
𝑃̂ = = 𝑇
𝑛 𝑛

Then, 𝑇 will be normally distributed. So, 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ~ 𝑁(𝑛𝑝, √𝑛𝑝(1 − 𝑝)) .

1
Given that 𝑃̂ is a linear transformation of 𝑇 (that is, 𝑃̂ = 𝑛 𝑇), it too will be normally distributed.

1
𝐸(𝑃̂) = 𝐸 ( 𝑇) = =𝑝
𝑛
1 √𝑝(1 − 𝑝)
𝜎(𝑃̂) = 𝜎 ( 𝑇) = =
𝑛 √𝑛

In conclusion:
𝑃̂ ~ 𝑁 (𝑝, √𝑝(1 − 𝑝)⁄√𝑛)

or equivalently
𝑃̂ − 𝑝
→ 𝑍~ 𝑁(0,1)
√𝑝(1 − 𝑝)⁄√𝑛

You have probably recognized, once again, that the relationship above is in fact predicted by
the Central Limit Theorem. 𝑃̂ is, in fact, the average of the 0’s and 1’s resulting from the
Bernoulli variable 𝑋 ~ 𝐵(1, 𝑝), 𝑝 is the expected value of this variable, and √𝑝(1 − 𝑝) is its
standard deviation. Therefore, there is a visible correspondence between saying that

𝑋̅ − 𝜇
→ 𝑍~𝑁(0,1)
𝜎 ⁄ √𝑛
and saying that
𝑃̂ − 𝑝
→ 𝑍~ 𝑁(0,1)
√𝑝(1 − 𝑝)⁄√𝑛

119
7.4 Sample size and independence of trials
Note that the Central limit theorem requires that the sample size be big enough for the
distribution of the sum of the variables to develop a bell-shaped appearance. Whereas a small
𝑛 may be sufficient when the variable of study is symmetric from the start, a bigger 𝑛 is
necessary when there is a strong asymmetry in the variable. This is particularly the case for
Bernoulli variables whose 𝑝 is either very close to 0 or to 1. For this reason, conditions that
allow the application of the CLT are a bit more demanding in the case of sample proportions.
As well as requiring that 𝑛 ≥ 30, we will also require that 𝑛𝑝 ≥ 10 and that 𝑛𝑞 ≥ 10.
When it comes to the replacement of the selections, which guarantees the independence of the
Bernoulli variables 𝑋𝑘 , we will consider that the CLT is applicable even in non replacement cases
as long as the sample size is relatively small compared to the size of the population.

Example 7.4.1
Let us assume that 72% of current Marianopolis students exercise at least 5 hours (on average)
per week. A sample of 81 students is collected at random and are asked to complete a
questionnaire about their health and exercise habits.

True or false: 72% of the students in the sample exercise at least 5 hours per week (on
average)

̂ = “The proportion of students in the sample who exercise 5 or more hours per week”
Let 𝑷
̂ obey? Justify your claim.
What distribution does variable 𝑷

120
What is the probability that two thirds or less of the surveyed students exercise at least 5
hours per week?

2
𝑃 (𝑃̂ ≤ ) = = 14.253%
3

What values of the sample proportion, centered around 𝒑 = 𝟎. 𝟕𝟐, are likely to occur with
90% probability?

121
7.5 Student t-distribution
We take advantage of the fact that sampling distributions are fresh in mind to introduce the
Student 𝑡-distribution.

For a given variable 𝑋 whose population parameters (μ and σ) are known, we have seen that
the variability of sample means 𝑋̅ from 𝜇 is quite predictable. Deviations, calculated in standard
𝜎
errors, , are used to compute relevant probabilities. In particular, the measure of the random
√𝑛
𝑋̅ −𝜇
variable 𝜎⁄ is normally distributed (even if 𝑋 as not) will depends only on the value taken by
√𝑛
𝑋̅.

An issue we will face in upcoming sections occurs when the population’s parameters are not
known and the sample size is too low to assume that the CLT applies.

𝑋̅ −𝜇
The measure is made up of two variables, 𝑋̅ and 𝑆. Problems are generated because of
𝑆⁄√𝑛
how volatile the sample standard deviation, 𝑆, can be, especially in small samples.

Two sample means can be equal, thus generating equal deviations, 𝑋̅ − 𝜇, without having the
𝑋̅ −𝜇
same sample standard deviation, therefore leading to unequal measures of .
𝑆⁄√𝑛

In contrast, two samples could have different means and different standard deviations and still
𝑋̅ −𝜇 𝑋̅ −𝜇 𝑋̅ −𝜇
produce the same value of . All this to say that the random variables and
𝑆⁄√𝑛 𝑆⁄√𝑛 𝜎⁄√𝑛
behave differently. The discrepancies between the two will be alleviated when 𝑛 is big enough,
thus increasing the likelihood that 𝑠 ≈ 𝜎.

If 𝑋 normally distributed in the population, the random variable

𝑋̅ − 𝜇
𝑇𝑛−1 =
𝑆⁄√𝑛
is called to Student t-distribution, or simply the t-distribution. The subscript 𝑛 − 1 refers to the
‘degrees of freedom’ of the variable and hints to the fact that there is not ‘just one’ Student
law, but rather one for every sample size. Just like a standard normal variable 𝑍, 𝑇𝑛−1 is
centered at 0 and is symmetric. Its distribution is different from that of 𝑍, with a higher density
in its extremes.

122
Figure 7.5.1

The density functions shown in Figure 7.5.1 illustrate how a 𝑡-distribution depends on the
degrees of freedom (𝑛 − 1) and how 𝑇𝑛−1 approaches 𝑍 as 𝑛 increases. Using the Student 𝑡-
distribution is therefore essential for small sample sizes.

Similarly to what was done with standard normal distributions, it is common practice to use the
notation 𝑡𝑛−1,𝛼 to denote values of the Student 𝑡-distribution such that

𝑃(𝑇𝑛−1 ≥ 𝑡𝑛−1,𝛼 ) = 𝛼
𝑃(𝑇𝑛−1 ≤ −𝑡𝑛−1,𝛼 ) = 𝛼
𝑃(𝑇𝑛−1 ≤ −𝑡𝑛−1,𝛼/2 𝑜𝑟 𝑇𝑛−1 ≥ 𝑡𝑛−1,𝛼/2 ) = 𝛼

Example 7.5.1
Using Excel, compare the probability that 𝑍 > 2 with that of 𝑇9 > 2, 𝑇19 > 2 and 𝑇29 > 2.

𝑃(𝑍 > 2) =

𝑃(𝑇9 > 2) =

𝑃(𝑇19 > 2) =

𝑃(𝑇29 > 2) =

123
Example 7.5.2
Using Excel, compare the values of 𝑎, 𝑏, 𝑐 and 𝑑 such that

𝑃(𝑍 ≥ 𝑎) = 10% 𝑎≈
𝑃(𝑇9 ≥ 𝑏) = 10% 𝑏≈
𝑃(𝑇19 ≥ 𝑐) = 10% 𝑐≈
𝑃(𝑇29 ≥ 𝑑) = 10% 𝑑≈

Note that the values of 𝑏, 𝑐 and 𝑑 in the previous exercise exclude the 10% right-most values of
𝑇9 , 𝑇19 , 𝑇29 . It is common practice to denote these values as 𝑡9,10% , 𝑡19,10% and 𝑡29,10% in a
similar way that was done with 𝑧10% .

In the same manner, because Student distributions are symmetric about 𝑡 = 0, the expressions
−𝑡9,10%, −𝑡19,10% and −𝑡29,10% are used to represent the boundaries of 𝑇9 , 𝑇19 , 𝑇29 that
exclude their 10% left-most values.

In other words,
𝑃(𝑇19 ≥ 𝑡19,10% ) = 10%, and
𝑃(𝑇19 ≤ −𝑡19,10% ) = 10%, where 𝑡19,10% = 1.328

124
Example 7.5.3
Consider a population whose mean and standard deviation are 𝜇 = 50 and 𝜎 = 10,
respectively. Let us suppose that a random sample of size 𝑛 = 25 is such that 𝑥 = 54 and
𝑠 = 10.8.

Based on the given information, find


̅ −𝝁
𝑿
𝒛𝒐𝒃𝒔 , the corresponding measure of 𝒁 = 𝝈⁄ : 𝑧𝑜𝑏𝑠 =
√𝒏

̅ −𝝁
𝑿
𝒕𝒐𝒃𝒔 , the corresponding measure of 𝑻𝟐𝟒 = : 𝑡𝑜𝑏𝑠 =
𝑺⁄√𝒏

Observe that the sample mean is to the right of 𝜇 and therefore both 𝑧𝑜𝑏𝑠 and 𝑡𝑜𝑏𝑠 are positive.

What is the probability that a subsequent sample will produce values for 𝒁 and 𝑻𝟐𝟒 that are
greater than those we have observed?

𝑃(𝑍 > 𝑧𝑜𝑏𝑠 ) =

𝑃(𝑇24 > 𝑡𝑜𝑏𝑠 ) =

125
Example 7.5.4
Consider a population whose mean and standard deviation are unknown. A random sample of
size 𝑛 = 36 is such that 𝑥 = 120 and 𝑠 = 18.

Based on the given information, find an interval that is symmetric relative to 𝜇 and holds 90% of
possible sample means 𝑋.

Answers
We will avoid using a normal distribution here, given that 𝜎 is unknown, but will make use of
the 𝑡-distribution with degrees of freedom instead.

𝑋̅ − 𝜇 𝑋̅ − 𝜇 𝑋̅ − 𝜇
𝑇 = ⟹ 𝑇 = =
𝑆⁄√𝑛 18⁄√36 3

𝑃(−𝑡 ,5% ≤𝑇 ≤𝑡 ,5% ) = 90%

Using Excel, we can find that 𝑡 ,5% ≈ 1.690

Therefore, the interval we are looking for is

In short, based on observed results, we would expect that 90% of sample means remain within
a distance of 5.069 of the population mean 𝜇.

126
Main take-aways from this section
Consider a quantitative variable whose population mean and standard deviation are 𝜇 and 𝜎,
respectively. If a sample of size 𝑛 is selected at random and with replacement from the
population, the value of 𝑋 that is observed for 𝑘 𝑡ℎ individual is denoted by 𝑋𝑘 and is such that
𝐸(𝑋𝑘 ) = 𝜇 and 𝜎(𝑋𝑘 ) = 𝜎.

Case 1
If variable 𝑋 is normally distributed, then the sum 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 and the sample mean
1
𝑋̅ = (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ) will be normally distributed random variables, regardless of the size
𝑛
of 𝑛. More specifically:

𝑇 ~ 𝑁(𝑛𝜇, √𝑛𝜎) 𝜎 𝑋̅ − 𝜇
𝑋̅ ~ 𝑁(𝜇, ) = 𝑍 ~ 𝑁(0, 1)
√𝑛 𝜎 ⁄ √𝑛

Case 2
If variable 𝑋 is not normally distribution, then the sum 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 and the sample
1
mean 𝑋̅ = (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ) are random variables whose distributions will resemble a
𝑛
normal distribution, as long as 𝑛 is big enough (𝑛 ≥ 30). More specifically:

𝑇 ~ 𝑁(𝑛𝜇, √𝑛𝜎) 𝜎 𝑋̅ − 𝜇
𝑋̅ ~ 𝑁 (𝜇, ) = 𝑍 ~ 𝑁(0, 1)
√𝑛 𝜎 ⁄ √𝑛

Case 3
If 𝑋 is Bernoulli variable 𝐵(1, 𝑝), then the sum 𝑇 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 obeys a Binomial law.
However, if the following criteria are met,
𝑛 ≥ 30
𝑛𝑝 ≥ 10
𝑛𝑞 ≥ 10

then the distributions of 𝑇 and of the sample proportion 𝑃̂ will resemble that of a normal law.
More specifically:

𝑇 ~ 𝑁(𝑛𝑝, √𝑛𝑝𝑞) √𝑝𝑞 𝑃̂ − 𝑝


𝑃̂ ~ 𝑁 (𝑝, ) = 𝑍 ~ 𝑁(0, 1)
√𝑛 √𝑝𝑞 ⁄√𝑛

127
In the event that the population standard deviation, 𝜎, is not known for a given quantitative
variable, then the sample mean 𝑋 can be converted into a Student 𝑡-distribution in order to
perform probability calculations.

𝑋̅ − 𝜇
~ 𝑇𝑛−1
𝑆⁄√𝑛

As 𝑛 grows, the 𝑡-distribution resembles the standard normal law.

128
Section 8 Hypothesis testing
8.1 Introductory discussion
Inferential statistics includes all techniques and procedures with which sample data is used to
make predictions or generalisations to an entire population. You may have guessed that
sampling distributions, discussed in Section 7, will be at the heart of all inference.

An important application of the sampling distributions to inference is Hypothesis Testing.

From the onset, let us highlight the main difference between hypothesis testing and confidence
intervals, a topic that will be discussed in Section 9:

o When constructing confidence intervals, the value of a parameter is considered unknown.


Sample data is used to estimate the parameter’s value.

o In hypothesis testing, an assumption is made a priori about the value of a parameter


(population indicators) and sample data is then used to validate, or invalidate, this
assumption.

In hypothesis testing, two mutually exclusive (but exhaustive) statements are confronted. The
value of a parameter will be the object of most test we will be conducting. If population data
was available, such test would not be required. However, the reality is that we often have
access to sample data only. Therefore, any decision that is made about a parameter based on
sample data is subject to error.

The first of the two statements is called the Null Hypothesis and is considered true until proven
otherwise. The null hypothesis usually reflects what we assume is (or at least may have been)
true about a population. It is denoted by 𝐻0 and is often associated to status quo.
Rejecting 𝐻0 leads to breaking the status quo. Rejecting 𝐻0 when 𝐻0 is true is called the type 1
error. The wrongful rejection of 𝐻0 is deemed to be the decision whose consequences are
gravest, which is why processes and control mechanisms put in place are meant to control the
risk of making such an error.

Not rejecting 𝐻0 when 𝐻0 is false is called the type 2 error.

129
Table 8.1.1 summarizes the decisions, good or bad, that can be made when conducting a
hypothesis test.

Table 8.1.1

Decision that is taken based on sample data


Reality in the population Reject 𝑯𝟎 Do not reject 𝑯𝟎
Bad decision
𝑯𝟎 is true Good decision
(type 1 error)
Bad decision
𝑯𝟎 is false Good decision
(type 2 error)

The hypothesis that is denoted by 𝐻1 is called the alternative hypothesis. The alternative
hypothesis is only accepted when sufficient statistical evidence demonstrates that 𝐻0 can be
rejected.

On a practical level, when testing the value of a parameter, 𝐻0 must contain some form of
equality ( =, ≤, ≥), whereas 𝐻1 must be a strict inequality ( ≠, < , >). It may be useful to
remember this when you are philosophically unsure about which statement’s “wrongful
dismissal will lead to the gravest consequences”.

Every hypothesis test will require that we be clear about


o the population under study,
o the variable of interest,
o the parameter being tested,
before writing the hypotheses and, eventually, proceeding with the test.

130
Example 8.1.1
For the longest time, the proportion of students at Marianopolis College that have failed, on a
given semester, more than one course has never exceeded 10%. As a result, the college never
felt the need to modify or improve existing services for students at risk. A survey will be
conducted to see if current practices should be re-examined.

Given the context described above, identify:

The population under study:

o The variable of interest:

1 , 𝑖𝑓
𝑋={
0 , 𝑖𝑓

o The parameter of interest:

Write the hypotheses that should be confronted.

𝐻0 : 𝐻1 :

Describe the consequences of type 1 and type 2 errors in the context that is described.

Type 1 error = Rejecting 𝐻0 when 𝐻0 is true

Type 2 error = Not rejecting 𝐻0 when 𝐻0 is false

131
Example 8.1.2
When Marianopolis College considered moving from the Côte-des-Neiges site to the CND
Mother House on Westmount Avenue, one of the main concerns was the impact it would have
on the commute time of students, faculty, and staff. Travel time to the Côte-des-Neiges site
was 45 minutes, on average, for the college community.

Should a study conclude that the average time of travel to the Mother House does not exceed
45 minutes, little resistance to the move is expected, and it would be easier to convince one
and all that the move is a good idea. On the other hand, if the average travel time to the new
location proved to be longer, the administration would have to negotiate with the STM to
provide a better service to the college.

Given the context described above, identify:

The population under study:

o The variable of interest:

o The parameter of interest:

Write the hypotheses that should be confronted.

𝐻0 : 𝐻1 :

Describe the consequences of type 1 and type 2 errors in the context that is described.

Type 1 error = Rejecting 𝐻0 when 𝐻0 is true

Type 2 error = Not rejecting 𝐻0 when 𝐻0 is false

132
Example 8.1.3 Write appropriate hypotheses to be confronted in the contexts described below.

Case 1
For quality control purposes, a governmental department will evaluate the duration of calls
(including waiting time) for a sample of users of its phone service. If the satisfaction criterion –
that calls should last no longer that 45 minutes, on average – are not respected, then additional
personnel will have to be hired.

Tested parameter:

Hypotheses: 𝐻0 : 𝐻1 : 𝜇

Case 2
Past data from the ministère de l’Éducation Supérieure indicates that the proportion of
students who complete their DEC in the prescribed 2 years is at least 60%. No reform is to be
undertaken unless a deterioration is observed.

Tested parameter:

Hypotheses: 𝐻0 : 𝐻1 : 𝜇

Case 3
The ministère de l’Éducation Supérieure currently has programs in place to control the student
debt. For university graduates, the average debt is believed to be 15,000$. If this amount were
to drop, or to rise, the ministère may want to reconsider its programs.

Tested parameter:

Hypotheses: 𝐻0 : 𝐻1 : 𝜇

Case 4
A government wishes to consult its citizens on a controversial matter. Its strategy, at first, is to
hold a survey and to launch a referendum if and only if sample results can demonstrate that a
majority of citizens would support its plan of action.

Tested parameter:

Hypotheses: 𝐻0 : 𝐻1 : 𝜇

133
8.2 To reject, or not to reject, that is the question!
It is essential to keep in mind that because sample data is used to guide our decision, either
one, rejecting 𝐻0 or not rejecting 𝐻0 , can result in being an error.

Because it will result in changing current practices, rejecting 𝐻0 is a decision that is taken only
when there is a small probability that it is, in fact, a bad decision. In other words, rejecting 𝐻0 is
a decision that is taken as long as a type 1 error is unlikely to occur.

The significance level of the test, denoted by 𝛼, is determined at the onset of the test and
guides the statistician’s decision to reject 𝐻0 or not. One can view 𝛼 as a threshold above which
rejecting 𝐻0 will be considered too risky.

To reject 𝐻0 , sample results must guarantee that the probability of making a type 1 error is less
than the pre-determined significance level. Usual levels for 𝛼 are 1%, 5%, or 10%.

For instance, suppose that 𝐻0 has been rejected, based on sample data, at a significance level
𝛼 = 5%. This means that the probability that 𝐻0 is true, and that rejecting 𝐻0 was in fact an
error, does not exceed 5%.
In contrast, suppose that sample data has not allowed you to reject the null hypotheses at a
significance level 𝛼 = 5%. This means that rejecting 𝐻0 based on the statistical evidence at
hand would be too risky. The probability of committing a type 1 error would have exceeded the
tolerable level of risk (𝛼 = 5%, in this case).

By choosing not to reject 𝐻0 , we are then exposed to a type 2 error (with an unknown
probability, but whose consequences are not as grave as those of a type 1 error).

Every hypothesis test is like judging a case in our legal system using the principle, “innocent
until proven guilty”.

𝐻0 : the accused is innocent


𝐻1 : the accused is guilty

Only conclusive evidence allows us to reject 𝐻0 and to conclude that the accused is guilty.

Wrongfully convicting constitutes a type 1 error in such a scenario, the error whose
consequences are gravest in our justice system. Any reasonable doubt prevents the rejection of
𝐻0 , which may turn out to be a type 2 error (that is, letting a guilty person go free… you may
want to research the O.J. Simpson trial, at this point).

134
It is crucial to understand that the objective of a hypothesis test is not to prove that 𝐻0 is true.
Although statistical evidence may not justify rejecting 𝐻0 , it does not mean that we accept 𝐻0 .
In the same manner, we will never say that “𝐻1 is rejected”.

Only two outcomes can result from hypothesis testing:

✓ Statistical evidence allows us to reject 𝐻0 and to accept 𝐻1 .


✓ Statistical evidence fails to reject 𝐻0 and we cannot accept 𝐻1 .

135
8.3 Tests on means and proportions using a rejection zone
Different approaches can be used to guide one’s decision regarding the null hypothesis. In this
section, we present the rejection zone method.

As we have mentioned earlier, wrongfully rejecting 𝐻0 is an error (a type 1 error, to be exact)


that one must try to minimize. Selecting a low significance level, 𝛼, has exactly that purpose.
Due to their random nature, all sample statistics have a probability of deviating from the
population parameter. One should expect that observed values of the sample mean, 𝑋 will vary
from 𝜇. Relatively small variations are of no concern. However, when variations are ‘extreme’,
one may reconsider the validity of the parameter that was otherwise assumed accurate.
A hypothesis test that is conducted at a significance level 𝛼 will consider a sample result as
sufficient evidence to reject 𝐻0 as long as

✓ the result contradicts 𝐻0 , and


✓ the result lands so far from the assumed value of the parameter that it is in an interval
whose outcomes have cumulative probability 𝛼;

The interval described in the second criterion above is known as the rejection zone.
Obviously, if 𝐻0 is true, there is a probability 𝛼 that a sample result will land in the rejection
zone. Rejecting 𝐻0 would turn out to be a type 1 error in such a case, but its probability is
controlled by the choice of the threshold, 𝛼.

The rejection zone is constructed using the sample distributions that were introduced in past
sections.

136
Test on a proportion
If a proportion 𝑝 of the population has a given character, and if a sample’s size satisfies the
criteria 𝑛 ≥ 30, 𝑛𝑝 ≥ 10, 𝑛𝑞 ≥ 10, we have seen that the sample proportion will be distributed
as follows:
𝑃̂ − 𝑝
𝑃̂ ~ 𝑁(𝑝, √𝑝𝑞 ⁄√𝑛) or equivalently = 𝑍 ~ 𝑁(0, 1)
√𝑝𝑞 ⁄√𝑛

The rejection zone is constructed (under the assumption that 𝐻0 is true) using the extreme 𝛼
outcomes. We already know that for a standard normal law:

𝑃(𝑍 > 𝑧𝛼 ) = 𝛼 ⟹ Right-tail rejection zone: 𝑍 > 𝑧𝛼


𝑃(𝑍 < −𝑧𝛼 ) = 𝛼 ⟹ Left-tail rejection zone: 𝑍 < −𝑧𝛼
𝑃(𝑍 < −𝑧𝛼⁄2 or 𝑍 > 𝑧𝛼⁄2 ) = 𝛼 ⟹ Two-tail rejection zone: 𝑍 < −𝑧𝛼⁄2 or 𝑍 > 𝑧𝛼⁄2

Given that 𝑃̂ is normally distributed and can be converted into 𝑍, we will simply have to
determine if the test statistic
𝑝̂ − 𝑝
𝑧𝑜𝑏𝑠 =
√𝑝𝑞 ⁄√𝑛

does in fact land in the appropriate rejection zone.

Step-by-step guide
1. Identify
o the population under study,
o the variable of interest,
o and the parameter being tested,
2. Write the null and alternative hypotheses,
3. Identify or choose the test’s significance level, 𝛼,
4. Construct the appropriate rejection zone for the test,
5. Conduct the survey, find 𝑝̂ and compute the test statistic 𝑧𝑜𝑏𝑠 ,
6. Determine whether the data allows you to reject 𝐻0 ,
7. Conclude in the context of the problem, discuss possible errors and the validity of the test
itself.

137
Example 8.3.1
For the longest time, the proportion of students at Marianopolis College that have failed, on a
given semester, more than one course has never exceeded 10%. As a result, the college never
felt the need to modify or bonify existing services for students at risk.
From a random sample of 200 students, 29 had failed at least two courses in their previous
semester. Based on this data, should current practices be re-examined?
Perform a hypothesis test at a significance level 𝛼 = 5%.

Answer
Population, variable and parameter of interest
o Population under study: the students of Marianopolis College
o Variable of interest: whether a student fails more than one course on a given semester.
This is a Bernoulli variable that could be defined as

1 , 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑓𝑎𝑖𝑙𝑒𝑑 𝑚𝑜𝑟𝑒 𝑡ℎ𝑎𝑛 𝑜𝑛𝑒 𝑐𝑜𝑢𝑟𝑠𝑒 𝑜𝑛 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑠𝑒𝑚𝑒𝑠𝑡𝑒𝑟


𝑋={
0 , 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑓𝑎𝑖𝑙𝑒𝑑 𝑎𝑡 𝑚𝑜𝑠𝑡 𝑜𝑛𝑒 𝑐𝑜𝑢𝑟𝑠𝑒 𝑜𝑛 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑠𝑒𝑚𝑒𝑠𝑡𝑒𝑟

o Tested parameter: 𝑝, the proportion of Marianopolis students who failed more than one
course on a given semester

Hypotheses: 𝐻0 : vs 𝐻1 :

Significance level: 𝛼 = 5%

Construct the appropriate rejection zone for the test:

Rejection zone:

Sample results:
𝑝̂ = 𝑧𝑜𝑏𝑠 =

Decision:.

138
Possible errors and validity:
Having decided to reject 𝐻0 means that we are exposed to a potential error of type

Can you evaluate the risk of making such an error?

The test is valid because the sample size satisfies all criteria relevant to the application of the
CLT to sample proportions:
𝑛 = 200 ≥ 30
𝑛𝑝 = 200(0.1) = 20 ≥ 10
𝑛𝑞 = 200(0.9) = 180 ≥ 10

139
Example 8.3.2
Max Poutine is a well-established chain of restaurants in Quebec with 16 concessions
throughout the province. The chain’s administrators estimate that 20% of reservations that are
made (either by phone or through their website) lead to no-shows. With this in mind, the
restaurant has adopted overbooking practices. The accuracy of the proportion of no-show is
regularly tested. Significant changes to no-show frequency can be compensated by adapting
the overbooking.

The most recent survey has shown that from a random sample of 175 reservations in the last 3
months, 41 resulted in no-shows. Based on these results, will current overbooking practices
have to be adapted?

Perform a hypothesis test at a significance level 𝛼 = 5%.

Answer
Population, variable and parameter of interest:
o Population under study: phone and internet reservations to Max Poutine restaurants
o Variable of interest: whether a reservation resulted in a no-show. This is a Bernoulli
variable that could be defined as

1 , 𝑖𝑓 𝑡ℎ𝑒 𝑟𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑟𝑒𝑠𝑢𝑙𝑡𝑒𝑑 𝑖𝑛 𝑎 𝑛𝑜 𝑠ℎ𝑜𝑤


𝑋={
0 , 𝑖𝑓 𝑡ℎ𝑒 𝑟𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑤𝑎𝑠 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑒𝑑

o Tested parameter: 𝑝, the proportion of reservations that resulted in no-shows

Hypotheses: 𝐻0 : vs 𝐻1 :

Significance level: 𝛼 = 5%

Construct the appropriate rejection zone for the test:

Rejection zone:


Sample results:
𝑝̂ = 𝑧𝑜𝑏𝑠 =

140
Decision:.

Possible errors and validity:


Having decided to reject 𝐻0 means that we are exposed to a potential error of type

Can you evaluate the risk of making such an error?

The test is valid because the sample size satisfies all criteria relevant to the application of the
CLT to sample proportions:
𝑛 = 175 ≥ 30
𝑛𝑝 = 175(0.2) = 35 ≥ 10
𝑛𝑞 = 175(0.8) = 145 ≥ 10

141
8.4 Test on a mean – big sample sizes (n  30)
Let 𝑋 represent a quantitative variable whose population mean is assumed to be 𝜇. If a random
sample of size 𝑛 is to be collected, then the sample mean, 𝑋, is a random variable whose
distribution depends on the scenarios listed below:

𝑋−𝜇
1. 𝑋 ~ 𝑁(𝜇, 𝜎⁄√𝑛) or equivalently = 𝑍 ~ 𝑁(0, 1) if 𝜎 is known
𝜎⁄√𝑛
𝑋−𝜇
2. ~ 𝑇𝑛−1 if 𝜎 is unknown
𝑆⁄√𝑛

Hypothesis tests may speculate on the value of 𝜇 and assumes it to be true until proven
otherwise. However, the value of 𝜎 is rarely known per se, which is why we will focus on case 2.

As we did for proportions earlier, the rejection zone is constructed (under the assumption that
𝐻0 is true) using the extreme 𝛼 outcomes.

We already know that for a 𝑡-distribution with 𝑛 − 1 degrees of freedom:

𝑃(𝑇𝑛−1 > 𝑡𝑛−1,𝛼 ) = 𝛼 ⟹ Right-tail rejection zone: 𝑇𝑛−1 > 𝑡𝑛−1,𝛼

𝑃(𝑇𝑛−1 < −𝑡𝑛−1,𝛼 ) = 𝛼 ⟹ Left-tail rejection zone: 𝑇𝑛−1 < −𝑡𝑛−1,𝛼

𝑃(𝑇𝑛−1 < −𝑡𝑛−1,𝛼⁄2 or 𝑇𝑛−1 > 𝑡𝑛−1,𝛼⁄2 ) = 𝛼 ⟹ Two-tail rejection zone: 𝑇𝑛−1 < −𝑡𝑛−1,𝛼⁄2

𝑇𝑛−1 > 𝑡𝑛−1,𝛼⁄2

𝑋 can be converted into its corresponding Student distribution, in order to determine if the
corresponding test statistic

𝑥−𝜇
𝑡𝑜𝑏𝑠 =
𝑠 ⁄ √𝑛
does in fact land in the appropriate rejection zone. The rest of the procedure has similar
instructions to those for tests on proportions:

142
Step-by-step guide
1. Identify the population and the variable of interest, as well as the tested parameter,
2. Write the null and alternative hypotheses,
3. Identify or choose the test’s significance level, 𝛼,
4. Construct the appropriate rejection zone for the test,
5. Conduct the survey, find 𝑥 and compute the test statistic 𝑡𝑜𝑏𝑠 ,
6. Determine whether the data allows you to reject 𝐻0 ,
7. Conclude in the context of the problem, discuss possible errors and the validity of the test
itself.

Example 8.4.1
When Marianopolis College considered moving from the Côte-des-Neiges site to the CND
Mother House on Westmount Avenue, one of the main concerns was the impact it would have
on the commute time of students, faculty and staff. Travel time to the Côte-des-Neiges site was
45 minutes, on average, for the college community.

Should a study conclude that the average time of travel to the Mother House does not exceed
45 minutes, little resistance to the move is expected, and it would be easier to convince one
and all that the move is a good idea. On the other hand, if the average travel time to the new
location proved to be longer, the administration would have to negotiate with the STM to
provide a better service to the college.

Travel time samples to the projected site were collected with the following results:
𝑛 = 48 𝑥 = 50 𝑠 = 16

Will the college need to take measures to address this situation? Use a hypothesis test at a
significance level 𝛼 = 10%.

143
Answer
Population, variable and parameter of interest:
o Population under study: the Marianopolis community (students, faculty and staff)
o The variable of interest: travel time of a member of the community to the Westmount Ave.
site
o The parameter of interest: 𝜇, the average travel time of all members of the community to
the Westmount Ave. site

Hypotheses: 𝐻0 : vs 𝐻1 :

Significance level: 𝛼 = 5%

Construct the appropriate rejection zone for the test:

Rejection zone:


Sample results:
𝑛= 𝑥= 𝑠=

𝑡𝑜𝑏𝑠 =

Decision:.

Possible errors and validity:


Having decided to reject 𝐻0 means that we are exposed to a potential error of type

Can you evaluate the risk of making such an error?

The test is valid given the sample size is sufficient.

144
8.5 Test on a mean – small sample sizes (n  30)
When small samples are concerned, the methods we are presenting are only valid if variable 𝑋
has a (relatively) normal distribution in the population. Assuming that the population mean is 𝜇,
then sample mean, 𝑋, collected from a random sample of size 𝑛 has a 𝑡-distribution with 𝑛 − 1
degrees of freedom.
𝑋−𝜇
~ 𝑇𝑛−1
𝑆⁄√𝑛

The rejection zone is constructed (under the assumption that 𝐻0 is true) using the extreme 𝛼
outcomes.

For a 𝑡-distribution with 𝑛 − 1 degrees of freedom:


𝑃(𝑇𝑛−1 > 𝑡𝑛−1,𝛼 ) = 𝛼 ⟹ Right-tail rejection zone: 𝑇𝑛−1 > 𝑡𝑛−1,𝛼
𝑃(𝑇𝑛−1 < −𝑡𝑛−1,𝛼 ) = 𝛼 ⟹ Left-tail rejection zone: 𝑇𝑛−1 < −𝑡𝑛−1,𝛼
𝑃(𝑇𝑛−1 < −𝑡𝑛−1,𝛼⁄2 or 𝑇𝑛−1 > 𝑡𝑛−1,𝛼⁄2 ) = 𝛼 ⟹ Two-tail rejection zone: 𝑇𝑛−1 < −𝑡𝑛−1,𝛼⁄2
𝑇𝑛−1 > 𝑡𝑛−1,𝛼⁄2
𝑋 can be converted into its corresponding Student distribution, to determine if the
corresponding test statistic
𝑥−𝜇
𝑡𝑜𝑏𝑠 =
𝑠 ⁄ √𝑛
does in fact land in the appropriate rejection zone. The rest of the procedure is a replica of
instructions given for tests on big samples:

Step-by-step guide
1. Identify the population and the variable of interest, as well as the tested parameter,
2. Write the null and alternative hypotheses,
3. Identify or choose the test’s significance level, 𝛼,
4. Construct the appropriate rejection zone for the test,
5. Conduct the survey, find 𝑥 and compute the test statistic 𝑡𝑜𝑏𝑠 ,
6. Determine whether the data allows you to reject 𝐻0 ,
7. Conclude in the context of the problem, discuss possible errors and the validity of the test
itself.

145
Example 8.5.1
Because of climate change, the caribou population in northern Quebec have migrated. In their
new environment, biologists claim that food is scarce and not adapted to their dietary needs.
The mass of adult male caribou has traditionally been normally distributed with average
𝜇 = 200 kg. If the biologists are correct, a drop in the mass of all caribou, including adult males,
is to be anticipated.

Given the large territory to cover, biologists managed to sample only 16 adult male caribou. The
average mass and standard deviation of the sample (in kg) are 𝑥 = 188 and 𝑠 = 18,
respectively.

Does the sample data corroborate the biologists’ theory? Use a hypothesis test at a significance
level 𝛼 = 5%.

Answer
Population, variable and parameter of interest
o Population under study: adult male caribou in northern Quebec
o The variable of interest: mass of an adult male caribou in northern Quebec
o The parameter of interest: 𝜇, mass of an adult male caribou in northern Quebec

Hypotheses: 𝐻0 : vs 𝐻1 :

Significance level: 𝛼 = 5%

Construct the appropriate rejection zone for the test:

Rejection zone:


Sample results:
𝑛= 𝑥= 𝑠=

𝑡𝑜𝑏𝑠 =

Decision:.

146
Possible errors and validity:
Having decided to reject 𝐻0 means that we are exposed to a potential error of type

Can you evaluate the risk of making such an error?

The test is valid given the sample size is sufficient.

147
8.6 Hypothesis testing and p-values
The 𝑝-value provides an alternative to using rejection zones in assisting one’s decision to reject
(or not to reject) a null hypothesis. In practice, computing the 𝑝-value is only necessary when a
sample result is in contradiction with 𝐻0 , which is assumed true until proven otherwise.

The 𝑝-value is obtained by examining the discrepancy between a sample result and the
assumed value of the corresponding parameter given under 𝐻0 . It measures the probability
that a sample result can be equally far, or even further, from the assumed parameter.

𝑝-value = 𝑃(𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑟𝑒𝑠𝑢𝑙𝑡 𝑖𝑠 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑎𝑠 𝑒𝑥𝑡𝑟𝑒𝑚𝑒 𝑎𝑠 𝑡ℎ𝑒 𝑜𝑛𝑒 𝑡ℎ𝑎𝑡 𝑤𝑎𝑠 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 | 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒)

A 𝑝-value that is relatively high means that the observed deviation from the predicted value
under 𝐻0 is not unusual. As a result, there is no reason to contest the validity of the null
hypothesis.

However, a 𝑝-value that is low occurs when the observed sample result and the predicted value
under 𝐻0 are unusually far from one another. Such an occurrence can be due to chance
(samples do not systematically reveal a fair portrait of the population). Alternatively, it can be
an indication that our assumption of 𝐻0 being true, based on which probabilities were
calculated in the first place, must be reconsidered. In other words, a low 𝑝-value basically says,
“if 𝐻0 were true, this extreme result would have very little chance of occurring”.

Remember that in all hypothesis tests, limiting the probability of a type 1 error is of greatest
importance. The likelihood of rejecting 𝐻0 when 𝐻0 is true is controlled by setting a low value
of 𝛼.

The following principle will therefore guide our decision:


o If 𝑝-value ≤ 𝛼, we reject 𝐻0 .
o If 𝑝-value > 𝛼, we do not reject 𝐻0 .

148
Step-by-step guide
1. Identify the population and the variable of interest, as well as the tested parameter,
2. Write the null and alternative hypotheses,
3. Identify or choose the test’s significance level, 𝛼,
4. Conduct the survey. If the sample result contradicts 𝐻0 , compute the test statistic and the
corresponding 𝑝-𝑣𝑎𝑙𝑢𝑒.
5. Determine whether the 𝑝-value allows you to reject 𝐻0 ,
6. Conclude in the context of the problem, discuss possible errors and the validity of the test
itself.

Note: for two-tailed tests, any deviation, left or right of the assumed parameter, contradicts 𝐻0 .
The size of the deviation (rather than its sign) is used to compute the 𝑝-𝑣𝑎𝑙𝑢𝑒. Upcoming
examples will clarify how to deal with these situations.

Example 8.6.1
For the longest time, the proportion of students at Marianopolis College that have failed, on a
given semester, more than one course has never exceeded 10%. As a result, the college never
felt the need to modify or bonify existing services for students at risk.
From a random sample of 200 students, 29 had failed at least two courses in their previous
semester. Based on this data, should current practices be re-examined?
Perform a hypothesis test at a significance level 𝛼 = 5%.

Answer
Population, variable and parameter of interest:
o Population under study: the students of Marianopolis College
o Variable of interest: whether a student fails more than one course on a given semester.
This is a Bernoulli variable that could be defined as

1 , 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑓𝑎𝑖𝑙𝑒𝑑 𝑚𝑜𝑟𝑒 𝑡ℎ𝑎𝑛 𝑜𝑛𝑒 𝑐𝑜𝑢𝑟𝑠𝑒 𝑜𝑛 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑠𝑒𝑚𝑒𝑠𝑡𝑒𝑟


𝑋={
0 , 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑓𝑎𝑖𝑙𝑒𝑑 𝑎𝑡 𝑚𝑜𝑠𝑡 𝑜𝑛𝑒 𝑐𝑜𝑢𝑟𝑠𝑒 𝑜𝑛 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑠𝑒𝑚𝑒𝑠𝑡𝑒𝑟

o Tested parameter: 𝑝, the proportion of Marianopolis students who failed more than one
course on a given semester

149
Hypotheses: 𝐻0 : vs 𝐻1 :

This is a -tailed test. We will only consider rejecting 𝐻0 if the sample proportion is
extreme enough to conclude that 𝑝 > 10%.

Significance level: 𝛼 = 5%

Sample result, test statistic and 𝒑-value:

The sample result contradicts 𝐻0 since 𝑝̂ =

Given that 𝑛 = 200 ≥ 30 and 𝑛𝑝 = 20 ≥ 10 and 𝑛𝑞 = 180 ≥ 10, we know that


𝑃̂ − 𝑝
= 𝑍 ~ 𝑁(0, 1)
√𝑝𝑞/√𝑛
𝑧𝑜𝑏𝑠 =

𝑝-𝑣𝑎𝑙𝑢𝑒 =

Decision:

Possible errors:
.

150
Example 8.6.2
Max Poutine is a well-established chain of restaurants in Quebec with 16 concessions
throughout the province. The chain’s administrators estimate that 20% of reservations that are
made (either by phone or through their website) lead to no-shows. With this in mind, the
restaurant has adopted overbooking practices. The accuracy of the proportion of no-show is
regularly tested. Significant changes to no-show frequency can be compensated by adapting
the overbooking.

The most recent survey has shown that from a random sample of 175 reservations in the last 3
months, 41 resulted in no-shows. Based on these results, will current overbooking practices
have to be adapted?

Perform a hypothesis test at a significance level 𝛼 = 5%.

Answer
Population, variable and parameter of interest:
o Population under study: phone and internet reservations to Max Poutine restaurants
o Variable of interest: whether a reservation resulted in a no-show. This is a Bernoulli
variable that could be defined as

1 , 𝑖𝑓 𝑡ℎ𝑒 𝑟𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑟𝑒𝑠𝑢𝑙𝑡𝑒𝑑 𝑖𝑛 𝑎 𝑛𝑜 𝑠ℎ𝑜𝑤


𝑋={
0 , 𝑖𝑓 𝑡ℎ𝑒 𝑟𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑤𝑎𝑠 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑒𝑑

o Tested parameter: 𝑝, the proportion of reservations that resulted in no-shows

Hypotheses: 𝐻0 : vs 𝐻1 :

This is a -tailed test. We will only consider rejecting 𝐻0 if the sample proportion is
extreme enough to conclude that 𝑝 > 10%.

Significance level: 𝛼 = 5%

Sample result, test statistic and 𝒑-value:


The sample result contradicts 𝐻0 since 𝑝̂ =

Given that 𝑛 = 175 ≥ 30 and 𝑛𝑝 = 34 ≥ 10 and 𝑛𝑞 = 145 ≥ 10, we know that


𝑃̂ − 𝑝
= 𝑍 ~ 𝑁(0, 1)
√𝑝𝑞/√𝑛

151
𝑧𝑜𝑏𝑠 =

𝑝-𝑣𝑎𝑙𝑢𝑒 =

Decision:

Possible errors:
.

152
Example 8.6.3
When Marianopolis College considered moving from the Côte-des-Neiges site to the CND
Mother House on Westmount Avenue, one of the main concerns was the impact it would have
on the commute time of students, faculty and staff. Travel time to the Côte-des-Neiges site was
45 minutes, on average, for the college community.

Should a study conclude that the average time of travel to the Mother House does not exceed
45 minutes, little resistance to the move is expected, and it would be easier to convince one
and all that the move is a good idea. On the other hand, if the average travel time to the new
location proved to be longer, the administration would have to negotiate with the STM to
provide a better service to the college.

Travel time samples to the projected site were collected with the following results:
𝑛 = 48 𝑥 = 50 𝑠 = 16

Will the college need to take measures to address this situation? Use a hypothesis test at a
significance level 𝛼 = 10%.

Answer
Population, variable and parameter of interest:
o Population under study: the Marianopolis community (students, faculty and staff)
o The variable of interest: travel time of a member of the community to the Westmount Ave.
site
o The parameter of interest: 𝜇, the average travel time of all members of the community to
the Westmount Ave. site
such an error is at most 10%.
Hypotheses: 𝐻0 : vs 𝐻1 :

This is a -tailed test. We will only consider rejecting 𝐻0 if the sample proportion is
extreme enough to conclude that 𝑝 > 10%.

Significance level: 𝛼 = 10%

Sample result, test statistic and 𝒑-value:


The sample result contradicts 𝐻0 since 𝑥 =

𝑋−𝜇
The sample size (𝑛 = 48 ≥ 30) is sufficient to assume that ~ 𝑇47
𝑆⁄√48

153
𝑡𝑜𝑏𝑠 =

𝑝-𝑣𝑎𝑙𝑢𝑒 =

Decision:

Possible errors:
.

154
Example 8.6.4
Because of climate change, the caribou population in northern Quebec have migrated. In their
new environment, biologists claim that food is scarce and not adapted to their dietary needs.
The mass of adult male caribou has traditionally been normally distributed with average 𝜇 =
200 kg. If the biologists are correct, a drop in the mass of all caribou, including adult males, is to
be anticipated. Given the large territory to cover, biologists managed to sample only 16 adult
male caribou. The average mass and standard deviation of the sample (in kg) are 𝑥 = 188 and
𝑠 = 18, respectively.

Does the sample data corroborate the biologists’ theory? Use a hypothesis test at a significance
level 𝛼 = 5%.

Answer
Population, variable and parameter of interest
o Population under study: adult male caribou in northern Quebec
o The variable of interest: mass of an adult male caribou in northern Quebec
o The parameter of interest: 𝜇, mass of an adult male caribou in northern Quebec

Hypotheses: 𝐻0 : vs 𝐻1 :

This is a -tailed test. We will only consider rejecting 𝐻0 if the sample proportion is
extreme enough to conclude that 𝑝 > 10%.

Significance level: 𝛼 = 5%

Sample result, test statistic and 𝒑-value:


The sample result contradicts 𝐻0 since 𝑥 =

The sample size (𝑛 = 16 < 30) is small, but variable 𝑋 is assumed to be normally distributed.

𝑋−𝜇
Therefore, we may assume that 𝑆⁄ ~
√16

𝑡𝑜𝑏𝑠 =

𝑝-𝑣𝑎𝑙𝑢𝑒 =

155
Decision:

Possible errors:
.

156
Section 9 Confidence intervals
Inferential statistics include all techniques and procedures with which sample data is used to
make predictions or generalisations about an entire population. You may have guessed that
sampling distributions, discussed in a Section 7, will be at the heart of all inference.

An important application of the sampling distributions to inference is the construction of


Confidence intervals.

From the onset, let us highlight the main difference between confidence intervals and
hypothesis testing, a topic that is discussed in Section 8:
o In hypothesis testing, an assumption is made a priori about the value of a parameter
(population indicators such as 𝜇, 𝜎 2 , 𝜌) and sample data is then used to validate, or
invalidate, this assumption.

o When constructing confidence intervals, the value of a parameter is considered


unknown. Sample data is used to estimate the parameter’s value.

In this section, we will focus on constructing confidence intervals for means and proportions.

Estimator: random variable that is the result of a sampling process and whose purpose is to
provide an approximate value for a parameter.

Point estimate: the value of the estimator for a given sample.

An estimator is unbiased when its expected value is equal to the parameter. This means that, if
all possible samples of size 𝑛 were collected from a population, the average value of the point
estimates would be equal to the parameter.

o The sample mean 𝑋 is an unbiased estimator of the parameter 𝜇


o The sample proportion 𝑃̂ is an unbiased estimator of the parameter 𝑝
o The sample variance 𝑆 2 is an unbiased estimator of the parameter 𝜎 2
o The maximum value of a sample will tend to underestimate the population’s maximum. It is
therefore a biased estimator of the population maximum.

157
9.1 Confidence interval for 
The sample mean 𝑋̅ is an unbiased estimator of the parameter 𝜇, which means that 𝐸(𝑋̅) = 𝜇.
You may recall that this identity was introduced when presenting sampling distributions. More
specifically, the distribution of 𝑋̅ was summarized as follows:

If 𝑋 is a quantitative variable whose population mean is 𝜇, then, as long as 𝑋 is normally


distributed in the population, or as long as the sample size is big enough (𝑛 ≥ 30), the sample
mean is such that
𝑋̅ − 𝜇
~ 𝑇𝑛−1
𝑆⁄√𝑛

Given its random nature, we expect that the point estimate 𝑥̅ will rarely (in fact, practically
never) be equal to 𝜇. However, the probability that 𝑋̅ will land in an interval around 𝜇 is
measurable. For instance, the probability that 𝑋̅ will be fall within a distance ∆ of 𝜇, that is
within the interval [𝜇 − ∆, 𝜇 + ∆], can be calculated by using the Student 𝑡-distribution:

𝜇−∆−𝜇 𝑋̅ − 𝜇 𝜇+∆−𝜇
𝑃(𝜇 − ∆ ≤ 𝑋̅ ≤ 𝜇 + ∆) = 𝑃 ( ≤ ≤ )
𝑆 𝑆 𝑆
√𝑛 √𝑛 √𝑛

−∆ ∆
= 𝑃( ≤ 𝑇𝑛−1 ≤ )
𝑆/√𝑛 𝑆/√𝑛

Here we can make use of notation we have introduced in the sampling distribution sections. By
convention, recall that denote by 𝑡𝑛−1,𝛼/2 the value of the 𝑡-distribution such that

𝑃(𝑇𝑛−1 ≤ −𝑡𝑛−1,𝛼⁄2 or 𝑇𝑛−1 ≥ 𝑡𝑛−1,𝛼⁄2 ) = 𝛼

which also means that

𝑃(−𝑡𝑛−1,𝛼⁄2 ≤ 𝑇𝑛−1 ≤ 𝑡𝑛−1,𝛼⁄2 ) = 1 − 𝛼

158

By association, we gather that if = 𝑡𝑛−1,𝛼⁄2 then
𝑆/√𝑛

∆ ∆
𝑃 (− ≤ 𝑇𝑛−1 ≤ )=1−𝛼
𝑆/√𝑛 𝑆/√𝑛

𝑆
In other words, if we choose ∆ = 𝑡𝑛−1,𝛼⁄2 , then
√𝑛

𝑆 𝑆
𝑃(𝜇 − ∆ ≤ 𝑋̅ ≤ 𝜇 + ∆) = 𝑃 (𝜇 − 𝑡𝑛−1,𝛼 ≤ 𝑋̅ ≤ 𝜇 + 𝑡𝑛−1,𝛼 )= 1−𝛼
2 √𝑛 2 √𝑛
Read carefully, this probability essentially states that a proportion 1 − 𝛼 of all possible samples
𝑆
of size 𝑛 will produce sample means whose distance from 𝜇 will not exceed 𝑡𝑛−1,𝛼⁄2 .
√𝑛

𝑆
The distance ∆ = 𝑡𝑛−1,𝛼⁄2 is called the margin of error and 1 − 𝛼 is referred to as the
√𝑛
confidence level.

Now, remember that in the context of parameter estimations, we are focusing on establishing
the most plausible values of 𝜇, rather than those of 𝑋̅. Luckily, knowing that a proportion 1 − 𝛼
𝑆
of all samples are within a distance 𝑡𝑛−1,𝛼⁄2 of 𝜇, also means that a proportion 1 − 𝛼 of
√𝑛
𝑆
intervals 𝑋̅ ± 𝑡𝑛−1,𝛼⁄2 contain 𝜇.
√𝑛

To summarize:
If a random sample of size 𝑛 is collected, with sample mean and standard deviation 𝑥̅ and 𝑠,
respectively, then the 𝟏 − 𝜶 confidence interval for 𝝁 is:

𝑠 𝑠
[𝑥̅ − 𝑡𝑛−1,𝛼⁄2 ¸ 𝑥̅ + 𝑡𝑛−1,𝛼⁄2 ]
√𝑛 √𝑛

The construction of the confidence interval is valid as long as 𝑋 is normally distributed in the
population, or as long as 𝑛 is big enough (𝑛 ≥ 30).

As 𝑛 grows, the 𝑡-distribution begins to resemble the standard normal law. Hence, for big
samples, we can approximate – when necessary – the confidence interval as

𝑠 𝑠
[𝑥̅ − 𝑧𝛼⁄2 ¸ 𝑥̅ + 𝑧𝛼⁄2 ]
√𝑛 √𝑛

159
The approximation of 𝑡 by 𝑧 was handy when 𝑇-tables were required to identify the value of
𝑠
𝑡𝑛−1,𝛼 . These tables were usually limited to small size 𝑛 and to a few values of 𝛼. With
2 √𝑛
modern technology, the approximation is practically unnecessary today.

Warning
Although it is tempting to say that there is a probability equal to 1 − 𝛼 that 𝜇 lies in the
confidence interval, statisticians tend to be opposed to this interpretation. The fact of the
matter is that 𝜇 is not a variable, it is a parameter. The randomness is truly that of the
confidence interval itself. Both its center and the width (𝑥̅ ± ∆) depend on the selection of the
sample itself.

A more appropriate interpretation is to say, “With 95% confidence, we estimate that the
population parameter is within the bounds of the confidence interval”.

160
Example 9.1.1
The production of alfalfa in Saskatchewan is tabulated every year to track the progress of crop
efficiency. It is too tedious to gather crop data from all fields. Therefore, the department of
Agriculture collects yearly data from 1-acre lots selected at random throughout the province.

The results of the 2023 survey are as follows:


𝑛 = 42 𝑥̅ = 1.42 𝑡𝑜𝑛𝑠 𝑠 = 0.48 𝑡𝑜𝑛𝑠

Estimate the average production per acre for the province of Saskatchewan in 2023 using a
confidence level of 95%.

Population under study: acres dedicated to alfalfa crops in Saskatchewan


Variable of interest: 𝑋 = “Production of alfalfa per 1-acre surface in Saskatchewan
(2023)”
Parameter to estimate: 𝜇, the average production of alfalfa, in tons per acre
(Saskatchewan, 2023)”

𝛼
1−𝛼 = → 𝛼= → =
2

𝑡41,2.5% =

∆=

95% CI for 𝜇: [ ] or [1.27¸ 1.57]

With 95% confidence, we estimate that the true average production of Saskatchewan alfalfa is
between 1.27 and 1.57 tons per acre in 2023. The sample size (𝑛 = 42) is sufficient to assume
that our process is valid.

161
Example 9.1.2
Competitive deep-water (apnea) divers were tested to evaluate their heart rates at rest.
Scientists believed that heart rates lower than those of the rest of the population could explain
their capacity to hold their breath for long periods of time.
There are very few professional divers. It is extremely difficult to collect samples of respectable
size, but the assumption is that heart rates at rest should be normally distributed.

The results of the survey are as follows:


𝑛 = 10 𝑥̅ = 61 𝑏𝑝𝑚 𝑠 = 8 𝑏𝑝𝑚

Estimate the average heart rate at rest of all deep-water divers using a confidence level of
90%.

Answer
Population under study: professional deep-water divers
Variable of interest: 𝑋 = “heart rate at rest of a professional apnea diver”
Parameter to estimate: 𝜇, average heart rate at rest of professional apnea divers

𝛼
1−𝛼 = → 𝛼= → =
2

𝑡9,5% =

∆=

90% CI for 𝜇: [ ] or [56.36¸ 65.64]

With 90% confidence, we estimate that the true average heart rate at rest of professional
apnea divers is between 56.36 and 65.64 bpm.

The process is valid (despite the small sample size) because of the assumed normal distribution
of 𝑋.

162
Example
When estimating 𝜇, how is the margin of error for a confidence interval affected if…

The sample size is increased?

The confidence level is increased?

The sample is extremely homogeneous?

163
9.2 Confidence interval for a proportion, p
The sample proportion 𝑃̂ is an unbiased estimator of the parameter 𝑝, which means that
𝐸(𝑃̂) = 𝑝.

As we have seen in the section on sampling distributions, we have shown that

𝑃̂ ~ 𝑁(𝑝, √𝑝𝑞 ⁄√𝑛)

if the following conditions are met: 𝑛 ≥ 30, 𝑛𝑝 ≥ 10, 𝑛𝑞 ≥ 10

Given its random nature, we expect that the point estimate 𝑝̂ may nevertheless differ from 𝑝.

However, the probability that 𝑃̂ will remain in an interval around 𝑝 is measurable. For example,
the probability that 𝑃̂ will be remain within a distance ∆ of 𝑝, that is within the interval
[𝑝 − ∆, 𝑝 + ∆], can be calculated using a standard normal law given that the Central limit
𝑃̂ −𝑝
theorem predicts that = 𝑍 ~ 𝑁(0, 1). Thus,
√𝑝𝑞⁄√𝑛

𝑝−∆−𝑝 𝑃̂ − 𝑝 𝑝+∆−𝑝
𝑃(𝑝 − ∆ ≤ 𝑃̂ ≤ 𝑝 + ∆) = 𝑃 ( ≤ ≤ )
√𝑝𝑞/√𝑛 √𝑝𝑞/√𝑛 √𝑝𝑞/√𝑛

−∆ ∆
= 𝑃( ≤𝑍≤ )
√𝑝𝑞/√𝑛 √𝑝𝑞/√𝑛

Now, recall that by convention 𝑧𝛼/2 is the value of the standard normal law such that

𝑃(𝑍 ≤ −𝑧𝛼⁄2 or 𝑍 ≥ 𝑧𝛼⁄2 ) = 𝛼

which also means that

𝑃(−𝑧𝛼⁄2 ≤ 𝑍 ≤ 𝑧𝛼⁄2 ) = 1 − 𝛼


By association, we gather that if = 𝑧𝛼⁄2 then
√𝑝𝑞/√𝑛
−∆ ∆
𝑃( ≤𝑍≤ )=1−𝛼
√𝑝𝑞/√𝑛 √𝑝𝑞/√𝑛

𝑝𝑞
In other words, if we choose ∆ = 𝑧𝛼⁄2 √ 𝑛 , then

164
√𝑝𝑞 √𝑝𝑞
𝑃(𝑝 − ∆ ≤ 𝑃̂ ≤ 𝑝 + ∆) = 𝑃 (𝑝 − 𝑧𝛼⁄2 ≤ 𝑃̂ ≤ 𝑝 + 𝑧𝛼⁄2 )=1−𝛼
√𝑛 √𝑛

This statement can be interpreted as to mean that variations of 𝑃̂ from 𝑝 should remain smaller
𝑝𝑞
than 𝑧𝛼⁄2 √ 𝑛 for most samples. In other words, the point estimate 𝑝̂ will typically be within a

𝑝𝑞
distance 𝑧𝛼⁄2 √ 𝑛 of the true parameter 𝑝. Indeed, we expect that

√𝑝𝑞 √𝑝𝑞
𝑃̂ − 𝑧𝛼⁄2 ≤ 𝑝 ≤ 𝑃̂ + 𝑧𝛼⁄2
√𝑛 √𝑛
with probability 1 − 𝛼.

Our main issue, as you may have noticed, is that since we are estimating the parameter 𝑝, then
its unknown value cannot be used to construct the interval
√𝑝𝑞 √𝑝𝑞
𝑝 − 𝑧𝛼⁄2 ≤ 𝑃̂ ≤ 𝑝 + 𝑧𝛼⁄2
√𝑛 √𝑛

We will have no choice here but to use the point estimate that 𝑝̂ ≈ 𝑝. The conditions that are
imposed to the size of 𝑛 should suffice to ensure that this approximation is relatively acceptable
to pursue our calculations:

To summarize:
If a random sample of size 𝑛 is collected, with sample proportion 𝑝̂ , then the 𝟏 − 𝜶 confidence
interval for 𝒑 is:

√𝑝̂ 𝑞̂ √𝑝̂ 𝑞̂
[𝑝 − 𝑧𝛼⁄2 ¸ 𝑝 + 𝑧𝛼⁄2 ]
√𝑛 √𝑛

√𝑝̂𝑞̂
where distance ∆ = 𝑧𝛼⁄2 is the margin of error. 1 − 𝛼 is considered a confidence level
√𝑛
rather than a probability, given the unpredictable reliability of 𝑝̂ .

We will deem that the process is valid as long as 𝑛 ≥ 30, 𝑛𝑝̂ ≥ 10 and 𝑛𝑞̂ ≥ 10.

165
Example 9.2.1
Air Banana is an airline whose frequency of late departures was one of the worst in the country.
Now that it has put new procedures in place, the airline will sample random flights and evaluate
its ‘on-time reliability’.

A random sample of 129 departing flights was gathered. 93 of these flights departed on time.
Construct a 95% confidence interval for true proportion of Air Banana flights that depart
without a delay.

Answer
Population under study: Departing Air Banana flights after new procedures were implemented.
Variable of interest: 𝑋 = 1 if the plane departs on time, 𝑋 = 0 otherwise
Parameter to estimate: 𝑝, proportion of all departures that were on time

𝛼
1−𝛼 = → 𝛼= → =
2

𝑧2.5% =

𝑝̂ = 𝑞̂ =

∆=

95% CI for 𝑝: [ ] or [64.4% ¸ 79.8%]

With 95% confidence, we estimate that the true proportion of Air Banana flights that depart on
time is between 64.4% and 79.8%.

The process is valid because of the sample size is sufficient since 𝑛 ≥ 30, 𝑛𝑝̂ ≥ 10 and
𝑛𝑞̂ ≥ 10.

166
Conservative confidence interval when estimating a proportion, p
We have seen that under sufficient conditions, the sample proportion 𝑃̂ has a distribution that
resembles that of a normal law.

More specifically,
𝑃̂ ~ 𝑁(𝑝, √𝑝𝑞 ⁄√𝑛)

Constructing a confidence interval for 𝑝 involves using its point estimate, 𝑝̂ , as the center of the
√𝑝̂𝑞̂
interval as well as in computing the margin of error ∆ = 𝑧𝛼⁄2 . In other words, not only is
√𝑛
there potential of missing the target when estimating the interval’s center, we may also
underestimate the required margin of error.

When estimating population means, the use of a 𝑡-distribution compensates for the use of 𝑠 in
replacing 𝜎. This approach is not suggested, however, when estimating population proportions
(the variable of interest obeys a Bernoulli distribution: not only is it not normally distributed in
the population, it is not quantitative at all).

A conservative approach is to overestimate the margin of error by replacing √𝑝̂ 𝑞̂ by its


maximum value, which is 0.5.

̂𝒒
Can you explain why √𝒑 ̂ is 0.5, at most?

167
Example 9.2.2
When estimating 𝑝, how is the margin of error for a confidence interval affected if…
the sample size is increased?

the confidence level is increased?

the sample proportion is extreme (either close to 0, or close to 1)?


Hint: use the graph below to orient your reasoning

168
9.3 Determination of a sample size
In research, the size of the sample is determined with specific objectives in mind. One may
want to obtain results with a limited margin of error at a given level of confidence. The
mathematical issue, here, is that even when ∆ and 1 − 𝛼 are pre-determined, unknowns still
prevent us from determining the size of 𝑛.

Table 9.3.1
Confidence interval Confidence interval
for 𝒑 for 𝝁
Margin of error formula √𝑝̂ 𝑞̂ 𝑠
∆ = 𝑧𝛼⁄2 ∆ = 𝑡𝑛−1,𝛼⁄2
√𝑛 √𝑛

Pre-determined quantities ∆, 1 − 𝛼 ∆, 1 − 𝛼

Unknown elements 𝑝̂ , 𝑞̂, 𝑛 𝑠, 𝑛, 𝑡𝑛−1,𝛼⁄2

Below, you will find how we can deal with the unknowns highlighted in Table 9.3.1 and
eventually determine the required sample size.

Sample size for estimation of a proportion, 𝒑


From the onset, a maximum value is set for the desired margin of error and for the confidence
level. Our goal is to find a value for 𝑛 such
√𝑝̂ 𝑞̂
𝑧𝛼⁄2 ≤ ∆𝑚𝑎𝑥
√𝑛

The estimates 𝑝̂ and 𝑞̂ are not yet known, but we have discussed earlier that in a conservative
scenario, √𝑝̂ 𝑞̂ will never exceed 0.5. Therefore, if we use √𝑝̂ 𝑞̂ = 0.5, we will find the size for
which the margin of error, even in a worst-case scenario, will not be surpassed.

0.5
𝑧𝛼⁄2 ≤ ∆𝑚𝑎𝑥
√𝑛
0.5 𝑧𝛼⁄2
√𝑛 ≥
∆𝑚𝑎𝑥
0.5 𝑧𝛼⁄2 2
𝑛≥( )
∆𝑚𝑎𝑥

169
Note that 𝑛 must be an integer, so we may have to round this number up.

Example 9.3.1
A new drug is to be tested. In addition to its efficiency, the pharmaceutical company must keep
track of the proportion of patients that have side effects (light, moderate, severe) when using
the drug.

What size sample is needed to produce a 95% confidence interval whose margin of error does
not exceed 4%?

𝛼
1−𝛼 = ⟹ = ⟹ 𝑧2.5% =
2

∆𝑚𝑎𝑥 =

0.5 𝑧𝛼⁄2 2
𝑛≥( ) ⟹ 𝑛≥
∆𝑚𝑎𝑥

Conclusion: The sample size must be at least 𝑛 =

170
Example 9.3.2
In the previous example, suppose that 608 individuals were tested, 21 of whom had moderate
to severe side effects.

Estimate the true proportion of patients that have moderate to severe side effects if given
the drug in its current form. Use a 95% confidence interval.

Point estimate: 𝑝̂ = 𝑞̂ =

𝛼
1−𝛼 = ⟹ = ⟹ 𝑧2.5% =
2

(Note that 𝑛, 𝑛𝑝̂ , 𝑛𝑞̂ are all big enough for this process to be valid)

∆=

95% confidence interval for 𝑝 : [ ] or [2.004% , 4.904%]

Does the result respect the 4% margin of error criteria that was set when determining sample
size?

171
Sample size for estimation of a mean, 𝝁
From the onset, a maximum value is set for the desired margin of error and for the confidence
level. Our goal is to find a value for 𝑛 such
𝑠
𝑡𝑛−1,𝛼⁄2 ≤ ∆𝑚𝑎𝑥
√𝑛
The estimate for 𝑠 not yet known and, given that the sample size is not fixed yet, there is no
way of knowing 𝑡𝑛−1,𝛼⁄2 .

In the absence of a sample size, our only resort is to replace 𝑡𝑛−1,𝛼⁄2 by 𝑧𝛼⁄2 . This creates an
underestimation of the margin of error, but we hope that 𝑛 will be big enough to guarantee
that 𝑡𝑛−1,𝛼⁄2 ≈ 𝑧𝛼⁄2 .

As for 𝑠, a conservative estimate of its value is used. That is, we replace 𝑠 by a value – usually
denoted by 𝜎̃ – that we assume 𝑠 will not exceed. Occasionally, a pre-survey can even be
conducted to estimate this value. Thus…
𝜎̃
𝑧𝛼⁄2 ≤ ∆𝑚𝑎𝑥
√𝑛
𝑧𝛼⁄2 𝜎̃
√𝑛 ≥
∆𝑚𝑎𝑥
𝑧𝛼⁄2 𝜎̃ 2
𝑛≥( )
∆𝑚𝑎𝑥

Again, 𝑛 must be an integer, so we may have to round this number up.

172
Example 9.3.3
Merrier produces and sells carbonated water. Periodically, samples are gathered to make sure
the volume that is poured into its bottles is adequate, at least on average. Pouring too little
liquid exposes the company to lawsuits. Pouring too much liquid increases the production costs.
Past sample analyses have indicated that the standard deviation of the volume that is poured
remains less than 6 𝑚𝑙.

Today, a sample will be collected to estimate the average volume poured into its
750 𝑚𝑙 bottles.

Find the size of the sample for which the margin of error does not exceed 𝟏 𝒎𝒍 at a 90%
confidence level.

𝛼
1−𝛼 = ⟹ = ⟹ 𝑧5% =
2

∆𝑚𝑎𝑥 =

𝜎̃ =

𝑧𝛼⁄2 𝜎̃ 2
𝑛≥( ) → 𝑛≥
∆𝑚𝑎𝑥

Conclusion: The sample size must be at least 𝑛 =

173
Example 9.3.4
In the previous example, suppose that 100 bottles were tested. The sample mean and standard
deviation were 752.4 𝑚𝑙 and 5.4 𝑚𝑙, respectively.

Estimate the true mean of the volume poured into all 𝟕𝟓𝟎 𝒎𝒍 bottles. Use a 90% confidence
interval.

Point estimate: 𝑥̅ =

𝑠
Margin of error: recall that when estimating a population mean, ∆= 𝑡𝑛−1,𝛼⁄2
√𝑛

𝛼
1−𝛼 = ⟹ = ⟹ 𝑡99;5% =
2

𝑠=

∆=

90% confidence interval for 𝜇 : 752.4 ± or [751.5036 , 753.2964]

Does the result respect the 𝟏 ml margin of error criteria that was set when determining the
sample size and is the result cause for concern?

The margin of error is lower than 1 ml. Merrier should be concerned. The sample seems to
indicate (with 90% confidence) that the average volume that is poured is, in fact, not equal to
750 ml.

174
Section 10 Correlation and linear regression
We have introduced, when presenting descriptive statistics, the concept of correlation
coefficients. If necessary, you may revisit this section to recall its major attributes. For instance,
what does the sign of the correlation coefficient mean? What about its size? Are non-correlated
variables necessarily independent?

In this section, we will discuss linear relationships between quantitative variables in the context
of hypothesis testing. More specifically, we will examine situations where variables can truly be
considered as significantly correlated. Moreover, we will introduce the linear regression, that
can not only attest to a significant alignment of the scatter plot, but also provide the ‘most
appropriate’ line with which predictions can potentially be made.

A visual inspection of the scatter plot before going any further is suggested. Trying to explain a
trend that may be quadratic, exponential, or logarithmic by using a line, even if it is the ‘best’
line, is a contradiction.

Example 9.3.1
In the graphs shown below, can you tell which one(s) does not really display a linear trend,
despite the ‘best’ line showing otherwise?

Figure 9.3.1 Figure 9.3.3

Figure 9.3.2 Figure 9.3.4

175
10.1 Linear Regressions
Many expressions are used to describe the line that ‘best’ describes the trend of a scatter plot:
o Line of best fit
o Line of least squares
o Linear regression equation

The equation itself is obtained through an optimization process. Techniques required to find
the ‘best’ line go beyond the scope of our course (although you may encounter them in
Calculus 3, should you accept that challenge!), but it is worth noting that its equation minimizes
the sum of the squared errors. Errors, also called residuals, occur whenever the 𝑦-value
predicted by the regression equation differs from the 𝑦-value that was actually observed (see
Figure 10.1.1).

Figure 10.1.1

𝐒𝐮𝐦 𝐨𝐟 𝐬𝐪𝐮𝐚𝐫𝐞𝐬 𝐨𝐟 𝐭𝐡𝐞 𝐞𝐫𝐫𝐨𝐫𝐬 = (𝐞𝐫𝐫𝐨𝐫 𝟏)𝟐 + (𝐞𝐫𝐫𝐨𝐫 𝟐)𝟐 + (𝐞𝐫𝐫𝐨𝐫 𝟑)𝟐 + ⋯

You may have guessed that correlations of 1, or of −1, can only occur if the sum of squares of
the errors is 0.

In addition to checking the linearity of the scatter plot, a linear regression is valid as long as the
response variable (𝑌) is quantitative and as long as the independent variable (𝑋) is either
quantitative or binary (qualitative, but coded as 0 or 1). A qualitative variable that has been
coded into arbitrary numerical values is not permitted.

176
Before going any further, let us compare the scatter plots shown below.

What do they have in common and how are they different?

Figure 10.1.2 Figure 10.1.3

Common elements
Both scatter plots show pairs of measures obtained from the individuals. In either case, every
dot reveals the number of students and the number of teachers for a selected school.

Differences
o the number of points in each scatter plot
o the equations of each line of ‘best fit’ through the scatter plots
o the notation used to represent the linear correlation coefficient (𝜌 vs 𝑟)
o the value of each correlation coefficient

One who notices that the value of parameter 𝜌 is provided will deduce that population data
was used to construct Figure 10.1.2. Indeed, the scatter plot provides information pertaining to
all high schools of a given region. Figure 10.1.3 shows a scatter plot using sample data, for which
the point estimate 𝑟 is published. This graph was constructed using a sample of high schools
from the same population that generated Figure 10.1.2.

Although the equations for each line through the scatterplots are similar, they are not identical,
nor are their linear correlation coefficients:

Line 1 (population) Line 2 (sample)

𝑦 = 0.0644𝑥 − 11.159 𝑦 = 0.0593𝑥 − 2.6419


𝜌 = 0.847 𝑟 = 0.757

177
As you can see, sampling does not only affect averages and proportion. Sampling may affect the
trends that we observe in scatter plots. The link between 𝑋 and 𝑌 is only approximate when
using sample data, whereas populational data provides their true relationship.
In terms of notation, certain distinctions are necessary.

The true regression equation is denoted by 𝑦 = 𝛽0 + 𝛽1 𝑥 when obtained from population data.

The values 𝛽0 and 𝛽1 are parameters that represent the line’s 𝑦-intercept and slope,
respectively.

On the other, the estimated equation of best fit, denoted by 𝑦̂ = 𝑏0 + 𝑏1 𝑥, is constructed from
sample data. The values of 𝑏0 and 𝑏1 are point estimates for 𝛽0 and 𝛽1, respectively. Their
values depend on the individuals that were selected to form the sample.
Because 𝑏0 and 𝑏1 are subject to estimation errors, the information they convey cannot
systematically be trusted.

𝑋 and 𝑌 may be dependent, while 𝑦̂ = 𝑏0 + 𝑏1 𝑥 may lead you to believe otherwise, as


illustrated in Figure 10.1.4.

Figure 10.1.4

178
Likewise, as shown in Figure 10.1.5,we may be misled by an apparent trend in 𝑦̂ = 𝑏0 + 𝑏1 𝑥
when 𝑋 and 𝑌 are actually independent.

Figure 10.1.5

In statistical contexts, the latter is the error that we wish to avoid as much as possible… that is
to conclude there is a relationship between variables when, in fact, no such relationship exists.

So, sample results will be submitted to hypothesis testing before we can acknowledge that the
regression is significant.

We will test the value of the parameter 𝛽1. Recall that this parameter represents the slope of
the best-fit line through a scatter plot drawn from populational data.

If the slope is significantly different from 0, then one may conclude that a trend (a relationship,
a dependency) does exist between the variables 𝑋 and 𝑌.

We will be using an Excel template to conduct the test and to determine its 𝑝-value.

179
Hypotheses on 𝜷𝟏
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0

If sample results provide sufficient evidence to reject 𝐻0 : 𝛽1 = 0, we accept 𝐻1 : 𝛽1 ≠ 0 and


conclude that the regression is significant: 𝑋 has been proven to influence 𝑌. In turn, using the
regression equation to estimate the values of 𝑌 is allowed.

A significant linear regression is always synonymous to saying there is a significant correlation.


Indeed, a linear regression is only significant when the scatter plot shows a sufficiently strong
linear trend.

In other words, the following hypothesis tests are equivalent:

Table 10.1.1
Does the linear regression equation have a Does the scatter plot have a significant
significant slope? linear trend?
𝑯𝟎 : 𝜷 𝟏 = 𝟎 𝐻0 : 𝜌 = 0
𝑯𝟏 : 𝜷 𝟏 ≠ 𝟎 𝐻1 : 𝜌 ≠ 0

The computations required to conduct either test are somewhat tedious, which is why we will
use Excel to conduct the test. This will allow us to focus on interpreting its output.

180
Example 10.1.1
Gas companies (Shell, Esso, Petro Canada) frequently argue that variations in the price of gas at
the pump are due to the variations of the price of the oil they import and refine.

Can data be used to confirm this relationship exists and whether it is linear?

A random generator produced a sample of gas stations and dates. On the selected days, the
price of gas at the pump (in cents per liter) and the price of oil (in US$ per barrel) were noted.

A visual inspection of the scatter plot (Figure 10.1.6) clearly indicates that a linear model is
warranted. The scatter plot is evenly distributed above and below the estimated regression
equation.

Figure 10.1.6

At a significance level 𝛼 = 5%, can we conclude that there is a significant linear relationship?
Or, equivalently, is there a significant correlation between the price of gas and the price of oil?

181
To answer these questions, we must perform the following tests (whose conclusions will always
be identical):

Is the regression significant? Is the correlation significant?


Hypotheses 𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0 𝐻0 : 𝜌 = 0 vs 𝐻1 : 𝜌 ≠ 0

Output we can use to guide our decision is obtained from Excel’s Regression option contained
in the Data Analysis Toolpak:

Is the regression significant? Is the correlation significant?

Hypotheses 𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0 𝐻0 : 𝜌 = 0 vs 𝐻1 : 𝜌 ≠ 0
Point estimate 𝑏1 = 0.4722 𝑟 = 0.5075
𝒑-value 6.095 × 10−8

182
Observe that, regardless of the test, the value of the parameter under 𝐻0 (that is, 𝛽1 = 0 or
𝜌 = 0) is contradicted by the point estimates (𝑏1 ≠ 0, 𝑟 ≠ 0). In fact, the discrepancies
between them are big enough to produce a 𝑝-value that does not exceed 𝛼.

6.095 × 10−8 ≤ 5% ⟹ 𝑝-value ≤ 𝛼 ⟹ sufficient evidence to reject 𝐻0

In the context that is described in the current example, we can conclude that (at a level of 5%)
there is a significant linear relationship between the price (per liter) of gas at the pump and the
price (per barrel) of oil.

Additional information provided by the Regression option proves to be extremely helpful.

1. A confidence interval for the true slope and true 𝒚-intercept of the regression is provided.

With 95% confidence, we estimate that the true intercept, 𝛽0, lies in the interval [42.23, 70.70]
and that the true slope, 𝛽1, is in the interval [0.3123, 0.6321].

183
2. The estimated regression equation is usable when the regression is significant.

𝑦̂ = 56.466642 + 0.4722232𝑥 is the estimated regression equation. Under certain


conditions, it can be used to estimate the price (per liter)
of gas at the pump when the price of one barrel of oil is
provided.

For example, when the price of oil is set at 100 US$/barrel, we can predict that the price at the
pump will be roughly 𝑦̂(𝑥=100) = 56.466642 + 0.4722232(100) = 103.69 (in $ per liter)

Keep in mind that this result is a point estimate of the price of gas since it is obtained from the
estimated regression equation.

3. The efficiency (or the quality) of the regression model is provided.

A measure of efficiency is required to determine whether the linear regression model is a good
predictor of the dependent variable? The coefficient of determination, or simply 𝑅 2 , does just
that. For all one-variable regressions, 𝑅 2 = 𝑟 2 . That is, the coefficient of determination is equal
to the square of the coefficient of correlation.

184
The 𝑅 2 reveals the proportion of the variability of 𝑌 that is explained (or predicted) by the
linear regression equation.

You can find the details of the calculation of 𝑅 2 within the ANOVA table, also provided with the
output of the Regression option.

𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑅2 =
𝑆𝑆𝑇𝑜𝑡𝑎𝑙
1138.715
=
4421.918
= 0.257516
= 25.7516%

The remaining percentage (74.25%, in our example) is due to unexplained variations of 𝑌, which
we typically call the residuals.

When the scatter plot remains very close to the line of best fit, 𝑅 2 will be close to 1, and the
regression equation is an excellent predictor of values of 𝑌. A value of 𝑅 2 that is close to 0 is an
indication that the regression will not be an accurate predictor of values of 𝑌.

185
Below, both linear regressions are significant (𝑝-values≤ 5%). However, ‘experience’ (shown in
Figure 10.1.7) is a better predictor of salary than is ‘age’ (Figure 10.1.8), based on the comparison
of their respective 𝑅 2 values.

Figure 10.1.7 Figure 10.1.8

Beware of how you use regression


Regression is a powerful tool. There are, however, a certain number of traps we may be
tempted to fall into.

We must avoid…

o using the estimated equation of a regression that has not proven to be significant,
o using a coded categorical variable as 𝑋,
o using a binary variable as 𝑌,
o using a linear regression without investigating the trend of the scatter plot,
o interpreting the constant when the context does not allow variable 𝑋 to be 0,
o making predictions on 𝑌 outside of the domain of observed values of 𝑋,
o considering ‘significant’ and ‘efficient’ as synonymous expressions.

186
Section 11 Categorical data and Chi-Square Test
We have seen how scatter plots, correlations and regressions can be used to establish the
existence of linear relationships between quantitative variables. Unfortunately, such analyses
are of no use if one, or both, of the variables are categorical (qualitative ordinal or nominal). Of
course, it would make no sense to talk about the ‘alignment’ of variables that are not
quantitative, and thus a scatter plot cannot be produced from such data.
Instead, the approach we will take requires that sample data be collected and used to construct
a contingency table. If necessary, we can refer to the section on descriptive statistics
contingency tables, should you want to review how to generate them.

The objective of a Chi-Square Test is to determine whether of not one can conclude that a
dependence exists between the variables of study. The test bears the name of the distribution
law from which the 𝑝-value is obtained. The Chi-Square distribution is denoted by

2
𝜒(𝑚−1)×(𝑛−1)

where the product (𝑚 − 1) × (𝑛 − 1) gives the distribution’s degrees of freedom. 𝑚 and 𝑛


correspond to the number of rows and columns the contingency table has.

The hypotheses we will confront are:

𝐻0 : The variables are independent vs 𝐻1 : The variables are dependent

Notice that this a non-parametric test. Indeed, the assumed value of a parameter is not
required to perform the Chi-Square Test.

The decision to reject the null hypothesis (or not) will be taken after the ‘observed frequencies’
(𝑜𝑖𝑗 ) of a contingency table are compared with the ‘expected frequencies’ (𝑒𝑖𝑗 ) under the
assumption of independence.

The larger the differences between observed and expected frequencies are, the bigger the
value of 𝜒 2 will be. We will explain, in our example, how expected frequencies are calculated.

As usual, a significance level must be chosen at the onset of the test and a 𝑝-value will be
calculated from the experimental result of the 𝜒 2 .

The test is valid if the total number of observations is at least 30 (𝑛 ≥ 30), and if all expected
frequencies are at least 5 (𝑒𝑖𝑗 ≥ 5).

187
Example 10.1.1
The following contingency table is inspired from the results of a survey that was conducted
among American citizens registered to vote at the 2020 presidential elections. The data was
published by the Pew Research Center (July 12, 2023).

Table of observed frequencies


Highest level of Intends to vote Intends to vote
Total
education Democrat Republican
Postgraduate 46 22 𝟔𝟖
College graduate 124 96 𝟐𝟐𝟎
Some college 40 38 𝟕𝟖
HS or less 56 78 𝟏𝟑𝟒
Total 𝟐𝟔𝟔 𝟐𝟑𝟒 𝟓𝟎𝟎

Based on the data, does the voting intention of an American citizen depend on the voter’s
highest level of education? Use an appropriate test of level 𝜶 = 𝟓%.

Below is a step-by -step procedure to answer this question.

1. Identify the variables of interest


The variables described in the current context are both qualitative:
Voting intentions (nominal variable)
Highest level of education (ordinal for degree)

2. State the hypotheses and the significance level


𝐻0 : Voting intentions are independent of the voter’s highest level of education
𝐻1 : Voting intentions are dependent of the voter’s highest level of education

𝛼 = 5% is the designated level of significance

3. Identify the distribution law you will need to use


A Chi-Square Test with (4 − 1) × (2 − 1) = 3 degrees of freedom, that is 𝜒32 , will be required
to perform the Chi-Square Test of independence.

188
4. Construct the table of expected frequencies
All expected frequencies are constructed under the assumption that 𝐻0 is true, that is, if voting
intentions were independent of the highest educational level.

If these factors are independent, then the probability that an individual from the sample will be
found in 𝑟𝑜𝑤 𝑖 and 𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 is:

𝑃(𝑟𝑜𝑤 𝑖 ∩ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑗) = 𝑃(𝑟𝑜𝑤 𝑖) ∙ 𝑃(𝑐𝑜𝑙𝑢𝑚𝑛 𝑗)


𝑟𝑜𝑤 𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
= ∙
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

The expected frequency in 𝑟𝑜𝑤 𝑖 and 𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 is therefore

𝑒𝑖𝑗 = 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ∙ 𝑃(𝑟𝑜𝑤 𝑖 ∩ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑗)


𝑟𝑜𝑤 𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
= 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ∙ ∙
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

In other words, expected frequencies are obtained by computing

(𝑟𝑜𝑤 𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦) ∙ (𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦)


𝑒𝑖𝑗 =
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

For instance, among all respondents (𝑛 = 500) there are 68 observations in row 1 and 266
observations in column 1. Therefore:
(68)(266)
𝑒11 = = 36.18
500

Among all respondents (𝑛 = 500) there are 220 observations in row 2 and 266 observations in
column 1. Therefore:
(220)(266)
𝑒21 = = 117.04
500

Among all respondents (𝑛 = 500) there are 78 observations in row 3 and 266 observations in
column 1. Therefore:
(78)(266)
𝑒31 = = 41.50
500

So, here is what we have so far…

189
Table of expected frequencies
Highest level of Intends to vote Intends to vote
Total
education Democrat Republican
Postgraduate 36.18
College graduate 117.04
Some college
HS or less
Total

Although we could keep repeating the process of calculating the expected frequencies, it is
convenient to note that total row and column frequencies must remain equal to those shown in
observed frequency table. In other words, we can save some time and deduce the remaining
expected frequencies.

Fill the missing entries of the contingency table shown below:

Table of expected frequencies


Highest level of Intends to vote Intends to vote
Total
education Democrat Republican
Postgraduate 36.18 𝟔𝟖
College graduate 117.04 𝟐𝟐𝟎
Some college 𝟕𝟖
HS or less 𝟏𝟑𝟒
Total 𝟐𝟔𝟔 𝟐𝟑𝟒 𝟓𝟎𝟎

The Chi-Square Test will be valid since both required conditions (𝑛 ≥ 30 and 𝑒𝑖𝑗 ≥ 5 for all 𝑖, 𝑗)
are satisfied.

190
5. Compute the experimental value of 𝝌𝟐 (that is, the test statistic)

2
2
(𝑜𝑖𝑗 − 𝑒𝑖𝑗 )
𝜒𝑜𝑏𝑠 =∑
𝑒𝑖𝑗
𝑖,𝑗

2
𝜒𝑜𝑏𝑠 can only equal 0 if all observed and expected frequencies are equal, that is if 𝑜𝑖𝑗 − 𝑒𝑖𝑗 = 0
for all 𝑖, 𝑗. However, the larger the differences are between the 𝑜𝑖𝑗 and 𝑒𝑖𝑗 , the further our
observed data is from a scenario of “perfect independence”. In our example:

2
(46 − 36.18)2 (22 − 31.82)2 (78 − 62.71)2
𝜒𝑜𝑏𝑠 = + + ⋯+ = 13.706
36.18 31.82 62.71

6. Use a 𝝌𝟐 probability table or Excel to find the 𝒑-value for the computed 𝝌𝟐𝒐𝒃𝒔
The 𝑝-value will represent the probability of observing a value of the Chi-Square that is at least
as extreme as the one produced by our experimental data. In other words, for our example,
2
the 𝑝-value is the area under the density function of 𝜒32 that is to the right of 𝜒𝑜𝑏𝑠 = 13.706.

Figure 10.1.1 shows the graph of the 𝜒32 distribution. The 𝑝-value is the area that is to the right
2
of 𝜒𝑜𝑏𝑠 = 13.706, which seems extremely small.

Figure 10.1.1

Excel will confirm our impression, since 𝑝-value= 𝑃(𝜒32 ≥ 13.706) = 0.0033 = 0.33%.

191
7. Conclude and interpret
Since the 𝑝-value= 0.33% ≤ 𝛼, we may reject 𝐻0 and conclude that voting intentions is
dependent of the voter’s highest level of education.

Rejecting 𝐻0 means there is a risk of that a type 1 error is made: concluding to the dependence
of voting intentions and highest education level, when there is in fact no such association.

Example 10.1.2
There is a belief (urban legend?) that the frequency of births varies with the phases of the
moon.

You have set out to test this belief and have collected data from natality reports published by a
sample of hospitals on a variety of dates. Data pertaining to the dates of birth, the phase of the
moon and number of births occurring on those dates were recorded.

The variables of interest were defined in the following way:


𝑋= “phase of the moon”, a categorical variable whose outcomes are either
o ‘FULL’, which includes the day the full moon takes place as well as the 2 days that
precede and follow it, or
o ‘NEW’, which includes the day the new moon takes place as well as the 2 days that
precede and follow it, or
o ‘OTHER’, which includes the remaining days of a lunar cycle.

𝑌= “number of daily births in the selected hospital”, whose outcome is categorized as either
o ‘HIGH’, if the number of births is above the third quartile of daily births in that hospital,
or
o ‘USUAL’, if the number of births is in the interquartile interval of daily births in that
hospital, or
o ‘LOW’, if the number of births is below the first quartile of daily births in that hospital.

192
After compiling the data, the contingency table revealed the following descriptive statistics:

OBSERVED FREQUENCY
FULL NEW OTHER Total
HIGH 16 13 47 76
USUAL 23 26 101 150
LOW 11 15 45 71
Total 50 54 193 297

Interpret the context that is described and use the experimental data to perform a test (𝛼 =
10%) where the dependence between the phase of the moon and the frequency of births may,
perhaps, be confirmed.

State the type of test that should be used, its hypotheses and significance level.

A Chi-Square independence test is required since both 𝑋 and 𝑌 are categorical variables.

𝐻0 :

𝐻1 :
𝛼 = 10%

Observe that statistical evidence is needed to conclude that variables are dependent. Unless
shown otherwise, we must not reject the assumption of independence between variables.

What distribution law is required to compute the test’s 𝒑-value?

(𝑚 − 1) × (𝑛 − 1) =

We will need the following distribution law to perform probability calculations: 𝜒42

193
Complete the table of expected frequencies:

EXPECTED FREQUENCY
FULL NEW OTHER Total
HIGH 12.7946 13.8182 76
USUAL 25.2525 150
LOW 71
Total 50 54 193 297

The Chi-Square Test will be valid since both required conditions (𝑛 ≥ 30 and 𝑒𝑖𝑗 ≥ 5 for all 𝑖, 𝑗)
are satisfied.

Calculate the test statistic based on the observed data:

2
(16 − 12.7946)2 (13 − 13.8182)2 (45 − 46.1380)2
𝜒𝑜𝑏𝑠 = + +⋯+ = = 1.7974
12.7946 13.8182 46.1380

Compute the 𝒑-value:

𝑝 − value = 𝑃(𝜒42 ≥ 1.7974)

State your decision in the context of the problem.


Since the 𝑝-value = v the test is at level 𝛼 = 10%.

We (can | cannot) reject 𝐻0 : there is (sufficient | insufficient) evidence to conclude that the
phase of the moon and the number of births are dependent.

Our decision may lead to an error of type v , that is, concluding that the phase of the moon
and the number of births are (dependent | independent) when these variables are in fact
(dependent | independent).

194
Appendix – Probability Tables
Appendix A Standard Normal Distribution
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
-3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
-3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
-3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
-3 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
-2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
-2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
-2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026 𝑷(𝒁 ≤ 𝒛)
-2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
-2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
-2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
-2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
-2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
-2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
-2 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
-1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
-1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
-1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
-1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
-1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
-1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
-1 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
-0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
-0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
-0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
-0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
-0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
-0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
-0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
-0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
-0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998

195
Appendix B Student 𝒕-distribution

𝑡 ,
df 10% 5% 2.50% 1% 0.50%
5 1.4759 2.0150 2.5706 3.3649 4.0321
6 1.4398 1.9432 2.4469 3.1427 3.7074
7 1.4149 1.8946 2.3646 2.9980 3.4995
8 1.3968 1.8595 2.3060 2.8965 3.3554
9 1.3830 1.8331 2.2622 2.8214 3.2498
10 1.3722 1.8125 2.2281 2.7638 3.1693
11 1.3634 1.7959 2.2010 2.7181 3.1058
12 1.3562 1.7823 2.1788 2.6810 3.0545
13 1.3502 1.7709 2.1604 2.6503 3.0123
14 1.3450 1.7613 2.1448 2.6245 2.9768
15 1.3406 1.7531 2.1314 2.6025 2.9467
16 1.3368 1.7459 2.1199 2.5835 2.9208
17 1.3334 1.7396 2.1098 2.5669 2.8982
18 1.3304 1.7341 2.1009 2.5524 2.8784
19 1.3277 1.7291 2.0930 2.5395 2.8609
20 1.3253 1.7247 2.0860 2.5280 2.8453
21 1.3232 1.7207 2.0796 2.5176 2.8314
22 1.3212 1.7171 2.0739 2.5083 2.8188
23 1.3195 1.7139 2.0687 2.4999 2.8073
24 1.3178 1.7109 2.0639 2.4922 2.7969
25 1.3163 1.7081 2.0595 2.4851 2.7874
26 1.3150 1.7056 2.0555 2.4786 2.7787
27 1.3137 1.7033 2.0518 2.4727 2.7707
28 1.3125 1.7011 2.0484 2.4671 2.7633
29 1.3114 1.6991 2.0452 2.4620 2.7564
34 1.3070 1.6909 2.0322 2.4411 2.7284
39 1.3036 1.6849 2.0227 2.4258 2.7079
44 1.3011 1.6802 2.0154 2.4141 2.6923
49 1.2991 1.6766 2.0096 2.4049 2.6800
54 1.2974 1.6736 2.0049 2.3974 2.6700
59 1.2961 1.6711 2.0010 2.3912 2.6618
64 1.2949 1.6690 1.9977 2.3860 2.6549
69 1.2939 1.6672 1.9949 2.3816 2.6490
74 1.2931 1.6657 1.9925 2.3778 2.6439
79 1.2924 1.6644 1.9905 2.3745 2.6395
84 1.2917 1.6632 1.9886 2.3716 2.6356
89 1.2911 1.6622 1.9870 2.3690 2.6322
94 1.2906 1.6612 1.9855 2.3667 2.6291
99 1.2902 1.6604 1.9842 2.3646 2.6264

196
Appendix C Chi-Square Distribution

𝜒 ,
df 10% 5% 2.50% 1% 0.50%
5 9.2364 11.0705 12.8325 15.0863 16.7496
6 10.6446 12.5916 14.4494 16.8119 18.5476
7 12.0170 14.0671 16.0128 18.4753 20.2777
8 13.3616 15.5073 17.5345 20.0902 21.9550
9 14.6837 16.9190 19.0228 21.6660 23.5894
10 15.9872 18.3070 20.4832 23.2093 25.1882
11 17.2750 19.6751 21.9200 24.7250 26.7568
12 18.5493 21.0261 23.3367 26.2170 28.2995
13 19.8119 22.3620 24.7356 27.6882 29.8195
14 21.0641 23.6848 26.1189 29.1412 31.3193
15 22.3071 24.9958 27.4884 30.5779 32.8013
16 23.5418 26.2962 28.8454 31.9999 34.2672
17 24.7690 27.5871 30.1910 33.4087 35.7185
18 25.9894 28.8693 31.5264 34.8053 37.1565
19 27.2036 30.1435 32.8523 36.1909 38.5823
20 1.3253 1.7247 2.0860 2.5280 2.8453
21 1.3232 1.7207 2.0796 2.5176 2.8314
22 1.3212 1.7171 2.0739 2.5083 2.8188
23 1.3195 1.7139 2.0687 2.4999 2.8073
24 1.3178 1.7109 2.0639 2.4922 2.7969
25 1.3163 1.7081 2.0595 2.4851 2.7874
26 1.3150 1.7056 2.0555 2.4786 2.7787
27 1.3137 1.7033 2.0518 2.4727 2.7707
28 1.3125 1.7011 2.0484 2.4671 2.7633
29 1.3114 1.6991 2.0452 2.4620 2.7564
34 1.3070 1.6909 2.0322 2.4411 2.7284
39 1.3036 1.6849 2.0227 2.4258 2.7079
44 1.3011 1.6802 2.0154 2.4141 2.6923
49 1.2991 1.6766 2.0096 2.4049 2.6800
54 1.2974 1.6736 2.0049 2.3974 2.6700
59 1.2961 1.6711 2.0010 2.3912 2.6618
64 1.2949 1.6690 1.9977 2.3860 2.6549
69 1.2939 1.6672 1.9949 2.3816 2.6490
74 1.2931 1.6657 1.9925 2.3778 2.6439
79 1.2924 1.6644 1.9905 2.3745 2.6395
84 1.2917 1.6632 1.9886 2.3716 2.6356
89 1.2911 1.6622 1.9870 2.3690 2.6322
94 1.2906 1.6612 1.9855 2.3667 2.6291
99 1.2902 1.6604 1.9842 2.3646 2.6264

197

You might also like