0% found this document useful (0 votes)
67 views

Statistics 1A Lecture Notes Article

The document contains lecture notes for a Statistics 1A course. It introduces key concepts in statistics including data, variables, types of data, scales of measurement, and probability. It discusses how statistics is used to analyze quantitative data and make inferences. Examples are provided to illustrate nominal, ordinal, interval, and ratio scales of measurement.

Uploaded by

EnKay 11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Statistics 1A Lecture Notes Article

The document contains lecture notes for a Statistics 1A course. It introduces key concepts in statistics including data, variables, types of data, scales of measurement, and probability. It discusses how statistics is used to analyze quantitative data and make inferences. Examples are provided to illustrate nominal, ordinal, interval, and ratio scales of measurement.

Uploaded by

EnKay 11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Faculty of Applied Sciences

Department of Mathematics & Physics

Statistics 1A Lecture Notes

Author: Thomas Farrar

Name:
Faculty of Applied Sciences
Department of Mathematics and Physics

Statistics 1A Lecture Notes


Author: Thomas Farrar

Contents
1 Introduction to Statistics and Data 1

2 Graphical Methods for Presenting Data 7

3 Descriptive Statistics 29

4 Basic Principles of Probability 53

5 Probability Distributions 77

6 Sampling Distributions 113

1 Introduction to Statistics and Data


Textbooks for the Course

• The following books may prove useful to you: Keller (2012), Navidi (2015),
Devore & Farnum (2005), Miller & Miller (2014) (the last one is more ad-
vanced)

• Note that some data sets used in these notes are taken from these books.
Others are taken or adapted from R software (R Core Team (2019), Todorov
& Filzmoser (2009), Wright (2018))

1.1 Statistics and Data: Definitions

What are statistics and data?

• Statistics has been described as “the science of uncertainty” (Tabak 2011). It


is a set of rules, concepts and methods used to organize numerical information
and use it to make informed estimates, decisions (inferences) or predictions.

1
• Data are information that come from investigations (e.g. observation, exper-
iments, sampling).

• In order to do statistics, we need data

• Today data generally comes in an electronic format. Common formats for elec-
tronic data include spreadsheets (e.g., MS Excel), text files (comma-delimited
.CSV file, space or tab-delimited), and relational databases

• Designing and accessing electronic databases is a topic you will study more in
your Data Management module

1.2 Types of Data


Categories of Data

• There are many ways of categorising data, such as:

– Qualitative vs. Quantitative Data


– Structured vs. Unstructured Data
– Primary vs. Secondary Data
– The Four Scales of Measurement (Nominal, Ordinal, Interval, Ratio)

Qualitative vs. Quantitative Data

• Qualitative data describes the quality of a thing in words or pictures without


using numerical measurements

• For example, qualitative data describing a class of students might include


descriptive terms like ‘friendly’ and “hard-working”, a story about the class,
or a photograph.

• Although qualitative data is useful, it is not conducive to statistical analysis

• There are branches of statistics that are specifically designed to analyse qual-
itative data, such as text mining (methods for statistical analysis of textual
data)

• Quantitative data describes quantities using numerical measurements

• For example, quantitative data describing a class of students might include


number of students, ages, genders, and marks

• Statistics is used with quantitative data

• Note, however, that social scientists often develop quantitative measures for
things we think of as qualitative (e.g., ‘depression’), in order to use statistics
in their disciplines.

2
Structured vs. Unstructured Data

• Structured data is data that is organised into structures such as tables (e.g.,
an Excel spreadsheet) and is therefore ready for statistical analysis

– Structured data often also comes with metadata, a separate document


that provides information about the data

• Unstructured data is data that is not organised into a structure such as a


spreadsheet

– This could be, e.g., a document, or a handwritten list that is scanned


into the computer

• Unstructured data needs to be converted into structured data before most


common statistical methods can be used on it

• This can be done manually or one can write a computer program to do it


automatically

Primary vs. Secondary Data

• Primary data is data that the one collects for one’s own research (whether
by experiment, observation, survey, etc.)

• Secondary data is data that already existed but that one accesses in order
to use in one’s research

The Four Scales of Measurement

• Structured data can be further classified according to the four scales of mea-
surement

• The scale of measurement of a data set is very important to deciding which


statistical method(s) are or are not appropriate to use to analyse it

• The four scales of measurement are as follows:

1. Nominal
2. Ordinal
3. Interval
4. Ratio

3
Properties of Scales of Measurement

• There are four properties that will be useful in distinguishing the four scales
of measurement. These are identity, magnitude, interval spacing, and
absolute zero

– The Identity property means that each value on the scale has a unique
meaning; no two values are the same
– The Magnitude property means that any two values on the scale can
be compared in terms of magnitude; therefore all values on the scale can
be ordered from least to greatest
– The Interval Spacing property means that the magnitudes between
values along the scale are equally spaced; therefore it is meaningful to
subtract them
– The Absolute Zero property means that the scale has a meaningful zero
value and no values below this value; therefore it is meaningful to divide
values to obtain ratios

The Nominal Scale

• The nominal scale satisfies the identity property only

• Values can be coded (assigned arbitrary numerical values), but they cannot
be ordered from least to greatest since there is no notion of magnitude

• Variables measured on the nominal scale are usually categorical (describing


one or more categories to which observations may belong)

• ‘Country’ and ‘Gender’ are examples of nominal variables

• Take an example of a cup of tea

Figure 1.1: Two Types of Tea, Green and Rooibos (Nominal Scale)

• The kind of tea (green tea, rooibos) would be measured on the nominal scale

4
The Ordinal Scale

• The ordinal scale satisfies the identity property and the magnitude property
only

• The data can be ordered from least to greatest, but cannot be subtracted

• Examples include university qualification (Diploma, Advanced Diploma, Post-


graduate Diploma, Masters, Doctorate), or Likert scale responses on a ques-
tionnaire (Agree strongly, agree, disagree, disagree strongly)

• Returning to our tea example, the size of the tea (large, medium, or small)
could be measured on the ordinal scale

Figure 1.2: Two Sizes of Tea, Small and Large (Ordinal Scale)

The Interval Scale

• The interval scale satisfies the identity, magnitude, and interval spacing prop-
erties only

• These means the values can be ordered and can also be meaningfully sub-
tracted but cannot be meaningfully divided

• The classic example of an interval scale is temperature, measured in degrees


Celsius

• If it was 30 degrees yesterday and it is 15 degrees today, we can subtract and


say that it was 15 degrees warmer yesterday than today; however, it is not
meaningful to divide the values and say that it was twice as warm yesterday
as it is today

• This is because the scale has no ‘absolute zero’: zero degrees Celsius does not
mean there is no heat, and in fact the temperature can be negative

5
The Ratio Scale

• The ratio scale satisfies the identity, magnitude, interval spacing, and absolute
zero properties

• These means the values can be ordered from least to greatest, can be mean-
ingfully subtracted, and can also be meaningfully divided

• Examples would include physical measurements such as the volume of a cup


of tea, the mass of a substance, the profit of a company, etc.

What is a variable?

• A variable is a quantity that can vary, or take on different values. It is described


with a letter to reflect the fact that its value may not be known, or to consider
the effect of different possible values on other quantities.

• Fixed vs. Random Variables

– A fixed variable can be measured without error, or established with


certainty.
– A random variable is a rule which assigns a value to each outcome of
an experiment. It has error or uncertainty.
– Are the following fixed or random?
∗ Age of students when they enter CPUT
∗ Rand-to-dollar exchange rate
∗ pH level of a vat of wine
∗ Rhino population in RSA

Deterministic and Random Systems


It may be seen that most quantities are not by nature fixed or random. This
is something that we as mathematicians decide when we create models to solve
real-world problems.

• What is a deterministic system?

– Has only fixed quantities


– Outcome can be predicted with certainty if we have enough information
about what determines it

• What is a random system?

– Has random quantities


– Outcome cannot be predicted with certainty

• Most systems are deterministic but too complex to model in this way.

6
– E.g. Flipping a coin; lottery balls - there are laws of motion that govern
how these objects behave, but the systems are far too complex to be of
any practical use
– The same is true of stock market fluctuations. The stock prices are driven
up and down by human buying and selling behaviour, but we would need
billions of variables to model this deterministically.
• We treat these quantities as random for practicality.
The bottom line is, randomness may not actually exist. Even computers can
only generate pseudo-random numbers and not true random numbers. If you are
interested in reading more about pseudo-random number generation, see James
(1990). Randomness is a tool which allows us to make decisions and predictions in
the presence of uncertainty.

2 Graphical Methods for Presenting Data


Raw Data

• Raw data refers to data as it is initially captured (in the case of primary data)
or as it is initially downloaded or accessed (in the case of secondary data)
• Table 2.1 gives a visual of some raw data from a South African schools dataset
displayed in MS Excel:

Table 2.1: Raw Excel Data from South African Schools Dataset

2.1 Presenting Nominal Data


Frequency Distribution Table

• A frequency distribution table is a method for summarising and displaying


categorical (nominal or ordinal) data

7
• This table provides a list of all categories represented in the data and gives
the frequencies (count of observations) for each category

• Other values that sometimes occur in a frequency table include:

– Relative frequency (percentage of total frequency accounted for by each


category)
– Cumulative frequency (sum of frequencies accounted for by all categories
so far)
– Cumulative relative frequency (percentage of total frequency accounted
for by all categories so far)

• The order of categories in a frequency table can be alphabetical, or in descend-


ing or ascending order of frequencies, or some other ordering

• A frequency distribution table for one variable is sometimes called a one-way


frequency table to distinguish it from a two-way frequency table (see below)

• Table 2.2 gives a frequency distribution table for the Province variable in the
South African schools dataset, ordered in descending order of frequency

Table 2.2: One-Way Frequency Table of Province (Schools Dataset)

Bar Graph

• A bar graph or column graph is a widely used graphical method that


visually displays the same information that is found in a frequency table

8
• In this graph, each category is represented by a bar or column

• The frequency of each category is represented by the height or length of the


corresponding bar

• As with frequency tables, the bars can be in alphabetical order, in decreasing


or increasing order of frequency, or some other order

• The bars are usually vertical but can also be horizontal; this is a matter of
preference

• In a vertical bar graph, the frequencies will be represented numerically along


the vertical axis and the categories will be labelled beneath the bars on the
horizontal axis

• Sometimes the frequency and/or relative frequency of each category is also


printed above or inside the corresponding bar

• Figure 2.1 is an example of a bar graph representing the same data as the
frequency table above

Figure 2.1: Vertical Bar Graph of Province, Ordered by Decreasing Frequency

• Figure 2.2 gives the same bar graph with horizontal bars

9
Figure 2.2: Horizontal Bar Graph of Province, Ordered by Decreasing Frequency

Pie Graph

• The pie graph is another graphical method for representing categorical (nom-
inal or ordinal) data

• The main difference between the bar graph and the pie graph (besides us-
ing a different shape) is that the pie graph is designed to present relative
frequencies whereas the bar graph is designed to present frequencies

• Since relative frequencies are also apparent from a bar graph, some statisticians
would argue that a bar graph is always preferable to a pie graph

• Figure 2.3 shows an example of a pie graph (with legend included)

10
Figure 2.3: Pie Graph of School Phase (Schools Dataset)

Donut Graph

• A donut graph is just an alternative form of a pie graph that leaves a hole in
the middle

• Figure 2.4 shows an example of a pie graph (with legend included)

11
Figure 2.4: Donut Graph of Boarding and Non-Boarding Schools (Schools Dataset)

Two-Way Frequency Table

• A two-way frequency table, also known as a contingency table or cross-


tabulation, shows joint frequencies of two categorical variables

• One variable is represented by rows of the table and the other by columns

• Besides giving the raw frequencies in the individual cells, a two-way frequency
table may include the relative frequencies per row, per column, or per the
entire table (if it includes all of these relative frequencies, it could be rather
confusing)

• It may also include row and/or column totals

• In Table 2.3, we can see three values in each cell:

– The cell frequency


– The cell relative frequency (percent of overall total frequency represented
by that cell)
– The column relative frequency (percent of column total frequency repre-
sented by that cell)

• In this case, the column relative frequency tells us what percent of schools in
each province belong to each quintile. This is a convenient way to compare the

12
Table 2.3: Two-Way Frequency Table of Quintile vs. Province (Schools Dataset)

distribution of quintiles between provinces. For instance, we can see that Free
State has the highest percentage of Quintile 1 schools (49,26%) while Gauteng
has the highest percentage of Quintile 5 schools (30,00%)
• Note: South African public schools are categorised into five groups called
‘quintiles’ based on the economic status of their surrounding communities.
Quintile 5 schools are in the wealthiest areas and Quintile 1 schools are in the
poorest areas. Quintile is an ordinal variable.
• We can see that this two-way frequency table also gives us the total row
and column frequencies and relative frequencies. In this case, the total row
frequencies tell us how many schools there are in each quintile across the whole
of South Africa. The total column frequencies tell us how many schools there
are in each province (the same information that we earlier represented in a bar
graph).

Stacked and Grouped Bar Graphs

• Stacked and grouped bar graphs are modified bar graphs that can be used to
jointly present data from two categorical variables
• Stacked and grouped bar graphs relate to two-way frequency tables the way
that basic bar graphs relate to one-way frequency tables

13
• The bar graphs shown in Figures 2.5 and 2.6 represent two-way frequencies of
schools by province and by locale (urban vs. rural)

Figure 2.5: Stacked Bar Graph of Province vs. Locale (Schools Dataset)

• In a stacked bar graph, the height of each bar represents the overall frequency
of the categories of one categorical variable. Within each of these bars are
two or more stacked sub-bars representing the frequencies of categories of the
second categorical variable within each category of the first

Figure 2.6: Grouped Bar Graph of Province vs. Locale (Schools Dataset)

14
• In a grouped bar graph, instead of stacking the sub-bars on top of each other,
they are placed alongside each other

• The advantage of a grouped bar graph is that it is easier to visually compare


the relative heights of the sub-bars

• The advantage of a stacked bar graph is that one gets a sense of the sub-
category frequencies while still having the overall frequencies for one of the
variables displayed

Other Applications of Bar and Pie Graphs

• Although bar and pie graphs are ideally designed for displaying frequencies and
relative frequencies (respectively) of categorical data, they are sometimes also
used to show the relationship between a numerical (interval or ratio) variable
and a categorical variable

• For example, take the bar graph shown in Figure 2.7 and the pie graph shown
in Figure 2.8, both of which are based on data shown in Table 2.4, taken from
Statistics South Africa (2019)

Province Electricity Volume (Gigawatt-Hours)


Western Cape 1822
Eastern Cape 695
Northern Cape 592
Free State 797
KwaZulu-Natal 3461
North West 2423
Gauteng 4208
Mpumalanga 2892
Limpopo 1252

Table 2.4: Volume of Electricity Delivered by Province, December 2018

15
Figure 2.7: Bar Graph of Volume of Electricity Delivered by Province, December
2018 (Gigawatt-Hours)

Figure 2.8: Pie Graph of Volume of Electricity Delivered by Province, December


2018 (Gigawatt-Hours)

16
2.2 Presenting Ordinal Data
Methods for Presenting Ordinal Data
• Ordinal data is still categorical like nominal data; the key difference is that the
values have not only the identity property but also the magnitude property
• Consequently, the tabular and graphical methods used to present ordinal data
are the same as those used to present nominal data: frequency tables, bar
graphs, pie graphs
• The only ‘catch’ is that, when presenting ordinal data, one should normally
order the table (rows and/or columns), bars, or pie slices, in increasing or de-
creasing order of the ordinal categories, rather than in increasing or decreasing
order of frequency
• For example, consider the following data from Statistics South Africa’s 2017
General Household Survey. Each household was asked, ‘In what condition are
the walls, roof, and floor of the main dwelling? Is it very weak, weak, needing
repairs, good or very good?’ In this case, ‘very weak, weak, needing repairs,
good, very good’ are increasing levels of condition on an ordinal scale. Two
frequency tables for roof condition are shown in Tables 2.5 and 2.6
• Which of these tables do you think is presented in a more logical order?

Roof Condition Frequency Relative Frequency


Good 10277 48.6%
Needs minor repairs 4796 22.8%
Very good 3142 14.9%
Weak 1934 9.2%
Very weak 944 4.5%

Table 2.5: Roof Condition Frequencies from 2017 GHS (Ordered by Frequency)

Roof Condition Frequency Relative Frequency


Very good 3142 14.9%
Good 10277 48.6%
Needs minor repairs 4796 22.8%
Weak 1934 9.2%
Very weak 944 4.5%

Table 2.6: Roof Condition Frequencies from 2017 GHS (Ordered by Levels of Ordinal
Variable)

• Clearly Table 2.6 is presented in a more logical order. Similarly, the cross-
tabulation in Table 2.7 between monthly rent cost and child hunger is presented
logically according to the ordering of the two ordinal variables involved

17
Monthly Rent
Child Hunger R501- R1001- R3001- R5001-
<R500 >R7000
Occurrence R1000 R3000 R5000 R7000
Never 497 402 345 374 225 229
Seldom 42 37 15 9 2 1
Sometimes 54 33 16 6 3 2
Often 32 10 4 3 4 4
Always 7 1 0 0 1 0

Table 2.7: Two-Way Frequencies of Child Hunger Occurrence vs. Monthly Rent

2.3 Methods for Presenting Interval and Ratio Data


Interval and Ratio Data

• In this section we will make no distinction between interval and ratio data; in
both cases we are dealing with numerical values

Class Intervals

• One way of analysing interval and ratio data is to break it down into ‘class
intervals’

• A class interval is a category representing a numerical range of values

• For example, in Table 2.7, the ‘Monthly Rent’ categories are intervals that
each represent a range of rent values

– A typical convention is that the class interval ‘x1 to x2 ’ contains obser-


vations that are strictly greater than (>) the lower limit x1 and that are
less than or equal to (≤) the upper limit x2
– Alternatively, if it is known that all the numerical values to be classified
are integers, the upper limit of one class interval can be one value below
the lower limit of the next class interval, e.g., 10 to 19 and 20 to 29

• Consider the age data (in years) from a certain population of size 50 in Table
2.8

27 37 32 44 44 72 57 28 25 35
55 52 37 22 45 36 44 82 28 32
71 45 31 36 22 70 24 29 33 71
55 37 43 49 27 38 73 59 54 22
25 41 32 27 41 23 57 26 60 19

Table 2.8: Ages in Years of a Population

18
• The data in this raw form is not easy to make sense of; one way to present
it more conveniently would be to break it into class intervals by decade, as in
Table 2.9

Age Class Interval Frequency


≤ 19 1
20-29 14
30-39 12
40-49 9
50-59 7
60-69 1
70-79 5
80+ 1

Table 2.9: One-Way Frequencies of Age Classes of a Population


Stem-and-Leaf Diagram
• Another way to visualise numerical data that also uses the notion of a class
interval is a stem-and-leaf diagram
• In this type of diagram, data are grouped according to their leading digit
(the ‘stem’) while the other digit(s) (called ‘leaves’) are listed individually
• A stem-and-leaf diagram for the age data above is shown in Table 2.10
Stem Leaf
1 9
2 2 2 2 3 4 5 5 6 7 7 7 8 8 9
3 1 2 2 2 3 5 6 6 7 7 7 8
4 1 1 3 4 4 4 5 5 9
5 2 4 5 5 7 7 9
6 0
7 0 1 1 2 3
8 2

Table 2.10: Stem-and-Leaf Diagram for Ages of a Population

Histogram
• The histogram is a graph used to visualise the distribution of numerical data.
This is achieved by breaking the data into class intervals (called ‘bins’ in this
case) and then drawing a bar representing the frequency of each class interval
• The key visual difference between a histogram and a bar graph is that in a
histogram there is no horizontal space between the bars; this lack of space
represents how the intervals occupy a continuous numerical scale

19
• An important decision when constructing a histogram (or even when con-
structing a frequency table for a numerical variable using class intervals) is
how many bins or class intervals to use

• Consider the data in Table 2.11:

2.97 6.80 7.73 8.61 9.60 10.28 11.12 12.31 13.47


4.00 6.85 7.87 8.67 9.76 10.30 11.21 12.62 13.60
5.20 6.94 7.93 8.69 9.82 10.35 11.29 12.69 13.96
5.56 7.15 8.00 8.81 9.83 10.36 11.43 12.71 14.24
5.94 7.16 8.26 9.07 9.83 10.40 11.62 12.91 14.35
5.98 7.23 8.29 9.27 9.84 10.49 11.70 12.92 15.12
6.35 7.39 8.37 9.37 9.96 10.50 11.70 13.11 15.24
6.62 7.62 8.47 9.43 10.04 10.64 12.16 13.38 16.06
6.72 7.62 8.56 9.52 10.21 10.95 12.19 13.42 16.90
6.78 7.69 8.58 9.58 10.28 11.09 12.28 13.43 18.26

Table 2.11: Energy Consumption from 90 Households in British Thermal Units


(BTU)

• A histogram constructed from this data would look very different depending
how many bins are used (see Figure 2.9)

Sturges’ Rule

• Sturges’ Rule is one widely used method for determining the number of class
intervals (bins) to use for a histogram

• The rule is to calculate the number of bins k using the following formula:

k = dlog2 n + 1e where n is the number of observations

• The d and e in the above formula mean that we round up what is in between
these symbols to the next highest integer (the number of bins must be an
integer)

• In our energy consumption example we have 90 observations, so we would


apply Sturges’ Rule as follows:

k = dlog2 (90) + 1e
= d6.49185 + 1e
=8

• A histogram of this data with 8 bins is shown in Figure 2.10

20
Figure 2.9: Histograms of Energy Consumption Data with Different Numbers of
Bins

Alternatives to Sturges’ Rule

• Sturges’ Rule usually works reasonably well as long as n > 30

• However, the rule has been criticised for large numbers of observations (n >
200) as well Hyndman (1995)

• Two alternative rules that can be used to construct histograms are Scott’s
Rule and Freedman and Diaconis’s Rule

• In both of these rules we do not calculate the number of bins but rather the
width of each bin

– Scott’s Rule calculates the width of each bin as

h = 3.5sn−1/3 where s is the sample standard deviation of the data

21
Figure 2.10: Histogram of Energy Consumption Data with 8 Bins as per Sturges’
Rule

– Freedman and Diaconis’s Rule calculates the width of each bin as

h = 2(IQ)n−1/3 where IQ is the sample interquartile range of the data

• We have not yet covered descriptive statistics such as the standard deviation
and interquartile range, so we will leave off these rules for now

• In any case, most statistical software is programmed to construct an appro-


priate histogram

• In practice, the number of bins is often chosen based on convenience, so that


the interval endpoints are nice round numbers (e.g., 0, 10, 20, . . . rather than
0, 6.3333, 12.6667, . . .)

Insights from a Histogram

• A histogram provides us with a snapshot of the whole distribution of a numer-


ical variable.

– It gives us an idea of the range of values covered by the data


– It gives us an idea of the centre of the data (where is the ‘average’ of the
data? Which sorts of values tend to occur most frequently in the data?)
– It gives us an idea of the symmetry or skewness of the data (see Figure
2.11)
∗ A distribution with a long right tail and most of the data concen-
trated on the left is called right-skewed or positively skewed
∗ A distribution with a long left tail and most of the data concentrated
on the right is called left-skewed or negatively skewed
∗ A distribution whose two tails appear as mirror images of each other,
without a pronounced skewness in either direction, is called sym-
metrical

22
– It also assists us in identifying unusual patterns such as a bimodal distri-
bution, a distribution whose histogram has two ‘peaks’ (see example in
Figure 2.12)

Figure 2.11: Positively Skewed Histogram (Left), Symmetrical Histogram (Middle)


and Negatively Skewed Histogram (Right)

Figure 2.12: Bimodal Histogram

Histogram: Example

• In a study of warp breakage during the weaving of fabric, 100 specimens of


yarn were tested. The number of cycles of strain to breakage was determined
for each specimen, resulting in the following data
• Construct a histogram using the class intervals (0, 100], (100, 200], (200, 300],
(300, 400], (400, 500], (500, 600], (600, 700], (700, 800], (800, 900]

23
86 175 157 282 38 211 497 246 393 198
146 176 220 224 337 180 182 185 396 264
251 76 42 149 65 93 423 188 203 105
653 264 321 180 151 315 185 568 829 203
98 15 180 325 341 353 229 55 239 124
249 364 198 250 40 571 400 55 236 137
400 195 38 196 40 124 338 61 286 135
292 262 20 90 135 279 290 244 194 350
131 88 61 229 597 81 398 20 277 193
169 264 121 166 246 186 71 284 143 188

Table 2.12: Cycles of Strain to Breakage for 100 Yarn Specimens

• Would you describe the distribution of this data as unimodal (having one
peak) or bimodal (having two peaks)? As symmetrical, negatively skewed, or
positively skewed?

– We first need to create a frequency table based on our class intervals


(Table 2.13):

Class Interval Frequency


0 < x ≤ 100 21
100 < x ≤ 200 32
200 < x ≤ 300 26
300 < x ≤ 400 14
400 < x ≤ 500 2
500 < x ≤ 600 3
600 < x ≤ 700 1
700 < x ≤ 800 0
800 < x ≤ 900 1

Table 2.13: Class Interval Frequencies of Cycles of Strain to Breakage for 100 Yarn
Specimens

– We then construct a histogram based on this table (Figure 2.13)


– The histogram appears to be unimodal (having a single peak in the in-
terval from 100 to 200) and positively skewed or right-skewed

Box-and-Whisker Plot

• A box-and-whisker plot (sometimes just called a box plot) is another graphical


method for representing the distribution of numerical data

24
Figure 2.13: Histogram of Cycles of Strain to Breakage for 100 Yarn Specimens

• Constructing a box-and-whisker plot requires understanding of some descrip-


tive statistics that we have not covered yet; thus we will leave it for the next
chapter

Line Graph

• A line graph is a method for displaying numerical data that are organised
sequentially

• A common example of this is a time series, which consists of data collected


on a particular variable at a series of time points

• Consider the number of rhino poached per year in South Africa from 2003 to
2019

• This data can be conveniently represented in a line graph (Figure 2.14) which
provides a clear visualisation of the rapid increase in rhino poaching over the
years 2007 to 2014 followed by a gradual decrease thereafter

• The method to draw a line graph is as follows:

(1) Put the time index variable on the horizontal (x) axis
(2) Put the numerical data variable on the vertical (y) axis
(3) Plot a point on the graph for each observation in the data
(4) Join the points with line segments, moving sequentially through the series

25
Year No. of Rhino
Poached
2003 22
2004 10
2005 13
2006 24
2007 13
2008 83
2009 122
2010 333
2011 448
2012 668
2013 1004
2014 1215
2015 1175
2016 1054
2017 1028
2018 769
2019 594

Figure 2.14: Data Series and Line Graph of Rhino Poached per year in South Africa

Line Graph: Example

• Table 2.14 gives the year-end unemployment rate for South Africa for each
year from 1994 to 2019. Represent this data using a line graph.

• Notice how the two graphs in Figure 2.15 look different just by changing the
scale on the vertical axis. In general, we should start the axis from 0 for a
variable that is measured on the ratio scale

26
Year Unemployment Rate (%)
1994 20
1995 16.9
1996 19.3
1997 21
1998 25.2
1999 23.3
2000 23.3
2001 26.2
2002 26.6
2003 28.4
2004 23
2005 23.5
2006 22.1
2007 21
2008 21.5
2009 24.1
2010 23.9
2011 23.8
2012 24.5
2013 24.1
2014 24.3
2015 24.5
2016 26.5
2017 26.7
2018 27.1
2019 29.1

Table 2.14: South African Annual Unemployment Rate, 1994-2019

27
Figure 2.15: Line Graph of South African Annual Unemployment Rate, 1994-2019

28
3 Descriptive Statistics
What is a Descriptive Statistic?

• A descriptive statistic is a quantity calculated from a set of numerical data


that describes some feature or characteristic of the data

• If the data comes from a sample (a selection of units from a larger population),
descriptive statistics are often used to estimate the characteristics of the whole
population

• However, the descriptive statistic value does not tell us how accurate of an esti-
mator it is; determining the accuracy and precision of estimators is something
we will look at in Statistics 1B

• For now, we will not worry about whether our data comes from a population
or from a sample from a larger population

• Some of the features of a data set that can be described using descriptive
statistics are central location (also called central tendency), dispersion
(also called variability or spread), relative standing, skewness, and tail
extremity (kurtosis)

• In this module we will look at all except for kurtosis

3.1 Measures of Central Location


The ‘Three M’s’

• The three most well-known descriptive statistics that are used to measure
central location are ones you have probably encountered in secondary school.
They are:

– The Mean
– The Median
– The Mode

• In this section we will also cover another measure of central location called the
Trimmed Mean, and we will discuss measures of central location for grouped
data (numerical data that has been put into class intervals)

Mean

• The mean is simply the arithmetic average of a set of data, calculated by


adding up all the data values and dividing by the number of observations

• The mean of a sample is usually denoted by x̄ (pronounced as “x bar”) and


the mean of a population (or a probability distribution) is usually denoted by
the Greek letter µ (pronounced as ‘mew’)

29
• The formula for calculating the population mean is as follows:
1
µ= (x1 + x2 + · · · + xN )
N
• Here, N denotes the ‘population size’, the number of observations of the vari-
able in the whole population
• The formula for calculating the sample mean is as follows:
1
x̄ = (x1 + x2 + · · · + xn )
n
• Here, n denotes the ‘sample size’, the number of observations of the variable
in the sample
• (Notice that the above two formulas are exactly the same except for the no-
tation)
• These formulas can also be expressed using Sigma Notation as follows:
N
1 X
µ= xi
N i=1
n
1X
x̄ = xi
n i=1

• Sigma notation uses the capital Greek letter Σ (Sigma) to denote a sum. The
indices over which the variable is to be summed are indicated below and above
the Σ
Mean: Example
• In Table 3.1 we have the fuel economy (l/100 km) for 32 high-performance
cars
13.5 15.1 12.4 16.3 19.2 13.1 14.6 17.9
13.5 15.6 14.8 18.6 8.7 18.2 10.3 14.3
12.4 19.8 15.9 27.2 9.3 18.6 10.9 18.8
13.2 11.6 17.2 27.2 8.3 21.2 9.3 13.2

Table 3.1: Fuel Economy (l/100 km) for 32 High-Performance Cars

• Sample Mean would be:


n
1X
x̄ = xi
n i=1
1
= (13.5 + 15.1 + 12.4 + · · · + 13.2)
32
= 15.31875
• This gives us a sense of the centre of this data; in statistics we sometimes refer
to the centre as the ‘location’

30
Order Statistics

• Before introducing the median we first need to understand what an order


statistic is

• If we rank the data from least to greatest, the ith order statistic refers to the
ith value in order

• The ith order statistic is denoted by x(i)

• Thus, for example, x(1) is the minimum of the sample data and x(n) is the
maximum

Order Statistics: Example

• Table 3.2 gives the final exam marks (%) for a class of 22 statistics students,
along with the sex of each student

Sex Exam Mark Sex Exam Mark


Male 71 Female 67
Male 90 Female 47
Male 63 Female 64
Female 49 Male 97
Male 75 Male 73
Male 59 Male 68
Male 74 Female 77
Female 81 Female 65
Male 90 Male 33
Male 59 Male 55
Male 73 Male 48

Table 3.2: Sex and Final Exam Mark (%) for 22 Statistics Students

• What is the third order statistic?

• We simply need to sort the data from least to greatest and then take the third
value:

33, 47, 48, 49, 55, 59, 59, 63, 64, 65, 67
68, 71, 73, 73, 74, 75, 77, 81, 90, 90, 97

• x(3) = 48

31
Median

• The median can be described as a measure of central location and also as a


measure of relative standing (which will be discussed in the next subsection)

• The median can be defined as the middle value in the data by order; as the
value for which half of the data lie below it and half of the data lie above it

• The median is sometimes denoted by x̃

• To obtain the median, we first must order the data from least to greatest

– If the number of observations n is odd, there will be a value within the


data that is exactly in the middle: n−1
2
values lie below it and n−1
2
values
lie above it
– Thus, the median when n is odd is the n−1 + 1 th= n+1

2 2
th greatest value
n−1
in the data, sometimes referred to as the 2 + 1 th order statistic,
which can be written as
x( n+1 )
2

– If the number of observations n is even, there cannot be a value within


the data itself for which half of the observations are below and half are
above
– Instead, we take the n2 th order statistic and n2 + 1 th order statistic; the


median lies between these two values


– By convention, in order to give one median value rather than two values,
we take the average of these two values

• In summary, the formula for the median can be written as follows:



x( n+1
2 )
if n is odd
x̃ = 1  
 x n +x n if n is even
2 (2) ( 2 +1)

Median: Example

• Consider again our sample of n = 32 cars’ fuel economy

• Since n is even, the median is


1  1 
x̃ = x( 32 ) + x( 32 +1) = x(16) + x(17)
2 2 2 2

• We order the data to get the 16th and 17th order statistics:

8.3, 8.7, 9.3, 9.3, 10.3, 10.9, 11.6, 12.4,


12.4, 13.1, 13.2, 13.2, 13.5, 13.5, 14.3, 14.6,
14.8, 15.1, 15.6, 15.9, 16.3, 17.2, 17.9, 18.2,
18.6, 18.6, 18.8, 19.2, 19.8, 21.2, 27.2, 27.2

32
18.8 18.5 18 17.8 18.1
17.3 19.2 17.2 19.3 18.2
18.3 18.6 19.5

Table 3.3: Heights (m) for 13 Loblolly Pine Trees

1
• x(16) = 14.6 and x(17) = 14.8, thus x̃ = (14.6 + 14.8) = 14.7
2
• Consider now a sample of heights (in metres) of n = 13 Loblolly pine trees:

• In this case, n is odd and so we proceed as follows:

x̃ = x( n+1 ) = x( 13+1 ) = x(7)


2 2

• In order to determine the 7th order statistic x(7) we simply need to sort the
data from least to greatest and then take the 7th value

17.2, 17.3, 17.8, 18.0, 18.1, 18.2, 18.3, 18.5, 18.6, 18.8, 19.2, 19.3, 19.5

• Thus x̃ = 18.3 in this case

Mode

• The mode is simply the most frequently occurring value in the data

• We could get the mode by creating a one-way frequency table and looking for
the row of the table with the highest frequency

• A stem-and-leaf diagram is also helpful for determining the mode by hand

• There is no standard mathematical symbol for the mode

• If there are two values that both have the highest frequency, we do not average
them, instead we say that there are two modes

– E.g., in our sample of n = 22 students’ exam marks (Table 3.2), the


modes are 59, 73, and 90 (all of which occur twice)

• This is analogous to concluding from a histogram that a distribution is bimodal


rather than unimodal (though of course histograms deal with class intervals
rather than individual values)

Mode: Example

• What is the mode of the fuel economy in the sample of n = 32 vehicles in


Table 3.1?

• In fact there are six modes: 9.3, 12.4, 13.2, 13.5, 18.6, and 27.2 (all of these
values occur twice, and no value occurs more than twice)

33
Trimmed Mean

• A trimmed mean, denoted x̄tr , is a descriptive statistic measuring central


location that is sort of a compromise between the mean and the median

• To obtain the trimmed mean, we specify a proportion r of smallest and largest


observations to be removed, 0 ≤ r < 0.5

• We remove the nr smallest observations and the nr greatest observations and


then calculate the mean of the remaining observations

– Notice that if r = 0, the trimmed mean reduces to the mean


– Notice further that if r is so large that all but one or two observations
are removed, the mean of the remaining one or two observations is by
definition the median; thus in this case the trimmed mean reduces to the
median

Trimmed Mean: Example

• Consider again our sample of 32 cars’ fuel economy in Table 3.1

• Let us find the trimmed mean in the cases r = 0 and r = 0.25

• We first need to sort the data (already done under Median: Example above)

• – If r = 0, then we remove no observations, so the trimmed mean is simply


the mean, previously calculated to be 15.31875
– If r = 0.25, then we remove the lowest nr = 32(0.25) = 8 observations
(8.3, 8.7, 9.3, 9.3, 10.3, 10.9, 11.6, 12.4) and the highest 8 observations
(18.6, 18.6, 18.8, 19.2, 19.8, 21.2, 27.2, 27.2) and are left with 16 remain-
ing observations
1
– Thus x̄tr = (12.4 + 13.1 + 13.2 + · · · + 17.9 + 18.2) = 14.925
16
– Notice that if we removed all but two points (r = 0.46875), the trimmed
mean would be equivalent to the median

Approximate Mean, Median, and Mode for Grouped Data

• Suppose you have numerical data that has been arranged into class intervals,
or perhaps into a histogram, and you do not have the original numerical values

• It is not possible in this case to calculate the exact mean or median of the
data, but there are formulas we can use to approximate this

• To approximate the mean:

• Suppose that the data have been grouped into c class intervals and that each
class interval has a lower limit and an upper limit (so that its midpoint can
be calculated)

34
• We will refer to the midpoints of the class intervals as mj , j = 1, 2, . . . , c (mj
can be calculated as 12 (`j + uj ) where `j is the lower limit and uj is the upper
limit)

• We will refer to the frequencies of the class intervals as fj , j = 1, 2, . . . , c

• In this case, the mean can be approximated by the grouped mean,


c
1X
x̄gr = fj mj
n j=1

• What we are doing in this formula is measuring each class interval using
its midpoint, and weighting each class interval using its frequency; thus the
grouped mean is an example of a weighted average

• To approximate the median, we first need to identify the class interval in


which the median falls. The easiest way to do this is by creating a frequency
table and including the cumulative relative frequencies. The first class interval
for which the cumulative relative frequency is ≥ 50% contains the median; we
call the index of this interval j̃

• Once we have identified the median class interval m, we can approximate the
median using the following formula:
m−1
X
n/2 − fj
j=1
x̃gr = `m + (um − `m )
fm

• What we are essentially doing in this formula is taking the lower limit of the
class interval containing the median (`m ) and adding a certain proportion of
the width of this class interval (the width being um − `m ).

• The proportion that we are adding is what proportion of the fm observations


in this class interval occur before we reach the middle observation in the data,
that is, the x( n ) th order statistic. Hence the denominator of the proportion
2
is fm and the numerator is n/2 minus the cumulative frequency prior to the
m−1
X
mth class interval, which is fj
j=1

– Notice that if the cumulative relative frequency of the mth class interval
m
X n
is exactly 50%, then fj = , which means that
j=1
2

m−1
X m
X m−1
X
n/2 − fj fj − fj
j=1 j=1 j=1 fm
= = =1
fm fm fm

35
– Thus, in this case, x̃gr reduces to `m + (um − `m ) = um , the upper limit of
the interval (since the median occurs precisely at the cutpoint between
intervals)

• To approximate the mode, we first identify the modal interval m, i.e. the
class interval with the highest frequency fj

• We then calculated the grouped mode as follows:


fm − fm−1
Modegr = `m + (um − `m )
(fm − fm−1 ) + (fm − fm+1 )

• What we are doing in this formula is shifting the location of the mode within
the modal interval towards the lower or upper limit of the interval depending
on the frequencies of the intervals just below and just above that interval

– Notice that, if the frequencies of the interval below and the interval above
1
are the same (fm−1 = fm+1 ), this reduces to Modegr = `m + (um − `m ),
2
which is the midpoint of the modal interval

Approximate Mean, Median, and Mode for Grouped Data: Example

• Consider the data in Table 3.4 which shows the NSC examination pass rates
per school grouped into class intervals

Pass Rate (%) Frequency Relative Cumulative Cumulative


Frequency Frequency Relative
Frequency
[0, 10] 65 0.96 65 0.96
(10, 20] 143 2.1 208 3.06
(20, 30] 260 3.83 468 6.89
(30, 40] 413 6.08 881 12.97
(40, 50] 569 8.38 1450 21.34
(50, 60] 659 9.7 2109 31.04
(60, 70] 783 11.52 2892 42.57
(70, 80] 1097 16.15 3989 58.71
(80, 90] 1161 17.09 5150 75.8
(90, 100] 1644 24.2 6794 100

Table 3.4: Frequency Table of 2016 Matric Pass Rates for 6794 South African Schools

36
• Grouped Mean:
c
1X
x̄gr = fj mj
n j=1
1
= (65(5) + 143(15) + 260(25) + 413(35) + 569(45)
6794
+659(55) + 783(65) + 1097(75) + 1161(85) + 1644(95))
473310
=
6794
= 69.6659

• Thus we estimate the mean 2016 matric pass rate among South African schools
to be about 69.7%

• Grouped Median: The 8th interval, i.e. (70, 80], is the first one for which the
cumulative relative frequency exceeds 50%; thus m = 8
7
X
n/2 − fj
j=1
x̃gr = `8 + (u8 − `8 )
f8
6794/2 − 2892
= 70 + (80 − 70)
1097
= 74.6035

• Thus we estimate the median 2016 matric pass rate among South African
schools to be about 74.6%

• Grouped Mode: The modal interval is the 10th interval, (90, 100], since it has
the highest frequency (1644); thus m = 10

• Since in this case there is no interval above the 10th, we treat the 11th interval
as having a frequency of 0
f10 − f9
Modegr = `10 + (u10 − `10 )
(f10 − f9 ) + (f10 − f11 )
1644 − 1161
= 90 + (100 − 90)
(1644 − 1161) + (1644 − 0)
= 92.2708

• Thus we would estimate the most common matric pass rate among South
African schools in 2016 to be about 92.3%

• For interest’s sake, the actual mean matric pass rate among schools in this
data set (based on exact numerical values, not class intervals) was 70.41%, the
actual median pass rate was 75%, and the actual mode pass rate was 100%

37
3.2 Measures of Dispersion
Measures of Dispersion

• ‘Dispersion’, also referred to as “spread” or ‘variability’ (and in certain con-


texts as ‘scale’ or ‘volatility’), measures the degree to which the data are spread
out

• The dispersion of the data would be visualised graphically as the width of the
histogram; however, we also need numerical ways of quantifying it

• In this subsection we will consider the following measures of dispersion:

– Range
– Mean Deviation from the Mean
– Median Deviation from the Median
– Variance and Standard Deviation

Range

• The Range is a statistical measure of dispersion that is easy to calculate: we


take the largest value in the data (the maximum) and subtract the smallest
value (the minimum)

R = max {x1 , x2 , . . . , xn } − min {x1 , x2 , . . . , xn }


= x(n) − x(1) (using our notation for order statistics)

• The Range quantifies for us the distance along the xi axis that is covered by
our data, similar to the distance along the horizontal axis that is covered by
a histogram

• The main problem with this descriptive statistic is that it is only describing
two values in the data set and does not tell us how widely dispersed (between
those two values) the rest of the data are

• It is very sensitive to outliers, which are extreme values

Range: Example

• Table 3.5 gives the hourly mean concentrations of nitrogen oxides (NOx) in
ambient air (parts per billion, ppb) next to a busy motorway, from 15 randomly
selected days

29.40 15.65 24.60 83.50 23.55


73.05 5.00 97.55 138.65 16.50
40.15 41.25 32.55 35.15 28.00

Table 3.5: Hourly Mean Concentrations (ppb) of NOx in Ambient Air

38
• We first need to sort the data in increasing order:

5.00, 15.65, 16.50, 23.55, 24.60,


28.00, 29.40, 32.55, 35.15, 40.15,
41.25, 73.05, 83.50, 97.55, 138.65

• Range is calculated as follows:

R = x(15) − x(1) = 138.65 − 5.00 = 133.65

A Note on Robust Statistics

• Statistics that are stable and do not radically change in the presence of outliers
(extreme values) and other anomalies are said to be robust; there is a whole
branch of statistical sciences that deals with robust statistics

• Statisticians prefer robust statistics because they better summarise the whole
data rather than just a few unusual values

• The range is an example of a statistic that is not robust. For instance, if the
value of 138.65 were removed from the data, the range would drop by 133.65
to 92.55. Thus a single extreme value has a huge effect on the range

• Returning to measures of central location, which statistic do you think is the


least robust between mean, median, mode, and trimmed mean?

Mean Deviation from the Mean

• The Mean Deviation from the Mean (sometimes called Mean Absolute
Deviation, MAD) is the average absolute difference between the mean and
the other values in the data

• Sample mean absolute deviation formula is as follows:


n
1X
M AD = |xi − x̄|
n i=1

• Although the mean absolute deviation has a nice intuitive meaning, it is not
used very much by statisticians in practice because it does not have nice math-
ematical properties

• The standard deviation (see below) is much more widely used

39
Mean Deviation from the Mean Example
• Consider again the sample of NOx concentrations in 3.4.
• The mean deviation from the mean is calculated as follows:
n
1X 1
x̄ = xi = (29.40 + 15.65 + 24.60 + · · · + 28.00)
n i=1 15
= 45.63667
n
1X
MAD = |xi − x̄|
n i=1
1
= (|29.40 − 45.63667| + |15.65 − 45.63667| + · · · + |28.00 − 45.63667|)
15
= 28.0271

• (Reminder: do not round off your first calculation, in this case the mean, if
you are going to use it for further calculations; otherwise you will introduce
rounding errors. to be safe, allow two extra significant digits in your first
calculation than the accuracy needed in your final answer.)
Median Deviation from the Median
• Another measure of dispersion that is more robust than the mean deviation
from the mean is the median absolute deviation from the median
• It requires us to compute absolute deviations from the median
ADMi = |xi − x̃|, i = 1, 2, . . . , n

, and then take the median of these values, ADM


^

Median Deviation from the Median: Example


• Consider again the same sample of n = 15 NOx concentration readings from
Table 3.4
• Since n = 15 (odd), the median is equivalent to the 8th order statistic
• We already sorted the data when calculating the range, so it is easy to see
that x̃ = x(8) = 32.55
• To calculate the median deviation from the median we proceed as follows:
Absolute Deviations from the Median:
|5.00 − 32.55|, |15.65 − 32.55|, |16.50 − 32.55|, . . . , |138.65 − 32.55|
= 3.15, 16.90, 7.95, 50.95, 9.00, 40.50, 27.55, 65.00,
106.10, 16.05, 7.60, 8.70, 0.00, 2.60, 4.55
Placed in increasing order: 0.00, 2.60, 3.15, 4.55, 7.60, 7.95, 8.70,
9.00, 16.05, 16.90, 27.55, 40.50, 50.95, 65.00, 106.10
ADM
^ = ADM(8) = 9.00

40
• The median deviation from the median is an example of a robust measure of
dispersion

• Check for yourself that, if we removed the largest value of 138.65 from the data
and recalculate the median deviation from the median, it would only change
slightly (from 9 to 9.725), whereas we saw above that the range would change
dramatically

Variance

• Variance is average squared difference between mean and other values

• This is one statistic where the formula differs depending on whether we are
working with a population or a sample

• Population Variance:
N
1 X
2
σ = (xi − µ)2
N i=1

• Sample Variance:
n
2 1 X
s = (xi − x̄)2
n − 1 i=1

• There is a shortcut formula for sample variance:


n
!
1 X
s2 = x2i − nx̄2
n − 1 i=1

• Note that the reason why we divide by n − 1 rather than n in the sample
variance is to correct for a source of bias in the estimator. This is called
Bessel’s Correction.

Standard Deviation

• The standard deviation is simply the square root of the variance, and is a very
widely used statistic in practice

• Population Standard Deviation



σ= σ2

• Sample Standard Deviation √


s= s2

• Variance and standard deviation are not robust, since they are functions of
the mean, which is also not robust

41
Variance and Standard Deviation Example

• Many traffic experts argue that the most important factor in accidents is not
the average speed of cars but the amount of variation. The speeds of 20 cars
were recorded on a road with a speed limit of 70 km/h and a high rate of
accidents (Table 3.6)

82 69 71 84 73 71 66 84 82 79
78 80 70 71 72 66 78 92 71 70

Table 3.6: Speed (km/h) of a Random Sample of 20 Cars on a Dangerous Road

• Variance is calculated as follows:


1
x̄ = (82 + 69 + 71 + · · · + 70) = 75.45
20
n
2 1 X
s = (xi − x̄)2
n − 1 i=1
1 
(82 − 75.45)2 + (69 − 75.45)2 + · · · + (70 − 75.45)2

=
19
928.95
= = 48.89211
19

• Therefore the standard deviation is given by:


√ √
s = s2 = 48.89211 = 6.992289

Approximate Variance and Standard Deviation of Grouped Data

• Recall our notation for the grouped mean:


c
1X
x̄gr = fj mj
n j=1

where j = 1, 2, . . . , c are the class intervals, fj is the frequency of the jth class
interval, and mj is the midpoint of the jth class interval

• The following formula gives an approximation of a sample variance for grouped


data, i.e. the grouped variance:
c
! c
!
1 X 1 X
s2gr = fj (mj − x̄gr )2 = m2j fj − nx̄2gr
n − 1 j=1 n − 1 j=1

• The formula on the right is quicker to calculate

• Of course the grouped standard deviation is obtained by taking square root of


the grouped variance

42
Approximate Variance and Standard Deviation for Grouped Data: Ex-
ample

• Refer back to the grouped data on NSC examination pass rates from 6794
South African schools (Table 3.1)

• We previously computed the grouped mean to be x̄gr = 69.6659

• Grouped Sample Variance is computed as follows:


c
!
1 X
s2gr = m2 fj − nx̄2gr
n − 1 j=1 j
1  2
5 (65) + 152 (143) + 252 (260) + · · · + 952 (1644) − 6794(69.6659)2

=
6793
= 526.791

• Taking the square root we obtain sgr = 22.9519.

Coefficient of Variation

• The Coefficient of Variation is a descriptive statistic with a fancy name


but it is nothing other than the ratio of the standard deviation to the mean

• This statistic is more easily interpretable than the standard deviation

• For instance, if we know that the population standard deviation is σ = 0.2, is


this large or small? The answer is, it depends!

– If the population mean is µ = 0.35, it is pretty large


– If the population mean is µ = 35 million, it is extremely small

• Thus by dividing the standard deviation by the mean, we get an indication of


how large the standard deviation is relative to the mean

• Thus the population coefficient of variation is calculated as


σ
CV =
µ
while the sample coefficient of variation is calculated as
s
cv =

43
3.3 Measures for Proportions
Proportions

• When dealing with nominal or ordinal data (or sometimes even with interval
or ratio data), a very simple but very important descriptive statistic is the
proportion with which the observations fall into a certain category or have
a certain attribute

• A proportion (which is essentially the same thing as the relative frequency in


the frequency tables discussed earlier) must fall between 0 (none of the obser-
vations have the attribute) and 1 (all of the observations have the attribute)

• A proportion can be converted to a percent by multiplying it by 100

• Population Proportion
A
P =
N
where A is the number of observations in the population which have the at-
tribute

• Sample Proportion
a
p=
n
where a is the number of observations in the sample which have the attribute

Proportions Example

• Take our sample of n = 22 student marks in Table 3.2 (already sorted from
least to greatest earlier):

33, 47, 48, 49, 55, 59, 59, 63, 64, 65, 67
68, 71, 73, 73, 74, 75, 77, 81, 90, 90, 97

• Suppose we are interested in the proportion of students who passed (that is,
achieved a mark of at least 50)

• Let a be the number of marks which are ≥ 50

• Notice that what we are effectively doing here is creating a nominal variable
with two categories, ‘Passed (Mark ≥ 50)’ and ‘Failed (Mark < 50)’

– A nominal variable with only two categories is called a binary variable

• We then calculate the sample proportion as follows:


a 18
p= = = 0.8182
n 22

• Thus the proportion of students who passed is 0.8182, and the percent is
81.82%.

44
• Suppose instead we were interested in the proportion of students whose sex is
female (recall: the sex of each student is provided in Table 3.2)

• In this case the variable is already nominal and binary, so we just count the
frequency of females a and divide by sample size n:
a 7
p= = = 0.3182
n 22

• Thus the proportion of students in the class that are female is 0.3182

3.4 Measures of Relative Standing


Order Statistics

• We defined order statistics earlier and saw that statistics such as the median
and range are functions of order statistics

• An order statistic is one example of a measure of relative standing

Percentiles

• The P th percentile is the value below which P % of the data fall; put differently,
P % of the data are less than the P th percentile and (100 − P )% of the data
are greater than the P th percentile

• For example, the median is equivalent to the 50th percentile, since 50% of the
data lie below the median

• Percentiles are often used to characterise scores on standardised tests.

– For instance, Intelligence Quantity (IQ) is a variable designed to measure


intelligence whereby a score of 100 is the median (50th percentile). An
IQ score of 120 is at about the 91st percentile, meaning that if you take
an IQ test and score 120, you are theoretically more intelligent than 91%
of people

• Formula for the rank of the P th percentile:


P
RP = (n + 1)
100

• (Note that there are other definitions out there!)

• If RP is an integer, then the P th percentile is x(RP ) , i.e. the RP th order


statistic

• If RP is a decimal number, then the P th percentile is calculated as follows:

– Let I be the integer part of RP (before decimal)

45
– Let F be the fractional part of RP (after decimal)

– P th percentile = F x(I+1) + (1 − F ) x(I)
– We are basically taking a weighted mean of the Ith and (I + 1)th order
statistics, with the fraction component F being the weights (if F > 0.5,
the percentile should be closer to X(I+1) ; if F < 0.5, the percentile should
be closer to X(I) ; if F = 0.5, the percentile will be exactly the average of
the two)

Percentiles: Example

• Refer again to our data set of 22 student marks in Table 3.2


• What is the 80th percentile of this data set? (the value below which 80% of
the marks fall)
80
• Rank of 80th percentile: R80 = (22 + 1) 100 = 18.4
80th percentile = 0.4x(19) + (1 − 0.4)x(18)
= 0.4(81) + (0.6)(77) = 78.6

• The student who scored a mark of 63 wants to know what percentile he is in


• To determine this, we simply use the rank of this student’s mark as the RP
P
value, and solve for P in the equation RP = (n + 1) . His mark ranks 8th
100
(going from least to greatest), so RP = 8
• Thus,
100RP 100(8)
P = = = 34.7826
n+1 23
• The student is in about the 35th percentile, meaning that about 35% of stu-
dents received an exam mark lower than his mark

Quartiles

• Quartiles are special cases of percentiles:


– First Quartile = Q1 = 25th percentile
– Second Quartile = Q2 = 50th percentile = x̃ (Median)
– Third Quartile = Q3 = 75th percentile

Interquartile Range

• Interquartile Range (IQR) is a measure of dispersion that is obtained by sub-


tracting the first quartile from the third quartile:
IQR = Q3 − Q1

• This is a more robust measure of dispersion than the Range, since it tends to
exclude extreme values (outliers)

46
1.05 1.10 0.99 1.33 1.13
1.16 1.01 1.30 1.30 1.16
0.96 1.18 1.16 1.17 1.14
1.10 0.70 1.30 1.50 1.10
1.11 0.68 1.01 0.93 1.18

Table 3.7: Peanut Yields (kg) from Plots in a Field

Interquartile Range: Example

• Table 3.7 contains yields (in kg) of peanuts from 25 plots of land in a field.
(Each plot was 3 m by 1 m in dimensions)

• Determine the interquartile range for this data set


25
R25 = (25 + 1) = 6.5
100
Q1 = 0.5x(7) + (1 − 0.5)x(6) = 0.5(1.01) + (1 − 0.5)(1.01) = 1.01
75
R75 = (25 + 1) = 19.5
100
Q3 = 0.5x(20) + (1 − 0.5)x(19) = 0.5(1.18) + (1 − 0.5)(1.18) = 1.18
IQR = Q3 − Q1 = 1.18 − 1.01 = 0.17

Quantiles

• A quantile is a more general version of percentiles and quartiles

• While quartiles divide the data into four parts, and percentiles divide the data
into one hundred parts, sample quantiles divide the data according to any
proportion

• Thus the first quartile is equivalent to the 25th percentile and to the 0.25
quantile

• Quantiles are often used for the extreme ends of the data, e.g. the 0.99975
sample quantile

Standard Scores (z-Scores)

• A standard score (also called a z-score) tells us how far an observation is from
the population mean, in units of population standard deviations

• The standard score of an observation xi is calculated as


xi − µ
zi =
σ

• The standard score is closely related to the Normal distribution (or Gaus-
sian distribution) which will be discussed later in the course

47
• We can easily identify whether an observation is above or below the mean
based on whether its standard score is positive or negative

• Standard scores can also be used to detect outliers (see below)

Standard Scores: Example

• Consider again the car speed data in 3.6

• Give the z-scores of all the observations in this data, assuming it is known
that the data come from a population with a mean of 75 km/h and a variance
of 49 km/h

• The results are tabulated in Table 3.8

xi zi xi zi
82 1.0000 78 0.4286
69 -0.8571 80 0.7143
71 -0.5714 70 -0.7143
84 1.2857 71 -0.5714
73 -0.2857 72 -0.4286
71 -0.5714 66 -1.2857
66 -1.2857 78 0.4286
84 1.2857 92 2.4286
82 1.0000 71 -0.5714
79 0.5714 70 -0.7143

Table 3.8: z-Scores for Vehicle Speed Data (Raw Data in Table 3.6)

• For example, z1 is calculated as follows:


x1 − µ 82 − 75
z1 = = √ =1
σ 49

Outliers

• Outliers are extreme values in the data (extremely large or extremely small
relative to the rest of the data)

• There are many different methods one can use to determine whether a partic-
ular observation is an outlier; some methods are very advanced

• In this module we will use two conventional definitions of outliers that is used
when drawing box-and-whisker plots (discussed below):

48
Definition 3.1. IQR Approach: An observation is an outlier if it falls more
than 1.5 interquartile ranges above the third quartile, or more than 1.5 in-
terquartile ranges below the first quartile. An observation is an extreme
outlier if it falls more than 3 interquartile ranges above the third quartile, or
more than 3 interquartile ranges below the first quartile.

• Definition 3.1 can be expressed mathematically as follows: an observation xi


is an outlier if

xi < Q1 − 1.5IQR, or
xi > Q3 + 1.5IQR

• The quantities Q1 − 1.5IQR and Q3 + 1.5IQR are referred to as the inner


fence

• An observation xi is an extreme outlier if:

xi < Q1 − 3IQR, or
xi > Q3 + 3IQR

• The quantities Q1 − 1.5IQR and Q3 + 1.5IQR are referred to as the outer


fence
Definition 3.2. Standard Score Approach: An observation is an outlier if
its standard score exceeds 3 in absolute value

• Definition 3.2 can be expressed mathematically as follows: an observation xi


is an outlier if
|zi | > 3

Outliers Example

• The data in Table 3.9 gives the weights (in kg) of 23 single-engine aircraft
built during the years 1947-1979:

3716 3045 4392 5835 4639 6768


6289 5285 6920 8496 8795 9381
5838 6084 6049 13570 13308 11205
12972 3675 13785 20987 8107

Table 3.9: Weights (kg) of 23 Types of Single-Engine Aircraft

• Determine, using the definitions above, whether there are any outliers in this
data, and if so, whether there are any extreme outliers. For standard scores
you may assume a population mean of 8000 kg and a population standard
deviation of 4000 kg.

49
• Under Definition 3.1:

– We calculate the fences as follows:


25
R25 = (23 + 1) =6
100
Q1 = x(6) = 5285
75
R75 = (23 + 1) = 18
100
Q3 = x(18) = 11205
IQR = Q3 − Q1 = 11205 − 5285 = 5920
Q1 − 1.5IQR = 5285 − 1.5(5920) = −3595
Q3 + 1.5IQR = 11205 + 1.5(5920) = 20085
Q1 − 3IQR = 5285 − 3(5920) = −12475
Q3 + 3IQR = 11205 + 3(5920) = 28695

– By inspection, we can see that there is one outlier, 20987, but no extreme
outliers

• Under Definition 3.2:

– We calculate the z-scores for the plane weights as


3716 − 8000 3045 − 8000
z1 = = −1.0710, z2 = = −1.2388,
4000 4000
...
20987 − 8000 8107 − 8000
z22 = = 3.2468, z23 = = 0.0268
4000 4000

• Since |z22 | > 3, this aircraft’s weight is also an outlier under the standard score
definition

Box-and-Whisker Plots

• A Box-and-Whisker Plot is a graphical representation of data that uses


measures of relative standing known as the ‘five-number summary’

• These five numbers are the minimum x(1) , the first quartile Q1 , the second
quartile (median) Q2 = x̃, the third quartile Q3 , and the maximum x(n) .

• The bottom and top of the ‘box’ represent the first and third quartiles, with
a line through the box representing the median

• The bottom and top ‘whiskers’ represent the minimum and maximum

• The five numbers are not usually numbered as they are in 3.1; this is just to
assist you in understanding the structure of the plot

50
Figure 3.1: Labelled Box-and-Whisker Plot of Some Randomly Generated Data

Box-and-Whisker Plots in the Presence of Outliers

• Box-and-Whisker Plots change when outliers are present.

• The ‘box’ part of the plot remains the same, but the convention is to represent
outliers as individual points above and/or below the whiskers.

• Therefore, when outliers are present, the ends of the whiskers no longer repre-
sent the minimum and maximum of the data but the minimum and maximum
excluding outliers

• For example, we identified one outlier in the aircraft weight data in Table 3.9.
Thus the box-and-whisker plot for this data appears as in Figure 3.2

Figure 3.2: Box-and-Whisker Plot of Aircraft Weight Data

51
3.5 Measuring of Skewness
Skewness

• Skewness refers to the extent to which the data is asymmetric (not symmet-
ric) about its central location

• We saw earlier that a histogram gives us a visual idea of whether a set of


interval or ratio data is positively skewed (most of the data below the central
location) or negatively skewed (most of the data above the central location)

• There are also descriptive statistics that one can use to measure the skewness
of a data set

• A widely used sample skewness statistic is g1 , computed as follows:


n
X
(xi − x̄)3 /n
i=1
g1 = " #3/2
n
X
(xi − x̄)2 /n
i=1

• We interpret the statistic as follows:



g1 >> 0 suggests the data is positively skewed

g1 << 0 suggests the data is negatively skewed

g ≈ 0
1 suggests the data is symmetrical

Skewness: Example

• Refer to the peanut yield data in Table 3.7

• Let us calculate and interpret g1 :


1
x̄ = (1.05 + 1.16 + 0.96 + · · · + 1.18) = 1.11
25
Xn
(xi − x̄)3 /n
i=1
g1 = " #3/2
n
X
(xi − x̄)2 /n
i=1
1 
(1.05 − 1.11)3 + (1.16 − 1.11)3 + · · · + (1.18 − 1.11)3

=  25
 3/2

1 2 2 2
(1.05 − 1.11) + (1.16 − 1.11) + · · · + (1.18 − 1.11)
25
= −0.5088

• The skewness is negative, which suggests that the data is negatively skewed

52
• This appears to agree with the histogram of the data in Figure 3.3, which
suggests a slightly negative skewing

Figure 3.3: Histogram of Peanut Yields

Changing Units

• Important Note: Make sure the units in your data match the units of your
descriptive statistics!

• E.g. Rand vs. Thousands of Rand

• E.g. g vs. kg

4 Basic Principles of Probability


4.1 Introduction to Probability
Probability Definitions

• An experiment is an action that may lead to one or more possible distinct


outcomes (it does not need to be a scientific experiment but can be something
as simple as flipping a coin)

• The set of all possible distinct outcomes of an experiment is called the sample
space and is denoted S

53
• Each outcome in a sample space can be referred to as an element or simply
as an outcome; and all outcomes are distinct and mutually exclusive (they do
not overlap)

• An event is a collection of one or more outcomes, in other words, a subset of


the sample space

• A probability is a mathematical measure assigned to an event within the


sample space of an experiment. Intuitively, it can be understood as the pro-
portion of time that this event would occur if the experiment were repeated
infinitely many times

• The probability of event A is denoted Pr (A)

• A probability must satisfy the following conditions:

1. 0 ≤ Pr (A) ≤ 1 for any event A ∈ S


2. Pr (S) = 1

• An event with a probability of 0 certainly will never happen (no matter how
many times the experiment is conducted)

• An event with a probability of 1 certainly will always happen (every time the
experiment is conducted)

• Since probabilities range between 0 and 1, they are sometimes expressed as a


percent (e.g. ‘There is a 20% chance of rain today’ or ‘There is a 50/50 chance
that my team wins today’)

• In this module we will express probabilities as fractions or decimals, and not


as percents

• When expressing a probability as a decimal, our convention will be to express


it correct to four decimal places

Probability: Basic Examples

• Two simple random experiments are the coin flip and the roll of a die

Figure 4.1: Both Sides of a South African Five-Rand Coin

54
• The sample space for a coin flip consists of two outcomes, ‘Heads’ and ‘Tails’;
thus S = {Heads, Tails}

• Note: the two sides of a coin are conventionally referred to as ‘Heads’ and
‘Tails’, with ‘Heads’ representing the side that displays a portrait of a head of
state or other public figure. As can be seen in Figure 4.1, South African coins
typically do not display the portrait of a person. However, the South African
Mint has officially confirmed on Twitter that the side of the coin displaying
the coat-of-arms is the ‘Heads’ side, which means that the side displaying an
animal is the ‘Tails’ side
1
• Experience tells us that for an ordinary, ‘fair’ coin, Pr (Heads) = Pr (Tails) =
2
• (Whether this is exactly true depends on the coin and how it is flipped)

Figure 4.2: A Six-Sided Die

• The sample space for the roll of a six-sided die is S = {1, 2, 3, 4, 5, 6}

• Note: the word ‘dice’ is the plural of ‘die’

• Again, experience tells us that each of these six outcomes is an event with a
1
probability of
6
• In the experiment of rolling a die, one could define an event that includes more
than one outcome, such as ‘roll an odd number’ (which includes, 1, 3, and 5)
or ‘roll at least 5’ (which includes 5 and 6) Thus
3 1
Pr (roll an odd number) = =
6 2
2 1
Pr (roll at least a 5) = =
6 3

• One of the basic principles of probability that we can see at work in the above
probabilities is that, if all outcomes of the experiment are equally likely, we
can calculate the probability of an event as
# of outcomes included in event A
Pr (A) =
total # of outcomes in sample space S

55
Probability: Further Example

• Another simple example of an experiment that gives rise to probabilities is


that of a spinner

Figure 4.3: A Spinner with Three Outcomes

• To get an idea of how a spinner works, play around with an online adjustable
spinner tool

• We can see from 4.3 that the three outcomes ‘Red’, ‘Green’, and ‘Blue’ are not
all equally probable; rather their probabilities are equivalent to the fraction of
the circle’s circumference that they occupy

• By measuring angles we could verify that


1
Pr (Red) =
6
1
Pr (Green) =
3
1
Pr (Blue) =
2

• In this case, to find the probability of the event A =‘Green or Blue’, we cannot
simply use the ‘# of outcomes’ formula above, i.e.
# of outcomes included in event A 2
Pr (A) = =
total # of outcomes in sample space S 3

56
• This value is incorrect because the outcomes are not all equally likely. Instead,
we can argue that
1 1 5
Pr (Green or Blue) = Pr (Green) + Pr (Blue) = + =
3 2 6

• This gives us the correct answer because the outcomes ‘Green’ and ‘Blue’ are
mutually exclusive or disjoint, meaning that they cannot both occur in the
same run of the experiment (the spinner cannot land on green and blue)

• The above gives rise to the additive rule of probability for mutually exclusive
events (see below)

Probability Interpretation

• Probability has a long run interpretation:

– If you repeat an random experiment many times, the proportion of


times where an event occurs tends to converge to its probability

• This is related to the Law of Large Numbers, which states that as n → ∞


(where n is the number of times we repeat an experiment), the probability
approaches 1 that the proportion of times event A occurs is within ±c of
Pr (A), for any constant c that we choose

• Therefore,

– Probability does not mean that if you flip a coin twice you will get Heads
once and Tails once
– But it means you flip a coin 1 million times, you will almost certainly get
close to 500 000 Heads and 500 000 Tails

Warning to Gamblers

• Casinos, lotteries, sports betting, and other commercial gambling enterprises


are designed in such a way that the probability of the player losing is higher
than the probability of the player winning

• This means that, while you may get lucky at the casino on one particular day,
if you keep going back to the casino many times you are almost certain to lose
more money than you gain

4.2 Additive Probability Rules


Requirements of Probabilities

• To restate what was said earlier, suppose we have a sample space S = {O1 , O2 , . . . , Ok }
(where the Oj denote outcomes, which by definition are mutually exclusive),
probabilities of these outcomes must meet two criteria:

57
1. 0 ≤ Pr (Oi ) ≤ 1 for each i

• That is, each probability must be between 0 and 1. We mentioned before


that 0 represents impossible and 1 represents certain. There is nothing
less than impossible and nothing more than certain!
k
X
2. Pr (S) = Pr (Oj ) = 1
j=1

• That is, the probabilities of all the possible outcomes must add up to 1.
One of the outcomes in the sample space has to happen, with certainty!

Mutual Exclusivity

• Events are usually denoted with capital letters like A and B

• Two events A and B are said to be mutually exclusive if they cannot both
occur in the same run of an experiment

• Outcomes are always mutually exclusive by definition, but events are not nec-
essarily mutually exclusive

• For example, when rolling a six-sided die,

– the events ‘roll a 3’ and ‘roll an even number’ are mutually exclusive
– The events ‘roll a 3’ and ‘roll an odd number’ are not mutually exclusive

• We can use Venn Diagrams to help us visualise this (Figures 4.4 and 4.5)

Figure 4.4: Events A and B are Mutually Exclusive

58
Figure 4.5: Events A and B are not Mutually Exclusive

• We can express the mutual exclusivity of events A and B in set notation as


follows:
A∩B =∅

• Here, ∩ is read as ‘Intersect’ and thus A∩B is the area of intersection (overlap)
between events A and B. ∅ denotes the empty set, and so the above statement
says that there are no outcomes in A that are also in B

Additive Rule for Mutually Exclusive Events

• Let A and B be two events relating to the same random experiment

• If A and B are mutually exclusive, the probability that at least one of A or B


will occur is:
Pr (A or B) = Pr (A) + Pr (B)

• This is written using set notation as follows

Pr (A ∪ B) = Pr (A) + Pr (B)

• The symbol ∪ is read as ‘Union’, thus ‘A Union B’

• It is important to recognise that the ‘Union’ of two events A and B consists


of outcomes that are included in event A or event B or both

• This probability is expressed visually (with reference to the Venn Diagram in


Figure 4.4) in Figure 4.6

Figure 4.6: Mathematical/Graphical Visualisation of Additive Rule for Mutually


Exclusive Events

59
Additive Rule for Mutually Exclusive Events: Example

• Suppose we roll a six-sided die, and

– Let A be the event of rolling a 6


– Let B be the event of rolling an odd number

• A and B are mutually exclusive (that is, A ∩ B = ∅), so:

Pr (A ∪ B) = Pr (A) + Pr (B)
1 3
= +
6 6
4 2
= =
6 3

Additive Rule for Non-Mutually Exclusive Events

• If A and B are not mutually exclusive, the set A ∩ B 6= ∅: there is at least


one outcome common to events A and B, and therefore Pr (A ∩ B) 6= 0

• Since the ‘Union’ of A and B consists of outcomes that are included in event
A or event B or both, if we use the previous rule for non-mutually-exclusive
events we will overstate the probability, because we will count A ∩ B twice,
once in A and once in B

• To solve this problem, the additive rule for non-mutually exclusive events
requires us to subtract Pr (A ∩ B) so that it is only counted once:

Pr (A or B) = Pr (A) + Pr (B) − Pr (A and B)

• Again, this is written in a more mathematical way as follows:

Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B)

• It is visually represented in Figure 4.7

Figure 4.7: Mathematical/Graphical Visualisation of Additive Rule for Non-


Mutually Exclusive Events

60
Additive Rule for Non-Mutually Exclusive Events: Example

• Suppose we roll a six-sided die, and

– Let A be the event of rolling at least a 5


– Let B be the event of rolling an odd number

• A and B are not mutually exclusive, so:

Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B)
2 3 1
= + −
6 6 6
4 2
= =
6 3

Complement Rule of Probability

• Let A be any event from an experiment

• Then Ac (read as “A complement” or ‘not A’) is an event consisting of all


outcomes that are not part of event A

• The Complement Rule says that:

Pr (Ac ) = 1 − Pr (A)

• It can be expressed graphically as in Figure 4.8

Figure 4.8: Mathematical/Graphical Visualisation of Complement Rule

• The rule follows from our earlier statement that Pr (S) = 1 (where S de-
notes the sample space, the complete set of possible distinct outcomes of the
experiment)

• Since Ac consists of all outcomes in S that are not in A, it follows that A∪Ac =
S, and therefore (since A and Ac are obviously mutually exclusive events) that

Pr (A) + Pr (Ac ) = 1

• By moving Pr (A) to the right side of the equation, we arrive at the Comple-
ment Rule

61
Complement Rule: Example

• Suppose we roll a six-sided die, and

– Let A be the event of rolling at least a 5

• Event Ac is, by definition, the event of not rolling at least a 5


2 1
• We know that Pr (A) = =
6 3
• Thus, by the Complement Rule,
1 2
Pr (Ac ) = 1 − Pr (A) = 1 − =
3 3

4.3 Conditional Probability and Multiplicative Probability


Rules
The Concept of Conditional Probability

• Consider two events relating to the roll of a six-sided die:

– Let A be the event of rolling an odd number


– Let B be the event of rolling a 2

• Clearly, A ∩ B = ∅ (A and B are mutually exclusive)

• Suppose that your friend rolls the die. You do not see the result but he tells
you that it is an odd number (i.e. event A has occurred)
1
• Before we had this information, we would have said that Pr (B) =
6
• However, now that we have this information, we need to update the probabil-
ity: we know that A and B are mutually exclusive, so if A has occurred, this
means B cannot occur, so the probability of B is 0.

• We cannot write Pr (B) = 0, because this would contradict the earlier state-
1
ment that Pr (B) =
6
• Instead, we define a conditional probability: the probability of an event
conditioning on another event

• We express this as Pr (B|A) = 0; B|A is read as ‘B given A’, because the


conditional probability gives us the probability that event B occurs given the
information that A has occurred

– Obviously, if events A and B are mutually exclusive, then Pr (B|A) = 0


and also Pr (A|B) = 0

62
– In our above example, Pr (A|B) = 0 means that if your friend told you
she rolled a 2, the probability that he rolled an odd number would be
updated to 0

• Now, suppose that your friend rolls the die again. You do not see the result
but he tells you that it is an even number (i.e. event A has not occurred, but
event Ac has occurred)

• Clearly, events Ac and B are not mutually exclusive: it is possible that the
number rolled could be even and could also be a 2
1
• Our original probability for event B was Pr (B) = ; how will we update the
6
probability now that we know that event Ac has occurred?

• Intuitively, we know that there are three even numbers that could have been
rolled (2, 4, 6): event Ac includes three outcomes, which are all equally likely

• Of these three outcomes, one outcome (2) satisfies event B


1
• Thus, Pr (B|Ac ) =
3
• What we are actually doing here is modifying the general principle
# of outcomes included in eventB
Pr (B) =
total # of outcomes in sample spaceS

• Instead of focusing on the entire sample space S, we are restricting our atten-
tion to the subset Ac ; thus
# of outcomes included in eventB and in event Ac
Pr (B|Ac ) =
total # of outcomes in eventAc

• More generally, we are taking the ratio of Pr (B ∩ Ac ) to Pr (B), i.e.

Pr (B ∩ Ac ) 1/6 1
Pr (B|Ac ) = = =
Pr (B) 3/6 3

• This gives rise to the general formula for calculating a conditional probability
below (where we will revert to speaking of events A and B, since we are
no longer focusing only on the definition of A, B, and Ac in our die rolling
example)

Mathematical Definition of Conditional Probability

• Suppose that we have events A and B pertaining to some experiment(s)

• Then, provided that Pr (A) 6= 0, the probability of B conditioning on A is


defined as
Pr (A ∩ B)
Pr (B|A) =
Pr (A)

63
• Similarly, provided that Pr (B) 6= 0, the probability of A conditioning on B is
defined as
Pr (A ∩ B)
Pr (A|B) =
Pr (B)
• Notice that we can rearrange the above equations to obtain two formulas for
Pr (A ∩ B), which is called the joint probability of A and B and which we
already saw in the additive rule for non-mutually-exclusive events

Pr (A ∩ B) = Pr (B|A) Pr (A)
Pr (A ∩ B) = Pr (A|B) Pr (B)

• The above is called the multiplicative rule for dependent events

Independent Events

• Consider a box containing eleven balls, seven grey and four white, as pictured
in Figure 4.9

Figure 4.9: Box Containing Seven Grey Balls and Four White Balls

• Suppose we perform the following procedure:

1. Draw one ball at random from the box


2. Replace the drawn ball (put it back in the box)
3. Draw a second ball at random from the box

64
• Now, define the following events:

– Let A be the event that the first ball drawn is grey


– Let B be the event that the second ball drawn is white

• Suppose we are on step (3) and we know that the first ball drawn and replaced
was grey (i.e. event A occurred). What is the probability of B conditioning
on A (Pr (B|A))?

• Logically, it is clear that since the first ball drawn is put back in the box, the
probability that the second ball drawn is white is not affected by the colour of
the first ball drawn

• Another way of saying this is that A and B are independent events: they do
not influence each other in any way

– We can express in mathematical notation that A is independent of B by


A⊥B

• If we were on step (3) and had no information about event A, we would say
4
that Pr (B) = (since four of the eleven balls are white)
11
• However, since event B is independent of event A, the information that A has
occurred does not cause us to update the probability; thus
4
Pr (B|A) = Pr (B) =
11

• Another way of seeing that A ⊥ B is to notice that Pr (B|A) = Pr (B|Ac );


that is, the probability of drawing a white ball on the second draw is the same,
regardless of whether the first ball was grey (event A) or white (event Ac )

• Therefore, the occurrence or non-occurrence of A has no relationship to the


occurrence or non-occurrence of B

Multiplicative Rule for Independent Events

• We have seen that, if events A and B are independent, Pr (B|A) = Pr (B)


(and, similarly, Pr (A|B) = Pr (A))

• By substituting this into the multiplicative rule for dependent events, we can
derive the multiplicative rule for independent events:

Pr (A ∩ B) = Pr (B|A) Pr (A)
Pr (A ∩ B) = Pr (B) Pr (A) (since A and B are independent)

• Remember, this multiplicative rule only applies if events A and B are inde-
pendent

65
• To contrast the independent events in our ‘balls in a box’ example above with
dependent events, let us modify the procedure as follows:
1. Draw one ball at random from the box
2. Do not replace the drawn ball (do not put it back in the box)
3. Draw a second ball at random from the box
• Define events A and B exactly as before:
– Let A be the event that the first ball drawn is grey
– Let B be the event that the second ball drawn is white
• Is it still the case that A ⊥ B?
• Suppose that event A occurs. Since one grey ball is removed from the box and
not replaced, when the second ball is drawn there are 10 balls in the box, 6
grey and 4 white
4 2
– Thus, Pr (B|A) = =
10 5
• Suppose that event Ac occurs (the first ball drawn is not grey, it is white).
Since one white ball is removed (and not replaced), when the second ball is
drawn there are 10 balls in the box, 7 grey and 3 white
3
– Thus, Pr (B|Ac ) =
10
• Since Pr (B|A) 6= Pr (B|Ac ), it follows that the probability of event B does
depend on whether or not A has occurred; thus A and B are dependent events
• In this case, to find Pr (A ∩ B) we cannot use the multiplicative rule for inde-
pendent events but must use the multiplicative rule for dependent events:
4 7 28 14
Pr (A ∩ B) = Pr (B|A) Pr (A) = = = = 0.2545
10 11 110 55
Independent Events and Gambling

• The concept of independence is often overlooked by gamblers (for instance,


playing the roulette wheel or a slot machine)
• A player may think that because he has won three times in a row, he is on a
‘lucky streak’ and is likely to keep winning
– This is false, because the results of successive plays of the game are in-
dependent
• Conversely, a player may think that because she has not won in awhile, she is
‘due for some luck’
– This too is false, because the results of successive plays of the game are
independent

66
Conditional Probabilities and Contingency Tables

• A contingency table is another name for a two-way frequency table, which


we encountered back in chapter 2 (see Tables 2.3 and 2.7)

• It turns out that conditional probability is closely related to the ‘column rel-
ative frequency’ (and ‘row relative frequency’) in a two-way frequency table

• Let us illustrate this by way of a new example

• Suppose that the Department of Correctional Services invites 100 prisoners


who to participate in a skills development programme that will help them to
be employable after their release

• Some of the prisoners end up participating and others do not

• The Department then keeps records on which of the prisoners reoffend and
which do not, within two years of their release from prison

• The results of the research are summarised in a contingency table (two-way


frequency table) in Table 4.1

Does not
Re-offends Total
re-offend
Completes 3 57 60
programme
Does not complete
27 13 40
programme
Total 30 70 100
Table 4.1: Contingency Table of Results of Correctional Services Skills Development
Programme

Calculating Marginal Probabilities from Contingency Table

• Suppose we choose one prisoner at random from the population of 100 (with
all prisoners equally likely to be selected), and define the following events:

– Let A be the event that the prisoner completed the skills development
programme
– Let B be the event that the prisoner reoffended after his/her release

• We can use Table 4.1 to very easily calculate any probabilities such as Pr (A),
Pr (Ac ), Pr (B), Pr (B c ), Pr (A ∩ B), Pr (A ∪ B), Pr (A|B), Pr (B|A), etc.

67
• To get the marginal probability of A (i.e. Pr (A), without taking into
account the influence of B), we simply divide the total of the first row by the
overall total: 60 out of 100 prisoners completed the programme, so Pr (A) =
60
100
• Similarly, to get the marginal probability of B (i.e. Pr (B), without taking
into account the influence of A), we simply divide the total of the first column
30
by the overall total: 30 out of 100 prisoners reoffended, so Pr (B) =
100
• In the same way, we could divide the total of the second row by the grand total,
and the total of the second column by the grand total, to get the marginal
probabilities Pr (Ac ) and Pr (B c ), respectively

– Of course, these two probabilities could also be calculated using the com-
plement rule, e.g.
60 40
Pr (Ac ) = 1 − Pr (A) = 1 − =
100 100
• All of this is illustrated in Figure 4.10

Figure 4.10: Calculating Marginal Probabilities from Contingency Table

Calculating Joint Probabilities from a Contingency Table

• Similarly, we can calculate the joint probability Pr (A ∩ B) from the contin-


gency table by dividing the frequency of the top left cell (number of prisoners
who completed the programme and reoffended) by the grand total, to get
3
Pr (A ∩ B) =
100
• This is illustrated in Figure 4.11

68
Figure 4.11: Calculating Joint Probabilities from Contingency Table

• We can also calculate the probability of Pr (A ∪ B) from the contingency table


by adding the frequencies of the three cells that include either event A or event
B or both (see Figure 4.12)

• Of course, according to the additive rule for non-mutually exclusive events,


this probability could also be calculated by adding the marginal probabilities
Pr (A) and Pr (B) and then subtracting the joint probability Pr (A ∩ B)

Figure 4.12: Calculating Union Probability from Contingency Table

Calculating Conditional Probabilities from a Contingency Table

• When calculating a conditional probability, we no longer divide by the grand


total of 100, because we are not interested in the entire sample space S but
only in the event on which we are conditioning

• Thus, if we are conditioning on A we will divide by the total frequency for event
A (60) and if we are conditioning on B we will divide by the total frequency
of event B (30)

• Thus, for example, to calculate Pr (A|B) we simply take the number of pris-
oners who completed the programme and reoffended, and divide this by the
3
number of prisoners who reoffended, thus obtaining , the probability that
30

69
a prisoner completed the programme given that s/he reoffended (see Figure
4.13)

Figure 4.13: Calculating Union Probability from Contingency Table

Bayes’ Theorem
• We noted earlier that the two ways of expressing the multiplicative rule for
dependent events can be rearranged as conditional probability formulas:
Pr (A ∩ B)
Pr (B|A) =
Pr (A)
Pr (A ∩ B)
Pr (A|B) =
Pr (B)

• The two ways of expressing the multiplicative rule for dependent events can
also be set equal to each other to give Bayes’ Theorem:
Pr (A) Pr (B|A) = Pr (B) Pr (A|B)
Pr (B) Pr (A|B)
Pr (B|A) = (Bayes’ Theorem)
Pr (A)

• Bayes’ Theorem is named after Thomas Bayes, an English statistician, philoso-


pher, and Presbyterian minister (1706-1761).
• The theorem is the foundation for Bayesian Statistics, which is a whole branch
and philosophy of how to approach statistics
Law of Total Probability
• The Law of Total Probability states that, for event A and for disjoint
(mutually exclusive) events B1 , B2 , . . . , Bb that partition the whole sample
b
X
space (i.e., Pr (Bk ) = 1), we have
k=1

b
X b
X
Pr (A) = Pr (A ∩ Bk ) = Pr (A|Bk ) Pr (Bk )
k=1 k=1

70
Figure 4.14: Visualisation of Law of Total Probability

• This law is visualised in Figure 4.14

• The law of total probability also gives us another way of expressing Bayes’
Theorem:
Pr (B1 ) Pr (A|B1 )
Pr (B1 |A) =
Pr (A)
Pr (B1 ) Pr (A|B1 )
= b
X
Pr (A|Bk ) Pr (Bk )
k=1

Law of Total Probability: Example

• Suppose that you have three bags that each contain 10 balls. Bag 1 has 3 blue
balls and 7 green balls. Bag 2 has 5 blue balls and 5 green balls. Bag 3 has 8
blue balls and 2 green balls. You choose a bag at random and then chooses a
ball from this bag at random. There is a 31 chance that you choose Bag 1, a 12
chance that you choose Bag 2, and a 61 chance that you choose Bag 3. What
is the probability that the chosen ball is blue?

Let A be the event that the chosen ball is blue


Let Bk be the event that the kth bag is chosen
3
X 3
X
Pr (A) = Pr (A ∩ Bk ) = Pr (A|Bk ) Pr (Bk )
k=1 k=1
= Pr (A|B1 ) Pr (B1 ) + Pr (A|B2 ) Pr (B2 ) + Pr (A|B3 ) Pr (B3 )
3 1 5 1 8 1
= + +
10 3 10 2 10 6
29
=
60

71
Bayes’ Theorem and Law of Total Probability: Example

• A certain pregnancy test has a 0.96 probability of giving a positive result


when applied to a woman who is actually pregnant (true positive), and a 0.08
probability of giving a positive result when applied to a woman who is not
pregnant (false positive). Consider a population of women of whom 7% are
pregnant. A woman from this population does the pregnancy test (no other
information is available besides that she is from a population of which 7% are
pregnant). Determine the following probabilities.

(i) The probability that the pregnancy test shows a positive result.
We use the Law of Total Probability. Let ‘Pos’ be the event that a
pregnancy test result is positive and let ‘Preg’ be the event that a woman
is pregnant.

Pr (Pos) = Pr (Pos|Preg) Pr (Preg) + Pr (Pos|Pregc ) Pr (Pregc )


(by law of total probability)
= (0.96)(0.07) + (0.08)(1 − 0.07)
= 0.1416

(ii) The probability that the woman is pregnant, given that the pregnancy
test shows a positive result.
We use Bayes’ Theorem.

Pr (Pos|Preg) Pr (Preg)
Pr (Preg|Pos) = (by Bayes’ Theorem)
Pr (Pos)
(0.96)(0.07)
=
0.1416
= 0.474576

(iii) The probability that the woman is not pregnant, given that the pregnancy
test shows a negative result.
We again use Bayes’ Theorem.

Pr (Posc |Pregc ) Pr (Pregc )


Pr (Pregc |Posc ) = (by Bayes’ Theorem)
Pr (Posc )
(1 − 0.08)(1 − 0.07)
=
1 − 0.1416
(0.92)(0.93)
=
0.8584
= 0.996738

(iv) The probability that the pregnancy test gives an incorrect result (false
positive or false negative).
We use the additive rule for mutually exclusive (disjoint) events. The
two events that represent an incorrect result are Pos ∩ Pregc (positive

72
test and not pregnant =⇒ false positive) and Posc ∩ Preg (negative test
and pregnant =⇒ false negative). We need to find the probability of
the union of these two events, which are obviously mutually exclusive
(disjoint) events.

Pr ((Pos ∩ Pregc ) ∪ (Posc ∩ Preg))


= Pr (Pregc ∩ Pos) + Pr (Preg ∩ Posc )
= Pr (Pregc |Pos) Pr (Pos) + Pr (Preg|Posc ) Pr (Posc )
= (1 − 0.474576)(0.1416) + (1 − 0.996738)(1 − 0.1416)
= 0.0772
OR:
Pr ((Pos ∩ Pregc ) ∪ (Posc ∩ Preg))
= Pr (Pos ∩ Pregc ) + Pr (Posc ∩ Preg)
= Pr (Pos|Pregc ) Pr (Pregc ) + Pr (Posc |Preg) Pr (Preg)
= (0.08)(1 − 0.07) + (1 − 0.96)(0.07)
= 0.0772

4.4 Counting Outcomes


Importance of Counting Outcomes

• In many probability problems, determining the probability requires us to count


outcomes

• Often, situations arise in which the outcomes are the different ways in which
a set of objects can be ordered or arranged

• The concepts of permutations and combinations can assist us with count-


ing such arrangements

Permutations

• A permutation is an arrangement of a set of objects where the order of the


objects is important

• For example, how many three-letter sequences can be formed from the letters
abc, using each letter only once?

– The answer is 6: abc, acb, bac, bca, cab, and cba


– The ordering is the only feature that makes these six outcomes distinct
– If order were not important, there would only be one distinct outcome,
since all six sequences above consist of the same three letters
– This example involves permutations without repetition, since each letter
could only be used once in the sequence

73
• We can also have permutations with repetition. For instance, how many three-
letter sequences can be formed from the letters abc, if each letter may be used
multiple times?

– The answer is 3 × 3 × 3 = 27, since each of the three letters could be an


a, b, or c
– A nice way to do this is to draw

× ×

and then fill in the number of possibilities for each position in the per-
mutation

– For abc without repetition, this would be:

3×2×1
– For abc with repetition, this would be:

3×3×3
• See if you can use the above approach to answer the following:

– A combination lock (which should actually be called a permutation lock)


uses a code with three digits (from 0 to 9). A digit may be repeated.
How many different codes are possible?
– Group A in the 2010 FIFA World Cup had four nations: Uruguay, Mexico,
South Africa, and France. How many different possible orderings of the
group were theoretically possible?

Tree Diagrams

• A tree diagram is a useful way of counting permutations (if the number of


permutations is not too large)

• Figure 4.15 gives a tree diagram for counting the six three-letter sequences of
the letters abc where no letter can be repeated

74
Figure 4.15: Tree Diagram for Non-Repeating Permutations of letters abc

Permutation Formulas

• The number of permutations of n distinct objects (without repetition) is given


by n! (read as ‘n factorial’)

– Hence, if we want to know how many ways five distinct books can be
arranged on a shelf, the answer is 5! = 120

• The number of permutations of r objects from a set of n distinct objects is


n!
n Pr =
(n − r)!

• n Pr is read as ‘n permute r’

– Suppose South African Idols is down to the last ten contestants. How
many possible permutations are there of the top three?
10! 3628800
10 P3 = = = 720
(10 − 3)! 5040

• A further explanation of the above formula: we have 10! on top because this is
the number of ways that 10 people can be arranged in order (10 × 9 × 8 × · · · × 1).
The order of the last 7 people (outside the top three) does not matter to us,
and there are 7! ways to arrange them, so we divide by 7!

75
• A more general version of this permutation formula can be used to count the
permutations of a set of n objects which are there are k different types, with
r1 objects of type 1, r2 objects of type 2, . . ., and rk objects of type k. The
formula is:
n!
r1 !r2 ! · · · rk !
• For example, suppose a lecturer has ten textbooks in her office: five statistics
textbooks, three mathematics textbooks and two chemistry textbooks. (The
textbooks within each subject are all identical.) How many different ways
could she arrange them on her shelf?

– There are 10 total objects: 5 of class 1, 3 of class 2 and 2 of class 3.


Therefore, the number of ways is
10!
= 2520
5!3!2!
Combinations

• A combination is a selection of r objects taken from n distinct objects where


the order in which the order of the r objects is not considered

• In fact, a combination is nothing other than a subset of size r from a set of


size n

• For example, we want to form a committee of 3 people from a group of 12.


How many possible committees are there?

• The formula for the number of ways to choose r objects from a set of n objects
is:  
n n!
n Cr = =
r r! (n − r)!

• Note: n Cr and nr are two symbolic ways of expressing the same quantity and


are both read as ‘n choose r’

– To solve our committee problem:


 
12 12! 479001600
= = = 220
3 3! (12 − 3)! (362880)(6)

• A further explanation of the above formula: we have 12! in the numerator


because this is the number of ways to arrange 12 people in order. We are
dividing these 12 people into two groups: 3 people who are on the committee
and 9 people who are not. It matters to us who is in which group, but the
order within each group doesn’t matter. So we divide out the number of ways
of ordering the 3 people in the committee (3!), and the number of ways of
ordering the 9 people not in the committee (9!).

76
5 Probability Distributions
5.1 Basic Concepts
Declaring a Random Variable

• Recall the definition of a random variable we gave way back in chapter


1: a random variable is a rule which assigns a value to each outcome of an
experiment

• (This is a simplified definition, as the formal mathematical definition of a


random variable is very technical)

• We represent random variables with capital letters such as X and Y , and the
values they take on with small letters such as x and y

• Expressions like X = x and Y = y denote outcomes; we can also use expres-


sions to denote events consisting of more than one outcome, such as 2 ≤ X ≤ 5
or Y > y

• We can declare a random variable pertaining to the experiment of a coin flip


as follows:
(
1 if the coin comes up Heads
Let X =
0 if the coin comes up Tails

Probability Distribution Definitions

• A probability distribution describes how probabilities are distributed across


all possible outcomes of a random variable

• The support of a probability distribution refers to the set of values it can


take on (like the sample space of an experiment)

• Random variables can be classified into two categories, which result in major
differences in the form of the probability distributions:

– Discrete random variables only take on a countable number of outcomes,


which typically means the support consists of a certain sequence of inte-
gers
– Continuous random variables take on an uncountable number of out-
comes, which typically means the support consists of real numbers over
a certain interval

77
5.2 Discrete Probability Distributions
Properties of Discrete Probability Distributions

• Discrete probability distributions must satisfy the following two properties:

(i) 0 < Pr (X = x) ≤ 1 for all x ∈ S


X
(ii) Pr (X = x) = 1
x∈S

An Introductory Example: Rolling Two Dice

• Consider a roll of two six-sided dice, which is a common practice in board


games such as Monopoly

• Let X be the sum of the numbers rolled on two dice

• Clearly the outcomes are countable (the support is S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12})
and so X is a discrete random variable

• To fully specify the probability distribution of X we can create a table giving


the probability for each outcome (each possible value of X)

• To determine these probabilities, we need to use our counting principles:

– There are 6 × 6 = 36 permutations of the two dice


– For each value of x, x ∈ S, we must count how many of the 36 outcomes
will cause the dice to add to x
– For example, for x = 2, we must get a 1 on both dice, and there is only
1
one way this can happen, so Pr (X = 2) =
36
– For x = 3, we must get a 1 on one die and a 2 on the other, and there
2
are two ways this can happen (1, 2 and 2, 1), so Pr (X = 3) =
36
– See if you can reason how the other probabilities in Table 5.1 were arrived
at

78
x Pr (X = x)
1
2 36
2
3 36
3
4 36
4
5 36
5
6 36
6
7 36
5
8 36
4
9 36
3
10 36
2
11 36
1
12 36

Table 5.1: Probability Distribution for the Sum of Two Six-Sided Dice

• We can use the probability distribution table to work out the probabilities of
events involving more than one outcome. For example:
– What is the probability that the dice add to a number greater than 9?
3+2+1 6 1
Pr (X > 9) = Pr (X = 10)+Pr (X = 11)+Pr (X = 12) = = =
36 36 6
Note that if we had said ‘at least 9’ we would have written Pr (X ≥ 9)
and would have included Pr (X = 9) in the calculation
– What is the probability that the dice add to an odd number?

Pr (X ∈ {3, 5, 7, 9, 11}) = Pr (X = 3) + Pr (X = 5) + Pr (X = 7)
+ Pr (X = 9) + Pr (X = 11)
2+4+6+4+2 18 1
= = =
36 36 2
• Can you verify that the probability distribution in Table 5.1 satisfies the two
properties of a discrete probability distribution?

Discrete Probability Distributions: Another Example

• Suppose that a salesperson is going to call on three people tomorrow. Based


on past experience. Let Y be the number of sales that he will make tomorrow;
the support of Y is S = {0, 1, 2, 3}. From past experience it is known that the
probability distribution for Y is as given in Table 5.2

x 0 1 2 3
Pr (X = x) 0.3 0.4 0.2 0.1

Table 5.2: Probability Distribution for Number of Sales

79
• What is the probability that the salesperson will make at most (no more than)
one sale tomorrow?

Pr (X ≤ 1) = Pr (X = 0) + Pr (X = 1) = 0.3 + 0.4 = 0.7

Probability Mass Functions

• A probability mass function (PMF) is a function fX (x) that inputs a value


x from the support of a discrete random variable X and outputs the probability
that the random variable takes on that value, i.e. Pr (X = x)

• Consider our definition of the random variable X for a coin flip earlier

• E.g. Flipping a coin 


1 if x = 0, 1
fX (x) = 2
0 otherwise

• Here, x = 0, 1 is the support (corresponding to the two outcomes Tails and


Heads)

• The PMF always outputs a value of 0 for any value not in the support of the
random variable

• To make this explicit we should always include the ‘0 otherwise’ in the defini-
tion of the function

• Can you work out the probability mass function for the sum of two six-sided
dice? (Challenging)

 6 − |x − 7| if x ∈ {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
fX (x) = 36
0 otherwise

• Try substituting values of x into this function and see if you get the same
probabilities as in Table 5.1

• The PMF can be represented using a line-point graph as in Figure 5.1

80
Figure 5.1: Probability Mass Function of Sum of Two Six-Sided Dice

Probability Mass Function Requirements

• The following must be true for all PMFs:

(i) 0 < fX (x) ≤ 1 for all x ∈ S


X
(ii) fX (x) = 1
x∈S

• This is just a restatement of the properties of discrete probability distributions


given earlier

Probability Mass Function Example

• A certain type of component is packaged in lots of 4. Let X be the number of


working components in a lot.

• Determine the value of k for which the function fX (x) below is a valid PMF
(
kx for x ∈ S = {1, 2, 3, 4}
fX (x) =
0 otherwise

81
• Solution:
X
fX (x) = 1
x∈S
4
X
kx = 1
x=1
k(1 + 2 + 3 + 4) = 1
1
k=
10

• Thus the PMF is


x

for x ∈ S = {1, 2, 3, 4}
fX (x) = 10
0 otherwise

Expectation and Variance of a Discrete Random Variable

• The expectation of a discrete random variable is the mean of the associated


probability distribution

• In other words, if we were to repeat the experiment associated with the random
variable many times, the mean of the values we obtain will converge to the
expectation (according to the Law of Large Numbers)

• The expectation of a random variable X is denoted as E (X) and also as µ

• The formula for expectation of a discrete random variable is


X
E (X) = xfX (x)
x∈S

• In other words, it is a weighted average of all the values in the support, with
probabilities serving as weights

• The variance of a discrete random variable is the variance of the associated


probability distribution

• In other words, if we were to repeat the experiment associated with the random
variable many times, the variance of the values we obtain will converge to the
variance of the distribution (according to the Law of Large Numbers)

• The variance of a random variable X is denoted as Var (X) and also as σ 2

• The formula for variance of a discrete random variable is


X
Var (X) = E X 2 − [E (X)]2 = x2 fX (x) − µ2

x∈S

82
Expectation and Variance of a Discrete Random Variable: Example

• Find the expectation and variance of the random variable X representing the
sum of two six-sided dice
Expectation:
12
X
E (X) = xfX (x)
x=2
12  
X 6 − |x − 7|
= x
x=2
36
       
1 2 3 4
=2 +3 +4 +5
36 36 36 36
       
5 6 5 4
+6 +7 +8 +9
36 36 36 36
     
3 2 1
+ 10 + 11 + 12
36 36 36
2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12
=
36
252
= =7
36

Variance:
12
X
2
x2 fX (x)

E X =
x=2
12  
X6 − |x − 7|
2
= x
x=2
36
       
2 1 2 2 2 3 2 4
=2 +3 +4 +5
36 36 36 36
       
5 6 5 4
+ 62 + 72 + 82 + 92
36 36 36 36
     
3 2 1
+ 102 + 112 + 122
36 36 36
4 + 18 + 48 + 100 + 180 + 294 + 320 + 324 + 300 + 242 + 144
=
36
1974
=
36 
Var (X) = E X 2 − µ2
1974 1974
= − 72 = − 49
36 36
1974 1764 210
= − = = 5.8333
36 36 36

83
Expectation and Variance of a Discrete Random Variable

• Derive the expectation and variance of the random variable with PMF

 x for x ∈ S = {1, 2, 3, 4}

fX (x) = 10
0 otherwise

5.3 Special Discrete Probability Distributions


Special Discrete Probability Distributions

• In this section, we will explore a few discrete probability distributions that are
of special importance in statistics (though there are many, many others)

• These are:

– Discrete Uniform Distribution


– Binomial Distribution
– Poisson Distribution
– Hypergeometric Distribution

Discrete Uniform Distribution

• The discrete uniform distribution models an experiment with k distinct


outcomes numbered 1, 2, . . . , k that are all equally likely

• The PMF of a discrete uniform random variable is given by:



 1 x = 1, 2, . . . , k
fX (x) = k
0 otherwise

• We use the notation X ∼ Uniform(k) to indicate that X is a random variable


with a discrete uniform distribution with support 1, 2, . . . , k

• Figure 5.2 plots the discrete uniform PMF for k = 20

84
Figure 5.2: Probability Mass Function of X ∼ Uniform(20)

• As is clear from the graph, the distribution is called ‘Uniform’ because the
probability is uniform (the same) for all outcomes

Discrete Uniform Distribution: Example

• Twenty people include Ntokozo have had their names entered in a draw for a
prize. One name is drawn at random. What is the probability that Ntokozo
wins the prize?
Let us number Ntokozo as the 1st of the 20 outcomes; thus X = 1 corresponds
to the outcome of Ntokozo winning the prize
1 1
fX (1) = =
k 20

• There is a 1 in 20 chance that Ntokozo wins

Expectation and Variance of Discrete Uniform Random Variable

• We can easily derive the expectation and variance of a discrete uniform random
variable

85
• Expectation:
k
X
E (X) = xfX (x)
x=1
k
X x
=
x=1
k
k
1X
= x
k x=1
 
1 k(k + 1)
= (formula for sum of an integer series from 1 to k)
k 2
k+1
=
2

• Variance:
k
X
2
x2 fX (x)

E X =
x=1
k
X x2
=
x=1
k
k
1X 2
= x
k x=1
 
1 k(k + 1)(2k + 1)
= (formula for sum of a squared integer series from 1 to k)
k 6
(k + 1)(2k + 1)
=
2
6 2
Var (X) = E X − µ
 2
(k + 1)(2k + 1) k+1
= −
6 2
 2 
1 2 1 1 1 2 1 1
= k + k+ − k + k+
3 2 6 4 2 4
2
1 1 k −1
= k2 − =
12 12 12

86
Binomial Distribution

• The binomial distribution models an experiment with the following condi-


tions:

– The experiment consists of n trials (repetitions)


– In each trial there are only two possible outcomes, which we label as
‘success’ and ‘failure’
– In each trial the probability of success is p and the probability of failure
is 1 − p
– The outcomes of all trials are independent

• Let X be the total number of successes among the n trials

• If the above four conditions hold, then X has a binomial distribution with the
following PMF:
 
 n px (1 − p)n−x if x = 0, 1, 2, . . . , n
fX (x) = x

0 otherwise

• A statistical shorthand for saying ‘X has a binomial distribution with n trials


and probability of success p in each trial’ is: X ∼ Binomial(n, p)

• The logic of the PMF can be described as follows:

– Since the n trials are independent, the multiplicative rule for independent
events applies
– If we want to find Pr (X = x); this means we would have x successes and
n − x failures
– Let Si be the event of a success in the ith trial; then Pr (Si ) = p for
i = 1, 2, . . . , n and Pr (Sic ) = 1 − p for i = 1, 2, . . . , n (by complement
rule)
– One way to achieve x successes and n − x failures would be:
c c c
S1 ∩ S2 ∩ · · · ∩ Sx ∩ Sx+1 ∩ Sx+2 ∩ · · · ∩ Sx+(n−x)=n

– That is, the first x trials are successes and the rest (the last n − x) are
failures
– The probability of this happening would be:
c c
∩ · · · ∩ Snc

Pr S1 ∩ S2 ∩ · · · ∩ Sx ∩ Sx+1 ∩ Sx+2
c c
· · · Pr (Snc ) (by independence)
 
= Pr (S1 ) Pr (S2 ) · · · Pr (Sx ) Pr Sx+1 Pr Sx+2
x times n−x times
z }| { z }| {
= p × p × · · · × p × (1 − p) × (1 − p) × · · · × (1 − p)
= px (1 − p)n−x

87
– However, this instance where the first x trials are successes and the last
n − x are failures is only one of many possible orderings of x successes
and n − x failures
– Thus we need to consider how many ways there are to order these x
successes and n − x failures. The answer (see above under Counting
Outcomes) is  
n! n
= n Cx or
x!(n − x)! x
– By adding each of these ways of getting x successes and n − x failures
(which are mutually exclusive), following additive rule for mutually ex-
clusive events, we add px (1 − p)n−x a total of nx times, giving us the
PMF formula above

Binomial Distribution: Examples

• A fair coin is flipped 10 times. What is the probability that the result is
‘Heads’ exactly seven times?

• Let us define ‘Heads’ as a success and ‘Tails’ as a failure (in general it is not
important which outcome we define as a success and which as a failure). Then
X is the number of ‘Heads’ obtained in the ten coin flips, which is a binomial
1
experiment with n = 10 and p =
2
1
• Thus X ∼ Binomial(10, )
2
 
n x
Pr (X = 7) = p (1 − p)n−x
x
   7  10−7
10 1 1
= 1−
7 2 2
= 0.1172

• A library knows that from past experience, 42% of books that are borrowed
are returned after the due date. If 15 books are borrowed today, what is the
probability that less than three of them are returned after the due date?

• Assuming that the return/non-return status of these 15 books by the due


date is independent, then X, number of books returned after the due date,
∼ Binomial(15, 0.42)

Pr (X < 3) = Pr (X = 0) + Pr (X = 1) + Pr (X = 2)
   
15 15−0 15
= 0
0.42 (1 − 0.42) + 0.421 (1 − 0.42)15−1
0 1
 
15
+ 0.422 (1 − 0.42)15−2
2
= 0.000283 + 0.003071 + 0.015569 = 0.0189

88
• Figure 5.3 plots the binomial distribution PMF for the library example above.
Of course the graph of this function will change for different values of n and p

Figure 5.3: Probability Mass Function of X ∼ Binomial(15, 0.42)

Expectation and Variance of Binomially Distributed Random Variable

• The expectation of a binomially distributed random variable X is E (X) = np,


and the variance is Var (X) = np (1 − p)

• In our ten coin flips example, the distribution mean would be:

µ = E (X) = np = (10)(0.5) = 5

• It makes sense intuitively that the average number of heads in ten coin flips
would be five

• The variance would be:

σ 2 = np (1 − p) = 15(0.42)(1 − 0.42) = 3.654

89
Poisson Distribution

• The Poisson distribution is used to model counts of ‘rare’ events that occur
with a fixed average rate over a specified period of time (or, ‘rare’ objects
that occur with a fixed average rate over a specified space: (distance, area, or
volume))

• The distribution can be thought of as a limiting case of the binomial distribu-


tion

– Suppose we are interested in the number of car accidents X that occur


at a busy intersection during one week
– We could divide the week into n intervals of time, with each interval being
so small that at most one accident could occur in that interval
– We define p as the probability that an accident occurs in a particular
sub-interval and 1 − p as the probability that no accident occurs
– We could then think of this as a binomial experiment
– It can then be shown that:
(np)x e−np
 
n x
lim p (1 − p)n−x =
n→∞ x x!

– If we let λ = np (this symbol is read as ‘lambda’) then we have the


probability mass function of the Poisson distribution:
 x −λ
λ e
x = 0, 1, 2, . . .
fX (x) = x!
0 otherwise

• The parameter λ is interpreted as the average rate of events per unit of time
(or average rate of objects per unit of space)

• Note that, although a Poisson-distributed random variable X can only take


on non-negative integer values, λ can be any positive real number: λ > 0

• Figure 5.4 plots the PMF of the Poisson distribution for the case λ = 2.5,
for x = 0, 1, 2, . . . , 10. Note that this is not the entire distribution, since the
support goes up to ∞, but of course we cannot plot up to ∞ on the horizontal
axis

• Our shorthand for saying that the random variable X has a Poisson distribu-
tion with average rate parameter λ is X ∼ Poisson(λ)

90
Figure 5.4: Probability Mass Function of X ∼ Poisson(2.5)

Poisson Distribution: Example

• The number of complaints that a busy laundry facility receives per day is a
random variable X ∼ Poisson(3.3)

1. What is the probability that the facility will receive less than two com-
plaints on a particular day?

Pr (X < 2) = Pr (X = 0) + Pr (X = 1)
λ0 e−λ λ1 e−λ
= +
0! 1!
= 0.036883 + 0.121714 = 0.1586

Expectation and Variance Poisson Distribution

• It can be shown that, if X is a Poisson-distributed random variable,

E (X) = λ
Var (X) = λ

91
• Thus this distribution has the unusual property that its expectation and vari-
ance are equal

• Another property of the Poisson distribution is that it is ‘scalable’ over time


or space

• Consider the above example. Clearly, the average rate of complaints received
per day is 3.3. If we are interested in the average rate of complaints per five-
day work week instead of per day, we are simply scaling the time interval by
a multiple of 5.

• The scalability property of the Poisson distribution means that if the number of
complaints received in one day is Poisson-distributed with parameter λ = 3.3,
then the number of complaints received in five days is Poisson-distributed with
parameter 5λ = 16.5

Hypergeometric Distribution

• The hypergeometric distribution is the last special discrete probability


distribution that we will consider in this module

• This distribution models an experiment that has some similarities to a binomial


experiment but important differences

• A hypergeometric experiment can be described as follows:

– There is a population of N objects that fall into two categories (we can
call them ‘successes’ and ‘failures’ if we like)
– There are K successes in the population and N − K failures
– We randomly choose n objects from the population without replacement
(that is, after choosing the first object, we do not put it back before
choosing the second; thus the same object cannot be chosen twice)

• If we let X be the number of successes among the n chosen objects, then X


has a hypergeometric probability distribution with the following PMF:
 K  N −K 
 k n−k

x = {max (0, n − (N − K)) , . . . , min (n, K)
N

fX (k) = n

0 otherwise

• (Note: we could replace k in the above formula with x if we preferred to do


so)

• Our shorthand for saying that the random variable X has a hypergeometric
distribution with population size N , number of successes in population K, and
number of trials (draws) n, is X ∼ Hypergeometric(N, K, n)

• Several important differences to highlight between the hypergeometric and


binomial distributions:

92
– The n ‘trials’ in a hypergeometric experiment are not independent, unlike
the n ‘trials’ in a binomial experiment
– The probability of success is different in each trial of a hypergeometric
experiment, unlike the probability of success in a binomial experiment,
which remains constant across trials (p)
– The support of the hypergeometric distribution is not necessarily the
integers from 0 to n like that of the binomial distribution. The lower
limits and upper limits can have restrictions on them in certain cases:
∗ If the number of objects selected (n) is greater than the number of
failures in the population (N −K) then there must be some successes
selected. Thus the minimum possible value of a hypergeometric ran-
dom variable is max (0, n − (N − K))
∗ If the number of objects selected (n) is greater than the number of
successes in the population (K) then there must be some failures
selected. Thus the maximum possible value of a hypergeometric ran-
dom variable is min (n, K)

• Figure 5.5 plots the hypergeometric PMF for N = 15, K = 9, and n = 8

Figure 5.5: Probability Mass Function of X ∼ Hypergeometric(N = 15, K = 9, n =


8)

93
• Observe that in this case the smallest possible number of successes is 2 (since
we are drawing 8 objects and there are only 6 failures in the population), while
the greatest possible number of successes is 8 (since we are drawing 8 objects
and there are 9 successes in the population)
Hypergeometric Distribution: Examples
• A box (pictured in Figure 5.6) contains 15 balls, of which 9 are grey and 6 are
white

Figure 5.6: Box Containing 9 Grey Balls and 6 White Balls

• 8 balls are drawn at random without replacement


• What is the probability of drawing less than 4 grey balls?
• Let X be the number of grey balls drawn (so we are defining grey balls as
‘successes’ and white balls as ‘failure’)
Pr (X < 4) = Pr (X = 2) + Pr (X = 3)
(X cannot be less than 2 in this case; see above)
9 15−9 9 15−9
   
2 8−2 3 8−3
= 15
 + 15

8 8
= 0.005594 + 0.078322 = 0.0839

94
• An academic department at a university consists of 20 staff, of whom 12 are
men and 8 are women. A committee is to be formed by randomly choosing
four staff members. What is the probability that the committee consists of
two men and two women?

• Let us define ‘women’ as successes and ‘men’ as failures (the men may not
appreciate this, but we can define it the other way around if we prefer and
still get the same answer)

• Then, if we let X be the number of women selected for the committee, X ∼


Hypergeometric(N = 20, K = 8, n = 4), and we are interested in Pr (X = 2) =
fX (2)
8 20−8
 
2 4−2
fX (2) = 20

4
= 0.3814

• A lottery game is played as follows. A player writes down five integers between
1 and 52 on a card. 52 numbered balls are placed in a machine and 6 balls are
selected at random. A player wins the jackpot if all six balls that come out of
the machine match the numbers on his/her card. What is the probability of
winning the jackpot when playing with one card?

• In this case we can define the six balls whose numbers match those written on
the player’s card as ‘successes’ and the other 52 − 6 = 46 balls as ‘failures’.
We then have X ∼ Hypergeometric(N = 52, K = 6, n = 6) and we want to
know Pr (X = 6) = fX (6) (all six selected balls must be successes to win the
jackpot)
6 52−6
 
6 6−6
fX (6) = 52

6
1
=
20358520

• The probability of winning the jackpot in this lottery is 1 in 20 358 520 (less
than 1 in 20 million!)

Expectation of a Hypergeometric Random Variable

• The expectation of a random variable X ∼ Hypergeometric(N, K, n) is given


by
nK
E (X) =
N
For example,

– In the box-of-balls example above with N = 15, K = 9, and n = 8, the


(8)(9)
expected number of grey balls drawn is E (X) = = 4.8
15

95
– In the committee example above with N = 20, K = 8 and n = 4, the
(4)(8)
expected number of women on the committee is E (X) = = 1.6
20
– In the lottery example above with N = 52, K = 6 and n = 6, the
expected number of balls drawn that match those on the player’s card is
(6)(6)
E (X) = = 0.6923
52
• The variance of a random variable X ∼ Hypergeometric(N, K, n) is given by
nK(N − K)(N − n)
Var (X) =
N 2 (N − 1)

5.4 Continuous Probability Distributions


Continuous Random Variables
• Recall: a continuous random variable has an uncountable number of outcomes,
and is defined over an interval (typically of real numbers)
• This means that we cannot describe the probability of any individual outcome
(value of the random variable), but only the probability that the random
variable falls within a particular sub-interval of that interval
• The function that describes probabilities associated with a continuous ran-
dom variable is called a probability density function (PDF) and not a
probability mass function
• The PDF is also written as fX (x) but the values of the function are not prob-
abilities
• To get probabilities from a PDF we must integrate the function over a certain
sub-interval from a to b to obtain the area under the curve; this gives us
Pr (a ≤ X ≤ b)
• Note that if X is a continuous random variable (unlike for a discrete random
variable), the event a ≤ X ≤ b is identical to the event a < X < b
Properties of a Probability Density Function
• The following must be true for all PDFs
(i) fX (x) ≥ 0 for all x ∈ S
Z ∞
(ii) fX (x)dx = 1
−∞

Integrating the PDF


• You will be learning how to integrate functions in your Mathematics 1A module
soon, if you have not already done so
• However, integrating PDFs is beyond the scope of Statistics 1A

96
5.5 Special Continuous Probability Distributions
Special Continuous Probability Distributions

• In this section, we will explore two continuous probability distributions that


are of special importance in statistics (though there are many, many others)

• These are:

– Continuous Uniform Distribution


– Normal Distribution (also known as Gaussian Distribution)

• We will also learn how to use the normal distribution to calculate approximate
probabilities from the binomial distribution

Continuous Uniform Distribution

• The continuous uniform distribution models an experiment where we ran-


domly select any real number within the interval [a, b], and all numbers in the
interval are equally likely

• The PDF of the continuous uniform distribution is given by

 1

X ∈ [a, b]
fX (x) = b − a
0 otherwise

• We denote that the random variable X has a continuous uniform distribution


defined on the interval [a, b] with the notation X ∼ Uniform(a, b) (the fact
that there are two parameters rather than one allows us to distinguish this
notation from that of the discrete uniform distribution)

• The most basic case involves the interval [0, 1] (a = 0, b = 1); generating a
Uniform(0, 1) random variable is the first step of all pseudo-random number
generators in computer science (which we briefly discussed back in chapter 1)

• The PDF plot for this case of the continuous uniform distribution is displayed
in Figure 5.7

97
Figure 5.7: Probability Density Function of X ∼ Uniform(a = 0, b = 1)

• A convenient feature of the continuous uniform PDF is that we do not need


to integrate it to get the area under the curve; since it is just a horizontal line
we simply use the formula for area of a rectangle (length × height)
Continuous Uniform Distribution: Example
• Consider a random variable X ∼ Uniform(1, 5)
• Calculate the probability that X falls between 1.2 and 2.45.
• To find this probability we must find the area under the probability density
function fX (x) between the values 1.2 and 2.45 (see Figure 5.8).
• The probability density function fX (x) is as follows:
1 1 1
fX (x) = = = = 0.25
b−a 5−1 4
This function has a constant height of 0.25. Thus the required probability is:
Pr (1.2 ≤ X ≤ 2.45) = length × height
= (2.45 − 1.2) × 0.25
= 1.25 × 0.25
= 0.3125

98
Figure 5.8: Area under PDF of X ∼ Uniform(a = 1, b = 5) representing
Pr (1.2 ≤ X ≤ 2.45)

Introducing the Normal Distribution

• Consider the two histograms of student marks in Figure 5.3, one for a group
of 50 students and one for a group of 5000 students

99
Figure 5.3: Histograms of Marks for Groups of 50 Students (left) and 5000 Students
(right)

• Do you notice that the histogram on the right is ‘bell shaped’ ? This is because
marks, like many other phenomena in the world, usually tend to follow a
Normal Distribution (also known as Gaussian Distribution) which has a
probability density function with the famous ‘bell-curve’ shape shown in Figure
5.9

Figure 5.9: Probability Density Function of X ∼ N (µ = 0, σ = 1)

100
• We can see from Figure 5.3 that as the number of observations increases, the
histogram matches the bell shape of this curve more and more closely

The Normal Distribution

• The Normal Distribution is a very important probability distribution in


terms of applications

• It is a continuous distribution with parameters µ ∈ (−∞, ∞) and σ ∈ (0, ∞)


(so named because µ is the expectation of the distribution and σ 2 is the vari-
ance)

• Shorthand notation for expressing that the random variable X is normally


distributed with parameters µ and σ is X ∼ N (µ, σ)

• The PDF is as follows


(x−µ)2
1 −
fX (x) = √
σ 2π
e2σ 2 for − ∞ < x < ∞

• This means that in order to find Pr (x1 ≤ X ≤ x2 ) where X ∼ N (µ, σ), we


must take the integral Z x2
fX (x)dx
x1

where fX (x) is as above in order to find the area under the PDF between x1
and x2

• This integral is very difficult to solve (even if you have already learned some
integration techniques in Mathematics 1A)

• Instead we will use two tricks to get probabilities from the normal distribution

• The first trick allows us to transform any random variable X ∼ N (µ, σ) to


a standard normal random variable, which refers to a random variable
Z ∼ N (0, 1) (it is a convention to use Z for standard normal random variables)

– Note that the function in Figure 5.9 appears to drop to 0 as x approaches


-4 and 4, but in fact it never reaches 0; hence the function is defined for
x ∈ (∞, ∞)

The Standard Normal Distribution

• The Standard Normal Distribution (also called Z Distribution) is a special


case of the Normal Distribution where µ = 0 and σ = 1

• Any random variable X ∼ N (µ, σ) can be transformed to a standard normal


random variable Z ∼ N (0, 1) using this formula:
X −µ
Z=
σ

101
• This transformation (which you may recall as similar to calculating a ‘stan-
dard score’ from chapter 3) is very useful because it means that if we have
probabilities for a standard normal distribution we can use them to obtain
probabilities for any normal distribution
• The PDF of the standard normal distribution is given by
1 2
1 − x
fZ (x) = √ e 2 for − ∞ < z < ∞

• If we want to find Pr (Z < z) for any value, this requires us to calculate the
integral Z z
fZ (x)dx
−∞

• This is equivalent to finding the area under the PDF as displayed in Figure
5.10

Figure 5.10: Graph of area under Standard Normal PDF between −∞ and z, rep-
resenting Pr (Z < z)

• This is still not possible to solve by hand; but using numerical integration
techniques (which you will learn about in your Numerical Methods modules)
it can be accurately approximated

102
• At the back of your notes you will find a ‘Z Table’ that gives Pr (Z < z), correct
to four decimal places, for any z value (up to 2 decimal places) between 0 and
3.49
Procedure for Using the Z Table to Calculate Pr (Z < z)
• Suppose we want to know Pr (Z < 0.42)
• We look up the value 0.42 in the Z Table as shown in Figure 5.11 and find
that Pr (Z < 0.42) = 0.6628

Figure 5.11: Lookup of Value z = 0.42 in Z Table

Procedure for Using the Z Table to Calculate Pr (Z > z)


• The Z Table directly only gives us ‘less than’ probabilities from the standard
normal distribution
• However, since we can ignore ‘equal to’ probabilities when dealing with a
continuous random variable, the event Z > z is the complement of the event
Z < z, i.e. Z > z = [Z < z]c
• Therefore, by the complement rule of probability,
Pr (Z > z) = 1 − Pr (Z < z)

• Thus, in order to find Pr (Z > z) from the Z Table, we look up the value of z
and find Pr (Z < z) and then subtract it from 1 to get Pr (Z > z)
• For example, suppose we want to find Pr (Z > 2)
Pr (Z > 2) = 1 − Pr (Z < 2)
= 1 − 0.9772 (from table)
= 0.0228

103
Finding Standard Normal Probabilities for Negative z Values

• Observe that the Z Table does not contain negative values of z, yet a standard
normal random variable has a mean of 0 and can take on negative values

• How then would we find the probability Pr (Z < z) where z is a negative


number?

• The answer is that we use the fact that the normal distribution is perfectly
symmetrical. This implies that

Pr (Z < z) = Pr (Z > −z)

where −z is a positive number. We saw above that to take a ‘greater than’


probability we use the complement rule, and therefore

Pr (Z < z) = 1 − Pr (Z < −z)

• Thus we would look up the positive value −z in the table and take 1 minus
the answer to get Pr (Z < z) if z is a negative number

• For example, suppose we want to find Pr (Z < −1.09)

Pr (Z < −1.09) = 1 − Pr (Z < 1.09)


= 1 − 0.8621
= 0.1379

• How would you find Pr (Z > z) where z is a negative number?

– Combine the complement rule approach and the symmetry approach

Pr (Z > z) = 1 − Pr (Z < z)
= 1 − Pr (Z > −z)
= 1 − [1 − Pr (Z < −z)]
= Pr (Z < −z)

– Thus we would look up −z in the table where −z is a positive number


and we would get our final answer directly (no need to subtract)

Finding Standard Normal Probabilities for z Values above 3.49

• You will notice that the Z Table only goes up to 3.49, but the standard normal
distribution is defined for z ∈ (−∞, ∞)

• How then do we find Pr (Z < z) where z > 3.49?

• The answer is going to be close to 1; however, if we need to be more precise than


that, we can use software such as MS Excel or R to get the exact probability
we need

104
Procedure for Using the Z Table to Calculate Pr (z1 < Z < z2 )

• What if we want to find the probability that a standard normal random vari-
able falls between two values z1 and z2 such that z1 < z2 ?

• If we look at Figure 5.12 we can observe the area under the PDF corresponding
to Pr (z1 < Z < z2 )

Figure 5.12: Graph of area under Standard Normal PDF between z1 and z2 , repre-
senting Pr (z1 < Z < z2 )

• It is clear from the graph that the area we want to calculate is equivalent to
Z z2 Z z1
fZ (x)dx − fZ (x)dx
−∞ −∞

that is, the area under the PDF from −∞ to z2 minus the area under the
PDF from −∞ to z1

• Since we have expressed the ‘between’ probability in terms of two ‘less than’
probabilities, we can use the Z Table to get these two ‘less than’ probabilities
and then subtract them

105
• In other words,

Pr (z1 < Z < z2 ) = Pr (Z < z2 ) − Pr (Z < z1 )

• For example, suppose we want to find Pr (−0.9 < Z < 1.33)

Pr (−0.9 < Z < 1.33) = Pr (Z < 1.33) − Pr (Z < −0.9)


= Pr (Z < 1.33) − [1 − Pr (Z < 0.9)]
= 0.9082 − [1 − 0.8159]
= 0.7241

Reverse Lookup in Z Table

• We can also use the Z table for a reverse lookup

• Suppose we want to find out for what value of z the following statement is
true: Pr (Z > z) = 0.025

• Since the table gives us Pr (Z < z) values and not Pr (Z > z) values, we first
rearrange our statement:

Pr (Z > z) = 0.025
1 − Pr (Z < z) = 0.025
Pr (Z < z) = 0.975

• Next we look inside the table for the probability 0.975

• We find that a probability of 0.9750 corresponds to z = 1.96

• Thus z = 1.96 is the value of z for which the statement Pr (Z > z) = 0.025 is
true (approximately)

• Statistical software (e.g., MS Excel, R) contains functions that allow us to do


such reverse lookups more easily and accurately

Normal Distribution Example

• The length of human pregnancies from conception to birth varies according to


a distribution that is approximately normal with mean 266 days and standard
deviation 16 days. Use the normal distribution to determine the following:

1. What is the probability that a pregnancy lasts less than 245 days?
2. What is the probability that a pregnancy lasts between 270 and 280 days?
3. There is an 80% probability that a pregnancy lasts more than y days.
Determine y, correct to nearest whole number.

• The solution is as follows:

106
1. Let X be the random variable representing the length of a pregnancy. We
know X is normally distributed, so we can transform X to a standard
normal random variable Z as follows:

X −µ
Z=
σ
X − 266
=
16
Thus:
 
X − 266 245 − 266
Pr (X < 245) = Pr <
16 16
 
−21
= Pr Z < = Pr (Z < −1.31)
16
= 1 − Pr (Z < 1.31) = 1 − 0.9049 = 0.0951

2. What is the probability that a pregnancy lasts between 270 and 280 days?

 
270 − 266 X − 266 280 − 266
Pr (270 < X < 280) = Pr < <
16 16 16
 
4 14
= Pr <Z< = Pr (0.25 < Z < 0.88)
16 16
= Pr (Z < 0.88) − Pr (Z < 0.25)
= 0.8106 − 0.5987 = 0.2119

107
3. There is an 80% probability that a pregnancy lasts more than x days. Deter-
mine x, correct to nearest whole number.

Pr (X > x) = 0.8
 
x − 266
Pr Z > = 0.8
16
x − 266
Let z =
16
Pr (Z > z) = 0.8
1 − Pr (Z < z) = 0.8
Pr (Z < z) = 0.2
Pr (Z < −z) = 1 − 0.2 = 0.8 (by symmetry property of normal distribution)
Pr (Z < 0.84) = 0.7995 ≈ 0.8 (from table)
Thus z ≈ −0.84
x − 266
Thus − 0.84 ≈ which gives us:
16
x ≈ 252.56 ⇒ 253 (nearest whole number)

5.6 Normal Approximation to the Binomial Distribution


Normal Approximation to Binomial Distribution

• It is tedious to calculate probabilities from the binomial distribution when the


number of trials n is large

• For instance, if n = 100 and p = 0.3 and we want to know Pr (25 ≤ X ≤ 75),
we have to calculate Pr (X = 25)+Pr (X = 26)+Pr (X = 27)+· · ·+Pr (X = 75)
which takes a long time

• But notice in the probability mass function graphs in Figure 5.13 that as the
number of trials n increases, the graph more and more closely resembles the
bell-shaped curve of the normal probability density function:

108
Figure 5.13: Probability Mass Function for X ∼ Binomial(10, 0.3) (left) and for
X ∼ Binomial(100, 0.3) (right)

• This gives us an idea: since when n is large, the binomial distribution behaves
similar to a normal distribution, why don’t we approximate the binomial prob-
abilities using the normal distribution?

• We know that the expectation of a binomial distribution is np and the variance


is np(1 − p); hence
p we can try to approximate X ∼ Binomial(n, p) with Y ∼
N (µ = np, σ = np(1 − p))
Y −µ
• We can then perform the transformation Z = and use the Z Table to
σ
obtain probabilities on Y , which will then be approximations for probabilities
on X

• Consider the following example: we want to calculate Pr (2 ≤ X ≤ 6) for a


random variable X ∼ Binomial(11, 0.5)

– NOTE: in order for the normal approximation to be sufficiently accurate,


a convention is that both of the following conditions must hold:
(i) np ≥ 5
(ii) n(1 − p) ≥ 5
– We should always verify that these two conditions hold before using the
normal approximation to binomial distribution

• In this case, np = 11(0.5) = 5.5 > 5 and n(1 − p) = 11(0.5) = 5.5 > 5 so the
approximation is valid

109
• It seems we should use the normal approximation as follows:

µ = np = 11(0.5) = 5.5
p p
σ = np(1 − p) = 11(0.5)(0.5) = 1.6583
Let Y ∼ N (5.5, 1.6583)
Pr (2 ≤ X ≤ 6) ≈ Pr (2 ≤ Y ≤ 6)
 
2 − 5.5 6 − 5.5
= Pr ≤Z≤
1.6583 1.6583
= Pr (−2.11 ≤ Z ≤ 0.30)
(Note that ≤ and < are the same for continuous random varibles)
= Pr (Z < 0.30) − Pr (Z < −2.11)
= 0.6179 − [1 − Pr (Z < 2.11)]
= 0.6179 − [1 − 0.9826] = 0.6005

• The exact answer to this question, if we used the ordinary binomial method
Pr (2 ≤ X ≤ 6) = Pr (X = 2)+Pr (X = 3)+Pr (X = 4)+Pr (X = 5)+Pr (X = 6),
is 0.7197. So our approximation is very bad. What went wrong?

Normal Approximation to Binomial Distribution: Continuity Correc-


tion

• The problem with our approximation is that we are approximating a discrete


random variable (which can only take on integer values) with a continuous
random variable (which can take on real-numbered values).

• Hence, for the point X = 2, for instance, we need to take into account the
area just to the left and just to the right

• Specifically, any value between X = 1.5 and X = 2.5 is still closer to X = 2


than to any other integer

• This concept is illustrated in Figure 5.14

110
Figure 5.14: Illustration of Normal Approximation without and with Continuity
Correction for Pr (2 ≤ X ≤ 6)

• In the first graph on the left, we are approximating Pr (2 ≤ X ≤ 6) by taking


the area under the normal distribution curve (with µ = np and σ = np(1 − p))
from x = 2 to x = 6
• However, the regions from 1.5 to 2 and from 6 to 6.5 (shaded in black) should
have been included, because it still ‘belongs to’ the region of the integer 2, since
we have now moved from a discrete distribution to a continuous distribution
• Hence, in the second graph, we approximate Pr (2 ≤ X ≤ 6) by taking the
area under the normal distribution curve from y = 1.5 to y = 6.5
• Recalculating, we get:
 
1.5 − 5.5 6.5 − 5.5
Pr (1.5 ≤ Y ≤ 6.5) = Pr ≤Z≤
1.6583 1.6583
= Pr (−2.41 ≤ Z ≤ 0.60)
= Pr (Z < 0.60) − Pr (Z < −2.41)
= 0.7257 − [1 − Pr (Z < 2.41)]
= 0.7257 − [1 − 0.9920] = 0.7177

• We can see that this is now much closer to the exact answer of 0.7197

Continuity Correction - Continued!

• What if we wanted to estimate Pr (2 < X < 6)?


• Because X is a discrete, binomially distributed random variable, it makes a
big difference whether we have < or ≤
• In the continuous case, the integer ‘2’ is represented by the interval from 1.5
to 2.5, and the integer ‘6’ is represented by the interval from 5.5 to 6.5

111
• Hence 2 < X < 6 translates to 2.5 < X < 5.5

• This is illustrated in Figure 5.15

Figure 5.15: Illustration of Normal Approximation without and with Continuity


Correction for Pr (2 < X < 6)

• In the graph on the left, we take the area under the Normal curve from 2 to 6.
But the region from 2 to 2.5 still ‘belongs’ to 2 and is not greater than 2 from
a continuous point of view. Similarly, the region from 5.5 to 6 still ‘belongs’
to 6 and is not less than 6 from a continuous point of view.

• In effect, when translating an integer from a discrete scale to a continuous


scale, we need to think of the integer i as comprising the interval from i − 0.5
to i + 0.5

• Hence, a better approximation for Pr (2 < X < 6) will be achieved when, as


in the graph on the right, we take the area under the normal curve from 2.5
to 5.5, omitting the regions shaded black (from 2 to 2.5 and from 5.5 to 6)

Normal Approximation - Further Example

• Returning to our original motivational example:

• If X is a binomially distributed random variable with n = 100 and p = 0.3,

112
use the normal approximation to estimate Pr (25 ≤ X ≤ 75)

µ = np = 100(0.3) = 30
p p
σ = np(1 − p) = 100(0.3)(1 − 0.3) = 4.5826
Let Y ∼ N (30, 4.5826)
Pr (25 ≤ X ≤ 75) ≈ Pr (24.5 ≤ Y ≤ 75.5)
 
24.5 − 30 75.5 − 30
= Pr ≤Z≤
4.5826 4.5826
= Pr (−1.20 ≤ Z ≤ 9.93)
= Pr (Z < 9.93) − Pr (Z < −1.20)
= 1 − [1 − Pr (Z < 1.20)]
= 1 − [1 − 0.8849] = 0.8849

• Note: the exact answer (to four decimal places) is 0.8864, so we are not far off

Normal Approximation - Exercises

• Use the conventional rule to identify which of the following probabilities in-
volving a binomial random variable can be adequately approximated using the
normal distribution:

1. Pr (11 < X ≤ 14) where X ∼ Binomial(25, 0.6)


2. Pr (13 < X ≤ 42) where X ∼ Binomial(43, 0.01)
3. Pr (X > 18) where X ∼ Binomial(300, 0.05)
4. Pr (X ≤ 2) where X ∼ Binomial(8, 0.5)

• In those cases where the normal approximation is appropriate, use it to esti-


mate the probability

6 Introduction to Sampling Distributions


6.1 Background to Sampling Distributions
The Concept of Sampling in Statistics

• One of the major tasks of statisticians is to quantify characteristics of a target


population

• For instance, if we take ‘residents of South Africa’ as our target population, the
government agency Statistics South Africa is tasked with providing accurate
information on characteristics such as:

– The size of the population (number of residents in South Africa)


– The unemployment rate (proportion of South African labour force that
is unemployed)

113
– The average household income (mean income of households)
– The fertility rate (mean number of children born per woman of childbear-
ing age)
– Etc.

• The above population quantities can be referred to as parameters

• It is difficult and very expensive to obtain data from a population of well over
50 million people in order to exactly quantify the parameter

– Though we should note that Statistics South Africa does embark on an


effort to collect data from the entire population once every ten years, in
the Census

• For this reason, Statistics South Africa usually obtains data from a sample,
that is, a subset of the population

• The quantity of interest is calculated on the sample data and used as an


estimate of the quantity for the whole populatioin

– This sample estimate is called a statistic

• In Statistics 2B, you will learn in much greater detail how sampling should be
done in order to obtain good statistics

• For now, it is enough for you to understand what a sample is and what a
statistic is

Statistics as Random Variables

• Recall: a statistic is a quantity calculated from a sample in order to estimate


an unknown quantity pertaining to a population

• Recall also: a random variable is a rule which assigns a value to each outcome
of an experiment. It has error or uncertainty; its value cannot be known for
certain until the experiment takes place

• A random variable is usually described in terms of its probability distribu-


tion, which tells us how likely its various outcomes are

• In frequentist statistics (the branch of statistics we are doing), it is assumed


that population parameters are fixed, not random, although they are unknown

– (In another branch of statistics called Bayesian statistics, the opposite


assumption is not made!)

• Consider this simple example: we are going to flip a coin n times, because we
want to know whether the coin is fair, i.e. whether the probability of ‘Heads’
is equal to the probability of ‘Tails’ (both 0.5)

114
– In this case, there is (in theory) an infinite population of coin flips in the
universe, from which we are going to take a sample of size n (you can see
that a ‘population’ is not always clearly defined)
– The parameter is p, the probability of getting ‘Heads’
– The random variable is X, the number of times the coin comes up
‘Heads’ in our sample of n flips
X
– The statistic is p̂ = , the proportion of ‘Heads’ in our sample of n
n
flips. This will be our estimator of p (when we put aˆon a parameter it
denotes a statistic that is an estimator of that parameter).
– The probability distribution that defines X is the binomial distribu-
tion, because what we have described is a binomial experiment

• Now here is the big new insight: a statistic, as we have defined it above,
is also a random variable!

• This sounds strange at first: after all, once we have flipped the coin n times
we know the value of the statistic p̂, so how is it random? But remember,
that is true of any random variables: we know the actual outcome after the
experiment has been done. But not before! Before we flip any coins, we don’t
know the value of p̂ and various outcomes are possible; hence it is random

• In fact, it is easy to see mathematically that p̂ is a random variable, because


X
its formula, , shows that it is a function of a random variable, X. And any
n
function of a random variable must also be a random variable.

• Hence, a statistic is a random variable. Indeed, another way to define a statistic


is this:

– A statistic is a function of the observable random variables in a sample


and known constants.
– For example, p̂ is a function of a random variable X and a known constant
n

Sampling Distributions

• Every random variable behaves according to a probability distribution

• Hence, because a statistic is a random variable, a statistic also has its own
probability distribution

• The probability distribution of a statistic drawn from a sample is called a


sampling distribution

• The focus of this chapter is on describing the sampling distribution of one


commonly used statistic: the sample mean

115
6.2 Sampling Distribution for a Sample Mean
The Sampling Distribution of the Sample Mean of Normally Distributed
Random Variables

• The probability distribution of the mean of a sample is defined in the following


theorem:

Theorem 1. Let Y1 , Y2 , . . . , Yn be a sample of independent random variables


n
1X σ
such that Yi ∼ N (µ, σ), i = 1, 2, . . . , n. Then Ȳ = Yi ∼ N (µ, √ ).
n i=1 n

• Stated in words, if we have an independent random sample taken from a


normal distribution with expectation µ and variance σ 2 , the sample mean is
σ2
also normally distributed with the same expectation µ but with variance
n
• If you are interested in seeing a proof of this theorem, see (?), pages 331-332.
Ȳ − µȲ Ȳ − µ
• It also follows from this theorem that Z = = √ ∼ N (0, 1)
σȲ σ/ n

An Illustration of the Sampling Distribution of a Sample Mean

• Let us revisit the human pregnancy example. Let us assume that the length
of a human pregnancy is normally distributed with a mean of 266 days and a
standard deviation of 16 days. But suppose we don’t know this mean, and we
want to estimate it by collecting data from a random sample of mothers

• Let us consider four possible sample sizes: n = 1, n = 5, n = 10, and n = 50.



• According to sampling distribution theory, in each case E Ȳ = µ = 266;
that is, the mean of the sampling distribution of Ȳ equals the mean of the
probability distribution of Yi
 σ2
• However, because Var Ȳ = , the variance of the sampling distribution of
n
Ȳ decreases as the sample size increases

• This makes sense: if we collect more data, we would expect to have a more
precise estimate of the average length of a pregnancy

• The effect of increasing sample size on the sampling distribution is shown in


Figure 6.1

116
Figure 6.1: Probability Distribution of Sample Mean of Pregnancy Durations for
Different Sample Sizes

• What we can see is that if we were to take a sample of just one mother, and
another researcher were to do the same, and a third researcher were to do the
same, and so forth, then when we all compared our results they would be very
spread out: they would have a large variance. One researcher might estimate
the average pregnancy length to be 240 days, and another, 290 days

• However, if we were to take a sample of 50 mothers, and another researcher


were to do the same, and a third researcher, and so forth, then when we all
compared our results they would be much closer together: they would have
a small variance. In fact, almost certainly all the researchers would have
obtained a sample mean somewhere between 256 and 276

Sampling Distribution of a Sample Mean: Example Problem 1

• The amount of time university lecturers devote to their jobs per week is nor-
mally distributed with a mean of 52 hours and a standard deviation of 6 hours.
It is assumed that all lecturers behave independently.

1. What is the probability that a lecturer works for more than 60 hours per
week?

117
Let Y1 be the number of hours worked per week by this lecturer. (Note
that we could equivalently define Ȳ as the sample mean of this sample
of n = 1 observation; in this case we could use the sampling distribution
approach and get the same answer.)
 
Y1 − µ 60 − µ
Pr (Y1 > 60) = Pr >
σ σ
 
60 − 52
= Pr Z >
6
= Pr (Z > 1.33)
= 1 − Pr (Z < 1.33) = 1 − 0.9082 = 0.0918

2. What is the probability that the mean amount of work per week for four
randomly selected lecturers is more than 60 hours?
Let Y1 , Y2 , Y3 , Y4 be the number of hours worked per week by these four
respective lecturers. Then, according to the sampling distribution theo-
rem, Ȳ is a normally distributed random variable with a mean of µ = 52
σ 6
and a standard deviation of √ = √ = 3.
n 4
 
 Ȳ − µ 60 − µ
Pr Ȳ > 60 = Pr √ > √
σ/ n σ/ n
 
60 − 52
= Pr Z > √
6/ 4
= Pr (Z > 2.67)
= 1 − Pr (Z < 2.67) = 1 − 0.9962 = 0.0038

We can see that the probability is much smaller in this case. Does this
agree with the graph above in terms of the effect of increasing sample
size on the spread of the sampling distribution?
3. What is the probability that if four lecturers are randomly selected, all
four work for more than 60 hours?
Because we have assumed that all lecturers are independent, we can use
the multiplication rule for independent events, which says that Pr (A ∩ B) =
Pr (A) Pr (B) if events A and B are independent. In this case we have
four events: Y1 > 60, Y2 > 60, Y3 > 60 and Y4 > 60. Of course, the mul-
tiplication rule for independent events can be extended to any number of
independent events. Thus:

Pr (Y1 > 60 ∩ Y2 > 60 ∩ Y3 > 60 ∩ Y4 > 60)


= Pr (Y1 > 60) Pr (Y2 > 60) Pr (Y3 > 60) Pr (Y4 > 60)
= [Pr (Y1 > 60)]4 (since the four random variables are identically distributed)
= 0.09184 = 0.000071

118
Sampling Distribution of a Sample Mean: Example Problem 2

• The manufacturer of cans of tuna that are supposed to have a net weight of
200 grams tells you that the net weight is actually a normal random variable
with a mean of 201.9 grams and a standard deviation of 5.8 grams. Suppose
you draw a random sample of 32 cans.

1. Find the probability that the mean weight of the sample is less than 199
grams.

 
 Ȳ − µ 199 − µ
Pr Ȳ < 199 = Pr √ < √
σ/ n σ/ n
 
199 − 201.9
= Pr Z < √
5.8/ 32
= Pr (Z < −2.83)
= 1 − Pr (Z < 2.83) = 1 − 0.9977 = 0.0023

2. Suppose your random sample of 32 cans of tuna produced a mean weight


that is less than 199 grams. Comment on the statement made by the
manufacturer.
If the distribution of the net weight stated by the manufacturer is true,
then we had only a 0.0023 probability of achieving a mean weight of less
than 199 grams in our sample. Since we did achieve such a weight, either
something extremely improbable has happened, or the manufacturer has
given us an incorrect probability distribution. The latter is more likely:
probably either the mean weight is below 201.9 grams or the standard
deviation is 5.8 grams.

Sampling Distribution of a Sample Mean: Example Problem 3 (Challeng-


ing)

• A teacher is taking some of her learners to an annual mathematics competition,


where competitors get a score from 0 to 100. The teacher knows from past
experience that her learners’ scores are normally distributed with a mean of
58 and a standard deviation of 13. The teacher wants to be 90% sure that
the mean score achieved by her learners this year is at least 50. What is the
minimum number of learners she should take to the competition?
To answer this question we must recognize that we will still be using the
sampling distribution of the mean; what has changed is that the unknown is
no longer the probability but the sample size, n.

119
 
 Ȳ − µ 50 − µ
Pr Ȳ > 50 = Pr √ > √ = 0.9
σ/ n σ/ n
 
50 − 58
Pr Z > √ = 0.9
13/ n
50 − 58
Let z = √
13/ n
Pr (Z > z) = 0.9
Pr (Z < −z) = 0.9
−z ≈ 1.28
z ≈ −1.28
50 − 58
√ ≈ −1.28
13/ n
√ −1.28(13)
n≈
√ −8
n ≈ 2.08
n ≈ 4.33

The minimum number of learners she should take is 4.33; but since she can
only take an integer number of learners, we must round up to 5. The teacher
should take at least 5 learners to the competition.

Sampling Distribution of a Sample Mean: Exercises

• An automatic machine in a manufacturing process is operating properly if


lengths of an important subcomponent are normally distributed with mean
117 cm and standard deviation 5.2 cm.

1. Find the probability that one selected subcomponent is shorter than 114
cm
2. Find the probability that if five subcomponents are randomly selected,
their mean length is less than 114 cm
3. Find the probability that if five subcomponents are randomly selected,
all five have a mean length of less than 114 cm

• (Challenging) The time it takes for a statistics lecturer to mark a test is nor-
mally distributed with a mean of 4.8 minutes and a standard deviation of 1.3
minutes. There are 60 students in the lecturer’s class. What is the probability
that he needs more than 5 hours to mark all the tests? (The 60 tests in this
year’s class can be considered a random sample of the many thousands of tests
the lecturer has marked and will mark.)

120
121
References
Devore, J. L. & Farnum, N. R. (2005), Applied Statistics for Engineers and Scien-
tists, 2nd edn, Brooks/Cole, Belmont.

Hyndman, R. J. (1995), The problem with sturges’ rule for constructing histograms.
unpublished.

James, F. (1990), ‘A review of pseudorandom number generators’, Computer Physics


Communications 60(3), 329–344.

Keller, G. (2012), Statistics for Management and Economics, 9th edn, Southwestern
Cengage Learning, Mason.

Miller, I. & Miller, M. (2014), John E. Freund’s Mathematical Statistics with Appli-
cations, 8th edn, Pearson, Essex.

Navidi, W. (2015), Statistics for Engineers and Scientists, 4th edn, McGraw-Hill
Education, New York.

R Core Team (2019), R: A Language and Environment for Statistical Computing, R


Foundation for Statistical Computing, Vienna, Austria.
URL: https://www.R-project.org/

Statistics South Africa (2019), Electricity generated and available for distribution
(Preliminary), December 2018, Technical Report P4141, Statistics South Africa.

Tabak, J. (2011), Probability and Statistics: The Science of Uncertainty, 2nd edn,
Facts on File, New York.

Todorov, V. & Filzmoser, P. (2009), ‘An object-oriented framework for robust mul-
tivariate analysis’, Journal of Statistical Software 32(3), 1–47.
URL: http://www.jstatsoft.org/v32/i03/

Wright, K. (2018), agridat: Agricultural Datasets. R package version 1.16.


URL: https://CRAN.R-project.org/package=agridat

122

You might also like