Statistics 1A Lecture Notes Article
Statistics 1A Lecture Notes Article
Name:
Faculty of Applied Sciences
Department of Mathematics and Physics
Contents
1 Introduction to Statistics and Data 1
3 Descriptive Statistics 29
5 Probability Distributions 77
• The following books may prove useful to you: Keller (2012), Navidi (2015),
Devore & Farnum (2005), Miller & Miller (2014) (the last one is more ad-
vanced)
• Note that some data sets used in these notes are taken from these books.
Others are taken or adapted from R software (R Core Team (2019), Todorov
& Filzmoser (2009), Wright (2018))
1
• Data are information that come from investigations (e.g. observation, exper-
iments, sampling).
• Today data generally comes in an electronic format. Common formats for elec-
tronic data include spreadsheets (e.g., MS Excel), text files (comma-delimited
.CSV file, space or tab-delimited), and relational databases
• Designing and accessing electronic databases is a topic you will study more in
your Data Management module
• There are branches of statistics that are specifically designed to analyse qual-
itative data, such as text mining (methods for statistical analysis of textual
data)
• Note, however, that social scientists often develop quantitative measures for
things we think of as qualitative (e.g., ‘depression’), in order to use statistics
in their disciplines.
2
Structured vs. Unstructured Data
• Structured data is data that is organised into structures such as tables (e.g.,
an Excel spreadsheet) and is therefore ready for statistical analysis
• Primary data is data that the one collects for one’s own research (whether
by experiment, observation, survey, etc.)
• Secondary data is data that already existed but that one accesses in order
to use in one’s research
• Structured data can be further classified according to the four scales of mea-
surement
1. Nominal
2. Ordinal
3. Interval
4. Ratio
3
Properties of Scales of Measurement
• There are four properties that will be useful in distinguishing the four scales
of measurement. These are identity, magnitude, interval spacing, and
absolute zero
– The Identity property means that each value on the scale has a unique
meaning; no two values are the same
– The Magnitude property means that any two values on the scale can
be compared in terms of magnitude; therefore all values on the scale can
be ordered from least to greatest
– The Interval Spacing property means that the magnitudes between
values along the scale are equally spaced; therefore it is meaningful to
subtract them
– The Absolute Zero property means that the scale has a meaningful zero
value and no values below this value; therefore it is meaningful to divide
values to obtain ratios
• Values can be coded (assigned arbitrary numerical values), but they cannot
be ordered from least to greatest since there is no notion of magnitude
Figure 1.1: Two Types of Tea, Green and Rooibos (Nominal Scale)
• The kind of tea (green tea, rooibos) would be measured on the nominal scale
4
The Ordinal Scale
• The ordinal scale satisfies the identity property and the magnitude property
only
• The data can be ordered from least to greatest, but cannot be subtracted
• Returning to our tea example, the size of the tea (large, medium, or small)
could be measured on the ordinal scale
Figure 1.2: Two Sizes of Tea, Small and Large (Ordinal Scale)
• The interval scale satisfies the identity, magnitude, and interval spacing prop-
erties only
• These means the values can be ordered and can also be meaningfully sub-
tracted but cannot be meaningfully divided
• This is because the scale has no ‘absolute zero’: zero degrees Celsius does not
mean there is no heat, and in fact the temperature can be negative
5
The Ratio Scale
• The ratio scale satisfies the identity, magnitude, interval spacing, and absolute
zero properties
• These means the values can be ordered from least to greatest, can be mean-
ingfully subtracted, and can also be meaningfully divided
What is a variable?
• Most systems are deterministic but too complex to model in this way.
6
– E.g. Flipping a coin; lottery balls - there are laws of motion that govern
how these objects behave, but the systems are far too complex to be of
any practical use
– The same is true of stock market fluctuations. The stock prices are driven
up and down by human buying and selling behaviour, but we would need
billions of variables to model this deterministically.
• We treat these quantities as random for practicality.
The bottom line is, randomness may not actually exist. Even computers can
only generate pseudo-random numbers and not true random numbers. If you are
interested in reading more about pseudo-random number generation, see James
(1990). Randomness is a tool which allows us to make decisions and predictions in
the presence of uncertainty.
• Raw data refers to data as it is initially captured (in the case of primary data)
or as it is initially downloaded or accessed (in the case of secondary data)
• Table 2.1 gives a visual of some raw data from a South African schools dataset
displayed in MS Excel:
Table 2.1: Raw Excel Data from South African Schools Dataset
7
• This table provides a list of all categories represented in the data and gives
the frequencies (count of observations) for each category
• Table 2.2 gives a frequency distribution table for the Province variable in the
South African schools dataset, ordered in descending order of frequency
Bar Graph
8
• In this graph, each category is represented by a bar or column
• The bars are usually vertical but can also be horizontal; this is a matter of
preference
• Figure 2.1 is an example of a bar graph representing the same data as the
frequency table above
• Figure 2.2 gives the same bar graph with horizontal bars
9
Figure 2.2: Horizontal Bar Graph of Province, Ordered by Decreasing Frequency
Pie Graph
• The pie graph is another graphical method for representing categorical (nom-
inal or ordinal) data
• The main difference between the bar graph and the pie graph (besides us-
ing a different shape) is that the pie graph is designed to present relative
frequencies whereas the bar graph is designed to present frequencies
• Since relative frequencies are also apparent from a bar graph, some statisticians
would argue that a bar graph is always preferable to a pie graph
10
Figure 2.3: Pie Graph of School Phase (Schools Dataset)
Donut Graph
• A donut graph is just an alternative form of a pie graph that leaves a hole in
the middle
11
Figure 2.4: Donut Graph of Boarding and Non-Boarding Schools (Schools Dataset)
• One variable is represented by rows of the table and the other by columns
• Besides giving the raw frequencies in the individual cells, a two-way frequency
table may include the relative frequencies per row, per column, or per the
entire table (if it includes all of these relative frequencies, it could be rather
confusing)
• In this case, the column relative frequency tells us what percent of schools in
each province belong to each quintile. This is a convenient way to compare the
12
Table 2.3: Two-Way Frequency Table of Quintile vs. Province (Schools Dataset)
distribution of quintiles between provinces. For instance, we can see that Free
State has the highest percentage of Quintile 1 schools (49,26%) while Gauteng
has the highest percentage of Quintile 5 schools (30,00%)
• Note: South African public schools are categorised into five groups called
‘quintiles’ based on the economic status of their surrounding communities.
Quintile 5 schools are in the wealthiest areas and Quintile 1 schools are in the
poorest areas. Quintile is an ordinal variable.
• We can see that this two-way frequency table also gives us the total row
and column frequencies and relative frequencies. In this case, the total row
frequencies tell us how many schools there are in each quintile across the whole
of South Africa. The total column frequencies tell us how many schools there
are in each province (the same information that we earlier represented in a bar
graph).
• Stacked and grouped bar graphs are modified bar graphs that can be used to
jointly present data from two categorical variables
• Stacked and grouped bar graphs relate to two-way frequency tables the way
that basic bar graphs relate to one-way frequency tables
13
• The bar graphs shown in Figures 2.5 and 2.6 represent two-way frequencies of
schools by province and by locale (urban vs. rural)
Figure 2.5: Stacked Bar Graph of Province vs. Locale (Schools Dataset)
• In a stacked bar graph, the height of each bar represents the overall frequency
of the categories of one categorical variable. Within each of these bars are
two or more stacked sub-bars representing the frequencies of categories of the
second categorical variable within each category of the first
Figure 2.6: Grouped Bar Graph of Province vs. Locale (Schools Dataset)
14
• In a grouped bar graph, instead of stacking the sub-bars on top of each other,
they are placed alongside each other
• The advantage of a stacked bar graph is that one gets a sense of the sub-
category frequencies while still having the overall frequencies for one of the
variables displayed
• Although bar and pie graphs are ideally designed for displaying frequencies and
relative frequencies (respectively) of categorical data, they are sometimes also
used to show the relationship between a numerical (interval or ratio) variable
and a categorical variable
• For example, take the bar graph shown in Figure 2.7 and the pie graph shown
in Figure 2.8, both of which are based on data shown in Table 2.4, taken from
Statistics South Africa (2019)
15
Figure 2.7: Bar Graph of Volume of Electricity Delivered by Province, December
2018 (Gigawatt-Hours)
16
2.2 Presenting Ordinal Data
Methods for Presenting Ordinal Data
• Ordinal data is still categorical like nominal data; the key difference is that the
values have not only the identity property but also the magnitude property
• Consequently, the tabular and graphical methods used to present ordinal data
are the same as those used to present nominal data: frequency tables, bar
graphs, pie graphs
• The only ‘catch’ is that, when presenting ordinal data, one should normally
order the table (rows and/or columns), bars, or pie slices, in increasing or de-
creasing order of the ordinal categories, rather than in increasing or decreasing
order of frequency
• For example, consider the following data from Statistics South Africa’s 2017
General Household Survey. Each household was asked, ‘In what condition are
the walls, roof, and floor of the main dwelling? Is it very weak, weak, needing
repairs, good or very good?’ In this case, ‘very weak, weak, needing repairs,
good, very good’ are increasing levels of condition on an ordinal scale. Two
frequency tables for roof condition are shown in Tables 2.5 and 2.6
• Which of these tables do you think is presented in a more logical order?
Table 2.5: Roof Condition Frequencies from 2017 GHS (Ordered by Frequency)
Table 2.6: Roof Condition Frequencies from 2017 GHS (Ordered by Levels of Ordinal
Variable)
• Clearly Table 2.6 is presented in a more logical order. Similarly, the cross-
tabulation in Table 2.7 between monthly rent cost and child hunger is presented
logically according to the ordering of the two ordinal variables involved
17
Monthly Rent
Child Hunger R501- R1001- R3001- R5001-
<R500 >R7000
Occurrence R1000 R3000 R5000 R7000
Never 497 402 345 374 225 229
Seldom 42 37 15 9 2 1
Sometimes 54 33 16 6 3 2
Often 32 10 4 3 4 4
Always 7 1 0 0 1 0
Table 2.7: Two-Way Frequencies of Child Hunger Occurrence vs. Monthly Rent
• In this section we will make no distinction between interval and ratio data; in
both cases we are dealing with numerical values
Class Intervals
• One way of analysing interval and ratio data is to break it down into ‘class
intervals’
• For example, in Table 2.7, the ‘Monthly Rent’ categories are intervals that
each represent a range of rent values
• Consider the age data (in years) from a certain population of size 50 in Table
2.8
27 37 32 44 44 72 57 28 25 35
55 52 37 22 45 36 44 82 28 32
71 45 31 36 22 70 24 29 33 71
55 37 43 49 27 38 73 59 54 22
25 41 32 27 41 23 57 26 60 19
18
• The data in this raw form is not easy to make sense of; one way to present
it more conveniently would be to break it into class intervals by decade, as in
Table 2.9
Histogram
• The histogram is a graph used to visualise the distribution of numerical data.
This is achieved by breaking the data into class intervals (called ‘bins’ in this
case) and then drawing a bar representing the frequency of each class interval
• The key visual difference between a histogram and a bar graph is that in a
histogram there is no horizontal space between the bars; this lack of space
represents how the intervals occupy a continuous numerical scale
19
• An important decision when constructing a histogram (or even when con-
structing a frequency table for a numerical variable using class intervals) is
how many bins or class intervals to use
• A histogram constructed from this data would look very different depending
how many bins are used (see Figure 2.9)
Sturges’ Rule
• Sturges’ Rule is one widely used method for determining the number of class
intervals (bins) to use for a histogram
• The rule is to calculate the number of bins k using the following formula:
• The d and e in the above formula mean that we round up what is in between
these symbols to the next highest integer (the number of bins must be an
integer)
k = dlog2 (90) + 1e
= d6.49185 + 1e
=8
20
Figure 2.9: Histograms of Energy Consumption Data with Different Numbers of
Bins
• However, the rule has been criticised for large numbers of observations (n >
200) as well Hyndman (1995)
• Two alternative rules that can be used to construct histograms are Scott’s
Rule and Freedman and Diaconis’s Rule
• In both of these rules we do not calculate the number of bins but rather the
width of each bin
21
Figure 2.10: Histogram of Energy Consumption Data with 8 Bins as per Sturges’
Rule
• We have not yet covered descriptive statistics such as the standard deviation
and interquartile range, so we will leave off these rules for now
22
– It also assists us in identifying unusual patterns such as a bimodal distri-
bution, a distribution whose histogram has two ‘peaks’ (see example in
Figure 2.12)
Histogram: Example
23
86 175 157 282 38 211 497 246 393 198
146 176 220 224 337 180 182 185 396 264
251 76 42 149 65 93 423 188 203 105
653 264 321 180 151 315 185 568 829 203
98 15 180 325 341 353 229 55 239 124
249 364 198 250 40 571 400 55 236 137
400 195 38 196 40 124 338 61 286 135
292 262 20 90 135 279 290 244 194 350
131 88 61 229 597 81 398 20 277 193
169 264 121 166 246 186 71 284 143 188
• Would you describe the distribution of this data as unimodal (having one
peak) or bimodal (having two peaks)? As symmetrical, negatively skewed, or
positively skewed?
Table 2.13: Class Interval Frequencies of Cycles of Strain to Breakage for 100 Yarn
Specimens
Box-and-Whisker Plot
24
Figure 2.13: Histogram of Cycles of Strain to Breakage for 100 Yarn Specimens
Line Graph
• A line graph is a method for displaying numerical data that are organised
sequentially
• Consider the number of rhino poached per year in South Africa from 2003 to
2019
• This data can be conveniently represented in a line graph (Figure 2.14) which
provides a clear visualisation of the rapid increase in rhino poaching over the
years 2007 to 2014 followed by a gradual decrease thereafter
(1) Put the time index variable on the horizontal (x) axis
(2) Put the numerical data variable on the vertical (y) axis
(3) Plot a point on the graph for each observation in the data
(4) Join the points with line segments, moving sequentially through the series
25
Year No. of Rhino
Poached
2003 22
2004 10
2005 13
2006 24
2007 13
2008 83
2009 122
2010 333
2011 448
2012 668
2013 1004
2014 1215
2015 1175
2016 1054
2017 1028
2018 769
2019 594
Figure 2.14: Data Series and Line Graph of Rhino Poached per year in South Africa
• Table 2.14 gives the year-end unemployment rate for South Africa for each
year from 1994 to 2019. Represent this data using a line graph.
• Notice how the two graphs in Figure 2.15 look different just by changing the
scale on the vertical axis. In general, we should start the axis from 0 for a
variable that is measured on the ratio scale
26
Year Unemployment Rate (%)
1994 20
1995 16.9
1996 19.3
1997 21
1998 25.2
1999 23.3
2000 23.3
2001 26.2
2002 26.6
2003 28.4
2004 23
2005 23.5
2006 22.1
2007 21
2008 21.5
2009 24.1
2010 23.9
2011 23.8
2012 24.5
2013 24.1
2014 24.3
2015 24.5
2016 26.5
2017 26.7
2018 27.1
2019 29.1
27
Figure 2.15: Line Graph of South African Annual Unemployment Rate, 1994-2019
28
3 Descriptive Statistics
What is a Descriptive Statistic?
• If the data comes from a sample (a selection of units from a larger population),
descriptive statistics are often used to estimate the characteristics of the whole
population
• However, the descriptive statistic value does not tell us how accurate of an esti-
mator it is; determining the accuracy and precision of estimators is something
we will look at in Statistics 1B
• For now, we will not worry about whether our data comes from a population
or from a sample from a larger population
• Some of the features of a data set that can be described using descriptive
statistics are central location (also called central tendency), dispersion
(also called variability or spread), relative standing, skewness, and tail
extremity (kurtosis)
• The three most well-known descriptive statistics that are used to measure
central location are ones you have probably encountered in secondary school.
They are:
– The Mean
– The Median
– The Mode
• In this section we will also cover another measure of central location called the
Trimmed Mean, and we will discuss measures of central location for grouped
data (numerical data that has been put into class intervals)
Mean
29
• The formula for calculating the population mean is as follows:
1
µ= (x1 + x2 + · · · + xN )
N
• Here, N denotes the ‘population size’, the number of observations of the vari-
able in the whole population
• The formula for calculating the sample mean is as follows:
1
x̄ = (x1 + x2 + · · · + xn )
n
• Here, n denotes the ‘sample size’, the number of observations of the variable
in the sample
• (Notice that the above two formulas are exactly the same except for the no-
tation)
• These formulas can also be expressed using Sigma Notation as follows:
N
1 X
µ= xi
N i=1
n
1X
x̄ = xi
n i=1
• Sigma notation uses the capital Greek letter Σ (Sigma) to denote a sum. The
indices over which the variable is to be summed are indicated below and above
the Σ
Mean: Example
• In Table 3.1 we have the fuel economy (l/100 km) for 32 high-performance
cars
13.5 15.1 12.4 16.3 19.2 13.1 14.6 17.9
13.5 15.6 14.8 18.6 8.7 18.2 10.3 14.3
12.4 19.8 15.9 27.2 9.3 18.6 10.9 18.8
13.2 11.6 17.2 27.2 8.3 21.2 9.3 13.2
30
Order Statistics
• If we rank the data from least to greatest, the ith order statistic refers to the
ith value in order
• Thus, for example, x(1) is the minimum of the sample data and x(n) is the
maximum
• Table 3.2 gives the final exam marks (%) for a class of 22 statistics students,
along with the sex of each student
Table 3.2: Sex and Final Exam Mark (%) for 22 Statistics Students
• We simply need to sort the data from least to greatest and then take the third
value:
33, 47, 48, 49, 55, 59, 59, 63, 64, 65, 67
68, 71, 73, 73, 74, 75, 77, 81, 90, 90, 97
• x(3) = 48
31
Median
• The median can be defined as the middle value in the data by order; as the
value for which half of the data lie below it and half of the data lie above it
• To obtain the median, we first must order the data from least to greatest
Median: Example
• We order the data to get the 16th and 17th order statistics:
32
18.8 18.5 18 17.8 18.1
17.3 19.2 17.2 19.3 18.2
18.3 18.6 19.5
1
• x(16) = 14.6 and x(17) = 14.8, thus x̃ = (14.6 + 14.8) = 14.7
2
• Consider now a sample of heights (in metres) of n = 13 Loblolly pine trees:
• In order to determine the 7th order statistic x(7) we simply need to sort the
data from least to greatest and then take the 7th value
17.2, 17.3, 17.8, 18.0, 18.1, 18.2, 18.3, 18.5, 18.6, 18.8, 19.2, 19.3, 19.5
Mode
• The mode is simply the most frequently occurring value in the data
• We could get the mode by creating a one-way frequency table and looking for
the row of the table with the highest frequency
• If there are two values that both have the highest frequency, we do not average
them, instead we say that there are two modes
Mode: Example
• In fact there are six modes: 9.3, 12.4, 13.2, 13.5, 18.6, and 27.2 (all of these
values occur twice, and no value occurs more than twice)
33
Trimmed Mean
• We first need to sort the data (already done under Median: Example above)
• Suppose you have numerical data that has been arranged into class intervals,
or perhaps into a histogram, and you do not have the original numerical values
• It is not possible in this case to calculate the exact mean or median of the
data, but there are formulas we can use to approximate this
• Suppose that the data have been grouped into c class intervals and that each
class interval has a lower limit and an upper limit (so that its midpoint can
be calculated)
34
• We will refer to the midpoints of the class intervals as mj , j = 1, 2, . . . , c (mj
can be calculated as 12 (`j + uj ) where `j is the lower limit and uj is the upper
limit)
• What we are doing in this formula is measuring each class interval using
its midpoint, and weighting each class interval using its frequency; thus the
grouped mean is an example of a weighted average
• Once we have identified the median class interval m, we can approximate the
median using the following formula:
m−1
X
n/2 − fj
j=1
x̃gr = `m + (um − `m )
fm
• What we are essentially doing in this formula is taking the lower limit of the
class interval containing the median (`m ) and adding a certain proportion of
the width of this class interval (the width being um − `m ).
– Notice that if the cumulative relative frequency of the mth class interval
m
X n
is exactly 50%, then fj = , which means that
j=1
2
m−1
X m
X m−1
X
n/2 − fj fj − fj
j=1 j=1 j=1 fm
= = =1
fm fm fm
35
– Thus, in this case, x̃gr reduces to `m + (um − `m ) = um , the upper limit of
the interval (since the median occurs precisely at the cutpoint between
intervals)
• To approximate the mode, we first identify the modal interval m, i.e. the
class interval with the highest frequency fj
• What we are doing in this formula is shifting the location of the mode within
the modal interval towards the lower or upper limit of the interval depending
on the frequencies of the intervals just below and just above that interval
– Notice that, if the frequencies of the interval below and the interval above
1
are the same (fm−1 = fm+1 ), this reduces to Modegr = `m + (um − `m ),
2
which is the midpoint of the modal interval
• Consider the data in Table 3.4 which shows the NSC examination pass rates
per school grouped into class intervals
Table 3.4: Frequency Table of 2016 Matric Pass Rates for 6794 South African Schools
36
• Grouped Mean:
c
1X
x̄gr = fj mj
n j=1
1
= (65(5) + 143(15) + 260(25) + 413(35) + 569(45)
6794
+659(55) + 783(65) + 1097(75) + 1161(85) + 1644(95))
473310
=
6794
= 69.6659
• Thus we estimate the mean 2016 matric pass rate among South African schools
to be about 69.7%
• Grouped Median: The 8th interval, i.e. (70, 80], is the first one for which the
cumulative relative frequency exceeds 50%; thus m = 8
7
X
n/2 − fj
j=1
x̃gr = `8 + (u8 − `8 )
f8
6794/2 − 2892
= 70 + (80 − 70)
1097
= 74.6035
• Thus we estimate the median 2016 matric pass rate among South African
schools to be about 74.6%
• Grouped Mode: The modal interval is the 10th interval, (90, 100], since it has
the highest frequency (1644); thus m = 10
• Since in this case there is no interval above the 10th, we treat the 11th interval
as having a frequency of 0
f10 − f9
Modegr = `10 + (u10 − `10 )
(f10 − f9 ) + (f10 − f11 )
1644 − 1161
= 90 + (100 − 90)
(1644 − 1161) + (1644 − 0)
= 92.2708
• Thus we would estimate the most common matric pass rate among South
African schools in 2016 to be about 92.3%
• For interest’s sake, the actual mean matric pass rate among schools in this
data set (based on exact numerical values, not class intervals) was 70.41%, the
actual median pass rate was 75%, and the actual mode pass rate was 100%
37
3.2 Measures of Dispersion
Measures of Dispersion
• The dispersion of the data would be visualised graphically as the width of the
histogram; however, we also need numerical ways of quantifying it
– Range
– Mean Deviation from the Mean
– Median Deviation from the Median
– Variance and Standard Deviation
Range
• The Range quantifies for us the distance along the xi axis that is covered by
our data, similar to the distance along the horizontal axis that is covered by
a histogram
• The main problem with this descriptive statistic is that it is only describing
two values in the data set and does not tell us how widely dispersed (between
those two values) the rest of the data are
Range: Example
• Table 3.5 gives the hourly mean concentrations of nitrogen oxides (NOx) in
ambient air (parts per billion, ppb) next to a busy motorway, from 15 randomly
selected days
38
• We first need to sort the data in increasing order:
• Statistics that are stable and do not radically change in the presence of outliers
(extreme values) and other anomalies are said to be robust; there is a whole
branch of statistical sciences that deals with robust statistics
• Statisticians prefer robust statistics because they better summarise the whole
data rather than just a few unusual values
• The range is an example of a statistic that is not robust. For instance, if the
value of 138.65 were removed from the data, the range would drop by 133.65
to 92.55. Thus a single extreme value has a huge effect on the range
• The Mean Deviation from the Mean (sometimes called Mean Absolute
Deviation, MAD) is the average absolute difference between the mean and
the other values in the data
• Although the mean absolute deviation has a nice intuitive meaning, it is not
used very much by statisticians in practice because it does not have nice math-
ematical properties
39
Mean Deviation from the Mean Example
• Consider again the sample of NOx concentrations in 3.4.
• The mean deviation from the mean is calculated as follows:
n
1X 1
x̄ = xi = (29.40 + 15.65 + 24.60 + · · · + 28.00)
n i=1 15
= 45.63667
n
1X
MAD = |xi − x̄|
n i=1
1
= (|29.40 − 45.63667| + |15.65 − 45.63667| + · · · + |28.00 − 45.63667|)
15
= 28.0271
• (Reminder: do not round off your first calculation, in this case the mean, if
you are going to use it for further calculations; otherwise you will introduce
rounding errors. to be safe, allow two extra significant digits in your first
calculation than the accuracy needed in your final answer.)
Median Deviation from the Median
• Another measure of dispersion that is more robust than the mean deviation
from the mean is the median absolute deviation from the median
• It requires us to compute absolute deviations from the median
ADMi = |xi − x̃|, i = 1, 2, . . . , n
40
• The median deviation from the median is an example of a robust measure of
dispersion
• Check for yourself that, if we removed the largest value of 138.65 from the data
and recalculate the median deviation from the median, it would only change
slightly (from 9 to 9.725), whereas we saw above that the range would change
dramatically
Variance
• This is one statistic where the formula differs depending on whether we are
working with a population or a sample
• Population Variance:
N
1 X
2
σ = (xi − µ)2
N i=1
• Sample Variance:
n
2 1 X
s = (xi − x̄)2
n − 1 i=1
• Note that the reason why we divide by n − 1 rather than n in the sample
variance is to correct for a source of bias in the estimator. This is called
Bessel’s Correction.
Standard Deviation
• The standard deviation is simply the square root of the variance, and is a very
widely used statistic in practice
• Variance and standard deviation are not robust, since they are functions of
the mean, which is also not robust
41
Variance and Standard Deviation Example
• Many traffic experts argue that the most important factor in accidents is not
the average speed of cars but the amount of variation. The speeds of 20 cars
were recorded on a road with a speed limit of 70 km/h and a high rate of
accidents (Table 3.6)
82 69 71 84 73 71 66 84 82 79
78 80 70 71 72 66 78 92 71 70
where j = 1, 2, . . . , c are the class intervals, fj is the frequency of the jth class
interval, and mj is the midpoint of the jth class interval
42
Approximate Variance and Standard Deviation for Grouped Data: Ex-
ample
• Refer back to the grouped data on NSC examination pass rates from 6794
South African schools (Table 3.1)
Coefficient of Variation
43
3.3 Measures for Proportions
Proportions
• When dealing with nominal or ordinal data (or sometimes even with interval
or ratio data), a very simple but very important descriptive statistic is the
proportion with which the observations fall into a certain category or have
a certain attribute
• Population Proportion
A
P =
N
where A is the number of observations in the population which have the at-
tribute
• Sample Proportion
a
p=
n
where a is the number of observations in the sample which have the attribute
Proportions Example
• Take our sample of n = 22 student marks in Table 3.2 (already sorted from
least to greatest earlier):
33, 47, 48, 49, 55, 59, 59, 63, 64, 65, 67
68, 71, 73, 73, 74, 75, 77, 81, 90, 90, 97
• Suppose we are interested in the proportion of students who passed (that is,
achieved a mark of at least 50)
• Notice that what we are effectively doing here is creating a nominal variable
with two categories, ‘Passed (Mark ≥ 50)’ and ‘Failed (Mark < 50)’
• Thus the proportion of students who passed is 0.8182, and the percent is
81.82%.
44
• Suppose instead we were interested in the proportion of students whose sex is
female (recall: the sex of each student is provided in Table 3.2)
• In this case the variable is already nominal and binary, so we just count the
frequency of females a and divide by sample size n:
a 7
p= = = 0.3182
n 22
• Thus the proportion of students in the class that are female is 0.3182
• We defined order statistics earlier and saw that statistics such as the median
and range are functions of order statistics
Percentiles
• The P th percentile is the value below which P % of the data fall; put differently,
P % of the data are less than the P th percentile and (100 − P )% of the data
are greater than the P th percentile
• For example, the median is equivalent to the 50th percentile, since 50% of the
data lie below the median
45
– Let F be the fractional part of RP (after decimal)
– P th percentile = F x(I+1) + (1 − F ) x(I)
– We are basically taking a weighted mean of the Ith and (I + 1)th order
statistics, with the fraction component F being the weights (if F > 0.5,
the percentile should be closer to X(I+1) ; if F < 0.5, the percentile should
be closer to X(I) ; if F = 0.5, the percentile will be exactly the average of
the two)
Percentiles: Example
Quartiles
Interquartile Range
• This is a more robust measure of dispersion than the Range, since it tends to
exclude extreme values (outliers)
46
1.05 1.10 0.99 1.33 1.13
1.16 1.01 1.30 1.30 1.16
0.96 1.18 1.16 1.17 1.14
1.10 0.70 1.30 1.50 1.10
1.11 0.68 1.01 0.93 1.18
• Table 3.7 contains yields (in kg) of peanuts from 25 plots of land in a field.
(Each plot was 3 m by 1 m in dimensions)
Quantiles
• While quartiles divide the data into four parts, and percentiles divide the data
into one hundred parts, sample quantiles divide the data according to any
proportion
• Thus the first quartile is equivalent to the 25th percentile and to the 0.25
quantile
• Quantiles are often used for the extreme ends of the data, e.g. the 0.99975
sample quantile
• A standard score (also called a z-score) tells us how far an observation is from
the population mean, in units of population standard deviations
• The standard score is closely related to the Normal distribution (or Gaus-
sian distribution) which will be discussed later in the course
47
• We can easily identify whether an observation is above or below the mean
based on whether its standard score is positive or negative
• Give the z-scores of all the observations in this data, assuming it is known
that the data come from a population with a mean of 75 km/h and a variance
of 49 km/h
xi zi xi zi
82 1.0000 78 0.4286
69 -0.8571 80 0.7143
71 -0.5714 70 -0.7143
84 1.2857 71 -0.5714
73 -0.2857 72 -0.4286
71 -0.5714 66 -1.2857
66 -1.2857 78 0.4286
84 1.2857 92 2.4286
82 1.0000 71 -0.5714
79 0.5714 70 -0.7143
Table 3.8: z-Scores for Vehicle Speed Data (Raw Data in Table 3.6)
Outliers
• Outliers are extreme values in the data (extremely large or extremely small
relative to the rest of the data)
• There are many different methods one can use to determine whether a partic-
ular observation is an outlier; some methods are very advanced
• In this module we will use two conventional definitions of outliers that is used
when drawing box-and-whisker plots (discussed below):
48
Definition 3.1. IQR Approach: An observation is an outlier if it falls more
than 1.5 interquartile ranges above the third quartile, or more than 1.5 in-
terquartile ranges below the first quartile. An observation is an extreme
outlier if it falls more than 3 interquartile ranges above the third quartile, or
more than 3 interquartile ranges below the first quartile.
xi < Q1 − 1.5IQR, or
xi > Q3 + 1.5IQR
xi < Q1 − 3IQR, or
xi > Q3 + 3IQR
Outliers Example
• The data in Table 3.9 gives the weights (in kg) of 23 single-engine aircraft
built during the years 1947-1979:
• Determine, using the definitions above, whether there are any outliers in this
data, and if so, whether there are any extreme outliers. For standard scores
you may assume a population mean of 8000 kg and a population standard
deviation of 4000 kg.
49
• Under Definition 3.1:
– By inspection, we can see that there is one outlier, 20987, but no extreme
outliers
• Since |z22 | > 3, this aircraft’s weight is also an outlier under the standard score
definition
Box-and-Whisker Plots
• These five numbers are the minimum x(1) , the first quartile Q1 , the second
quartile (median) Q2 = x̃, the third quartile Q3 , and the maximum x(n) .
• The bottom and top of the ‘box’ represent the first and third quartiles, with
a line through the box representing the median
• The bottom and top ‘whiskers’ represent the minimum and maximum
• The five numbers are not usually numbered as they are in 3.1; this is just to
assist you in understanding the structure of the plot
50
Figure 3.1: Labelled Box-and-Whisker Plot of Some Randomly Generated Data
• The ‘box’ part of the plot remains the same, but the convention is to represent
outliers as individual points above and/or below the whiskers.
• Therefore, when outliers are present, the ends of the whiskers no longer repre-
sent the minimum and maximum of the data but the minimum and maximum
excluding outliers
• For example, we identified one outlier in the aircraft weight data in Table 3.9.
Thus the box-and-whisker plot for this data appears as in Figure 3.2
51
3.5 Measuring of Skewness
Skewness
• Skewness refers to the extent to which the data is asymmetric (not symmet-
ric) about its central location
• There are also descriptive statistics that one can use to measure the skewness
of a data set
Skewness: Example
• The skewness is negative, which suggests that the data is negatively skewed
52
• This appears to agree with the histogram of the data in Figure 3.3, which
suggests a slightly negative skewing
Changing Units
• Important Note: Make sure the units in your data match the units of your
descriptive statistics!
• E.g. g vs. kg
• The set of all possible distinct outcomes of an experiment is called the sample
space and is denoted S
53
• Each outcome in a sample space can be referred to as an element or simply
as an outcome; and all outcomes are distinct and mutually exclusive (they do
not overlap)
• An event with a probability of 0 certainly will never happen (no matter how
many times the experiment is conducted)
• An event with a probability of 1 certainly will always happen (every time the
experiment is conducted)
• Two simple random experiments are the coin flip and the roll of a die
54
• The sample space for a coin flip consists of two outcomes, ‘Heads’ and ‘Tails’;
thus S = {Heads, Tails}
• Note: the two sides of a coin are conventionally referred to as ‘Heads’ and
‘Tails’, with ‘Heads’ representing the side that displays a portrait of a head of
state or other public figure. As can be seen in Figure 4.1, South African coins
typically do not display the portrait of a person. However, the South African
Mint has officially confirmed on Twitter that the side of the coin displaying
the coat-of-arms is the ‘Heads’ side, which means that the side displaying an
animal is the ‘Tails’ side
1
• Experience tells us that for an ordinary, ‘fair’ coin, Pr (Heads) = Pr (Tails) =
2
• (Whether this is exactly true depends on the coin and how it is flipped)
• Again, experience tells us that each of these six outcomes is an event with a
1
probability of
6
• In the experiment of rolling a die, one could define an event that includes more
than one outcome, such as ‘roll an odd number’ (which includes, 1, 3, and 5)
or ‘roll at least 5’ (which includes 5 and 6) Thus
3 1
Pr (roll an odd number) = =
6 2
2 1
Pr (roll at least a 5) = =
6 3
• One of the basic principles of probability that we can see at work in the above
probabilities is that, if all outcomes of the experiment are equally likely, we
can calculate the probability of an event as
# of outcomes included in event A
Pr (A) =
total # of outcomes in sample space S
55
Probability: Further Example
• To get an idea of how a spinner works, play around with an online adjustable
spinner tool
• We can see from 4.3 that the three outcomes ‘Red’, ‘Green’, and ‘Blue’ are not
all equally probable; rather their probabilities are equivalent to the fraction of
the circle’s circumference that they occupy
• In this case, to find the probability of the event A =‘Green or Blue’, we cannot
simply use the ‘# of outcomes’ formula above, i.e.
# of outcomes included in event A 2
Pr (A) = =
total # of outcomes in sample space S 3
56
• This value is incorrect because the outcomes are not all equally likely. Instead,
we can argue that
1 1 5
Pr (Green or Blue) = Pr (Green) + Pr (Blue) = + =
3 2 6
• This gives us the correct answer because the outcomes ‘Green’ and ‘Blue’ are
mutually exclusive or disjoint, meaning that they cannot both occur in the
same run of the experiment (the spinner cannot land on green and blue)
• The above gives rise to the additive rule of probability for mutually exclusive
events (see below)
Probability Interpretation
• Therefore,
– Probability does not mean that if you flip a coin twice you will get Heads
once and Tails once
– But it means you flip a coin 1 million times, you will almost certainly get
close to 500 000 Heads and 500 000 Tails
Warning to Gamblers
• This means that, while you may get lucky at the casino on one particular day,
if you keep going back to the casino many times you are almost certain to lose
more money than you gain
• To restate what was said earlier, suppose we have a sample space S = {O1 , O2 , . . . , Ok }
(where the Oj denote outcomes, which by definition are mutually exclusive),
probabilities of these outcomes must meet two criteria:
57
1. 0 ≤ Pr (Oi ) ≤ 1 for each i
• That is, the probabilities of all the possible outcomes must add up to 1.
One of the outcomes in the sample space has to happen, with certainty!
Mutual Exclusivity
• Two events A and B are said to be mutually exclusive if they cannot both
occur in the same run of an experiment
• Outcomes are always mutually exclusive by definition, but events are not nec-
essarily mutually exclusive
– the events ‘roll a 3’ and ‘roll an even number’ are mutually exclusive
– The events ‘roll a 3’ and ‘roll an odd number’ are not mutually exclusive
• We can use Venn Diagrams to help us visualise this (Figures 4.4 and 4.5)
58
Figure 4.5: Events A and B are not Mutually Exclusive
• Here, ∩ is read as ‘Intersect’ and thus A∩B is the area of intersection (overlap)
between events A and B. ∅ denotes the empty set, and so the above statement
says that there are no outcomes in A that are also in B
Pr (A ∪ B) = Pr (A) + Pr (B)
59
Additive Rule for Mutually Exclusive Events: Example
Pr (A ∪ B) = Pr (A) + Pr (B)
1 3
= +
6 6
4 2
= =
6 3
• Since the ‘Union’ of A and B consists of outcomes that are included in event
A or event B or both, if we use the previous rule for non-mutually-exclusive
events we will overstate the probability, because we will count A ∩ B twice,
once in A and once in B
• To solve this problem, the additive rule for non-mutually exclusive events
requires us to subtract Pr (A ∩ B) so that it is only counted once:
Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B)
60
Additive Rule for Non-Mutually Exclusive Events: Example
Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B)
2 3 1
= + −
6 6 6
4 2
= =
6 3
Pr (Ac ) = 1 − Pr (A)
• The rule follows from our earlier statement that Pr (S) = 1 (where S de-
notes the sample space, the complete set of possible distinct outcomes of the
experiment)
• Since Ac consists of all outcomes in S that are not in A, it follows that A∪Ac =
S, and therefore (since A and Ac are obviously mutually exclusive events) that
Pr (A) + Pr (Ac ) = 1
• By moving Pr (A) to the right side of the equation, we arrive at the Comple-
ment Rule
61
Complement Rule: Example
• Suppose that your friend rolls the die. You do not see the result but he tells
you that it is an odd number (i.e. event A has occurred)
1
• Before we had this information, we would have said that Pr (B) =
6
• However, now that we have this information, we need to update the probabil-
ity: we know that A and B are mutually exclusive, so if A has occurred, this
means B cannot occur, so the probability of B is 0.
• We cannot write Pr (B) = 0, because this would contradict the earlier state-
1
ment that Pr (B) =
6
• Instead, we define a conditional probability: the probability of an event
conditioning on another event
62
– In our above example, Pr (A|B) = 0 means that if your friend told you
she rolled a 2, the probability that he rolled an odd number would be
updated to 0
• Now, suppose that your friend rolls the die again. You do not see the result
but he tells you that it is an even number (i.e. event A has not occurred, but
event Ac has occurred)
• Clearly, events Ac and B are not mutually exclusive: it is possible that the
number rolled could be even and could also be a 2
1
• Our original probability for event B was Pr (B) = ; how will we update the
6
probability now that we know that event Ac has occurred?
• Intuitively, we know that there are three even numbers that could have been
rolled (2, 4, 6): event Ac includes three outcomes, which are all equally likely
• Instead of focusing on the entire sample space S, we are restricting our atten-
tion to the subset Ac ; thus
# of outcomes included in eventB and in event Ac
Pr (B|Ac ) =
total # of outcomes in eventAc
Pr (B ∩ Ac ) 1/6 1
Pr (B|Ac ) = = =
Pr (B) 3/6 3
• This gives rise to the general formula for calculating a conditional probability
below (where we will revert to speaking of events A and B, since we are
no longer focusing only on the definition of A, B, and Ac in our die rolling
example)
63
• Similarly, provided that Pr (B) 6= 0, the probability of A conditioning on B is
defined as
Pr (A ∩ B)
Pr (A|B) =
Pr (B)
• Notice that we can rearrange the above equations to obtain two formulas for
Pr (A ∩ B), which is called the joint probability of A and B and which we
already saw in the additive rule for non-mutually-exclusive events
Pr (A ∩ B) = Pr (B|A) Pr (A)
Pr (A ∩ B) = Pr (A|B) Pr (B)
Independent Events
• Consider a box containing eleven balls, seven grey and four white, as pictured
in Figure 4.9
Figure 4.9: Box Containing Seven Grey Balls and Four White Balls
64
• Now, define the following events:
• Suppose we are on step (3) and we know that the first ball drawn and replaced
was grey (i.e. event A occurred). What is the probability of B conditioning
on A (Pr (B|A))?
• Logically, it is clear that since the first ball drawn is put back in the box, the
probability that the second ball drawn is white is not affected by the colour of
the first ball drawn
• Another way of saying this is that A and B are independent events: they do
not influence each other in any way
• If we were on step (3) and had no information about event A, we would say
4
that Pr (B) = (since four of the eleven balls are white)
11
• However, since event B is independent of event A, the information that A has
occurred does not cause us to update the probability; thus
4
Pr (B|A) = Pr (B) =
11
• By substituting this into the multiplicative rule for dependent events, we can
derive the multiplicative rule for independent events:
Pr (A ∩ B) = Pr (B|A) Pr (A)
Pr (A ∩ B) = Pr (B) Pr (A) (since A and B are independent)
• Remember, this multiplicative rule only applies if events A and B are inde-
pendent
65
• To contrast the independent events in our ‘balls in a box’ example above with
dependent events, let us modify the procedure as follows:
1. Draw one ball at random from the box
2. Do not replace the drawn ball (do not put it back in the box)
3. Draw a second ball at random from the box
• Define events A and B exactly as before:
– Let A be the event that the first ball drawn is grey
– Let B be the event that the second ball drawn is white
• Is it still the case that A ⊥ B?
• Suppose that event A occurs. Since one grey ball is removed from the box and
not replaced, when the second ball is drawn there are 10 balls in the box, 6
grey and 4 white
4 2
– Thus, Pr (B|A) = =
10 5
• Suppose that event Ac occurs (the first ball drawn is not grey, it is white).
Since one white ball is removed (and not replaced), when the second ball is
drawn there are 10 balls in the box, 7 grey and 3 white
3
– Thus, Pr (B|Ac ) =
10
• Since Pr (B|A) 6= Pr (B|Ac ), it follows that the probability of event B does
depend on whether or not A has occurred; thus A and B are dependent events
• In this case, to find Pr (A ∩ B) we cannot use the multiplicative rule for inde-
pendent events but must use the multiplicative rule for dependent events:
4 7 28 14
Pr (A ∩ B) = Pr (B|A) Pr (A) = = = = 0.2545
10 11 110 55
Independent Events and Gambling
66
Conditional Probabilities and Contingency Tables
• It turns out that conditional probability is closely related to the ‘column rel-
ative frequency’ (and ‘row relative frequency’) in a two-way frequency table
• The Department then keeps records on which of the prisoners reoffend and
which do not, within two years of their release from prison
Does not
Re-offends Total
re-offend
Completes 3 57 60
programme
Does not complete
27 13 40
programme
Total 30 70 100
Table 4.1: Contingency Table of Results of Correctional Services Skills Development
Programme
• Suppose we choose one prisoner at random from the population of 100 (with
all prisoners equally likely to be selected), and define the following events:
– Let A be the event that the prisoner completed the skills development
programme
– Let B be the event that the prisoner reoffended after his/her release
• We can use Table 4.1 to very easily calculate any probabilities such as Pr (A),
Pr (Ac ), Pr (B), Pr (B c ), Pr (A ∩ B), Pr (A ∪ B), Pr (A|B), Pr (B|A), etc.
67
• To get the marginal probability of A (i.e. Pr (A), without taking into
account the influence of B), we simply divide the total of the first row by the
overall total: 60 out of 100 prisoners completed the programme, so Pr (A) =
60
100
• Similarly, to get the marginal probability of B (i.e. Pr (B), without taking
into account the influence of A), we simply divide the total of the first column
30
by the overall total: 30 out of 100 prisoners reoffended, so Pr (B) =
100
• In the same way, we could divide the total of the second row by the grand total,
and the total of the second column by the grand total, to get the marginal
probabilities Pr (Ac ) and Pr (B c ), respectively
– Of course, these two probabilities could also be calculated using the com-
plement rule, e.g.
60 40
Pr (Ac ) = 1 − Pr (A) = 1 − =
100 100
• All of this is illustrated in Figure 4.10
68
Figure 4.11: Calculating Joint Probabilities from Contingency Table
• Thus, if we are conditioning on A we will divide by the total frequency for event
A (60) and if we are conditioning on B we will divide by the total frequency
of event B (30)
• Thus, for example, to calculate Pr (A|B) we simply take the number of pris-
oners who completed the programme and reoffended, and divide this by the
3
number of prisoners who reoffended, thus obtaining , the probability that
30
69
a prisoner completed the programme given that s/he reoffended (see Figure
4.13)
Bayes’ Theorem
• We noted earlier that the two ways of expressing the multiplicative rule for
dependent events can be rearranged as conditional probability formulas:
Pr (A ∩ B)
Pr (B|A) =
Pr (A)
Pr (A ∩ B)
Pr (A|B) =
Pr (B)
• The two ways of expressing the multiplicative rule for dependent events can
also be set equal to each other to give Bayes’ Theorem:
Pr (A) Pr (B|A) = Pr (B) Pr (A|B)
Pr (B) Pr (A|B)
Pr (B|A) = (Bayes’ Theorem)
Pr (A)
b
X b
X
Pr (A) = Pr (A ∩ Bk ) = Pr (A|Bk ) Pr (Bk )
k=1 k=1
70
Figure 4.14: Visualisation of Law of Total Probability
• The law of total probability also gives us another way of expressing Bayes’
Theorem:
Pr (B1 ) Pr (A|B1 )
Pr (B1 |A) =
Pr (A)
Pr (B1 ) Pr (A|B1 )
= b
X
Pr (A|Bk ) Pr (Bk )
k=1
• Suppose that you have three bags that each contain 10 balls. Bag 1 has 3 blue
balls and 7 green balls. Bag 2 has 5 blue balls and 5 green balls. Bag 3 has 8
blue balls and 2 green balls. You choose a bag at random and then chooses a
ball from this bag at random. There is a 31 chance that you choose Bag 1, a 12
chance that you choose Bag 2, and a 61 chance that you choose Bag 3. What
is the probability that the chosen ball is blue?
71
Bayes’ Theorem and Law of Total Probability: Example
(i) The probability that the pregnancy test shows a positive result.
We use the Law of Total Probability. Let ‘Pos’ be the event that a
pregnancy test result is positive and let ‘Preg’ be the event that a woman
is pregnant.
(ii) The probability that the woman is pregnant, given that the pregnancy
test shows a positive result.
We use Bayes’ Theorem.
Pr (Pos|Preg) Pr (Preg)
Pr (Preg|Pos) = (by Bayes’ Theorem)
Pr (Pos)
(0.96)(0.07)
=
0.1416
= 0.474576
(iii) The probability that the woman is not pregnant, given that the pregnancy
test shows a negative result.
We again use Bayes’ Theorem.
(iv) The probability that the pregnancy test gives an incorrect result (false
positive or false negative).
We use the additive rule for mutually exclusive (disjoint) events. The
two events that represent an incorrect result are Pos ∩ Pregc (positive
72
test and not pregnant =⇒ false positive) and Posc ∩ Preg (negative test
and pregnant =⇒ false negative). We need to find the probability of
the union of these two events, which are obviously mutually exclusive
(disjoint) events.
• Often, situations arise in which the outcomes are the different ways in which
a set of objects can be ordered or arranged
Permutations
• For example, how many three-letter sequences can be formed from the letters
abc, using each letter only once?
73
• We can also have permutations with repetition. For instance, how many three-
letter sequences can be formed from the letters abc, if each letter may be used
multiple times?
× ×
and then fill in the number of possibilities for each position in the per-
mutation
3×2×1
– For abc with repetition, this would be:
3×3×3
• See if you can use the above approach to answer the following:
Tree Diagrams
• Figure 4.15 gives a tree diagram for counting the six three-letter sequences of
the letters abc where no letter can be repeated
74
Figure 4.15: Tree Diagram for Non-Repeating Permutations of letters abc
Permutation Formulas
– Hence, if we want to know how many ways five distinct books can be
arranged on a shelf, the answer is 5! = 120
• n Pr is read as ‘n permute r’
– Suppose South African Idols is down to the last ten contestants. How
many possible permutations are there of the top three?
10! 3628800
10 P3 = = = 720
(10 − 3)! 5040
• A further explanation of the above formula: we have 10! on top because this is
the number of ways that 10 people can be arranged in order (10 × 9 × 8 × · · · × 1).
The order of the last 7 people (outside the top three) does not matter to us,
and there are 7! ways to arrange them, so we divide by 7!
75
• A more general version of this permutation formula can be used to count the
permutations of a set of n objects which are there are k different types, with
r1 objects of type 1, r2 objects of type 2, . . ., and rk objects of type k. The
formula is:
n!
r1 !r2 ! · · · rk !
• For example, suppose a lecturer has ten textbooks in her office: five statistics
textbooks, three mathematics textbooks and two chemistry textbooks. (The
textbooks within each subject are all identical.) How many different ways
could she arrange them on her shelf?
• The formula for the number of ways to choose r objects from a set of n objects
is:
n n!
n Cr = =
r r! (n − r)!
• Note: n Cr and nr are two symbolic ways of expressing the same quantity and
76
5 Probability Distributions
5.1 Basic Concepts
Declaring a Random Variable
• We represent random variables with capital letters such as X and Y , and the
values they take on with small letters such as x and y
• Random variables can be classified into two categories, which result in major
differences in the form of the probability distributions:
77
5.2 Discrete Probability Distributions
Properties of Discrete Probability Distributions
• Clearly the outcomes are countable (the support is S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12})
and so X is a discrete random variable
78
x Pr (X = x)
1
2 36
2
3 36
3
4 36
4
5 36
5
6 36
6
7 36
5
8 36
4
9 36
3
10 36
2
11 36
1
12 36
Table 5.1: Probability Distribution for the Sum of Two Six-Sided Dice
• We can use the probability distribution table to work out the probabilities of
events involving more than one outcome. For example:
– What is the probability that the dice add to a number greater than 9?
3+2+1 6 1
Pr (X > 9) = Pr (X = 10)+Pr (X = 11)+Pr (X = 12) = = =
36 36 6
Note that if we had said ‘at least 9’ we would have written Pr (X ≥ 9)
and would have included Pr (X = 9) in the calculation
– What is the probability that the dice add to an odd number?
Pr (X ∈ {3, 5, 7, 9, 11}) = Pr (X = 3) + Pr (X = 5) + Pr (X = 7)
+ Pr (X = 9) + Pr (X = 11)
2+4+6+4+2 18 1
= = =
36 36 2
• Can you verify that the probability distribution in Table 5.1 satisfies the two
properties of a discrete probability distribution?
x 0 1 2 3
Pr (X = x) 0.3 0.4 0.2 0.1
79
• What is the probability that the salesperson will make at most (no more than)
one sale tomorrow?
• Consider our definition of the random variable X for a coin flip earlier
• The PMF always outputs a value of 0 for any value not in the support of the
random variable
• To make this explicit we should always include the ‘0 otherwise’ in the defini-
tion of the function
• Can you work out the probability mass function for the sum of two six-sided
dice? (Challenging)
6 − |x − 7| if x ∈ {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
fX (x) = 36
0 otherwise
• Try substituting values of x into this function and see if you get the same
probabilities as in Table 5.1
80
Figure 5.1: Probability Mass Function of Sum of Two Six-Sided Dice
• Determine the value of k for which the function fX (x) below is a valid PMF
(
kx for x ∈ S = {1, 2, 3, 4}
fX (x) =
0 otherwise
81
• Solution:
X
fX (x) = 1
x∈S
4
X
kx = 1
x=1
k(1 + 2 + 3 + 4) = 1
1
k=
10
• In other words, if we were to repeat the experiment associated with the random
variable many times, the mean of the values we obtain will converge to the
expectation (according to the Law of Large Numbers)
• In other words, it is a weighted average of all the values in the support, with
probabilities serving as weights
• In other words, if we were to repeat the experiment associated with the random
variable many times, the variance of the values we obtain will converge to the
variance of the distribution (according to the Law of Large Numbers)
82
Expectation and Variance of a Discrete Random Variable: Example
• Find the expectation and variance of the random variable X representing the
sum of two six-sided dice
Expectation:
12
X
E (X) = xfX (x)
x=2
12
X 6 − |x − 7|
= x
x=2
36
1 2 3 4
=2 +3 +4 +5
36 36 36 36
5 6 5 4
+6 +7 +8 +9
36 36 36 36
3 2 1
+ 10 + 11 + 12
36 36 36
2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12
=
36
252
= =7
36
Variance:
12
X
2
x2 fX (x)
E X =
x=2
12
X6 − |x − 7|
2
= x
x=2
36
2 1 2 2 2 3 2 4
=2 +3 +4 +5
36 36 36 36
5 6 5 4
+ 62 + 72 + 82 + 92
36 36 36 36
3 2 1
+ 102 + 112 + 122
36 36 36
4 + 18 + 48 + 100 + 180 + 294 + 320 + 324 + 300 + 242 + 144
=
36
1974
=
36
Var (X) = E X 2 − µ2
1974 1974
= − 72 = − 49
36 36
1974 1764 210
= − = = 5.8333
36 36 36
83
Expectation and Variance of a Discrete Random Variable
• Derive the expectation and variance of the random variable with PMF
x for x ∈ S = {1, 2, 3, 4}
fX (x) = 10
0 otherwise
• In this section, we will explore a few discrete probability distributions that are
of special importance in statistics (though there are many, many others)
• These are:
84
Figure 5.2: Probability Mass Function of X ∼ Uniform(20)
• As is clear from the graph, the distribution is called ‘Uniform’ because the
probability is uniform (the same) for all outcomes
• Twenty people include Ntokozo have had their names entered in a draw for a
prize. One name is drawn at random. What is the probability that Ntokozo
wins the prize?
Let us number Ntokozo as the 1st of the 20 outcomes; thus X = 1 corresponds
to the outcome of Ntokozo winning the prize
1 1
fX (1) = =
k 20
• We can easily derive the expectation and variance of a discrete uniform random
variable
85
• Expectation:
k
X
E (X) = xfX (x)
x=1
k
X x
=
x=1
k
k
1X
= x
k x=1
1 k(k + 1)
= (formula for sum of an integer series from 1 to k)
k 2
k+1
=
2
• Variance:
k
X
2
x2 fX (x)
E X =
x=1
k
X x2
=
x=1
k
k
1X 2
= x
k x=1
1 k(k + 1)(2k + 1)
= (formula for sum of a squared integer series from 1 to k)
k 6
(k + 1)(2k + 1)
=
2
6 2
Var (X) = E X − µ
2
(k + 1)(2k + 1) k+1
= −
6 2
2
1 2 1 1 1 2 1 1
= k + k+ − k + k+
3 2 6 4 2 4
2
1 1 k −1
= k2 − =
12 12 12
86
Binomial Distribution
• If the above four conditions hold, then X has a binomial distribution with the
following PMF:
n px (1 − p)n−x if x = 0, 1, 2, . . . , n
fX (x) = x
0 otherwise
– Since the n trials are independent, the multiplicative rule for independent
events applies
– If we want to find Pr (X = x); this means we would have x successes and
n − x failures
– Let Si be the event of a success in the ith trial; then Pr (Si ) = p for
i = 1, 2, . . . , n and Pr (Sic ) = 1 − p for i = 1, 2, . . . , n (by complement
rule)
– One way to achieve x successes and n − x failures would be:
c c c
S1 ∩ S2 ∩ · · · ∩ Sx ∩ Sx+1 ∩ Sx+2 ∩ · · · ∩ Sx+(n−x)=n
– That is, the first x trials are successes and the rest (the last n − x) are
failures
– The probability of this happening would be:
c c
∩ · · · ∩ Snc
Pr S1 ∩ S2 ∩ · · · ∩ Sx ∩ Sx+1 ∩ Sx+2
c c
· · · Pr (Snc ) (by independence)
= Pr (S1 ) Pr (S2 ) · · · Pr (Sx ) Pr Sx+1 Pr Sx+2
x times n−x times
z }| { z }| {
= p × p × · · · × p × (1 − p) × (1 − p) × · · · × (1 − p)
= px (1 − p)n−x
87
– However, this instance where the first x trials are successes and the last
n − x are failures is only one of many possible orderings of x successes
and n − x failures
– Thus we need to consider how many ways there are to order these x
successes and n − x failures. The answer (see above under Counting
Outcomes) is
n! n
= n Cx or
x!(n − x)! x
– By adding each of these ways of getting x successes and n − x failures
(which are mutually exclusive), following additive rule for mutually ex-
clusive events, we add px (1 − p)n−x a total of nx times, giving us the
PMF formula above
• A fair coin is flipped 10 times. What is the probability that the result is
‘Heads’ exactly seven times?
• Let us define ‘Heads’ as a success and ‘Tails’ as a failure (in general it is not
important which outcome we define as a success and which as a failure). Then
X is the number of ‘Heads’ obtained in the ten coin flips, which is a binomial
1
experiment with n = 10 and p =
2
1
• Thus X ∼ Binomial(10, )
2
n x
Pr (X = 7) = p (1 − p)n−x
x
7 10−7
10 1 1
= 1−
7 2 2
= 0.1172
• A library knows that from past experience, 42% of books that are borrowed
are returned after the due date. If 15 books are borrowed today, what is the
probability that less than three of them are returned after the due date?
Pr (X < 3) = Pr (X = 0) + Pr (X = 1) + Pr (X = 2)
15 15−0 15
= 0
0.42 (1 − 0.42) + 0.421 (1 − 0.42)15−1
0 1
15
+ 0.422 (1 − 0.42)15−2
2
= 0.000283 + 0.003071 + 0.015569 = 0.0189
88
• Figure 5.3 plots the binomial distribution PMF for the library example above.
Of course the graph of this function will change for different values of n and p
• In our ten coin flips example, the distribution mean would be:
µ = E (X) = np = (10)(0.5) = 5
• It makes sense intuitively that the average number of heads in ten coin flips
would be five
89
Poisson Distribution
• The Poisson distribution is used to model counts of ‘rare’ events that occur
with a fixed average rate over a specified period of time (or, ‘rare’ objects
that occur with a fixed average rate over a specified space: (distance, area, or
volume))
• The parameter λ is interpreted as the average rate of events per unit of time
(or average rate of objects per unit of space)
• Figure 5.4 plots the PMF of the Poisson distribution for the case λ = 2.5,
for x = 0, 1, 2, . . . , 10. Note that this is not the entire distribution, since the
support goes up to ∞, but of course we cannot plot up to ∞ on the horizontal
axis
• Our shorthand for saying that the random variable X has a Poisson distribu-
tion with average rate parameter λ is X ∼ Poisson(λ)
90
Figure 5.4: Probability Mass Function of X ∼ Poisson(2.5)
• The number of complaints that a busy laundry facility receives per day is a
random variable X ∼ Poisson(3.3)
1. What is the probability that the facility will receive less than two com-
plaints on a particular day?
Pr (X < 2) = Pr (X = 0) + Pr (X = 1)
λ0 e−λ λ1 e−λ
= +
0! 1!
= 0.036883 + 0.121714 = 0.1586
E (X) = λ
Var (X) = λ
91
• Thus this distribution has the unusual property that its expectation and vari-
ance are equal
• Consider the above example. Clearly, the average rate of complaints received
per day is 3.3. If we are interested in the average rate of complaints per five-
day work week instead of per day, we are simply scaling the time interval by
a multiple of 5.
• The scalability property of the Poisson distribution means that if the number of
complaints received in one day is Poisson-distributed with parameter λ = 3.3,
then the number of complaints received in five days is Poisson-distributed with
parameter 5λ = 16.5
Hypergeometric Distribution
– There is a population of N objects that fall into two categories (we can
call them ‘successes’ and ‘failures’ if we like)
– There are K successes in the population and N − K failures
– We randomly choose n objects from the population without replacement
(that is, after choosing the first object, we do not put it back before
choosing the second; thus the same object cannot be chosen twice)
• Our shorthand for saying that the random variable X has a hypergeometric
distribution with population size N , number of successes in population K, and
number of trials (draws) n, is X ∼ Hypergeometric(N, K, n)
92
– The n ‘trials’ in a hypergeometric experiment are not independent, unlike
the n ‘trials’ in a binomial experiment
– The probability of success is different in each trial of a hypergeometric
experiment, unlike the probability of success in a binomial experiment,
which remains constant across trials (p)
– The support of the hypergeometric distribution is not necessarily the
integers from 0 to n like that of the binomial distribution. The lower
limits and upper limits can have restrictions on them in certain cases:
∗ If the number of objects selected (n) is greater than the number of
failures in the population (N −K) then there must be some successes
selected. Thus the minimum possible value of a hypergeometric ran-
dom variable is max (0, n − (N − K))
∗ If the number of objects selected (n) is greater than the number of
successes in the population (K) then there must be some failures
selected. Thus the maximum possible value of a hypergeometric ran-
dom variable is min (n, K)
93
• Observe that in this case the smallest possible number of successes is 2 (since
we are drawing 8 objects and there are only 6 failures in the population), while
the greatest possible number of successes is 8 (since we are drawing 8 objects
and there are 9 successes in the population)
Hypergeometric Distribution: Examples
• A box (pictured in Figure 5.6) contains 15 balls, of which 9 are grey and 6 are
white
94
• An academic department at a university consists of 20 staff, of whom 12 are
men and 8 are women. A committee is to be formed by randomly choosing
four staff members. What is the probability that the committee consists of
two men and two women?
• Let us define ‘women’ as successes and ‘men’ as failures (the men may not
appreciate this, but we can define it the other way around if we prefer and
still get the same answer)
• A lottery game is played as follows. A player writes down five integers between
1 and 52 on a card. 52 numbered balls are placed in a machine and 6 balls are
selected at random. A player wins the jackpot if all six balls that come out of
the machine match the numbers on his/her card. What is the probability of
winning the jackpot when playing with one card?
• In this case we can define the six balls whose numbers match those written on
the player’s card as ‘successes’ and the other 52 − 6 = 46 balls as ‘failures’.
We then have X ∼ Hypergeometric(N = 52, K = 6, n = 6) and we want to
know Pr (X = 6) = fX (6) (all six selected balls must be successes to win the
jackpot)
6 52−6
6 6−6
fX (6) = 52
6
1
=
20358520
• The probability of winning the jackpot in this lottery is 1 in 20 358 520 (less
than 1 in 20 million!)
95
– In the committee example above with N = 20, K = 8 and n = 4, the
(4)(8)
expected number of women on the committee is E (X) = = 1.6
20
– In the lottery example above with N = 52, K = 6 and n = 6, the
expected number of balls drawn that match those on the player’s card is
(6)(6)
E (X) = = 0.6923
52
• The variance of a random variable X ∼ Hypergeometric(N, K, n) is given by
nK(N − K)(N − n)
Var (X) =
N 2 (N − 1)
96
5.5 Special Continuous Probability Distributions
Special Continuous Probability Distributions
• These are:
• We will also learn how to use the normal distribution to calculate approximate
probabilities from the binomial distribution
1
X ∈ [a, b]
fX (x) = b − a
0 otherwise
• The most basic case involves the interval [0, 1] (a = 0, b = 1); generating a
Uniform(0, 1) random variable is the first step of all pseudo-random number
generators in computer science (which we briefly discussed back in chapter 1)
• The PDF plot for this case of the continuous uniform distribution is displayed
in Figure 5.7
97
Figure 5.7: Probability Density Function of X ∼ Uniform(a = 0, b = 1)
98
Figure 5.8: Area under PDF of X ∼ Uniform(a = 1, b = 5) representing
Pr (1.2 ≤ X ≤ 2.45)
• Consider the two histograms of student marks in Figure 5.3, one for a group
of 50 students and one for a group of 5000 students
99
Figure 5.3: Histograms of Marks for Groups of 50 Students (left) and 5000 Students
(right)
• Do you notice that the histogram on the right is ‘bell shaped’ ? This is because
marks, like many other phenomena in the world, usually tend to follow a
Normal Distribution (also known as Gaussian Distribution) which has a
probability density function with the famous ‘bell-curve’ shape shown in Figure
5.9
100
• We can see from Figure 5.3 that as the number of observations increases, the
histogram matches the bell shape of this curve more and more closely
where fX (x) is as above in order to find the area under the PDF between x1
and x2
• This integral is very difficult to solve (even if you have already learned some
integration techniques in Mathematics 1A)
• Instead we will use two tricks to get probabilities from the normal distribution
101
• This transformation (which you may recall as similar to calculating a ‘stan-
dard score’ from chapter 3) is very useful because it means that if we have
probabilities for a standard normal distribution we can use them to obtain
probabilities for any normal distribution
• The PDF of the standard normal distribution is given by
1 2
1 − x
fZ (x) = √ e 2 for − ∞ < z < ∞
2π
• If we want to find Pr (Z < z) for any value, this requires us to calculate the
integral Z z
fZ (x)dx
−∞
• This is equivalent to finding the area under the PDF as displayed in Figure
5.10
Figure 5.10: Graph of area under Standard Normal PDF between −∞ and z, rep-
resenting Pr (Z < z)
• This is still not possible to solve by hand; but using numerical integration
techniques (which you will learn about in your Numerical Methods modules)
it can be accurately approximated
102
• At the back of your notes you will find a ‘Z Table’ that gives Pr (Z < z), correct
to four decimal places, for any z value (up to 2 decimal places) between 0 and
3.49
Procedure for Using the Z Table to Calculate Pr (Z < z)
• Suppose we want to know Pr (Z < 0.42)
• We look up the value 0.42 in the Z Table as shown in Figure 5.11 and find
that Pr (Z < 0.42) = 0.6628
• Thus, in order to find Pr (Z > z) from the Z Table, we look up the value of z
and find Pr (Z < z) and then subtract it from 1 to get Pr (Z > z)
• For example, suppose we want to find Pr (Z > 2)
Pr (Z > 2) = 1 − Pr (Z < 2)
= 1 − 0.9772 (from table)
= 0.0228
103
Finding Standard Normal Probabilities for Negative z Values
• Observe that the Z Table does not contain negative values of z, yet a standard
normal random variable has a mean of 0 and can take on negative values
• The answer is that we use the fact that the normal distribution is perfectly
symmetrical. This implies that
• Thus we would look up the positive value −z in the table and take 1 minus
the answer to get Pr (Z < z) if z is a negative number
Pr (Z > z) = 1 − Pr (Z < z)
= 1 − Pr (Z > −z)
= 1 − [1 − Pr (Z < −z)]
= Pr (Z < −z)
• You will notice that the Z Table only goes up to 3.49, but the standard normal
distribution is defined for z ∈ (−∞, ∞)
104
Procedure for Using the Z Table to Calculate Pr (z1 < Z < z2 )
• What if we want to find the probability that a standard normal random vari-
able falls between two values z1 and z2 such that z1 < z2 ?
• If we look at Figure 5.12 we can observe the area under the PDF corresponding
to Pr (z1 < Z < z2 )
Figure 5.12: Graph of area under Standard Normal PDF between z1 and z2 , repre-
senting Pr (z1 < Z < z2 )
• It is clear from the graph that the area we want to calculate is equivalent to
Z z2 Z z1
fZ (x)dx − fZ (x)dx
−∞ −∞
that is, the area under the PDF from −∞ to z2 minus the area under the
PDF from −∞ to z1
• Since we have expressed the ‘between’ probability in terms of two ‘less than’
probabilities, we can use the Z Table to get these two ‘less than’ probabilities
and then subtract them
105
• In other words,
• Suppose we want to find out for what value of z the following statement is
true: Pr (Z > z) = 0.025
• Since the table gives us Pr (Z < z) values and not Pr (Z > z) values, we first
rearrange our statement:
Pr (Z > z) = 0.025
1 − Pr (Z < z) = 0.025
Pr (Z < z) = 0.975
• Thus z = 1.96 is the value of z for which the statement Pr (Z > z) = 0.025 is
true (approximately)
1. What is the probability that a pregnancy lasts less than 245 days?
2. What is the probability that a pregnancy lasts between 270 and 280 days?
3. There is an 80% probability that a pregnancy lasts more than y days.
Determine y, correct to nearest whole number.
106
1. Let X be the random variable representing the length of a pregnancy. We
know X is normally distributed, so we can transform X to a standard
normal random variable Z as follows:
X −µ
Z=
σ
X − 266
=
16
Thus:
X − 266 245 − 266
Pr (X < 245) = Pr <
16 16
−21
= Pr Z < = Pr (Z < −1.31)
16
= 1 − Pr (Z < 1.31) = 1 − 0.9049 = 0.0951
2. What is the probability that a pregnancy lasts between 270 and 280 days?
270 − 266 X − 266 280 − 266
Pr (270 < X < 280) = Pr < <
16 16 16
4 14
= Pr <Z< = Pr (0.25 < Z < 0.88)
16 16
= Pr (Z < 0.88) − Pr (Z < 0.25)
= 0.8106 − 0.5987 = 0.2119
107
3. There is an 80% probability that a pregnancy lasts more than x days. Deter-
mine x, correct to nearest whole number.
Pr (X > x) = 0.8
x − 266
Pr Z > = 0.8
16
x − 266
Let z =
16
Pr (Z > z) = 0.8
1 − Pr (Z < z) = 0.8
Pr (Z < z) = 0.2
Pr (Z < −z) = 1 − 0.2 = 0.8 (by symmetry property of normal distribution)
Pr (Z < 0.84) = 0.7995 ≈ 0.8 (from table)
Thus z ≈ −0.84
x − 266
Thus − 0.84 ≈ which gives us:
16
x ≈ 252.56 ⇒ 253 (nearest whole number)
• For instance, if n = 100 and p = 0.3 and we want to know Pr (25 ≤ X ≤ 75),
we have to calculate Pr (X = 25)+Pr (X = 26)+Pr (X = 27)+· · ·+Pr (X = 75)
which takes a long time
• But notice in the probability mass function graphs in Figure 5.13 that as the
number of trials n increases, the graph more and more closely resembles the
bell-shaped curve of the normal probability density function:
108
Figure 5.13: Probability Mass Function for X ∼ Binomial(10, 0.3) (left) and for
X ∼ Binomial(100, 0.3) (right)
• This gives us an idea: since when n is large, the binomial distribution behaves
similar to a normal distribution, why don’t we approximate the binomial prob-
abilities using the normal distribution?
• In this case, np = 11(0.5) = 5.5 > 5 and n(1 − p) = 11(0.5) = 5.5 > 5 so the
approximation is valid
109
• It seems we should use the normal approximation as follows:
µ = np = 11(0.5) = 5.5
p p
σ = np(1 − p) = 11(0.5)(0.5) = 1.6583
Let Y ∼ N (5.5, 1.6583)
Pr (2 ≤ X ≤ 6) ≈ Pr (2 ≤ Y ≤ 6)
2 − 5.5 6 − 5.5
= Pr ≤Z≤
1.6583 1.6583
= Pr (−2.11 ≤ Z ≤ 0.30)
(Note that ≤ and < are the same for continuous random varibles)
= Pr (Z < 0.30) − Pr (Z < −2.11)
= 0.6179 − [1 − Pr (Z < 2.11)]
= 0.6179 − [1 − 0.9826] = 0.6005
• The exact answer to this question, if we used the ordinary binomial method
Pr (2 ≤ X ≤ 6) = Pr (X = 2)+Pr (X = 3)+Pr (X = 4)+Pr (X = 5)+Pr (X = 6),
is 0.7197. So our approximation is very bad. What went wrong?
• Hence, for the point X = 2, for instance, we need to take into account the
area just to the left and just to the right
110
Figure 5.14: Illustration of Normal Approximation without and with Continuity
Correction for Pr (2 ≤ X ≤ 6)
• We can see that this is now much closer to the exact answer of 0.7197
111
• Hence 2 < X < 6 translates to 2.5 < X < 5.5
• In the graph on the left, we take the area under the Normal curve from 2 to 6.
But the region from 2 to 2.5 still ‘belongs’ to 2 and is not greater than 2 from
a continuous point of view. Similarly, the region from 5.5 to 6 still ‘belongs’
to 6 and is not less than 6 from a continuous point of view.
112
use the normal approximation to estimate Pr (25 ≤ X ≤ 75)
µ = np = 100(0.3) = 30
p p
σ = np(1 − p) = 100(0.3)(1 − 0.3) = 4.5826
Let Y ∼ N (30, 4.5826)
Pr (25 ≤ X ≤ 75) ≈ Pr (24.5 ≤ Y ≤ 75.5)
24.5 − 30 75.5 − 30
= Pr ≤Z≤
4.5826 4.5826
= Pr (−1.20 ≤ Z ≤ 9.93)
= Pr (Z < 9.93) − Pr (Z < −1.20)
= 1 − [1 − Pr (Z < 1.20)]
= 1 − [1 − 0.8849] = 0.8849
• Note: the exact answer (to four decimal places) is 0.8864, so we are not far off
• Use the conventional rule to identify which of the following probabilities in-
volving a binomial random variable can be adequately approximated using the
normal distribution:
• For instance, if we take ‘residents of South Africa’ as our target population, the
government agency Statistics South Africa is tasked with providing accurate
information on characteristics such as:
113
– The average household income (mean income of households)
– The fertility rate (mean number of children born per woman of childbear-
ing age)
– Etc.
• It is difficult and very expensive to obtain data from a population of well over
50 million people in order to exactly quantify the parameter
• For this reason, Statistics South Africa usually obtains data from a sample,
that is, a subset of the population
• In Statistics 2B, you will learn in much greater detail how sampling should be
done in order to obtain good statistics
• For now, it is enough for you to understand what a sample is and what a
statistic is
• Recall also: a random variable is a rule which assigns a value to each outcome
of an experiment. It has error or uncertainty; its value cannot be known for
certain until the experiment takes place
• Consider this simple example: we are going to flip a coin n times, because we
want to know whether the coin is fair, i.e. whether the probability of ‘Heads’
is equal to the probability of ‘Tails’ (both 0.5)
114
– In this case, there is (in theory) an infinite population of coin flips in the
universe, from which we are going to take a sample of size n (you can see
that a ‘population’ is not always clearly defined)
– The parameter is p, the probability of getting ‘Heads’
– The random variable is X, the number of times the coin comes up
‘Heads’ in our sample of n flips
X
– The statistic is p̂ = , the proportion of ‘Heads’ in our sample of n
n
flips. This will be our estimator of p (when we put aˆon a parameter it
denotes a statistic that is an estimator of that parameter).
– The probability distribution that defines X is the binomial distribu-
tion, because what we have described is a binomial experiment
• Now here is the big new insight: a statistic, as we have defined it above,
is also a random variable!
• This sounds strange at first: after all, once we have flipped the coin n times
we know the value of the statistic p̂, so how is it random? But remember,
that is true of any random variables: we know the actual outcome after the
experiment has been done. But not before! Before we flip any coins, we don’t
know the value of p̂ and various outcomes are possible; hence it is random
Sampling Distributions
• Hence, because a statistic is a random variable, a statistic also has its own
probability distribution
115
6.2 Sampling Distribution for a Sample Mean
The Sampling Distribution of the Sample Mean of Normally Distributed
Random Variables
• Let us revisit the human pregnancy example. Let us assume that the length
of a human pregnancy is normally distributed with a mean of 266 days and a
standard deviation of 16 days. But suppose we don’t know this mean, and we
want to estimate it by collecting data from a random sample of mothers
• This makes sense: if we collect more data, we would expect to have a more
precise estimate of the average length of a pregnancy
116
Figure 6.1: Probability Distribution of Sample Mean of Pregnancy Durations for
Different Sample Sizes
• What we can see is that if we were to take a sample of just one mother, and
another researcher were to do the same, and a third researcher were to do the
same, and so forth, then when we all compared our results they would be very
spread out: they would have a large variance. One researcher might estimate
the average pregnancy length to be 240 days, and another, 290 days
• The amount of time university lecturers devote to their jobs per week is nor-
mally distributed with a mean of 52 hours and a standard deviation of 6 hours.
It is assumed that all lecturers behave independently.
1. What is the probability that a lecturer works for more than 60 hours per
week?
117
Let Y1 be the number of hours worked per week by this lecturer. (Note
that we could equivalently define Ȳ as the sample mean of this sample
of n = 1 observation; in this case we could use the sampling distribution
approach and get the same answer.)
Y1 − µ 60 − µ
Pr (Y1 > 60) = Pr >
σ σ
60 − 52
= Pr Z >
6
= Pr (Z > 1.33)
= 1 − Pr (Z < 1.33) = 1 − 0.9082 = 0.0918
2. What is the probability that the mean amount of work per week for four
randomly selected lecturers is more than 60 hours?
Let Y1 , Y2 , Y3 , Y4 be the number of hours worked per week by these four
respective lecturers. Then, according to the sampling distribution theo-
rem, Ȳ is a normally distributed random variable with a mean of µ = 52
σ 6
and a standard deviation of √ = √ = 3.
n 4
Ȳ − µ 60 − µ
Pr Ȳ > 60 = Pr √ > √
σ/ n σ/ n
60 − 52
= Pr Z > √
6/ 4
= Pr (Z > 2.67)
= 1 − Pr (Z < 2.67) = 1 − 0.9962 = 0.0038
We can see that the probability is much smaller in this case. Does this
agree with the graph above in terms of the effect of increasing sample
size on the spread of the sampling distribution?
3. What is the probability that if four lecturers are randomly selected, all
four work for more than 60 hours?
Because we have assumed that all lecturers are independent, we can use
the multiplication rule for independent events, which says that Pr (A ∩ B) =
Pr (A) Pr (B) if events A and B are independent. In this case we have
four events: Y1 > 60, Y2 > 60, Y3 > 60 and Y4 > 60. Of course, the mul-
tiplication rule for independent events can be extended to any number of
independent events. Thus:
118
Sampling Distribution of a Sample Mean: Example Problem 2
• The manufacturer of cans of tuna that are supposed to have a net weight of
200 grams tells you that the net weight is actually a normal random variable
with a mean of 201.9 grams and a standard deviation of 5.8 grams. Suppose
you draw a random sample of 32 cans.
1. Find the probability that the mean weight of the sample is less than 199
grams.
Ȳ − µ 199 − µ
Pr Ȳ < 199 = Pr √ < √
σ/ n σ/ n
199 − 201.9
= Pr Z < √
5.8/ 32
= Pr (Z < −2.83)
= 1 − Pr (Z < 2.83) = 1 − 0.9977 = 0.0023
119
Ȳ − µ 50 − µ
Pr Ȳ > 50 = Pr √ > √ = 0.9
σ/ n σ/ n
50 − 58
Pr Z > √ = 0.9
13/ n
50 − 58
Let z = √
13/ n
Pr (Z > z) = 0.9
Pr (Z < −z) = 0.9
−z ≈ 1.28
z ≈ −1.28
50 − 58
√ ≈ −1.28
13/ n
√ −1.28(13)
n≈
√ −8
n ≈ 2.08
n ≈ 4.33
The minimum number of learners she should take is 4.33; but since she can
only take an integer number of learners, we must round up to 5. The teacher
should take at least 5 learners to the competition.
1. Find the probability that one selected subcomponent is shorter than 114
cm
2. Find the probability that if five subcomponents are randomly selected,
their mean length is less than 114 cm
3. Find the probability that if five subcomponents are randomly selected,
all five have a mean length of less than 114 cm
• (Challenging) The time it takes for a statistics lecturer to mark a test is nor-
mally distributed with a mean of 4.8 minutes and a standard deviation of 1.3
minutes. There are 60 students in the lecturer’s class. What is the probability
that he needs more than 5 hours to mark all the tests? (The 60 tests in this
year’s class can be considered a random sample of the many thousands of tests
the lecturer has marked and will mark.)
120
121
References
Devore, J. L. & Farnum, N. R. (2005), Applied Statistics for Engineers and Scien-
tists, 2nd edn, Brooks/Cole, Belmont.
Hyndman, R. J. (1995), The problem with sturges’ rule for constructing histograms.
unpublished.
Keller, G. (2012), Statistics for Management and Economics, 9th edn, Southwestern
Cengage Learning, Mason.
Miller, I. & Miller, M. (2014), John E. Freund’s Mathematical Statistics with Appli-
cations, 8th edn, Pearson, Essex.
Navidi, W. (2015), Statistics for Engineers and Scientists, 4th edn, McGraw-Hill
Education, New York.
Statistics South Africa (2019), Electricity generated and available for distribution
(Preliminary), December 2018, Technical Report P4141, Statistics South Africa.
Tabak, J. (2011), Probability and Statistics: The Science of Uncertainty, 2nd edn,
Facts on File, New York.
Todorov, V. & Filzmoser, P. (2009), ‘An object-oriented framework for robust mul-
tivariate analysis’, Journal of Statistical Software 32(3), 1–47.
URL: http://www.jstatsoft.org/v32/i03/
122