Stat130 Module Notes
Stat130 Module Notes
STAT130
Introduction to Statistics
2
Chapter 1 – Terminology
1.1 Definitions
Data/Data set – Set of values collected or obtained when gathering information on some
issue of interest.
Examples
4) The yields of a certain crop obtained after applying different types of fertilizer.
Statistics – Collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting the data and drawing
conclusions from it.
Statistics in the above sense refers to the methodology used in drawing meaningful
information from a data set. This use of the term should not be confused with statistics
(referring to a set of numerical values) or statistics (referring to measures of description
obtained from a data set).
Examples
2) The collection of all cars of a certain type manufactured during a particular month.
1) Study of the entire population carried out by the government every 10 years.
A census is usually very costly and time consuming. It is therefore not carried out very often.
A study of a population is usually confined to a subgroup of the population.
The number of values in the sample (sample size) is denoted by n. The number of values in
the population (population size) is denoted by N.
Discrete variables – Variables that can assume a finite or countable number of possible
values. Such variables are usually obtained by counting.
Examples
3) A person’s response (agree, not agree) to a statement. A one (1) is recorded when the
person agrees with the statement, a zero (0) is recorded when a person does not
agree.
Continuous variables – Variables that can assume an infinite number of possible values. Such
variables are usually obtained by measurement.
Examples
Examples
Nominal scale – Level of measurement which classifies data into categories in which no order
or ranking can be imposed on the data.
A variable can be treated as nominal when its values represent categories with no intrinsic
ranking. For example, the department of the company in which an employee works. Examples
of nominal variables include region, postal code, or religious affiliation.
Ordinal scale – Level of measurement which classifies data into categories that can be
ordered or ranked. Differences between the ranks do not exist.
A variable can be treated as ordinal when its values represent categories with some intrinsic
order or ranking.
Examples
Examples
Discrete and continuous variables examples given above.
Interval scale – Level of measurement which classifies data that can be ordered and ranked
and where differences are meaningful. However, there is no meaningful zero and ratios are
meaningless.
Examples
1) The difference between a temperature of 100 degrees and 90 degrees is the same
difference as that between 90 degrees and 80 degrees. Taking ratios in such a case
does not make sense.
Examples
Variables like height, weight, mark (in test) and speed are ratio variables. These variables
have a natural zero and ratios make sense when doing calculations e.g. a weight of 80
kilograms is twice as heavy as one of 40 kilograms.
2.1) Collecting data that compares reckless driving of female and male drivers.
2.2) Collecting data on smoking and lung cancer.
6
Examples
Examples
Sampling frame (synonyms: "sample frame", "survey frame") – This is the actual set of units
from which a sample is drawn
Example
Consider a survey aimed at establishing the number of potential customers for a new service
in a certain city. The research team has drawn 1000 numbers at random from a telephone
directory for the city, made 200 calls each day from Monday to Friday from 8am to 5pm and
asked some questions.
In this example, the population of interest is all the inhabitants in the city. The sampling frame
includes only those city dwellers that satisfy all the following conditions:
3) They are likely to be at home from 8am to 5pm from Monday to Friday;
Probability samples – Samples drawn according to the laws of chance. These include simple
random sampling, systematic sampling and stratified random sampling.
Simple random sampling – Sampling in which each sample of a given size that can be drawn
will have the same chance of being drawn. Most of the theory in statistical inference is based
on random sampling being used.
Examples
1) The 6 winning numbers (drawn from 49 numbers) in a Lotto draw. Each potential
sample of 6 winning numbers has the same chance of being drawn.
We can use Excel to create a random sample of data. We can create many types of random
samples using the formula: RANDBETWEEN
Example
Using the functions in Excel, select the 6 wining numbers in a Lotto draw.
We will start with an empty spreadsheet in Excel. To randomly select 6 lotto numbers
numbered from 1 to 49, we click in cell A2 and type =RANDBETWEEN(1;49). Press Enter. Fill
the other 5 numbers by dragging down.
The results will be displayed as below. Note: Your results will be different from those below.
8
The advantage of simple random sampling is that it is simple and easy to apply when small
populations are involved. However, because every person or item in a population has to be
listed before the corresponding random numbers can be read, this method is very
cumbersome to use for large populations and cannot be used if no list of the population items
is available. It can also be very time consuming to try and locate every person included in the
sample. There is also a possibility that some of the persons in the sample cannot be contacted
at all.
Systematic sampling – Sampling in which data is obtained by selecting every kth object,
N
where k is approximately .
n
Examples
1) A manufacturer might decide to select every 20th item on a production line to test for
defects and quality. This technique requires the first item to be selected at random as
a starting point for testing and, thereafter, every 20th item is chosen.
2) A market researcher might select every 10th person who enters a particular store,
after selecting a person at random as a starting point; or interview occupants of every
5th house in a street, after selecting a house at random as a starting point.
Stratified random sampling – Sampling in which the population is divided into groups (called
strata) according to some characteristic. Each of these strata is then sampled using random
sampling.
A general problem with random sampling is that you could, by chance, miss out a particular
group in the sample. However, if you subdivide the population into groups, and sample from
each group, you can make sure the sample is representative. Some examples of strata
commonly used are those according to province, age and gender. Other strata may be
according to religion, academic ability or marital status.
9
Example
In a study investigating the expenditure pattern of consumers, they were divided into low,
medium and high income groups.
When sampling is proportional to size (an income group comprises the same percentage of
the sample as of the population) the sample sizes for the strata should be calculated as
follows.
Convenience Sampling – Sampling in which data that is readily available is used e.g. surveys
done on the internet. These include quota sampling.
A company is marketing a new product and needs to know how potential customers might
react to the product.
Stage 1: It is decided that age (the 3 groups under 20, 20-40, over 40) and gender
(male, female) are the characteristics that will determine the sample.
Stage 2: The 6 categories to be sampled from are (male under 20), (male 20-40), (male
over 40), (female under 20), (female 20-40) and (female over 40).
Stage 3: The numbers (sub-quotas) to be sampled are (male under 20) - 40,
(male 20-40) - 60, (male over 40) - 25, (female under 20) - 35, (female 20-40) - 65 and
(female over 40) -30. The total quota is the total of all the sub-quotas i.e. 255.
Stage 4: Visit a place where individuals to be interviewed are readily available e.g. a
large shopping center and interview people until all the quotas are filled.
Quota sampling is a cheap and convenient way of obtaining a sample in a short space of time.
However, this method of sampling is not based on the laws of chance and cannot guarantee
a sample that is representative of the population from which it is drawn.
When obtaining a quota sample, interviewers often choose who they like (within criteria
specifications) and may therefore select those who are easiest to interview. Therefore,
sampling bias can result. It is also impossible to estimate the accuracy of quota sampling
(because sampling is not random).
11
The graph above shows how a person's weight varied from the beginning of 1991 to the
beginning of 1995.
Bar charts
A bar chart or bar graph is a chart consisting of rectangular bars with heights proportional to
the values that they represent. Bar charts are used for comparing two or more values that are
taken over time or under different conditions.
In a simple bar chart the figures used to make comparisons are represented by bars. These
are either drawn vertically or horizontally. Only totals are represented. The height or length
of the bar is drawn in proportion to the size of the figure being presented. An example is
shown below.
12
When you want to draw a bar chart to illustrate your data, it is often the case that the totals
of the figures can be broken down into parts or components.
You start by drawing a simple bar chart with the total figures as shown above. The columns
or bars (depending on whether you draw the chart vertically or horizontally) are then divided
into the component parts.
Pareto chart
A Pareto chart is a special type of bar chart where the values being plotted are
arranged in descending order. The graph is accompanied by a line graph which shows
the cumulative totals of each category, left to right.
The graph below is a Pareto chart that shows the percentage of late arrivals at a place
of work organized according to cause of late arrival (from the most common to the
least common cause). The line shows the accumulated percentages.
100 100%
80 80%
60 60%
Percent
percent
40 40%
20 20%
0 0%
traffic child care transport weather overslept emergency
reason
Dot Plot
This is diagram where a line is drawn according to a scale that is appropriate for the data set
and the values (in the data set) plotted at their positions on the scale. If the same value occurs
more than once, the multiple values are plotted on top of each other at the same point on
the scale. For small data sets (few values) this plot can provide useful information regarding
data patterns.
14
Example
Imagine that a medium-sized retailer, thinking of expanding into a new region
identifies a business that it considers as being ready for takeover. It finds the following
annual profit figures (in tens of thousands of pounds) for the target retailer's last ten
years trading:
9 9 7 7 7 6 5 4 3 3
To draw a dot plot we can begin by drawing a horizontal line across the page to
represent the range of values of all the numbers; then we can mark an 'x' above the
appropriate value along the line as follows:
Pie Chart
A Pie chart is a diagram that shows the subdivision of some entity/total into subgroups. The
diagram is in the form of a circle which is divided into slices with each slice having an area
according to the proportion that it makes up of the total.
Example
The pie chart below shows the ingredients used to make a sausage and mushroom pizza.
The degrees needed for each slice is found by calculating the appropriate percentage of 360
e.g. for sausage the degrees are 0.125×360 = 45 and for cheese 0.25×360 =90 etc.
15
Stem-and-leaf plot
A stem-and-leaf plot is a device used for summarizing quantitative data in a
table/graphical format to assist in visualizing the shape of a data set.
Examples
1) To construct a stem-and-leaf plot, the values must first be sorted in ascending order.
Here is the sorted set of data values that will used in the example:
44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106
Next, it must be determined what the stems will represent and what the leaves will
represent. Typically, the leaf contains the last digit of the number and the stem
contains all of the other digits. In the case of very large or very small numbers, the
data values may be rounded to a particular place value (such as the hundredths place)
that will be used for the leaves. The remaining digits to the left of the rounded place
value are used as the stems.
In this example, the leaf represents the “ones” place and the stem the rest of the
number (“tens” place or higher).
The stem-and-leaf plot is drawn with two columns separated by a vertical line. The
stems are listed to the left of the vertical line. It is important that each stem is listed
only once and that no numbers are skipped, even if it means that some stems have no
leaves. The leaves are listed in increasing order in a row to the right of each stem.
4 |4679
5 |
6 |34688
7 |2256
8 |148
9 |
10 | 6
key: 5|4=54
leaf unit: 1.0 Conclusion: The 12 of the 17 values
stem unit: 10.0 are greater or equal to 63 and less or
equal to 88.
16
As an example, suppose the fat contents (in grams) for eating English breakfasts and
cold meat sandwiches are to be compared. The fat contents are shown below.
Sandwiches: 6, 7, 12, 13, 17, 18, 20, 21, 21, 24, 26, 28, 30, 34
Breakfasts: 12, 14, 15, 16, 18, 23, 25, 25, 36, 36, 38, 41, 44, 45
Breakfasts Sandwiches
|0| 6 7
2 4 5 6 8 |1| 2 3 7 8
3 5 5 |2| 0 1 1 4 6 8
6 6 8 |3| 0 4
1 4 5 |4|
Conclusion: The fat content in English breakfasts appears to be higher than that in
sandwiches.
Suppose the symbol x is used to denote some variable of interest in a study. In order to
distinguish between values of this variable, subscripts are used.
n
x1 + x2 + . . . + xn = ∑x .
i =1
i
If it is understood that the range of subscript indices over which the summation is taken
involves all the x values, the summation can be written as just
x1 + x2 + . . . + xn = ∑x.
17
Example 1: Suppose x1 = 70, x2 = 74, x3 = 66, x4 = 68, x5 = 71. Then
∑x
i =1
i = x1+x2+ . . . + x5 = 70+74+66+68+71 = 349.
∑x ∑x
2 2
i = x12 + x 22 + + x n2 or for short.
i =1
∑x
i =1
2
i = 702 + 742 + 662 + 682 + 712 = 24397.
n n
Note that ∑x
i =1
2
i ≠ ( ∑ xi ) 2
i =1
5
e.g. for the abovementioned data ∑x
i =1
2
i = 24397 ≠ 349 2 = 121801.
The summation notation can also be used to write the sum of products of corresponding values
for 2 different sets of values.
∑x y
i =1
i i = x1 y1 + x 2 y 2 + + x n y n
i 1 2 3 4 5 6
xi 11 13 7 12 10 8
yi 8 5 7 6 9 11
∑x y
i =1
i i = (11×8) + (13×5) + (7×7) + (12×6) + (10×9) + (8×11)
= 88 + 65 + 49 + 72 + 90 + 88
= 452.
n n n
Note that ∑ xi y i
i =1
≠ ( ∑ xi ) ( ∑ y i )
i =1 i =1
6
e.g. for the abovementioned data ∑x
i =1
i = 61 and The summation notation is used
extensively in specifying
6 6 6 6
calculations in statistical formulae.
∑ y i = 46 ( ∑ x i ) ( ∑ y i ) = 2806 ≠
i =1 i =1 i =1
∑x y .
i =1
i i
18
A frequency distribution is a table in which data are grouped into classes and the number of
values (frequencies) which fall in each class recorded.
The main purpose of constructing a frequency distribution is to get insight into the
distribution pattern of the frequencies over the classes. Hence, the name frequency
distribution is used to refer to this pattern.
Example 1
In a survey of 40 families in a village, the number of children per family was recorded and the
following data obtained.
1 0 3 2 1 5 6 2
2 1 0 3 4 2 1 6
3 2 1 5 3 3 2 4
2 2 3 0 2 1 4 5
3 3 4 4 1 2 4 5
Example 2
Consider the following data of low temperatures (in degrees Fahrenheit to the nearest
degree) for 50 days. The highest temperature is 64 and the lowest temperature is 39.
1. Find the maximum (=64) and minimum (=39) values and calculate the
2. Decide on the number of classes. Use Sturges’ rule which states that
No. of classes = k
= the rounded up value of (1 + 1.44 ln n)
= 1 + 1.44 × ln(50)
= 6.63
i.e. k = 7.
3. Calculate the class width such that no. of classes × class width > range
4. Find the lower value that defines the first class. This is usually a value just below the
minimum value in the data set. Since the minimum value for this data set is 39, the lowest
class can have a minimum value one below this i.e. 38.
5. Find the lower values that define each of the classes that follow by successively adding
the class width to the lower value of class.
The frequency distribution below shows the data values sorted into the classes
The table below shows the classes and their frequencies for the temperatures data set.
class limits f
38 – 41 4
42 – 45 10
46 – 49 8
50 – 53 15
54 – 57 9
58 – 61 3
62 – 65 1
Total 50
20
The values in the above example that define the classes of the frequency distribution are
called class limits. The classes of the type 38 – 41, 42 – 45,… in which both the upper and
lower limits are included are called “ inclusive classes” . For example, the class 38 – 41
includes all the values from 38 to 41.
In spite of great importance of classification in statistical analysis, no hard and fast rules can
be laid down for it.
The following points must be kept in mind for classification:
1) The classes should be clearly defined and should not lead to any ambiguity.
2) Each of the given values in the data set should be included in one of the classes.
3) The classes should be of equal width, otherwise the different class frequencies
will not be comparable. If the class widths are unequal, then comparable figures
can be obtained by dividing the value of the frequencies by the corresponding
widths of the class intervals. The ratios thus obtained are called ‘frequency
density’.
4) The number of classes should not be too large nor too small.
If we deal with a continuous variable, it is not possible to arrange the data in the class
intervals of above type. Let us consider the distribution of age in years. If class intervals are
15 – 19, 20 – 24 then persons with ages between 19 and 20 years are not taken into
consideration. In such a case we form the class intervals as 0 – 5, 5 – 10, 10 – 15,
15 – 20,… . Here all the persons with any fraction of age are included in one group or the
other. In the above classes, the upper limits of each class are excluded from the respective
classes and are included in the immediate next class and are known as ‘exclusive classes’.
The upper and lower class limits of the new exclusive type classes are known as class
boundaries.
If d is the gap between the upper limit of any class and the lower limit of the succeeding
class, the class boundaries for any class are then given by:
Example 3
The monthly expenditures (thousands of rands) of 60 households are shown on the next page.
The values of this data set were accurately recorded (not rounded).
classes f
4.5 – 5.5 5
5.5 – 6.5 7
6.5 – 7.5 13
7.5 – 8.5 13
8.5 – 9.5 9
9.5 – 10.5 10
10.5 – 11.5 3
Total 60
For this distribution lower (upper) class limit = lower (upper) class boundary for each of the
classes.
A value that falls on the boundary of 2 classes is allocated to the higher of the two classes e.g.
5.50000 is allocated to the class 5.5 – 6.5 (not 4.5 to 5.5).
22
Class midpoints
Examples
1) For the frequency distribution in example 2 (temperature data), the class midpoints
are given on the following page.
2) For the frequency distribution in example 3 (expenditure data), the class midpoints are
given below.
classes midpoints
4.5 – 5.5 5
5.5 – 6.5 6
6.5 – 7.5 7
7.5 – 8.5 8
8.5 – 9.5 9
9.5 – 10.5 10
10.5 – 11.5 11
23
Cumulative frequencies
The “less than” cumulative frequency of a class is the number of values in the sample that
are less than or equal to the upper class boundary of the class.
Examples
class cumulative
f calculations
boundaries frequency
37.5 – 41.5 4 4 4
41.5 – 45.5 10 14 4+10
45.5 – 49.5 8 22 4+10+8
49.5 – 53.5 15 37 4+10+8+15
53.5 – 57.5 9 46 4+10+8+15+9
57.5 – 61.5 3 49 4+10+8+15+9+3
61.5 – 65.5 1 50 4+10+8+15+9+3+1
cumulative
classes f calculations
frequencies
4.5 – 5.5 5 5 5
5.5 – 6.5 7 12 5+7
6.5 – 7.5 13 25 5+7+13
7.5 – 8.5 13 38 5+7+13+13
8.5 – 9.5 9 47 5+7+13+13+9
9.5 – 10.5 10 57 5+7+13+13+9+10
10.5 – 11.5 3 60 5+7+13+13+9+10+3
Total 60
1) The relative and percentage frequencies for the frequency distribution in example
2 (temperature data) are shown below.
2) The relative and percentage frequencies for the frequency distribution in example 3
(expenditure data) is shown on the following page.
relative percentage
classes f
frequency frequency
4.5 – 5.5 5 0.083 8.3
5.5 – 6.5 7 0.117 11.7
6.5 – 7.5 13 0.217 21.7
7.5 – 8.5 13 0.217 21.7
8.5 – 9.5 9 0.15 15
9.5 – 10.5 10 0.167 16.7
10.5 – 11.5 3 0.05 5
Total 60 1 100
Histogram
16
14
12
10
frequency
8
6
4
2
0
37.5-41.5 41.5-45.5 45.5-49.5 49.5-53.5 53.5-57.5 57.5-61.5 61.5-65.5
temperature
Frequency polygon
This is also a graphical representation of a frequency distribution. For each class the
class midpoint is plotted against the frequency and the plotted points joined by means
of straight lines.
Example
midpoint 35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
f 0 4 10 8 15 9 3 1 0
16
14
12
10
frequency
8
6
4
2
0
0 10 20 30 40 50 60 70 80
midpoint
26
Note:
The two plotted values at the lower and upper ends were added to anchor the graph to the
horizontal axis. The lower end value is a plot of 0 versus the midpoint of the class below the
first (lowest) class (35.5). This midpoint is obtained by subtracting the class width (4) from the
midpoint of the lowest class (39.5). The upper end value is a plot of 0 versus the midpoint of
the class above the last class (67.5). This midpoint is obtained by adding the class width (4) to
the midpoint of the last (highest) class (63.5).
The histogram and frequency polygon are equivalent graphical representations of the pattern
of the frequencies shown in the frequency distribution.
The the histogram can provide an estimate of the probability (chance) that a value drawn at
random from the data set will lie between two values.
Examples
1) For the frequency distribution in example 2 (temperature data), the estimated chance
that a randomly drawn value will be at least 45.5 but less than 57.5 is
8 + 15 + 9
= 0.64.
50
Example
For the “less than” ogive of the frequency distribution in example 2 (temperature data)
class boundary 37.5 41.5 45.5 49.5 53.5 57.5 61.5 65.5
cumulative
0 4 14 22 37 46 49 50
frequency
27
cumulative frequency
60
50
Cum. frequency 40
30
20
10
0
0 10 20 30 40 50 60 70
class boundary
Note:
The plotted value at the lower end was added to anchor the graph to the horizontal axis. The
lower end value is a plot of 0 versus the upper class boundary of the class below the first
(lowest) class (37.5). This upper class boundary is obtained by subtracting the class width (4)
from the upper class boundary of the lowest class (41.5).
A percentage “less than” ogive can be plotted by just changing the vertical scale. In this
example the frequencies add up to 50. In order to convert these frequencies to percentages,
each frequency is multiplied by 2. To draw the percentage ogive, each cumulative frequency
in the above table will have to be multiplied by 2. The resulting graph is shown on the
following page. Values that have a given percentage of the observations in the data set less
than it can be read off from the ogive.
120
100
% cumulative freq
80
60
40
20
0
0 10 20 30 40 50 60 70
boundaries
28
The shape of a distribution
The main purpose of drawing a histogram is to describe the clustering pattern of the values
in the data set. For a large sample size, the histogram (frequency polygon) can be fairly well
approximated by a smooth curve (called a frequency curve) that is fitted to the frequencies.
The following patterns of the shape of the frequency curve appear regularly in data sets.
0.45
0.4
0.35
0.3
frequency
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
x
This shape is for data sets where the majority of values are in the central portion of the scale
with fewer and fewer values the further away from the center (in both directions). Many data
sets have this shape. Examples are
0.12
0.1
0.08
frequency
0.06
0.04
0.02
0
0 1 2 3 4 5 6
x
This shape occurs when all the values in the data set occur approximately the same number
of times.
29
Examples are
3) Frequencies obtained when tossing an unbiased coin and recording 0 if tails come up and 1
if heads come up.
Bimodal shape
60
50
40
frequency
30
20
10
0
0 20 40 60 80 100 120
Body length (m m )
This pattern which shows two distinct peaks (hence the name bimodal data) appearing
when there are two subgroups with different sets of values in the same data set.
Examples
1) Measuring the body lengths of ants when there are adults and juveniles together in
the same data set. The two peaks in the curve reflect the fact that juvenile ants have
shorter body lengths than adult ants.
2) Heights of a population of males and females. Since the females are shorter than the
males, the frequency curve will have two peaks. One peak will be located where the
most female heights are concentrated and one where the most male heights are
concentrated.
30
Positive skew shape
1.2
0.8
frequency
0.6
0.4
0.2
0
0 2 4 6 8 10 12 14
x
This shape shows a high clustering of values at the lower end of the scale and less and less
clustering further away from the lower end towards the upper end.
Example
The time it takes to serve a customer at a supermarket. For most customers the service time
is quite short. The longer the service time, the less the number of customers.
0.3
0.25
0.2
frequency
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16
-0.05
x
This shape shows a high clustering of values at the upper end of the scale and less and less
clustering further away from the upper end towards the lower end.
Example
Marks in a test where most students did well, but a few performed poorly.
31
In the calculations a distinction will be made between methods used when the data are in
raw form (values as collected) or grouped form (form of a frequency distribution).
1
mean = x =
n
∑x.
x is pronounced “x bar”.
Example
The marks of seven students in a mathematics test with a maximum possible mark of 20 are
given below:
15 13 18 16 14 17 12:
mean = x =
∑x =
15 + 13 + 18 + 16 + 14 + 17 + 12
= 15.
n 7
Median:
The median is the value in the data set which is such that half of the values in the data
set are less than or equal to it and half greater than or equal to it.
For an odd number of values in the data set, the median is the middle value of the
data set when it has been arranged in ascending order. That is, from the smallest
value to the largest value.
If the number of values in the data set is even, then the median is the average of the
two middle values.
32
Examples
1) The marks of nine students in a geography test that had a maximum possible mark of 50
are given below:
47 35 37 32 38 39 36 34 35
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39 47
2) Consider the above data set with the first value (47) omitted.
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39
In this case the number of values n = 8 which is an even number. The two middle values in
n 8 n
the data set are in positions = = 4 and + 1 = 5 i.e. the values 35 and 36.
2 2 2
35 + 36
Median = = 35.5.
2
Mode:
The mode of a set of data values is the value(s) that occurs most often.
Example:
Find the mode of the following data set:
48 44 48 45 42 49 48
The mode is 48 since it occurs most often.
Note
1) It is possible for a set of data values to have more than one mode.
2) If there are two data values that occur most frequently, we say that the set of data
values is bimodal e.g. the data set 2 2 4 5 5 6 has two modes (2 and 5).
3) If no value in the data set occurs more than once, it has no mode e.g. the data set 4
5 7 9 has no mode.
33
Comparison of mean, median and mode
1) The mean is used as a measure of central tendency for symmetrical, bell-shaped data that
do not have extreme values (extreme values are called outliers).
2) The median may be more useful than the mean when there are extreme values in the data
set as it is not affected by the extreme values.
3) The mode is useful when the most common item, characteristic or value of a data set is
required.
Examples
1) The amounts (thousands) for which each of 7 properties were sold are shown below.
For this data set mean = x = 772.86. This value of the mean is not a central value for
the data set (it is greater than all the values but the largest one). The reason for this is
that the last value (2350) has a considerable influence on the value of the mean.
The median = 555 is a value that more centrally located than the mean. Unlike the
mean, the median is not influenced by the large last values in the data set.
2) For qualitative (non-numerical) data only the mode can be calculated. For example,
suppose 10 rate payers are asked whether they think the percentage increase in rates
is reasonable. They can either agree (A), disagree (D) or be neutral (N) on the issue.
Their responses are shown below.
A, A, D, N, D, A, D, D, N, N.
For this data set the modal response is D (since D occurs more times than the other
responses). It is not possible to calculate a median or a mean for this data set.
When calculating the mean for raw data, it is usually assumed that all the values in the data
set are equally important. If the values are not all considered equally important, the weighted
mean ( x w ) is calculated according to the formula below.
In the formula x1, x2, . . . , xr are the values and w1, w2, . . . ,wr their respective weights.
Example
The final mark (percentage) in a certain course is based on an assignment mark (which counts
for 10% of the final mark), a test mark (which counts for 30% of the final mark) and an exam
34
mark (which counts for 60% of the final mark). Calculate the final mark of a student
who gets a 65% assignment mark, a 70% test mark and a 55% exam mark.
Solution:
The above formula is applied with
x1= 65, x2= 70 x3= 55,
w1= 10, w2= 30 w3= 60.
65 * 10 + 70 * 30 + 55 * 60 6050
xw = = = 60.5.
10 + 30 + 60 100
where xmid(i) is the midpoint of the ith class, k the number of classes and n the sample size.
This formula is a special case of the weighted mean formula with wi = fi and
k
∑w
i =1
i = n.
Example
2487
mean = = 49.74.
50
2.5 Measures of variability (variation,
spread, dispersion)
Variability refers to the extent to which the values in a data set vary around (differ from)
the associated measure of central tendency.
35
Example
The performance of 2 different stocks is monitored over a period of 8 days. Their values are
shown in the table below.
day 1 2 3 4 5 6 7 8
A 103 120 112 108 130 106 120 112
B 112 97 85 123 153 85 146 110
The dot plot that follows shows the performance of each stock.
The mean values for the two stocks are the same (=113.875), but they differ in variability
(extent of spread around the mean). Stock B has a far wider spread around the mean than
stock A.
Example
For stock A the standard deviation is calculated as follows.
x = score A x2
103 10609
120 14400
112 12544
108 11664
130 16900
106 11236
120 14400
112 12544
sum 911 104297
For stock B the standard deviation is 25.682 (check this using STATMODE).
Interpretation: The stock A values differ (on average) from the mean by 8.919, while stock
B values differ (on average) from the mean by almost 3 times this amount.
37
2.5.2 Grouped data
Standard deviation and variance
For grouped data, the raw data formulae for the variance and standard deviation can
be slightly modified.
Example
125372.5 − 2487 2 / 50
variance = S2 = = 34.06367
49
Example:
The coefficient of variation calculations show that in relative terms the variability for
expenditure data set is greater than that of the temperature data set.
Example:
Men’s Heights have a bell-shaped distribution with a mean of 69.2 inches and a standard
deviation of 2.9 inches.
Approximately 68% of data values are within 69.2 ± 2.9 = (66.3, 72.1).
Approximately 95% of data values are within 69.2 ± 5.8 = (63.4, 75).
Approximately 99.7% of data values are within 69.2 ± 8.7 = (60.5, 77.9).
2.8.1 Definitions
The ith percentile , Pi , is the value that has i% of the values in a data set less or equal to it
(0 < i ≤ 100).
Examples
Steps to be followed in calculating the first and third quartiles for raw data
3) Divide the data set into 2 portions of equal numbers of values – set 1 consists of those
values less or equal to the median and set 2 consists of those values greater or equal
to the median. When the data set has an odd number of values, the median is
excluded from the division of the data set into 2 portions.
4) The first quartile (Q1) is the median of set 1 and the third quartile (Q3) is the median
of set 2.
Example
The distance from home to work (kilometers) of 11 employees at a certain company are
shown below. Calculate Q1 and Q3.
1) Ordered data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49
2) Median = 40. After this step the median is deleted from the data set.
4) Set 2 – 5 values greater than the median i.e. 41, 42, 43, 47, 49.
Suppose the data set consists of the above values and 56 (12 values).
1) Ordered Data Set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56
40 + 41
2) median = = 40.5. Unlike what was done in example 1, no values are deleted
2
from the data set.
3) Set 1 – 6 values less or equal than median i.e. 6, 7, 15, 36, 39, 40
Set 2 – 6 values greater or equal than the median i.e. 41, 42, 43, 47, 49, 56.
15 + 36 43 + 47
4) Q1 = median of set 1 = = 25.5 , Q3 = median of set 2 = = 45.
2 2
Q3 − Q1
The quartile deviation = Q = can also be used as a measure of variability.
2
For the data set in example 1, quartile deviation = Q = (43 – 15)/2 = 14.
The quartile deviation value shows the extent to which the values in the data set deviate from
the median. For a skew data set (heavy clustering at lower or upper end of the scale) the
quartile deviation is a more appropriate measure of variability than the standard deviation
(which is more suitable as a measure of variability for symmetric data sets).
A formula for calculating the ith percentile Pi for grouped data is shown below.
i = 1, 2, … , 100.
n = sample size
c = class width.
41
Example
class cumulative
boundaries f frequency
37.5 – 41.5 4 4
41.5 – 45.5 10 14
45.5 – 49.5 8 22
49.5 – 53.5 15 37
53.5 – 57.5 9 46
57.5 – 61.5 3 49
61.5 – 65.5 1 50
Total 50
Median
i * n 50 * 50
Step 1: Calculate position of median = = = 25.
100 100
Step 2: Median class (class that contains 25th observation) is the class 49.5 – 53.5.
First quartile
Step 2: First quartile class (class that contains 12.5th observation) is the class __________
Q1 =
42
Third quartile
Step 2: Third quartile class (class that contains 37.5th observation) is the class ___________
Q3 =
Fourth decile
65th Percentile
(32.5 − 22) * 4
P65 = 49.5 + = 52.3.
15
43
Example
The cumulative frequency graph on the following page shows the distribution of marks
scored by a class of 40 students in a test.
type value(s)
central tendency median
deviation Q − Q1
quartile deviation = Q = 3
2
extremes minimum and maximum
44
Example
The IQ’s of 13 people are shown below.
92, 104, 93, 98, 112, 145, 88, 90, 104, 119, 101, 95, 154
minimum = 88
Q1 = 92.5
median = 101
Q3 = 115.5
maximum = 154
Box-and-Whisker plot
Q** = Q3 + 1.5×IQR
=115.5 + (1.5)(23)
=150
The only value in the data set that is larger than this is 154. This value (154) is “too big” and
so is an outlier
Example continued:
46
A Box-and-Whisker plot can also be used to assess the skewness (departure from
symmetry) of a variable.
• For positively skewed data most of the values are at the lower end of the scale
(mean > median, “box” section of the plot towards the lower end of the scale).
• For negatively skewed data most of the values are at the upper end of the scale (mean
< median, “Box” section of the plot towards the upper end of the scale).
• In the previous example the data set is positively skew.
When several data sets are to be compared, several Box-and-Whisker plots can be plotted
side-by-side.
Example
The Box-and-Whisker plot shown below enables one to compare delays in departing flights
(in minutes) for certain days in December (16th to the 26th).
For all the days the data sets are positively skewed (data sets all have the “box” section
closer to the lower end of the scale with a long upper whisker). This means that there are
short delays in flight departures on all the days. The long upper whiskers that are visible
show that there were some quite late departures on 16, 17, 21, 22, 23, 24 and 25
December.
47
Chapter 3 – Probability
3.1 Terminology
Probability (Chance)
• A probability is the chance that something of interest will happen.
• A probability is expressed as a proportion i.e. it ranges from 0 to 1.
Chance can be expressed as a percentage i.e. it ranges from 0 to 100.
Examples
1
2) The probability of winning the Lotto is .
13983816
Random experiment
This is an experiment that gives different outcomes when repeated under similar conditions.
3) The outcome that will occur when the experiment is performed depends on
chance.
Examples
4) Drawing a card from a deck of cards (possible outcomes: 13 hearts, 13 clubs, 13 spades,
13 diamonds).
48
Set
A set is a collection of outcomes.
Sample space
The sample space is the set of all possible outcomes of a random experiment. A
sample space is usually denoted by the symbol S and the collection of elements
contained in S enclosed in curly brackets { }.
Sample point
A sample point is an individual outcome (element) in a sample space.
Examples
5) Drawing a card from a deck of cards. The elements in the sample space are listed
below.
S = {2♦ 3♦ 4♦ 5♦ 6♦ 7♦ 8♦ 9♦ 10♦ J♦ Q♦ K♦ A♦
2♥ 3♥ 4♥ 5♥ 6♥ 7♥ 8♥ 9♥ 10♥ J♥ Q♥ K♥ A♥
2♣ 3♣ 4♣ 5 ♣ 6♣ 7♣ 8♣ 9♣ 10♣ J♣ Q♣ K♣ A♣
2♠ 3♠ 4♠ 5♠ 6♠ 7♠ 8♠ 9♠ 10♠ J♠ Q♠ K♠ A♠ }
Event
An event is a subset of a sample space i.e. a collection of sample points taken from a sample
space.
Impossible event
An impossible event is an event that cannot happen (has probability zero).
Certain event
A certain event is an event that is sure to happen (has probability 1).
49
Simple events are events that involve only one sample point (outcome) of the sample
space.
Examples
1) Let E denote the event “an odd number is obtained when tossing a single die”.
Then E = {1, 3, 5}.
2) Let H denote the event “at least one head appears when tossing two coins”.
H = {hh, ht, th}.
3) Let B denote the event “obtaining a club and a heart in a single draw from a deck of
cards”. The event B is impossible. The set of outcomes of B is an empty set denoted by
B = { } = φ.
4) Let A denote the event “obtaining a 1, 2, 3, 4, 5 or 6 when tossing a single die”. The
event A is a certain event i.e. one of the outcomes belonging to the set describing the
event must happen. This is denoted by A = S, where S is the sample space.
Venn diagrams
• A Venn diagram is a drawing, in which circular areas represent groups of items
usually sharing common properties.
• The drawing consists of two or more circles, each representing a specific group or
set, contained within a square that represents the sample space. Venn diagrams are
often used as a visual display when referring to sample spaces, events and
operations involving events.
Complementary events
The complementary event Ā (sometimes written À) of an event A is all the outcomes in S
that are not in A.
50
Examples
1) Consider the experiment of tossing a single die. S = {1, 2, 3, 4, 5, 6}. The complement
of the event A = “obtaining a 3 or less” = {1, 2, 3} is
A = “obtaining a 4 or more” = {4, 5, 6}.
2) Consider the experiment of tossing two coins. S = {hh, ht, th, tt}. The complement of
the event H = “at least one head”= {hh, ht, th} is H = “no heads” = {tt}.
• The union of two events A and B, denoted by A ∪ B , is the set of outcomes that are
in A or in B or in both A and B i.e. the event that
“either A or B or both A and B occur”
or “at least one of A or B occurs”.
These definitions involving two events can be extended to ones involving 3 or more events
e.g. for the 3 events A1, A2 and A3 the event A1 ∪ A2 ∪ A3 is the event “at least one of A1, A2
or A3 occurs” and A1 ∩ A2 ∩ A3 the event “A1 and A2 and A3 occur”.
Examples
A ∪ B = {1, 2, 3, 5, 6, 7, 8, 9} , A ∩ B = { 3, 7},
A ∩ B = {2, 5, 9}, A ∩ B = {1, 6, 8}.
51
2) Let C be the event “drawing a face card from a deck of cards” and A the event “drawing
a king or an ace from a deck of cards”.
Examples
1) Let B be the event “drawing a black card from a deck of cards” and R the event “drawing
a red card from a deck of cards”.
The events B and R have no outcomes in common i.e. B ∩ R = φ (empty set). Hence B
and R are mutually exclusive.
2) Let E be the event “an even number with a single throw of a die” and O the event “an
odd number with a single throw of a die” i.e. E = {2, 4, 6} and O = {1, 3, 5}.
N ( A) m
P(A) = = ,
N (S ) n
where N(A) = m is the number of outcomes favourable to the event A and N(S) = n
the number of outcomes in the sample space S i.e. the total number of outcomes.
Examples
Solution:
2) Two dice are rolled. Find the probability that a sum of 7 will occur.
Solution:
The number of sample points in S is 36 (see example 3 under sample space).
The classical definition of probability requires the assumption that all the outcomes in the
sample space are equally likely. If this assumption is not met, this formula cannot be used.
53
Example
The possible temperatures (degrees Celsius) in Durban on a particular day in December are
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.
In December Durban is hot so, for example, 15 degrees is less likely than 30 degrees
i.e. P (temperature = 15) = 1 ÷ 25 = 0.04 does not seem reasonable.
f
P(A) = .
n
Note: This formula differs from the classical formula in the sense that the classical
formula uses all the outcomes in the sample space as the total number of outcomes,
while the relative frequency formula uses the number of repetitions (n) of the
experiment as total number of outcomes. In the classical formula the number of
outcomes in the sample space is fixed, while the number of repetitions of an
experiment (n) can vary. It can be shown that the empirical probability is a good
approximation of the true probability when n is sufficiently large.
Examples
1) A bent coin is tossed 1000 times with heads coming up 692 times.
692
An estimate of P(h) is = 0.692.
1000
Mark f
less than 30 6
30 – 39 26
40 – 49 45
50 – 59 64
60 – 69 82
70 – 79 37
80 – 89 22
90 – 99 8
Total 290
From the table (using the empirical formula) the following probabilities can be
estimated.
54
26 + 6
(a) P(mark less than 40) = = 0.110.
290
64 + 82 + 37 + 22 + 8 6 + 26 + 45 213
(b) P(pass) = = 1− = = 0.73.
290 290 290
22 + 8
(c) P(above 80) = = 0.103.
290
Example
The preference probabilities according to gender for 2 different brands of a certain product
are summarized in the table on the following page.
The gender marginal probabilities are obtained by summing the joint probabilities over the
brands. The brand marginal probabilities are obtained by summing the joint probabilities
over the genders.
Brand
Marginal
1 2
Probability
Male 0.2 0.32 0.52
Gender
Female 0.4 0.08 0.48
Marginal
0.6 0.4 1
Probability
This rule can be extended to any finite number of experiments. If one experiment can be
done in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then one of the k
experiments can be done in n1 + n2 +. . . + nk ways.
Example:
Suppose a man is standing in a room which has 2 doors to his left and 1 door to his
right. In how many ways can he leave the room?
Solution:
Let “leave the room by going to the left” be experiment 1 and “leave the room by
going to the right” be experiment 2. There are n=2 ways to do experiment 1 (he can
leave by door A or door B) and there is m=1 way to do experiment 2 (he can leave by
door C). In total there are n+m = 2+1 = 3 ways to leave the room.
This rule can be extended to any finite number of experiments. If one experiment can be
done in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then the k
experiments together can be done in n1×n2×…×nk ways.
56
Example 1:
A basic meal consists of soup, a sandwich and a beverage. If a person having this
meal has 3 choices of soup, 4 choices of sandwiches and a choice of coffee or tea as
a beverage, how many such meals are possible?
Example 2:
A PIN to be used at an ATM can be formed by selecting 4 digits from the digits
0, 1, 2, . . . , 9 . How many choices of PIN are there if
Factorial notation
In how many ways can n (n – integer) objects be arranged in a row?
Note: 1 ! = 1, 0 ! = 1.
Examples
1) In how many ways can 7 people be placed in a queue at a bus stop?
Permutation
• A permutation is the number of different arrangements of a group of items where
order matters.
• The number of permutations of n objects taken r at a time is calculated from
n!
nPr = P(n, r) = .
(n − r )!
Combination
• A combination is the number of different selections of a group of items where order
does not matter.
• The number of combinations of a group of n objects taken r at a time is calculated
from
n n!
nCr = C(n, r) = ( r ) = .
(n − r )!r!
Examples:
1) Four people (A, B, C, D) serve on a board of directors. A chairman and vice-chairman are
to be chosen from these 4 people. In how many ways can this be done?
Chairman Vice-chairman
A B
B A
A C
C A
A D
58
D A
B C
C B
B D
D B
C D
D C
2) Four people (A, B, C, D) serve on a board of directors. Two people are to be chosen from
them as members of a committee that will investigate fraud allegations. In how many
ways can this be done?
Number of ways = 6.
In both these examples a choice of 2 people from 4 people is made. However, in example 1
the order of choice of the 2 people matters (since the one person chosen is chairman and
the other one vice-chairman). In example 2 the order does not matter. The only interest is in
who serves on the committee.
Application of formulae.
In question 1 the permutations formula applies with n = 4, r =2.
4!
Number of ways = P(4, 2) = = 12.
(4 − 2)!
4!
Number of ways = C(4, 2) = = 6.
2!(4 − 2)!
3) Find the number of ways to take 4 people and place them in groups of 3 at a time where
order does not matter.
Solution:
Since order does not matter, use the combination formula.
4! 24
C(4,3) = = =4 .
3!(4 − 3)! 6
4) Find the number of way to arrange 6 items in groups of 4 at a time where order matters.
6! 720
Solution: P(6,4) = = = 360
(6 − 4)! 2!
There are 360 ways to arrange 6 items taken 4 at a time when order matters.
59
5) Find the number of ways to take 20 objects and arrange them in groups of 5 at a time
where order does not matter.
20! 20.19.18.17.16
Solution: C(20,5) = = = 15504
5!(20 − 5)! 1.2.3.4.5
There are 15 504 ways to arrange 20 objects taken 5 at a time when order does not
matter.
6) Determine the total number of five-card hands that can be drawn from a deck of 52
cards.
Solution:
When a hand of cards is dealt, the order of the cards does not matter. Thus the
combinations formula is used.
There are 52 cards in a deck and we want to know in how many different ways we can
draw them in groups of five at a time when order does not matter. Using the
combination formula gives
C(52,5) = 2 598 960.
7) There are five women and six men in a group. From this group a committee of 4 is to be
chosen. In how many ways can the committee be formed if the committee is to have at least
3 women in it?
Solution:
8) In how many ways can a phone number consisting of 5 digits be chosen from the digits
1, 2, 3, . . . , 9 if no digits are to be repeated?
Solution:
9) In how many ways can the 6 winning numbers in a Lotto draw be selected?
Solution:
10) In many ways can a five-card hand consisting of three eight's and two sevens be dealt?
Solution:
60
11) How many different 5-card hands include 4 of a kind and one other card?
Solution:
We have 13 different ways to choose 4 of a kind: 2's, 3's, 4's, … Queens, Kings and
Aces.
Once a set of 4 of a kind has been removed from the deck, 48 cards are left.
The possible situations that will satisfy the above requirement are:
Complementary events
For any event A defined on some sample space,
P( A ) = 1 – P( A).
These formulae can be extended to probabilities involving more than two events
e.g. for 3 events A, B and C defined on some sample space
This formula can easily be verified with the aid of the Venn diagram shown below.
From the above diagram the following sets can be written down.
De Morgan’s Laws
____
(1) P( A ∩ B ) = P( A ∪ B)
_____
(2) P ( A ∪ B ) = P( A ∩ B)
P(A) = P( A ∩ B) + P( A ∩ B )
P(B) = P( A ∩ B) + P( A ∩ B)
These formulae can be verified from the Venn diagram shown on the following page.
The formulae can be extended to probabilities involving more than two events.
Examples
1) There are two telephone lines – A and B. Line A is engaged 50% of the time and line B is
engaged 60% of the time. Both lines are engaged 30% of the time. Calculate the
probability that
Let E1 denote the event “line A is engaged” and E2 the event “line B is engaged”.
(a) P(at least one of the lines are engaged) = P(E1 ∪ E2)
= P(E1) + P(E2) – P(E1 ∩ E2)
= 0.5 + 0.6 – 0.3
= 0.8
(b) P(none of the lines are engaged.) = 1 – P(at least one of the lines are engaged)
= 1 – 0.8
= 0.2
(d) The event “line A is engaged, but line B is not engaged” can be written in symbols as
(e) P(only one line is engaged) = P(line A is engaged, but line B is not engaged)
+ P(line B is engaged, but line A is not engaged)
= P( E 1 ∩ E 2 ) + P( E1 ∩ E 2 )
P( E1 ∩ E 2 ) = P(E2) – P(E1 ∩ E2) = 0.6 – 0.3 = 0.3. (Using the total probability
formula)
2) Let O be the event that a certain lecturer will be in his/her office on a particular
afternoon and L the event that he/she will be at a lecture. Suppose P(O) = 0.48 and P(L)
= 0.27.
Solution:
(b) P( O ∩ L ) =
3) A batch of 20 computers contain 3 that are faulty. Four (4) computers are selected at
random without replacement from this batch. Calculate the probability that
Solution:
There are C(20,4) = 4845 [why not P(20,4) ?] ways of selecting the 4 computers from the
batch of 20. Since random selection is used, all 4845 selections are equally likely. Let A
denote the event “all 4 the computers selected are not faulty” and B the event “at least
2 of the computers selected are faulty”
(a) P(A) =
(b) P(B) =
The Conditional probability of an event A occurring given that another event B has occurred
is given by
P( A ∩ B)
P(A | B) = , where P(B) > 0.
P( B)
P( A ∩ B)
Also P(B|A) = , where P(A) > 0.
P( A)
65
Example 1
Five hundred (500) TV viewers consisting of 300 males and 200 females were asked whether
they were satisfied with the news coverage on a certain TV channel. Their replies are
summarized in the table below.
Answer
Satisfied Not Satisfied Total
Male 180 120 300
Gender
Female 90 110 200
Total 270 230 500
180
P(satisfied | male) = = 0.6.
300
90
P(satisfied | female) = = 0.45.
200
270
P(satisfied) = = 0.54 and P(not satisfied) =
500
Note
2) The probability of a person being satisfied depends on the gender of the person being
interviewed. In this case females are less satisfied than males with the news coverage.
Example 2
At a certain university the probability of passing accounting is 0.68, the probability of
passing statistics 0.65 and the probability of passing both statistics and accounting is 0.57.
Calculate the probability that a student
(c) passes statistics when it is known that he/she did not pass accounting.
Solution:
66
P( A ∩ B) 0.57
(a) P(B|A) = = = 0.838 .
P( A) 0.68
P( A ∩ B) 0.57
(b) P(A|B) = = = 0.877.
P( B) 0.65
(c) P(B | A ) =
Examples
1) A box has 12 bulbs, 3 of which are defective. If two bulbs are selected at random
without replacement, then what is the probability that both are defective?
Solution:
Let d1 denote the event “the first bulb is defective” and d2 the event “the second bulb is
defective”.
T
3
Then P(d1) = and
12
2
P(d2|d1) = .
11
Using the above mentioned multiplication formula,
3 2
P(d2 ∩ d1) = P(d1) P(d2|d1) = = 0.045.
12 11
67
2) Two cards are drawn at random from from a deck of playing cards. What is the
probability that both these cards are aces?
Solution:
Since there are 4 aces in a deck of 52 cards, the probability of drawing one ace is 4/52.
Having removed one ace and not replacing it reduces the probabilities of drawing
another ace on the second draw. The 51 cards remaining contain 3 aces and therefore
the probability of drawing an ace on the second draw is 3/51. We can multiply these
probabilities and determine the probability of drawing two aces.
3) Three cards are drawn at random from from a deck of playing cards. What is the
probability that all 3 these cards are aces?
Independent events
Two events A and B are said to independent if P(A| B) = P(A) or P(B|A) = P(B).
This means that the occurrence of B does not affect the probability that A occurs.
Substitution of the above result into the multiplication formula for two probabilities gives
P(A ∩ B) = P(A) P(B) if A and B are independent.
Examples
1) The probability that person A will be alive in 20 years is 0.7 and the probability that
person B will be alive in 20 years is 0.5, while the probability that they will both be alive
in 20 years is 0.45. Are the events E1 “A is alive in 20 years” and E2 “B is alive in 20 years”
independent?
Solution:
Since P(E1) P(E2) = 0.7 × 0.5 = 0.35 ≠ P(E1 ∩ E2), the events E1 and E2 are not
independent.
68
Since P(1st coin is heads) × P(2nd coin is heads) = ½ × ½ = ¼ = P(both tosses heads),
the events “heads on the first toss” and “heads on the second toss” are independent.
• The multiplication rule for independent events can be extended to involve more
than 2 events. In general, if the events A1, A2, . . . , An are independent then
Examples
1) A coin is tossed and a single 6 sided die is rolled. Find the probability of “heads” and
rolling a 3 with the die.
P(head) = ½ and P(3) = 1/6.
Since the results of the coin and the die are independent,
P(heads and 3) = P(heads) P(3) = (1/2) × (1/6) = 1/12
2) A school survey found that 9 out of 10 students like pizza. If three students are
chosen at random with replacement, what is the probability that all three students
like pizza?
Solution
P(student 1 likes pizza) = 9/10 = P(student 2 likes pizza) = P(student 3 likes pizza).
P(student 1 likes pizza and student 2 likes pizza and student 3 likes pizza)
= P(student 1 likes pizza) x P(student 2 likes pizza) x P(student 3 likes pizza)
9
= ( ) 3 = 0.729 .
10
3) It is known that 8% of all cars of a certain make that are sold encounter engine
overheating problems within 50 000 kilometers of travel. During the past week 4
such cars were sold. Assuming that engine overheating problems for the 4 cars are
encountered independently, what is the probability that
(a) all 4
(b) none
(c) at least one of these cars sold
encounter engine overheating problems within 50 000 kilometers of travel ?
69
Solution:
Let A denote the event “overheating problems within 50 000 kilometers of travel”.
So
P(none) =
Bayes’ theorem
P( A ∩ B)
In order to apply the conditional probability formula P(A|B) = ,
P( B)
values for P(A ∩ B) and P(B) are needed.
Suppose that only the values for P(A), P(B|A) and P(B| A ) are available.
In this case the probabilities [ P(A ∩ B) and P(B)] required for calculating P(A|B) can be
calculated from
and
Substituting these probabilities into the first conditional probability formula gives
P( A) P( B | A)
P(A|B) = .
P( A) P( B | A) + P( A ) P( B | A )
This result is known as Bayes’ theorem (named after the person who proposed the
method).
70
Example 1
When testing a person for a certain disease, the test can show either a positive result (the
person has the disease) or a negative result (the person does not have the disease).
When a person actually has the disease, the test shows positive 99% of the time. When the
person actually does not have the disease the test shows negative 95% of the time. Suppose
it is known that only 0.1% of the people in the population have the disease.
a) If a test turns out to be positive, what is the probability that the person has the
disease?
b) If the test turns out to be negative, what is the probability that the person does not
have the disease?
Solution:
Denominator:
P(B) = P( A ∩ B) + P( A ∩ B)
= P(A) P(B|A) + P( A ) P(B| A )
= ( 0.001 × 0.99 ) + ( 0.999 × 0.05 )
= 0.00099 + 0.04995
= 0.05094
P( A ∩ B) 0.00099
P(A|B) = = = 0.0194.
P( B) 0.05094
71
P( A ∩ B ) P( A ) P( B | A ) 0.999 x0.95
(b) P( A | B ) = = = = 0.9999895.
P( B ) 1 − P( B) 0.94906
From the above it can be seen that a negative result of the test is very reliable (it will be
wrong only 105 times in 10 million cases). On the other hand, the chances that a person will
have the disease when the result of the test shows positive is 194 in 10 000.
Suppose A1, A2, …, An are mutually exclusive events whose union is the sample space
S and P(Ai) > 0. Then, for any event B with P(B) > 0, and any k={1, 2, …, 3},
Example 2
Suppose that Bob can decide to go to work by one of three modes of transportation – car, bus,
or commuter train. Because of high traffic, if he decides to go by car, there is a 50% chance
he will be late. If he goes by bus, which has special reserved lanes but is sometimes
overcrowded, the probability of being late is only 20%. The commuter train is more expensive
than the other modes of transport but is late only 1% of the time.
a) Suppose that Bob is late one day and his boss wishes to estimate the probability that he
drove to work that day by car. Since he does not know which mode of transportation
Bob usually uses, he assumes that each mode is equally likely to be used. What is the
boss’ estimate of the probability that Bob drove to work by car?
b) Suppose that a co-worker of Bob’s knows that Bob drives to work by car 10% of the
time, he almost always takes the commuter train to work, and he never takes the bus.
Given that Bob is late to work today, the co-worker believes there is a ____% chance
that Bob came to work by train.
Solution
There are two events of interest –being late and choice of transport. There are 3 options for
the choice of transport.
Let
L = is late to work
B = takes bus
C = takes car
T = takes train
72
Solution (a)
𝑃𝑃(𝐶𝐶 ∩ 𝐿𝐿)
Find 𝑃𝑃(𝐶𝐶|𝐿𝐿) =
𝑃𝑃(𝐿𝐿)
1 1 1
Numerator: 𝑃𝑃(𝐶𝐶 ∩ 𝐿𝐿) = 𝑃𝑃(𝐶𝐶) × 𝑃𝑃(𝐿𝐿 |𝐶𝐶) = � � . � � =
3 2 6
1 1 1 1 1 1
= � �.� � + � � .� � + � � .� �
3 2 3 5 3 100
71
=
300
𝑃𝑃(𝐶𝐶 ∩ 𝐿𝐿) 1 71
So, 𝑃𝑃(𝐶𝐶|𝐿𝐿) = = � � ÷ � � = 0.7042
𝑃𝑃(𝐿𝐿) 6 300
Solution (b)
Try for yourself
a p
= .
b 1− p
a b
From the above it can be shown that p = and 1 – p = .
a+b a+b
b 1− p
= .
a p
73
Examples
a) A pair of balanced dice is tossed. What are the odds in favour of the sum of the numbers
showing a 6?
Total number of outcomes = 6 x 6 =36.
Possible ways of getting a sum of 6 : (1, 5), (2, 4), (3, 3), (4, 2), (5,1).
Number of ways of getting a 6 is 5.
p = probability sum equals 6 = 5/36 , 1 – p = 31/36.
Odds in favour of a 6 is: 5 to 31 or 1 to 6.2
c) The table below shows data that were collected from 781 middle aged female patients at a
certain hospital.
no 90 346 436
(i) For smokers the odds in favour of heart problems is 172 to 173 or 1 to 1.0058
From this it can be seen that smokers are much more at risk for heart problems than non-
smokers.
74
Chapter 4 – Probability
distributions of discrete random
variables
Examples:
Examples:
1) The variables T and X from the above examples are discrete random variables.
2) The variables H and V from the above examples are continuous random variables.
75
Examples:
1) As above, let T be the random variable that represents the number of tails obtained
when a coin is flipped three times. Then T has 4 possible values 0, 1, 2, and 3. The
outcomes of the experiment and the values of T are summarized in the next table.
Outcomes T
hhh 0
hht, hth, thh 1
tth, tht, htt 2
ttt 3
Assuming that the outcomes are all equally likely, the probability distribution for T is
given in the following table.
t 0 1 2 3 Total
p(t) 1/8 3/8 3/8 1/8 1
2) Let Y denote the number of tosses of a coin until heads appear first. Then
y 1 2 3 . . . Total
p(y) ½ (½)2 (½)3 . . . 1
1st die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
2nd die 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
x 2 3 4 5 6 7 8 9 10 11 12
P(X=x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Note:
For any discrete random variable X, the range of values that it can assume are such that
0 ≤ P(x) ≤ 1 and ∑ P( x) = 1 .
x
Examples
1) For the probability mass function in example 1 the cumulative distribution function is
x 0 1 2 3
F(x) 1/8 ½ 7/8 1
2) For the probability mass function in example 3 the cumulative distribution function is
x 2 3 4 5 6 7 8 9 10 11 12
F(x) 1/36 3/36 6/36 10/36 15/36 21/36 26/36 30/36 33/36 35/36 1
3) Consider a discrete random variable with probability mass function given below.
x 1 2 3 4
P(X=x) 0.1 0.3 0.4 0.2
77
The graphs on the previous page are plots of the probability mass function (graph on the
right) and cumulative distribution function (graph on the left).
A random variable can only take on one value at a time i.e. the events X = x1 and X = x 2 for
x1 ≠ x2 are mutually exclusive. The probability of the variable taking on any number of
different values can be found by simply adding the appropriate probabilities.
Examples
1) Find the probability of getting 2 or more tails when a coin is flipped 3 times.
2) Find the probability of getting at least one tail when a coin is flipped 3 times.
Or
3) Find the probability of needing at most 3 tosses of a coin to get the first heads.
The mean or expected value of a random variable X is the average value that we would
expect for X when performing the random experiment many times.
E(X) = µ = ∑ xp(x) .
Examples
Thus if 3 coins are flipped a large number of times, we should expect the average
number of tails (per 3 flips) to be about 1.5. Since the number of tails is an integer
value, it will never actually assume the mean value of 1.5. This mean value more
reflects the fact that the extreme values (0 and 3) occur the same proportion of
times (an eighth) and the middle values occur the same proportion of times (three
eighths).
79
2) The score S obtained in a certain quiz is a random variable with probability distribution
given below.
s 0 1 2 3 4 5
p(s) 0.12 0.04 0.16 0.32 0.24 0.12
s 0 1 2 3 4 5 sum
p(s) 0.12 0.04 0.16 0.32 0.24 0.12 1
s × p(s) 0 0.04 0.32 0.96 0.96 0.60 2.88
µ = E(S) = 2.88
Variance
For a random variable X, the variance, denoted by σ2 , can be calculated by using the
formula
The standard deviation of X, denoted by σ, is just the positive square root of σ2. This is a
measure of the extent to which the values are spread around the mean.
The calculation of the standard deviation for a random variable is similar to that of the
calculation of the standard deviation for grouped data.
Example
t 0 1 2 3 sum
p(t) 1/8 3/8 3/8 1/8 1
t × p(t) 0 3/8 6/8 3/8 1.5
t2 × p(t) 0 3/8 12/8 9/8 3
80
Bernoulli trial:
Consider an experiment in which there are two complementary outcomes. One
outcome is labelled “success” (s) and the other is labelled “failure” (f). Such an
experiment is called a Bernoulli trial.
We denote the probability of success as P(s)= p and the probability of failure as
P(f) = 1–p = q
Notation:
A short hand way of referring to a binomially distributed random variable X, based on
n trials with probability of success p, is X ~ B(n,p) or X ~ Bin(n,p).
Examples:
1) Consider the experiment of flipping a coin 5 times. If we let the event of getting “tails” on
a flip be labeled “success” and “heads” failure, and if the random variable T represents
the number of tails obtained, then T will be binomially distributed with n = 5, p = ½ and
q=½
Tree diagram
The number of possible outcomes in a binomial experiment can be written down
from a diagram such as the one below. This diagram called a tree diagram enables
one to write down all the outcomes when this experiment is performed 3 times.
s
s
f
s
s
f
f
start
s
s
f
f
s
f
f
The following outcomes and their respective number of successes (x) can be written down
from the above tree diagram.
Outcomes x
fff 0
ffs, fsf, ffs 1
ssf, sfs, fss 2
sss 3
A formula for the binomial probability mass function for the case n = 3 can be written down
from the above table by noting the following.
1) Each outcome is a sequence of s (success) and f (failure) values e.g. fff, ffs, ssf etc.
4) The number of outcomes where there are x success and (3 – x) failure outcomes can
be counted by using the formula C(3, x)= 3Cx .
By using the above, the binomial formula for n = 3 can be written down as
To write down the general formula, the same reasoning as explained above applies to
sequences with n outcomes consisting of s (x of these) and f (n – x of these) values. In the
formula the number 3 is just replaced by n i.e.
Examples
1) As in the previous examples, let T be the random variable representing the number of tails
when a coin is flipped 3 times. Then T ~ Bin(3 , 0.5). Using the formula above with n=3
and p = 0.5 , we can calculate the probability of exactly 2 tails as:
a) 3 answers correctly?
b) 7 answers correctly?
c) fewer than 3 answers correctly?
d) at least 5 answers correctly?
Solution:
a) P(X=3) = f(3) = 10C3 (0.2)3 (0.8)7 = 0.2013
c) P(X < 3) =
d) P(X ≥ 5) =
83
Notice that the calculations needed in parts (c) and (d) of the previous example are time
consuming. Instead of using the pdf f(x) to solve the problems, the CDF F(x) can be used.
Values for the CDF are found in the Cumulative Binomial Distribution tables at the end of
the notes (Table A).
There are several tables – one for each different value of n. The first column gives the value
of n while the second column gives the possible values that the random variable X can take
on. The top row gives common values of p.
Remember: These tables give cumulative probabilities so situations that involve the
“<”, “>”and “≥” signs must be adjusted so that they are in a form that uses the “≤”
sign i.e. a “less than or equal to” situation.
Examples
1) Suppose X ~ Bin(12 , 0.6). Find the probability that X is less than, or equal to, 5.
Part (d): What is the probability that the student chooses at least 5 answers
correctly?
P(X ≥ 5) = 1 – P(X ≤ 4) = 1 – 0.9672 = 0.0328
Example
Note: A Binomial random variable with n=1 is simply a Bernoulli trial and is sometimes
referred to as a Bernoulli distribution.
Consider a bowl with N marbles of which Np are blue and Nq red, where p + q = 1. If
sampling is done with replacement and drawing a blue marble labeled “success” (red
Np Nq
marble labeled “failure”), then P(success) = = p and P(failure) = = q . If P( x
N N
blue marbles in n draws) is required and sampling is with replacement, the binomial
formula will still apply. If sampling is without replacement, P(success) is no longer
constant (assumption 4 of binomial experiment is violated) and the binomial formula
will no longer apply for calculating the abovementioned probability. In such a case
Example
A bowl contains 10 blue and 7 red marbles. Four (4) marbles are drawn at random from the
bowl. Calculate the probability of
(a) two
(b) at least 3
blue marbles drawn when sampling is done
1) with replacement.
2) without replacement.
86
2 2
10 7
P(X = 2) = 4 C 2 = 0.352 .
17 17
= 0.335 + 0.120
= 0.455.
C2 ×7 C2 45 × 21
2a) P(X = 2) = 10
= = 0.397 .
17 C 4 2380
2b)
Examples
1) The number of bad cheques presented for daily payment at a bank.
2) The number of road deaths per month.
3) The number of bacteria in a given culture.
4) The number of defects per square meter on metal sheets being manufactured.
5) The number of mistakes per typewritten page.
87
PDF
The probability that x events occur in time/space is given by
Examples
1) A secretary claims an average mistake rate of 1 per page. A sample page is selected
at random and 5 mistakes found. What is the probability of her making 5 or more
mistakes if her claim of 1 mistake per page on average is correct?
Solution:
In this case μ=1 is claimed and X the number of mistakes ≥ 5. If the claim is true,
P(X ≥ 5) = 1 – P(X ≤ 4)
e −1 e −1 e −1
= 1 – e −1 + e −1 + + +
2 ! 3! 4!
= 1 – 0.9963
= 0.0037.
The above calculation shows that if the claim of 1 mistake per page on average is true,
there is only a 37 in 10 000 chance of getting 5 or more mistakes per page. This remote
chance of 5 or more mistakes when an average of 1 mistake per page is true casts doubt
on whether the claim of 1 mistake per page on average is in fact true.
2) At a particular restaurant 4 plates are broken, on average, each week. What is the
probability that
a) 2 plates are broken next week?
b) at most 4 plates are broken next week?
c) more than 3 plates are broken next week?
Solution:
𝑒𝑒 −4 40 𝑒𝑒 −4 41 𝑒𝑒 −4 42 𝑒𝑒 −4 43 𝑒𝑒 −4 44
= + + + +
0! 1! 2! 3! 4!
= 0.6288
𝑒𝑒 −4 40 𝑒𝑒 −4 41 𝑒𝑒 −4 42 𝑒𝑒 −4 43
=1− − − −
0! 1! 2! 3!
= 0.5665
Notice that the calculations needed in parts (b) and (c) of the previous example are time
consuming. Instead of using the pdf f(x) to solve the problems, the CDF F(x) can be used.
Values for the CDF are found in the Cumulative Poisson Distribution table at the end of the
notes (Table B).
The top row gives some values for µ and the first column gives some values that Poisson
random variable X can take on. The cumulative probabilities F(x) = P(X < x) can be found by
lining up the relevant row and column.
Reminder: As with the Cumulative Binomial Distribution tables, these tables give
cumulative probabilities so situations that involve the “<”, “>” and “≥” signs must be
adjusted so that they are in a form that uses the “≤” sign i.e. a “less than or equal to”
situation.
Example 2
Part (b): Step 1 – Find µ=4 in the top row of the table.
Step 2 – Find x=4 in the first column.
Step 3 – Line up the column and row.
At the intersection of the row is the value F(4) = P(X ≤ 4) = 0.6288
The Poisson random variable can also be seen as an approximation to a binomial random
variable with the number of trials (n) large and the probability of success (p) small such that
the mean μ = np is of moderate size. This approximation is good when n ≥ 20 and p ≤ 0.05
or n ≥ 100 and np ≤ 10 .
Example
A life insurance company has found that the probability is 0.000015 that a person aged 40-
50 will die from a certain rare disease. If the company has 100 000 policy holders in this age
group, what is the probability that this company will have to pay out 4 claims or more
because of death from this disease?
Solution:
For the following reasons a binomial distribution with n = 100 000 and p = 0.000015 is
reasonable in this case.
3 The death or not from this disease of one person does not affect that of another
person.
The Poisson distribution with µ = 100 000×(0.000015) = 1.5 can be used to approximate this
probability.
P(X ≥ 4) = 1 – P(X ≤ 3)
= 1 – 0.9344
= 0.0656.
• The mean and variance of the Poisson distribution are given by E(X) = µ and
var(X) = µ.
• In the case of the Poisson approximation to the binomial distribution
E(X) = var(X) = np
standard deviation = np .
90
Example
Calls arrive at switchboard at an average rate of 1 every 15 seconds. What is the probability
of not more than 5 calls arriving during a particular minute?
Solution:
A mean rate of 1 every 15 seconds is equivalent to a mean rate of 4 every minute. Since the
question concerns an interval of 1 minute, µ = 4 (not µ = 1).
−4
41 𝑒𝑒 −4 42 𝑒𝑒 −4 43 𝑒𝑒 −4 44 𝑒𝑒 −4 45 𝑒𝑒 −4
𝑃𝑃(𝑋𝑋 ≤ 5) = 𝑒𝑒 + + + + + = 0.7851
1! 2! 3! 4! 5!
1
The clustering pattern of the values of X over the possible values in the interval is described
by a mathematical function f(x) called the probability density function (pdf). A high (low)
clustering of values will result in high (low) values of this function. For a continuous random
variable X, only probabilities associated with ranges of values (e.g. an interval of values from
a to b) will be calculated. The probability that the value of X will fall between the values a
and b is given by the area between a and b under the curve describing the probability
density function f(x). For any probability density function the total area under the graph of
f(x) is 1.
The constants µ and σ can be shown to be the mean and standard deviation respectively
of X. These constants completely specify the density function. A graph of the curve
describing the probability function (known as the normal curve) for the case µ = 0 and
σ = 1 is shown on the following page.
2
0.45
0.4
0.35
0.3
0.25
p(z) 0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
z
An increase (decrease) in the mean µ results in a shift of the graph to the right (left). An
increase (decrease) in the standard deviation σ results in the graph becoming more (less)
spread out e.g. compare the curves of the distributions with σ2 = 0.2, 0.5, 1 and 5 in the
previous diagram.
Histogram
1000
900
800
freq 700
600
500
400
300
200
100
0
15 25 35 45 55 65 75 90 More
mark
The histogram of the marks has an appearance that can be described by a normal curve i.e.
it has a symmetric, bell-shaped appearance. The mean of the marks is 51.95 and the
standard deviation 10.
table for each possible mean and standard deviation. This problem is overcome by
transforming X, the normal random variable of interest [X ~ N(µ; σ2) ], to a standardized
normal random variable
X −µ
Z= .
σ
It can be shown that the transformed random variable is normally distributed with µ = 0 and
σ = 1 i.e. Z ~ N(0; 1). The random variable Z can be transformed back to X by using the
formula
X = µ + Zσ .
The normal distribution with mean µ = 0 and standard deviation σ = 1 is called the standard
normal distribution. The symbol Z is reserved for a random variable with this distribution.
The graph of the standard normal distribution appears below.
Various areas under the above normal curve are shown. The standard normal table gives the
area under the curve to the left of the value z. Other types of areas can be found by
combining several of the areas as shown in the next examples.
Note
• For negative values of z less than the minimum value (– 3.79) in the table, the
probabilities are taken as 0 i.e. P(Z ≤ z) = 0 for z < – 3.79.
• For positive values of z greater than the maximum value (3.79) in the table, the
probabilities are taken as 1 i.e. P(Z ≤ z) = 1 for z > 3.79.
Examples
In all the examples that follow, Z ~ N(0; 1).
c) P( – 0.47 < Z < 1.35) = P(Z < 1.35) – P(Z < – 0.47)
= 0.9115 – 0.3192
= 0.5923
In all the above examples an area was found for a given value of z. It is also possible to find a
value of z when an area to its left is given. This can be written as P(Z ≤ zα) = α (α is the greek
letter for “a” and is pronounced “alpha”). In this case zα has to be found where α is the area
to its left
6
Examples
Search the body of the table for the required area (0.0344) and then read off the
value of z corresponding to this area. In this case z0.0344 = – 1.82.
Finding 0.975 in the body of the table and reading off the z value gives z0.975 = 1.96.
When searching the body of the table for 0.95 this value is not found. The z value
corresponding to 0.95 can be estimated from the following information obtained
from the table.
z area to left
1.64 0.9495
? 0.95
1.65 0.9505
Since the required area (0.95) is halfway between the 2 areas obtained from the
table, the required z can be taken as the value halfway between the two z values
that were obtained
1.64 + 1.65
from the table i.e. z = = 1.645.
2
Exercise: Using the same approach as above, verify that the z value corresponding to
an area of 0.05 to its left is –1.645.
When searching the body of the table this area is not found. The following
information can be found.
z area to left
2.14 0.9838
? 0.9841
2.15 0.9842
The area required is not midway between the 2 other areas so the z-value
corresponding to the closer area is used i.e. z = 2.15 is used.
7
At the bottom of the standard normal table selected percentiles zα are given for different
values of α. This means that the area under the normal curve to the left of zα is α.
Examples:
1 α = 0.900, zα = 1.282
means P(Z < 1.282) = 0.900.
2 α = 0.995, zα = 2.576
means P(Z < 2.576) = 0.995.
3 α = 0.005, zα = – 2.576
means P(Z < – 2.576) = 0.005.
The standard normal distribution is symmetric with respect to the mean = 0. From this it
follows that the area under the normal curve to the right of a positive z entry in the
standard normal table is the same as the area to the left of the associated negative entry
(– z) i.e.
P(Z ≥ z) = P(Z ≤ – z) .
Let X be a N(μ ; σ2) random variable and Z a N(0 ; 1) random variable. Then
𝑋𝑋 − 𝜇𝜇 𝑥𝑥 − 𝜇𝜇 𝑥𝑥 − 𝜇𝜇
𝑃𝑃(𝑋𝑋 ≤ 𝑥𝑥) = 𝑃𝑃 � ≤ � = 𝑃𝑃 �𝑍𝑍 ≤ �
𝜎𝜎 𝜎𝜎 𝜎𝜎
𝑎𝑎 − 𝜇𝜇 𝑋𝑋 − 𝜇𝜇 𝑏𝑏 − 𝜇𝜇 𝑎𝑎 − 𝜇𝜇 𝑏𝑏 − 𝜇𝜇
𝑃𝑃(𝑎𝑎 ≤ 𝑋𝑋 ≤ 𝑏𝑏) = 𝑃𝑃 � ≤ ≤ � = 𝑃𝑃 � ≤ 𝑍𝑍 ≤ �
𝜎𝜎 𝜎𝜎 𝜎𝜎 𝜎𝜎 𝜎𝜎
8
Example 1
The height H (in inches) of a population of women is approximately normally distributed
with a mean of µ = 63.5 and a standard deviation of σ = 2.75 inches. To calculate the
probability that a woman is less than 63 inches tall, we first find the z-value that is
associated with h = 63 inches. (This z-value is sometimes referred to as the z-score.)
63 − 63.5
𝑧𝑧 = = −0.18
2.75
Example 2
The length X (inches) of sardines is a N(4.62 ; 0.0529) random variable. What proportion of
sardines is
(a) longer than 5 inches? (b) between 4.35 and 4.85 inches?
5 − 4.62
(a) P(X > 5) = P(Z > )
0.23
= P(Z > 1.65)
= 1 – P(Z ≤ 1.65)
= 1 – 0.9505
= 0.0495.
(b)
= P(– 1.17 ≤ Z ≤ 1)
= P(Z ≤ 1) – P(Z ≤ –1.17)
= 0.8413 – 0.1210
= 0.7203.
9
The standard normal table can be used to find percentiles for random variables which are
normally distributed.
Example
The marks M obtained in a mathematics entrance examination are normally distributed with
µ = 514 and σ = 113 . Find the mark that is the 80th percentile.
From the standard normal table, the z-score which is closest to an entry of 0.80 in
the body of the table is 0.84 (the actual area to its left is 0.7995). The mark which
corresponds to a z-score of 0.84 can be found by solving
for m. This yields m = 608.92 i.e. a mark of approximately 609 is better than 80% of
all other exam marks.
Chapter 6 – Sampling
distributions
6.1 Definitions
• A sampling distribution arises when repeated samples of the same size are drawn
from a particular population (distribution) and a statistic (numerical measure of
description of sample data) is calculated for each sample. The interest is then
focused on the probability distribution (called the sampling distribution) of the
statistic.
Example
Suppose all possible samples of size 2 are drawn with replacement from a population with
sample space S = {2, 4, 6, 8} and the mean calculated for each sample.
The different values that can be obtained and their corresponding means are shown in the
table below.
2nd value
2 4 6 8
2 2 3 4 5
1st 4 3 4 5 6
value 6 4 5 6 7
8 5 6 7 8
In the above table the row and column entries indicate the two values in the sample (16
possibilities when combining rows and columns). The mean is located in the cell
4+6
corresponding to these entries e.g. 1st value = 4, 2nd value = 6 has a mean entry of = 5.
2
Assuming that random sampling is used, all the mean values in the above table are equally
likely. Under this assumption the following distribution can be constructed for these mean
values.
11
x 2 3 4 5 6 7 8 sum
count 1 2 3 4 3 2 1 16
1 1 3 1 3 1 1
P( X = x ) 1
16 8 16 4 16 8 16
The above distribution is referred to as the sampling distribution of the mean for random
samples of size 2 drawn from this distribution.
The mean and variance of the population from which these samples are drawn are
µ=5
and
σ = [∑ x − (∑ x) / N ] ÷ N = ¼ (22 + 42 + 62 + 82 – 202 / 4) = 5.
2 2 2
Consider a population with mean µ and variance σ2. It can be shown that the mean and
variance of the sampling distribution of the mean, based on a random sample of size n, are
given by
µ X = µ and σ X2 = σ2/n.
Sampling distributions can involve different statistics (e.g. sample mean, sample proportion,
sample variance) calculated from different sample sizes drawn from different distributions.
Some of the important results from statistical theory concerning sampling distributions are
summarized in the sections that follow.
12
Note 1:
Since
𝑛𝑛
∑ 𝑥𝑥
𝑋𝑋� = 𝑖𝑖=1 𝑖𝑖
𝑛𝑛
𝜎𝜎 2 𝑋𝑋� − 𝜇𝜇
𝑋𝑋�~𝑁𝑁 �𝜇𝜇; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝜎𝜎 2
�
𝑛𝑛
Note 2:
The value of n for which this theorem is valid depends on the distribution from
which the sample is drawn. If the sample is drawn from a normal population, the
theorem is valid for all n. If the distribution from which the sample is drawn is fairly
close to being normal, a value of n > 30 will suffice for the theorem to be valid. If the
distribution from which the sample is drawn is substantially different from a normal
distribution e.g. positively or negatively skewed, a value of n much larger than 30 will
be needed for the theorem to be valid.
Note 3:
Suppose the underlying distribution is a Bernoulli distribution with probability p of
success and probability q = 1–p of failure. The mean and variance for this distribution
are µ = p and σ2 = pq.
In this case
∑𝑛𝑛 𝑥𝑥
�
𝑃𝑃 = 𝑖𝑖=1 𝑖𝑖
𝑛𝑛
�
where 𝑃𝑃 is the proportion of successes in the sample and can be seen as an estimate
13
of the proportion of successes in the population (the distribution from which the
sample is drawn).
Then, according to the CLT,
𝑝𝑝𝑝𝑝 𝑃𝑃� − 𝑝𝑝
𝑃𝑃�~𝑁𝑁 �𝑝𝑝; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝑝𝑝𝑝𝑝
�
𝑛𝑛
Example 1:
An electric firm manufactures light bulbs whose lifetime (in hours) follows a normal
distribution with mean 800 and variance 1600. A random sample of 10 light bulbs is drawn
and the lifetime recorded for each light bulb. Calculate the probability that the mean of this
sample
785 − 800
(𝑎𝑎) 𝑃𝑃(𝑋𝑋� < 785) = 𝑃𝑃 �𝑍𝑍 < �
40
√10
= 𝑃𝑃(𝑍𝑍 < −1.19)
= 0.117
820 − 800
(𝑏𝑏) 𝑃𝑃(𝑋𝑋� > 820) = 𝑃𝑃 �𝑍𝑍 > �
40
√10
= 𝑃𝑃(𝑍𝑍 > 1.58)
= 1 − 0.9429
= 0.0571
Example 2
Suppose that 40% of all adults in Durban support an increase in the state sales tax from 5%
to 6% provided that the additional revenue goes to education. A survey of 140 adults
randomly selected from Durban was done and the participants were asked if they support
increase. What is the probability that more than half the sample support the increase?
Example 3
It is known that 73% of secretaries in RSA know how to touch-type. A sample of 1800
secretaries is taken. What is the probability that the sample proportion of secretaries who
can touch-type differs from the national proportion by no more than 3%?
So the t-distribution is similar to the standard normal distribution. For small sample sizes it
shows more variability than the standard normal distribution i.e. its curves are flatter in
appearance with thicker tails. As the sample size increases, the t-distribution approaches
the standard normal distribution and for n>30 the differences are negligible.
The graph below shows how the t-distribution changes for different values of r (the degrees
of freedom).
16
The t-distribution was first proposed in a paper by William Gosset in 1908 who wrote the
paper under the pseudonym “Student” and so is also referred to as Student’s t-distribution.
The row entry (ν ) gives the degrees of freedom (df) and the column entry (α) gives the area
under the t-curve to the left of the value that appears in the table at the intersection of the
row and column entry.
Notation: tν ;α denotes the t-value that has an area of α to the left where the df for the
t-distribution are ν.
Examples
1. For df = 2 and α = 0.995 the entry is t2 ; 0.995 = 9.925. This means that for the t-
distribution with 2 degrees of freedom
P(t ≤ 9.925) = 0.995.
2. For df = ∞ and α = 0.95 the entry is t∞ ; 0.95= 1.645. This means that for the t-
distribution with ∞ degrees of freedom
P(t ≤ 1.645) = 0.95.
When a t-value that has an area less than 0.5 to its left is to be looked up, the fact that the
t-distribution is symmetrical around 0 is used i.e.,
Examples
3. For df = ν = 10 and α = 0.10 the value of t0.10 such that P(t < t10 ; 0.10 ) = 0.10 is found
from
t10 ; 0.1 = – t10 ; 0.9 = – 1.372
Note that the percentile values in the last row of the t-distribution are identical to the
corresponding percentile entries in the standard normal table. Since the t-distribution for
large samples (degrees of freedom) is the same as the standard normal distribution, their
percentiles should be the same.
observed frequency f1 f2 .. fk
expected frequency e1 e2 .. ek
k
( f i − ei ) 2
The quantity χ2 = ∑
i =1 ei
can be shown to follow a chi-square distribution with
The chi-square curve is different for each value of degrees of freedom. The diagram on the
following page shows how the chi-square distribution changes for different values of ν (the
degrees of freedom).
Unlike the normal and t-distributions, the chi-square distribution is only defined for positive
values and is not a symmetrical distribution. As the degrees of freedom increase, the chi-
square distribution becomes more a more symmetrical. For a sufficiently large degrees of
freedom the chi-square distribution approaches the normal distribution.
The row entry (ν ) gives the degrees of freedom and the column entry (α) gives the area
under the chi-square curve to the left of the value that appears in the table at the
intersection of the row and column entry.
Notation: χ216 ; 0.09 denotes the chi-square value where df=16 and the area to its left is 0.09.
19
Examples:
2) For df = 30 and α = 0.995 the entry is 53.67 i.e. χ 302 ; 0.995 = 53.67 . This means that
for the chi-square distribution with 30 degrees of freedom
P( χ
2
≤ 53.67) = 0.995.
3) For df = 6 and α = 0.95 the entry is 12.59 i.e. χ 62 ; 0.95 = 12.59 . This means that for
the chi-square distribution with 6 degrees of freedom
P( χ P( χ
2 2
≤ 12.59 ) = 0.95 or > 12.59 ) = 0.05.
Random samples of sizes n1 and n2 (sometimes m is used instead of n2) are drawn from
independent normally distributed populations that are labeled 1 and 2 respectively. Denote
the variances calculated from these samples by S12 and S22 respectively and their
corresponding population variances by σ12 and σ22 respectively. The ratio
S 2 /σ 2
F = 12 12
S2 /σ 2
is distributed according to an F-distribution (named after the famous statistician R.A. Fisher)
with numerator degrees of freedom df1 = n1 − 1 and denominator degrees of freedom
S12
df 2 = n2 − 1 . When σ = σ the F-ratio is F = 2 .
2
1
2
2
S2
Notation: F(df1 ,df2) or Fdf1 , df2 are used to denote the F-distribution with df1 numerator
degrees of freedom and df2 denominator degrees of freedom.
The F-distribution is positively skewed, and the F-values can only be positive. For each
combination of df1 and df2 there is a different F-distribution. The diagram below shows plots
for a number of F-distributions (F-curves) with σ 12 = σ 22 .
Three important distributions are special cases of the F-distribution. The normal distribution
is an F(1, ∞) distribution, the t-distribution an F(1, n2 ) distribution and the chi-square
distribution an F( n1 , ∞) distribution.
21
In Table F (at the end of your notes) the 90, 95, 97.5 and 99 percentage points of the F-
distribution are given. The tables are laid out as follows:
The top row gives some values for the numerator df and the first column gives some values
for the denominator df. The value at the intersection of a row and a column is the value that
has α(100)% of the area under the curve to its left i.e. P(F < Fdf1 , df2 ; α) = α. There are tables
for α = 0.90, α = 0.95, α = 0.97.5 and α = 0.99.
Examples
1) For the F(3,26) distribution, 2.98 has an area (under its curve) of α = 0.95 to its left and
1 – α to its right (see graph below), i.e. F3, 26 ; 0.95 = 2.98
P(F < 2.98) = 0.95
22
3) Using df1 = 4 and df2 = 7, the value of F that has 1% of the area to the right of it is 7.85.
Only upper tail values (those with large areas below and small areas above) can be read off
from the F-tables. Lower tail values can be calculated from the formula:
1
𝐹𝐹(𝑑𝑑𝑑𝑑1 , 𝑑𝑑𝑑𝑑2 ; 𝛼𝛼) =
𝐹𝐹(𝑑𝑑𝑑𝑑2 , 𝑑𝑑𝑑𝑑1 ; 1 − 𝛼𝛼)
Examples
1) Find the value such that 2.5% of the area under the F(7,5) curve is to the left of it.
1 1
𝐹𝐹(7 , 5 ; 0.025) = = = 0.189
𝐹𝐹(5 , 7 ; 0.975) 5.29
23
2) Find the value such that 1% of the area under the F(10,9) curve is to the left of it.
1 1
𝐹𝐹(10 , 9 ; 0.01) = = = 0.2024
𝐹𝐹(9 , 10 ; 0.99) 4.94
24
Chapter 7 – Statistical
Inference: Estimation for one
sample case
Examples
1.) The government of a country wants to estimate the proportion of voters (p) in the
country that approve of their economic policies.
2.) A manufacturer of car batteries wishes to estimate the average lifetime (µ) of their
batteries.
3.) A paint company is interested in estimating the variability (as measured by the
variance, σ2) in the drying time of their paints.
The quantities p, µ and σ2 that are to be estimated are called population parameters.
A sample estimate of a population parameter is called a statistic. The table below gives
examples of some commonly used parameters toegether with their statistics.
Parameter Statistic
p p̂
µ x
σ2 S2
25
Examples
Suppose the mean time it takes to serve customers at a supermarket checkout counter is to
be estimated.
1) The mean service time of 100 customers of (say) x = 2.283 minutes is an example of
a point estimate of the parameter µ.
2) If it is stated that the probability is 0.95 (95% chance) that the mean service time will
be from 1.637 minutes to 4.009 minutes, the interval of values (1.637, 4.009) is an
interval estimate of the parameter μ.
The estimation approaches discussed will focus mainly on the interval estimate approach.
Θ – pronounced “theta”.
1 – α is called the confidence coefficient. It is the probability that the confidence interval
will contain θ the parameter that is being estimated.
100(1 – α) is called the confidence percentage.
Example
Consider example 2 of the previous section.
θ, the parameter that is being estimated, is the population mean µ .
26
α=0.05
The confidence coefficient is (1–α ) = 0.95
The confidence percentage is 100(1–α ) = 95.
L = 1.637 U = 4.009
The confidence interval is the interval (1.637, 4.009).
In the sections that follow the determination of L and U when estimating the parameters µ, p
and σ2 will be discussed.
The determination of the confidence limits is based on the central limit theorem. This
theorem states that for sufficiently large samples
𝜎𝜎 2 𝑋𝑋� − 𝜇𝜇
𝑋𝑋�~𝑁𝑁 �𝜇𝜇; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝜎𝜎 2
�
𝑛𝑛
Formulae for the lower and upper confidence limits can be constructed in the following way
(using a confidence coefficient of 0.95 as an example).
27
𝑋𝑋� − 𝜇𝜇
P �−1.96 < 𝜎𝜎 < 1.96� = 0.95
� 𝑛𝑛
√
By a few steps of mathematical manipulation (not shown here), the above part in brackets
can be changed to have only the parameter µ between the inequality signs. This will give
σ σ
Let L = X − 1.96 and U = X + 1.96 .
n n
Then the above formula can be written as P(L ≤ µ ≤ U) = 0.95.
Since both L and U are determined by the sample values (which determine X ), they
(and the confidence interval) will change for different samples. Since the parameter
µ that is being estimated remains constant, these intervals will either include or
exclude µ. The central limit theorem states that such intervals will include the
parameter µ with probability 0.95 (95 out of 100 times).
𝜎𝜎 𝜎𝜎 𝜎𝜎
�𝑥𝑥̅ − 1.96 ; 𝑥𝑥̅ + 1.96 � or 𝑥𝑥̅ ± 1.96
√𝑛𝑛 √𝑛𝑛 √𝑛𝑛
The percentage of confidence associated with the interval is determined by the value (called
the z – multiplier) obtained from the standard normal distribution. In the above formula a z-
multiplier of 1.96 determines a 95% confidence interval.
confidence percentage 99 95 90
z-multiplier 2.576 1.96 1.645
α 0.01 0.05 0.10
Example
The actual content of cool drink in a 500 milliliter bottle is known to vary. The standard
deviation is known to be 5 milliliters. Thirty (30) of these 500 milliliter bottles were selected
at random and their mean content found to 498.5. Calculate 95% and 99% confidence
intervals for the population mean content of these bottles.
Solution:
X −µ
t= follows a t-distribution with
S/ n
degrees of freedom = df = n – 1.
29
The confidence interval formula used in the previous section is modified by replacing the
z-multiplier by the t-multiplier that is looked up from the t-distribution.
Example
The time (in seconds) taken to complete a simple task was recorded for each of 15 randomly
selected employees at a certain company. The values are given below.
38. 43. 38. 26. 41. 42. 37. 37. 41. 42. 50. 37. 36. 31.
2 9 4 2 3 3 5 2 2 3 31 1 3 7 8
Calculate 95% and 99% confidence intervals for the population mean time it takes to
complete this task.
Solution:
n = 15 (given) , x = 38.36, S = 5.78 (Calculated from the data)
Substituting x = 38.36, n = 15, S = 5.78, t = 2.145 into the above formula gives
Substituting x = 38.36, n = 15, S = 5.78, t = 2.977 into the above formula gives
The formulae for the confidence interval of the population variance σ2 are based on the fact
(n − 1) S 2
that follows a chi-square distribution with (n – 1) degrees of freedom. Let
σ2
α α α
χ 2 (1 − ) and χ 2 ( ) [also written as χ21–α/2 and χ2α/2 respectively] denote the 100( 1 − )
2 2 2
100α
and percentile points of the chi-square distribution with (n – 1) degrees of freedom.
2
(𝑛𝑛 − 1)𝑠𝑠 2
𝑃𝑃 �𝜒𝜒𝑣𝑣2 ; 𝛼𝛼� ≤ 2
≤ 𝜒𝜒𝑣𝑣2 ;1−𝛼𝛼� � = 1 − 𝛼𝛼
2 𝜎𝜎 2
By a few steps of mathematical manipulation (not shown here), the above part in brackets
can be changed to have only the parameter σ2 between the inequality signs. This will give
where
upper = χ2ν ; 1–α/2 , the larger of the 2 percentile points and
lower = χ2ν ; α/2 , the smaller of the 2 percentile points.
31
The values of α and α/2 are calculated from the confidence percentage = 100(1 – α )
e.g. if confidence percentage = 95, α= 0.05 , α/2 = 0.025.
(n − 1) S 2 (n − 1) S 2
Step 3 : Confidence interval is [ , ]
upper lower
Example
Calculate 90% and 95% confidence intervals for the population variance of the time taken to
complete the simple task (see previous example).
Solution:
467.34 467.34
The confidence interval is ( , ) = (19.74, 71.13).
23.68 6.57
α
lower = χ 2 ( ) = χ 2 (0.025) = 5.63 for ν = 14.
2
32
467.34 467.34
The confidence interval is ( , ) = (17.89, 83.01).
26.12 5.63
The determination of the confidence limits for the population proportion of items labeled
“success” is based on the central limit theorem for the sample proportion. This theorem
states that for sufficiently large samples
𝑝𝑝𝑝𝑝 𝑃𝑃� − 𝑝𝑝
𝑃𝑃�~𝑁𝑁 �𝑝𝑝; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝑝𝑝𝑝𝑝
�
𝑛𝑛
Formulae for the lower and upper confidence limits can be constructed in the following
way.
Since Z ~ N(0,1)
Pˆ − p
P( –1.96 ≤ ≤ 1.96) = 0.95
pq / n
By a few steps of mathematical manipulation (not shown here), the above part in brackets
can be changed to have the parameter p (in the numerator) between the inequality signs.
This will give
𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝
𝑃𝑃 �𝑃𝑃� − 1.96� ≤ 𝑝𝑝 ≤ 𝑃𝑃� + 1.96� � = 0.95
𝑛𝑛 𝑛𝑛
33
Since the confidence interval formula is based on a single sample, the is replaced by its
x
sample estimate pˆ = and the parameters p and q = 1 – p by their respective sample
n
x
estimates pˆ = and qˆ = 1 − pˆ .
n
This gives the following 95% confidence interval for p: ( pˆ − 1.96 pˆ qˆ / n , pˆ + 1.96 pˆ qˆ / n ).
If the confidence percentage is changed then the z-multiplier will change. The table below
gives some examples.
confidence percentage 99 95 90
z-multiplier 2.576 1.96 1.645
α 0.01 0.05 0.10
Example
During a marketing campaign for a new product 176 out of the 200 potential users of this
product that were contacted indicated that they would use it. Calculate a 90% confidence
interval for the proportion of potential users who would use this product.
Solution:
176
x = 176, n = 200 so p̂ = = 0.88, qˆ = 1 − pˆ = 0.12.
200
Confidence interval is
(0.88 ± 1.645 0.88 * 0.12 / 200 ) = (0.88 ± 0.0378) = (0.842, 0.918).
σ
x ± z-multiplier
n
σ
The quantity z-multiplier is known as the error (denoted by E).
n
The smaller the error, the more accurately the parameter μ is estimated.
Suppose the size of the error is specified in advance and the sample size n is determined to
achieve this accuracy. This can be done by solving for n from the equation
σ
E = z-multiplier , which gives
n
z − multiplier * σ 2
n=( ) .
E
Example
Consider the example on the interval estimation of the mean content of 500 milliliter cool
drink bottles. The standard deviation σ is known to be 5. Suppose it is desired to estimate
the mean with 95% confidence and an error that is not greater than 0.8. What sample size is
needed to achieve this accuracy?
Solution:
1.96 * 5 2
n= ( ) = 150.0625 = 151 (n is always rounded up).
0.8
35
z − multiplier * pq
The equation to be solved for n is :E = .
n
z − multiplier 2
n = pq ( ) .
E
A practical problem encountered when using this formula is that values for the parameters
p and q=1 – p are needed. Since the purpose of this technique is to estimate p, these values
of p and q are obviously not known.
If no information on p is available, the value of p that will give the maximum value of
p(1 – p) = pq will be taken. It can be shown that p= 0.5 maximizes this expression. This gives
max pq = 0.25 . Substituting this maximum value in the above formula gives
z − multiplier 2
max n = ¼ ( ) .
E
If more accurate information on the value of p is known (e.g. some range of values), it
should be used in the above formula.
Example
Consider the problem (discussed earlier) of estimating the proportion of potential users who
would use a new product. Suppose this proportion is to be estimated with 99% confidence
and an error not exceeding 2% (proportion of 0.02) is required. What sample size is needed
to achieve this?
36
Solution:
But supppose it is known that the value of p is between 0.8 and 0.9.
In such a case
max p(1 – p) = pq = 0.8 × 0.2 = 0.16
The additional information on possible values for p reduces the sample size by 36%.
37
Chapter 8 – Statistical
Inference : Testing of
hypotheses for one sample
Statistical hypothesis
A statistical hypothesis is an assertion (claim) made about a value(s) of a population
parameter.
Purpose
The purpose of testing of hypotheses is to determine whether a claim that is made
could be true. The conclusion about the truth of such a claim is not stated with
absolute certainty, but rather in terms of the language of probability.
1) A supermarket receives complaints that the mean content of “1 kilogram” sugar bags
that are sold by them is less than 1 kilogram.
2) The variability in the drying time of a certain paint (as measured by the variance) has
until recently been 65 minutes. It is suspected that the variability has now increased.
3) A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). They want to test whether this is indeed the case.
H0: ϴ = ϴ0 (The statement that the parameter ϴ is equal to the hypothetical value ϴ0)
38
Examples
1) In the first example (above) the parameter of interest is the population mean µ and
the hypotheses to be tested are
2) In the second example (above) the parameter of interest is the population variance
σ2 and the hypotheses to be tested are
3) In the third example (above) the parameter of interest is the population proportion,
p, of job completions behind schedule and the hypotheses to be tested are
One-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis) in a
direction that is either below or above that specified by the null hypothesis.
Example
The alternative hypothesis H1a (see example 1 above) is the alternative that the value
of the parameter is less than that stated under the null hypothesis and the
alternative H1b (see example 2 above) is the alternative that the value of the
parameter is greater than that stated under the null hypothesis.
39
Two-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis) in directions
that can be either below or above that specified by the null hypothesis.
Example
The alternative hypothesis H1c (see example 3 above) is the alternative that the value
of the parameter is either greater than that stated under the null hypothesis or less
than that stated under the null hypothesis.
H0 : µ = µ0 versus
H1a: µ < µ0 or H1b: µ > µ0 or H1c: µ≠ µ0.
The data set that is needed to perform the test is x1, x2, . . . , xn ,
a random sample of size n drawn from the population for which the mean is tested. The test
is performed to see whether or not the sample data are consistent with what is stated by
the null hypothesis.
The instrument that is used to perform the test is called a test statistic. A test statistic is a
quantity calculated from the sample data.
When testing for the population mean, the test statistic used is:
x − µ0
z0 = .
σ/ n
If the difference between x and µ0 (and therefore the value of z0) is reasonably small, H0 will
be not be rejected. In this case the sample mean is consistent with the value of the
population mean that is being tested. If this difference (and therefore the value of z0) is
sufficiently large, H0 will be rejected. In this case the sample mean is not consistent with the
value of the population mean that is being tested. In order to decide how large this
difference between x and μ0 (and therefore the value of z0) should be before H0 is rejected,
the following should be considered.
40
Type I error
• A type I error is committed when the null hypothesis is rejected when, in fact it is
true i.e. H0 is wrongly rejected.
• In this example, a type I error is committed when it is decided that the statement
H0: µ = μ0 should be rejected when, in fact, it is true.
Type II error
• A type II error is committed when the null hypothesis is not rejected when, in fact, it
is false i.e. a decision not to reject H0 is wrong.
• In this example, a type II error is committed when it is decided that the statement
H0: µ = μ0 should not be rejected when, in fact, it is false.
The following table gives a summary of possible conclusions and their correctness when
performing a test of hypotheses.
A Type I error is often considered to be more serious, and therefore more important to
avoid, than a Type II error. The hypothesis testing procedure is therefore designed so that
there is a guaranteed small probability of rejecting the null hypothesis wrongly. This
probability is never 0 (why?). Mathematically the probability of a type I error can be stated
as
Probabilities of type I and type II errors work in opposite directions. The more reluctant you
are to reject H0, the higher the risk of accepting it when, in fact, it is false. The easier you
make it to reject H0, the lower the risk of accepting it when, in fact, it is false
For the test of the population mean the critical value is determined in the following way.
Assuming that H0 is true, the test statistic will follow a standard normal distribution i.e.
X − µ0
Z0 = ~ N(0, 1).
σ/ n
(i) When testing H0 versus the alternative hypothesis H1a (µ < µ0), the critical value is the
value Zα which is such that the area under the standard normal curve to the left of Zα is α
i.e. P(Z0 < Zα) = α. This leaves an area of 1 – α to the right of Zα.
(ii) When testing H0 versus the alternative hypothesis H1b (µ > µ0) , the critical value is the
value Z1 – α which is such that the area under the standard normal curve to the left of Z1 – α
is
1 – α i.e. P(Z0 < Z1 – α) = 1 – α. This leaves an area of α to the right of Z1 – α
The graph below illustrates the case α = 0.05. This means 1 – α = 0.95 and thus
P(Z0 < 1.645) = 0.95.
(iii) When testing H0 versus the alternative hypothesis H1c (µ ≠ µ0), the critical values are
the values Z1 – α/2 and Zα/2. The area under the standard normal curve to the left of Z1 – α/2 is
1 – α/2. The area under the standard normal curve to the left of Zα/2 is α/2.
i.e. P(Z0 < Z1 – α/2) = 1 – α/2 and P(Z0 < Zα/2) = α/2.
The area under the normal curve between these two critical values is 1 – α. The graph on
the the following page shows the case α = 0.05
42
(i) When testing H0 versus the alternative hypothesis H1a , the rejection region is
{ z0 | z0 < Zα }.
(ii) When testing H0 versus the alternative hypothesis H1b , the rejection region is
{ z0 | z0 > Z1 – α }.
(iii) When testing H0 versus the alternative hypothesis H1c , the rejection region is
H0 is rejected when there is a sufficiently large difference between the sample mean x and
the mean (μ0 ) under H0 . Such a large difference is called a significant difference (result of
the test is significant). The value of α is called the level of significance. It specifies the level
beyond which this difference (between x and μ0) is sufficiently large for H0 to be rejected.
The value of α is specified prior to performing the test and is often taken as either 0.05 (5%
level of significance) or 0.01 (1% level of significance).
When H0 is rejected, it does not necessarily mean that it is not true. It means that according
to the sample evidence available it appears not to be true. Similarly when H0 is not rejected,
it does not necessarily mean that it is true. It means that there is not sufficient sample
evidence to disprove H0.
Critical values for tests based on the standard normal distribution can be found from the
selected percentiles listed at the bottom of the pages of the standard normal table.
43
A summary of the steps to be followed in the testing procedure is shown below (continuing
onto the following page).
x − µ0
2 Calculate the test statistic z 0 = .
σ/ n
3 State the level of significance α and determine the critical value(s) and critical
region.
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1 – α/2 or z0 < Zα/2 }.
4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Examples
1) A supermarket receives complaints that the mean content of “1 kilogram” sugar bags
that are sold by them is less than 1 kilogram. A random sample of 40 sugar bags is
selected from the shelves and the mean found to be 0.987 kilograms. From past
experience the standard deviation contents of these bags is known to be 0.025
kilograms. Test, at the 5% level of significance, whether this complaint is justified.
Solution:
0.987 − 1
Test statistic: z0 = = –3.289.
0.025 / 40
2) A supermarket manager suspects that the machine filling “500 gram” containers of
coffee is overfilling them i.e. the actual contents of these containers is more than
500 grams. A random sample of 30 of these containers is selected from the shelves
and the mean found to be 501.8 grams. From past experience the variance of
contents of these bags is known to be 60 grams. Test at the 5% level of significance
whether the manager’s suspicion is justified.
Solution:
501.8 − 500
Test statistic: z0 = = 1.273.
60 / 30
3) During a quality control exercise the manager of a factory that fills cans of frozen
shrimp wants to check whether the mean weights of the cans conform to
specifications i.e. the mean of these cans should be 600 grams as stated on the label
of the can. He/she wants to guard against either over or under filling the cans. A
random sample of 50 of these cans is selected and the mean found to be 595 grams.
From past experience the standard deviation of contents of these bags is known to
be 20 grams. Test, at the 5% level of significance, whether the weights conform to
specifications. Repeat the test at the 10% level of significance.
45
Solution:
595 − 600
Test statistic: z0 = = 1.768.
20 / 50
Conclusion: There is insufficient evidence to show that the weights don’t conform to
specifications.
Suppose the test is performed at the 10% level of significance. In such a case
Conclusion: There is sufficient evidence to show that the weights don’t conform to
specifications.
Thus, being less strict about controlling a type I error (changing α from 0.05 to 0.10)
results in a different conclusion about H0 (reject instead of do not reject).
Note
1. In example 1 the alternative hypothesis H1a was used, in example 2 the alternative
H1b and in example 3 the alternative H1c.
2. Alternatives H1a and H1b [one-sided (tailed) alternatives] are used when there is a
particular direction attached to the range of mean values that could be true if H0 is
not true.
4. If, in the above examples, the level of significance had been changed to 1%, the
critical values used would have been Z0.01 = – 2.326 (in example 1) ,
Z0.99 = 2.326 (in example 2) and Z0.005 = –2.576 , Z0.995 = 2.576 (in example 3).
When performing the test for the population mean for the case where the population
variance is not known, the following modifications are made to the procedure.
• In the test statistic formula the population standard deviation σ is replaced by the
sample standard deviation S.
x − µ0
• Since the test statistic t0 = that is used to perform the test follows a
S/ n
t-distribution with n–1 degrees of freedom, critical values are looked up in the
t-tables.
x − µ0
2 Calculate the test statistic t 0 = .
S/ n
3 State the level of significance α and determine the critical value(s) and critical
region.
(iii) For alternative H1c the critical region is R = { t0 | t0 > t1 – α/2 or t0 < tα/2 }.
Examples
A paint manufacturer claims that the average drying time for a new paint is 2 hours (120
minutes). The drying times for 20 randomly selected cans of paint were obtained. The
results are shown below.
(a) test whether the population mean drying time is greater than 2 hours (120 minutes)
(b) test, at the 5% level of significance, whether the population mean drying time could be 2
hours (120 minutes).
Solution:
124.1 − 120
Test statistic t0 = = 1.899.
9.65674 / 20
Thus, being more strict about controlling a type I error (changing α from 0.05 to
0.01) results in a different conclusion about H0 (Do not reject instead of reject).
124.1 − 120
Test statistic: t0 = = 1.899 (as calculated in part(a)).
9.65674 / 20
Note:
• Despite the fact that the same data were used in the above examples, the
conclusions were different. In the first test H0 was rejected, but in the next 2 tests H0
was not rejected.
• In the first test the probability of a type I error was set at 5%, while in the second
test this was changed to 1%. To achieve this, the critical was moved from 1.729 to
2.539, resulting in the test statistic value (1.899) being less than (in stead of greater
than) the critical value.
49
• In the third test (which has a two-sided alternative hypothesis), the upper critical
value was increased to 2.093 (to have an area of 0.025 under the t-curve to its right).
Again this resulted in the test statistic value (1.899) being less than (in stead of
greater than) the critical value.
For a one-sided test with alternative hypothesis H1b the rejection region (highlighted area) is
shown in the graph below.
50
For a two-sided test with alternative hypothesis H1c the rejection region (highlighted area) is
shown in the graph below.
Example 1
Consider the example on the drying time of the paint discussed in the previous section. Until
recently it was believed that the variance in the drying time is 65 minutes. Suppose it is
suspected that this variance has increased. Test this assertion at the 5% level of significance.
Solution:
19 * 9.65674 2
Test statistic: χ 02 = = 27.258.
65
α = 0.05, 1 – α = 0.95.
From the chi-square distribution table with
51
Conclusion: There is insufficient evidence to conclude that the variance has increased.
Example 2
A manufacturer of car batteries guarantees that their batteries will last, on average 3 years
with a standard deviation of 1 year. Ten of the batteries have lifetimes (in years) of
1.2 2.5 3 3.5 2.8 4 4.3 1.9 0.7 4.3
Test at the 5% level of significance whether the variability guarantee is still valid.
Solution:
H0 : σ2 = 1 (Guarantee is valid)
9 * 1.592889
Test statistic: χ 02 = = 14.336.
1
Critical region R = {χ20 < χ20.025 = 2.70 or χ20 > χ20.975 = 19.02}.
Conclusion: Using a 5% level of significance, there is insufficient evidence to show that the
variance is not 1 i.e. the data suggests that the guarantee is still valid.
52
The test for the population proportion (p) is based on the CLT. From this result it follows
Pˆ − p
that Z= ~ N(0, 1).
pq / n
For this reason the critical value(s) and critical region are the same as that for the test for
the population mean (both based on the standard normal distribution).
pˆ − p
2 Calculate the test statistic z 0 = ’
p 0 q0 / n
3 State the level of significance α and determine the critical value(s) and critical
region.
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Examples
1) A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). Of their 80 most recent jobs 22 were completed behind
schedule. Test at the 5% level of significance whether this information confirms their
suspicion.
Solution:
22
n = 80, x = 22 (given), p̂ = = 0.275, p0 = 0.20.
80
0.275 − 0.20
Test statistic: z0 = = 1.677.
0.20 * 0.80 / 80
α = 0.05
Critical region: R = { z0 < Z0.025 = – 1.96 or z0 > Z0.975 = 1.96 }.
Conclusion: There is not sufficient evidence to conclude that the proportion is not 0.2
i.e. the data indicates the suspicion is valid.
2) During a marketing campaign for a new product 176 out of the 200 potential users of
this product that were contacted indicated that they would use it. Is this evidence
that more than 85% of all the potential will actually use the product? Use α = 0.01.
Solution:
176
n = 200, x = 176, p0 = 0.85 (given), p̂ = = 0.88.
200
0.88 − 0.85
Test statistic z0 = = 1.188.
0.85 * 0.15 / 200
Conclusion: The evidence suggests that 85% of all potential users will use the
product.
54
Chapter 9 – Linear
Correlation and regression
The first step in the exploration of bivariate data is to plot the variables on a graph. From
such a graph, which is known as a scatter diagram (scatter plot, scatter graph), an idea can
be formed about the nature of the relationship.
Examples
1) The number of copies sold (y) of a new book (measured in thousands of units) is
dependent on the advertising budget (x) the publisher commits in a pre-publication
campaign (measured in thousands of Rands). The values of x and y for 12 recently
published books are shown below.
Scatter diagram
90
80
70
60
copies sold
50
40
30
20
10
0
0 5 10 15 20 25 30 35
advertising budget
55
2) In a study of the relationship between the amount of daily rainfall (x) and the
quantity of air pollution removed (y), the following data were collected.
Scatter diagram
160
140
120
Quantity removed
100
80
60
40
20
0
0 2 4 6 8
Rainfall
• In both cases the relationship can be fairly well described by means of a straight line
i.e. both these relationships are linear relationships.
• In both the examples changes in the values of y are affected by changes in the values
of x (not the other way round). The variable x is known as the explanatory
(independent) variable and the variable y the response (dependent) variable.
In this section only linear relationships between 2 variables will be explored. The issues to
be explored are
1) Measuring the strength of the linear relationship between the 2 variables (the linear
correlation problem).
2) Finding the equation of the straight line that will best describe the relationship
between the 2 variables (the linear regression problem). Once this line is
determined, it can be used to estimate a value of y for given value of x (linear
estimation).
–1 ≤ r ≤ 1.
If the plotted points are closely clustered around this line, r will lie close to either 1 or –1
(depending on whether the linear relationship is positive or negative). Perfect positive
correlation occurs when all the plotted points lie on a line with a positive gradient. For this
case r = 1. Perfect negative correlation occurs when the plotted points lie on a line with a
negative gradient. For this case r = –1.The further the plotted points are away from the line,
the closer the value of r will be to 0. Consider the scatter diagrams that follow.
No pattern (r close to 0)
For a sample of n pairs of values (x1, y1) , (x2, y2), . . . , (xn, yn) , the coefficient of
correlation can be calculated from the formula
Example
Consider the data on the advertising budget (x) and the number of copies sold (y)
considered earlier. For this data r can be calculated in the following way.
x y xy x2 y2
8 12.5 100 64 156.25
9.5 18.6 176.7 90.25 345.96
7.2 25.3 182.16 51.84 640.09
6.5 24.8 161.2 42.25 615.04
10 35.7 357 100 1274.49
12 45.4 544.8 144 2061.16
11.5 44.4 510.6 132.25 1971.36
14.8 45.8 677.84 219.04 2097.64
17.3 65.3 1129.69 299.29 4264.09
27 75.7 2043.9 729 5730.49
30 72.3 2169 900 5227.29
25 79.2 1980 625 6272.64
sum 178.8 545 10032.89 3396.92 30656.5
58
Substituting
n=12, ∑ x = 178.8, ∑ y = 545,
∑ xy = 10032.89, ∑ x2 = 3396.92 ∑ y2 = 30656.5
Comment: Strong positive correlation i.e. the increase in the number of copies sold
is closely linked with an increase in advertising budget.
Coefficient of determination
The strength of the correlation between 2 variables is proportional to the square of
the correlation coefficient (r2). This quantity, called the coefficient of determination,
is the proportion of variability in the y variable that is accounted for by its linear
relationship with the x variable.
Example
In the above example on copies sold (y) and advertising budget (x), the
coefficient of determination = r2 = 0.91942 = 0.8453.
This means that 84.53% of the change in the variability of copies sold is explained by
its relationship with advertising budget.
The scatter diagram is a plot of the DBH (diameter at breast height measured in inches)
versus the age (years) for 12 oak trees. The data are shown in the following table.
Age (x) 97 93 88 81 75 57 52 45 28 15 12 11
DBH (y) 12.5 12.5 8 9.5 16.5 11 10.5 9 6 1.5 1 1
According to the least squares principle, the line that “best” fits the plotted points is the one
that minimizes the sum of the squares of the vertical deviations (see vertical lines in the
graph) between the plotted y and estimated y (values on the line). For this reason the line
fitted according to this principle is called the least squares line.
ŷ = a + bx,
where ŷ is the fitted y value (y value on the line which is different to the observed y
value),a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be
calculated from
n∑ xy − ∑ x ∑ y
b= and a = y − bx.
n∑ x 2 − (∑ x ) 2
Example
For the above data on age (x) and DBH (y) the least squares line can calculated as shown
below.
x y xy x2
60
Substituting
∑ xy = 6877.5 ∑ x2 = 47240
Therefore the equation of the y on x least squares line that can be used to estimate values
of y (DBH) based on x (age) is
ŷ = 1.285 + 0.12779 x.
Suppose the DBH of a tree aged 90 years is to be estimated. This can be done by
substituting the value of x = 90 into the above equation. Then
ŷ = 1.285 + 0.12779 × 90 = 12.786.
A word of caution
• The linear relationship between y and x is often only valid for values of x within a
certain range e.g. when estimating the DBH using age as explanatory variable, it
should be taken into account that at some age the tree will stop growing. Assuming a
61
linear relationship between age and DBH for values beyond the age where the tree
stops growing would be incorrect.
• Only relationships between variables that could be related in a practical sense are
explored e.g. it would be pointless to explore the relationship between the number
of vehicles in New York and the number of divorces in South Africa. Even if data
collected on such variables might suggest a relationship, it cannot be of any practical
value.
• If variables are not linearly related, it does not mean that they are not related. There
are many situations where the relationships between variables are non-linear.
Note:
Calculations will be demonstrated using the Data Analysis Add-Ins ToolPak in Excel.
You are required to know how to use the STAT mode on your calculator.
Example
A plot of the banana consumption (y) versus the price (x) is shown in the graph on the
following page. A straight line will not describe this relationship very well, but the non-linear
curve shown below will describe it well.
14
y
12
10
8
β
6 y =α + + u = α + βz + u
x
4
0
0 1 2 3 4 5 6 7 8 9 10 11 x12
This sequence shows how a nonlinear regression model may be fitted. It uses the banana
consumption example in the first sequence.