0% found this document useful (0 votes)
66 views

Stat130 Module Notes

this module gives you a full and well grounded understanding and introduction as to what statistics is all about

Uploaded by

Kiaren Pillay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Stat130 Module Notes

this module gives you a full and well grounded understanding and introduction as to what statistics is all about

Uploaded by

Kiaren Pillay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

1

STAT130

Introduction to Statistics
2

Chapter 1 – Terminology

1.1 Definitions
Data/Data set – Set of values collected or obtained when gathering information on some
issue of interest.

Examples

1) The monthly sales of a certain vehicle collected over a period.

2) The number of passengers using a certain airline on various routes.

3) Rating (on a scale from 1 to 5) of a new product by customers.

4) The yields of a certain crop obtained after applying different types of fertilizer.

Statistics – Collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting the data and drawing
conclusions from it.

Statistics in the above sense refers to the methodology used in drawing meaningful
information from a data set. This use of the term should not be confused with statistics
(referring to a set of numerical values) or statistics (referring to measures of description
obtained from a data set).

Descriptive Statistics – Collection, organization, summarization and presentation of data.


To be discussed in chapter 2.

Population – All subjects possessing a common characteristic that is being studied.

Examples

1) The population of people inhabiting a certain country.

2) The collection of all cars of a certain type manufactured during a particular month.

3) All patients in a certain area suffering from AIDS.

4) Exam marks obtained by all students studying a certain statistics course.

Census – A study where every member (element) of the population is included.


3
Examples

1) Study of the entire population carried out by the government every 10 years.

2) Special investigations e.g. tax study commissioned by a government.

3) Any study of all the individuals/elements in a population.

A census is usually very costly and time consuming. It is therefore not carried out very often.
A study of a population is usually confined to a subgroup of the population.

Sample – A subgroup or subset of the population.

The number of values in the sample (sample size) is denoted by n. The number of values in
the population (population size) is denoted by N.

Statistical Inference – Generalizing from samples to populations and expressing the


conclusions in the language of probability (chance). To be discussed in chapters 5 – 9.

Variable – Characteristic or attribute that can assume different values

Discrete variables – Variables that can assume a finite or countable number of possible
values. Such variables are usually obtained by counting.

Examples

1) The number of cars parked in a parking lot.

2) The number of students attending a statistics lecture.

3) A person’s response (agree, not agree) to a statement. A one (1) is recorded when the
person agrees with the statement, a zero (0) is recorded when a person does not
agree.

Continuous variables – Variables that can assume an infinite number of possible values. Such
variables are usually obtained by measurement.

Examples

1) The body temperature of a person.

2) The weight of a person.

3) The height of a tree.

4) The contents of a bottle of cool drink.


4
Measurement scales

Qualitative variables – Variables that assume non-numerical values.

Examples

1) The course of study at university (B.Com, B.Eng , BA etc.)


2) The grade (A, B, C, D or E) obtained in an examination.

Nominal scale – Level of measurement which classifies data into categories in which no order
or ranking can be imposed on the data.

A variable can be treated as nominal when its values represent categories with no intrinsic
ranking. For example, the department of the company in which an employee works. Examples
of nominal variables include region, postal code, or religious affiliation.

Ordinal scale – Level of measurement which classifies data into categories that can be
ordered or ranked. Differences between the ranks do not exist.

A variable can be treated as ordinal when its values represent categories with some intrinsic
order or ranking.

Examples

1) Levels of service satisfaction from very dissatisfied to very satisfied.


2) Attitude scores representing degree of satisfaction or confidence and preference
rating scores (low, medium or high).
3) Likert scale responses to statements (strongly agree, agree, neutral, disagree, strongly
disagree).

Quantitative variables – Variables which assume numerical values.

Examples
Discrete and continuous variables examples given above.

Interval scale – Level of measurement which classifies data that can be ordered and ranked
and where differences are meaningful. However, there is no meaningful zero and ratios are
meaningless.

Examples

1) The difference between a temperature of 100 degrees and 90 degrees is the same
difference as that between 90 degrees and 80 degrees. Taking ratios in such a case
does not make sense.

2) When referring to dates (years) or temperatures measured (degrees Fahrenheit or


Celsius) there is no natural zero point.
5
Ratio scale – Level of measurement where differences and ratios are meaningful and
there is a natural zero. This is the “highest” level of measurement in terms of possible
operations that can be performed on the data.

Examples

Variables like height, weight, mark (in test) and speed are ratio variables. These variables
have a natural zero and ratios make sense when doing calculations e.g. a weight of 80
kilograms is twice as heavy as one of 40 kilograms.

Summary of 4 measurement scales

Measurement examples Meaningful calculations


scale
Nominal Types of music Put into categories
University faculties
Vehicle makes
Ordinal Motion picture ratings: Put into categories
G- General audiences Put into order
PG-Parental guidance
PG-13 – Parents cautioned
R - Restricted
NC 17 – No under 17
Interval Years: 2009,2010, 2011 Put into categories
Months: 1,2, . . . , 12 Put into order
Differences between values
are meaningfull
Ratio rainfall Put into categories
humidity Put into order
income Differences between values
are meaningfull
Ratios are meaningfull

Experiment – The process of observing some phenomenon that occurs.

An experiment can be observational or designed.

1) A designed experiment can be controlled to a certain extent by the experimenter.


Consider a study of 4 fuel additives on the reduction in oxides of nitrogen. You may
have 4 drivers and 4 cars at your disposal. You are not particularly interested in any
effects of particular cars or drivers on the resultant oxide reduction. However, you do
not want the results for the fuel additives to be influenced by the driver or car. An
appropriate design of the experiment (way of performing the experiment) will allow
you to estimate effects of all factors of interest without these outside factors
influencing the results.

2) An observational study is not controlled by the experimenter. The characteristic of


interest is simply observed and the results recorded. For example

2.1) Collecting data that compares reckless driving of female and male drivers.
2.2) Collecting data on smoking and lung cancer.
6

Parameter – Characteristic or measure of description obtained from a population.

Examples

1) Mean (average) age of all employees working at a certain company.

2) The proportion of registered female voters in a certain country.

Statistic – Characteristic or measure of description obtained from a sample.

Examples

1) The mean (average) monthly salary of 50 selected employees in a certain government


department.

2) The proportion of smokers in a sample of 60 university students.

1.2 Sampling methods


When selecting a sample, the main objective is to ensure that it is as representative as
possible of the population it is drawn from. When a sample fails to achieve this objective, it is
said to be biased.

Sampling frame (synonyms: "sample frame", "survey frame") – This is the actual set of units
from which a sample is drawn

Example

Consider a survey aimed at establishing the number of potential customers for a new service
in a certain city. The research team has drawn 1000 numbers at random from a telephone
directory for the city, made 200 calls each day from Monday to Friday from 8am to 5pm and
asked some questions.

In this example, the population of interest is all the inhabitants in the city. The sampling frame
includes only those city dwellers that satisfy all the following conditions:

1) They have a telephone.

2) The telephone number is included in the directory.

3) They are likely to be at home from 8am to 5pm from Monday to Friday;

4) They are not people who refuse to answer telephone surveys.


7
The sampling frame in this case definitely differs from the population. For example, it
under-represents the categories which either have no telephone (e.g. the most poor), have
an unlisted number, and who were not at home at the time of calls (e.g. employed people),
who don't like to participate in telephone interviews (e.g. more busy and active people). Such
differences between the sampling frame and the population of interest is a main cause of bias
when drawing conclusions based on the sample.

Probability samples – Samples drawn according to the laws of chance. These include simple
random sampling, systematic sampling and stratified random sampling.

Simple random sampling – Sampling in which each sample of a given size that can be drawn
will have the same chance of being drawn. Most of the theory in statistical inference is based
on random sampling being used.

Examples

1) The 6 winning numbers (drawn from 49 numbers) in a Lotto draw. Each potential
sample of 6 winning numbers has the same chance of being drawn.

2) Each name in a telephone directory could be numbered sequentially. If the sample


size was to include 2 000 people, then 2 000 numbers could be randomly generated
by computer or numbers could be picked out of a hat. These numbers could then be
matched to names in the telephone directory, thereby providing a list of 2 000 people.

We can use Excel to create a random sample of data. We can create many types of random
samples using the formula: RANDBETWEEN

Example

Using the functions in Excel, select the 6 wining numbers in a Lotto draw.

We will start with an empty spreadsheet in Excel. To randomly select 6 lotto numbers
numbered from 1 to 49, we click in cell A2 and type =RANDBETWEEN(1;49). Press Enter. Fill
the other 5 numbers by dragging down.

The results will be displayed as below. Note: Your results will be different from those below.
8
The advantage of simple random sampling is that it is simple and easy to apply when small
populations are involved. However, because every person or item in a population has to be
listed before the corresponding random numbers can be read, this method is very
cumbersome to use for large populations and cannot be used if no list of the population items
is available. It can also be very time consuming to try and locate every person included in the
sample. There is also a possibility that some of the persons in the sample cannot be contacted
at all.

Systematic sampling – Sampling in which data is obtained by selecting every kth object,
N
where k is approximately .
n

Examples

1) A manufacturer might decide to select every 20th item on a production line to test for
defects and quality. This technique requires the first item to be selected at random as
a starting point for testing and, thereafter, every 20th item is chosen.

2) A market researcher might select every 10th person who enters a particular store,
after selecting a person at random as a starting point; or interview occupants of every
5th house in a street, after selecting a house at random as a starting point.

3) A systematic sample of 500 students is to be selected from a university with an


enrolled population of 10 000. In this case the population size N=10 000 and the
10000
sample size n=500. Then every = 20th student will be included in the sample.
500
The first student in the sample can be randomly selected from an alphabetical list of
students and thereafter every 20th student can be selected until 500 names have been
obtained.

Stratified random sampling – Sampling in which the population is divided into groups (called
strata) according to some characteristic. Each of these strata is then sampled using random
sampling.

A general problem with random sampling is that you could, by chance, miss out a particular
group in the sample. However, if you subdivide the population into groups, and sample from
each group, you can make sure the sample is representative. Some examples of strata
commonly used are those according to province, age and gender. Other strata may be
according to religion, academic ability or marital status.
9
Example

In a study investigating the expenditure pattern of consumers, they were divided into low,
medium and high income groups.

Income group percentage of population


low 40
medium 45
high 15

A stratified sample of 500 consumers is to be selected for this study.

When sampling is proportional to size (an income group comprises the same percentage of
the sample as of the population) the sample sizes for the strata should be calculated as
follows.

40 * 500 45 * 500 15 * 500


low : = 200 , medium : = 225 , high : = 75.
100 100 100

Convenience Sampling – Sampling in which data that is readily available is used e.g. surveys
done on the internet. These include quota sampling.

Quota sampling – Quota sampling is performed in 4 stages.

a) Stage 1: Decide which characteristics of the elements/individuals in the population


to be sampled are of importance.
b) Stage 2: Decide on the categories to be sampled from. These categories are
determined by cross-classification according to the characteristics chosen at stage 1.
c) Stage 3: Decide on the overall number (quota) and numbers (sub-quotas) to be
sampled from each of the categories specified in step 2.
d) Stage 4: Collect the information required until all the numbers (quotas) are obtained.
10
Example

A company is marketing a new product and needs to know how potential customers might
react to the product.

Stage 1: It is decided that age (the 3 groups under 20, 20-40, over 40) and gender
(male, female) are the characteristics that will determine the sample.

Stage 2: The 6 categories to be sampled from are (male under 20), (male 20-40), (male
over 40), (female under 20), (female 20-40) and (female over 40).

Stage 3: The numbers (sub-quotas) to be sampled are (male under 20) - 40,
(male 20-40) - 60, (male over 40) - 25, (female under 20) - 35, (female 20-40) - 65 and
(female over 40) -30. The total quota is the total of all the sub-quotas i.e. 255.

Stage 4: Visit a place where individuals to be interviewed are readily available e.g. a
large shopping center and interview people until all the quotas are filled.

Quota sampling is a cheap and convenient way of obtaining a sample in a short space of time.
However, this method of sampling is not based on the laws of chance and cannot guarantee
a sample that is representative of the population from which it is drawn.

When obtaining a quota sample, interviewers often choose who they like (within criteria
specifications) and may therefore select those who are easiest to interview. Therefore,
sampling bias can result. It is also impossible to estimate the accuracy of quota sampling
(because sampling is not random).
11

Chapter 2 – Descriptive Statistics


(Exploratory Data Analysis)
All the data sets used in this chapter will be regarded as samples drawn from some
population. One of the main purposes of studying a sample is to get information about the
population. The main focus here is on summarizing and describing some features of the data.

2.1 Graphs and diagrams


Line graph – A line graph is a graph used to present some characteristic recorded over time.
Example

The graph above shows how a person's weight varied from the beginning of 1991 to the
beginning of 1995.

Bar charts

A bar chart or bar graph is a chart consisting of rectangular bars with heights proportional to
the values that they represent. Bar charts are used for comparing two or more values that are
taken over time or under different conditions.

Simple Bar Chart

In a simple bar chart the figures used to make comparisons are represented by bars. These
are either drawn vertically or horizontally. Only totals are represented. The height or length
of the bar is drawn in proportion to the size of the figure being presented. An example is
shown below.
12

Component Bar Chart

When you want to draw a bar chart to illustrate your data, it is often the case that the totals
of the figures can be broken down into parts or components.

Year Total Male Female


1959 51 956 000 25 043 000 26 913 000
1969 55 461 000 26 908 000 28 553 000
1979 56 240 000 27 373 000 28 867 000
1989 57 365 000 27 988 000 29 377 000
1999 59 501 000 29 299 000 30 202 000

You start by drawing a simple bar chart with the total figures as shown above. The columns
or bars (depending on whether you draw the chart vertically or horizontally) are then divided
into the component parts.

Multiple (compound) Bar Chart


You may find that your data allows you to make comparisons of the component figures
themselves. If so, you will want to create a multiple (compound) bar chart. This type
of chart enables you to trace the trends of each individual component, as well as
making comparisons between the components.
13

Pareto chart
A Pareto chart is a special type of bar chart where the values being plotted are
arranged in descending order. The graph is accompanied by a line graph which shows
the cumulative totals of each category, left to right.

The graph below is a Pareto chart that shows the percentage of late arrivals at a place
of work organized according to cause of late arrival (from the most common to the
least common cause). The line shows the accumulated percentages.

100 100%

80 80%

60 60%
Percent
percent

40 40%

20 20%

0 0%
traffic child care transport weather overslept emergency
reason

Dot Plot

This is diagram where a line is drawn according to a scale that is appropriate for the data set
and the values (in the data set) plotted at their positions on the scale. If the same value occurs
more than once, the multiple values are plotted on top of each other at the same point on
the scale. For small data sets (few values) this plot can provide useful information regarding
data patterns.
14

Example
Imagine that a medium-sized retailer, thinking of expanding into a new region
identifies a business that it considers as being ready for takeover. It finds the following
annual profit figures (in tens of thousands of pounds) for the target retailer's last ten
years trading:

9 9 7 7 7 6 5 4 3 3

To draw a dot plot we can begin by drawing a horizontal line across the page to
represent the range of values of all the numbers; then we can mark an 'x' above the
appropriate value along the line as follows:

Pie Chart

A Pie chart is a diagram that shows the subdivision of some entity/total into subgroups. The
diagram is in the form of a circle which is divided into slices with each slice having an area
according to the proportion that it makes up of the total.

Example
The pie chart below shows the ingredients used to make a sausage and mushroom pizza.

The degrees needed for each slice is found by calculating the appropriate percentage of 360
e.g. for sausage the degrees are 0.125×360 = 45 and for cheese 0.25×360 =90 etc.
15

The complete calculations are shown in the table below.

Ingredient Percentage Degrees


Sausage 7.5 0.075 × 360 = 27
Cheese 25 0.250 × 360 = 90
Crust 50 0.50 × 360 = 180
Tomato sauce 12.5 0.125 × 360 = 45
Mushrooms 5 0.050 × 360 = 18

Stem-and-leaf plot
A stem-and-leaf plot is a device used for summarizing quantitative data in a
table/graphical format to assist in visualizing the shape of a data set.

Examples

1) To construct a stem-and-leaf plot, the values must first be sorted in ascending order.
Here is the sorted set of data values that will used in the example:

44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106

Next, it must be determined what the stems will represent and what the leaves will
represent. Typically, the leaf contains the last digit of the number and the stem
contains all of the other digits. In the case of very large or very small numbers, the
data values may be rounded to a particular place value (such as the hundredths place)
that will be used for the leaves. The remaining digits to the left of the rounded place
value are used as the stems.

In this example, the leaf represents the “ones” place and the stem the rest of the
number (“tens” place or higher).

The stem-and-leaf plot is drawn with two columns separated by a vertical line. The
stems are listed to the left of the vertical line. It is important that each stem is listed
only once and that no numbers are skipped, even if it means that some stems have no
leaves. The leaves are listed in increasing order in a row to the right of each stem.

4 |4679
5 |
6 |34688
7 |2256
8 |148
9 |
10 | 6

key: 5|4=54
leaf unit: 1.0 Conclusion: The 12 of the 17 values
stem unit: 10.0 are greater or equal to 63 and less or
equal to 88.
16

2) Two data sets can be compared by drawing a back-to-back stem-and-leaf plot.

As an example, suppose the fat contents (in grams) for eating English breakfasts and
cold meat sandwiches are to be compared. The fat contents are shown below.

Sandwiches: 6, 7, 12, 13, 17, 18, 20, 21, 21, 24, 26, 28, 30, 34

Breakfasts: 12, 14, 15, 16, 18, 23, 25, 25, 36, 36, 38, 41, 44, 45

A back-to-back stem-and-leaf plot is shown below.

Breakfasts Sandwiches
|0| 6 7
2 4 5 6 8 |1| 2 3 7 8
3 5 5 |2| 0 1 1 4 6 8
6 6 8 |3| 0 4
1 4 5 |4|

key: 2|4=24 for sandwiches and 2|4=42 for breakfasts


leaf unit: 1.0
stem unit: 10.0

Conclusion: The fat content in English breakfasts appears to be higher than that in
sandwiches.

2.2 Sigma and subscript notation


The symbol sigma ∑ (Capital S in Greek alphabet) is used to denote “the sum of” values.

Suppose the symbol x is used to denote some variable of interest in a study. In order to
distinguish between values of this variable, subscripts are used.

x1 – first value in the data set which has a subscript 1.


x2 – second value in the data set which has a subscript 2.
.
.
xn – nth value in the data set which has a subscript n.

The sum of these values is written in shorthand notation as

n
x1 + x2 + . . . + xn = ∑x .
i =1
i

If it is understood that the range of subscript indices over which the summation is taken
involves all the x values, the summation can be written as just

x1 + x2 + . . . + xn = ∑x.
17
Example 1: Suppose x1 = 70, x2 = 74, x3 = 66, x4 = 68, x5 = 71. Then

∑x
i =1
i = x1+x2+ . . . + x5 = 70+74+66+68+71 = 349.

The sum of the squares of a set of values are written as

∑x ∑x
2 2
i = x12 + x 22 +  + x n2 or for short.
i =1

Example 2: For the data set in example 1,


5

∑x
i =1
2
i = 702 + 742 + 662 + 682 + 712 = 24397.

n n
Note that ∑x
i =1
2
i ≠ ( ∑ xi ) 2
i =1
5
e.g. for the abovementioned data ∑x
i =1
2
i = 24397 ≠ 349 2 = 121801.

The summation notation can also be used to write the sum of products of corresponding values
for 2 different sets of values.

∑x y
i =1
i i = x1 y1 + x 2 y 2 +  + x n y n

Example: Consider the following values.

i 1 2 3 4 5 6
xi 11 13 7 12 10 8
yi 8 5 7 6 9 11

For this data


6

∑x y
i =1
i i = (11×8) + (13×5) + (7×7) + (12×6) + (10×9) + (8×11)

= 88 + 65 + 49 + 72 + 90 + 88
= 452.

n n n
Note that ∑ xi y i
i =1
≠ ( ∑ xi ) ( ∑ y i )
i =1 i =1
6
e.g. for the abovementioned data ∑x
i =1
i = 61 and The summation notation is used
extensively in specifying
6 6 6 6
calculations in statistical formulae.
∑ y i = 46 ( ∑ x i ) ( ∑ y i ) = 2806 ≠
i =1 i =1 i =1
∑x y .
i =1
i i
18

2.3 Frequency distributions and related graphs


Frequency distribution

A frequency distribution is a table in which data are grouped into classes and the number of
values (frequencies) which fall in each class recorded.
The main purpose of constructing a frequency distribution is to get insight into the
distribution pattern of the frequencies over the classes. Hence, the name frequency
distribution is used to refer to this pattern.

Example 1
In a survey of 40 families in a village, the number of children per family was recorded and the
following data obtained.
1 0 3 2 1 5 6 2
2 1 0 3 4 2 1 6
3 2 1 5 3 3 2 4
2 2 3 0 2 1 4 5
3 3 4 4 1 2 4 5

number of children Tally frequency (f)


0 ||| 3
1 ||||| || 7
2 ||||| | | | | | 10
3 ||||| ||| 8
4 ||||| | 6
5 |||| 4
6 || 2
Total 40

Note: The sum of the frequencies = sample size i.e. ∑f = n.

Example 2
Consider the following data of low temperatures (in degrees Fahrenheit to the nearest
degree) for 50 days. The highest temperature is 64 and the lowest temperature is 39.

Data Set - Low Temperatures for 50 Days


57 39 52 52 43
50 53 42 58 55
58 50 53 50 49
45 49 51 44 54
49 57 55 64 45
50 45 51 54 58
53 49 52 51 41
52 40 44 49 45
43 47 47 43 51
55 55 46 54 41
19

Constructing a frequency distribution


The classes into which the above values can be sorted can be found by following the steps
shown below.

1. Find the maximum (=64) and minimum (=39) values and calculate the

range = maximum – minimum = 64 – 39 = 25.

2. Decide on the number of classes. Use Sturges’ rule which states that

No. of classes = k
= the rounded up value of (1 + 1.44 ln n)
= 1 + 1.44 × ln(50)
= 6.63
i.e. k = 7.

3. Calculate the class width such that no. of classes × class width > range

i.e. 7× class width > 25.

This suggests a class width of 4.

4. Find the lower value that defines the first class. This is usually a value just below the
minimum value in the data set. Since the minimum value for this data set is 39, the lowest
class can have a minimum value one below this i.e. 38.

5. Find the lower values that define each of the classes that follow by successively adding
the class width to the lower value of class.

lower value of the second class = 38 + 4 = 42.

lower value of the third class = 42 + 4 = 46 etc.

The frequency distribution below shows the data values sorted into the classes

38 – 41, 42 – 45, 46 – 49, 50 – 53, 54 – 57, 58 – 61, 62 – 65

The table below shows the classes and their frequencies for the temperatures data set.

class limits f
38 – 41 4
42 – 45 10
46 – 49 8
50 – 53 15
54 – 57 9
58 – 61 3
62 – 65 1
Total 50
20

The values in the above example that define the classes of the frequency distribution are
called class limits. The classes of the type 38 – 41, 42 – 45,… in which both the upper and
lower limits are included are called “ inclusive classes” . For example, the class 38 – 41
includes all the values from 38 to 41.
In spite of great importance of classification in statistical analysis, no hard and fast rules can
be laid down for it.
The following points must be kept in mind for classification:

1) The classes should be clearly defined and should not lead to any ambiguity.
2) Each of the given values in the data set should be included in one of the classes.
3) The classes should be of equal width, otherwise the different class frequencies
will not be comparable. If the class widths are unequal, then comparable figures
can be obtained by dividing the value of the frequencies by the corresponding
widths of the class intervals. The ratios thus obtained are called ‘frequency
density’.
4) The number of classes should not be too large nor too small.

Continuous Frequency Distribution

If we deal with a continuous variable, it is not possible to arrange the data in the class
intervals of above type. Let us consider the distribution of age in years. If class intervals are
15 – 19, 20 – 24 then persons with ages between 19 and 20 years are not taken into
consideration. In such a case we form the class intervals as 0 – 5, 5 – 10, 10 – 15,
15 – 20,… . Here all the persons with any fraction of age are included in one group or the
other. In the above classes, the upper limits of each class are excluded from the respective
classes and are included in the immediate next class and are known as ‘exclusive classes’.
The upper and lower class limits of the new exclusive type classes are known as class
boundaries.

If d is the gap between the upper limit of any class and the lower limit of the succeeding
class, the class boundaries for any class are then given by:

Upper class boundary = upper class limit + (d/2)


Lower class boundary = Lower class limit – (d/2)
21
Example 2 continued (temperature data)
The frequency distribution below includes the class boundaries.

class limits class boundaries f


38 – 41 37.5 – 41.5 4
42 – 45 41.5 – 45.5 10
46 – 49 45.5 – 49.5 8
50 – 53 49.5 – 53.5 15
54 – 57 53.5 – 57.5 9
58 – 61 57.5 – 61.5 3
62 – 65 61.5 – 65.5 1
Total 50

Example 3
The monthly expenditures (thousands of rands) of 60 households are shown on the next page.
The values of this data set were accurately recorded (not rounded).

7.21741 7.8989 6.85461 10.31167 8.48253 5.17069


5.09063 8.16412 5.67094 7.7394 7.87423 5.41634
9.37265 10.14436 7.15675 10.31107 8.86571 10.1734
5.99276 6.5738 7.06965 8.82439 7.47467 9.50018
4.90014 5.50273 8.12516 5.51933 7.43641 10.95599
5.87188 9.36936 9.83773 10.18893 5.12028 9.60018
8.56534 9.27719 8.37107 7.03318 10.78344 9.08941
6.85749 7.7887 9.68159 6.75009 8.0521 8.19638
10.17312 7.51527 11.31383 8.5765 7.48021 8.39881
7.37565 7.28159 8.81773 5.53182 5.98515 7.71778

The frequency distribution shown below is a summary of this data set.

classes f
4.5 – 5.5 5
5.5 – 6.5 7
6.5 – 7.5 13
7.5 – 8.5 13
8.5 – 9.5 9
9.5 – 10.5 10
10.5 – 11.5 3
Total 60

For this distribution lower (upper) class limit = lower (upper) class boundary for each of the
classes.
A value that falls on the boundary of 2 classes is allocated to the higher of the two classes e.g.
5.50000 is allocated to the class 5.5 – 6.5 (not 4.5 to 5.5).
22
Class midpoints

The midpoint of class (xmid) can be calculated from

Examples

1) For the frequency distribution in example 2 (temperature data), the class midpoints
are given on the following page.

class limits class boundaries f midpoints


38 – 41 37.5 – 41.5 4 39.5
42 – 45 41.5 – 45.5 10 43.5
46 – 49 45.5 – 49.5 8 47.5
50 – 53 49.5 – 53.5 15 51.5
54 – 57 53.5 – 57.5 9 55.5
58 – 61 57.5 – 61.5 3 59.5
62 – 65 61.5 – 65.5 1 63.5

2) For the frequency distribution in example 3 (expenditure data), the class midpoints are
given below.

classes midpoints
4.5 – 5.5 5
5.5 – 6.5 6
6.5 – 7.5 7
7.5 – 8.5 8
8.5 – 9.5 9
9.5 – 10.5 10
10.5 – 11.5 11
23
Cumulative frequencies

The “less than” cumulative frequency of a class is the number of values in the sample that
are less than or equal to the upper class boundary of the class.

Examples

1) For the frequency distribution in example 2 (temperature data) the cumulative


frequencies are calculated as shown below.

class cumulative
f calculations
boundaries frequency
37.5 – 41.5 4 4 4
41.5 – 45.5 10 14 4+10
45.5 – 49.5 8 22 4+10+8
49.5 – 53.5 15 37 4+10+8+15
53.5 – 57.5 9 46 4+10+8+15+9
57.5 – 61.5 3 49 4+10+8+15+9+3
61.5 – 65.5 1 50 4+10+8+15+9+3+1

2) For the frequency distribution in example 3 (expenditure data) the cumulative


frequencies are calculated as shown below.

cumulative
classes f calculations
frequencies
4.5 – 5.5 5 5 5
5.5 – 6.5 7 12 5+7
6.5 – 7.5 13 25 5+7+13
7.5 – 8.5 13 38 5+7+13+13
8.5 – 9.5 9 47 5+7+13+13+9
9.5 – 10.5 10 57 5+7+13+13+9+10
10.5 – 11.5 3 60 5+7+13+13+9+10+3
Total 60

Relative and percentage frequencies


f
• Relative frequency = frequency/sample size i.e. Rf = .
n
• The percentage frequency of a class is calculated from relative frequency × 100.
24
Examples

1) The relative and percentage frequencies for the frequency distribution in example
2 (temperature data) are shown below.

class boundaries f relative frequency percentage frequency


37.5 – 41.5 4 0.08 8
41.5 – 45.5 10 0.2 20
45.5 – 49.5 8 0.16 16
49.5 – 53.5 15 0.3 30
53.5 – 57.5 9 0.18 18
57.5 – 61.5 3 0.06 6
61.5 – 65.5 1 0.02 2

2) The relative and percentage frequencies for the frequency distribution in example 3
(expenditure data) is shown on the following page.

relative percentage
classes f
frequency frequency
4.5 – 5.5 5 0.083 8.3
5.5 – 6.5 7 0.117 11.7
6.5 – 7.5 13 0.217 21.7
7.5 – 8.5 13 0.217 21.7
8.5 – 9.5 9 0.15 15
9.5 – 10.5 10 0.167 16.7
10.5 – 11.5 3 0.05 5
Total 60 1 100

Histogram

A histogram is the graphical representation of a frequency distribution. The frequency


for each class is represented by a rectangular bar with the class boundaries as base
and the frequency as height.
25
Example

A histogram of the frequency distribution in example 2 (temperature data) is shown below.

16
14
12
10
frequency

8
6
4
2
0
37.5-41.5 41.5-45.5 45.5-49.5 49.5-53.5 53.5-57.5 57.5-61.5 61.5-65.5
temperature

Frequency polygon

This is also a graphical representation of a frequency distribution. For each class the
class midpoint is plotted against the frequency and the plotted points joined by means
of straight lines.

Example

For the temperature data the following values are plotted.

midpoint 35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
f 0 4 10 8 15 9 3 1 0

The plot is shown below.

16
14
12
10
frequency

8
6
4
2
0
0 10 20 30 40 50 60 70 80
midpoint
26

Note:
The two plotted values at the lower and upper ends were added to anchor the graph to the
horizontal axis. The lower end value is a plot of 0 versus the midpoint of the class below the
first (lowest) class (35.5). This midpoint is obtained by subtracting the class width (4) from the
midpoint of the lowest class (39.5). The upper end value is a plot of 0 versus the midpoint of
the class above the last class (67.5). This midpoint is obtained by adding the class width (4) to
the midpoint of the last (highest) class (63.5).

The histogram and frequency polygon are equivalent graphical representations of the pattern
of the frequencies shown in the frequency distribution.

The the histogram can provide an estimate of the probability (chance) that a value drawn at
random from the data set will lie between two values.

Examples

1) For the frequency distribution in example 2 (temperature data), the estimated chance
that a randomly drawn value will be at least 45.5 but less than 57.5 is
8 + 15 + 9
= 0.64.
50

2) For the frequency distribution in example 3 (monthly expenditure), the estimated


13 + 9 + 10 + 3
chance that a randomly drawn value will be at least 7.5 is = 0.583.
60

“Less than” ogive


This is the graph of the cumulative frequencies versus the upper class boundaries.

Example

For the “less than” ogive of the frequency distribution in example 2 (temperature data)

the following values are plotted.

class boundary 37.5 41.5 45.5 49.5 53.5 57.5 61.5 65.5
cumulative
0 4 14 22 37 46 49 50
frequency
27
cumulative frequency

60

50

Cum. frequency 40

30

20

10

0
0 10 20 30 40 50 60 70
class boundary

Note:
The plotted value at the lower end was added to anchor the graph to the horizontal axis. The
lower end value is a plot of 0 versus the upper class boundary of the class below the first
(lowest) class (37.5). This upper class boundary is obtained by subtracting the class width (4)
from the upper class boundary of the lowest class (41.5).

A percentage “less than” ogive can be plotted by just changing the vertical scale. In this
example the frequencies add up to 50. In order to convert these frequencies to percentages,
each frequency is multiplied by 2. To draw the percentage ogive, each cumulative frequency
in the above table will have to be multiplied by 2. The resulting graph is shown on the
following page. Values that have a given percentage of the observations in the data set less
than it can be read off from the ogive.

120

100
% cumulative freq

80

60

40

20

0
0 10 20 30 40 50 60 70
boundaries
28
The shape of a distribution

The main purpose of drawing a histogram is to describe the clustering pattern of the values
in the data set. For a large sample size, the histogram (frequency polygon) can be fairly well
approximated by a smooth curve (called a frequency curve) that is fitted to the frequencies.
The following patterns of the shape of the frequency curve appear regularly in data sets.

Symmetric bell shape

0.45

0.4

0.35

0.3
frequency

0.25

0.2

0.15

0.1

0.05

0
-4 -2 0 2 4
x

This shape is for data sets where the majority of values are in the central portion of the scale
with fewer and fewer values the further away from the center (in both directions). Many data
sets have this shape. Examples are

1) Marks obtained in an examination.


2) Heights of a large group of adult males.
3) IQ scores in a large population.

Uniform (rectangular) shape

0.12

0.1

0.08
frequency

0.06

0.04

0.02

0
0 1 2 3 4 5 6
x

This shape occurs when all the values in the data set occur approximately the same number
of times.
29
Examples are

1) Frequencies of winning numbers in a large number of Lotto draws.

2) Frequencies of winning numbers in a large number of roulette games.

3) Frequencies obtained when tossing an unbiased coin and recording 0 if tails come up and 1
if heads come up.

Bimodal shape
60

50

40
frequency

30

20

10

0
0 20 40 60 80 100 120
Body length (m m )

This pattern which shows two distinct peaks (hence the name bimodal data) appearing
when there are two subgroups with different sets of values in the same data set.

Examples

1) Measuring the body lengths of ants when there are adults and juveniles together in
the same data set. The two peaks in the curve reflect the fact that juvenile ants have
shorter body lengths than adult ants.

2) Heights of a population of males and females. Since the females are shorter than the
males, the frequency curve will have two peaks. One peak will be located where the
most female heights are concentrated and one where the most male heights are
concentrated.
30
Positive skew shape

1.2

0.8

frequency
0.6

0.4

0.2

0
0 2 4 6 8 10 12 14
x

This shape shows a high clustering of values at the lower end of the scale and less and less
clustering further away from the lower end towards the upper end.

Example
The time it takes to serve a customer at a supermarket. For most customers the service time
is quite short. The longer the service time, the less the number of customers.

Negative skewed shape

0.3

0.25

0.2
frequency

0.15

0.1

0.05

0
0 2 4 6 8 10 12 14 16
-0.05
x

This shape shows a high clustering of values at the upper end of the scale and less and less
clustering further away from the upper end towards the lower end.

Example
Marks in a test where most students did well, but a few performed poorly.
31

2.4 Measures of central tendency (location)


A measure of central tendency is a value that shows the location on the scale where a data
set is centrally located (most values are clustered around it).

In the calculations a distinction will be made between methods used when the data are in
raw form (values as collected) or grouped form (form of a frequency distribution).

2.4.1 Raw data: The mean (average), median and


mode
Mean:
The mean (or average) of a set of data values is the sum of all of the data values in
the set divided by the n the number of data values. That is

1
mean = x =
n
∑x.
x is pronounced “x bar”.

Example
The marks of seven students in a mathematics test with a maximum possible mark of 20 are
given below:
15 13 18 16 14 17 12:

mean = x =
∑x =
15 + 13 + 18 + 16 + 14 + 17 + 12
= 15.
n 7

Median:
The median is the value in the data set which is such that half of the values in the data
set are less than or equal to it and half greater than or equal to it.

For an odd number of values in the data set, the median is the middle value of the
data set when it has been arranged in ascending order. That is, from the smallest
value to the largest value.

If the number of values in the data set is even, then the median is the average of the
two middle values.
32
Examples

1) The marks of nine students in a geography test that had a maximum possible mark of 50
are given below:

47 35 37 32 38 39 36 34 35

Find the median of this set of data values.

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39 47

2) Consider the above data set with the first value (47) omitted.

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39

In this case the number of values n = 8 which is an even number. The two middle values in
n 8 n
the data set are in positions = = 4 and + 1 = 5 i.e. the values 35 and 36.
2 2 2

35 + 36
Median = = 35.5.
2

Mode:
The mode of a set of data values is the value(s) that occurs most often.
Example:
Find the mode of the following data set:
48 44 48 45 42 49 48
The mode is 48 since it occurs most often.

Note

1) It is possible for a set of data values to have more than one mode.
2) If there are two data values that occur most frequently, we say that the set of data
values is bimodal e.g. the data set 2 2 4 5 5 6 has two modes (2 and 5).
3) If no value in the data set occurs more than once, it has no mode e.g. the data set 4
5 7 9 has no mode.
33
Comparison of mean, median and mode

1) The mean is used as a measure of central tendency for symmetrical, bell-shaped data that
do not have extreme values (extreme values are called outliers).

2) The median may be more useful than the mean when there are extreme values in the data
set as it is not affected by the extreme values.

3) The mode is useful when the most common item, characteristic or value of a data set is
required.

Examples

1) The amounts (thousands) for which each of 7 properties were sold are shown below.

280, 390, 412, 555, 698, 725, 2 350

For this data set mean = x = 772.86. This value of the mean is not a central value for
the data set (it is greater than all the values but the largest one). The reason for this is
that the last value (2350) has a considerable influence on the value of the mean.

The median = 555 is a value that more centrally located than the mean. Unlike the
mean, the median is not influenced by the large last values in the data set.

2) For qualitative (non-numerical) data only the mode can be calculated. For example,
suppose 10 rate payers are asked whether they think the percentage increase in rates
is reasonable. They can either agree (A), disagree (D) or be neutral (N) on the issue.
Their responses are shown below.

A, A, D, N, D, A, D, D, N, N.

For this data set the modal response is D (since D occurs more times than the other
responses). It is not possible to calculate a median or a mean for this data set.

The weighted mean

When calculating the mean for raw data, it is usually assumed that all the values in the data
set are equally important. If the values are not all considered equally important, the weighted
mean ( x w ) is calculated according to the formula below.

In the formula x1, x2, . . . , xr are the values and w1, w2, . . . ,wr their respective weights.

Example

The final mark (percentage) in a certain course is based on an assignment mark (which counts
for 10% of the final mark), a test mark (which counts for 30% of the final mark) and an exam
34
mark (which counts for 60% of the final mark). Calculate the final mark of a student
who gets a 65% assignment mark, a 70% test mark and a 55% exam mark.

Solution:
The above formula is applied with
x1= 65, x2= 70 x3= 55,
w1= 10, w2= 30 w3= 60.

65 * 10 + 70 * 30 + 55 * 60 6050
xw = = = 60.5.
10 + 30 + 60 100

2.4.2 Grouped data


Mean:
For grouped data the mean is calculated from the formula below.

where xmid(i) is the midpoint of the ith class, k the number of classes and n the sample size.
This formula is a special case of the weighted mean formula with wi = fi and
k

∑w
i =1
i = n.

Example

For the frequency distribution of temperatures (example 2 of the frequency distributions),


the mean can be calculated as shown below.

Class boundaries xmid(i) fi xmid(i) fi


37.5 – 41.5 39.5 4 158
41.5 – 45.5 43.5 10 435
45.5 – 49.5 47.5 8 380
49.5 – 53.5 51.5 15 772.5
53.5 – 57.5 55.5 9 499.5
57.5 – 61.5 59.5 3 178.5
61.5 – 65.5 63.5 1 63.5
Total 50 2487

2487
mean = = 49.74.
50
2.5 Measures of variability (variation,
spread, dispersion)
Variability refers to the extent to which the values in a data set vary around (differ from)
the associated measure of central tendency.
35
Example

The performance of 2 different stocks is monitored over a period of 8 days. Their values are
shown in the table below.

day 1 2 3 4 5 6 7 8
A 103 120 112 108 130 106 120 112
B 112 97 85 123 153 85 146 110

The dot plot that follows shows the performance of each stock.

The mean values for the two stocks are the same (=113.875), but they differ in variability
(extent of spread around the mean). Stock B has a far wider spread around the mean than
stock A.

2.5.1 Raw data


Range: range = (maximum value in data set) – (minimum value in data set)

Example: For the stocks data sets


Range for stock A = 130 – 103 = 27
Range for stock B = 153 – 85 = 68
36
The larger (wider) spread in the stock B values is reflected in the larger range (more
than twice that of stock A).

Standard deviation and variance

The sample variance (denoted by S2 ) is a measure of variability based on squared


differences between the values in the data set and the mean.

The variance is expressed in the data units squared.


The standard deviation = S = S 2 , which is the positive square root of the variance,
is expressed in the same units as the data.

Example
For stock A the standard deviation is calculated as follows.

x = score A x2
103 10609
120 14400
112 12544
108 11664
130 16900
106 11236
120 14400
112 12544
sum 911 104297

For stock B the standard deviation is 25.682 (check this using STATMODE).

Interpretation: The stock A values differ (on average) from the mean by 8.919, while stock
B values differ (on average) from the mean by almost 3 times this amount.
37
2.5.2 Grouped data
Standard deviation and variance

For grouped data, the raw data formulae for the variance and standard deviation can
be slightly modified.

As before standard deviation = S = S2.

Example

For the frequency distribution of temperatures (example 2 of the frequency distributions),


the variance and standard deviation can be calculated as shown below.

class boundaries xmid(i) fi xmid(i)fi xmid(i)2fi


37.5 – 41.5 39.5 4 158 6241
41.5 – 45.5 43.5 10 435 18922.5
45.5 – 49.5 47.5 8 380 18050
49.5 – 53.5 51.5 15 772.5 39783.75
53.5 – 57.5 55.5 9 499.5 27722.25
57.5 – 61.5 59.5 3 178.5 10620.75
61.5 – 65.5 63.5 1 63.5 4032.25
Total 50 2487 125372.5

125372.5 − 2487 2 / 50
variance = S2 = = 34.06367
49

standard deviation = S = 34.06367 =5.836.

2.6 Coefficient of variation


The standard deviations of 2 data sets that are expressed in different units cannot be
directly compared. Such a comparison can be done by calculating the

Example:

For the temperature data, x = 49.74 and S = 5.836.


For the expenditure data (see example 3 of the frequency distributions) x = 7.93333 and
S = 1.65567.
38
Since the two standard deviations that were calculated above are in different units, they
cannot be compared directly.

The coefficient of variation calculations show that in relative terms the variability for
expenditure data set is greater than that of the temperature data set.

2.7 Bell-shaped data


If it is known that the data set of interest has a bell-shaped clustering pattern of the values,
the Empirical rule says that
(i) Approximately 68% of data values are within 1 standard deviation of the mean.
(ii) Approximately 95% of data values are within 2 standard deviations of the mean.
(iii) Approximately 99.7% of data values are within 3 standard deviations of the mean.

Example:
Men’s Heights have a bell-shaped distribution with a mean of 69.2 inches and a standard
deviation of 2.9 inches.

Approximately 68% of data values are within 69.2 ± 2.9 = (66.3, 72.1).
Approximately 95% of data values are within 69.2 ± 5.8 = (63.4, 75).
Approximately 99.7% of data values are within 69.2 ± 8.7 = (60.5, 77.9).

2.8 Measures of position – percentiles

2.8.1 Definitions
The ith percentile , Pi , is the value that has i% of the values in a data set less or equal to it
(0 < i ≤ 100).

Examples

• Median = me = 50th percentile = P50.

• First quartile = Q1 = 25th percentile = P25.

• Third quartile = Q3 = 75th percentile = P75.


39
• The 9 deciles D1, D2, . . . , D9 are the values that have 10%, 20%, . . . , 90%
respectively of the values in the data set less or equal to them.

D1 = P10, D2 = P20, … , D5 = P50 = me, … ,D9 = P90.

2.8.2 Calculation of quartiles and quartile deviation


for raw data
For raw data the calculations of the first and third quartiles are based on the same principles
as that of the median.

Steps to be followed in calculating the first and third quartiles for raw data

1) Organize the values in the data set in ascending order in magnitude.

2) Find the median.

3) Divide the data set into 2 portions of equal numbers of values – set 1 consists of those
values less or equal to the median and set 2 consists of those values greater or equal
to the median. When the data set has an odd number of values, the median is
excluded from the division of the data set into 2 portions.

4) The first quartile (Q1) is the median of set 1 and the third quartile (Q3) is the median
of set 2.

Example

The distance from home to work (kilometers) of 11 employees at a certain company are
shown below. Calculate Q1 and Q3.

6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36

1) Ordered data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49

2) Median = 40. After this step the median is deleted from the data set.

3) Set 1 – 5 values less than median i.e. 6, 7, 15, 36, 39.

4) Set 2 – 5 values greater than the median i.e. 41, 42, 43, 47, 49.

5) Q1 = median of set 1 = 15,


Q3 = median of set 2 = 43.
40
Example

Suppose the data set consists of the above values and 56 (12 values).

6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36, 56

1) Ordered Data Set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56

40 + 41
2) median = = 40.5. Unlike what was done in example 1, no values are deleted
2
from the data set.

3) Set 1 – 6 values less or equal than median i.e. 6, 7, 15, 36, 39, 40

Set 2 – 6 values greater or equal than the median i.e. 41, 42, 43, 47, 49, 56.
15 + 36 43 + 47
4) Q1 = median of set 1 = = 25.5 , Q3 = median of set 2 = = 45.
2 2

Q3 − Q1
The quartile deviation = Q = can also be used as a measure of variability.
2
For the data set in example 1, quartile deviation = Q = (43 – 15)/2 = 14.

The quartile deviation value shows the extent to which the values in the data set deviate from
the median. For a skew data set (heavy clustering at lower or upper end of the scale) the
quartile deviation is a more appropriate measure of variability than the standard deviation
(which is more suitable as a measure of variability for symmetric data sets).

2.8.3 Calculation of median, quartiles and


percentiles for grouped data

Percentile class – class that contains the percentile that is calculated.

A formula for calculating the ith percentile Pi for grouped data is shown below.

i = 1, 2, … , 100.

Li = lower class boundary of percentile class.

fi = frequency of percentile class

n = sample size

Fless = Sum of frequencies of classes less than percentile class.

c = class width.
41

Example

For the frequency distribution of temperatures (example 2 of the frequency distributions –


table given below), the calculations of the median, first quartile, third quartile, 4th decile and
65th percentile are shown below.

class cumulative
boundaries f frequency
37.5 – 41.5 4 4
41.5 – 45.5 10 14
45.5 – 49.5 8 22
49.5 – 53.5 15 37
53.5 – 57.5 9 46
57.5 – 61.5 3 49
61.5 – 65.5 1 50
Total 50

Median

The above formula with i = 50 , n = 50 applies.

i * n 50 * 50
Step 1: Calculate position of median = = = 25.
100 100
Step 2: Median class (class that contains 25th observation) is the class 49.5 – 53.5.

Step 3: L50 = 49.5, f50 = 15, Fless = 22, c = 4.

Step 4: Substitute into the above formula.


(25 − 22) * 4
Median = 49.5 + = 50.3.
15

First quartile

The above formula with i = ___ , n =____ applies.

Step 1: Calculate position of first quartile :

Step 2: First quartile class (class that contains 12.5th observation) is the class __________

Step 3: L25 = ______ , f25 =______ , Fless = ______ , c = ______.

Step 4: Substitute into the above formula.

Q1 =
42

Third quartile

The above formula with i =_____ , n = _____ applies.

Step 1: Calculate position of third quartile :

Step 2: Third quartile class (class that contains 37.5th observation) is the class ___________

Step 3: L75 =______ , f75 =______ , Fles s=______ , c =______

Step 4: Substitute into the above formula.

Q3 =

Fourth decile

The above formula with i = 40 , n = 50 applies.


i * n 40 * 50
Step 1: Calculate position of 4th decile = = = 20.
100 100
Step 2: 4th decile class (class that contains 20th observation) is the class 45.5-49.5.

Step 3: L40 = 45.5, f40 = 8, Fless = 14, c = 4.

Step 4: Substitute into the above formula.


(20 − 14) * 4
D4 = 45.5 + = 48.5.
8

65th Percentile

The above formula with i = 65, n = 50 applies.


i * n 65 * 50
Step 1: Calculate position of 65th percentile = = = 32.5.
100 100
Step 2: 65th percentile class (class that contains 32.5th observation) is the class 49.5 – 53.5.

Step 3: L65 = 49.5, f65 = 15 , Fless = 22, c = 4.

Step 4: Substitute into the above formula.

(32.5 − 22) * 4
P65 = 49.5 + = 52.3.
15
43

Percentiles can also be read off from a “less than” ogive.

Example

The cumulative frequency graph on the following page shows the distribution of marks
scored by a class of 40 students in a test.

From the graph Q1 = 36, Me = 44, Q3 =52.

2.9 Five number summary and Box-and-


Whisker plot

Five number summary


• A five number summary of a data set is a summary using the minimum, 1st quartile,
median, 3rd quartile and maximum as summary measures.
• The five number summary shows the following types of information.

type value(s)
central tendency median
deviation Q − Q1
quartile deviation = Q = 3
2
extremes minimum and maximum
44
Example
The IQ’s of 13 people are shown below.

92, 104, 93, 98, 112, 145, 88, 90, 104, 119, 101, 95, 154

minimum = 88
Q1 = 92.5
median = 101
Q3 = 115.5
maximum = 154

Interqartile range = IQR = Q3 – Q1


This gives the range of the middle 50% of the data.
Example continued:
115.5 – 92.2 = 23
The middle 50% of the data values are spread over 23 units – they range from 92.5 to
115.5.

Quartile deviation = Q = IQR ÷ 2


This is also sometimes called the semi interquartile range. It shows the extent to which
the values in the data set deviate from the median.
Example continued:
Q = 23 ÷2 = 11.5

Box-and-Whisker plot

A box-and-whisker plot is a graphical representation of some important values of a data set.


• The median – a measure of central location – is given in the box-and-whisker plot.
• Q1 and Q3 are shown so the interquartile range and the quartile deviation – both
measures of variation – can be found.
• The minimum and maximum values are also shown so the range – a measure of
variation – can be calculated.
• Outliers are shown. Outliers are values that are unusually small or unusually large
relative to the other values in the data set.

To find outliers “cut-off” values must be found.

Lower cut-off value = Q* = Q1 – 1.5×IQR


Any value in the data set that is smaller than the lower cut-off value is considered
too small and so is an outlier.

Upper cut-off value = Q** = Q3 + 1.5×IQR


Any value in the data set that is larger than the upper cut-off value is considered too
large and so is an outlier.
45
Example continued:
IQR = Q3 – Q1 = 23
Q* = Q1 – 1.5×IQR
= 92.5 – (1.5)(23) = 58
None of the values in the data set are smaller than the lower cut-off value so there are no
values that are “too small”.

Q** = Q3 + 1.5×IQR
=115.5 + (1.5)(23)
=150

The only value in the data set that is larger than this is 154. This value (154) is “too big” and
so is an outlier

Drawing a box-and-whisker plot


• Mark Q1 and Q3 on a suitable section of the number line. Draw a box extending from
Q1 to Q3.
• Mark the position of the median on the number line. Draw a vertical line in the box at
this position.
• Mark the positions of the outliers (if any) with a star or some other bold mark.
• Draw the “whiskers” by drawing a horizontal line connecting Q1 to the minimum value
and Q3 to the maximum value. Special case: When there are outliers, the left whisker
extends to the smallest value that is not an outlier and the right whisker extends to
the largest value that is not an outlier.

Example continued:
46
A Box-and-Whisker plot can also be used to assess the skewness (departure from
symmetry) of a variable.
• For positively skewed data most of the values are at the lower end of the scale
(mean > median, “box” section of the plot towards the lower end of the scale).
• For negatively skewed data most of the values are at the upper end of the scale (mean
< median, “Box” section of the plot towards the upper end of the scale).
• In the previous example the data set is positively skew.

When several data sets are to be compared, several Box-and-Whisker plots can be plotted
side-by-side.

Example

The Box-and-Whisker plot shown below enables one to compare delays in departing flights
(in minutes) for certain days in December (16th to the 26th).

For all the days the data sets are positively skewed (data sets all have the “box” section
closer to the lower end of the scale with a long upper whisker). This means that there are
short delays in flight departures on all the days. The long upper whiskers that are visible
show that there were some quite late departures on 16, 17, 21, 22, 23, 24 and 25
December.
47

Chapter 3 – Probability

3.1 Terminology

Probability (Chance)
• A probability is the chance that something of interest will happen.
• A probability is expressed as a proportion i.e. it ranges from 0 to 1.
Chance can be expressed as a percentage i.e. it ranges from 0 to 100.

Examples

1) The probability of rain tomorrow is 0.40


There is a 40% chance of rain tomorrow.

1
2) The probability of winning the Lotto is .
13983816

3) The probability of a certain new product being successful is 0.75.

Random experiment
This is an experiment that gives different outcomes when repeated under similar conditions.

1) The experiment can have more than one possible outcome.

2) All possible outcomes can be listed.

3) The outcome that will occur when the experiment is performed depends on
chance.

Examples

1) Tossing a coin (possible outcomes: heads, tails).

2) Rolling a die (possible outcomes: 1, 2, 3, 4, 5, 6).

3) Asking a person to assign a rating to a product (possible outcomes: A, B, C, D, E).

4) Drawing a card from a deck of cards (possible outcomes: 13 hearts, 13 clubs, 13 spades,
13 diamonds).
48
Set
A set is a collection of outcomes.

Sample space
The sample space is the set of all possible outcomes of a random experiment. A
sample space is usually denoted by the symbol S and the collection of elements
contained in S enclosed in curly brackets { }.

Sample point
A sample point is an individual outcome (element) in a sample space.

Examples

1) Tossing a single coin. S = {h, t}.

2) Tossing a die. S = {1, 2, 3, 4, 5, 6}.

3) Tossing a pair of dice


S= { (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6) }.

4) Tossing two coins. S = {hh, ht, th, tt}.

5) Drawing a card from a deck of cards. The elements in the sample space are listed
below.

S = {2♦ 3♦ 4♦ 5♦ 6♦ 7♦ 8♦ 9♦ 10♦ J♦ Q♦ K♦ A♦
2♥ 3♥ 4♥ 5♥ 6♥ 7♥ 8♥ 9♥ 10♥ J♥ Q♥ K♥ A♥
2♣ 3♣ 4♣ 5 ♣ 6♣ 7♣ 8♣ 9♣ 10♣ J♣ Q♣ K♣ A♣
2♠ 3♠ 4♠ 5♠ 6♠ 7♠ 8♠ 9♠ 10♠ J♠ Q♠ K♠ A♠ }

Each outcome listed in the above examples is a sample point.

Event
An event is a subset of a sample space i.e. a collection of sample points taken from a sample
space.

Impossible event
An impossible event is an event that cannot happen (has probability zero).

Certain event
A certain event is an event that is sure to happen (has probability 1).
49
Simple events are events that involve only one sample point (outcome) of the sample
space.

Examples

1) Let E denote the event “an odd number is obtained when tossing a single die”.
Then E = {1, 3, 5}.

2) Let H denote the event “at least one head appears when tossing two coins”.
H = {hh, ht, th}.

3) Let B denote the event “obtaining a club and a heart in a single draw from a deck of
cards”. The event B is impossible. The set of outcomes of B is an empty set denoted by
B = { } = φ.

4) Let A denote the event “obtaining a 1, 2, 3, 4, 5 or 6 when tossing a single die”. The
event A is a certain event i.e. one of the outcomes belonging to the set describing the
event must happen. This is denoted by A = S, where S is the sample space.

Venn diagrams
• A Venn diagram is a drawing, in which circular areas represent groups of items
usually sharing common properties.
• The drawing consists of two or more circles, each representing a specific group or
set, contained within a square that represents the sample space. Venn diagrams are
often used as a visual display when referring to sample spaces, events and
operations involving events.

3.2 Complements, Unions and Intersections of


events
Compound events
These are events that involve more than one event. Such events can be obtained by
performing various operations involving two or more events.
Some of the operations that can be performed are described in the sections that follow.

Complementary events
The complementary event Ā (sometimes written À) of an event A is all the outcomes in S
that are not in A.
50
Examples

1) Consider the experiment of tossing a single die. S = {1, 2, 3, 4, 5, 6}. The complement
of the event A = “obtaining a 3 or less” = {1, 2, 3} is
A = “obtaining a 4 or more” = {4, 5, 6}.

2) Consider the experiment of tossing two coins. S = {hh, ht, th, tt}. The complement of
the event H = “at least one head”= {hh, ht, th} is H = “no heads” = {tt}.

Union and intersection of events

• The union of two events A and B, denoted by A ∪ B , is the set of outcomes that are
in A or in B or in both A and B i.e. the event that
“either A or B or both A and B occur”
or “at least one of A or B occurs”.

• The intersection of two events A and B, denoted by A ∩ B , is the set of outcomes


that are in both A and B i.e. the event that
“both A and B occur”.

The Venn diagrams below show the sets A ∪ B and A ∩ B .

A ∩ B is the event “a sample point is in B but not in A”.


A ∩ B is the event “a sample point is in A but not in B”.

These definitions involving two events can be extended to ones involving 3 or more events
e.g. for the 3 events A1, A2 and A3 the event A1 ∪ A2 ∪ A3 is the event “at least one of A1, A2
or A3 occurs” and A1 ∩ A2 ∩ A3 the event “A1 and A2 and A3 occur”.

Examples

1) Consider the events A = {1, 3, 6, 7, 8} and B = { 2, 3, 5, 7, 9} defined on a sample space


S = {1, 2, 3, . . . , 10}.

A ∪ B = {1, 2, 3, 5, 6, 7, 8, 9} , A ∩ B = { 3, 7},
A ∩ B = {2, 5, 9}, A ∩ B = {1, 6, 8}.
51

2) Let C be the event “drawing a face card from a deck of cards” and A the event “drawing
a king or an ace from a deck of cards”.

C = {J♦, Q♦, K♦, J♥, Q♥, K♥,


J♠, Q♠, K♠, J♣, Q♣, K♣}

A = {A♦, A♥, A♠, A♣, K♦, K♥, K♠, K♣}.

C  A = {J♦, Q♦, K♦, J♥, Q♥, K♥,


J♠, Q♠, K♠, J♣, Q♣, K♣,
A♦, A♥, A♠, A♣}.

C ∩ A = { K♦, K♥, K♠, K♣}.

Mutually exclusive (disjoint) events


Two events A and B are mutually exclusive (disjoint) if they have no elements
(outcomes) in common .This also means that these events cannot occur together.

Examples

1) Let B be the event “drawing a black card from a deck of cards” and R the event “drawing
a red card from a deck of cards”.

The events B and R have no outcomes in common i.e. B ∩ R = φ (empty set). Hence B
and R are mutually exclusive.

2) Let E be the event “an even number with a single throw of a die” and O the event “an
odd number with a single throw of a die” i.e. E = {2, 4, 6} and O = {1, 3, 5}.

E and O have no outcomes in common i.e. E ∩ O = φ and are therefore mutually


exclusive.
52

3.3 Definitions of probability

Classical definition of probability


If there are n equally likely total numbers of outcomes of which m are favorable to
an event A, then the probability of occurrence of the event A, denoted as P(A), is
given by

N ( A) m
P(A) = = ,
N (S ) n

where N(A) = m is the number of outcomes favourable to the event A and N(S) = n
the number of outcomes in the sample space S i.e. the total number of outcomes.

Note: Since N(A) ≥ 0 and N(A) ≤ N(S), 0 ≤ P(A) ≤ 1.

Examples

1) Two coins are tossed. Find the probability of getting


(i) exactly two heads.
(ii) at least one head.

Solution:

Here, S = {hh, ht, th, tt} .


(i) Let A = getting exactly two heads = {hh}
∴ P(A) = ¼.

(ii) Let B = getting at least one head = {hh, ht, th}


∴ P(B) = ¾.

2) Two dice are rolled. Find the probability that a sum of 7 will occur.

Solution:
The number of sample points in S is 36 (see example 3 under sample space).

Let A = “a sum of 7 will occur”.

= {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)}

∴ P(A) = 6/36 = 1/6.

The classical definition of probability requires the assumption that all the outcomes in the
sample space are equally likely. If this assumption is not met, this formula cannot be used.
53
Example
The possible temperatures (degrees Celsius) in Durban on a particular day in December are

15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.

In December Durban is hot so, for example, 15 degrees is less likely than 30 degrees
i.e. P (temperature = 15) = 1 ÷ 25 = 0.04 does not seem reasonable.

Relative frequency (empirical) definition of probability


If an experiment is repeated n times and an event A is observed f times, then the
estimated probability of occurrence of an event A is given by

f
P(A) = .
n

Note: This formula differs from the classical formula in the sense that the classical
formula uses all the outcomes in the sample space as the total number of outcomes,
while the relative frequency formula uses the number of repetitions (n) of the
experiment as total number of outcomes. In the classical formula the number of
outcomes in the sample space is fixed, while the number of repetitions of an
experiment (n) can vary. It can be shown that the empirical probability is a good
approximation of the true probability when n is sufficiently large.

Examples

1) A bent coin is tossed 1000 times with heads coming up 692 times.
692
An estimate of P(h) is = 0.692.
1000

2) A summary of the final marks in a certain statistics course is shown below.

Mark f
less than 30 6
30 – 39 26
40 – 49 45
50 – 59 64
60 – 69 82
70 – 79 37
80 – 89 22
90 – 99 8
Total 290

From the table (using the empirical formula) the following probabilities can be
estimated.
54

26 + 6
(a) P(mark less than 40) = = 0.110.
290

64 + 82 + 37 + 22 + 8 6 + 26 + 45 213
(b) P(pass) = = 1− = = 0.73.
290 290 290

22 + 8
(c) P(above 80) = = 0.103.
290

Marginal and joint probabilities


• Probabilities involving the occurrence of single events are called marginal
probabilities.
• Probabilities involving the occurrence of two or more events are called joint
probabilities

Example

The preference probabilities according to gender for 2 different brands of a certain product
are summarized in the table on the following page.
The gender marginal probabilities are obtained by summing the joint probabilities over the
brands. The brand marginal probabilities are obtained by summing the joint probabilities
over the genders.

Brand
Marginal
1 2
Probability
Male 0.2 0.32 0.52
Gender
Female 0.4 0.08 0.48
Marginal
0.6 0.4 1
Probability

Joint probabilities: P(male ∩ brand 1) = 0.20, P(male ∩ brand 2) = 0.32,


P(female ∩ brand 1) = 0.40, P(female ∩ brand 2) = 0.08

Marginal probabilities: P(male) = 0.52, P(female) = 0.48,


P(brand 1) = 0.60, P(brand 2) = 0.40.
55

3.4 Counting formulae


The computation of probabilities using the classical definition involves counting the number
of outcomes favourable to the event of interest (say event A) and the total number of
possible outcomes in the sample space. The following formulae can be used to count
numbers of outcomes to be used in the classical definition formula.

Addition and Multiplication formulae for counting


Addition formula – If an experiment can be performed in n ways, and another experiment
can be performed in m ways then either of the two experiments can be performed in (n+m)
ways.

This rule can be extended to any finite number of experiments. If one experiment can be
done in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then one of the k
experiments can be done in n1 + n2 +. . . + nk ways.

Example:
Suppose a man is standing in a room which has 2 doors to his left and 1 door to his
right. In how many ways can he leave the room?

Solution:
Let “leave the room by going to the left” be experiment 1 and “leave the room by
going to the right” be experiment 2. There are n=2 ways to do experiment 1 (he can
leave by door A or door B) and there is m=1 way to do experiment 2 (he can leave by
door C). In total there are n+m = 2+1 = 3 ways to leave the room.

Multiplication formula – If an experiment can be done in n ways and another experiment


can be done in m ways, then both of the experiments can be done in n × m ways.

This rule can be extended to any finite number of experiments. If one experiment can be
done in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then the k
experiments together can be done in n1×n2×…×nk ways.
56
Example 1:
A basic meal consists of soup, a sandwich and a beverage. If a person having this
meal has 3 choices of soup, 4 choices of sandwiches and a choice of coffee or tea as
a beverage, how many such meals are possible?

Choosing soup (experiment 1) has 3 possibilities.


Choosing a sandwich (experiment 2) has 4 possibilities.
Choosing a beverage (experiment 3) has 2 possibilities.

Number of choices of meals = 3 × 4 × 2 = 24.

Example 2:
A PIN to be used at an ATM can be formed by selecting 4 digits from the digits
0, 1, 2, . . . , 9 . How many choices of PIN are there if

(a) digits may be repeated?


(b) digits may not be repeated?

(a) First digit – 10 choices,


second digit – 10 choices,
third digit – 10 choices,
fourth digit – 10 choices.
number of choices = 10 × 10 × 10 × 10 = 104 = 10 000.

(b) First digit – 10 choices,


second digit – 9 choices,
third digit – 8 choices,
fourth digit – 7 choices.
number of choices = 10 × 9 × 8 × 7 = 5040.

Factorial notation
In how many ways can n (n – integer) objects be arranged in a row?

Let n = 2 : 1st object – 2 choices


2nd object – 1 choice.
Number of ways = 2 × 1 = 2.

Let n = 3 : 1st object – 3 choices


2nd object – 2 choices.
3rd object – 1 choice.
Number of ways = 3 × 2 × 1 = 6.
57

In general: the number of ways is n × (n-1) × (n-2) ×. . . × 2 × 1 = n ! (n factorial)

Using this notation 2 × 1 = 2 ! = 2


3×2×1=3!=6
4 × 3 × 2 × 1 = 4 ! = 24 etc.

Note: 1 ! = 1, 0 ! = 1.

The factorial notation is used in counting formulae.

Examples
1) In how many ways can 7 people be placed in a queue at a bus stop?

The 7 people have to be placed in the 7 positions from 1st to 7th.


No. of ways = 7 × 6 × 5 × . . . × 2 × 1 = 7 ! = 5040.

2) In how many ways can 5 books be arranged in a row?

No. of ways = 5 × 4 × 3 × 2 × 1 = 5 ! = 120.

Permutations and combinations

Permutation
• A permutation is the number of different arrangements of a group of items where
order matters.
• The number of permutations of n objects taken r at a time is calculated from
n!
nPr = P(n, r) = .
(n − r )!

Combination
• A combination is the number of different selections of a group of items where order
does not matter.
• The number of combinations of a group of n objects taken r at a time is calculated
from
n n!
nCr = C(n, r) = ( r ) = .
(n − r )!r!

Examples:

1) Four people (A, B, C, D) serve on a board of directors. A chairman and vice-chairman are
to be chosen from these 4 people. In how many ways can this be done?

Chairman Vice-chairman
A B
B A
A C
C A
A D
58
D A
B C
C B
B D
D B
C D
D C

Number of ways = 12.

2) Four people (A, B, C, D) serve on a board of directors. Two people are to be chosen from
them as members of a committee that will investigate fraud allegations. In how many
ways can this be done?

People chosen A and B A and C A and D B and C B and D C and D

Number of ways = 6.
In both these examples a choice of 2 people from 4 people is made. However, in example 1
the order of choice of the 2 people matters (since the one person chosen is chairman and
the other one vice-chairman). In example 2 the order does not matter. The only interest is in
who serves on the committee.

Application of formulae.
In question 1 the permutations formula applies with n = 4, r =2.

4!
Number of ways = P(4, 2) = = 12.
(4 − 2)!

In question 2 the combinations formula applies with n = 4, r =2.

4!
Number of ways = C(4, 2) = = 6.
2!(4 − 2)!

3) Find the number of ways to take 4 people and place them in groups of 3 at a time where
order does not matter.

Solution:
Since order does not matter, use the combination formula.

4! 24
C(4,3) = = =4 .
3!(4 − 3)! 6

4) Find the number of way to arrange 6 items in groups of 4 at a time where order matters.

6! 720
Solution: P(6,4) = = = 360
(6 − 4)! 2!

There are 360 ways to arrange 6 items taken 4 at a time when order matters.
59

5) Find the number of ways to take 20 objects and arrange them in groups of 5 at a time
where order does not matter.

20! 20.19.18.17.16
Solution: C(20,5) = = = 15504
5!(20 − 5)! 1.2.3.4.5
There are 15 504 ways to arrange 20 objects taken 5 at a time when order does not
matter.

6) Determine the total number of five-card hands that can be drawn from a deck of 52
cards.

Solution:
When a hand of cards is dealt, the order of the cards does not matter. Thus the
combinations formula is used.

There are 52 cards in a deck and we want to know in how many different ways we can
draw them in groups of five at a time when order does not matter. Using the
combination formula gives
C(52,5) = 2 598 960.

7) There are five women and six men in a group. From this group a committee of 4 is to be
chosen. In how many ways can the committee be formed if the committee is to have at least
3 women in it?

Solution:

8) In how many ways can a phone number consisting of 5 digits be chosen from the digits
1, 2, 3, . . . , 9 if no digits are to be repeated?

Solution:

9) In how many ways can the 6 winning numbers in a Lotto draw be selected?

Solution:

10) In many ways can a five-card hand consisting of three eight's and two sevens be dealt?

Solution:
60

11) How many different 5-card hands include 4 of a kind and one other card?

Solution:

We have 13 different ways to choose 4 of a kind: 2's, 3's, 4's, … Queens, Kings and
Aces.

Once a set of 4 of a kind has been removed from the deck, 48 cards are left.

Remember OR means add.

The possible situations that will satisfy the above requirement are:

4 Aces and one other card C(4,4)×C(48,1) = 48.

or 4 Kings and one other card C(4,4)×C(48,1) = 48.

or 4 Queens and one other card C(4,4)×C(48,1) = 48.


.
.
.
or 4 twos and one other card C(4,4)×C(48,1) = 48.
Total of 48×13 = 624 ways.

3.5 Basic probability formulae

Complementary events
For any event A defined on some sample space,

P( A ) = 1 – P( A).

Union of two or more events

For any two events A and B defined on some sample space,

P( A ∪ B ) = P(A) + P(B) for mutually exclusive events.

P( A ∪ B ) = P(A) + P(B) – P ( A ∩ B ) for events that are not mutually exclusive.


61

These formulae can be extended to probabilities involving more than two events
e.g. for 3 events A, B and C defined on some sample space

P( A ∪ B ∪ C ) = P(A) + P(B) + P(C) for mutually exclusive events.

P( A ∪ B ∪ C ) = P(A) + P(B) + P(C) – P( A ∩ B ) – P(A ∩ C ) – P( B ∩ C ) + P( A ∩ B ∩ C )

for events that are not mutually exclusive.

This formula can easily be verified with the aid of the Venn diagram shown below.

From the above diagram the following sets can be written down.

A = {1, 2, 4, 5 }; B = {2, 3, 5, 6} ; C = {4, 5 ,6, 7} ;


A ∩ B = {2, 5} ; A ∩ C = {4, 5 }; B ∩ C = {5, 6};
A ∩ B ∩ C = {5}; A ∪ B ∪ C = {1, 2, 3, 4, 5, 6, 7}.

Exercise: Complete the verification of the result for P( A ∪ B ∪ C ) .

De Morgan’s Laws
____
(1) P( A ∩ B ) = P( A ∪ B)
_____
(2) P ( A ∪ B ) = P( A ∩ B)

Venn diagram verification of second result


Right hand side:
62

Left hand side:

Exercise: Verify the first result by using Venn diagrams.

Total probability formulae

P(A) = P( A ∩ B) + P( A ∩ B )

P(B) = P( A ∩ B) + P( A ∩ B)

These formulae can be verified from the Venn diagram shown on the following page.

The formulae can be extended to probabilities involving more than two events.

Examples

1) There are two telephone lines – A and B. Line A is engaged 50% of the time and line B is
engaged 60% of the time. Both lines are engaged 30% of the time. Calculate the
probability that

(a) at least one of the lines are engaged.


(b) none of the lines are engaged.
(c) line B is not engaged.
(d) line A is engaged, but line B is not engaged.
(e) only one line is engaged.
63
Solution:

Let E1 denote the event “line A is engaged” and E2 the event “line B is engaged”.

Given: P(E1) = 0.5, P(E2) = 0.6, P(E1 ∩ E2) = 0.3.

(a) P(at least one of the lines are engaged) = P(E1 ∪ E2)
= P(E1) + P(E2) – P(E1 ∩ E2)
= 0.5 + 0.6 – 0.3
= 0.8

(b) P(none of the lines are engaged.) = 1 – P(at least one of the lines are engaged)
= 1 – 0.8
= 0.2

(c) P(B not engaged) = 1 – P(B engaged) = 1 – P(E2) = 1 – 0.6 = 0.4.

(d) The event “line A is engaged, but line B is not engaged” can be written in symbols as

P(E1 ∩ E 2 ) = P(E1) – P(E1 ∩ E2)


= 0.5 – 0.3
= 0.2.
(Used the total probability formula)

(e) P(only one line is engaged) = P(line A is engaged, but line B is not engaged)
+ P(line B is engaged, but line A is not engaged)
= P( E 1 ∩ E 2 ) + P( E1 ∩ E 2 )

P( E1 ∩ E 2 ) = P(E2) – P(E1 ∩ E2) = 0.6 – 0.3 = 0.3. (Using the total probability
formula)

P(only one line is engaged) = 0.2 + 0.3 = 0.5

2) Let O be the event that a certain lecturer will be in his/her office on a particular
afternoon and L the event that he/she will be at a lecture. Suppose P(O) = 0.48 and P(L)
= 0.27.

(a) State in words the event O ∩ L .


(b) Calculate P( O ∩ L ).

Solution:

(a) O is the event that

L is the event that


64
O ∩ L is the event that

(b) P( O ∩ L ) =

3) A batch of 20 computers contain 3 that are faulty. Four (4) computers are selected at
random without replacement from this batch. Calculate the probability that

(a) all 4 the computers selected are not faulty.


(b) at least 2 of the computers selected are faulty.

Solution:

There are C(20,4) = 4845 [why not P(20,4) ?] ways of selecting the 4 computers from the
batch of 20. Since random selection is used, all 4845 selections are equally likely. Let A
denote the event “all 4 the computers selected are not faulty” and B the event “at least
2 of the computers selected are faulty”

Using the classical probability result,

(a) P(A) =

(b) P(B) =

3.6 Conditional probability

The Conditional probability of an event A occurring given that another event B has occurred
is given by

P( A ∩ B)
P(A | B) = , where P(B) > 0.
P( B)

P( A ∩ B)
Also P(B|A) = , where P(A) > 0.
P( A)
65
Example 1
Five hundred (500) TV viewers consisting of 300 males and 200 females were asked whether
they were satisfied with the news coverage on a certain TV channel. Their replies are
summarized in the table below.

Answer
Satisfied Not Satisfied Total
Male 180 120 300
Gender
Female 90 110 200
Total 270 230 500

180
P(satisfied | male) = = 0.6.
300

90
P(satisfied | female) = = 0.45.
200

P(not satisfied | male) =

P(not satisfied | female) =

270
P(satisfied) = = 0.54 and P(not satisfied) =
500

Note

1) When calculating a conditional probability the sample space is restricted to that


associated with the event that is known to occur.

2) The probability of a person being satisfied depends on the gender of the person being
interviewed. In this case females are less satisfied than males with the news coverage.

Example 2
At a certain university the probability of passing accounting is 0.68, the probability of
passing statistics 0.65 and the probability of passing both statistics and accounting is 0.57.
Calculate the probability that a student

(a) passes statistics when it is known that he/she passed accounting.

(b) passes accounting when it is known that he/she passed statistics.

(c) passes statistics when it is known that he/she did not pass accounting.

Solution:
66

Let A denote the event “a student passes accounting” and


B the event “a student passes statistics”.

Then A is the event “a student did not pass accounting”,


A ∩ B the event “a student passes both statistics and accounting” and
A ∩ B the event “a student passes statistics, but not accounting”.

Given: P(A) = 0.68, P(B) = 0.65, P( A ∩ B ) = 0.57.

P( A ∩ B) 0.57
(a) P(B|A) = = = 0.838 .
P( A) 0.68

P( A ∩ B) 0.57
(b) P(A|B) = = = 0.877.
P( B) 0.65

(c) P(B | A ) =

Multiplication rule of probabilities

Suppose the joint probability P( A ∩ B ) is to be calculate if either of the conditional


probabilities [ P(A|B) or P(B|A) ] and the corresponding unconditional probability [ P(B) or
P(A) ] are known. Then the conditional probability formulae can be manipulated to obtain
the joint probability i.e.
P(A ∩ B) = P(B) P(A|B) = P(A) P(B|A).

These formulae are known as the multiplication formulae of probabilities.

Examples

1) A box has 12 bulbs, 3 of which are defective. If two bulbs are selected at random
without replacement, then what is the probability that both are defective?

Solution:

Let d1 denote the event “the first bulb is defective” and d2 the event “the second bulb is
defective”.
T
3
Then P(d1) = and
12
2
P(d2|d1) = .
11
Using the above mentioned multiplication formula,

3 2
P(d2 ∩ d1) = P(d1) P(d2|d1) = = 0.045.
12 11
67

2) Two cards are drawn at random from from a deck of playing cards. What is the
probability that both these cards are aces?

Solution:

Since there are 4 aces in a deck of 52 cards, the probability of drawing one ace is 4/52.
Having removed one ace and not replacing it reduces the probabilities of drawing
another ace on the second draw. The 51 cards remaining contain 3 aces and therefore
the probability of drawing an ace on the second draw is 3/51. We can multiply these
probabilities and determine the probability of drawing two aces.

P(drawing 2 aces) = (4/52) × (3/51) = 1/221.

• The multiplication rule can be extended to involve more than 2 events


e.g. for 3 events A1, A2 and A3 defined on the same sample space,

P( A1 ∩ A2 ∩ A3 ) = P(A1) P(A2|A1) P(A3|A2∩A1).

3) Three cards are drawn at random from from a deck of playing cards. What is the
probability that all 3 these cards are aces?

Solution: P(drawing 3 aces) = (4/52) . (3/51) . (2/50) = 1/5525.

Independent events

Two events A and B are said to independent if P(A| B) = P(A) or P(B|A) = P(B).
This means that the occurrence of B does not affect the probability that A occurs.

Substitution of the above result into the multiplication formula for two probabilities gives
P(A ∩ B) = P(A) P(B) if A and B are independent.

This formula is known as the product formula for independent events.

Examples

1) The probability that person A will be alive in 20 years is 0.7 and the probability that
person B will be alive in 20 years is 0.5, while the probability that they will both be alive
in 20 years is 0.45. Are the events E1 “A is alive in 20 years” and E2 “B is alive in 20 years”
independent?

Solution:

P(E1) = 0.7, P(E2) = 0.5, P(E1 ∩ E2) = 0.45

Since P(E1) P(E2) = 0.7 × 0.5 = 0.35 ≠ P(E1 ∩ E2), the events E1 and E2 are not
independent.
68

2) Two coins are tossed. Using the classical definition of probability,


P(both tosses heads) = ¼ .
Assuming that both coins are unbiased, P(1st coin is heads) = P(2nd coin is heads) = ½ .

Since P(1st coin is heads) × P(2nd coin is heads) = ½ × ½ = ¼ = P(both tosses heads),
the events “heads on the first toss” and “heads on the second toss” are independent.

• The multiplication rule for independent events can be extended to involve more
than 2 events. In general, if the events A1, A2, . . . , An are independent then

P( A1 ∩ A2 ∩ . . . ∩ An ) = P(A1) P(A2) . . . P(An).

Examples

1) A coin is tossed and a single 6 sided die is rolled. Find the probability of “heads” and
rolling a 3 with the die.
P(head) = ½ and P(3) = 1/6.

Since the results of the coin and the die are independent,
P(heads and 3) = P(heads) P(3) = (1/2) × (1/6) = 1/12

2) A school survey found that 9 out of 10 students like pizza. If three students are
chosen at random with replacement, what is the probability that all three students
like pizza?

Solution
P(student 1 likes pizza) = 9/10 = P(student 2 likes pizza) = P(student 3 likes pizza).

P(student 1 likes pizza and student 2 likes pizza and student 3 likes pizza)
= P(student 1 likes pizza) x P(student 2 likes pizza) x P(student 3 likes pizza)
9
= ( ) 3 = 0.729 .
10

3) It is known that 8% of all cars of a certain make that are sold encounter engine
overheating problems within 50 000 kilometers of travel. During the past week 4
such cars were sold. Assuming that engine overheating problems for the 4 cars are
encountered independently, what is the probability that
(a) all 4
(b) none
(c) at least one of these cars sold
encounter engine overheating problems within 50 000 kilometers of travel ?
69
Solution:
Let A denote the event “overheating problems within 50 000 kilometers of travel”.

(a) P(A) = 0.08.

P(all 4 have overheating problems) = [P(A)]4 = 0.084 = 0.00004096.

(b) P(not overheating problems) =

So
P(none) =

(c) P(at least 1) =

Bayes’ theorem

P( A ∩ B)
In order to apply the conditional probability formula P(A|B) = ,
P( B)
values for P(A ∩ B) and P(B) are needed.

Suppose that only the values for P(A), P(B|A) and P(B| A ) are available.
In this case the probabilities [ P(A ∩ B) and P(B)] required for calculating P(A|B) can be
calculated from

P(A ∩ B ) = P(A) P(B|A)


(Using conditional probability multiplication formula)

and

P(B) = P( A ∩ B) + P( A ∩ B) = P(A) P(B|A) + P( A ) P(B| A ) .


(Using the total probability formula and the conditional probability multiplication formula)

Substituting these probabilities into the first conditional probability formula gives

P( A) P( B | A)
P(A|B) = .
P( A) P( B | A) + P( A ) P( B | A )

This result is known as Bayes’ theorem (named after the person who proposed the
method).
70
Example 1

When testing a person for a certain disease, the test can show either a positive result (the
person has the disease) or a negative result (the person does not have the disease).

When a person actually has the disease, the test shows positive 99% of the time. When the
person actually does not have the disease the test shows negative 95% of the time. Suppose
it is known that only 0.1% of the people in the population have the disease.

a) If a test turns out to be positive, what is the probability that the person has the
disease?

b) If the test turns out to be negative, what is the probability that the person does not
have the disease?

Solution:

Let A = the person has the disease


B = the test returns a positive result
Then

A is the event “the person does not have the disease”,


B|A is the event “the test is positive given the person has the disease”,
B| A is the event “the test is positive given the person does not have the disease” and
B | A is the event “the test is negative given the person does not have the disease”.

(a) P(A) = 0.001 (given) ,


P( A ) = 1 – P(A) = 0.999,
P(B|A) = 0.99 (given),
P( B | A ) = 0.95 (given),
P(B| A ) = 1 – P( B | A ) = 0.05.

Substitution into the above formulae gives

Numerator: P(A ∩ B) = P(A) P(B|A) = 0.001 × 0.99 = 0.00099

Denominator:
P(B) = P( A ∩ B) + P( A ∩ B)
= P(A) P(B|A) + P( A ) P(B| A )
= ( 0.001 × 0.99 ) + ( 0.999 × 0.05 )
= 0.00099 + 0.04995
= 0.05094

P( A ∩ B) 0.00099
P(A|B) = = = 0.0194.
P( B) 0.05094
71

P( A ∩ B ) P( A ) P( B | A ) 0.999 x0.95
(b) P( A | B ) = = = = 0.9999895.
P( B ) 1 − P( B) 0.94906

From the above it can be seen that a negative result of the test is very reliable (it will be
wrong only 105 times in 10 million cases). On the other hand, the chances that a person will
have the disease when the result of the test shows positive is 194 in 10 000.

Bayes’ Theorem can be extended:

Suppose A1, A2, …, An are mutually exclusive events whose union is the sample space
S and P(Ai) > 0. Then, for any event B with P(B) > 0, and any k={1, 2, …, 3},

𝑃𝑃(𝐴𝐴𝑘𝑘 ). 𝑃𝑃(𝐵𝐵 | 𝐴𝐴𝑘𝑘 )


𝑃𝑃(𝐴𝐴𝑘𝑘 | 𝐵𝐵) =
∑ 𝑃𝑃(𝐴𝐴𝑘𝑘 ). 𝑃𝑃(𝐵𝐵 | 𝐴𝐴𝑘𝑘 )

Example 2

Suppose that Bob can decide to go to work by one of three modes of transportation – car, bus,
or commuter train. Because of high traffic, if he decides to go by car, there is a 50% chance
he will be late. If he goes by bus, which has special reserved lanes but is sometimes
overcrowded, the probability of being late is only 20%. The commuter train is more expensive
than the other modes of transport but is late only 1% of the time.

a) Suppose that Bob is late one day and his boss wishes to estimate the probability that he
drove to work that day by car. Since he does not know which mode of transportation
Bob usually uses, he assumes that each mode is equally likely to be used. What is the
boss’ estimate of the probability that Bob drove to work by car?

b) Suppose that a co-worker of Bob’s knows that Bob drives to work by car 10% of the
time, he almost always takes the commuter train to work, and he never takes the bus.
Given that Bob is late to work today, the co-worker believes there is a ____% chance
that Bob came to work by train.

Solution

There are two events of interest –being late and choice of transport. There are 3 options for
the choice of transport.
Let
L = is late to work
B = takes bus
C = takes car
T = takes train
72

Each mode of transport is equally likely so 𝑃𝑃(𝐵𝐵) = 𝑃𝑃(𝐶𝐶) = 𝑃𝑃(𝑇𝑇) = 1�3


𝑃𝑃(𝐿𝐿 | 𝐶𝐶) = 1�2
𝑃𝑃(𝐿𝐿 | 𝐵𝐵) = 1�5
𝑃𝑃(𝐿𝐿 | 𝑇𝑇) = 1�100

Solution (a)
𝑃𝑃(𝐶𝐶 ∩ 𝐿𝐿)
Find 𝑃𝑃(𝐶𝐶|𝐿𝐿) =
𝑃𝑃(𝐿𝐿)

1 1 1
Numerator: 𝑃𝑃(𝐶𝐶 ∩ 𝐿𝐿) = 𝑃𝑃(𝐶𝐶) × 𝑃𝑃(𝐿𝐿 |𝐶𝐶) = � � . � � =
3 2 6

Denominator: 𝑃𝑃(𝐿𝐿) = 𝑃𝑃(𝐿𝐿 ∩ 𝐶𝐶) + 𝑃𝑃(𝐿𝐿 ∩ 𝐵𝐵) + 𝑃𝑃(𝐿𝐿 ∩ 𝑇𝑇)


= 𝑃𝑃(𝐶𝐶) . 𝑃𝑃(𝐿𝐿 | 𝐶𝐶) + 𝑃𝑃(𝐵𝐵) . 𝑃𝑃(𝐿𝐿 | 𝐵𝐵) + 𝑃𝑃(𝑇𝑇) . 𝑃𝑃(𝐿𝐿 | 𝑇𝑇)

1 1 1 1 1 1
= � �.� � + � � .� � + � � .� �
3 2 3 5 3 100
71
=
300

𝑃𝑃(𝐶𝐶 ∩ 𝐿𝐿) 1 71
So, 𝑃𝑃(𝐶𝐶|𝐿𝐿) = = � � ÷ � � = 0.7042
𝑃𝑃(𝐿𝐿) 6 300

Solution (b)
Try for yourself

3.7 Probabilities and odds


Let a to b be the odds in favour of some event A. Suppose P(A) = p. Then P( A ) = 1 – p.
The odds in favour of A is then defined as

a p
= .
b 1− p

a b
From the above it can be shown that p = and 1 – p = .
a+b a+b

The odds against A is b to a. From the above

b 1− p
= .
a p
73
Examples

a) A pair of balanced dice is tossed. What are the odds in favour of the sum of the numbers
showing a 6?
Total number of outcomes = 6 x 6 =36.
Possible ways of getting a sum of 6 : (1, 5), (2, 4), (3, 3), (4, 2), (5,1).
Number of ways of getting a 6 is 5.
p = probability sum equals 6 = 5/36 , 1 – p = 31/36.
Odds in favour of a 6 is: 5 to 31 or 1 to 6.2

b) What are the odds in favour of a red number coming up?


p = probability (red number) = 18/37 and 1–p= 19/37
So the odds in favour of a red are 18 to 19, or 1 to 1.056.

c) The table below shows data that were collected from 781 middle aged female patients at a
certain hospital.

Smoker\heart problems yes no total

yes 172 173 345

no 90 346 436

total 262 519 781

From the table it can be seen that

(i) For smokers the odds in favour of heart problems is 172 to 173 or 1 to 1.0058

(ii) For non-smokers the odds in favour of heart problems is 90 to 346 or 1 to


3.8444.

From this it can be seen that smokers are much more at risk for heart problems than non-
smokers.
74

Chapter 4 – Probability
distributions of discrete random
variables

4.1 Discrete random variables


A random variable is a variable whose value depends on the outcome of a random
experiment. A random variable is denoted by a capital letter and a particular value of a
random variable by a lower case (small) letter.

Examples:

1) T = the number of tails (t) when a coin is flipped 3 times.


2) X = the sum of the values (x) showing when two dice are rolled.
3) H = the height (h) of a woman chosen at random from a group.
4) V = the liquid volume (v) of soda in a can marked 12 oz.

There are two types of random variables:

Discrete Random Variables


• Variables that have a finite or countable number of possible values.
• These variables usually occur in counting experiments.

Continuous Random Variables


• Variables that can take on any value in some interval i.e. they can take an infinite
number of possible values.
• These variables usually occur in experiments where measurements are taken.

Examples:

1) The variables T and X from the above examples are discrete random variables.

2) The variables H and V from the above examples are continuous random variables.
75

4.2 Discrete probability distributions and


their graphical representations
A discrete probability distribution is a list of the possible distinct values of the random
variable together with their corresponding probabilities.The probability of the random
variable X assuming a particular value x is denoted by P(X=x) = p(x). Sometimes the notation
P(X=x) = f(x) or P(X=x)=P(x) is used. This probability, which is a function of x, is referred to
as the probability mass function.

Examples:

1) As above, let T be the random variable that represents the number of tails obtained
when a coin is flipped three times. Then T has 4 possible values 0, 1, 2, and 3. The
outcomes of the experiment and the values of T are summarized in the next table.

Outcomes T
hhh 0
hht, hth, thh 1
tth, tht, htt 2
ttt 3

Assuming that the outcomes are all equally likely, the probability distribution for T is
given in the following table.

t 0 1 2 3 Total
p(t) 1/8 3/8 3/8 1/8 1

2) Let Y denote the number of tosses of a coin until heads appear first. Then

S = {h, th, tth, ttth, . . . } and Y =1, 2, 3, 4, …

y 1 2 3 . . . Total
p(y) ½ (½)2 (½)3 . . . 1

Why is ½ + (½)2 + (½)3 + . .. = 1 ?


76
3) A pair of dice is tossed. Let X denote the sum of the digits. The probability
distribution of X can be found from the following table. The entry in any particular
cell is the sum of the row and column values.

1st die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
2nd die 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

x 2 3 4 5 6 7 8 9 10 11 12
P(X=x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Note:
For any discrete random variable X, the range of values that it can assume are such that

0 ≤ P(x) ≤ 1 and ∑ P( x) = 1 .
x

The cumulative distribution function


The cumulative distribution function is defined as

𝐹𝐹(𝑋𝑋) = 𝑃𝑃(𝑋𝑋 ≤ 𝑥𝑥) = � 𝑃𝑃(𝑟𝑟)


𝑟𝑟≤𝑥𝑥

Examples

1) For the probability mass function in example 1 the cumulative distribution function is

x 0 1 2 3
F(x) 1/8 ½ 7/8 1

2) For the probability mass function in example 3 the cumulative distribution function is

x 2 3 4 5 6 7 8 9 10 11 12
F(x) 1/36 3/36 6/36 10/36 15/36 21/36 26/36 30/36 33/36 35/36 1

3) Consider a discrete random variable with probability mass function given below.

x 1 2 3 4
P(X=x) 0.1 0.3 0.4 0.2
77

(a) CDF (b) PMF

The graphs on the previous page are plots of the probability mass function (graph on the
right) and cumulative distribution function (graph on the left).

A random variable can only take on one value at a time i.e. the events X = x1 and X = x 2 for
x1 ≠ x2 are mutually exclusive. The probability of the variable taking on any number of
different values can be found by simply adding the appropriate probabilities.

Examples

1) Find the probability of getting 2 or more tails when a coin is flipped 3 times.

P(T ≥ 2) = 3/8 + 1/8 = ½.

2) Find the probability of getting at least one tail when a coin is flipped 3 times.

P(at least 1) = p(1) + p(2) + p(3)


= 3/8 + 3/8 +1/8
= 7/8

Or

P(at least 1) = 1 – p(0) = 1 – 1/8 = 7/8.

3) Find the probability of needing at most 3 tosses of a coin to get the first heads.

P(at most 3) = p(1) + p(2) + p(3)


= ½ + (½)2 + (½)3
= 7/8
78

4) Find the probability of getting a sum of


(a) 7 when tossing a pair of dice.
(b) at least 4 when tossing a pair of dice.

(a) P(7) = P(1st is 6, 2nd is 1) + P(1st is 5, 2nd is 2) + P(1st is 4, 2nd is 3)


+ P(1st is 3, 2nd is 4) + P(1st is 2, 2nd is 5) + P(1st is 1, 2nd is 6)
= 6/36
= 1/6.

(b) P(at least 4) = p(4) + p(5) + . . . + p(12)


= 1 – [p(2) + p(3)]
= 1 – 3/36
= 33/36
=11/12.

4.3 Mean, variance and standard deviation


of a discrete random variable

The mean or expected value of a random variable X is the average value that we would
expect for X when performing the random experiment many times.

Notation: The mean or expected value of a random variable X will be represented by µ or


E(X).

We can calculate the mean by using the formula

E(X) = µ = ∑ xp(x) .
Examples

1) The expected value of the random variable T from above is:

Thus if 3 coins are flipped a large number of times, we should expect the average
number of tails (per 3 flips) to be about 1.5. Since the number of tails is an integer
value, it will never actually assume the mean value of 1.5. This mean value more
reflects the fact that the extreme values (0 and 3) occur the same proportion of
times (an eighth) and the middle values occur the same proportion of times (three
eighths).
79

2) The score S obtained in a certain quiz is a random variable with probability distribution
given below.

s 0 1 2 3 4 5
p(s) 0.12 0.04 0.16 0.32 0.24 0.12

The mean of the random variable S can be calculated as shown below.

s 0 1 2 3 4 5 sum
p(s) 0.12 0.04 0.16 0.32 0.24 0.12 1
s × p(s) 0 0.04 0.32 0.96 0.96 0.60 2.88

µ = E(S) = 2.88

Variance
For a random variable X, the variance, denoted by σ2 , can be calculated by using the
formula

The standard deviation of X, denoted by σ, is just the positive square root of σ2. This is a
measure of the extent to which the values are spread around the mean.

The calculation of the standard deviation for a random variable is similar to that of the
calculation of the standard deviation for grouped data.

Example

Calculate the standard deviation of the random variable T from above.

t 0 1 2 3 sum
p(t) 1/8 3/8 3/8 1/8 1
t × p(t) 0 3/8 6/8 3/8 1.5
t2 × p(t) 0 3/8 12/8 9/8 3
80

4.4 Binomial, hypergeometric and Poisson


distributions

Bernoulli trial:
Consider an experiment in which there are two complementary outcomes. One
outcome is labelled “success” (s) and the other is labelled “failure” (f). Such an
experiment is called a Bernoulli trial.
We denote the probability of success as P(s)= p and the probability of failure as
P(f) = 1–p = q

Binomial random variable:


Random variable X is said to have a Binomial distribution if it counts the number of
successes when n (fixed number) identical, independant Bernoulli trials are
performed.

So, to identify a Binomial random variable 5 questions must be asked.


1) Are there a fixed number of trials (n)?
2) Is each trial a Bernoulli trial i.e. does each trial have 2 complementary outcomes?
3) Are the Bernoulli trials identical i.e. is the probability of success p the same for all
the trials?
4) Are the trials independent i.e. does the outcome of one trial not affect the
outcome of another trial?
5) Does X count the number of successes?
If the answer is “yes” to all 5 questions then X is a Binomial random variable.

Notation:
A short hand way of referring to a binomially distributed random variable X, based on
n trials with probability of success p, is X ~ B(n,p) or X ~ Bin(n,p).

Examples:

1) Consider the experiment of flipping a coin 5 times. If we let the event of getting “tails” on
a flip be labeled “success” and “heads” failure, and if the random variable T represents
the number of tails obtained, then T will be binomially distributed with n = 5, p = ½ and
q=½

2) A student answers 10 questions in a multiple-choice test by guessing each answer. For


each question, there are 5 possible answers, only one of which is correct. If we consider
a “success” as getting a question right and consider the 10 questions as 10 independent
Bernoulli trials, then the random variable X representing the number of correct answers
will be binomially distributed with n=10, p=0.2 and q=0.8.
81
3) Fourteen percent of flights from a certain airport are delayed. If 20 flights are chosen at
random, then we can consider each flight to be an independent Bernoulli trial. If we define
a successful trial to be one where a flight takes off on time, then the random variable Z
representing the number of on-time flights will be binomially distributed with n =2 0, p =
0.86 and q = 0.14.

Tree diagram
The number of possible outcomes in a binomial experiment can be written down
from a diagram such as the one below. This diagram called a tree diagram enables
one to write down all the outcomes when this experiment is performed 3 times.

1st 2nd 3rd

s
s
f
s
s
f
f
start

s
s
f
f
s
f
f

The following outcomes and their respective number of successes (x) can be written down
from the above tree diagram.

Outcomes x
fff 0
ffs, fsf, ffs 1
ssf, sfs, fss 2
sss 3

Formula for the calculation of binomial probabilities

A formula for the binomial probability mass function for the case n = 3 can be written down
from the above table by noting the following.

1) Each outcome is a sequence of s (success) and f (failure) values e.g. fff, ffs, ssf etc.

2) In a particular sequence s occurs x times and f occurs (3 – x) times for x = 0, 1, 2, 3.


82
3) Since the trials are independent, the probability of a particular sequence of s’s and
f’s is given by a product of p (the probability of success) and q (the probability of
failure) values, where p’s occur x times and q’s (3 – x) times e.g. P(fff) = q3,
P(ffs) = pq2, P(ssf) = p2q etc.

4) The number of outcomes where there are x success and (3 – x) failure outcomes can
be counted by using the formula C(3, x)= 3Cx .

By using the above, the binomial formula for n = 3 can be written down as

P(X=3) = 3Cx px q3-x for x = 0, 1, 2, 3.

To write down the general formula, the same reasoning as explained above applies to
sequences with n outcomes consisting of s (x of these) and f (n – x of these) values. In the
formula the number 3 is just replaced by n i.e.

P(X = x) = f(x) = nCx px qn-x for x = 0, 1, 2, …, n

Examples

1) As in the previous examples, let T be the random variable representing the number of tails
when a coin is flipped 3 times. Then T ~ Bin(3 , 0.5). Using the formula above with n=3
and p = 0.5 , we can calculate the probability of exactly 2 tails as:

f(2) = 3C2 (0.5)2 (0.5) 1 = 0.375

2) A student answers 10 questions in a multiple-choice test by guessing each answer. For


each question, there are 5 possible answers, only one of which is correct. If we consider
a “success” as getting a question right and consider the 10 questions as 10 independent
Bernoulli trials, then X ~ Bin(10 , 0.2) where X is the random variable representing the
number of correct answers. What is the probability that the student chooses

a) 3 answers correctly?
b) 7 answers correctly?
c) fewer than 3 answers correctly?
d) at least 5 answers correctly?

Solution:
a) P(X=3) = f(3) = 10C3 (0.2)3 (0.8)7 = 0.2013

b) P(X=7) = f(7) = 10C7 (0.2)7 (0.8)3 = 0.000786.

c) P(X < 3) =

d) P(X ≥ 5) =
83

Cumulative Binomial Distribution Tables

Notice that the calculations needed in parts (c) and (d) of the previous example are time
consuming. Instead of using the pdf f(x) to solve the problems, the CDF F(x) can be used.
Values for the CDF are found in the Cumulative Binomial Distribution tables at the end of
the notes (Table A).
There are several tables – one for each different value of n. The first column gives the value
of n while the second column gives the possible values that the random variable X can take
on. The top row gives common values of p.

Remember: These tables give cumulative probabilities so situations that involve the
“<”, “>”and “≥” signs must be adjusted so that they are in a form that uses the “≤”
sign i.e. a “less than or equal to” situation.

Examples
1) Suppose X ~ Bin(12 , 0.6). Find the probability that X is less than, or equal to, 5.

Here we want to find F(5) = P(X ≤ 5).


Step 1: Find the table that has n = 12 in the first column.
Step 2: Choose the value x=5 in the second column.
Step 3: Find p = 0.6 in the top row.
The value given at the intersection of the “5 row” and the “0.6 column” is
F(5) = P(X ≤ 5) = 0.1582

2) In the multiple choice test example X ~ Bin(10 , 0.2).


Part c: What is the probability that the student chooses fewer than 3 answers
correctly?
P(X < 3) = P(X ≤ 2) = 0.6778
To find this go to the table with n = 10 in the first column. Choose x=2 in the second
column and choose p = 0.2 in the top row. Line up the column and row and the value
is F(2) = P(X ≤ 2) = 0.6778

Part (d): What is the probability that the student chooses at least 5 answers
correctly?
P(X ≥ 5) = 1 – P(X ≤ 4) = 1 – 0.9672 = 0.0328

Mean and standard deviation of a binomial random variable

If X is a binomial random variable with n trials, probability of success p and probability of


failure q, then the mean, variance and standard deviation of X can be calculated by using the
following formulae.
mean = E(X) = µ= np
var(X) = σ2 = npq
standard deviation (X) = npq.
84

Example

For T = the number of tails when a coin is flipped 3 times, n = 3, p = q = ½ .

Note: A Binomial random variable with n=1 is simply a Bernoulli trial and is sometimes
referred to as a Bernoulli distribution.

Shape of the binomial distribution

A binomial distribution is symmetric if p = q , positively skewed if p < q and negatively


skewed if p > q . These shapes are illustrated in the graphs for n = 20 shown below and on
the following page.
85

Hypergeometric distribution: Bernoulli trials where sampling is without


replacement
The folowing expreimental model is sometimes associated with the binomial distribution.

Consider a bowl with N marbles of which Np are blue and Nq red, where p + q = 1. If
sampling is done with replacement and drawing a blue marble labeled “success” (red
Np Nq
marble labeled “failure”), then P(success) = = p and P(failure) = = q . If P( x
N N
blue marbles in n draws) is required and sampling is with replacement, the binomial
formula will still apply. If sampling is without replacement, P(success) is no longer
constant (assumption 4 of binomial experiment is violated) and the binomial formula
will no longer apply for calculating the abovementioned probability. In such a case

The abovementioned distribution is known as the hypergeometric distribution.

Example

A bowl contains 10 blue and 7 red marbles. Four (4) marbles are drawn at random from the
bowl. Calculate the probability of
(a) two
(b) at least 3
blue marbles drawn when sampling is done
1) with replacement.
2) without replacement.
86

N=17, Np=10, Nq=7 and n=4

2 2
 10  7
P(X = 2) = 4 C 2     = 0.352 .
 17   17 

1b) P(X ≥ 3) = P(X = 3) + P(X = 4)

= 0.335 + 0.120
= 0.455.

C2 ×7 C2 45 × 21
2a) P(X = 2) = 10
= = 0.397 .
17 C 4 2380

2b)

4.5 Poisson distribution


Poisson random variable:
If random variable X counts the number of events that occur at random in an interval
of time or space, then X is a Poisson random variable. The average number of events
that occur in the time/space interval is denoted by μ.
A short hand way of referring to a Poisson distributed random variable X with average
(mean) rate of occurrence µ, is X ~ Po(µ).

Examples
1) The number of bad cheques presented for daily payment at a bank.
2) The number of road deaths per month.
3) The number of bacteria in a given culture.
4) The number of defects per square meter on metal sheets being manufactured.
5) The number of mistakes per typewritten page.
87

PDF
The probability that x events occur in time/space is given by

For 𝑥𝑥 = 0, 1, 2, 3, … where 𝜇𝜇 >


0

Examples

1) A secretary claims an average mistake rate of 1 per page. A sample page is selected
at random and 5 mistakes found. What is the probability of her making 5 or more
mistakes if her claim of 1 mistake per page on average is correct?

Solution:

In this case μ=1 is claimed and X the number of mistakes ≥ 5. If the claim is true,
P(X ≥ 5) = 1 – P(X ≤ 4)
 e −1 e −1 e −1 
= 1 –  e −1 + e −1 + + + 
 2 ! 3! 4! 
= 1 – 0.9963
= 0.0037.

The above calculation shows that if the claim of 1 mistake per page on average is true,
there is only a 37 in 10 000 chance of getting 5 or more mistakes per page. This remote
chance of 5 or more mistakes when an average of 1 mistake per page is true casts doubt
on whether the claim of 1 mistake per page on average is in fact true.

2) At a particular restaurant 4 plates are broken, on average, each week. What is the
probability that
a) 2 plates are broken next week?
b) at most 4 plates are broken next week?
c) more than 3 plates are broken next week?

Solution:

a) Let X = number of plates broken in a week.


Then X ~ Po(4)
𝑒𝑒 −4 42
𝑃𝑃(𝑋𝑋 = 2) = = 0.1465
2!
88
b) 𝑃𝑃(𝑋𝑋 ≤ 4) = 𝑓𝑓(0) + 𝑓𝑓(1) + 𝑓𝑓(2) + 𝑓𝑓(3) + 𝑓𝑓(4)

𝑒𝑒 −4 40 𝑒𝑒 −4 41 𝑒𝑒 −4 42 𝑒𝑒 −4 43 𝑒𝑒 −4 44
= + + + +
0! 1! 2! 3! 4!
= 0.6288

c) 𝑃𝑃(𝑋𝑋 > 3) = 1 − 𝑃𝑃(𝑋𝑋 ≤ 3)

= 1 − 𝑓𝑓(0) − 𝑓𝑓(1) − 𝑓𝑓(2) − 𝑓𝑓(3)

𝑒𝑒 −4 40 𝑒𝑒 −4 41 𝑒𝑒 −4 42 𝑒𝑒 −4 43
=1− − − −
0! 1! 2! 3!
= 0.5665

Cumulative Poisson Distribution Tables

Notice that the calculations needed in parts (b) and (c) of the previous example are time
consuming. Instead of using the pdf f(x) to solve the problems, the CDF F(x) can be used.
Values for the CDF are found in the Cumulative Poisson Distribution table at the end of the
notes (Table B).
The top row gives some values for µ and the first column gives some values that Poisson
random variable X can take on. The cumulative probabilities F(x) = P(X < x) can be found by
lining up the relevant row and column.

Reminder: As with the Cumulative Binomial Distribution tables, these tables give
cumulative probabilities so situations that involve the “<”, “>” and “≥” signs must be
adjusted so that they are in a form that uses the “≤” sign i.e. a “less than or equal to”
situation.

Example 2
Part (b): Step 1 – Find µ=4 in the top row of the table.
Step 2 – Find x=4 in the first column.
Step 3 – Line up the column and row.
At the intersection of the row is the value F(4) = P(X ≤ 4) = 0.6288

Part (c): First find P(X ≤ 3)


Step 1 – Find µ=4 in the top row.
Step 2 – Find x=3 in the first column.
Step 3 – Line up the column and the row.
At the intersection of the row is the value F(3) = P(X ≤ 3) = 0.4335
So, P(X > 3) = 1- P(X ≤ 3) = 1 – 0.4335 = 0.5665
89

Poisson approximation of binomial distribution

The Poisson random variable can also be seen as an approximation to a binomial random
variable with the number of trials (n) large and the probability of success (p) small such that
the mean μ = np is of moderate size. This approximation is good when n ≥ 20 and p ≤ 0.05
or n ≥ 100 and np ≤ 10 .

Example

A life insurance company has found that the probability is 0.000015 that a person aged 40-
50 will die from a certain rare disease. If the company has 100 000 policy holders in this age
group, what is the probability that this company will have to pay out 4 claims or more
because of death from this disease?

Solution:

For the following reasons a binomial distribution with n = 100 000 and p = 0.000015 is
reasonable in this case.

1 A person either dies or not from this disease (two outcomes).

2 The probability of dying from the disease is constant.

3 The death or not from this disease of one person does not affect that of another
person.

The Poisson distribution with µ = 100 000×(0.000015) = 1.5 can be used to approximate this
probability.
P(X ≥ 4) = 1 – P(X ≤ 3)

= 1 – 0.9344
= 0.0656.

Mean and standard deviation of a Poisson random variable

• The mean and variance of the Poisson distribution are given by E(X) = µ and
var(X) = µ.
• In the case of the Poisson approximation to the binomial distribution

E(X) = var(X) = np
standard deviation = np .
90

If the average rate of occurrence of µ is given for a particular time/space interval


length/size, probability calculations can also be carried out for an interval length/size which
is different to the one given.

Example

Calls arrive at switchboard at an average rate of 1 every 15 seconds. What is the probability
of not more than 5 calls arriving during a particular minute?

Solution:

A mean rate of 1 every 15 seconds is equivalent to a mean rate of 4 every minute. Since the
question concerns an interval of 1 minute, µ = 4 (not µ = 1).

−4
41 𝑒𝑒 −4 42 𝑒𝑒 −4 43 𝑒𝑒 −4 44 𝑒𝑒 −4 45 𝑒𝑒 −4
𝑃𝑃(𝑋𝑋 ≤ 5) = 𝑒𝑒 + + + + + = 0.7851
1! 2! 3! 4! 5!
1

Chapter 5 – The normal


distribution

5.1 Probability distributions of continuous


random variables
A random variable X is called continuous if it can assume any of the possible values in some
interval i.e. the number of possible values are infinite. In this case the definition of a discrete
random variable (list of possible values with their corresponding probabilities) cannot be
used (since there are an infinite number of possible values it is not possible to draw up a list
of possible values). For this reason probabilities associated with individual values of a
continuous random variable X are taken as 0.

The clustering pattern of the values of X over the possible values in the interval is described
by a mathematical function f(x) called the probability density function (pdf). A high (low)
clustering of values will result in high (low) values of this function. For a continuous random
variable X, only probabilities associated with ranges of values (e.g. an interval of values from
a to b) will be calculated. The probability that the value of X will fall between the values a
and b is given by the area between a and b under the curve describing the probability
density function f(x). For any probability density function the total area under the graph of
f(x) is 1.

5.2 Normal distribution


A continuous random variable X is normally distributed (follows a normal distribution) if the
probability density function of X is given by

The constants µ and σ can be shown to be the mean and standard deviation respectively
of X. These constants completely specify the density function. A graph of the curve
describing the probability function (known as the normal curve) for the case µ = 0 and
σ = 1 is shown on the following page.
2

Graph of standard norm al distribution

0.45
0.4
0.35
0.3
0.25
p(z) 0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
z

5.2.1 Properties of the normal distribution


The graph of the function defined above has a symmetric, bell-shaped appearance. The
mean µ is located on the horizontal axis where the graph reaches its maximum value. At the
two ends of the scale the curve describing the function gets closer and closer to the
horizontal axis without actually touching it. Many quantities measured in everyday life have
a distribution which closely matches that of a normal random variable e.g. marks in an
exam, weights of products, heights of a male population. The parameter µ shows where the
distribution is centrally located and σ the spread of the values around µ. A short hand way
of referring to a random variable X which follows a normal distribution with mean µ and
variance σ2 is by writing X ~ N(µ ; σ2). The next diagram shows graphs of normal
distributions for various values of μ and σ2.
3

An increase (decrease) in the mean µ results in a shift of the graph to the right (left). An
increase (decrease) in the standard deviation σ results in the graph becoming more (less)
spread out e.g. compare the curves of the distributions with σ2 = 0.2, 0.5, 1 and 5 in the
previous diagram.

5.2.2 Empirical example – The normal


distribution and a histogram
Consider the scores obtained by 4 500 candidates in a matric mathematics examination.

Histogram

1000
900
800
freq 700
600
500
400
300
200
100
0
15 25 35 45 55 65 75 90 More

mark

The histogram of the marks has an appearance that can be described by a normal curve i.e.
it has a symmetric, bell-shaped appearance. The mean of the marks is 51.95 and the
standard deviation 10.

5.3 The Standard Normal Distribution


To find probabilities for a normally distributed random variable, we need to be able to
calculate the areas under the graph of the normal distribution. Such areas are obtained
from a table showing the cumulative distribution of the normal distribution (see appendix).
Since the normal distribution is specified by the mean (µ) and standard deviation (σ), there
are many possible normal distributions that can occur. It will be impossible to construct a
4

table for each possible mean and standard deviation. This problem is overcome by
transforming X, the normal random variable of interest [X ~ N(µ; σ2) ], to a standardized
normal random variable

X −µ
Z= .
σ

It can be shown that the transformed random variable is normally distributed with µ = 0 and
σ = 1 i.e. Z ~ N(0; 1). The random variable Z can be transformed back to X by using the
formula

X = µ + Zσ .

The normal distribution with mean µ = 0 and standard deviation σ = 1 is called the standard
normal distribution. The symbol Z is reserved for a random variable with this distribution.
The graph of the standard normal distribution appears below.

Various areas under the above normal curve are shown. The standard normal table gives the
area under the curve to the left of the value z. Other types of areas can be found by
combining several of the areas as shown in the next examples.

5.4 Calculating probabilities using the


standard normal table
The standard normal table is found at the back of your notes.
The areas shown in the table are those under the standard normal curve to the left of the
value of z looked up i.e. P(Z < z)
5

Note
• For negative values of z less than the minimum value (– 3.79) in the table, the
probabilities are taken as 0 i.e. P(Z ≤ z) = 0 for z < – 3.79.
• For positive values of z greater than the maximum value (3.79) in the table, the
probabilities are taken as 1 i.e. P(Z ≤ z) = 1 for z > 3.79.

Examples
In all the examples that follow, Z ~ N(0; 1).

a) P(Z < 1.35) = 0.9115

b) P(Z > – 0.47) = 1 – P(Z ≤ – 0.47)


= 1– 0.3192
= 0.6808

c) P( – 0.47 < Z < 1.35) = P(Z < 1.35) – P(Z < – 0.47)
= 0.9115 – 0.3192
= 0.5923

d) P(Z > 0.76) = 1 – P(Z < 0.76)


= 1 – 0.7764
= 0.2236

e) P(0.95 ≤ Z ≤ 1.36) = P(Z ≤ 1.36) – P(Z ≤ 0.95)


= 0.9131 – 0.8289
= 0.0842

f) P( – 1.96 ≤ Z ≤ 1.96) = P(Z ≤ 1.96) – P(Z ≤ – 1.96)


= 0.9750 – 0.0250
= 0.95

In all the above examples an area was found for a given value of z. It is also possible to find a
value of z when an area to its left is given. This can be written as P(Z ≤ zα) = α (α is the greek
letter for “a” and is pronounced “alpha”). In this case zα has to be found where α is the area
to its left
6

Examples

1) Find the value of z that has an area of 0.0344 to its left.

Search the body of the table for the required area (0.0344) and then read off the
value of z corresponding to this area. In this case z0.0344 = – 1.82.

2) Find the value of z that has an area of 0.975 to its left.

Finding 0.975 in the body of the table and reading off the z value gives z0.975 = 1.96.

3) Find the value of z that has an area of 0.95.

When searching the body of the table for 0.95 this value is not found. The z value
corresponding to 0.95 can be estimated from the following information obtained
from the table.

z area to left
1.64 0.9495
? 0.95
1.65 0.9505

Since the required area (0.95) is halfway between the 2 areas obtained from the
table, the required z can be taken as the value halfway between the two z values
that were obtained

1.64 + 1.65
from the table i.e. z = = 1.645.
2

Exercise: Using the same approach as above, verify that the z value corresponding to
an area of 0.05 to its left is –1.645.

4) Find the value of z that has an area of 0.9841 to its left.

When searching the body of the table this area is not found. The following
information can be found.
z area to left
2.14 0.9838
? 0.9841
2.15 0.9842

The area required is not midway between the 2 other areas so the z-value
corresponding to the closer area is used i.e. z = 2.15 is used.
7

At the bottom of the standard normal table selected percentiles zα are given for different
values of α. This means that the area under the normal curve to the left of zα is α.

Examples:
1 α = 0.900, zα = 1.282
means P(Z < 1.282) = 0.900.

2 α = 0.995, zα = 2.576
means P(Z < 2.576) = 0.995.

3 α = 0.005, zα = – 2.576
means P(Z < – 2.576) = 0.005.

The standard normal distribution is symmetric with respect to the mean = 0. From this it
follows that the area under the normal curve to the right of a positive z entry in the
standard normal table is the same as the area to the left of the associated negative entry
(– z) i.e.

P(Z ≥ z) = P(Z ≤ – z) .

For example, P(Z ≥ 1.96) = 1 – 0.975 = 0.025 = P(Z ≤ – 1.96).

5.5 Calculating probabilities for any normal


random variable

Let X be a N(μ ; σ2) random variable and Z a N(0 ; 1) random variable. Then

𝑋𝑋 − 𝜇𝜇 𝑥𝑥 − 𝜇𝜇 𝑥𝑥 − 𝜇𝜇
𝑃𝑃(𝑋𝑋 ≤ 𝑥𝑥) = 𝑃𝑃 � ≤ � = 𝑃𝑃 �𝑍𝑍 ≤ �
𝜎𝜎 𝜎𝜎 𝜎𝜎
𝑎𝑎 − 𝜇𝜇 𝑋𝑋 − 𝜇𝜇 𝑏𝑏 − 𝜇𝜇 𝑎𝑎 − 𝜇𝜇 𝑏𝑏 − 𝜇𝜇
𝑃𝑃(𝑎𝑎 ≤ 𝑋𝑋 ≤ 𝑏𝑏) = 𝑃𝑃 � ≤ ≤ � = 𝑃𝑃 � ≤ 𝑍𝑍 ≤ �
𝜎𝜎 𝜎𝜎 𝜎𝜎 𝜎𝜎 𝜎𝜎
8

Example 1
The height H (in inches) of a population of women is approximately normally distributed
with a mean of µ = 63.5 and a standard deviation of σ = 2.75 inches. To calculate the
probability that a woman is less than 63 inches tall, we first find the z-value that is
associated with h = 63 inches. (This z-value is sometimes referred to as the z-score.)

63 − 63.5
𝑧𝑧 = = −0.18
2.75

Then use P(H ≤ 63) = P(Z ≤ – 0.18) = 0.4286.


This means that 42.86% (a proportion of 0.4286) of women are less than 63 inches tall.

Example 2
The length X (inches) of sardines is a N(4.62 ; 0.0529) random variable. What proportion of
sardines is
(a) longer than 5 inches? (b) between 4.35 and 4.85 inches?

5 − 4.62
(a) P(X > 5) = P(Z > )
0.23
= P(Z > 1.65)
= 1 – P(Z ≤ 1.65)
= 1 – 0.9505
= 0.0495.

(b)

= P(– 1.17 ≤ Z ≤ 1)
= P(Z ≤ 1) – P(Z ≤ –1.17)
= 0.8413 – 0.1210
= 0.7203.
9

5.6 Finding percentiles by using the standard


normal table

The standard normal table can be used to find percentiles for random variables which are
normally distributed.

Example

The marks M obtained in a mathematics entrance examination are normally distributed with
µ = 514 and σ = 113 . Find the mark that is the 80th percentile.

From the standard normal table, the z-score which is closest to an entry of 0.80 in
the body of the table is 0.84 (the actual area to its left is 0.7995). The mark which
corresponds to a z-score of 0.84 can be found by solving

for m. This yields m = 608.92 i.e. a mark of approximately 609 is better than 80% of
all other exam marks.

Exercises: All these exercises refer to the normal distribution above.

(1) Find P35 .


(2) If a person scores in the top 5% of test marks, what is the minimum mark they could
have received?
(3) If a person scores in the bottom 10% of test marks, what is the maximum mark they
could have received?
10

Chapter 6 – Sampling
distributions

6.1 Definitions
• A sampling distribution arises when repeated samples of the same size are drawn
from a particular population (distribution) and a statistic (numerical measure of
description of sample data) is calculated for each sample. The interest is then
focused on the probability distribution (called the sampling distribution) of the
statistic.

• Sampling distributions arise in the context of statistical inference i.e. when


statements are made about a population on the basis of random samples drawn
from it.

Example

Suppose all possible samples of size 2 are drawn with replacement from a population with
sample space S = {2, 4, 6, 8} and the mean calculated for each sample.

The different values that can be obtained and their corresponding means are shown in the
table below.
2nd value
2 4 6 8
2 2 3 4 5
1st 4 3 4 5 6
value 6 4 5 6 7
8 5 6 7 8

In the above table the row and column entries indicate the two values in the sample (16
possibilities when combining rows and columns). The mean is located in the cell
4+6
corresponding to these entries e.g. 1st value = 4, 2nd value = 6 has a mean entry of = 5.
2

Assuming that random sampling is used, all the mean values in the above table are equally
likely. Under this assumption the following distribution can be constructed for these mean
values.
11

x 2 3 4 5 6 7 8 sum
count 1 2 3 4 3 2 1 16
1 1 3 1 3 1 1
P( X = x ) 1
16 8 16 4 16 8 16

The above distribution is referred to as the sampling distribution of the mean for random
samples of size 2 drawn from this distribution.

The mean and variance of the population from which these samples are drawn are
µ=5
and
σ = [∑ x − (∑ x) / N ] ÷ N = ¼ (22 + 42 + 62 + 82 – 202 / 4) = 5.
2 2 2

The sampling distribution of the mean has mean and variance


µX = 5
and

Note that µ X = 5 = µ and that σ X2 = 2.5 = 5/2 = σ2/2.

Consider a population with mean µ and variance σ2. It can be shown that the mean and
variance of the sampling distribution of the mean, based on a random sample of size n, are
given by
µ X = µ and σ X2 = σ2/n.

σ X = σ is known as the standard error. In the preceding example n = 2.


n

Sampling distributions can involve different statistics (e.g. sample mean, sample proportion,
sample variance) calculated from different sample sizes drawn from different distributions.
Some of the important results from statistical theory concerning sampling distributions are
summarized in the sections that follow.
12

6.2 The Central Limit Theorem


The following result is known as the Central Limit Theorem (CLT).

Let X1, X2, . . . , Xn be a random sample of size n drawn from a


distribution with mean µ and variance σ2 (σ2 should be finite). Then for
sufficiently large n,
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖
𝑛𝑛
is approximately normally distributed with
mean = µ and variance = σ2/n.

Note 1:
Since
𝑛𝑛
∑ 𝑥𝑥
𝑋𝑋� = 𝑖𝑖=1 𝑖𝑖
𝑛𝑛

this result can be written as

𝜎𝜎 2 𝑋𝑋� − 𝜇𝜇
𝑋𝑋�~𝑁𝑁 �𝜇𝜇; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝜎𝜎 2

𝑛𝑛
Note 2:
The value of n for which this theorem is valid depends on the distribution from
which the sample is drawn. If the sample is drawn from a normal population, the
theorem is valid for all n. If the distribution from which the sample is drawn is fairly
close to being normal, a value of n > 30 will suffice for the theorem to be valid. If the
distribution from which the sample is drawn is substantially different from a normal
distribution e.g. positively or negatively skewed, a value of n much larger than 30 will
be needed for the theorem to be valid.

Note 3:
Suppose the underlying distribution is a Bernoulli distribution with probability p of
success and probability q = 1–p of failure. The mean and variance for this distribution
are µ = p and σ2 = pq.
In this case
∑𝑛𝑛 𝑥𝑥

𝑃𝑃 = 𝑖𝑖=1 𝑖𝑖
𝑛𝑛

where 𝑃𝑃 is the proportion of successes in the sample and can be seen as an estimate
13

of the proportion of successes in the population (the distribution from which the
sample is drawn).
Then, according to the CLT,
𝑝𝑝𝑝𝑝 𝑃𝑃� − 𝑝𝑝
𝑃𝑃�~𝑁𝑁 �𝑝𝑝; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝑝𝑝𝑝𝑝

𝑛𝑛

Example 1:

An electric firm manufactures light bulbs whose lifetime (in hours) follows a normal
distribution with mean 800 and variance 1600. A random sample of 10 light bulbs is drawn
and the lifetime recorded for each light bulb. Calculate the probability that the mean of this
sample

(a) is less than 785 hours.


(b) is greater than 820 hours.
(c) differs from the actual mean lifetime of 800 by not more than 16 hours.
(d) differs from the actual mean lifetime of 800 by more than 16 hours.

785 − 800
(𝑎𝑎) 𝑃𝑃(𝑋𝑋� < 785) = 𝑃𝑃 �𝑍𝑍 < �
40
√10
= 𝑃𝑃(𝑍𝑍 < −1.19)
= 0.117

820 − 800
(𝑏𝑏) 𝑃𝑃(𝑋𝑋� > 820) = 𝑃𝑃 �𝑍𝑍 > �
40
√10
= 𝑃𝑃(𝑍𝑍 > 1.58)
= 1 − 0.9429
= 0.0571

−16 𝑋𝑋� − 800 16


(𝑐𝑐) 𝑃𝑃(−16 < 𝑋𝑋� − 𝜇𝜇 < 16) = 𝑃𝑃 � < 𝜎𝜎 < �
40 40
√10 √𝑛𝑛 √10
= 𝑃𝑃(−1.26 < 𝑍𝑍 < 1.26)
= 𝑃𝑃(𝑍𝑍, 1.26) − 𝑃𝑃(𝑍𝑍 < −1.26)
= 0.8962 − 0.1038
= 0.7924
14

𝑃𝑃(𝑋𝑋� differs from 800 by more than 16)


= 1 − 𝑃𝑃(𝑋𝑋� differs from 800 by not more than 16)
= 1 − 0.7924
= 0.2076

Example 2
Suppose that 40% of all adults in Durban support an increase in the state sales tax from 5%
to 6% provided that the additional revenue goes to education. A survey of 140 adults
randomly selected from Durban was done and the participants were asked if they support
increase. What is the probability that more than half the sample support the increase?

Solution: p = 0.4 q = 1–0.4 = 0.6 n = 140

𝑃𝑃� − 𝑝𝑝 0.5 − 0.4


𝑃𝑃�𝑃𝑃� > 0.5� = 𝑃𝑃 ⎛ > ⎞
𝑝𝑝𝑝𝑝 (0.4)(0.6)
� �
⎝ 𝑛𝑛 140 ⎠
= 𝑃𝑃(𝑍𝑍 > 2.42)
= 1 − 𝑃𝑃(𝑍𝑍 < 2.42)
= 1 − 0.9922
= 0.0078

Example 3
It is known that 73% of secretaries in RSA know how to touch-type. A sample of 1800
secretaries is taken. What is the probability that the sample proportion of secretaries who
can touch-type differs from the national proportion by no more than 3%?

Solution: p = 0.73 q = 0.27 n = 1800

−0.03 𝑃𝑃� − 𝑝𝑝 0.03


𝑃𝑃�−0.03 < 𝑃𝑃� − 𝑝𝑝 < 0.03� = 𝑃𝑃 ⎛ < < ⎞
𝑝𝑝𝑝𝑝
�(0.73)(0.27) � �(0.73)(0.27)
⎝ 1800 𝑛𝑛 1800 ⎠
= 𝑃𝑃(−2.87 < 𝑍𝑍 < 2.87)
= 𝑃𝑃(𝑍𝑍 < 2.87) − 𝑃𝑃(𝑍𝑍 < −2.87)
= 0.9979 − 0.0021
= 0.9958
15

6.3 The t-distribution


The central limit theorem states that the statistic
X −µ
Z=
σ/ n
follows a standard normal distribution. If σ is not known, it would be logical to replace σ (in
the formula for Z) by its sample estimate S. For small values of the sample size n , the
statistic
X −µ
t=
S/ n
does not follow a normal distribution. If it is assumed that sampling is done from a
population that is approximately a normal population, the distribution of the statistic t
follows a t-distribution with n–1 degrees of freedom (df). For each value of df a different
distribution is defined.

The t-distribution has the following properties:

• The mean is zero (like the standard normal distribution).


• The distribution is bell-shaped and is symmetrical about the mean (like the standard
normal distribution).
• The variance is greater than one, but approaches one from above as the sample size
increases (σ2 = 1 for the standard normal distribution).

So the t-distribution is similar to the standard normal distribution. For small sample sizes it
shows more variability than the standard normal distribution i.e. its curves are flatter in
appearance with thicker tails. As the sample size increases, the t-distribution approaches
the standard normal distribution and for n>30 the differences are negligible.

The graph below shows how the t-distribution changes for different values of r (the degrees
of freedom).
16

The t-distribution was first proposed in a paper by William Gosset in 1908 who wrote the
paper under the pseudonym “Student” and so is also referred to as Student’s t-distribution.

Tables for the t-distribution


The layout of the t-tables is as follows.

ν α=0.900 α=0.95 .. α=0.995


1 3.078 6.314 63.66
2 1.886 2.920 9.925
. .
. .
∞ 1.282 1.645 2.576

The row entry (ν ) gives the degrees of freedom (df) and the column entry (α) gives the area
under the t-curve to the left of the value that appears in the table at the intersection of the
row and column entry.
Notation: tν ;α denotes the t-value that has an area of α to the left where the df for the
t-distribution are ν.

Examples

1. For df = 2 and α = 0.995 the entry is t2 ; 0.995 = 9.925. This means that for the t-
distribution with 2 degrees of freedom
P(t ≤ 9.925) = 0.995.

2. For df = ∞ and α = 0.95 the entry is t∞ ; 0.95= 1.645. This means that for the t-
distribution with ∞ degrees of freedom
P(t ≤ 1.645) = 0.95.

When a t-value that has an area less than 0.5 to its left is to be looked up, the fact that the
t-distribution is symmetrical around 0 is used i.e.,

P(t < tν ; α) = P(t > tν ; 1 – α) = P(t < – tν ; 1 – α) for α ≤ 0.5

This means that tν ; α = – tν ; 1 – α.


17

Examples

3. For df = ν = 10 and α = 0.10 the value of t0.10 such that P(t < t10 ; 0.10 ) = 0.10 is found
from
t10 ; 0.1 = – t10 ; 0.9 = – 1.372

Note that the percentile values in the last row of the t-distribution are identical to the
corresponding percentile entries in the standard normal table. Since the t-distribution for
large samples (degrees of freedom) is the same as the standard normal distribution, their
percentiles should be the same.

6.4 The chi-square (χ2) distribution


The chi-square distribution arises in a number of sampling situations. These include the ones
described below.

1) Drawing repeated samples of size n from an approximate normal distribution with


variance σ2 and calculating the variance (S2) for each sample. It can be shown that
the quantity
(n − 1) S 2
χ2 =
σ2
follows a chi-square distribution with degrees of freedom = n – 1. This will be used in
chapters 7 to 9.

2) When comparing sequences of observed and expected frequencies as shown in the


table below. The observed frequencies (referring to the number of times values of
some variable of interest occur) are obtained from an experiment, while the
expected ones arise from some pattern believed to be true.

observed frequency f1 f2 .. fk
expected frequency e1 e2 .. ek

k
( f i − ei ) 2
The quantity χ2 = ∑
i =1 ei
can be shown to follow a chi-square distribution with

k – 1 degrees of freedom. The purpose of calculating this χ2 is to make an assessment


as to how well the observed and expected frequencies correspond. (This will not be
covered in stat130).
18

The chi-square curve is different for each value of degrees of freedom. The diagram on the
following page shows how the chi-square distribution changes for different values of ν (the
degrees of freedom).

Unlike the normal and t-distributions, the chi-square distribution is only defined for positive
values and is not a symmetrical distribution. As the degrees of freedom increase, the chi-
square distribution becomes more a more symmetrical. For a sufficiently large degrees of
freedom the chi-square distribution approaches the normal distribution.

Tables for the chi-square distribution

The layout of the chi-square tables is as follows.

ν α=0.005 α=0.01 .. α=0.99 α=0.995


1 0.000039 0.000157 6.63 7.88
2 0.010025 0.020101 9.21 10.60
.
30 13.79 14.95 50.89 53.67

The row entry (ν ) gives the degrees of freedom and the column entry (α) gives the area
under the chi-square curve to the left of the value that appears in the table at the
intersection of the row and column entry.
Notation: χ216 ; 0.09 denotes the chi-square value where df=16 and the area to its left is 0.09.
19

Examples:

1) For df = 30 and α = 0.01 the entry is 14.95 i.e. χ 302 ; 0.01


= 14.95 . This means that
for the chi-square distribution with 30 degrees of freedom
P( χ
2
≤ 14.95) = 0.01.

2) For df = 30 and α = 0.995 the entry is 53.67 i.e. χ 302 ; 0.995 = 53.67 . This means that
for the chi-square distribution with 30 degrees of freedom

P( χ
2
≤ 53.67) = 0.995.

3) For df = 6 and α = 0.95 the entry is 12.59 i.e. χ 62 ; 0.95 = 12.59 . This means that for
the chi-square distribution with 6 degrees of freedom

P( χ P( χ
2 2
≤ 12.59 ) = 0.95 or > 12.59 ) = 0.05.

This probability statement is illusrated in the next graph.


20

6.5 The F-distribution

Random samples of sizes n1 and n2 (sometimes m is used instead of n2) are drawn from
independent normally distributed populations that are labeled 1 and 2 respectively. Denote
the variances calculated from these samples by S12 and S22 respectively and their
corresponding population variances by σ12 and σ22 respectively. The ratio
S 2 /σ 2
F = 12 12
S2 /σ 2
is distributed according to an F-distribution (named after the famous statistician R.A. Fisher)
with numerator degrees of freedom df1 = n1 − 1 and denominator degrees of freedom
S12
df 2 = n2 − 1 . When σ = σ the F-ratio is F = 2 .
2
1
2
2
S2

Notation: F(df1 ,df2) or Fdf1 , df2 are used to denote the F-distribution with df1 numerator
degrees of freedom and df2 denominator degrees of freedom.

The F-distribution is positively skewed, and the F-values can only be positive. For each
combination of df1 and df2 there is a different F-distribution. The diagram below shows plots
for a number of F-distributions (F-curves) with σ 12 = σ 22 .
Three important distributions are special cases of the F-distribution. The normal distribution
is an F(1, ∞) distribution, the t-distribution an F(1, n2 ) distribution and the chi-square
distribution an F( n1 , ∞) distribution.
21

Tables for the F-distribution

In Table F (at the end of your notes) the 90, 95, 97.5 and 99 percentage points of the F-
distribution are given. The tables are laid out as follows:

The top row gives some values for the numerator df and the first column gives some values
for the denominator df. The value at the intersection of a row and a column is the value that
has α(100)% of the area under the curve to its left i.e. P(F < Fdf1 , df2 ; α) = α. There are tables
for α = 0.90, α = 0.95, α = 0.97.5 and α = 0.99.

Examples

1) For the F(3,26) distribution, 2.98 has an area (under its curve) of α = 0.95 to its left and
1 – α to its right (see graph below), i.e. F3, 26 ; 0.95 = 2.98
P(F < 2.98) = 0.95
22

2) P( F < F4, 32 ; 0.95 ) = 0.95


So F4, 32 ; 0.95 = 2.67

3) Using df1 = 4 and df2 = 7, the value of F that has 1% of the area to the right of it is 7.85.

Lower tail values from the F-distribution

Only upper tail values (those with large areas below and small areas above) can be read off
from the F-tables. Lower tail values can be calculated from the formula:

1
𝐹𝐹(𝑑𝑑𝑑𝑑1 , 𝑑𝑑𝑑𝑑2 ; 𝛼𝛼) =
𝐹𝐹(𝑑𝑑𝑑𝑑2 , 𝑑𝑑𝑑𝑑1 ; 1 − 𝛼𝛼)

Examples

1) Find the value such that 2.5% of the area under the F(7,5) curve is to the left of it.

In the above formula df1 = 7, df2 = 5 and α = 0.025. Then

1 1
𝐹𝐹(7 , 5 ; 0.025) = = = 0.189
𝐹𝐹(5 , 7 ; 0.975) 5.29
23

2) Find the value such that 1% of the area under the F(10,9) curve is to the left of it.

In the above formula df1 = 10, df2 = 9 and α = 0.01. Then

1 1
𝐹𝐹(10 , 9 ; 0.01) = = = 0.2024
𝐹𝐹(9 , 10 ; 0.99) 4.94
24

Chapter 7 – Statistical
Inference: Estimation for one
sample case

7.1 Statistical inference


Statistical inference (inferential statistics) refers to the methodology used to draw conclusions
(expressed in the language of probability) about population parameters on the basis of
samples drawn from the population.

Examples

1.) The government of a country wants to estimate the proportion of voters (p) in the
country that approve of their economic policies.

2.) A manufacturer of car batteries wishes to estimate the average lifetime (µ) of their
batteries.

3.) A paint company is interested in estimating the variability (as measured by the
variance, σ2) in the drying time of their paints.

The quantities p, µ and σ2 that are to be estimated are called population parameters.
A sample estimate of a population parameter is called a statistic. The table below gives
examples of some commonly used parameters toegether with their statistics.

Parameter Statistic
p p̂
µ x
σ2 S2
25

7.2 Point and interval estimation

• A point estimate of a parameter is a single value (point) that estimates a parameter.

• An interval estimate of a parameter is a range of values from L (lower value) to U


(upper value) that estimate a parameter. Associated with this range of values is a
probability or percentage chance that this range of values will contain the parameter
that is being estimated.

Examples

Suppose the mean time it takes to serve customers at a supermarket checkout counter is to
be estimated.

1) The mean service time of 100 customers of (say) x = 2.283 minutes is an example of
a point estimate of the parameter µ.

2) If it is stated that the probability is 0.95 (95% chance) that the mean service time will
be from 1.637 minutes to 4.009 minutes, the interval of values (1.637, 4.009) is an
interval estimate of the parameter μ.

The estimation approaches discussed will focus mainly on the interval estimate approach.

7.3 Confidence intervals terminology


A confidence interval is a range of values from L (lower value) to U (upper value) that
estimate a population parameter θ with 100(1- α )% confidence.

Θ – pronounced “theta”.

L is the lower confidence limit.


U is the upper confidence limit.

The interval (L, U) is called the confidence interval.

1 – α is called the confidence coefficient. It is the probability that the confidence interval
will contain θ the parameter that is being estimated.
100(1 – α) is called the confidence percentage.

Example
Consider example 2 of the previous section.
θ, the parameter that is being estimated, is the population mean µ .
26

α=0.05
The confidence coefficient is (1–α ) = 0.95
The confidence percentage is 100(1–α ) = 95.

L = 1.637 U = 4.009
The confidence interval is the interval (1.637, 4.009).

In the sections that follow the determination of L and U when estimating the parameters µ, p
and σ2 will be discussed.

7.4 Confidence interval for the population


mean (population variance known)

The determination of the confidence limits is based on the central limit theorem. This
theorem states that for sufficiently large samples

𝜎𝜎 2 𝑋𝑋� − 𝜇𝜇
𝑋𝑋�~𝑁𝑁 �𝜇𝜇; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝜎𝜎 2

𝑛𝑛

Formulae for the lower and upper confidence limits can be constructed in the following way
(using a confidence coefficient of 0.95 as an example).
27

Since Z ~ N(0;1), it follows from the graph that

P(– 1.96 ≤ Z ≤ 1.96) = 0.95


and so

𝑋𝑋� − 𝜇𝜇
P �−1.96 < 𝜎𝜎 < 1.96� = 0.95
� 𝑛𝑛

By a few steps of mathematical manipulation (not shown here), the above part in brackets
can be changed to have only the parameter µ between the inequality signs. This will give

σ σ
Let L = X − 1.96 and U = X + 1.96 .
n n
Then the above formula can be written as P(L ≤ µ ≤ U) = 0.95.

This formula is interpreted in the following way:

Since both L and U are determined by the sample values (which determine X ), they
(and the confidence interval) will change for different samples. Since the parameter
µ that is being estimated remains constant, these intervals will either include or
exclude µ. The central limit theorem states that such intervals will include the
parameter µ with probability 0.95 (95 out of 100 times).

In a practical situation the confidence interval will not be determined by many


samples, but by only one sample. Therefore the confidence interval that is calculated
in a practical situation will involve replacing the random variable X by the sample
value x. Then the above formulae for a 95% confidence interval for the population
mean µ becomes

𝜎𝜎 𝜎𝜎 𝜎𝜎
�𝑥𝑥̅ − 1.96 ; 𝑥𝑥̅ + 1.96 � or 𝑥𝑥̅ ± 1.96
√𝑛𝑛 √𝑛𝑛 √𝑛𝑛
The percentage of confidence associated with the interval is determined by the value (called
the z – multiplier) obtained from the standard normal distribution. In the above formula a z-
multiplier of 1.96 determines a 95% confidence interval.

If a different percentage of confidence is required, the z – multiplier needs to be changed.


The table below gives examples of z-multipliers and their corresponding confidence
percentages.
28

confidence percentage 99 95 90
z-multiplier 2.576 1.96 1.645
α 0.01 0.05 0.10

Calculation of confidence interval for µ (σ2 known)


Step 1 : Calculate x . Values of n, σ2 and confidence percentage are given
Step 2 : Look up z-multiplier for given a confidence percentage.
σ
Step 3 : Confidence interval is x ± z-multiplier
n

Example

The actual content of cool drink in a 500 milliliter bottle is known to vary. The standard
deviation is known to be 5 milliliters. Thirty (30) of these 500 milliliter bottles were selected
at random and their mean content found to 498.5. Calculate 95% and 99% confidence
intervals for the population mean content of these bottles.

Solution:

95% confidence interval


Substituting x = 498.5, n = 30, σ = 5, z = 1.96 into the above formula gives
5
498.5 ± 1.96 = (496.71, 500.29).
30

99% confidence interval


Substituting x = 498.5, n = 30, σ = 5, z = 2.576 into the above formula gives
5
498.5 ± 2.576 = (496.15, 500.85).
30

7.5 Confidence interval for the population


mean (population variance not known)
When the population variance (σ2) is not known, it is replaced by the sample variance (S2) in
the formula for Z mentioned in the previous section. In such a case, when n is small, the
quantity

X −µ
t= follows a t-distribution with
S/ n
degrees of freedom = df = n – 1.
29

The confidence interval formula used in the previous section is modified by replacing the
z-multiplier by the t-multiplier that is looked up from the t-distribution.

Calculation of confidence interval for µ (σ2 not known), small n


Step 1 : Calculate x and S. Values of n and confidence percentage are given
Step 2 : Look up t-multiplier for a given confidence percentage and degrees of
freedom = n–1
S
Step 3 : Confidence interval is x ± t-multiplier
n

Example
The time (in seconds) taken to complete a simple task was recorded for each of 15 randomly
selected employees at a certain company. The values are given below.

38. 43. 38. 26. 41. 42. 37. 37. 41. 42. 50. 37. 36. 31.
2 9 4 2 3 3 5 2 2 3 31 1 3 7 8

Calculate 95% and 99% confidence intervals for the population mean time it takes to
complete this task.

Solution:
n = 15 (given) , x = 38.36, S = 5.78 (Calculated from the data)

95% confidence interval:

α = 0.05 therefore α/2 = 0.025 and 1 – α/2 = 0.975


degrees of freedom = df = ν = 15 – 1 = 14
t-multiplier = t14; 0.975 = 2.145 (from t-table)

Substituting x = 38.36, n = 15, S = 5.78, t = 2.145 into the above formula gives

99% confidence interval:

α = 0.01 therefore α/2 = 0.005 and 1 – α/2 = 0.995


degrees of freedom = df = ν = 15 – 1 = 14
t-multiplier = t14; 0.995 = 2.977 (from t-table)
30

Substituting x = 38.36, n = 15, S = 5.78, t = 2.977 into the above formula gives

7.6 Confidence interval for population


variance

The formulae for the confidence interval of the population variance σ2 are based on the fact
(n − 1) S 2
that follows a chi-square distribution with (n – 1) degrees of freedom. Let
σ2
α α α
χ 2 (1 − ) and χ 2 ( ) [also written as χ21–α/2 and χ2α/2 respectively] denote the 100( 1 − )
2 2 2
100α
and percentile points of the chi-square distribution with (n – 1) degrees of freedom.
2

For this distribution, it follows from the graph that

(𝑛𝑛 − 1)𝑠𝑠 2
𝑃𝑃 �𝜒𝜒𝑣𝑣2 ; 𝛼𝛼� ≤ 2
≤ 𝜒𝜒𝑣𝑣2 ;1−𝛼𝛼� � = 1 − 𝛼𝛼
2 𝜎𝜎 2

By a few steps of mathematical manipulation (not shown here), the above part in brackets
can be changed to have only the parameter σ2 between the inequality signs. This will give

where
upper = χ2ν ; 1–α/2 , the larger of the 2 percentile points and
lower = χ2ν ; α/2 , the smaller of the 2 percentile points.
31

The values of α and α/2 are calculated from the confidence percentage = 100(1 – α )
e.g. if confidence percentage = 95, α= 0.05 , α/2 = 0.025.

Calculation of confidence interval for σ2

Step 1 : Calculate S2. Values of n and confidence percentage are given


Step 2 : Look up upper and lower chi-square values for a given confidence
percentage and degrees of freedom = df.

(n − 1) S 2 (n − 1) S 2
Step 3 : Confidence interval is [ , ]
upper lower

Example

Calculate 90% and 95% confidence intervals for the population variance of the time taken to
complete the simple task (see previous example).

Solution:

n =15 , S2 = 33.3811 (Calculated from the data)

90% confidence interval:


Look up upper and lower chi-square values by using df = ν = 14 and α =0.10.

upper = χ21 – α/2 = χ20.95 = 23.68 for ν = 14.

lower = χ2α/2 = χ20.05 = 6.57 for ν = 14.

(n – 1)S2 = 14 × 33.3811 = 467.34

467.34 467.34
The confidence interval is ( , ) = (19.74, 71.13).
23.68 6.57

95% confidence interval


Look up upper and lower chi-square values by using df = ν = 14 and α =0.05.

upper = χ21-α/2 = χ20.975 = 26.12 for ν = 14.

α
lower = χ 2 ( ) = χ 2 (0.025) = 5.63 for ν = 14.
2
32

(n – 1)S2 = 14 x×33.3811 = 467.34

467.34 467.34
The confidence interval is ( , ) = (17.89, 83.01).
26.12 5.63

7.7 Confidence interval for population


proportion
In some experiments the interest is in whether or not items posses a certain characteristic
of interest e.g. whether a patient improves or not after treatment, whether an item
manufactured is acceptable or not, whether an answer to a question is correct or incorrect.
The population proportion of items labeled “success” in such an experiment (e.g. patient
improves, item is acceptable, answer is correct) is estimated by calculating the sample
proportion of “success” items.

The determination of the confidence limits for the population proportion of items labeled
“success” is based on the central limit theorem for the sample proportion. This theorem
states that for sufficiently large samples

𝑝𝑝𝑝𝑝 𝑃𝑃� − 𝑝𝑝
𝑃𝑃�~𝑁𝑁 �𝑝𝑝; � or 𝑍𝑍 = ~𝑁𝑁(0; 1)
𝑛𝑛 𝑝𝑝𝑝𝑝

𝑛𝑛
Formulae for the lower and upper confidence limits can be constructed in the following
way.
Since Z ~ N(0,1)

P(– 1.96 ≤ Z ≤ 1.96) = 0.95

Pˆ − p
P( –1.96 ≤ ≤ 1.96) = 0.95
pq / n

By a few steps of mathematical manipulation (not shown here), the above part in brackets
can be changed to have the parameter p (in the numerator) between the inequality signs.
This will give
𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝
𝑃𝑃 �𝑃𝑃� − 1.96� ≤ 𝑝𝑝 ≤ 𝑃𝑃� + 1.96� � = 0.95
𝑛𝑛 𝑛𝑛
33

Since the confidence interval formula is based on a single sample, the is replaced by its
x
sample estimate pˆ = and the parameters p and q = 1 – p by their respective sample
n
x
estimates pˆ = and qˆ = 1 − pˆ .
n
This gives the following 95% confidence interval for p: ( pˆ − 1.96 pˆ qˆ / n , pˆ + 1.96 pˆ qˆ / n ).
If the confidence percentage is changed then the z-multiplier will change. The table below
gives some examples.

confidence percentage 99 95 90
z-multiplier 2.576 1.96 1.645
α 0.01 0.05 0.10

In general, the (1–α)100% confidence interval for the population proportion is

𝑝𝑝̂ 𝑞𝑞� 𝑝𝑝̂ 𝑞𝑞�


�𝑝𝑝̂ − 𝑧𝑧1−𝛼𝛼�2 � ; 𝑝𝑝̂ + 𝑧𝑧1−𝛼𝛼�2 � �
𝑛𝑛 𝑛𝑛

Calculation of confidence interval for p


x
Step 1 : Calculate pˆ = and qˆ = 1 − pˆ , x, n and confidence percentage are given
n
Step 2 : Look up z-multiplier for given a confidence percentage.

Step 3 : Confidence interval is p̂ ± z-multiplier pˆ qˆ / n

Example

During a marketing campaign for a new product 176 out of the 200 potential users of this
product that were contacted indicated that they would use it. Calculate a 90% confidence
interval for the proportion of potential users who would use this product.

Solution:
176
x = 176, n = 200 so p̂ = = 0.88, qˆ = 1 − pˆ = 0.12.
200

confidence percentage = 90 (given) so


z0.95 = 1.645 (From above table)
34

Confidence interval is
(0.88 ± 1.645 0.88 * 0.12 / 200 ) = (0.88 ± 0.0378) = (0.842, 0.918).

7.8 Sample size when estimating the


population mean
Consider the formula for the confidence interval of the mean (µ) when σ 2 is known.

σ
x ± z-multiplier
n

σ
The quantity z-multiplier is known as the error (denoted by E).
n
The smaller the error, the more accurately the parameter μ is estimated.
Suppose the size of the error is specified in advance and the sample size n is determined to
achieve this accuracy. This can be done by solving for n from the equation

σ
E = z-multiplier , which gives
n

z − multiplier * σ 2
n=( ) .
E

The z-multiplier is determined by the percentage confidence required in the estimation.

Example

Consider the example on the interval estimation of the mean content of 500 milliliter cool
drink bottles. The standard deviation σ is known to be 5. Suppose it is desired to estimate
the mean with 95% confidence and an error that is not greater than 0.8. What sample size is
needed to achieve this accuracy?

Solution:

σ = 5, E = 0.8 (given), z-multiplier = z0.975 = 1.96

1.96 * 5 2
n= ( ) = 150.0625 = 151 (n is always rounded up).
0.8
35

7.9 Sample size for estimation of population


proportion
The approach used in determining the sample size for the estimation of the population
proportion is much the same as that used when estimating the population mean.

z − multiplier * pq
The equation to be solved for n is :E = .
n

When solving for n the formula becomes

z − multiplier 2
n = pq ( ) .
E

A practical problem encountered when using this formula is that values for the parameters
p and q=1 – p are needed. Since the purpose of this technique is to estimate p, these values
of p and q are obviously not known.

If no information on p is available, the value of p that will give the maximum value of
p(1 – p) = pq will be taken. It can be shown that p= 0.5 maximizes this expression. This gives
max pq = 0.25 . Substituting this maximum value in the above formula gives

z − multiplier 2
max n = ¼ ( ) .
E

If more accurate information on the value of p is known (e.g. some range of values), it
should be used in the above formula.

As explained before, the z-multiplier is determined by the percentage confidence required


in the estimation.

Example

Consider the problem (discussed earlier) of estimating the proportion of potential users who
would use a new product. Suppose this proportion is to be estimated with 99% confidence
and an error not exceeding 2% (proportion of 0.02) is required. What sample size is needed
to achieve this?
36

Solution:

E = 0.02 (given), z-multiplier = z0.995 = 2.576


If no information is known about p then
2.576 2
n = ¼( ) = 4147.36 = 4148 (rounded up).
0.02

But supppose it is known that the value of p is between 0.8 and 0.9.
In such a case
max p(1 – p) = pq = 0.8 × 0.2 = 0.16

By using this information, the value of n can be calculated as


2.576 2
n = 0.16 ( ) = 2654.31 = 2655 (rounded up).
0.02

The additional information on possible values for p reduces the sample size by 36%.
37

Chapter 8 – Statistical
Inference : Testing of
hypotheses for one sample

8.1 Formulation of hypotheses and related


terminology

Statistical hypothesis
A statistical hypothesis is an assertion (claim) made about a value(s) of a population
parameter.

Purpose
The purpose of testing of hypotheses is to determine whether a claim that is made
could be true. The conclusion about the truth of such a claim is not stated with
absolute certainty, but rather in terms of the language of probability.

Examples of claims to be tested

1) A supermarket receives complaints that the mean content of “1 kilogram” sugar bags
that are sold by them is less than 1 kilogram.

2) The variability in the drying time of a certain paint (as measured by the variance) has
until recently been 65 minutes. It is suspected that the variability has now increased.

3) A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). They want to test whether this is indeed the case.

Null and alternative hypotheses

Null hypothesis (H0)


This is a statement concerning the value of the parameter of interest (ϴ) in a claim that is
made. This is formulated as

H0: ϴ = ϴ0 (The statement that the parameter ϴ is equal to the hypothetical value ϴ0)
38

Alternative hypothesis (H1)


This is a statement about the possible values of the parameter ϴ that are believed to be
true if H0 is not true. One of the alternative hypotheses shown below will apply.

H1a: θ < θ 0 or H1b: θ > θ 0 or H1c: θ ≠ θ 0 .

Examples

1) In the first example (above) the parameter of interest is the population mean µ and
the hypotheses to be tested are

H0: µ = 1 (Population mean is 1 kilogram)


H1a: µ < 1 (Population mean is less than 1 kilogram)

In terms of the general notation stated above θ =µ and θ 0 = 1 .

2) In the second example (above) the parameter of interest is the population variance
σ2 and the hypotheses to be tested are

H0: σ2 = 65 (Population variance is 65)


H1b: σ2 > 65 (Population variance is greater than 65)

In terms of the general notation stated above θ = σ2 and θ 0 = 65.

3) In the third example (above) the parameter of interest is the population proportion,
p, of job completions behind schedule and the hypotheses to be tested are

H0: p = 0.20 (Population proportion is 0.20)


H1c: p ≠ 0.20 (Population proportion is not equal to 0.20)

In terms of the general notation stated above θ = p and θ 0 = 0.20 .

One and two-sided alternatives

One-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis) in a
direction that is either below or above that specified by the null hypothesis.

Example

The alternative hypothesis H1a (see example 1 above) is the alternative that the value
of the parameter is less than that stated under the null hypothesis and the
alternative H1b (see example 2 above) is the alternative that the value of the
parameter is greater than that stated under the null hypothesis.
39

Two-sided alternative
This is a hypothesis that specifies the alternative values (to the null hypothesis) in directions
that can be either below or above that specified by the null hypothesis.

Example

The alternative hypothesis H1c (see example 3 above) is the alternative that the value
of the parameter is either greater than that stated under the null hypothesis or less
than that stated under the null hypothesis.

8.2 Testing of hypotheses for one sample:


Terminology and summary of procedure
The testing procedure and terminology will be explained for the test for the population
mean μ with population variance σ2 known.

The hypotheses to be tested are

H0 : µ = µ0 versus
H1a: µ < µ0 or H1b: µ > µ0 or H1c: µ≠ µ0.

The data set that is needed to perform the test is x1, x2, . . . , xn ,
a random sample of size n drawn from the population for which the mean is tested. The test
is performed to see whether or not the sample data are consistent with what is stated by
the null hypothesis.
The instrument that is used to perform the test is called a test statistic. A test statistic is a
quantity calculated from the sample data.

When testing for the population mean, the test statistic used is:
x − µ0
z0 = .
σ/ n
If the difference between x and µ0 (and therefore the value of z0) is reasonably small, H0 will
be not be rejected. In this case the sample mean is consistent with the value of the
population mean that is being tested. If this difference (and therefore the value of z0) is
sufficiently large, H0 will be rejected. In this case the sample mean is not consistent with the
value of the population mean that is being tested. In order to decide how large this
difference between x and μ0 (and therefore the value of z0) should be before H0 is rejected,
the following should be considered.
40

Type I error
• A type I error is committed when the null hypothesis is rejected when, in fact it is
true i.e. H0 is wrongly rejected.
• In this example, a type I error is committed when it is decided that the statement
H0: µ = μ0 should be rejected when, in fact, it is true.

Type II error
• A type II error is committed when the null hypothesis is not rejected when, in fact, it
is false i.e. a decision not to reject H0 is wrong.
• In this example, a type II error is committed when it is decided that the statement
H0: µ = μ0 should not be rejected when, in fact, it is false.

The following table gives a summary of possible conclusions and their correctness when
performing a test of hypotheses.

Actually true/Conclusion Reject H0 Do not reject H0


H0 is true Type I error Correct conclusion
H0 is false Correct conclusion Type II error

A Type I error is often considered to be more serious, and therefore more important to
avoid, than a Type II error. The hypothesis testing procedure is therefore designed so that
there is a guaranteed small probability of rejecting the null hypothesis wrongly. This
probability is never 0 (why?). Mathematically the probability of a type I error can be stated
as

P(type I error) = P(Reject H0 | H0 is true) = α.

When testing for the population mean

P(type I error) = P(reject μ = μ0 | μ = μ0 is true) = α

P(type II error) = P(do not reject µ = µ0 | µ = µ0 is false) = β.

Probabilities of type I and type II errors work in opposite directions. The more reluctant you
are to reject H0, the higher the risk of accepting it when, in fact, it is false. The easier you
make it to reject H0, the lower the risk of accepting it when, in fact, it is false

Critical value(s) and critical region

Critical (cut-off) value(s)


• The critical value(s) for tests of hypotheses is(are) a value(s) to which the test
statistic is compared in order to determine whether or not the null hypothesis
should be rejected.
• The critical value is determined according to the specified value of α, the probability
of a type I error.
41

For the test of the population mean the critical value is determined in the following way.
Assuming that H0 is true, the test statistic will follow a standard normal distribution i.e.

X − µ0
Z0 = ~ N(0, 1).
σ/ n

(i) When testing H0 versus the alternative hypothesis H1a (µ < µ0), the critical value is the
value Zα which is such that the area under the standard normal curve to the left of Zα is α
i.e. P(Z0 < Zα) = α. This leaves an area of 1 – α to the right of Zα.

The following graph illustrates the case α = 0.05


i.e. P(Z0 < –1.645) = 0.05.

(ii) When testing H0 versus the alternative hypothesis H1b (µ > µ0) , the critical value is the
value Z1 – α which is such that the area under the standard normal curve to the left of Z1 – α
is
1 – α i.e. P(Z0 < Z1 – α) = 1 – α. This leaves an area of α to the right of Z1 – α

The graph below illustrates the case α = 0.05. This means 1 – α = 0.95 and thus
P(Z0 < 1.645) = 0.95.

(iii) When testing H0 versus the alternative hypothesis H1c (µ ≠ µ0), the critical values are
the values Z1 – α/2 and Zα/2. The area under the standard normal curve to the left of Z1 – α/2 is
1 – α/2. The area under the standard normal curve to the left of Zα/2 is α/2.
i.e. P(Z0 < Z1 – α/2) = 1 – α/2 and P(Z0 < Zα/2) = α/2.

The area under the normal curve between these two critical values is 1 – α. The graph on
the the following page shows the case α = 0.05
42

i.e. P(Z0 < – 1.96 or Z0 > 1.96) = 0.05.

Critical region (CR)


• The critical region, or rejection region R, is the set of values of the test statistic for
which the null hypothesis is rejected.

(i) When testing H0 versus the alternative hypothesis H1a , the rejection region is

{ z0 | z0 < Zα }.

(ii) When testing H0 versus the alternative hypothesis H1b , the rejection region is

{ z0 | z0 > Z1 – α }.

(iii) When testing H0 versus the alternative hypothesis H1c , the rejection region is

{ z0 | z0 > Z 1 – α/2 or z0 < Zα/2 }.

H0 is rejected when there is a sufficiently large difference between the sample mean x and
the mean (μ0 ) under H0 . Such a large difference is called a significant difference (result of
the test is significant). The value of α is called the level of significance. It specifies the level
beyond which this difference (between x and μ0) is sufficiently large for H0 to be rejected.
The value of α is specified prior to performing the test and is often taken as either 0.05 (5%
level of significance) or 0.01 (1% level of significance).

When H0 is rejected, it does not necessarily mean that it is not true. It means that according
to the sample evidence available it appears not to be true. Similarly when H0 is not rejected,
it does not necessarily mean that it is true. It means that there is not sufficient sample
evidence to disprove H0.

Critical values for tests based on the standard normal distribution can be found from the
selected percentiles listed at the bottom of the pages of the standard normal table.
43

8.3 Test for the population mean (population


variance known)

A summary of the steps to be followed in the testing procedure is shown below (continuing
onto the following page).

Test for µ when σ 2 is known

1 State null and alternative hypotheses.


H0: µ = µ 0 versus H1a: µ < µ 0 or H1b: µ > µ 0 or H1c: µ ≠ µ 0

x − µ0
2 Calculate the test statistic z 0 = .
σ/ n

3 State the level of significance α and determine the critical value(s) and critical
region.

(i) For alternative H1a the critical region is R = { z0 | z0 < Zα }.

(ii) For alternative H1b the critical region is R = { z0 | z0 > Z1 – α }.

(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1 – α/2 or z0 < Zα/2 }.

4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.

5 State conclusion in terms of the original problem.

Examples

1) A supermarket receives complaints that the mean content of “1 kilogram” sugar bags
that are sold by them is less than 1 kilogram. A random sample of 40 sugar bags is
selected from the shelves and the mean found to be 0.987 kilograms. From past
experience the standard deviation contents of these bags is known to be 0.025
kilograms. Test, at the 5% level of significance, whether this complaint is justified.

Solution:

H0 : μ = 1 (The complaint is not justified)

H1 : μ < 1 (The complaint is justified)

n = 40, x = 0.987, σ = 0.025, μ0 = 1 (given)


44

0.987 − 1
Test statistic: z0 = = –3.289.
0.025 / 40

α = 0.05 so Critical region R = { z0 < Z0.05 = –1.645 }.

Since z0 = –3.289 < –1.645, H0 is rejected.

Conclusion: There is sufficient evidence to conclude that the complaint is justified.

2) A supermarket manager suspects that the machine filling “500 gram” containers of
coffee is overfilling them i.e. the actual contents of these containers is more than
500 grams. A random sample of 30 of these containers is selected from the shelves
and the mean found to be 501.8 grams. From past experience the variance of
contents of these bags is known to be 60 grams. Test at the 5% level of significance
whether the manager’s suspicion is justified.

Solution:

H0 : μ = 500 (Suspicion is not justified)

H1 : μ > 500 (Suspicion is justified)

n = 30, x = 501.8, σ2 = 60, μ0 = 500 (given)

501.8 − 500
Test statistic: z0 = = 1.273.
60 / 30

α = 0.05 so Critical region R = { z0 > Z0.95 = 1.645 }.

Since z0 = 1.273 < 1.645, H0 is not rejected.

Conclusion: There insufficient evidence to conclude that the complaint is justified.

3) During a quality control exercise the manager of a factory that fills cans of frozen
shrimp wants to check whether the mean weights of the cans conform to
specifications i.e. the mean of these cans should be 600 grams as stated on the label
of the can. He/she wants to guard against either over or under filling the cans. A
random sample of 50 of these cans is selected and the mean found to be 595 grams.
From past experience the standard deviation of contents of these bags is known to
be 20 grams. Test, at the 5% level of significance, whether the weights conform to
specifications. Repeat the test at the 10% level of significance.
45

Solution:

H0 : μ = 600 (Weights conform to specifications)

H1 : μ ≠ 600 (Weights do not conform to specifications)

n = 50, x = 595, σ = 20, μ0 = 600 (given)

595 − 600
Test statistic: z0 = = 1.768.
20 / 50

α = 0.05 so Critical region R = { z0 < Z0.025 = – 1.96 or z0 > Z0.975 = 1.96 }.

Since –1.96 < z0 = 1.768 < 1.96, H0 is not rejected.

Conclusion: There is insufficient evidence to show that the weights don’t conform to
specifications.

Suppose the test is performed at the 10% level of significance. In such a case

α = 0.10 so Critical region R = { z0 < Z0.25 = –1.645 or z0 > Z0.95 = 1.645 }.

Since z0 = 1.768 > 1.645, H0 is rejected.

Conclusion: There is sufficient evidence to show that the weights don’t conform to
specifications.

Thus, being less strict about controlling a type I error (changing α from 0.05 to 0.10)
results in a different conclusion about H0 (reject instead of do not reject).

Note

1. In example 1 the alternative hypothesis H1a was used, in example 2 the alternative
H1b and in example 3 the alternative H1c.

2. Alternatives H1a and H1b [one-sided (tailed) alternatives] are used when there is a
particular direction attached to the range of mean values that could be true if H0 is
not true.

3. Alternative H1c [two-sided (tailed) alternative] is used when there is no particular


direction attached to the range of mean values that could be true if H0 is not true.
46

4. If, in the above examples, the level of significance had been changed to 1%, the
critical values used would have been Z0.01 = – 2.326 (in example 1) ,
Z0.99 = 2.326 (in example 2) and Z0.005 = –2.576 , Z0.995 = 2.576 (in example 3).

8.4 Test for the population mean (population


variance not known): t-test

When performing the test for the population mean for the case where the population
variance is not known, the following modifications are made to the procedure.

• In the test statistic formula the population standard deviation σ is replaced by the
sample standard deviation S.
x − µ0
• Since the test statistic t0 = that is used to perform the test follows a
S/ n
t-distribution with n–1 degrees of freedom, critical values are looked up in the
t-tables.

Test for µ when σ 2 is not known (t-test)

1 State null and alternative hypotheses.


H0: µ = µ 0 versus H1a: µ < µ 0 or H1b: µ > µ 0 or H1c: µ ≠ µ 0 .

x − µ0
2 Calculate the test statistic t 0 = .
S/ n
3 State the level of significance α and determine the critical value(s) and critical
region.

Degrees of freedom = ν = n–1.

(i) For alternative H1a the critical region is R = { t0 | t0 < tα }.

(ii) For alternative H1b the critical region is R = { t0 | t0 > t1 – α }.

(iii) For alternative H1c the critical region is R = { t0 | t0 > t1 – α/2 or t0 < tα/2 }.

4 If t0 lies in the critical region, reject H0 , otherwise do not reject H0.

5 State conclusion in terms of the original problem.


47

Examples

A paint manufacturer claims that the average drying time for a new paint is 2 hours (120
minutes). The drying times for 20 randomly selected cans of paint were obtained. The
results are shown below.

123 106 139 135


127 128 119 130
131 133 121 136
122 115 116 133
109 120 130 109

Assuming that the sample was drawn from a normal distribution,

(a) test whether the population mean drying time is greater than 2 hours (120 minutes)

(i) at the 5% level of significance.


(ii) at the 1% level of significance.

(b) test, at the 5% level of significance, whether the population mean drying time could be 2
hours (120 minutes).

Solution:

(a) H0 : μ = 120 (mean is 2 hours)

H1 : μ > 120 (mean is greater than 2 hours)

n = 20, μ0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the data).

124.1 − 120
Test statistic t0 = = 1.899.
9.65674 / 20

(i) If α = 0.05, 1 – α = 0.95. From the t-distribution table with

degrees of freedom =ν = n–1 =19, t0.95 = 1.729.

Critical region R = { t0 > t0.95 = 1.729 }.

Since 1.899 > 1.729 , H0 is rejected.

Conclusion: There is sufficient evidence to conclude that the mean is greater


than 2 hours.
48

(ii) If α = 0.01, 1–α = 0.99. From the t-distribution table with

degrees of freedom =ν = n–1 =19, t0.99 = 2.539.

Critical region R = { t0 > t0.95 = 2.539 }.

Since 1.899 < 2.539 , H0 is not rejected.

Conclusion: The mean drying time appears to be 2 hours.

Thus, being more strict about controlling a type I error (changing α from 0.05 to
0.01) results in a different conclusion about H0 (Do not reject instead of reject).

(b) H0 : μ = 120 (mean is 2 hours)

H1 : μ ≠ 120 (mean is not equal to 2 hours)

n = 20, μ0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the data).

124.1 − 120
Test statistic: t0 = = 1.899 (as calculated in part(a)).
9.65674 / 20

If α = 0.05, α/2 = 0.025, 1 – α/2 = 0.975.


From the t-distribution table with
degrees of freedom =ν = n–1 =19, t0.025 = –2.093, t0.975= 2.093..

Critical region R = { t0 < – 2.093 or t0 > t0.975 = 2.093 }.

Since –2.093 <1.899 < 2.093, H0 is not rejected.

Conclusion: Using a 5% level of significance, there is insufficient evidence to conclude


that the mean drying time is not 2 hours.

Note:
• Despite the fact that the same data were used in the above examples, the
conclusions were different. In the first test H0 was rejected, but in the next 2 tests H0
was not rejected.

• In the first test the probability of a type I error was set at 5%, while in the second
test this was changed to 1%. To achieve this, the critical was moved from 1.729 to
2.539, resulting in the test statistic value (1.899) being less than (in stead of greater
than) the critical value.
49

• In the third test (which has a two-sided alternative hypothesis), the upper critical
value was increased to 2.093 (to have an area of 0.025 under the t-curve to its right).
Again this resulted in the test statistic value (1.899) being less than (in stead of
greater than) the critical value.

8.5 Test for population variance


(n − 1) S 2
The test for the population variance is based on χ 2 = following a chi-square
σ2
distribution with n – 1 degrees of freedom. The critical values are therefore obtained from
the chi-square tables.

Test for the population variance σ2


1 State the null and alternative hypotheses.
H0: σ 2 = σ 02 versus H1a: σ 2 < σ 02 or H1b: σ 2 > σ 02 or H1c: σ 2 ≠ σ 02
(n − 1) S 2
2 Calculate the test statistic χ 02 = .
σ 02
3 State the level of significance α and determine the critical value(s) and critical
region.
Degrees of freedom = ν = n–1.

(i) For alternative H1a the critical region is R = { χ 02 | χ 02 < χ α2 }.


(ii) For alternative H1b the critical region is R = { χ 02 | χ 02 > χ 12−α }.
(iii) For alternative H1c the critical region is R = { χ 02 | χ 02 > χ 12−α / 2 or χ 02 <
χ α2 / 2 }.

4 If χ 02 lies in the critical region, reject H0 , otherwise do not reject H0.

5 State conclusion in terms of the original problem.

For a one-sided test with alternative hypothesis H1b the rejection region (highlighted area) is
shown in the graph below.
50

For a two-sided test with alternative hypothesis H1c the rejection region (highlighted area) is
shown in the graph below.

Example 1

Consider the example on the drying time of the paint discussed in the previous section. Until
recently it was believed that the variance in the drying time is 65 minutes. Suppose it is
suspected that this variance has increased. Test this assertion at the 5% level of significance.

Solution:

H0 : σ2 = 65 (Variance has not increased)

H1 : σ2 > 65 (Variance has increased)

n = 20, σ 02 = 65 (given), S = 9.65674 (calculated from the data).

19 * 9.65674 2
Test statistic: χ 02 = = 27.258.
65

α = 0.05, 1 – α = 0.95.
From the chi-square distribution table with
51

degrees of freedom =ν = n – 1 =19, χ20.95 = 30.14.

Critical region R = { χ 02 > χ 02.95 = 30.14 }.

Since 27.258 < 30.14, H0 is not rejected.

Conclusion: There is insufficient evidence to conclude that the variance has increased.
Example 2

A manufacturer of car batteries guarantees that their batteries will last, on average 3 years
with a standard deviation of 1 year. Ten of the batteries have lifetimes (in years) of
1.2 2.5 3 3.5 2.8 4 4.3 1.9 0.7 4.3
Test at the 5% level of significance whether the variability guarantee is still valid.

Solution:

H0 : σ2 = 1 (Guarantee is valid)

H1 : σ2 ≠ 1 (Guarantee is not valid)

n = 10, σ 02 = 1 (given), S = 1.26209702, S2 = 1.592889 (calculated from the data).

9 * 1.592889
Test statistic: χ 02 = = 14.336.
1

α = 0.05, α/2 = 0.025, 1 – α/2 = 0.975.

From the chi-square distribution table with


degrees of freedom =ν = n – 1 =9, χ20.025 = 2.70 , χ20.975 = 19.02.

Critical region R = {χ20 < χ20.025 = 2.70 or χ20 > χ20.975 = 19.02}.

Since 2.70 < 14.336 < 19.02, H0 is not rejected.

Conclusion: Using a 5% level of significance, there is insufficient evidence to show that the
variance is not 1 i.e. the data suggests that the guarantee is still valid.
52

8.6 Test for population proportion

The test for the population proportion (p) is based on the CLT. From this result it follows
Pˆ − p
that Z= ~ N(0, 1).
pq / n
For this reason the critical value(s) and critical region are the same as that for the test for
the population mean (both based on the standard normal distribution).

Test for the population proportion p


1 State the null and alternative hypotheses.
H0: p = p 0 versus H1a: p < p 0 or H1b: p > p0 or H1c: p ≠ p 0

pˆ − p
2 Calculate the test statistic z 0 = ’
p 0 q0 / n

3 State the level of significance α and determine the critical value(s) and critical
region.

(i) For alternative H1a the critical region is R = { z0 | z0 < Zα }.

(ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }.

(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.

4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.

5 State conclusion in terms of the original problem.

Examples

1) A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). Of their 80 most recent jobs 22 were completed behind
schedule. Test at the 5% level of significance whether this information confirms their
suspicion.

Solution:

H0 : p = 0.20 (Suspicion is confirmed)

H1 : p ≠ 0.20 (Suspicion is not confirmed)


53

22
n = 80, x = 22 (given), p̂ = = 0.275, p0 = 0.20.
80

0.275 − 0.20
Test statistic: z0 = = 1.677.
0.20 * 0.80 / 80

α = 0.05
Critical region: R = { z0 < Z0.025 = – 1.96 or z0 > Z0.975 = 1.96 }.

Since –1.96 < z0 = 1.677 < 1.96, H0 is not rejected.

Conclusion: There is not sufficient evidence to conclude that the proportion is not 0.2
i.e. the data indicates the suspicion is valid.

2) During a marketing campaign for a new product 176 out of the 200 potential users of
this product that were contacted indicated that they would use it. Is this evidence
that more than 85% of all the potential will actually use the product? Use α = 0.01.

Solution:

H0 : p = 0.85 (85% of all potential users will use the product)


H1 : p > 0.85 (More than 85% of all potential users will use the product)

176
n = 200, x = 176, p0 = 0.85 (given), p̂ = = 0.88.
200

0.88 − 0.85
Test statistic z0 = = 1.188.
0.85 * 0.15 / 200

α = 0.01 so Critical region R = { z0 > Z0.99 = 2.576 }.

Since z0 = 1.188 < 2.576, H0 is not rejected.

Conclusion: The evidence suggests that 85% of all potential users will use the
product.
54

Chapter 9 – Linear
Correlation and regression

9.1 Bivariate data and scatter diagrams


Often two variables are measured simultaneously and relationships between these variables
explored. Data sets involving two variables are known as bivariate data sets.

The first step in the exploration of bivariate data is to plot the variables on a graph. From
such a graph, which is known as a scatter diagram (scatter plot, scatter graph), an idea can
be formed about the nature of the relationship.

Examples

1) The number of copies sold (y) of a new book (measured in thousands of units) is
dependent on the advertising budget (x) the publisher commits in a pre-publication
campaign (measured in thousands of Rands). The values of x and y for 12 recently
published books are shown below.

x 8 9.5 7.2 6.5 10 12 11.5 14.8 17.3 27 30 25


y 12.5 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2

Scatter diagram

Adverting budget and copies sold

90

80

70

60
copies sold

50
40

30

20

10

0
0 5 10 15 20 25 30 35
advertising budget
55

2) In a study of the relationship between the amount of daily rainfall (x) and the
quantity of air pollution removed (y), the following data were collected.

Rainfall quantity removed (micrograms per


(centimeters) cubic meter)
4.3 126
4.5 121
5.9 116
5.6 118
6.1 114
5.2 118
3.8 132
2.1 141
7.5 108

Scatter diagram

Rainfall and quantity removed

160

140

120
Quantity removed

100

80

60

40

20

0
0 2 4 6 8
Rainfall

• In both cases the relationship can be fairly well described by means of a straight line
i.e. both these relationships are linear relationships.

• In the first example an increase in y is proportional to an increase in x (positive


linear relationship).
In the second example a decrease in y is proportional to an increase in x (negative
linear relationship).
56

• In both the examples changes in the values of y are affected by changes in the values
of x (not the other way round). The variable x is known as the explanatory
(independent) variable and the variable y the response (dependent) variable.
In this section only linear relationships between 2 variables will be explored. The issues to
be explored are

1) Measuring the strength of the linear relationship between the 2 variables (the linear
correlation problem).

2) Finding the equation of the straight line that will best describe the relationship
between the 2 variables (the linear regression problem). Once this line is
determined, it can be used to estimate a value of y for given value of x (linear
estimation).

9.2 Linear Correlation


The calculation of the coefficient of correlation (r) is based on the closeness of the plotted
points (in the scatter diagram) to the line fitted through them. It can be shown that

–1 ≤ r ≤ 1.

If the plotted points are closely clustered around this line, r will lie close to either 1 or –1
(depending on whether the linear relationship is positive or negative). Perfect positive
correlation occurs when all the plotted points lie on a line with a positive gradient. For this
case r = 1. Perfect negative correlation occurs when the plotted points lie on a line with a
negative gradient. For this case r = –1.The further the plotted points are away from the line,
the closer the value of r will be to 0. Consider the scatter diagrams that follow.

Strong positive correlation (r close to 1)

Strong negative correlation (r close to –1)


57

No pattern (r close to 0)

For a sample of n pairs of values (x1, y1) , (x2, y2), . . . , (xn, yn) , the coefficient of
correlation can be calculated from the formula

Example

Consider the data on the advertising budget (x) and the number of copies sold (y)
considered earlier. For this data r can be calculated in the following way.

x y xy x2 y2
8 12.5 100 64 156.25
9.5 18.6 176.7 90.25 345.96
7.2 25.3 182.16 51.84 640.09
6.5 24.8 161.2 42.25 615.04
10 35.7 357 100 1274.49
12 45.4 544.8 144 2061.16
11.5 44.4 510.6 132.25 1971.36
14.8 45.8 677.84 219.04 2097.64
17.3 65.3 1129.69 299.29 4264.09
27 75.7 2043.9 729 5730.49
30 72.3 2169 900 5227.29
25 79.2 1980 625 6272.64
sum 178.8 545 10032.89 3396.92 30656.5
58

Substituting
n=12, ∑ x = 178.8, ∑ y = 545,
∑ xy = 10032.89, ∑ x2 = 3396.92 ∑ y2 = 30656.5

into the equation for r gives

Comment: Strong positive correlation i.e. the increase in the number of copies sold
is closely linked with an increase in advertising budget.

Coefficient of determination
The strength of the correlation between 2 variables is proportional to the square of
the correlation coefficient (r2). This quantity, called the coefficient of determination,
is the proportion of variability in the y variable that is accounted for by its linear
relationship with the x variable.

Example
In the above example on copies sold (y) and advertising budget (x), the
coefficient of determination = r2 = 0.91942 = 0.8453.
This means that 84.53% of the change in the variability of copies sold is explained by
its relationship with advertising budget.

9.3 Linear Regression


Finding the equation of the line that best fits the (x, y) points is based on the least squares
principle. This principle can best be explained by considering the scatter diagram below.
59

The scatter diagram is a plot of the DBH (diameter at breast height measured in inches)
versus the age (years) for 12 oak trees. The data are shown in the following table.

Age (x) 97 93 88 81 75 57 52 45 28 15 12 11
DBH (y) 12.5 12.5 8 9.5 16.5 11 10.5 9 6 1.5 1 1

According to the least squares principle, the line that “best” fits the plotted points is the one
that minimizes the sum of the squares of the vertical deviations (see vertical lines in the
graph) between the plotted y and estimated y (values on the line). For this reason the line
fitted according to this principle is called the least squares line.

Calculation of least squares linear regression line

The equation for the line to be fitted to the (x, y) points is

ŷ = a + bx,

where ŷ is the fitted y value (y value on the line which is different to the observed y
value),a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be
calculated from

n∑ xy − ∑ x ∑ y
b= and a = y − bx.
n∑ x 2 − (∑ x ) 2

Example

For the above data on age (x) and DBH (y) the least squares line can calculated as shown
below.

x y xy x2
60

97 12.5 1212.5 9409


93 12.5 1162.5 8649
88 8 704 7744
81 9.5 769.5 6561
75 16.5 1237.5 5625
57 11 627 3249
52 10.5 546 2704
45 9 405 2025
28 6 168 784
15 1.5 22.5 225
12 1 12 144
11 1 11 121
sum 654 99 6877.5 47240

Substituting

n=12, ∑ x = 654, ∑ y = 99,

∑ xy = 6877.5 ∑ x2 = 47240

into the above equation gives

Therefore the equation of the y on x least squares line that can be used to estimate values
of y (DBH) based on x (age) is
ŷ = 1.285 + 0.12779 x.

Suppose the DBH of a tree aged 90 years is to be estimated. This can be done by
substituting the value of x = 90 into the above equation. Then
ŷ = 1.285 + 0.12779 × 90 = 12.786.

A word of caution

• The linear relationship between y and x is often only valid for values of x within a
certain range e.g. when estimating the DBH using age as explanatory variable, it
should be taken into account that at some age the tree will stop growing. Assuming a
61

linear relationship between age and DBH for values beyond the age where the tree
stops growing would be incorrect.

• Only relationships between variables that could be related in a practical sense are
explored e.g. it would be pointless to explore the relationship between the number
of vehicles in New York and the number of divorces in South Africa. Even if data
collected on such variables might suggest a relationship, it cannot be of any practical
value.

• If variables are not linearly related, it does not mean that they are not related. There
are many situations where the relationships between variables are non-linear.

Note:
Calculations will be demonstrated using the Data Analysis Add-Ins ToolPak in Excel.
You are required to know how to use the STAT mode on your calculator.

Example

A plot of the banana consumption (y) versus the price (x) is shown in the graph on the
following page. A straight line will not describe this relationship very well, but the non-linear
curve shown below will describe it well.

NONLINEAR REGRESSION: EXAMPLE

14
y

12

10

8
β
6 y =α + + u = α + βz + u
x
4

0
0 1 2 3 4 5 6 7 8 9 10 11 x12

This sequence shows how a nonlinear regression model may be fitted. It uses the banana
consumption example in the first sequence.

You might also like