0% found this document useful (0 votes)
1K views344 pages

Intro Stat 2010

This document introduces the textbook IntroSTAT, which aims to teach students applied statistics. It is designed for students in business, commerce, and management who have a basic understanding of calculus. The textbook is organized as a lecture book, with examples labeled A to be used in lectures, worked examples labeled B for private study, and problem statements labeled C. It covers exploring and summarizing data through various graphical and numerical methods, as well as probability theory, random variables, probability distributions, inference, and regression. The goal is to equip students with statistical skills for interpreting data and making informed decisions under uncertainty.

Uploaded by

Bade Parkin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views344 pages

Intro Stat 2010

This document introduces the textbook IntroSTAT, which aims to teach students applied statistics. It is designed for students in business, commerce, and management who have a basic understanding of calculus. The textbook is organized as a lecture book, with examples labeled A to be used in lectures, worked examples labeled B for private study, and problem statements labeled C. It covers exploring and summarizing data through various graphical and numerical methods, as well as probability theory, random variables, probability distributions, inference, and regression. The goal is to equip students with statistical skills for interpreting data and making informed decisions under uncertainty.

Uploaded by

Bade Parkin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 344

INTROSTAT

Les Underhill and Dave Bradfield


Department of Statistical Sciences
University of Cape Town

December 2010

ii

Introduction
IntroSTAT apart, there seem to be two kinds of introductory Statistics textbooks. There
are those that assume no mathematics at all, and get themselves tied up in all kinds of
knots trying to explain the intricacies of Statistics to students who know no calculus.
There are those that assume lots of mathematics, and get themselves tied up in the
knots of mathematical statistics.
IntroSTAT assumes that students have a basic understanding of differentiation and
integration. The book was designed to meet the needs of students, primarily those in
business, commerce and management, for a course in applied statistics.
IntroSTAT is designed as a lecture-book. One of our aims is to maximize the time
spent in explaining concepts and doing examples. It is for this reason that three types
of examples are included in the chapters. Those labeled A are used to motivate concepts,
and often contain explanations of methods within them. They are for use in lectures.
The B examples are worked examples they shouldnt be used in lectures there
is nothing more deadly dull than lecturing through worked examples. Students should
use the B examples for private study. The C examples contain problem statements.
A selection of these can be tackled in lectures without the need to waste time by the
lecturer writing up descriptions of examples, and by the students copying them down.
There are probably more exercises at the end of most chapters than necessary. A
selection has been marked with asterisks ( ) these should be seen as a minimum set
to give experience with all the types of exercises.
Acknowledgements . . .

We are grateful to our colleagues who used editions 1 to 4 of IntroSTAT and made
suggestions for changes and improvements. We have also appreciated comments from
students. We will continue to welcome their ideas and hope that they will continue to
point out the deficiencies. Mrs Tib Cousins undertook the enormous task of turning
edition 4 into TEX files, which were the basis upon which this revision was undertaken.
Mrs Margaret Zaborowski helped us proofread the text but any remaining errors are
our responsibility.
This volume is essentially the 1996 edition of Introstat with some minor corrections
of errors, and was reset in LATEX from the the original plain TEX version.

iii

iv

INTROSTAT

Contents

Introduction

iii

1 EXPLORING DATA

2 SET THEORY

45

3 PROBABILITY THEORY

55

4 RANDOM VARIABLES

93

5 PROBABILITY DISTRIBUTIONS I

115

6 MORE ABOUT RANDOM VARIABLES

145

7 PROBABILITY DISTRIBUTIONS II

165

8 MORE ABOUT MEANS

177

9 THE t- AND F-DISTRIBUTIONS

201

10 THE CHI-SQUARED DISTRIBUTION

227

11 PROPORTIONS AND SAMPLES SURVEYS

251

12 REGRESSION AND CORRELATION

273

SUMMARY OF THE PROBABILITY DISTRIBUTIONS

321

TABLES

325

vi

CONTENTS

Chapter

1
EXPLORING DATA

KEYWORDS: Data summary and display, qualitative and quantitative data, pie charts, bar graphs, histograms, symmetric and skew
distributions, stem-and-leaf plots, median, quartiles, extremes, fivenumber summary, box-and-whisker plots, outliers and strays, measures
of spread and location, sample mean, sample variance and standard deviation, summary statistics, exploratory data analysis.

Facing up to uncertainty . . .
We live in an uncertain world. But we still have to take decisions. Making good
decisions depends on how well informed we are. Of course, being well informed means
that we have useful information to assist us. So, having useful information is one of the
keys to good decision making.
Almost instinctively, most people gather information and process it to help them
take decisions. For example, if you have several applicants for a vacant post, you would
not draw a number out of a hat to decide which one to employ. Almost without thinking
about it, you would attempt to gather as much relevant information as you can about
them to help you compare the applicants. You might make a short-list of applicants to
interview, and prepare appropriate questions to put to each of them. Finally you come
to an informed decision.
Sometimes the available information is such that we feel it is easy to make a good
decision. But at other times, so much confusion and uncertainty cloud the situation
that we are inclined to go by gut-feeling or even by guessing. But we can do better
than this. This books aims to equip you with some of the necessary skills to outguess
the competition. Or, putting it less brashly, to help you to make consistently sound
decisions.
As the world becomes more technologically advanced, people realize more and more
that information is valuable. Obtaining the information they need might just require a
phone call, or maybe a quick visit to the library. Sometimes, they might need to expend
more energy and extract some information out of a database. Or worse, they might have
to design an experiment and gather some data of their own. On other occasions, the
information might be hidden in historical records.
Usually, data contains information that is not self-evident. The message cannot be
extracted by simply eye-balling the data. Ironically, the more valuable the information,
1

INTROSTAT

the more deeply it usually lies buried within the data. In these instances, statistical
tools are needed to extract the information from the data. Herein lies the focus of this
book.
For example, consider the record of share prices on the Johannesburg Stock Exchange.
Hidden in this data lies a wealth of information whether or not a share is risky, or if it
is over- or underpriced. This data even contains traces of our own emotions whether
our sentiments are mawkish or positive, risk prone or risk averse and our preferences
for higher dividends, for smaller companies, for blue-chip shares. Little wonder that
there is a multitude of financial analysts out there trying to analyse share price data
hoping that they might unearth valuable information that will deliver the promise of
better profits.
Just as the financial analysts have an insatiable appetite for information on which
to base better investment decisions, so in every field of human endeavour, people are
analysing information with the objective of improving the decisions they take.
One of the essentials skills needed to extract information from data, to interpret this
information, and to take decisions based on it, is Statistics. Not everyone is willing,
or has the foresight, to master a course in the science of Statistics. We are fortunate
that this is true otherwise statisticians would not hold the monopoly on superior
decision-making!
You have already made at least one good decision the decision to do a course in
Statistics.

What Statistics is (and is not!)

...

Most people seem to think that what statisticians do all day is to count and to add.
The two kinds of statisticians that most frequently impinge on the general public are
really parodies of statisticians: the sports statistician and the official statistician.
Statistics is not what you see at the bottom of the television screen during the French
Open Tennis Championships: statistique, followed by a count of the number of double
faults and aces the players have produced in the match so far! Nor is statistics about
adding up dreary columns of figures, and coming to the conclusion, for example, that
there were 30 777 000 sheep in South Africa in 1975. That sort of count is enough to
put anyone to sleep!
If statisticians dont count in the 123 sense, in what sense do they count? What
is statistics? We define statistics as the science of decision making in the face of
uncertainty. The emphasis is not on the collection of data (although the statistician
has an important role in advising on the data collection process), but on taking things
one step further interpreting the data. Statistics may be thought of as data-based
decision making. Perhaps it is a pity that our discipline is called Statistics. A far better
name would have been Decision Science. Statistics really comes into its own when the
decisions to be made are not clear-cut and obvious, and there is uncertainty (even after
the data has been gathered) about which of several alternative decisions is the best one
to choose.
For example, the decision about which card to play in a game of bridge to maximize
your chance of winning, or the decision about where to locate a factory so as to maximize
the likelihood that your companys share of the market will reach a target value are not
simple decisions. In both situations, you can gather as much data as you can (the cards
in your hand, and those already played in the first case, proximity to raw materials and

CHAPTER 1. EXPLORING DATA

to markets in the second), and take a best possible decision on the basis of this data,
but there is still no guarantee of success. In both cases, your opposition may react in
unexpected ways, and you risk defeat.
In the above sentences, the words uncertainty, chance, likelihood and risk
have appeared. All these are qualitative terms. Before the statistician can get down to
his or her real job (of taking decisions in the face of uncertainty), this nebulous concept
of uncertainty has to be put onto a firm footing. Probability Theory is the branch of
mathematics that achieves this quantification of uncertainty.
Therefore, before you can become a statistician, you have to learn a hefty chunk of
Probability Theory. This is contained in chapters 2 to 7. Chapters 8 to 12 deal with the
science of data-based decision making.
However, in the remainder of this chapter, we aim to give you insight into what is to
come in the later chapters, to give you a feeling for data, and to do data-based decision
making using intuitive concepts.

Display, summarize and interpret . . .


Before getting deeply involved with tackling any situation or problem in daily life, it
is wise to take a step back and take a glimpse at the big picture and so it is with
Statistics. As a starting point, statisticians make a quick and dirty summary of the
data they are about to analyse in order to get a feel for what they are dealing with.
The initial overview usually involves: constructing a visual display of the data; summarizing the data with a few pertinent key numbers; and gaining insight into the
potential of the data.

What do we mean by data?

...

Data is information. There are data drips and data floods, and statisticians have
to learn to deal with both. Usually, there is either too little or too much data! When
data comes in floods, the problem is to extract the salient features. When data comes
in drips, the problem is to know what are valid interpretations.
Besides the amount of data, there are different types of data. For the moment,
we need to distinguish between qualitative and quantitative data. Qualitative data
is usually non-numerical, and arises when we classify objects using labels or names as
categories: for example, make of car, colour of eyes, gender, nationality, profession,
cause of death, etc. Sometimes the categories are semi-numerical: for example, size of
companies categorized as small, medium or large.
Quantitative data, on the other hand, is always numerical, and data points can
be ranked or ordered. Quantitative data usually arises from measuring or counting:
for example, flying time between airports, number of rooms in a house, salary of an
accountant, cost of building a school, volume of water in a dam, number of new car sales
in a month, the size of the AIDS epidemic, etc..

Visual displays of qualitative data . . .


Two neat ways of displaying qualitative data are the pie chart and the bar chart.

INTROSTAT

Example 1A: Table 1.1 contains data on a class of 81 Master of Business Administration
(MBA) students. The table shows each students faculty for their first degree, in either
Arts, Commerce, Engineering, Medicine, Science or other. Also given are their test
scores for an entrance examination known as the GMAT, a test commonly used by
business schools worldwide as part of the information to assist in the selection process.
Our brief is to construct a visual summary of the distribution of students having the
various first-degree categories in the table.
Firstly, we decide that first-degree category, the data that we are being asked to
display, is qualitative data. Appropriate display techniques are the pie chart and the
bar graph.
Secondly, we find the frequency distribution of the qualitative data by counting
the number of students falling into each category. At the same time, we calculate
relative frequencies by dividing the frequency in each category by the total number
of observations:
First degree

Frequency

Relative
frequency

Engineering
Science
Arts
Commerce
Medicine
Other

28
16
16
10
5
6

0.35
0.20
0.20
0.12
0.06
0.07

Thirdly, we plot the pie chart and the bar graph.


Pie chart: the actual construction of a pie chart is straightforward! We have arranged
the segments in anti-clockwise order, starting at three oclock, but there is no hardand-fast rule about this. The pie chart communicates most effectively if the relative
frequencies are arranged in decreasing order of size.
Piechart showing proportions of M.B.A. students
Engineering

Science

.................................................
..........
.............
..........
........
........
.......
.......
......
.
.
.
.
.
......
.....
.
.
......
.
.
.
... ....
.....
.
.
.
.
...
....
...
.
.
.
....
.
...
...
....
.
.
.
...
..
...
.
.
.
...
...
..
.
.
...
.
.
...
...
...
.
.
...
...
..
.
.
...
...
..
.
.
...
...
..
.
...
.
...
..
...
...
.
...
...
....
...
...
.
.
.
...
...
....
.
..
...
....
..
...
..
...
..
..
...
...
...
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
...
....... .................
.
.
.
.
.
.
.
.
..
.
... .... .......
...
.......
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
...
... ..... ..........
.
.......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
........
...
....
...........
........
....
...
...
...
..........
........
....
...
...
...
...........
........
....
... ...................
..
........
...
....
..
........
.....
.
.
.
.
.
....
........
...
...
........ ....
....
.
...
.......
....
....
...
....
..
...
...
....
...
...
.
.
....
..
...
...
.
.
.
.
.
.
...
....
...
..
.
...
.
.
....
...
....
.... ......
.
....
.... ...
.....
.....
.
.....
....
...
......
.....
.
.
.
.
.
.
......
....
.
.
.
.......
....
.
.
....
.
........
.
.
.
.
.
.
...
.........
...
............
.........
...............................................................

Other

Medicine

Arts

Commerce

CHAPTER 1. EXPLORING DATA

Table 1.1: MBA Student Data


First
degree

GMAT
score

1. Engineering
2. Engineering
3. Engineering
4. Engineering
5. Engineering
6. Engineering
7. Engineering
8. Engineering
9. Engineering
10. Engineering
11. Engineering
12. Engineering
13. Engineering
14. Engineering
15. Engineering
16. Engineering
17. Engineering
18. Engineering
19. Engineering
20. Engineering
21. Engineering
22. Engineering
23. Engineering
24. Engineering
25. Engineering
26. Engineering
27. Engineering

610
510
610
580
720
620
540
500
750
640
550
650
600
600
510
570
620
590
660
550
560
630
540
560
650
540
680

First
degree
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.

GMAT
score

Engineering
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Arts
Arts
Arts
Arts
Arts
Arts
Arts
Arts
Arts
Arts

710
600
550
540
620
650
500
590
630
660
570
600
630
500
580
560
550
560
550
500
510
570
510
660
500
710
510

First
degree
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.

Arts
Arts
Arts
Arts
Arts
Arts
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Medicine
Medicine
Medicine
Medicine
Medicine
Other
Other
Other
Other
Other
Other

GMAT
score
500
620
550
600
520
520
550
520
560
560
600
540
550
650
510
560
590
700
640
680
580
550
680
540
640
620
450

INTROSTAT

Bar graph: Notice that there is no quantitative scale along the vertical axis of the
bar graph, that the bars are not connected, and that the widths of the bars have no
particular relevance. Because there is no quantitative ordering of the categories, we are
free to arrange them as we please. As for the pie chart, it is generally best to arrange
the bars in decreasing order of relative frequency; this makes comparison easier, and also
tends to highlight the important features of the data. Relative frequencies could also
have been used in the construction of the bar graph.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .
... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . .. ... . ... .. . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Science

... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ...
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..

16

Arts

.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. . . . . . . . . . . . . . . . . .
.. . . .. . .. . . .. . .. . . .. . .
.. . . .. . .. . . .. . .. . . .. . .
.. . . .. . .. . . .. . .. . . .. . .
.. . . .. . .. . . .. . .. . . .. . .
. . . . . . . . . . .
... . .. ... . ... .. . ... .. ... . ..
.. . . .. . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . .

10

28 Engineering

Commerce

Other
Medicine
10

20

30

Number of MBA students


The visually striking impact of both the pie chart and the bar graph is that engineers
form the largest proportion of this class of MBA students. Next, we might ask for a
reason for this. A plausible explanation is that engineers are not exposed to much
management and administrative training during their undergraduate years, and that
they make up for this by doing an MBA. A second explanation is that the data was
extracted during recessionary times, when engineers were not in demand perhaps
they were investing in themselves by doing an MBA while projects were scarce. How
would you set about investigating whether this latter explanation is correct?
Our plots have provided some insight into this data. Sometimes, all we achieve is to
demonstrate the obvious. At other times, our plots will reveal completely unexpected
phenomena. Careful interpretation is then needed, frequently with the help of the experts from the discipline that the data comes from.
Example 2C: Table 1.2 gives the composition of the All-Share Index of the Johannesburg Stock Exchange as at 2 January 1990. The breakdown of the All-Share Index
reflects that it is composed of seven major sector indices, namely Coal, Diamonds,
All-Gold, Metals & Minerals, Mining Financial, Financial, and Industrial Indices.
(a) Construct a bar chart showing the number of shares in each of the seven major
sector indices that contribute to the All-Share Index.
(b) Now construct a bar chart showing the relative weightings of each of the seven major sectors as a percentage of the All-Share Index, and comment on any differences
you find from (a) above.

CHAPTER 1. EXPLORING DATA

Table 1.2: Composition of the All Share Index on the JSE


No. of
Percentage Major
shares in weighting in sector
Subsidiary Sector Index subsidiary
All-Share index
index
Index
Coal
2
0.82
Coal
Diamonds

8.27

Diamonds

Gold Rand and others


Gold Evander
Gold Klerksdorp
Gold OFS
Gold West Wits

5
2
3
4
3

1.39
0.89
5.45
3.99
7.27

All-Gold

Copper
Manganese
Platinum
Tin
Other metals & minerals

1
1
2
1
3

0.55
0.91
5.00
0.01
0.26

Metal
&
minerals

Mining houses
Mining holding

3
3

16.79
7.13

Mining
financial

Banks & financial services


Insurance
Investment trusts
Property
Property trusts

5
4
3
12
11

2.80
2.15
1.02
0.47
0.97

Financial

Industrial holdings
Beverages & hotels
Building & construction
Chemicals
Clothing, footwear & textiles
Electronics, electr. & battery
Engineering
Fishing
Food
Furniture & household goods
Motors
Paper & packaging
Pharmaceutical & medical
Printing & publishing
Steel & allied
Stores
Sugar
Tobacco & match
Transportation

6
2
6
2
10
7
7
1
4
5
6
3
3
3
2
10
1
1
3

9.94
3.43
0.92
3.46
0.62
1.39
0.95
0.08
2.14
0.30
0.43
2.26
0.48
0.18
1.99
2.16
0.43
2.48
0.22

Industrial

AllShare

INTROSTAT

Visual displays of quantitative data I : histograms . . .


Histograms are a time-honoured and familiar way of displaying quantitative data.
We demonstrate the construction procedure by means of an example.
Example 3A: Referring back to the data on GMAT scores in Example 1A, draw a
histogram for the distribution of the GMAT scores for the students.
We recommend a four-step histogram procedure:
1. Determine the size of the sample1 , i.e. the number of observations. We have
n = 81 students and throughout this book we reserve use of the symbol n for
the concept of sample size, the number of numbers we are dealing with! Find the
smallest and largest numbers in the sample. Call these xmin and xmax , respectively.
The smallest GMAT score was from student 81, who scored 450, and the largest
was the 750 achieved by student 9:
xmin = 450
xmax = 750
2. Choose class intervals that cover the range from xmin to xmax . Here are two
guidelines that help determine the length L of the class intervals: the first is due
to Mr Sturge, the second is used by the computer package GENSTAT. If the class
intervals are made too narrow, the histogram looks spikey, and if too wide, the
histogram is blurred. Sturge says: use class intervals of length
L=

xmax xmin
xmax xmin
=
1 + log2 n
1 + 1.44 loge n

GENSTAT says: use class intervals of length


L=

xmax xmin

.
n

For our data, Sturge says


750 450
xmax xmin
=
= 40.94,
1 + 1.44 loge n
1 + 1.44 log e 81
while GENSTAT says
L=

750 450
xmax xmin

=
= 33.33.
n
81

As a general rule, avoid choosing class intervals which are of awkward lengths.
Multiples of 2, 5 and 10 are most frequently used. Feel free to choose intervals
between half and double those suggested by the guidelines. All the class intervals
should be the same width. Resist the temptation to make the class intervals wider
over that part of the range where the data is sparser this has the effect of
1

We consider a sample to be a small number of observations taken from the population of interest.
We hope that the sample is representative of the population as a whole, so that conclusions drawn
from the sample will be valid for the population. We consider methods of obtaining a representative
sample in Chapter 11.

CHAPTER 1. EXPLORING DATA

destroying the visual message of the histogram. For this example, L = 50 is a


sensible choice for the width of the class interval. It is convenient to start our
class intervals at 450, and carry on in steps of 50 as far as is necessary, so that
the boundaries of the class intervals are at 450, 500, 550, 600, 650, 700, 750, and
800. We also need to agree that scores that fall on the boundaries will be allocated
to the higher of the two class intervals, so strictly the class intervals are 450499,
etc., as shown in the frequency distribution table below.

3. Count the number of GMAT scores falling into each class interval. The most
convenient way to do this is to set up a tick sheet, and to make one pass through
the data allocating each score to its class interval. This sets up a frequency
distribution:

Class
interval

frequency

450499
500549
550599
600649
650699
700749
750799

1
20
24
21
10
4
1

Total

81

4. Plot the histogram, choosing suitable scales for each axis:

10

INTROSTAT
25

20

Number
of
MBA
students

... . ... .. . ... .. ... .


.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
.. . .. . . .. . .. .
. ... .. ... . .. ... . ... .. ... .. ... . .. ... . ... ..
. . . . . . . . . . . . .

15

10

0
400

500

... . ... .. . ... .. . ... ..


.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .
.. . .. . . .. . . .. .

. .. . ... .. ... . .. ... . ..


. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . .. . .. . . .. . .
. . . . . . .

. .. ... . ... .. . ... .. .


. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .
. . .. . .. . . .. . .

600

. .. ... . ... .. . ... .. ...


. . .. . .. . . .. . ..
. . .. . .. . . .. . ..
. . .. . .. . . .. . ..
. . .. . .. . . .. . ..
. . .. . .. . . .. . ..
. . .. . .. . . .. . .. . .. . .. . . .. . .. .
. . .. . .. . . .. . .. . .. . .. . . .. . .. .

700

800

GMAT scores
The striking feature of the histogram is that it is not symmetric but is skewed to
the right, which means that it has a long tail stretching off to right. The terms in bold
are technical, jargon terms, but their meanings are obvious.
A seasoned statistician would expect a distribution of test scores (or examination
results) to have a tail at both ends of the frequency distribution. In the above display,
there has been a truncation of the distribution at 500 (apart from a single score of
450). We would infer that the acceptance criterion on the MBA programme is a GMAT
score of 500 or more. In reality there is a tail on the left, but it is suppressed by the
fact that applicants who achieved these scores were not accepted. In the light of this, a
statistician would also query the score of 450. Is it an error in the data? Maybe it should
be 540, and there has been a transcription error. But a more plausible explanation is
that the student was outstanding in some other aspect of the selection process maybe
the personal interview was very impressive!
Example 4B: The risks taken by investors when they invest in the stock exchange are
of considerable interest to financial analysts. Investors associate the risks of investments
with how volatile (or variable) the price changes are. Analysts measure volatility of price
changes using the standard deviation a statistical measure of variability that we
will learn about later in this chapter. The table below contains the standard deviations
(or risks) of a sample of 75 shares listed on the Johannesburg Stock Exchange. The units
are per cent per month. Construct a suitable histogram of the data.
23
19
27
26
17

22
23
20
25
10

17
11
17
11
25

18
16
8
12
26

21
11
13
20
11

25
15
28
22
12

23 25 12 23 27 14
15 12 12 12 21 13
14 9 13 11 23 23
21 9 13 19 19 13
25 22 12 11 22 20

28
11
10
14
14

9
13
12
15
10

23
13
12
17
23

11

CHAPTER 1. EXPLORING DATA


1. The sample size is n = 75, the extreme values are xmin = 8, xmax = 28.
2. Sturge says:
L = (28 8)/(1 + 1.44 log e 75) = 2.8,
while GENSTAT says:
p
L = (28 8)/ (75) = 2.3.

So a sensible length for the class interval is 2, and we use class interval boundaries
at 8, 10, 12, . . . , 28. Effectively, the class intervals are 89, 1011, . . . , 2829.
3. Count the number of shares falling into each class:
Class
interval

Frequency

89
1011
1213
1415
1617
1819
2021
2223
2425
2627
2829

4
10
16
7
5
4
6
12
5
4
2

Total

75

Finally, we plot the histogram:


20

15
Number
10
of
shares
5

... . .. ... . ... ..


.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
. . . . .

... . .. ... . ...


.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..

10

. ... .. ... . .. .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .

. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .

. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .

. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .

.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
. . . . .
. . . .. . .. . . ... . ... .. . ... ..
. . . .. . .. . . .. . .. . . .. .
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..

20
Risk

... . .. ... . ...


.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
.. . . .. . ..
. . . . .

. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. . . . .

30

12

INTROSTAT

The striking feature of this histogram is that is has two clear peaks. In statistical
jargon, it is said to be bimodal. The visual display of this information has thus revealed
information which was not at all obvious from even a careful search through the 75 values
in the table of data. The financial analyst now needs an explanation for the bimodality.
Further investigation revealed that gold shares were predominantly responsible for the
peak on the right while industrial shares were found to be responsible for the peak on
the left. The histogram reveals that gold shares generally have a substantially higher risk
than industrial shares. In laymans terms, we conclude that gold shares are generally
more volatile than industrial shares.

Example 5C: Plot a histogram to display the examination marks of 25 students and
comment on the shape of the histogram:

68
78

72
48

39
55

50
53

69 52
61 71

51
42

50
57

41
34

52
57

65 37
66 87

45

Example 6C: A company that produces timber is interested in the distribution of the
heights of their pine trees. Construct a histogram to display the heights, in metres, of
the following sample of 30 trees:

18.3
17.7
18.5

19.1
19.1
17.8

17.3
17.4
20.1

19.4
19.3
19.4

17.6
18.7
20.5

20.1
18.2
16.8

19.9
20.0
18.8

20.0
17.7
19.7

19.5
20.0
18.4

19.3
17.5
20.4

Visual displays of quantitative data II : stem-and-leaf


plots . . .
Stem-and-leaf plots are a relatively new display technique. The visual effect is very
similar to that of the histogram; however, they have the advantage that additional
information is represented the original data values can be extracted from the display.
Thus stem-and-leaf plots can be used as a means of data storage.
We learn how to construct a stem-and-leaf plot by means of an example, using
somewhat historic data. The procedure is simple.

Example 7A: At the end of 1983/84 English football season, the points scored by each
club were as follows:

13

CHAPTER 1. EXPLORING DATA

Arsenal
Aston Villa
Birmingham City
Coventry City
Everton
Ipswich Town
Leicester City
Liverpool
Luton Town
Manchester United
Norwich City

63
60
48
50
59
53
51
79
51
74
50

Nottingham Forest
Notts County
Queens Park Rangers
Southampton
Stoke City
Sunderland
Tottenham Hotspur
Watford
West Bromwich
West Ham United
Wolverhampton

71
41
73
71
50
52
61
57
51
60
29

To produce the stem-and-leaf plot for the points scored by each team, we split each
number into a stem and a leaf. In this example, the natural split is to use the tens
as stems, and the units as leaves. Because the numbers range from the 20s to the 70s,
our stems run from 2 to 7. We write them in a column:

stems leaves
2
3
4
5
6
7

We now make one pass through the data. We split each number into its stem
and its leaf, and write the leaf on the appropriate stem. The first number, the
63 points scored by Arsenal, has stem 6 and leaf 3. We write a 3 as a leaf on
stem 6:
stems leaves
2
3
4
5
6 3
7
Aston Villas 60 points become leaf 0 on stem 6. Birmingham Citys 48 points are
entered as leaf 8 on stem 4. After the first six scores have been entered, we have:

14

INTROSTAT

stems
2
3
4
5
6
7

leaves

8
093
30

Continue until all 22 numbers have been entered:


stems
2
3
4
5
6
7

leaves
9
81
0931100271
3010
94131

count
1
0
2
10
4
5
22

In the last column, we enter the count of the number of leaves on each stem, add
them up, and check that we have entered the right number of leaves!
The final step is to sort the leaves on each stem from smallest to largest, and to add
a cumulative count column:
stems
2
3
4
5
6
7

sorted
leaves
9
18
0001112379
0013
11349

count
1
0
2
10
4
5

cum.
count
1
1
3
13
17
22

What have we created? Essentially, we have a histogram on its side, with class
intervals of length 10. But in addition, we have retained all the original information.
In a histogram, we would only have known that five teams scored between 70 and 79
points; now we know that there were scores of 71 (two teams), 73, 74, and Liverpools
league-winning 79!
Example 8A: For the data of example 1A, produce and compare stem-and-leaf plots
of the GMAT scores of students with Engineering and Arts backgrounds.
All the GMAT scores ended in a zero, so this contains no useful information; therefore
we use the hundreds as stems and the tens as leaves. For both categories of students,
we would then have only three stems, 5, 6 and 7. Looking back at the histogram
display of GMAT scores (example 3A), we note that we used class intervals of width 50
units. We can create this width class interval in the stem-and-leaf plot as demonstrated
below.

15

CHAPTER 1. EXPLORING DATA


Engineering:
stems
5
5?
6
6?
7
7?

leaves
140144
8579566
11240023
5658
21
5

count
6
7
8
4
2
1
28

Arts:
stems
5
5?
6
6?
7
7?

leaves
01101022
6575
20
6
1

count
8
4
2
1
1
0
16

In this approach, we split each 100 into two stems; the first is labelled and
encompasses the leaves from 0 to 4, the second is labelled? and includes the leaves
from 5 to 9.
The final step is to sort the leaves for each stem.
ENGINEERING

stems
5
5?
6
6?
7
7?

sorted
leaves
011444
5566789
00112234
5568
12
5

ARTS

cum.
count

stems

6
13
21
25
27
28

5
5?
6
6?
7
7?

sorted
leaves
00011122
5567
02
6
1

cum.
count
8
12
14
15
16
16

Again, a skewness to the right is evident in both displays.


Striking, too, is the observation that students with an engineering background tend
to have GMAT scores in the upper 500s and lower 600s, whereas the majority of arts
background students have scores in the 500s. Although the sample sizes are small, this

16

INTROSTAT

pattern seems marked enough to suggest that engineers perform better, on average, than
the arts students.
If splitting stems into two parts seems inadequate for the data set on hand, here
is a system for splitting them into five!
Example 9B: Produce a stem-and-leaf plot for the risk data of Example 4B.
As for the histogram, it would be sensible to use stems of width 2. Each stem is
therefore split into five: 0 and 1 are denoted , 2 and 3 are denoted t, 4 and 5
are denoted f, 6 and 7 are denoted s, and 8 and 9 are denoted ?. Notice the
convenient mnemonics English is a marvellous language!
Arts:
sorted
cum.
stems
leaves
count count
? 8999
4
4
1 0001111111
10
14
1t 2222222223333333
16
30
1f 4444555
7
37
1s 67777
5
42
1? 8999
4
46
2 000111
6
52
2t 222233333333
12
64
2f 55555
5
69
2s 6677
4
73
2? 88
2
75
Note that we have presented the stem-and-leaf plot with the leaves already sorted.
Example 10C: Produce a stem-and-leaf plot for the examination marks of another
group of 25 students.
50
72

79
71

53
61

85
51

50 53
72 53

65
39

58
67

43
27

45
43

48 51
69 53

54

Example 11C: The maximum temperatures ( C) at 20 towns in southern African one


summers day are given in the following table. Produce a stem-and-leaf plot.
Pietersburg
Pretoria
Johannesburg
Nelspruit
Mmabatho
Bethlehem
Bloemfontein
Kimberley
Upington
Keetmanshoop

30
20
28
30
33
30
31
31
28
28

Windhoek
Cape Town
George
Port Elizabeth
East London
Beaufort West
Queenstown
Durban
Pietermaritzburg
Ladysmith

32
22
21
18
17
23
22
26
27
30

17

CHAPTER 1. EXPLORING DATA

Example 12C: In order to assess the prices of the television repair industry, a faulty
television set was taken to 37 TV repair shops for a quote. The data below represents
the quoted prices in rands. Construct a stem-and-leaf plot and comment on its features.
60
185
105

55
200
75

158
38
150 140
75 120

48
75
36

120
125
150

85
125
120

245
125
176

90
145
60

60
200
78

49
145
28

38
94

98
165

Five-number data summaries median, lower and upper


quartiles, extremes . . .
At the beginning of this chapter, we said that statisticians gained a feel for the data
they were about to analyse in two ways. We have now dealt with the first way, that of
constructing a visual display. Now we move on to the second way, by computing a few
key numbers which summarize the data.
Our aim now is to reduce a large batch of data to just a few numbers which we can
grasp simultaneously, and thus help us to understand the important features of the data
set as a whole.
It is useful now to introduce the concepts of rank. In a sample of size n sorted from
smallest to largest, the smallest number is said to have rank 1, the second smallest
rank 2, . . . , the largest rank n. We call the smallest number x(1) (so x(1) = xmin ), the
second smallest x(2) , . . . , and the largest is x(n) (so x(n) = xmax ). We use x(r) for the
number with rank r. The cumulative count column of a stem-and-leaf plot makes it
easy to find the observation with any given rank.
We use x(r+ 1 ) to denote the number half-way between the numbers with rank r and
2
rank r + 1:
x(r) + x(r+1)
x(r+ 1 ) =
.
2
2
We say that x(r+ 1 ) is the number with rank r + 21 . Such numbers are called half-ranks.
2
We define the median of a batch of n numbers as the number which has rank
(n + 1)/2. We use x(m) to denote the median. If n is an odd number, then the rank of
the median will be a whole number, and the median will be the middle number in the
data set. But if n is even, then the rank will be a half-rank, and will be the average of
the two middle numbers in the data set.
The lower quartile is defined to be the number with rank l = ([m] + 1)/2, where
m is the rank of the median. The notation [m] means that if m is something-and-a-half,
we drop the half! The alternative to doing this is having to define quarter-ranks! The
upper quartile has rank u = n l + 1. The lower and upper quartiles are denoted x(l)
and x(u) , respectively. The extremes, the smallest and largest values in a data set, have
ranks 1 and n, and we agreed earlier to call them x(1) and x(n) , respectively.
These five-numbers provide a useful summary of the batch of data, called, with complete lack of imagination, the five-number summary. We write them from smallest
to largest:
(x(1) , x(l) , x(m) , x(u) , x(n) ).

18

INTROSTAT

Example 13A: Find the five-number summary for the end-of-season football points of
Example 7A.
An easy way to find the five-number summary is to use the stem-and-leaf plot.
stems
2
3
4
5
6
7

sorted
leaves
9
18
0001112379
0013
11349

count
1
0
2
10
4
5

cum.
count
1
1
3
13
17
22

Because n = 22, the median has rank m = (n + 1)/2 = (22 + 1)/2 = 11 12 . We need
to average the numbers with ranks 11 and 12. From the cumulative count, we see that
the last leaf on stem 4 has rank 3, and the last leaf on stem 5 has rank 13. Counting
along stem 5, we find that 53 is the number with rank 11 and 57 has rank 12. Thus the
median is the average of these two numbers (53 + 57)/2) = 55; we write x(m) = 55. Half
the teams scored below 55 points, half scored above 55 points.
The lower quartile has rank l = ([11 12 ]+1)/2) = (11+1)/2 = 6. The observation with
rank 6 is 50, thus x(l) = 50. The upper quartile has rank u = n l + 1 = 22 6 + 1 = 17.
The observation with rank 17 is 63, thus x(u) = 63. The extremes are x(1) = 29 and
x(n) = 67. The five-number summary is:
(29, 50, 55, 63, 79).
Why is this a big deal? Because it tells us that . . .
1. Half the teams scored below 55 points, half scored above 55 points, because 55 is
the median.
2. Half the teams scored between 50 and 63 points, because these two numbers are
the lower and upper quartiles.
3. A quarter of the scores lay between 29 and 50, a quarter between 50 and 55, a
quarter between 55 and 63, and a quarter between 63 and 79.
4. All the scores lay between 29 and 79.
Example 14B: Find the five-number summaries for GMAT scores of both engineering
and arts students. Use the stem-and-leaf plot of example 8A.

stems
5
5?
6
6?
7
7?

ENGINEERING
sorted
leaves
count
011444
6
5566789
7
00112234
8
5568
4
12
2
5
1

cum.
count
6
13
21
25
27
28

stems
5
5?
6
6?
7
7?

ARTS
sorted
leaves
count
00011122
8
5567
4
02
2
6
1
1
1
0

cum.
count
8
12
14
15
16
16

CHAPTER 1. EXPLORING DATA

19

For the engineers, the median has rank m = (28 + 1)/2 = 14 21 . Thus x(m) = (600 +
600)/2 = 600. The lower quartile has rank l = ([m] + 1)/2 = ([14 21 ] + 1)/2 = 7 12 , and
the upper quartile rank n l + 1 = 22 7 21 + 1 = 21 21 . So x(l) = (550 + 550)/2 = 550,
and x(u) = (640 + 650)/2 = 645. The five-number summary is
(500, 550, 600, 645, 750).
For the arts students, the median has rank m = (16+1)/2 = 8 12 . Thus x(m) = (520+
550)/3 = 535. The lower quartile has rank l = ([m] + 1)/2 = ([8 12 ] + 1)/2 = 4 21 , and the
upper quartile rank n l + 1 = 16 4 21 + 1 = 12 12 . So x(l) = (510 + 510)/2 = 510, and
x(u) = (570 + 600)/2 = 585. The five-number summary is
(500, 510, 535, 585, 710).
For the engineers, the median GMAT score was 600; by contrast, for arts students,
it was only 535. The central 50% of engineers obtained scores in the interval from 550 to
645, while the central 50% of arts students were in a downwards-shifted interval, 510 to
585. This reinforces our earlier interpretation that engineers tend to have higher GMAT
scores than arts students.
Example 15C: Find the five-number summaries for the data of (a) Example 10C, (b)
Example 11C, and (c) Example 12C.

Visual displays of quantitative data III : box-and-whisker


plots . . .
Five-number summaries can be displayed graphically by means of box-and-whisker
plots. (This ridiculous name was invented by the American statistician who invented
the method, John Tukey , who also invented both the name stem-and-leaf plot and
the plot itself! John Tukey was not only an inventor of crazy names; he also made an
enormous impact on the theory and practice of the discipline Statistics.) Once again,
we will use an example to describe how to produce a box-and-whisker plot.
Example 16A: Produce a box-and-whisker plot for the football team points of Example
7A, using the five-number summary (29, 50, 55, 63, 79) computed in Example 13A. The
procedure is simple:
1. Draw a vertical axis which covers at least the range of the data.
2. Draw a box from the lower to the upper quartile.
3. Draw a line across the box at the median.
4. Draw whiskers from the box out to the extremes.
Applied to the five-number summary (29, 50, 55, 63, 79), this procedure yields the
box-and-whisker plot in Figure 1.1.
Box-and-whisker plots are especially useful when we wish to compare two or more
sets of data. To achieve this, we construct the plots side-by-side. It is essential to use
the same vertical scale for all the plots.

20

INTROSTAT
100

80

upper extreme (79)

upper quartile (63)

60

median (55)
lower quartile (50)

Points
40

lower extreme (29)


20

Figure 1.1:

Example 17B: Draw a series of box-and-whisker plots to compare the GMAT scores
of each category of MBA students.
We computed the five-number summaries of the GMAT scores for engineering and
arts students in Example 14B. The five-number summaries for all the categories are:

Engineering
Science
Arts
Commerce
Medicine
Other

(500,
(500,
(500,
(510,
(580,
(460,

550,
550,
510,
540,
585,
540,

600,
585,
535,
555,
640,
585,

645,
625,
585,
600,
690,
640,

750)
660)
710)
700)
700)
680)

The box-and-whisker plots, shown side-by-side, reveal the differences between the
various categories of students.

21

CHAPTER 1. EXPLORING DATA


800

700

GMAT
scores

600

500
ENG.

SCI.

ARTS

COM.

MED.

OTH.

400
We see from a comparison of the box-and-whisker plots that the students in this class
with a medical background had the highest median GMAT score, followed by engineers,
with arts students having the lowest median. The skewness to the right (now shown as a
long whisker pointing upwards!) which we commented on earlier for the class as a whole,
is also evident for engineering, science, arts and commerce students, the categories for
which the sample sizes were large.

Outliers and strays. . .


In many data sets, there are one or more values that appear to be very different to
the bulk of the observations. Intuitively, we recognize these values because they are a
long way from the median of the data set as a whole. We can make our stem-and-leaf
plots more informative by plotting and labelling some of the outlying values in such
a way that they are highlighted and our attention immediately drawn to them. These
outlying values might well represent errors that have crept into the data, either when the
observation was made, or when the numbers were transcribed from one sheet of paper
to another, or when they were entered into a computer, or even when they were being
transferred from one computer to another. On the other hand, outlying values might
represent genuine observations, and be of special interest and importance. In any event,
these observations need to be checked, and either confirmed or rejected. It is useful to
have rules that will aid us to identify such observations.
Outliers are those observations which are greater than
x(m) + 6(x(u) x(m) )
or less than
x(m) 6(x(m) x(l) )

22

INTROSTAT

and we label them boldly on the box-and-whisker plot.


Less outlying values called strays are those observations which are not outliers but
are greater than
x(m) + 3(x(u) x(m) )
or less than
x(m) 3(x(m) x(l) )
and we label them less boldly on the plot.
The largest and smallest observations which are not strays are called the fences
(more Tukeyisms!). When outliers and strays are being portrayed in a box-and-whisker
plot, the convention is to take the whiskers out as far as the fences, not the extremes.
This helps to isolate and highlight the outlying values.
In any event, it is sometimes helpful to identify a few values of special interest or
importance in a box-and-whisker plot.

Example 18A: The university computing service provides data on the amount of
computer usage (hours) by each of 30 students in a course:

Student no. Usage

Student no. Usage

Student no. Usage

AD483
CI144
FV246
HN050
JV670
LW032
PH544
SA831
TB864
WB909

AM044
CS572
GM337
JK314
KM232
MA276
PS279
SC186
VO822
YG007

AS677
EK817
GR803
JR894
LJ419
MJ076
RR676
SS154
WG794
ZP559

53
7
38
48
31
48
4
51
11
73

2
25
36
84
35
69
60
47
41
38

36
20
33
154
44
95
18
37
34
125

Is the lecturer justified in claiming that certain students appear to be making excessive
use of the computer (playing games?) while the usage of others is so low that she is
suspicous that they are not doing the work themselves?

The stem-and-leaf plot is

23

CHAPTER 1. EXPLORING DATA

stems
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

sorted
leaves
247
18
05
134566788
14788
13
09
3
4
5

cum.
count
3
5
7
16
21
23
25
26
27
28
28
28
29
29
29
30

The five-number summary is (2, 31, 38, 53, 154).


The outliers were those observations greater than
x(m) + 6(x(u) x(m) ) = 38 + 6(53 38) = 128
or less than
x(m) 6(x(m) x(l) ) = 38 6(38 31) = 4.
There was only one outlier, the usage of 154 hours by student JR894.
The strays were those observations which were not identified as outliers but were
greater than
x(m) + 3(x(u) x(m) ) = 38 + 3(53 38) = 83
or less than
x(m) 3(x(m) x(l) ) = 38 3(38 31) = 17.
There are seven strays: four students (AM044 (2 hours), PH544 (4 hours), CI144
(7 hours), TB864 (11 hours)) are at the low usage end, and three (JK314 (84 hours),
MJ076 (95 hours) and ZP559 (125 hours) are at the high usage end. The fences are the
outermost observations that were not strays, and were the 18 hours and 73 hours. The
box-and-whisker plot, with the outlier labelled boldly, and strays merely labelled, looks
like this:

24

INTROSTAT
JR894

150

100

ZP559

MJ076

Hours

JK318

upper quartile (53)

50

median (38)
lower quartile (31)

CI144
AM044

TP864
PH544

The lecturer now has a list of students whose computer utilization appears to be
suspicious.
Example 19C: A company that produces breakfast cereals is interested in the protein
content of wheat, its basic raw material. The protein content of 29 samples of wheat
(percentages) was recorded as follows:

9.2 8.0 10.9 11.6 10.4 9.5 8.5 7.7 8.0 11.3 10.0 12.8 8.2 10.5 10.2
11.9 8.1 12.6 8.4 9.6 11.3 9.7 10.8 83 10.8 11.5 21.5 9.4 9.7

Confirm the statisticians conclusion that the values 83 and 21.5 are outliers.
The statistician asked that these values should be investigated. Checking back to
the original data, it was discovered that 83 should have been 8.3, and 21.5 should have
been 12.5. Transposed digits and misplaced decimal points are two of the most frequent
types of error that occur when data is entered into a computer.
Example 20C: A winery is concerned about the possible impact of global warming
on the grape crop. It was able to obtain some interesting historical rainfall data going
back to 1884 from a wine-producing region. The rainfall (mm) in successive Januaries
at Paarl for the 22-year period 18841905 were recorded as follows:

25

CHAPTER 1. EXPLORING DATA


Year
1884
1885
1886
1887
1888
1889
1890
1891

Rain
2.6
4.9
16.3
21.6
6.1
.0
.0
1.1

Year
1892
1893
1894
1895
1896
1897
1898
1899

Rain
37.8
.0
.0
52.3
4.1
6.4
15.8
27.7

Year
1900
1901
1902
1903
1904
1905

Rain
3.0
145.1
39.7
105.9
17.8
10.6

(a) Produce a stem-and-leaf plot.


(b) Find the five-number summary.
(c) Draw the box-and-whisker plot, showing outliers and strays, if any.

Statistics in Statistics . . .
Within the discipline Statistics, we give a precise technical definition to the concept,
a statistic. A statistic is any quantity determined from a sample. Thus the median is
a statistic, and so are the other four numbers that make up a five-number summary.
These are examples of summary statistics, because they endeavour to summarize
certain aspects of the information contained in the sample. We now learn about a
further bunch of statistics.

Measures of location and spread . . .


We use the term measure of location to describe any statistic that purports to
locate the middle, in some sense, of the data set. For example, confronted by a
collection of data on house prices, we would use a measure of location to answer the
question: What is the typical price of a house? The next questions might be: How much
variability is there in house prices? What is the difference between the price of a cheap
house and that of an expensive house? Measures of spread are designed to provide
answers to these two questions. In the next few sections we consider a few of the most
important measures of location, and then some measures of spread.

The sample median


The median, which we denoted x(m) , locates the middle of the data in the sense
that half the observations are smaller than the median and half are larger than the
median. To find the median it is necessary to sort or rank the data from the smallest
value to the largest. Remember that if the sample size n is an odd number, the median
is the middle observation, but if n is even, it is the average of the two middle
observations.

26

INTROSTAT

The sample mean. . .


The sample mean is, with good justification, the most important measure of location. It is found by adding together all the values in the sample, and dividing this total
by n, the sample size. We introduce a subscript notation to describe a sample of size n.
Call the first observation we make x1 , the second x2 , . . . , the nth xn . Then the sample
mean, almost universally denoted x
(pronounced, x bar), is defined to be
x
= (x1 + x2 + + xn )/n
n
1X
xi
=
n
i=1

The sample mean locates the middle of the batch of data values in a special way.
It is equivalent to hanging a 1 kg mass at points x1 , x2 , . . . xn along a ruler (of zero
mass), and then x
is the point at which the ruler balances. (The masses neednt be 1 kg,
but they must all be equal!)
The mean is much easier to calculate than the median. The mean requires a single
pass through the data, adding up the values. In contrast, the data needs to be sorted
before the median can be computed, an operation which requires several passes through
the data.
Example 21A: Find the sample mean of the dividend yields of 15 shares in the paper and packaging sector of the Johannesburg Stock Exchange. Also find the median.
Compare the mean and the median. The yields are expressed as percentages.
Copi
Caricar
Coates
Consol
DRG

3.3
8.4
10.7
6.0
9.6

E. Haddon
Kohler
Metal Box
Metaclo
Nampak

7.6
7.1
6.6
8.6
5.8

Pr. Paper
Prs. Sup
Sappi
Trio Rand
Xactics

We sum the 15 dividend yields and divide by 15:


x
= (3.3 + 8.4 + 10.7 + + 8.2 + 3.0)/15
= 6.80

The stem-and-leaf plot for these data is shown below:


stems
2
3
4
5
6
7
8
9
10

sorted
leaves
9
03
8
067
156
246
6
7

count
1
2
0
1
3
3
3
1
1

cum.
count
1
3
3
4
7
10
13
14
15

6.7
2.9
7.5
8.2
3.0

27

CHAPTER 1. EXPLORING DATA

The median has rank m = (15 + 1)/2 = 8, and thus x(m) = 7.1. In this example,
there is little difference between the two measures of location. But this is not always the
case . . .
Example 22A: Find the mean and the median of the weekly volume of the same
15 shares as in Example 21A. The weekly volume is the number of shares traded in a
week.
Copi
Caricar
Coates
Consol
DRG

2
2
3
1
31

300
100
100
200
800

E. Haddon
0
Kohler
100
Metal Box 111 400
Metaclo
700
Nampak
100

Pr. Paper
700
Prs. Sup
0
Sappi
40 600
Trio Rand 84 100
Xactics
45 900

The sample mean is x


= (2 300 + 2 100 + + 84 100 + 45 900)/15 = 21 607.
Sort the data, locate the middle (8th) value, and find that the median is x(m) = 2 100.
The mean is just over 10 times larger than the median. What has gone wrong?
Nothing, its just that the mean and median locate the middle of the data according
to a different set of rules! In this example, the mean has been dragged upwards by a few
large values, so that only five of the fifteen numbers are larger than the mean. But even
if a million Metal Box shares had been traded during the week, the median would have
remained the same! The median is an example of a measure of location which is said,
in statistical jargon, to be robust. The mean is not robust, being sensitive to outlying
values in the data set. Because the mean is not robust, it is important to be aware of
possible outliers in any sample of data for which the mean is being computed.
The mean and the median tend to be close when the distribution of the values is
symmetric and there are no outlying observations. The mean and median differ increasingly as the distribution of the data becomes more and more skew. The observations
in the long tail of a skew distribution drag the mean in the direction of the tail. The
sample mean of a very skew distribution might give a totally misleading impression of
the middle of the data set.
There are no hard-and-fast rules which state when to use the sample mean and when
to use the median as a measure of location. In general terms, the median is good for
most sets of data. The sample mean is most useful when the data has a symmetric
distribution. Data with a long tail to the right can be made more symmetric by taking
logarithms or taking square roots of all the data values. Such manipulations to the
original data values are called transformations.
The sample mean has mathematical advantages over the median. It is a FAR easier
statistic for the mathematical statisticians to do algebraic manipulations with than the
median. A vast amount of statistical theory has been developed for the sample mean, and
for this reason it is the predominant measure of location used in sophisticated statistical
methods.

28

INTROSTAT

Measures of spread . . .
Measures of spread give insight into the variability of a set of data. Two measures
of spread can be defined in an obvious way from the five-number summary. They are:
the range R, defined as
R = x(n) x(1) ,
and the interquartile range I, defined as
I = x(u) x(l) .
The range is unreliable as a measure of spread because it depends only on the smallest
and largest values in the sample, and is thus as sensitive as it can possibly be to outlying
values in the sample. It is the ultimate example of a non-robust statistic! On the other
hand, the interquartile range is the length of the interval covering the central half of the
data values in the sample, and it is not sensitive to outliers in the data. The interquartile
range is a robust measure of spread.
The sample variance and its square root, the sample standard deviation, have
the same advantage, easier algebraic manipulation, over the range and interquartile
range that the mean had over the median. Therefore the sample variance is frequently
the only measure of spread calculated for a set of data.
The sample variance, denoted by s2 , is defined by the formula
n

s2 =

1 X
(xi x
)2 .
n1
i=1

In words, it is the sum of the squared differences between each data value and the sample
mean, with this sum being divided by one less than the number of terms in the sum.
The sample standard deviation, denoted by s, is the square root of the sample variance.
It is a nuisance to have these two measures of spread, s and s2 , one of which is simply
the square root of the other. Why have both? The standard deviation is the easier of
the two measures of spread to get an intuitive feeling for, largely because it is measured
in the same units as the original data. The variance is measured in squared units,
an awkward quantity to visualize. For example, if data consists of prices measured in
rands, the sample variance has units squared rands (whatever that means!), but the
standard deviation is in rands. Even worse, if the data consists of percentages, the
variance has units %2 , whereas the standard deviation has the intelligible units %.
But mathematical statisticians prefer to work with the variance not having to deal
with a square root in the algebra makes their lives simpler and neater. So the two
equivalent measures of spread co-exist side by side, and we just have to come to terms
with both of them.
Example 23A: Compute the sample variance s2 and the standard deviation s for the
dividend yields of the 15 shares of Example 21A.

29

CHAPTER 1. EXPLORING DATA


We have computed x
= 6.8. So
n

1 X
(xi x
)2
n1
i=1

1
=
(3.3 6.8)2 + (8.4 6.8)2 + (10.7 6.8)2 +
(15 1)

+ (8.2 6.8)2 + (3.0 6.8)2

1
=
(3.5)2 + (1.6)2 + (3.9)2 + + (1.4)2 + (3.8)2
14

1
=
12.25 + 2.56 + 15.21 + + 1.96 + 14.44
14

1
75.62
=
14
= 5.40

s2 =

The standard deviation is s =

5.40 = 2.32.

The variance and the standard deviation are always positive. This is guaranteed,
because all the terms in the sum are squared, which makes them positive, even though
some of the individual differences are negative.
The variance can be calculated more efficiently by a short-cut formula. The short
cut involves reducing the number of subtractions needed to calculate the variance from
n to 1. Examine the following steps carefully:
(n 1)s2 =
=
=

n
X
i=1

n
X

i=1
n
X
i=1

(xi x
)2
(x2i 2
x xi + x
2 )
x2i

n
X

2
x xi +

n
X

x
2

i=1

i=1

The
third term involves adding x
2 to itself n times. So it is equal to n x
2 . But x
=
P
n
1
x , so
n

i=1

n
1 X 2
nx
=
xi
n
2

i=1

The second term in the sum above can also be rewritten:


n
X

2
x xi = 2
x

i=1

2
=
n
=

2
n

n
X

i=1
n
X
i=1
n
X
i=1

xi

xi

n
X
i=1

xi

2

xi

30

INTROSTAT

Substituting these expressions for the second and third terms yields
n
n
n
X
2 X 2 1 X 2
2
2
xi
xi +
xi
(n 1)s =
n
n
i=1

n
X
i=1

Thus the short-cut formula for


s2 =

s2

is

i=1

i=1

n
1 X
2
xi )2 .
xi (
n
i=1

n
n
1 hX 2 1 X 2 i
xi
xi .
n1
n
i=1

i=1

Look carefully at this formula. There is now only one subtraction, whereas the
original formula involved n subtractions.
Example 24A: Calculate the sample variance of the dividend yields again, this time
using the short-cut
Pn formula.
We need i=1 xi , the sum of the data values, given by
n
X
i=1

xi = 3.3 + 8.4 + + 3.0 = 102.0;

P
and ni=1 x2i , the sum of squares of the data values, i.e. square them first, then add
them, like this:
n
X
x2i = 3.32 + 8.42 + + 3.02 = 769.22
i=1

Then

n
n
1 hX 2 1 X 2 i
xi
xi
n1
n
i=1
i=1
i
1
1h
(769.22 (102.0)2 = 5.40
=
14
15

s2 =

as before.

If the data has a symmetric distribution with no outliers, then the standard deviation has the following approximate interpretation. The interval from one standard
deviation below the sample mean to one standard deviation above it, (
x s, x
+ s),
should contain about two-thirds of the observations. Thus the sample mean and the
sample standard deviation together provide a two-number summary of the data set.
Many data sets are summarized by these two statistics the sample mean provides a
measure of location and the sample standard deviation a measure of spread.
However, the sample variance and the sample standard deviation have the disadvantage that, like the mean, they are sensitive to outliers. They are sensitive in two ways.
First of all, the outlier distorts the mean, so all the differences (xi x
) are misleading.
Secondly, if xj , the jth data value, is an outlier, then the term (xj x
) will be large
relative to the other differences, and, once it is squared, it can make a disproportionately
large contribution to the sum of squared differences.
Note that the intervals (x(1) , x(n) ), (
x s; x
+ s), and (x(l) , x(u) ) cover 100%, 68%
two-thirds, and exactly 50% of the observations, respectively. But it is not possible to
make direct comparisons between the range, the standard deviations and the interquartile range.

31

CHAPTER 1. EXPLORING DATA

Example 25B: Calculate the sample means, sample standard deviations, medians,
interquartile ranges and ranges of the GMAT scores for each faculty category of Example 1A. Comment on the results.
P
P
For the sample mean and variance, we need the quantities ni=1 xi and ni=1 x2i . For
the category Engineering, we have
n
X
i=1

n
X
i=1

xi = 619 + 510 + + 710 = 16 850

x2i = 6102 + 5102 + + 7102 = 10 254 100

Then
x
= 16 850/28 = 601.8
and

1
(10 254 100 (16 850)2 /28) = 4 222.6.
27

The standard deviation is s = 4 222.6 = 65.0.


The remaining measures of spread and location can readily be obtained from the fivenumber summaries of Example 17B. For example, for Engineering, the lower and upper
quartiles were x(l) = 550 and x(u) = 645, and the interquartile range is I = 645 550 =
95. The extremes were x(1) = 500 and x(n) = 750, so the range R = 750 500 = 250.
Calculation of these summary statistics for the remaining five categories yields the
table:
s2 =

Location
First
degree
Engineering
Science
Arts
Commerce
Medicine
Other

Spread

x(m)

601.8
583.1
555.6
567.0
638.0
581.7

600
585
535
555
640
585

65.0
48.5
62.8
45.7
53.1
80.1

95
75
75
60
105
100

250
160
210
140
120
220

In commenting on this table, we look first at the measures of location. The sample
means show that students with first degrees in medicine had the highest mean GMAT
score (638.0), followed by engineering students (601.8), and then commerce (567.0).
The lowest mean was recorded for arts students (555.6). The medians follow the same
pattern, and apart from arts, the sample means and medians are relatively close. In the
box-and-whisker plots in Example 17B, we saw that the distribution of GMAT scores
for arts appeared to be strongly skewed to the right. Hence the difference between the
sample mean (555.6) and the median (535) for this category of students is consistent
with the earlier evidence of skewness.

32

INTROSTAT

For the measures of spread, it is evident that the category Other has the largest
standard deviation (80.1), followed by engineering (65.0). The smallest standard deviation was for commerce (45.7). The interquartile ranges (I) and ranges (R) follow
a broadly similar pattern. A plausible explanation as to why the category Other
should have the largest standard deviation (and the second largest interquartile range
and range) is that it encompasses a wide diversity of students, not falling into any of the
single faculty categories.
The conclusions reached here provide a partial description of this MBA class of
81 students. If they were representative of all MBA students at all universities, we
might be able to generalize the statements. Another worry that we would have before
we could generalize the results relates to issues of sample size. Could the differences
in the measures of location and spread we observed here occur just because we got an
unusually bright group of, for example, medical students in this MBA class? We will
defer further consideration of these statistical issues until chapter 8! In order to prepare
ourselves for taking that kind of decision we have to learn some probability theory.
Example 26C: Calculate the sample mean and standard deviation, the median, the
range and interquartile range of Paarl rainfall data (Example 20C).
Example 27C: (a) Suppose that the sample mean and standard devation of the n
numbers x1 , x2 , . . . , xn are x
and s. An additional observation xn+1 becomes available.
Show that the updated mean x
? is
x
? =

n
x + xn+1
n+1

and the updated standard deviation s? is


r

1
?
s =
(n 1)s2 + n(
xx
? )2 + (xn+1 x
? )2 .
n

(b) The sample mean of nine numbers is 4.8 with standard deviation 3.0. A 10th
observation is made. It is 6.8. Update the mean and the standard deviation.

Exploratory data analysis . . .


The techniques we have learnt in this chapter have largely been aiming at getting a
feel for a sample of data, a process somewhat grandly called exploratory data analysis.
In the age of instant arithmetic and the personal computer, the temptation is to use the
statistical methods of chapters 8 to 12 and beyond, and to accept the answers uncritically.
We have seen in this chapter that the presence of one or more outliers in a sample can
have a pretty devastating effect on the sample mean and the sample standard deviation,
the most frequently used summary statistics of all. Likewise, we have seen how skewness
affects these statistics. Most statistical methods make a variety of assumptions many
of these can be checked out, visually at least, by the exploratory data analysis techniques
described in this chapter. Many of these techniques have become part of the data analysis
software of statistical packages. You are strongly encouraged to use them before you
do more complex statistical analyses.

33

CHAPTER 1. EXPLORING DATA

Solutions to examples . . .
2C The frequencies and relative frequencies, from which the bar graphs are constructed, are given in the table.
Major sector

(a) Frequency

(b) Percentage of
All-Share Index

2
1
17
8
6
35
82

0.82%
8.27%
18.99%
6.73%
23.92%
7.41%
33.86%

151

100%

Coal
Diamonds
All-Gold
Metals & minerals
Mining financial
Financial
Industrial
Total

.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . ..
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . ..
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. . . . . . . . . . . . . . . . . . . . . . .
.. . .. . . .. . .. . . .. . .. . . .
.. . .. . . .. . .. . . .. . .. . . .
.. . .. . . .. . .. . . .. . .. . . .
. . . . . . . . . . .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
. . . . .
.. . .. . .
.. . .. . .
.. . .. . .
. . . .
..
..
..
.

35

17

82 Industrial

Financial

All-Gold

Metals & minerals


Mining financial

2 Coal

1 Diamonds
0

10

20

30

40

50

60

70

80

90

(a) Number of shares


. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . .
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . .
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .. . . .. . .. . . .. . .. . . .. .
. . .. . . .. . .. . . .. . .. . . .. .
. . .. . . .. . .. . . .. . .. . . .. .
. . . . . . . . . . . .
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . ..
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . ..
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .. . . .. . .. . . .. . .. . . .
. . .. . . .. . .. . . .. . .. . . .
. . .. . . .. . .. . . .. . .. . . .
. . . . . . . . . . .
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. .
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. .
. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
..
..
..
.
. . .. . . .. . .. . . .. . .. . . .. . .. .
. . .. . . .. . .. . . .. . .. . . .. . .. .
. . .. . . .. . .. . . .. . .. . . .. . .. .
. . . . . . . . . . . . .

7%

Financial

19%

7%

All-Gold

Metals & minerals

24%

1%

Mining financial

Coal

8%

34% Industrial

10

Diamonds

20

30

(b) Percentage of All-Share Index


Keeping the ordering of the shares the same in both plots highlights the observation
that the All-Share Index does not give equal weighting to each share. Especially
striking is the large weighting that the Mining financial sector has in the All-Share

34

INTROSTAT
Index in relation to the small number of shares. In fact, the single share AngloAmerican had a weighting of 8.95% in the All-Share Index!

5C We chose a class interval of 10.


10

Number
of
students

5
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
. . . . .

0
0

20

... . ... .. . ...


.. . .. . . ..
.. . .. . . ..
.. . .. . . ..
.. . .. . . ..
.. . .. . . ..
.. . .. . . ..
.. . .. . . ..
.. . .. . . ..
.. . .. . . ..

40

... . .. ... . ... ..


.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .

... . ... .. . ... ..


.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
. . . . .

. .. . ... .. ... .
. . . .. . .. .
. . . .. . .. .
. . . .. . .. .
. . . .. . .. . . . . . .
. . . .. . .. . .. . .. . . .. .
. . . .. . .. . .. . .. . . .. .

60

80

100

Examination mark
6C We used a class interval of 0.5.
6
Number
4
of
trees
2
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .

0
15

16

. ... .. ... . .. .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .

17

. ... .. . ... ..
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .

. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .

. ... .. ... . .. .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .

18

. .. ... . ... ..
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .

19

.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
. . . .

. ... .. . ... .. .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .

20

Height (m)
10C
stems
2
3
4
5
6
7
8

sorted
leaves
7
9
3358
0011333348
1579
1229
5

cum.
count
1
1
4
10
4
4
1

1
2
6
16
20
24
25

11C
stems
1
2
2
3

sorted
leaves
78
01223
67888
00001123

count
2
5
5
8

cum.
count
2
7
12
20

. .. ... . ... ..
. . .. . .. .
. . .. . .. .

21

22

35

CHAPTER 1. EXPLORING DATA


12C

stems
0
0
1
1
2

sorted
leaves
28,36,38,38,48,49
55,60,60,60,75,75,78,85,90,94,98
05,20,20,20,25,25,25,40,45,45
50,50,58,65,76,85
00,00,45

count
6
12
10
6
3

cum.
count
6
18
28
34
37

15C (a) (27, 50, 53, 67, 85)


(b) (17, 22.5, 28, 30, 33)
(c) (28, 60, 105, 145, 245)

20C
stems
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

sorted
leaves
0.0,0.0,0.0,0.0,1.1,2.6,3.0,4.1,4.9,6.1,6.4
0.6,5.8,6.3,7.8
1.6,7.7
7.8,9.7
2.3

5.9

5.1

(b) (0.0, 2.6, 8.5, 27.7, 145.1)

cum.
count
11
15
17
19
19
20
20
20
20
20
21
21
21
21
22

36

INTROSTAT

150

1901 (145.1 mm)


1903 (105.9 mm)

100
Rainfall
(mm)

50
upper quartile (27.7 mm)
median (8.5 mm)
lower quartile (2.6 mm)

26C x
= 235.8

s = 365.7

xm = 85

R = 1451

I = 251.

Exercises . . .

1.1 As a cartoon strip matures, it is likely to change in subtle ways. In this exercise,
we want to look at the pattern of word usage in Shultzs Peanuts, comparing
the period 1959/60 with 1975/76. The tables below show the number of words per
cartoon strip in the two periods. Produce stem-and-leaf plots and find five-number
summaries for both periods. Draw side-by-side box-and-whisker plots, and discuss
how the number of words per cartoon strip has changed.
Number of words in 66 Peanuts cartoon strips from 1959/60, reprinted in Youre
a winner, Charlie Brown.
51
44
23
46
30
39

35
55
26
43
28
26

35
42
45
53
36
34

30
40
63
35
29
35

41
5
58
34
76
60

52
15
28
51
40
41

55
27
37
47
32
47

38
0
59
43
59
40

41
16
24
22
32
46

32
28
35
45
49
45

49
14
43
23
29
4

Number of words in 66 Peanuts cartoon strips from 1975/76, reprinted in Lets

37

CHAPTER 1. EXPLORING DATA


hear it for dinner, Snoopy.
37
39
22
32
30
35

40
49
34
46
45
45

44
39
36
63
38
41

44
35
52
39
43
18

49
35
50
60
30
20

35
52
39
40
47
33

34
47
32
30
28
47

29
45
21
29
39
17

35
44
20
17
45
40

37
52
17
30
45
37

33
28
22
42
32
19

1.2 If you are a salesperson, it is very easy to get pessimistic because on many days no
sales are made. The days when everthing goes well keep you going. The daily sales
of a second-hand car salesperson are tabled below. Display them as a histogram.
Calculate also the sample mean, standard deviation and median. Is the median a
helpful measure of central tendency for this data?
0
0
0
0
0
0

0
4
0
0
0
3

1
0
0
1
1
1

0
0
1
1
1
1

0
1
1
0
0
6

0
0
0
3
0
0

1
0
1
0
0
0

0
2
0
5
0
0

0
0
1
0
1
0

2
0
2
1
5
2

0
3
0
0
2
1

1
0
0
0
0
3

3
0
0
1
2
3

0
1
4
2
0
0

1
0
0
0
0
0

2
1
1
1
2
0

0
1
0
0
0
1

0
1
2
0
1
0

1.3 Service is an important factor in making a business profitable. For a sample of


25 tables of customers, the manager of a restaurant kept track of how long they
waited from arrival to receipt of their main course. The following waiting times
(minutes) were recorded:
34 24 43 56 74 45 23 43 56 67 30 19 36
32 65 36 24 54 39 43 67 54 32 18 97

(a) Display the data as a stem-and-leaf plot, compute the five-number summary
and draw the box-and-whisker plot.
(b) Find the sample mean and standard deviation. What proportion of the data
values is within one standard deviation of the mean?
(c) Comment on the shape of the distribution of the data, and attempt to interpret it.
1.4 Water is a crucial resource in a generally arid country like South Africa. The
mean annual runoff (millions of m3 ) of 63 rivers in the old Cape Province of South
Africa is given in the table below. Find the five-number summary, and attempt
to draw the box-and-whisker plot. Take logarithms of each data value, and repeat
the exercise. Discuss the effect of the logarithmic transformation.

38

INTROSTAT

River

Mean
annual
runoff

Kei
1001
Quko
41
Kwenxura
25
Kwelera
32
Gqunube
35
Buffalo
82
Goba
6
Gxulu
6
Ncera
6
Tyolumnqa
25
Kiwane
2
Keiskamma
133
Cqutywa
2
Bira
6
Mgwalana
5
Mtati
4
Mpekweni
2
Fish
479
Kleinemond
6
Riet
4
Kowie
23

River

Kasuka
Kariega
Bushmans
Boknes
Sundays
Coega
Swartkops
Yellowwoods
Gamtoos
Kabeljou
Seekoei
Krom
Klipdrif
Storms
Elandsbos
Keurbooms
Knysna
Goukamma
Swartvlei
Touw
Kaaimans

Mean
annual
runoff
5
15
38
13
29
13
84
45
485
27
27
105
35
69
67
160
110
44
73
30
59

River

Mean
annual
runoff

Groot Brak
29
Klein Brak
45
Hartenbos
5
Gouritz
744
Kafferkuils
141
Duiwenhoks
131
Bree
1893
Heuningnes
78
Uilskraal
65
Kleinriviersvlei
96
Botriviersvlei
116
Palmiet
310
Wildevoelvlei
38
Sout
38
Diep
43
Berg
235
Verlorevlei
102
Wadrifsoutpan
19
Jakkals
10
Olifants
1217
Orange
9344

(Data from Noble and Hemens, S.A. National Scientific Programmes, Report No
34, 1976.)
1.5 This is an exercise in robustness! Calculate the median, interquartile range, range,
sample mean and sample standard deviation for the following 12 data values:
10.8 9.7 14.1 12.3 10.9 8.9 11.7 12.6 11.2 10.5 8.3 131
Which value looks suspiciously like an outlier? Put its decimal point back in
the right place, and recompute the summary statistics. Which of these statistics
change, and which remain the same? For which statistic is the percentage change
the largest? (The percentage change is the difference between the correct and
the biased values, divided by the correct value, multiplied by 100.)
1.6 Heathrow Airport in London is one of the worlds busiest airports. The time of
touchdown for planes arriving between 17h30 and 19h30 on 17 October 1991 is
recorded (to the second) in the table below. Compute the inter-arrival times in
seconds and present appropriate summary statistics. What do you think the target

39

CHAPTER 1. EXPLORING DATA

inter-arrival time is? Was there any apparent difference between the first and the
second hour of observation? How frequently did glitches (= irregularities) occur?
17h30:07
32:46
34:14
37:13
38:56
40:27
41:37
43:21
44:24
45:50
47:10
48:58
50:03
51:49
52:50
54:32
55:42
17h57:21
?

17h59:36
18h01:24
03:10
04:26
05:47
08:49
10:27
11:24
12:51
15:34
16:52
18:10
19:24?
21:48
24:20
25:41
26:51
18h28:22

18h29:51
31:51
34:04
36:40
37:52
40:41
42:23
43:59
45:20
46:42
48:38
50:44
52:02
53:44
55:05
56:44
58:11
19h00:06

19h01:41
03:33
06:10
07:29
09:25
10:11
12:29
13:41
15:33
17:26
19:14
21:03
22:24
24:15
25:38
26:59
19h29:26

This was the arrival of the supersonic jet Concord.

1.7 You have a sample of data x1 , x2 , . . . , xn , with sample mean x


and sample variance
2
s .
P
(a) Show that ni=1 (xi x
) = 0.
(b) Add some constant b (i.e. add any number) to each of the xi . How does this
change the sample mean and variance?
(c) Now multiply each of the xi by a constant c. How does this change the sample
mean and variance?
(d) Transform the xi to yi = c xi + b. What are the sample mean and variance
of y1 , y2 , . . . , yn ?
1.8 During a year, a company had the following advertising expenditures: Magazine,
R11 000; Newspaper, R45 000; Pamphlets, R8 000; Radio, R34 000; Television,
R110 000; Miscellaneous, R5 000. Construct a bar graph that provides an effective
visual display of the breakdown of advertising expenditure.
1.9 Take any sets of data which interest you, and apply the exploratory data analysis
techniques of this chapter to them.

40

INTROSTAT

Solutions to exercises

1.1

stems
0
1
2
3
4
5
6
7

1959/60
sorted
leaves
045
1456
233466788899
0022244555556789
00011123334555667799
12355899
03
6

count
3
4
12
16
20
8
2
1

cum.
count
3
7
19
35
55
63
65
66

count
0
5
8
27
20
4
2
0

cum.
count
0
5
13
40
60
64
66
66

1975/76
stems
0
1
2
3
4
5
6
7

sorted
leaves
77789
01228899
000002223344555556777899999
00012344455555677799
0222
03

The five-number summaries are (0, 28, 37.5, 46, 76) for the early period and (17,
30, 37, 45, 63) for the late period.
The box-and-whisker plots, plotted side-by-side are:
80
70
60
50
Number
40
of
words
30
20
10
0

upper quartile (46)


median (37.5)
lower quartile (28)

upper quartile (45)


median (37)
lower quartile (30)

41

CHAPTER 1. EXPLORING DATA

The medians are similar during both periods, but there appears to be less variability during 1975/76 than during 1959/60 both the range and the interquartile
range are shorter.
1.2
60
50
40
Number
of
30
days
20
10
0

. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . . . . .

... . ... .. . ... .. . ...


.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..
.. . .. . . .. . . ..

.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .

. .. . . .. . .. . .
. .. . . .. . .. . .
. .. . . .. . .. . .
. .. . . .. . .. . . . . .. . . .. . .. . . . . .. . . .. . .. .
. . . . . . . . . . . . . . . . . . . . . . . .

Number of cars sold

x
= 0.82, s = 1.24, xm = 0
No, the median is not much use here more than half the data values are zero!
Even though the data is very skew, the mean is more interesting than the median.
1.3 (a) The stem-and-leaf plot is

stems
1
2
3
4
5
6
7
8
9

sorted
leaves
89
344
0224669
3335
4466
577
4
7

count
2
3
7
4
4
3
1
0
1

cum.
count
2
5
12
16
20
23
24
24
25

Five-number summary (18, 32, 43, 56, 97). The value of 97 is a stray.

42

INTROSTAT
100

97 minutes to wait for a meal!

90
80
70
60
Time
(mins)

upper quartile (56)

50
median (43)

40

lower quartile (32)

30
20
10
0

(b) x
= 44.4, s = 19.3. 15 out of 25 observations (60%) lie within one standard
deviation of the mean, i.e. in the interval (25.1, 63.7).
(c) Apart from the stray (which should be investigated), the data show relatively
little skewness to the right. In the context, the interquartile range (24 minutes) is probably wider than desirable, and efforts should be made to make
services times more consistent.
1.4 Five-number summary (2, 13, 38, 105, 9344), strays greater than 239, outliers
greater than 440. On taking logarithms (base 10), the five-number summary is
(0.30, 1.11, 1.58, 2.02, 3.97), strays exceed 2.9, and there are no outliers.
10000

800
700

9344
1893

1217

1001

500
Runoff
400
3
(m 106)
300

1000

Runoff
100
3
(m 106)
50

10

310 Palmiet

200
100
0

1893 Bree
1217 Olifants
1001 Kei

500

744 Gouritz

485 Gamtoos
479

Fish

9344 Orange

5000

Orange
Bre
e
Olifants
Kei

600

5
upper quartile (105)
median (38)
lower quartile (33)

upper
quartile (105)
median (38)
lower
quartile (33)

43

CHAPTER 1. EXPLORING DATA

In the plot on the left, the runoffs from four rivers cannot be plotted to scale.
The inner scale on the plot on the right shows logarithms to base 10. Notice how
dramatic the effect is always look at scales on plots to see if data has been
transformed.
1.5 With the outlier (131), xm = 11.05, I = 2.35, R = 122.7, x
= 21.00, and s = 34.68.
With the outlier corrected (13.1), xm = 10.85, I = 2.35, R = 5.8, x
= 11.18, and
s = 1.71. The outlier affects the range, standard deviation and sample mean (in
that order).
1.8 The bar graph is most effective if the expenditures are arranged in order:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . ..
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . ..
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . ..
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .
. . . . . . . . . . . . . . . . . . . . . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. . . . . . . . . . . . . . . . .
. .. . .. . . ..
. .. . .. . . ..
. .. . .. . . ..
. . . . .
. .. . .. .
. .. . .. .
. .. . .. .
. . . .
. .. .
. .. .
. .. .
. .

R45 000 (21%)

R34 000 (16%)

R11 000 (5%)

R8 000 (4%)

R5 000 (2%)
R25 000

Newspaper

Radio

Magazines

Pamphlets

Miscellaneous

R50 000

R75 000

R100 000

Percentage of Advertising Expenditure

R110 000 (52%)


Television

44

INTROSTAT

Chapter

2
SET THEORY

KEYWORDS:
Set, subset, intersection, union, complement,
empty and universal sets, mutually exclusive sets; pairwise mutually
exclusive and exhaustive sets.

Why do we have to do set theory?

...

Simply because one of Murphys Laws states that before you can do anything, you
have to do something else. Before we can do statistics we have to do probability
theory, and for that we need some set theory. So here we go.

Definition of sets . . .
We define a set A to be a collection of distinguishable objects or entities. The set
A is determined when we can either (a) list the objects that belong to A or (b) give a
rule by which we can decide whether or not a given object belongs to A.
Example 1A: (a) If we say, The letters e, f, g belong to the set A, then we write
A = {e, f, g}
(b) If we say, The set B consists of real numbers between 1 and 10 inclusive, then
we write
B = {x | 1 x 10}.

We read this by saying: The set B consists of all real numbers x such that x is larger
than or equal to 1 but is less than or equal to 10.
Because the object e belongs to the set A we write
eA
and we say: e is an element of A. Because e does not belong to B, we write
e 6 B

and we say: e is not an element of B.


Note, firstly, that if C = {1, 3, 5, a} and D = {a, 1, 5, 3} then C = D. The order
in which we list the elements of a set is irrelevant. Secondly, if E = {a, b, c, a} and
F = {a, b, c} then E = F . The set E contains only the distinguishable elements a, b
and c.
45

46

INTROSTAT

Example 2B:
(a) Express in set theory notation: the set U of numbers which have square roots
between 1 and 4.
(b) Write out in full all the elements of the set Z = {(x, y) | x {1, 2, 3, 4}, y = x2 }.
(a) Because the square roots of numbers between 1 and 16 belong to U , we write
U = {x | 1 x 16}.
(b) Z = {(1, 1), (2, 4), (3, 9), (4, 16)}.
Example 3C: Which of the following statements are correct and which are wrong?
(a)
(b)
(c)
(d)
(e)
(f)
(g)

{3, 3, 3, 3} = {3}
6 {5, 6, 7}
C = {1, 0, 1}
F = {x | 4 < f < 5}
{1, 2, 7} = {7, 2, 1, 7}
If H = {2, 4, 6, 8}, J = {1, 2, 3, 4} and K = {2x | x H}, then K = J
{1} {1, 2, 3}.

Subsets . . .
Suppose we have two sets, G and H, and that every element of G also belongs to
H. Then we say that G is a subset of H and we write G H. We can also write
H G and say H contains G. If every element in G does not also belong to H, we
write G 6 H and say G is not a subset of H.
Example 4A: Let G = {1, 3, 5}, H = {1, 3, 5, 9} and J = {1, 2, 3, 4, 5}. Then
clearly G H, H 6 J, J G.
Note that the notation , for sets is analogous to the notation , for ordinary
numbers (rather than the notation <, >). The round end of the subset notation tells
you which of the sets is smaller (in the same way as the pointed end shows which
of two numbers is smaller).
Our definition of subset has a curious (at first sight) but logical consequence. Because
every element in G belongs to G, we can write G G. For numbers, we can write 2 2.
If H G and G H, then, obviously, H = G. For numbers, x 2 and x 2
together imply that x = 2.
Example 5C: Let V = {v | 0 < v < 5}, W = {0, 5}, X = {1, 2, 3, 4}, Y = {2, 4},
Z = {x | 1 x 4}. Which of the following statements are true, and which are false:
(a) V = W
(e) X = Z
(b) Y X
(f) Z 6 V
(c) W V
(g) Y W
(d) Z X
(h) Y Z

47

CHAPTER 2. SET THEORY

Intersections . . .
Suppose that L = {a, b, c} and M = {b, c, d}. Then L 6 M and M 6 L. But if
we consider the set N = {b, c}, then we see that N L and N M , and that no other
set of which N is a subset has this property. This leads us to the idea of intersection.
The intersection of any two sets is the set that contains precisely those elements
which belong to both sets. For the sets, L, M and N above we write N = L M and
read this N equals L intersection M . The intersection of two sets M and N can be
thought of as the set containing those elements which belong to both M and to N .
Example 6A: If P = {x | 0 x 10} and Q = {x | 5 < x < 20}, find P Q. Is
5 P Q? Is 10 P Q?
Paying careful attention to the endpoints,
P Q = {x | 5 < x 10}.
No, 5 6 P Q, but, yes, 10 P Q.

The empty set, mutually exclusive sets . . .


What happens if L = {a, b, c} and R = {d, e, f }? If we want L R to be a set,
then we must introduce a new concept, the empty set, the set that has no members.
This is a sensible concept: consider the set of English-speaking fish, or consider the set
of real numbers whose square is negative. We reserve the symbol to denote the empty
set. We use this symbol for no other purpose. We write and read this as L R = , the
intersection of sets L and P is the empty set.
Pairs of sets whose intersection is the empty set are said to be mutually exclusive
sets (or disjoint sets). Thus L and R are mutually exclusive.

The universal set, the sample space . . .


Another reserved symbol is the letter S. It is used for the set containing all objects
under consideration. Thus if, in a particular problem, the only objects of interest are
the colours of a traffic light, then S = {red, amber, green}. The set S is known to
mathematicians as the universal set. In statistical jargon the set S is called the sample
space.

Unions
The concept union contrasts with the concept intersection. The union of two sets
A and B is the set that contains the elements that belong to A or to B. Here we use
the word or in an inclusive sense we do not exclude from the union those elements
that belong to both A and B.
If A = {1, 2, 3} and B = {2, 3, 4, 5} then the union of A and B is the set
C = {1, 2, 3, 4, 5}. We write
C =AB
and say C equals A union B.

48

INTROSTAT

Example 7A: If P = {x | 0 x 10} and Q = {x | 5 < x < 20}, find P Q.


The union includes all the elements of both set P and set Q:
P Q = {x | 0 x 20}.

Complements . . .
Our final concept from set theory is that of the complement of a set. Given the
sample space S, we define the complement of a set A to be the set of elements of S
which are not in A. The complement of A is written A, and is always relative to the
sample space S.
If S = {1, 2, 3, 4, 5, 6}, A = {1, 3, 5} and B = {2, 4, 6} then A = {2, 4, 6}.
We write
A=B
and say the complement of A equals B or, more briefly, A complement equals B.
Example 8A: If S = {x | 0 x 1} and D = {x | 0 < x < 1}, find D.
Because the set D excludes the endpoints of the interval from zero to one, D = {0, 1}.
Example 9C: If the sample space S contains the letters of the alphabet, i.e. S =
{a, b, c, . . . , x, y, z}, the set A contains the vowels, the set B contains the consonants,
the set C contains the first 10 letters of the alphabet C = {a, b, c, . . . , h, i, j} pick out
the true and false statements in the following list:
(a) A B = S
(g) S B = B
(b) A B =
(h) A A = S
(c) S S
(i) C A = {o, u}
(d) A C = {a, e, i}
(j) (A C) = A C
(e) A B
(k) A C C
(l) S =
(f) A = B

Venn diagrams . . .
A pictorial representation of sets that helps us solve many probems in set theory is
known as the Venn diagram. In the diagrams below think of all the points in the
rectangle as being the sample space S, and all the points inside the circles for A and
B as the sets A and B respectively. The shaded area in the diagram on the left then
represents A B, the set of points belonging to A and B. Similarly the diagram on the
right is a visual representation of A B, the set of points belonging to A or B. Recall
once again the special, inclusive meaning we give to or. When drawing Venn diagrams
it is helpful to associate intersection with and and union with or.

49

CHAPTER 2. SET THEORY


S

AB

AB

The diagram on the left below shows how to depict two mutually exclusive sets in a
Venn diagram.
Venn diagrams are usually only useful for up to three sets: the area shaded in the
diagram on the right is A B C.
S

B
C
ABC

AB =

Pairwise mutually exclusive, exhaustive sets . . .


If a family of sets A1 , A2 , . . . , An are such that any pair of them is mutually
exclusive, i.e. Ai Aj = if i 6= j, and if A1 A2 . . . An = S, i.e. the union of the
sets exhausts the sample space, then the family of sets A1 , A2 , . . . , An are said to be
pairwise mutually exclusive and exhaustive. If we represent such a family of sets
on a Venn diagram, the sets must cover the sample space, and they must be disjoint.
Here are two examples:
. .
......... .....
.....
......... .....
......
......... ...........
......
..........
......
......
..........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
......
......
.........
.....
......
.........
..
......
......
.........
....
......
......
..........
.....
...
......
.........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....
.....
......
.... ...........
......
......
.........
......
......
3
......
......
......
......
.
.
.
.
.
.
.
.
.
.
.
..
......
.....
......
......
......
.....
......
......
.....
......
......
......
......
......
......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....
.....
......
......
......
4..............
......
......
...
......
......
......
.....
.....
......
......
.
.
.
.
.....
.
.
.
.
.
.
......
....
....
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
......
......
.....
......
......
......
......
n1..............
......
......
......
..
......
.
.
.
.
.
.
.
.
.
.
..
.
.....
......
......
.....
n
......
.....
......
.....
......
....
...
...

A1..............
.
.... A2
....
A
A

..

...
....... ...
...
....... .....
...
.......
..
.......
...
...
.......
.
.
...
.
.
.
...
.
.
...
.......
...
...
.......
...
.
3
.
.
.
.
.
.
.
1 ....
...
..
.......
...
.......
6 ........
8
... .............
.........................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
........ ... ..
.
.
.
.
.
.
.
.
.
.
.
.
..
.........
... ..
..........
...
...........
... ...
...
.......... ..
...
... ...
........... .....
...
...
..
.
...........
.
.
...
..........
2 ... ...
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
5 ....
..
...
..........
...
.
...
...........
..
...
... .............
...
..
...
................
...
..
...
.....
..
...
.
.
.
.
.
...
.. .....
...
...
.. ...
...
...
.. ...
...
...
.....
..
..
.
.
.
4
7
9
.
.
.
.
...
...
...
...
...
...
...
...
..
.
.
.
.
.
..
...
...
...
...
...

50

INTROSTAT

Using Venn diagrams . . .


Example 10B: Draw Venn diagrams to show that (A B) C = (A C) (B C)
In the left-hand Venn diagram, the diagonally-shaded shaded area is A B, and the
vertically shaded area is C. Their intersection (A B) C is shaded in both directions.
In the right-hand Venn diagram, the two shaded areas are A C and B C. Their union
is the same as the area shaded in both directions in the left-hand diagram.
A

(A B) C

(A C) (B C)

Example 11C: Draw Venn diagrams to show that the following are true:
(a) A B = A (B A)
(b) (A C) (B A) = (A C) (A B)
(c) The sets A B, A C, B A and (A C) form a family of pairwise mutually
exclusive and exhaustive sets.
Example 12C: Draw Venn diagrams to determine which of the following statements
are true.
(a)
(b)
(c)
(d)
(e)
(f)

(A B) = A B
(A B) (A B) A B
(A B) C = (A C) (B C)
(C A) (C B) = (C (A B))
[(A B) C] [(A C) B] = [(A B C) (A B) C] (A B)
If the sets A1 , A2 , A3 , and A4 are pairwise mutually exclusive and exhaustive, and
B is an arbitrary set, then
B = (A1 B) (A2 B) (A3 B) (A4 B).

Solutions to examples . . .
3C (a), (b), (c) and (e) are correct; (d) should read either F = {x | 4 < x < 5} or
F = {f | 4 < f < 5}. For (f), check that the following statement is correct: if H
and J are as given, and if K = {2x | x J} then K = H. For (g), note that we
never use the -notation with a set on the left hand side.
5C Only (b) and (d) are true.
9C All are true.

51

CHAPTER 2. SET THEORY


11C All are true.
12C (b) (c) (e) and (f) are true. For (a), check that (A B) = A B is true.

Easy exercises . . .
2.1 Let S be {1, 2, 3, 4, 5, 6}, the set of all possible outcomes when a die is thrown
and the number of dots on the uppermost face recorded. Describe in words the
following sets:
(a) {6}
(d) {2, 4, 6}
(b) {1, 2, 3, 4}
(e) {5, 6}
(c) {1, 3, 5}
(f) {6}
2.2

If S = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, and A = {0, 1, 2}, B = {3, 4, 5, 6, 7},


C = {7, 8}, and D = {2, 4, 6, 8}, which of the following statements are true?
(a)
(b)
(c)
(d)
(e)
(f)

A and B are mutually exclusive


B = {0, 1, 2, 8, 9}
A B C D = {0, 1, 2, 3, 4, 5, 6, 7, 8}
D (B C)
A B C D = {9}
A (B D) = (A B) (A D).

2.3 Let S denote the set of all companies listed on the Johannesburg Stock Exchange.
Let A = {x | x is in the gold mining sector},

let B = {x | x has annual turnover exceeding R10 million},


let C = {x | x has financial year ending in June},

let D = {x | the share price of x is higher now than six months ago}.

Describe in words the following sets:


(a)
(b)
(c)
(d)
2.4

A B,
A D,
A C D,
B (C A),

(e)
(f)
(g)
(h)

A,
C D,
BC
(B A) (C D).

If A, B and C are subsets of a universal set S, draw Venn diagrams to determine


which of the following statements are true.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)

AA=S
AA=
AB =AB
AB =AB
A (B C) = (A B) (A C)
A (B C) = (A B) (A C)
ABC =S
ABC =
AB AB

52

INTROSTAT
(j) A (B C) A (B C)

2.5 If S = {1, 2, 3}, list all the subsets of S.


2.6

Draw a series of Venn diagrams representing three sets, and shade in the following
areas.
(a)
(b)
(c)
(d)

ABC
(B A) (A C)
ABC
(A B C) (B C).

More difficult exercises . . .


2.7

Let B1 , B2 , . . . , Bn be n disjoint subsets of S such that


ni=1 Bi = S

and Bi Bj = for i 6= j.

Let A be any other subset of S.


Use a Venn diagram to show that
A = ni=1 (A Bi ).
[Notation: ni=1 Bi means B1 B2 . . . Bn .]
2.8 Show that if the set S has n elements, then S has 2n subsets. [Hint: Use the
binomial theorem.]
2.9 Let A and B be two events defined on a sample space S. Depict the following
events in Venn diagrams:
(a) C = (A B) (A B)
(b) D = (A B) (A B)
(c) What can you say about events C and D?

Solutions to exercises . . .
2.1 (a)
(b)
(c)
(d)
(e)
(f)

The number six is obtained.


A number less than or equal to four is obtained.
An odd number is obtained.
An even number is obtained.
A number greater than or equal to 5 is obtained.
A number other than 6 is obtained.

2.2 All are true except (d) and (f).


2.3 (a) Set of companies either in the gold mining sector or with turnovers exceeding
R10 million.

CHAPTER 2. SET THEORY

53

(b) Set of gold mining companies whose share price is higher now than six months
ago.
(c) Set of gold mining companies with a financial year ending in June whose share
price is higher now than six months ago.
(d) Set of all companies which have an annual turnover exceeding R10 million
and which are either gold mining companies or companies with financial years
ending in June (or both).
(e) Set of companies not in the gold mining sector.
(f) Set of companies which either do not have a financial year ending in June or
have a share price which is higher now than six months ago.
(g) Set of companies which either do not have an annual turnover exceeding R10
million or do not have financial year ending in June.
(h) Set of companies which either do not have an annual turnover exceeding R10
million or are not in the gold mining sector or both have a financial year
ending in June and have a share price which is higher now than six months
ago.
(Notice how difficult it is to express unambiguously in words the meaning of a few
mathematical symbols.)
2.4 All are true, except (g) and (h).
2.5 , {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}.
2.9 (c) C and D are mutually exclusive.

54

INTROSTAT

Chapter

PROBABILITY THEORY
KEYWORDS:
Random experiments, sample space, events,
elementary events, certain and impossible events, mutually exclusive events, probability, relative frequency, Kolmogorovs
axioms, permutations, combinations, conditional probability,
Bayes theorem, independent events.

New wine in old wineskins . . .


In the mathematical sciences, in contrast to most other disciplines, we prefer not
to coin new words for new concepts. We rather prefer to give new meanings to old
words. In this chapter we ask you to put aside your intuitive ideas of what constitutes
an experiment or an event and replace them with the new meanings statisticians
have given them.

Random experiments, sample spaces, trials . . .


To statisticians, a random experiment is a procedure whose outcome in a particular performance or trial cannot be predetermined. Although we cannot foretell what
the outcome of any single repetition of the experiment will be, we must be able to list
the set of all possible outcomes of the experiment. In general, random experiments must
be capable, in theory at least, of indefinite repetition. It must also be possible to observe
the outcome of each repetition of the experiment. The set of all possible outcomes of a
random experiment is called the sample space of the random experiment. We usually
use the letter S to denote the sample space. Each repetition of the procedure for the
random experiment is called a trial, and gives rise to one and only one of the possible
outcomes.
Example 1A: The following are examples of random experiments and their sample
spaces.
(a) We toss a coin. We can list the set of possible outcomes: S = {heads, tails}. We
can repeat the experiment endlessly, and we can observe the result of every trial.
55

56

INTROSTAT

(b) A phone number is chosen at random. The number is dialled, and the person who
answers is asked whether he/she is currently watching television. If the telephone is
unanswered after 45 seconds, the outcome, no reply, is recorded. The set of possible outcomes, the sample space, is S = {yes, no, wont say, number engaged, no reply}.
(c) A light bulb is allowed to burn until it burns out. The lifetime of the bulb is
recorded. The possible outcomes are the set of non-negative real numbers (i.e. the
set of positive numbers plus zero the bulb might not burn at all). The sample
space is thus S = {t | t 0}.
(d) A die is placed in a shaker, which is agitated violently, and thrown out onto
the table. The dots on the upturned face are counted. The sample space is
S = {1, 2, 3, 4, 5, 6}.
(e) In a survey of traffic passing a particular point on Boulevard East, a time period
of one minute is chosen at random, and the number of vehicles that pass the point
in the minute is counted. The possible outcomes are the integers, including zero:
{0, 1, 2, 3, . . .}.
(f) A geologist takes rock samples in a mine in order to determine the quality of the
iron-ore to be mined. The analytical laboratory reports the proportion of iron in
the ore. The sample space is S = {p | 0 p 1}.

Events . . .
An event is defined to be any subset of the sample space of S. Thus if S =
{1, 2, 3, 4, 5, 6} then the sets A = {1, 3, 5} and C = {3, 4, 5, 6} are events. A is
clearly the event of getting an odd number. C is the event of getting a number greater
than or equal to 3.
The empty set is the set containing no elements. It is often denoted by {} or by .
By the definition of subsets, and S are both subsets of S and are thus events.
They have special names. The event is called the impossible event, and S is called
the certain event.
An elementary event is an event with exactly one member, the events D = {3}
and E = {5} are elementary events. A and C are not elementary events, nor are and
S.
Given two events A and B we say that A and B are mutually exclusive if A B =
, that is if they have no elementary events in common, e.g. if A = {1, 3, 5} and
B = {2, 4, 6} then A B = . This makes sense, because A is the event of getting an
odd number, B is the event of getting an even number and obviously you cannot get
an odd number and an even number on the same throw of a die.

When does an event occur?

...

We give a special meaning to the concept, the occurrence of an event. We say


that the event A = {1, 3, 5} has occurred if the outcome of a trial is any member of
A. So if we toss a die and get a 3, then we can say that the event A has occurred we
have obtained an odd number. Simultaneously, the events C and D, as defined above,
have also occurred.

CHAPTER 3. PROBABILITY THEORY

57

Example 2B: What is the sample space for each of the following random experiments?
(a) A game of squash is played and the score at the end of the first set is noted.
(b) We record the way in which a batsman in cricket ends his innings.
(c) An investor owns two shares which she monitors for a month. At the end of the
month she records whether they went up, down or remained unchanged.
(a) Squash is played to 9 points, with deuce at 8-all, in which case the player who
reached 8 first decides whether to play to 9 points or to 10 points. Thus the sample
space is S = {9-0, 9-1, 9-2, 9-3, 9-4, 9-5, 9-6, 9-7, 9-8, 10-8, 10-9}.
(b) The set of ways in which a batsmans innings can end is given by S = {bowled,
caught, leg before wicket, run out, stumped, hit wicket, not out, retired, retired
hurt, obstruction, timed out}.
(c) It is convenient to let U = up D = down and N = no change. Then S = { UU,
UD, UN, DU, DD, DN, NU, ND, NN}, where, for example, DU means first share
down, second share up.
Notice how we construct the most detailed possible sample space the set of
outcomes {both up, one up & one down, one up & one unchanged, one down & one
unchanged, both unchanged, both down} is not acceptable because more than one elementary event gives rise to several of these outcomes. For example, both elementary
events UD and DU give rise to one up & one down.
Example 3C: A random experiment consists of tossing 3 coins of values R1, R2 and
R5 and observing heads and tails. Which of the following is the correct sample space?
(a) S = {3 heads,2 heads,1 head,0 heads}.
(b) S = {3 heads,2 heads 1 tail,1 head 2 tails,3 tails}.
(c) S = {HHH,HHT,HTH,THH,HTT,THT,TTH,TTT} where, for example, HTH
means heads on R1, tails on R2 and heads on R5.
Example 4B: Refer to the random experiments (a) to (c) of Example 2B, and give the
subsets of S that correspond to the following events.
(a)

(i) The squash game is won by 5 or more points.

(ii) The game goes to deuce.


(b) When the batsman ended his innings the bails were dislodged from the wickets.
(c) None of the shares decline.
In each case we simply list the elementary events that favour the occurrence of the
event in which we are interested. The answers are:
(a) (i) {9-0, 9-1, 9-2, 9-3, 9-4}
(ii) {9-8, 10-8, 10-9}
(b) {bowled, run out, stumped, hit wicket}
(c) {UU, UN, NU, NN}.
Example 5C: A salesperson, after calling on a client, records the outcome: sale made
(S), or no sale made (N ). List the sample space of outcomes in one afternoon if
(a) two clients are visited
(b) three clients are visited.

58

INTROSTAT

(c) Suppose now that three outcomes are recorded: sale made (S), sales potential good
(P ), no sale ever likely to be made (N ). List the sample space if two clients are
visited.
Example 6C. A party of five hikers, three males and two females, walk along a mountain trail in single file.
(a) What is the sample space S?
(b) Find the subset of S that correspond to the events:
U : a female is in the lead

V : a male is bringing up the rear

W : females are in the second and fourth positions.

(c) Find the subsets of S that correspond to U , U W , V W , and U V .

Kolmogorov, father of probability . . .


Andrey Nikolaevich Kolmogorov was a Russian mathematician who, in 1933, published the axioms of probability, and established the theoretical foundation for the rigorous mathematical study of probability theory.
KOLMOGOROVS AXIOMS OF PROBABILITY
Suppose that S is the sample space for a random experiment. For all
events A S, we define the probability of A, denoted Pr(A), to be a
real number with the following properties:
1. 0 Pr(A) 1 for all A S
2. Pr(S) = 1 and Pr() = 0
3. If A B = (i.e. if A and B are mutually exclusive events) then
Pr(A B) = Pr(A) + Pr(B).
The function Pr(A) provides a means of attaching probabilities to events in S. The
first two axioms tell us that probabilities lie between zero and one, and that these
extreme probabilities occur for the impossible and certain events, respectively. The
probabilities of all other events are graded between these two extremes unlikely events
have probabilities close to zero, and events which are nearly certain have probabilities
close to one. If an event is as likely to occur as it is not to occur, then it has probability
0.5. Thus for an unbiased coin, for which heads and tails are equally likely, the
probability of the event heads is equal to the probability of the event tails is equal
to 0.5!
This function Pr(A) is almost certainly a new kind of function to you. The functions
you have seen before, e.g. y = 3(x2 + 5), take one real number, x, and map them onto
another real number, y. If it helps you, you can think of the function y = f (x) as a
kind of mincing machine you put a number (x) in, you get another number (y)
out. Now you must think of the function Pr(A) as a new kind of mincing machine
you put a set (A) in, and out pops a number between zero and one (inclusive of these
end limits)!

59

CHAPTER 3. PROBABILITY THEORY

Relative frequencies . . .
To try to get some insight into the concept of probability, consider a random experiment on some sample space S repeated infinitely many times. Lets start by doing
n trials of the random experiment and counting the number of times r that some event
A S occurs during the n trials. Then we define r/n to be the relative frequency of
the event A. Obviously, 0 r/n 1. Thus relative frequencies and probabilities both
lie between zero and one.
We can think of the probability of the event A as the relative frequency of A as n,
the number of trials of the random experiment gets very large. In symbols
Pr(A) = lim

r
.
n

If you toss a fair coin, then the probability of heads is equal to the probability of
tails, i.e. Pr(H) = Pr(T ) = 0.5. If you tossed the coin 10 times you might observe
6 heads, a relative frequency of 6/10 = 0.6. But if you tossed it 100 times you might
observe 53 heads, relative frequency 53/100 = 0.53. If you kept going for a few hours
more, and tossed it 1000 times you might observe 512 heads, giving a relative frequency
of 512/1000 = 0.512. As the number of trials increases, the relative frequency
tends to get closer and closer to the true probability.

A class experiment birthdays in April, May and June . . .


Almost exactly a quarter of the days of the year fall into April, May or June
(91/365.25 = 0.249, allowing for leap years every fourth year). Thus we expect the
probability that an individuals birthday falls into one of these three months to be
pretty close to 0.249. Lets do an experiment within the class, and fill in this table.
Number of
students (n)

Number with birthdays


in April, May, June (r)

Relative
frequency (r/n)

Front row
Front three rows
Whole class
Do the relative frequencies get closer to the true probability as n gets larger?

Some useful theorems . . .


We consider several theorems that follow immediately from the axioms of probability.
Theorem 1. Let A S. Then Pr(A) = 1 Pr(A).
Proof: We write S as the union of two mutually exclusive events:
A A = S.
Because A and A are mutually exclusive, i.e. A A = , we can use axiom 3 to state
Pr(A A) = Pr(A) + Pr(A).

60

INTROSTAT

But AA = S, and Pr(S) = 1, by Kolmogorovs 2nd axiom, so Pr(AA) = 1. Therefore


Pr(A) + Pr(A) = 1
and
Pr(A) = 1 Pr(A),
as required
Theorem 2. If A S and B S then Pr(A) = Pr(A B) + Pr(A B).
Proof: Write A as the union of the two mutually exclusive sets:
A = (A B) (A B).
Clearly,
(A B) (A B) = .
Therefore, using axiom 3,
Pr(A) = Pr(A B) + Pr(A B)
Notice that theorem 2 may also be expressed as
Pr(A B) = Pr(A) Pr(A B).
Theorem 3. The Addition Rule. For any arbitrary events A and B,
Pr(A B) = Pr(A) + Pr(B) Pr(A B).
Proof: Write A B as the union of two mutually exclusive sets:
A B = B (A B)
Because B and A B are mutually exclusive, we can again apply axiom 3 and say
Pr(A B) = Pr(B) + Pr(A B).
But, by theorem 2, Pr(A B) = Pr(A) Pr(A B). The result follows.
Theorem 4. If B A, then Pr(B) Pr(A).
Proof: If B A then we can write A as the union of two mutually exclusive sets,
A = B (A B)
and
Pr(A) = Pr(B) + Pr(A B)
Pr(B)

because Pr(A B) 0 as all probabilities are non-negative.

61

CHAPTER 3. PROBABILITY THEORY

Theorem 5 If A1 , A2 , . . . , An are pairwise mutually exclusive, i.e. Ai Aj = for i 6= j,


then
Pr(A1 A2 . . . An ) = Pr(A1 ) + Pr(A2 ) + . . . + Pr(An ),
or, in a more concise notation,
Pr

n
[

i=1

Ai

n
X

Pr(Ai ).

i=1

Proof: The proof is by repeated use of axiom 3. The events (A1 A2 . . . An1 ) and
An are mutually exclusive. Thus
!
!
n1
n
[
[
Ai + Pr(An )
Ai = Pr
Pr
i=1

i=1

Next, the events (A1 A2 . . .

An2 ) and An1 are mutually exclusive. Thus


!
!
n2
n1
[
[
Ai + Pr(An1 ),
Ai = Pr
Pr
i=1

i=1

so that
Pr

n
[

i=1

Ai

= Pr

n2
[

Ai

i=1

+ Pr(An1 ) + Pr(An ).

Continue the process, and the result follows.


Example 7A: If Pr(A) = 0.5, Pr(B) = 0.6 and Pr(A B) = 0.3, find Pr(B), Pr(A B)
and Pr(A B).
By theorem 1, Pr(B) = 1 Pr(B) = 1 0.6 = 0.4.
By theorem 2, Pr(A B) = Pr(A) Pr(A B) = 0.5 0.3 = 0.2.
By theorem 3, Pr(A B) = Pr(A) + Pr(B) Pr(A B) = 0.5 + 0.6 0.3 = 0.8
Example 8C: Is it possible for events in the same sample space to have probabilities
Pr(A) = 0.8, Pr(B) = 0.6 and Pr(A B) = 0.7?
Example 9C: In the sample space S, let Pr(A) = 0.7, Pr(B) = 0.5, Pr(C) = 0.1,
Pr(A B) = 0.9, and Pr((A B) C) = 0. Depict the events A, B and C on a Venn
diagram and find the probability of the events A B, A C, A B C, B C, and
(A B).
Example 10C: If we know that Pr(A B) = 0.6 and Pr(A B) = 0.2, can we find
Pr(A) and Pr(B)?

62

INTROSTAT

Equally probable elementary events . . .


All probability problems, in theory at least, can be solved by making use of theorem 5. The elementary events that make up the sample space S are always mutually
exclusive because if one elementary event occurs, no other elementary event occurs.
Therefore, if we knew the probabilities of all the elementary events, we would also be
able to compute the probabilities of any event in S. By theorem 5, the probability of
any event is simply the sum of the probabilities of the elementary events that make up
the event.
There is a wide class of problems for which we do know the probabilities of all the
elementary events in a sample space. These are the problems for which it is reasonable to
assume that all the elementary events are equally likely. If there are N elementary
events in S, and each one has the same probability of occurring, then the probability of
each and every elementary event must be 1/N .
Equally probable elementary events occur in many games of chance coins and
dice are assumed to be unbiased; when a card is drawn from a pack of 52 playing cards,
the probability of any particular card is assumed to be 1/52. Let A be some event in
this scenario. Then A must consist of the union of elementary events, each of probability
1/N . If we could determine the number of elementary events in A, then we could
write

number of elementary events in A


number of elementary events in S
n(A)
n(A)
=
,
=
n(S)
N

Pr(A) =

where we define the function n(A) to mean the count of the number of elementary events
in A. Clearly, n(S) = N .

Example 11A: Consider tossing a fair die. Then S = {1, 2, 3, 4, 5, 6} and N = 6. Let
A = {1, 3, 5} the event of getting an odd number. Find Pr(A).
The number of elementary events in A is n(A) = 3. So

Pr(A) =

3
1
n(A)
= = ,
N
6
2

which your intuition should tell you is correct!

CHAPTER 3. PROBABILITY THEORY

63

COMPUTING PROBABILITIES WHEN THE ELEMENTARY


EVENTS ARE ALL EQUIPROBABLE
If all the N elementary events in a sample space have probability 1/N ,
the following three steps enable the probability of any event A in the
sample space to be found:
1. Determine the sample space S of elementary events. Determine
the number, N = n(S) of elementary events S. You might have
to list them and count, or you might be able to use one of the
counting rules given below. If the N elementary events are
equally probable, then each one occurs with probability 1/N .
2. Determine A, the subset of S, the event whose probability is to be
found. Count the number of elementary events in A suppose
n(A) elementary events make up the event A.
3. Then Pr(A) = n(A)/N .
Example 12A: 100 people bought tickets in a charity raffle. 60 of them bought the
tickets because they supported the charity. 75 bought tickets because they liked the
prize. No one who neither supported the charity nor liked the prize bought a ticket.
(a) What is the probability that the prize-winning ticket was bought by someone who
liked the prize?
(b) What is the probability that the prize was won by someone who did not support
the charity?
(c) What is the probability that the prize was won by someone who both supported
the charity and liked the prize?
(a) Let A and B be the events liked the prize and support the charity respectively.
To find Pr(A), we apply the three steps as follows:
1. N = 100
2. n(A) = 75
3. Therefore Pr(A) = n(A)/N = 75/100 = 0.75.
(b) To find Pr(B), we only have to go through steps 2 and 3.
1. n(B) = 40
2. Thus Pr(B) = n(B)/N = 40/100 = 0.4
(c) We now need Pr(A B):
1. n(A B) = 60 + 75 100 = 35
2. Pr(A B) = n(A B)/N = 35/100 = 0.35.

64

INTROSTAT

Example 13C: A pack of playing cards contains 52 cards, 13 belonging to each of


the four suites Spades, Hearts, Diamonds and Clubs. Within each suite the
13 cards are labelled: Ace, 2, . . . , 10, Jack, Queen, King. Let D be the event that a
randomly selected card is a diamond, and K be the event that the card is a king, and
B be the event that the card has one of the numbers from 2 to 10. Find Pr(D), Pr(K),
Pr(B), Pr(D K), Pr(B D), Pr(B K) and Pr(D K B).
Example 14C: The seats of a jet airliner are arranged in 55 rows (numbered 1 to 55)
of 10 seats (lettered A to K, leaving out I). In each row, seats C, D, G and H are on
aisles, and A and K are window seats. Smoking is permitted in rows 45 to 55 inclusive.
If a passenger is assigned a seat at random, what is the probability of being allocated
(a)
(b)
(c)
(d)

an aisle seat?
a seat in the smoking section?
a window seat in the non-smoking section?
a window seat in row 1?

Permutations and combinations . . .


Even in problems in which all the elementary events are equally probable, it is
usually impractical to list and to count all the elementary events in the sample space or
in the event of interest. The theory of combinations and permutations frequently comes
to the rescue, and enables the number of elementary events in sample spaces and events
to be determined quite easily. This theory is summarized in a series of eight counting
rules given later.

Permutations of n objects . . .
Recall that a set is just a group of objects, and that the order in which the objects
are listed is irrelevant. We now consider the number of different ways all the objects in
a set may be arranged in order. A set containing n distinguishable objects has
n(n 1) 3 2 1 = n! (n factorial)
different orderings of the objects belonging to the set. We can see this by thinking
in terms of having n slots to fill with the n objects in the set. Each slot can hold one
object. We can choose an object for the first slot in n ways; there are then n 1 objects
available for the second slot, so we can select an object for the second slot in n 1 ways,
leaving n 2 objects available for the third slot, . . . , until the last remaining object has
to placed in final slot. We say that there are n! distinct arrangements (technically, we
call each arrangement or ordering a permutation) of the n objects in the set.
Example 15A: If the set A = {1, 2, 3}, list all the possible permutations.
There are 3! = 3 2 1 = 6 distinct arrangements of the objects in A. They are:
1 2 3 1 3 2 2 1 3 2 3 1 3 1 2 3 2 1.

CHAPTER 3. PROBABILITY THEORY

65

Permutations of n objects taken r at a time . . .


Suppose now that we have a set containing n objects, and that we have r (0 < r n)
slots to fill. In how many ways can we do this, assuming that each object is used up
once it is allocated to a slot? We number the slots from 1 to r and fill each in turn.
We can choose any of the n objects to fill the first slot. Having filled the first slot there
are n 1 objects available, any of which may be chosen for the second slot. Therefore,
the first two slots can be filled in n(n 1) ways. The first three slots can be filled in
n(n 1)(n 2) ways.
By the time we have filled the (r 1)st slot and are ready for the rth slot, we have
used r 1 members of our set and therefore have n (r 1) = n r + 1 members left
to choose from. Hence, the r slots can be filled in
n(n 1) (n r + 1)
n(n 1) (n r + 1) (n r)(n r 1) 3 2 1
=
(n r)(n r 1) 3 2 1
n!
ways
=
(n r)!

Thus there are n!/(n r)! ways of ordering r elements taken from a set containing n
elements using each element at most once. Note that we are (a) choosing r objects and
(b) arranging them. We are here involved in two processes, choosing and arranging. The
number of ways of choosing and arranging r objects out of n distinguishable objects is
called the number of permutations of n objects taken r at a time and is denoted
by (n)r (n permutation r).
n!
(n)r =
(n r)!
This formula is also valid for r = n if we adopt the convention that 0! = 1.
Example 16A: The focusing mechanism on Rons camera is bust, so that he can only
take pictures of people at a distance of 2 metres, so he only takes pictures of 3 people at
a time. How many different pictures (a rearrangement of the same people is considered
a different picture) are possible if 10 people are present?
This is the same as asking for the number of permutations of 10 objects taken 3 at
a time, given by
10!
(10)3 =
= 10 9 8 = 720.
(10 3)!
Example 17A: Suppose (as happened in South Africa in 1994) that 19 political parties
contested an election. One party wanted the ballot papers to have the parties listed in
random order. Another said it was impractical. How many different orderings would
have been possible?
This is equivalent to asking: How many permutations of 19 objects taken 19 at a
time are there? The answer is:
19!
19!
19!
=
=
(19)19 =
(19 19)!
0!
1
= 121 645 100 000 000 000 = 1.216451 1017 ,
roughly 25 million different ballot papers per man, woman and child on planet earth!

66

INTROSTAT

Combinations of n objects taken r at a time . . .


Now suppose we want merely to count the number of ways of choosing r elements out
of the n elements in our set without regard to the arrangement of the chosen elements.
In other words, we want to determine the number of r-element subsets that we can
form. We call this the number of combinations of n objects taken r at a time,
and denote it by the symbol  n
r (n combination r).
n
To find the value of r , we reason as follows. When we found the number of
permutations of n objects taken r at a time, we divided the process into two operations
choosing r objects and then arranging them. We are now only interested in
the first operation. We recall that a subset having r objects can be arranged in r!
permutations. A little reflection will convince you that there are therefore r! times more
permutations than combinations; that is
 
n
r! = (n)r = n!/(n r)!
r

Therefore the formula for n
r is given by
 
n
n!
=
r! (n r)!
r
We have discovered that the number of r-element subsets that can be formed from a set
containing n elements is n
r .

Example 18A: In how many ways can a 9 man work team be formed from 15 men?
The problem asks only for the number of ways of choosing 9 men out of 15:
 
15
15!
=
= 5005.
9
9! 6!

Example 19A: How many different bridge hands can be dealt from a pack of 52 playing
cards?
A bridge hand contains 13 cards what matters is only the group of cards (even
though you might arrange them in a convenient order). Therefore, bridge hands consist
of combinations of 52 objects taken 13 at a time:
 
52
= 635 013 559 600.
13
At 15 minutes per bridge game, there are enough different bridge hands to keep you
going for about 20 million years continuously.
Example 20B: From 8 accountants and 5 computer programmers, in how many ways
can one select a committee of
(a) 3 accountants and 2 computer programmers?
(b) 5 people, subject to the condition that the committee contain at least 2 computer
programmers and at least two accountants.

CHAPTER 3. PROBABILITY THEORY

67


(a) We can choose 3 accountants from 8 in 83 ways. We can choose 2 computer

programmers from 5 in 5
for every group
2 ways. We multiply the results, because

5
of 3 accountants that we choose, we can choose one of 2 different groups of
computer programmers. Thus we can choose a committee of 3 accountants and 2
computer programmers in
  
8 5
= 56 10 = 560 ways.
3 2
(b) The total possible number of ways of forming the committee is:
The total number of ways of forming a committee composed of 3 accountants and
2 computer programmers plus the number of ways of composing a committee of
2 accountants and 3 computer programmers:
     
8 5
8 5
+
= 560 + 280 = 840 ways.
3 2
2 3

Permutations, with repetitions . . .


We now suppose that we have n types of objects and r slots, and that we have at
least r objects of each type available. We can thus fill the first slot with any of the
n types of objects, there are still n types of objects available for the second slot, . . .
Because there are at least r objects of each type, there are still objects of each of the n
types available for the final, rth slot.
Thus the number of permutations of n types of objects taken r at a time,
allowing repetitions is
n n . . . n = nr .
Example 21A: How many four digit numbers can be made from the 10 digits from 0
to 9, if repetitions are permitted?
We have four slots to fill. But because all of the 10 digits remain available to fill
every slot, this can be done in 104 = 10000 ways. This makes sense, because there are
10 000 numbers from 0 (actually 0 000) to 9 999.
Example 22B:
(a) How many four letter words can be made with a 26-letter alphabet including
all nonsense words?
(b) It is proposed to adopt a system of motor car number plates which uses three
letters of the alphabet (excluding I and O) followed by three digits. How many
number plates are possible?

(a) Because words like BEER, with repeated letters, are permissible, the potential
number of 4 letter words is 264 = 456 976.
(b) Clearly, because number plates like BBB444 with repetitions are permissible, the
number of possible number plates is 243 103 = 13 824 000, or nearly 14 million.

68

INTROSTAT

Combinations, with repetitions . . .


Once again, we have n types of objects, with at least r available of each type. The
number of selections of r objects, allowing repetitions, is given by


n+r1
.
r
The proof of this result is included as an exercise at the end of the chapter.
Example 23A: A company needs to purchase four new vehicles. How many selections
of makes are possible if they select any of seven different makes (Volkswagen, Toyota,
Nissan, Honda, Mazda, Ford, Uno)?
Clearly, a repetition of any of the makes is possible. However, the order of the makes
is unimportant. The number of combinations of 7 objects taken 4 at a time, allowing
repetitions, is given by

 
  
n+r1
7+41
10
=
=
= 210.
r
4
4
Example 24B: A supermarket sells 10 types of jam. You buy three tins. How many
combinations are possible?
Assuming the supermarket has at least 3 tins of every type of jam, the number of
combinations of 10 jams taken 3 at a time, allowing repetitions (i.e. you could buy more
than one tin of one type of jam), is

 
  
n+r1
10 + 3 1
12
=
=
= 220.
r
3
3

Counting rules . . .
The discussion above can be summarized into several counting rules:
Counting Rule 1: The number of distinguishable arrangements of n distinct objects,
not allowing repetitions is n!.
Counting Rule 2: The number of ways of ordering r objects chosen from n distinct
objects, not allowing repetitions is
(n)r =

n!
.
(n r)!

Counting Rule 3: The number of ways of choosing a set of r objects from n distinct
objects, not allowing repetitions, is
 
n
n!
.
=
r!(n r)!
r

69

CHAPTER 3. PROBABILITY THEORY

Counting Rule 4: The number of permutations of r objects, chosen from n distinct


objects, allowing repetitions is
nr .
Counting Rule 5: The number of combinations of r objects, chosen from n distinct
objects, allowing repetitions is


n+r1
.
r
Counting rules 1 to 5 relate to scenarios in which there are a total of n distinguishable
objects. They can be compressed into a two-way table:

Without repetition

With repetition
rule 4

Permutations

rules 1 and 2
n!
(n)r =
(n r)!

nr

Combinations

rule 3
n =
n!
r
(n r)! r!

rule 5
n + r 1
r

Counting rule 1 is the special case of counting rule 2, with r = n.

We add three further useful counting rules which we will state, and leave the proofs
as exercises.
Counting Rule 6: The number of distinguishable arrangements of n items, of which
n1 are of one kind and n2 = n n1 are of another kind is
   
n
n
n!
=
=
.
n1 ! n2 !
n1
n2
Here it is assumed that the n1 items of the first kind are indistinguishable from each
other, and the n2 items of the second kind are indistinguishable from each other.
Before we move onto the final two counting rules, we define a generalization of the
binomial coefficient, known as the multinomial coefficient. We let


n
n!
=
,
n1 n2 . . . nk
n1 ! n2 ! . . . nk !
P
where ki=1 ni = n, and call this the multinomial coefficient. The sum of the numbers
in the bottom row of a multinomial coefficient must be equal to the number at the
top! Notice that in multinomial coefficient notation, the binomial coefficient would
therefore have to be written with two numbers in the bottom row:
  

n
n
=
.
r
r (n r)

70

INTROSTAT

Counting Rule 7: The


Pnumber of ways of choosing k combinations of sizes n1 , n2 , . . . , nk
from a set of n items ( nk = n) is given by the multinomial coefficient



n
.
n1 n2 . . . nk

Counting Rule 8: The number of distinguishable arrangements of n items of k types,


of which n1 are of the first type, n2 are of the second type, . . . , nk are of the kth type is
given by


n
.
n1 n2 . . . nk

Using the counting rules . . .


Example 25A: A furniture store is displaying a large couch in the window. The
window dresser has six cushions of which four are yellow and two are red. In how many
ways can the cushions be arranged on the couch (in a row, of course)?
We require arrangements of 6 items, where 4 are of one kind and 2 are of a second
kind. Counting rule 6 tells us that there are
   
6
6
=
= 15 possible arrangements.
4
2
Example 26A: A chain of nine hardware stores wishes to test-market a new product
in four of the nine stores. How many selections of four stores are possible?
The stores are distinguishable, we cannot have a repetition of the same store, and
the order of the stores is irrelevant! This is a classic application of counting rule 3. There
are
 
9
9!
= 126
=
4! 5!
4
different selections.
Example 27A: A committee of 12 is to be split into three subcommittees, having
three, four and five members, respectively. In how many ways can the subcommittees
be formed?
By counting rule 7, the number of combinations of sizes 3, 4 and 5 chosen from a
set of 12 items is given by



12
12!
= 27 720.
=
345
3! 4! 5!

71

CHAPTER 3. PROBABILITY THEORY

Example 28B: A clothing store has designed a series of seven different advertisements,
labelled AG. A local newspaper offers a special rate if advertisements are placed on the
first, second and third pages of the next weekend edition.
(a) How many different arrangements of the advertisements are possible, assuming
that the same advertisement is not repeated.
(b) If the marketing manager decides to allocate the advertisements randomly, and
decides not to use the same advertisement more than once, what is the probability
that advertisements A and B appear on the first and second pages respectively?
The advertisements are distinguishable, repetitions are not allowed, arrangements
are important, so the solution requires application of counting rule 2.
(a) The number of arrangements is (7)3 = 7!/4! = 210.
(b) For the third page, one of C, D, E, F or G must be selected. Hence Pr(A and on
first and B on second page) = 5/210 = 0.024.
Example 29B: A wealthy investor decides to give four of her 12 investments to her
daughter. Five of her investments are in gold-mining companies, the remaining seven in
various industrial companies. Her daughter is given the opportunity to select the four
companies at random.
(a) How many different sets of companies could the daughter be given?
(b) What is the probability that the daughter gets a poorly diversified portfolio of
investments, with either four gold-mining companies, or four industrial companies?
(a) The 12 companies are distinguishable, repetitions are not possible, and arrangements are irrelevant, so counting rule 3 is appropriate. The number of combinations

of 12 companies taken four at a time is 12
4 = 495.

(b) The number of ways of selecting four gold-mining companies is 54 = 5, and
therefore Pr(4gold-mining companies) = 5/495. The number of ways of selecting

four industrial companies is 7
4 = 35, with probability 35/495. The probability
of an undiversified portfolio is the sum of these two probabilities: (5 + 35)/495 =
0.081, about one chance in 12.
Example 30B: At a party, there are substantial stocks of five brands of beer Castle,
Lion, Ohlssons, Black Label and Amstel. One of the party-goers grabs two cans without
looking. How many different combinations of 2 beers are possible?
The brands are distinguishable, repetitions are permitted, but the ordering is unimportant, so counting rule 5 is used. The number of ways of selecting two cans from five
brands allowing repetitions is


5+21
2

 
6
= 15.
2

Note, that not all of these 15 outcomes are equally probable!

72

INTROSTAT

Example 31B: A group of 20 people is to travel in three light aircraft seating 4, 6 and
10 people respectively. What is the probability that three friends travel on the same
plane?
The total number of ways of choosing combinations of sizes 4, 6 and 10 from a group
of size 20 is, by counting rule 7, given by


20
20!
= 38 798 760.
=
4! 6! 10!
4 6 10
If the three friends travel in the four-seater aircraft, the remaining 17 must be split into

groups of sizes 1, 6 and 10. This can be done in 1 17
6 10 ways. Similarly, if they travel
in the six-seater, the other 17 must be split into groups of 4, 3 and 10,and if they travel
in the 10-seater, the others must be in groups of sizes 4, 6 and 7. Thus the total number
of ways the three friends can be together is

 
 

17
17
17
+
+
= 4 900 896 ways.
1 6 10
4 3 10
467
Thus, Pr(3 friends together) = (4 900 896)/(38 798 760) = 0.1263.
Example 32B: A bridge hand consists of 13 cards dealt from a pack of 52 playing
cards. What is the probability of being dealt a hand containing exactly 5 spades?
The cards are distinguishable, repetitions are not possible, and the arrangement of
the cards is irrelevant (the order in which you are actually dealt the cards does not make
any difference to the hand). So counting rule 3 is the one to use to determine that the

total number of possible hands is 52
13 .
Applying counting rule 3 twice more, the number of ways of drawing 5 spades from
the 13 in the pack and the remaining 8 cards for the hand from the 39 clubs, hearts and
 39
diamonds is 13
5 8 . Hence
  
13 39
5
8
 
= 0.1247.
Pr(5 spades) =
52
13
Example 33B: If there are r people together, what is the probability that they all
have different birthdays (assuming that leap years dont exist)?
To determine the total number of ways they can have birthdays we use counting
rule 4, the dates are distinguishable, repetitions are possible, and we think in terms of
going through the r people in some order. The total number of ways is 365r .
The number of ways they can have different birthdays is given by counting rule 2,
which doesnt allow repetitions: (365)r . So
Pr (all different birthdays) =

(365)r
.
365r

If n = 23 this probability is 0.493, which is just less than one-half. The probability of
the complementary event, that there is at least one pair of shared birthdays, is therefore
0.507, marginally over 0.5. This means that, on average, in every second group of
23 people there will be shared birthdays.

CHAPTER 3. PROBABILITY THEORY

73

Example 34B: Participants in a market research survey are given a set of 16 cards,
each having a picture of a well known car model. The participants are asked to sort the
cards into three piles:
Pile 1 the 3 models rated best of all.
Pile 2 the 5 models rated above average.
Pile 3 the 8 models rated below average.
In how many ways can the three piles be formed?
The 16 cards are distinguishable, but once they are in their pile their ordering is
irrelevant, and repetitions are impossible. So use counting rule 7: the sorting task can
be performed in

 

n
16
=
= 720 720 ways.
r1 r2 r3
358
Example 35B: To open a certain bicycle combination lock you have to get all five
digits (between 0 and 9) correct. What is the probability that a thief is successful on
his first combination?
A combination lock is not a combination lock at all it should be called a permutation lock! You dont only have to hit the right digits, you have to get them in exactly
the right order! Clearly, the probability is 1/105 = 0.00001.
Example 36C: There are 33 candidates for an election to a committee of three. What
is the probability that Jones, Smith and Brown are elected?
Example 37C: A group of eight students fill the front row at Statistics lectures daily.
They decide to keep attending lectures until they have exhausted every possible arrangement in the front row. For how many days will they attend lectures?
Example 38C: A young investor is considering the purchase of a portfolio of three
shares from the Building and construction sector of the stock exchange. He chooses
three shares at random from the 25 shares currently listed in this sector.
(a) How many ways can shares be selected for the portfolio?
(b) What is the probability that Everite, Grinaker and L.T.A. (three shares in this
sector) are selected?
(c) What is the probability that Grinaker is one of the selected shares?
Example 39C: A firm of speculative builders has bought three adjoining plots. The
company builds houses in seven styles. It is concerned about the visual appearance
of the houses from the street. So they ask their drafting section to sketch all possible
selections of street views.
(a) How many sketches are required if (i) no repetitions of styles are allowed, and if
(ii) they allow repetitions of styles?
(b) If they choose one sketch at random from those in part (a)(ii), what is the probability that all the houses will be of different styles?
(c) In order to determine the materials required, the quantity surveying department is
concerned only with the three styles which might be built (and not on which plot
they are built on). How many combinations of styles must they be prepared for if
(i) no repetitions of styles are allowed, and if (ii) repetitions are allowed?

74

INTROSTAT

Example 40C: Two new computer codes are being developed to prevent unauthorized
access to classified information. The first consists of six digits (each chosen from 0 to 9);
the second consists of three digits (from 0 to 9) followed by two letters (A to Z, excluding
I and O).
(a) Which code is better at preventing unauthorized access (defined as breaking the
code in one attempt)?
(b) If both codes are implemented, the first followed by the second, what is probability
of gaining access in a single attempt?
Example 41C: A housewife is asked to rank five brands of washing powder (A, B,
C, D, E) in order of preference. Suppose that she actually has no preference, and her
ordering is arbitrary. What is the probability that
(a) brand A is ranked first?
(b) brand C is ranked first and brand D is ranked second?

Conditional probability . . .
Conditional probabilities provide a method for updating or revising probabilities
in the light of new information. On Monday, the weather forecaster might say the probability of rain on Thursday is 50% (weather forecasters have not heard of Kolmogorovs
axioms, and insist on giving probabilities as percentages!), on Tuesday he might revise
this probability in the light of additional information to 70%, and on Wednesday, with
even more reliable information, he might say 60%. In statistical jargon, we would say
that each forecast was conditional on the information available up to that point in
time.
Example 42A: We draw one card from a pack of 52 cards. The probability that it is
the King of Clubs is 1/52. Suppose now that someone draws the card for you, and tells
you that the card is a club. Now what is the probability that it is the King of Clubs?
Obviously 1/13. We have reduced our sample space from the set of 52 cards to the set
of 13 clubs.
CONDITIONAL PROBABILITY
Let A and B be two events in a sample space S. Then the conditional
probability of the event B given that the event A has occurred, denoted
by Pr(B | A), is
Pr(B | A) = Pr(A B)/ Pr(A)
provided that Pr(A) 6= 0. Pr(B | A) is read the probability of B given A.
The conditional probability Pr(B | A) may be thought of as a reassessment of the
probability of B given the information that some other event A has occurred.

75

CHAPTER 3. PROBABILITY THEORY


Example 42A continued: We reconsider our example using this definition.
Pr (King of clubs clubs)
Pr (clubs)
Pr (King of clubs)
=
Pr (clubs)

Pr (King of clubs | clubs) =

The event King of clubs is a subset of the events clubs so the intersection of
these two events is the event King of clubs. Pr (clubs) is the probability of drawing a
club there are 13 ways of doing this. Hence Pr (clubs) = 13/52. Therefore
1/52
13/52
= 1/13,

Pr (King of clubs | clubs) =


as before.

Example 43C: You, a woman with a medical background, are one of 198 applicants
for an M.B.A. programme of whom 81 will be selected. You hear, along the grapevine,
on good authority that there were 70 woman applicants, of whom 38 were selected.
Assess your probabilities of being accepted before and after you receive the grapevine
information. Use the definition of conditional probability.
Example 44B: Suppose A and B are two events in a sample space, and that Pr(A) =
0.6, Pr(B) = 0.2 and Pr(A | B) = 0.5.
Find
(a) Pr(B | A) (c) Pr(A B)
(b) Pr(A B) (d) Pr(B | A).
In this type of problem a useful first step always is to simplify as many conditional
probabilities into absolute probabilities as possible.
From the given information we note that
Pr(A | B) = Pr(A B)/ Pr(B)
0.5 = Pr(A B)/0.2.

Thus
Pr(A B) = 0.1.

We are now in a position to tackle the questions asked.


(a) Pr(B | A) = Pr(B A)/ Pr(A) = 0.1/0.6 = 0.17
(b) Pr(A B) = Pr(A) + Pr(B) Pr(A B) = 0.6 + 0.2 0.1 = 0.7
(c) Pr(B) = Pr(A B) + Pr(A B) by theorem 2. Therefore
Pr(A B) = Pr(B) Pr(A B) = 0.2 0.1 = 0.1.
(d) And finally,
Pr(B | A) = Pr(B A)/ Pr(A)

= Pr(B A)/(1 Pr(A))

= 0.1/(1 0.6) = 0.25.

76

INTROSTAT

Example 45B: Show that Pr(B | A) + Pr(B | A) = 1.


Pr(B | A) + Pr(B | A) = Pr(B A)/ Pr(A) + Pr(B A)/ Pr(A)
=

Pr(B A) + Pr(B A)
Pr(A)

= Pr(A)/ Pr(A)

by Theorem 2

=1

Example 46C: Is it possible for events A and B in a sample space to have the following
probabilities?
Pr(A) = 0.5

Pr(B) = 0.8

Pr(A | B) = 0.2.

Example 47C: The probability that a first year student passes Economics if he passes
Statistics is 0.5, the probability that he passes Statistics if he passes Economics is 0.8,
and the (unconditional) probability that he passes Statistics is 0.7. The Statistics results come out first, and the student finds he has failed. What is now the conditional
probability of passing Economics?
Example 48C: Show that for any three events A, B and C in a sample space S
Pr(A B C) = Pr(A | B C) Pr(B | C) Pr(C).

Bayes Theorem . . .
For any two events A and B there are two conditional probabilities that can be
considered:
Pr(B | A) = Pr(A B)/ Pr(A)

Pr(A | B) = Pr(A B)/ Pr(B).

A very useful tool for finding conditional probabilities is Bayes theorem, which connects
Pr(B | A) with Pr(A | B), named in honour of Rev. Thomas Bayes, who did pioneering
work in probability theory in the 1700s.
Bayes Theorem. If A and B are two events, then
Pr(A | B) =

Pr(B | A) Pr(A)
Pr(B | A) Pr(A) + Pr(B | A) Pr(A)

CHAPTER 3. PROBABILITY THEORY

77

Proof: Recall the definition of conditional probability


Pr(A | B) = Pr(A B)/ Pr(B)
and theorem 2
Pr(B) = Pr(A B) + Pr(A B).
Substituting, we have
Pr(A | B) =

Pr(A B)
.
Pr(A B) + Pr(A B)

We note that

and

Pr(A B) = Pr(B | A) Pr(A)


Pr(A B) = Pr(B | A) Pr(A).

Therefore
Pr(A | B) =

Pr(B | A) Pr(A)
.
Pr(B | A) Pr(A) + Pr(B | A) Pr(A)

Example 49B: A television manufacturer cannot produce the full quota of tubes it
requires, so it purchases 20% of its needs from an outside supplier. The quality manager
has determined that 6% of the tubes produced in house are defective, and that 8% of
the purchased tubes are defective. He finds the tube of a randomly selected television
to be defective. What is the probability that the tube was produced by the company.
Let D be the event tube defective,
C be the event produced by the company.
We are given Pr(D | C) = 0.06, Pr(D | C) = 0.08 and Pr(C) = 0.8. We want to
find Pr(C | D). By Bayes theorem
Pr(C | D) =
=

Pr(D | C) Pr(C)
Pr(D | C) Pr(C) + Pr(D | C) Pr(C)
0.06 0.8
= 0.75.
0.06 0.8 + 0.08 0.2

Example 50B: You feel ill at night and stumble into the bathroom, grab one of three
bottles in the dark and take a pill. An hour later you feel really ghastly, and you
remember that one of the bottles contains poison and the other two aspirin.
Your handy medical text says that 80% of people who take the poison will show
the same symptoms as you are showing, and that 5% of people taking aspirin will have
them.
Let B be the event having the symptoms
A be the event taking the poison
Then A is the event taking aspirin.

78

INTROSTAT

What is the probability that you took the poison given that you have got the symptoms, i.e. what is Pr(A | B)?
Pr(A | B) =

Pr(B | A) Pr(A)
.
Pr(B | A) Pr(A) + Pr(B | A) Pr(A)

From the information supplied by the handy medical text


Pr(B | A) = 0.8

and

Pr(B | A) = 0.05.

From our groping round in the dark, we conclude that


Pr(A) = 1/3

and

Pr(A) = 2/3.

Thus
Pr(A | B) =

0.8 1/3
= 0.89.
0.8 1/3 + 0.05 2/3

We recommend that you call the doctor!


Example 51C: A trick coin with two tails is put in a jar with three normal coins. One
coin is selected at random and tossed. If the result is a tail, what is the probability that
the trick coin was selected?
Example 52C: Let A and B be two events with positive probability. Which of the
following statements are always true, and which are not, in general, true?
(a) Pr(A | B) + Pr(A | B) = 1.
(b) Pr(A | B) + Pr(A | B) = 1.
(c) Pr(A | B) + Pr(A | B) = 1.
Example 53C: Let A and B be two mutually exclusive events and let C be any other
event. Show that
(a) Pr(A B | C) = Pr(A | C) + Pr(B | C).
Pr(A C) + Pr(B C)
.
(b) Pr(C | A B) =
Pr(A) + Pr(B)
Example 54C: The miners are out on strike, with a list of demands. Negotiators
reckon that if management meets one of the demands, the probability that the strike
will end is 0.85. But if this demand is not met, the probability that the strike will
continue is 0.92. You assess the probability that management will agree to meet the
demand as 0.3. Later you hear that the strike has ended. What is the probability that
demand was met?
Example 55C: A well is drilled as part of an oil exploration programme. The probability of the well passing through shale is 0.4. If the well passes through shale, the
probability of striking oil is 0.3. If it does not pass through shale, the probability drops
to 0.1.
(a) Given that oil was found, what is the probability that it did not pass through
shale?
(b) Given that oil was not found, what is the probability that it passed through shale?

79

CHAPTER 3. PROBABILITY THEORY

Example 56C: We have been using the kiddie version of Bayes theorem. Prove the
adult version. Let A1 , A2 , . . . , An be a set of mutually exclusive and exhaustive events
in S. Let B be any other event. Then
Pr(Ai | B) =

Pr(B | Ai ) Pr(Ai )
.
Pr(B | A1 ) Pr(A1 ) + Pr(B | A2 ) Pr(A2 ) + + Pr(B | An ) Pr(An )

Example 57C: Jim has applied for a bursary for next year. His estimates of the
probabilities of getting each grade of result, and his probabilities of getting the bursary
given each grade are given in the table.
Grade

1st

Upper 2nd

Lower 2nd

3rd

Fail

Pr(getting grade)
Pr(getting bursary|grade)

0.20
0.90

0.15
0.75

0.50
0.40

0.10
0.15

0.05
0.00

You subsequently hear that he was awarded the bursary. What is the probability
(a) that he got a first class pass?
(b) that he failed?
(c) that he got either an upper second or a lower second?
Example 58C: A family has two dogs (Rex and Rover) and a cat called Garfield. None
of them is fond of the postman. If they are outside, the probabilities that Rex, Rover
and Garfield will attack the postman are 30%, 40% and 15%, respectively. Only one is
outside at a time, with probabilities 10%, 20% and 70%, respectively. If the postman is
attacked, what is the probability that Garfield was the culprit?

Independent events . . .
The intuitive feeling is that independent events have no effect upon each other. But
how do we decide whether two events A and B are independent? If the occurrence of
event A has nothing to do with the occurrence of event B, then we expect the conditional
probability of B given A to be the same as the unconditional probability of B:
Pr(B | A) = Pr(B).
The information that event A has occurred does not change the probability of B occurring. If Pr(B | A) = Pr(B), then, using the definition of conditional probability,
Pr(B A)
= Pr(B),
Pr(A)
or,
Pr(A B) = Pr(A) Pr(B).
This leads us to definition of independent events.
Events A and B are independent if
Pr(A B) = Pr(A) Pr(B).

80

INTROSTAT

In words, the probability of the intersection of independent events is the product of their
individual probabilities.
The definition can be extended to the independence of a series of events: if events
A1 , A2 , . . . , An are independent, then
Pr(A1 A2 . . . An ) = Pr(A1 ) Pr(A2 ) Pr(An ).
Many students initially have a conceptual difficulty separating the concepts of events
that are mutually exclusive and events that are independent. It helps (a little) to
realize that mutually exclusive is a concept from set theory (chapter 2) and can be
represented on a Venn diagram. But independence is a concept in probability theory
(chapter 3), and cannot be represented in a Venn diagram.
Note that independent events are never mutually exclusive. For example, the events
the gold price goes up today and it rains in Cape Town this afternoon are conceptually independent they have nothing to do with each other. However, the gold price
might go up today and it might rain in Cape Town this afternoon the intersection of
these events is not empty, and they are therefore not mutually exclusive. On the other
hand, if you toss a coin, the events heads and tails are mutually exclusive. Someone
tells you he tossed a coin and got heads. In the light of this information, your assessment of the probability of getting tails is instantly adjusted downwards from 0.5 to
zero! Mutually exclusive events are certainly not independent events!
Try making up your own examples to clear up the difference between the two concepts.
Example 59A: Let A be the event that a microchip is manufactured perfectly. Let B
be the event that the chip is installed correctly. If Pr(A) = 0.98 and Pr(B) = 0.93 what
is the probability that the installed chip functions perfectly?
We require Pr(A B). Because manufacture and installation may be considered
independent, we have:
Pr(A B) = Pr(A) Pr(B) = 0.98 0.93 = 0.9114.
Example 60B: A four-engined plane can land safely even if three engines fail. Each
engine fails, independently of the others, with probability 0.08 during a flight. What is
the probability of making a safe landing?
Let Ai be the event that engine i fails. Then the event safe landing can be written
as (A1 A2 A3 A4 ), the complement of the event all engines fail.
Pr(A1 A2 A3 A4 ) = 1 Pr(A1 A2 A3 A4 )

= 1 Pr(A1 ) Pr(A2 ) Pr(A3 ) Pr(A4 )

= 1 0.084 = 0.999 959 040.


Quite safe! On average, about one flight in 24 414 will crash!

Example 61C: An orbiting satellite has three panels of solar cells, which function
independently of each other. Each fails during the mission with probability 0.02. What
is the probability that there will be adequate power output during the entire mission if
(a) all three panels must be active?
(b) at least two panels must be active?

CHAPTER 3. PROBABILITY THEORY

81

Example 62C: The probability that a certain type of air-to-air missile will hit its
target is 0.4. How many missiles must be fired simultaneously if it is desired that the
probability of at least one hit will exceed 0.95?
Example 63C: A test pilot will have to use his ejector seat with probability 0.08. The
probability that the ejector seat works is 0.97. The probability that his parachute opens
is 0.99. Assuming these events to be independent, calculate the probability that the test
pilot survives the flight.
Example 64C: Suppose that a fashion shirt comes in three sizes and five colours. The
three sizes (and the percentage of the population who purchase each size) are: small
(30%), medium (50%), and large (20%). Market research indicates the following colour
preferences: white (6%), blue (26%), green (36%), orange (18%), and red (14%). The
management of a store expects to sell 1000 of these shirts. How many shirts of each size
and colour should they order? Assume independence.
Example 65C: Some financial academics argue that the day-to-day movements of share
prices are statistically independent. Assume, hypothetically, that the share De Beers has
a probability of 0.55 of rising on any given trading day. What is the probability that it
rises on three successive trading days?
Example 66C: The probability that the rand will weaken against the dollar tomorrow
is 0.53. The probability that you will wake up late tomorrow is 0.42.
(a) What is the probability that, tomorrow, the rand weakens against the dollar and
you wake up late?
(b) What is the probability that, tomorrow, the rand weakens against the dollar or
you wake up late?
Example 67C: The probability that you will be able to answer the question in the
examination on Chapter 3 of IntroSTAT is 0.65. The probability that you enter the
numbers into your calculator correctly is 0.94. The probability that your calculator
operates correctly is 0.99. The probability that you copy the answer correctly from your
calculator to your answer book is 0.97. What is the probability that you get the question
correct?

Solutions to examples . . .
3C Alternative (c) is correct, it lists all the elementary events.
5C (a) S = {SS, SN, N S, N N }.

(b) S = {SSS, SSN, SN N, N N N, N N S, N SN, N SS, SN S}


(c) S = {SS, SN, SP, N S, N P, N N, P P, P S, P N }

6C (a) S = {M M M F F , M M F M F , M F M M F , F M M M F , F M M F M , F M F M M ,
FFMMM, MFMFM, MFFMM, MMFFM}

(b) U = {F M M M F, F M M F M, F M F M M, F F M M M }
V = {F M M F M , F M F M M , F F M M M , M F M F M , M F F M M , M M F F M }
W = {M F M F M }

82

INTROSTAT
(c) U = {M M M F F, M M F M F, M F M M F, M F M F M, M F F M M,
MMFFM}
U W =
V W = {M F M F M }
U V = {F M M M F }

8C Pr(A B) = 1.1, therefore impossible.


9C C is disjoint from A B, but A B and C exhaust S.
Pr(AB) = 0.3, Pr(AC) = 0, Pr(AB C) = 1.0, Pr(B C) = 0.1 Pr(A B) =
0.7.
10C No, all we can say is Pr(A) + Pr(B) = 0.8.
13C Pr(D) = 0.25, Pr(K) = 0.077, Pr(B) = 0.692, Pr(D K) = 0.019,
Pr(B D) = 0.481, Pr(B K) = 0.077 and Pr(D K B) = 0.827.
14C (a) 0.40 (b) 0.20

36C 1/ 33
3 = 0.000183

(c) 0.16

(d) 0.004

37C 8! = 40 320 days = 110.5 years!



1
38C (a) 25
3 = 2 300 (b) 2300 = 0.0004


(c) 24
2 /2300 = 0.12

39C (a) (i) (7)3 = 210 (ii) 73 = 343


(b) (7)3 /(73 ) = 0.612


7
+
3

1
7
(c) (i) 3 = 35 (ii)
= 84
3

40C (a) 1/(106 ) = 0.000 001 000 and 1/(103 242 ) = 1/(576 000) = 0.000 001 736. Six
digits are better than three digits followed by two letters.
(b) 1/(106 103 242 ) = 0.1736 1011
41C (a) 4!/5! =

1
5

(b) 3!/5! =

1
20

43C Before, 0.409; after 0.543.


46C Pr(A B) = 0.16 and thus Pr(A B) = 1.14, which is impossible.
47C Pr (passing economics | failed statistics) = 0.0875/0.3 = 0.2917
.

1
3
1
51C By Bayes theorem, Pr (trick coin | tails) = 14
4 + 2 4 = 0.4
52C Only (b) is correct.

54C Pr(demands met | strike ended) =


55C (a)
(b)

0.10.6
0.10.6+0.30.4
0.70.4
0.70.4+0.90.6

57C (a) 0.3547


58C

0.850.3
0.850.3+0.080.7

= 0.820

= 0.333
= 0.341

(b) 0.0

0.150.7
0.30.1+0.40.2+.15.7

61C (a) 0.983 = 0.9412

(c) 0.6158
= 0.488
(b) 0.9412 + 3 0.02 0.982 = 0.9988

CHAPTER 3. PROBABILITY THEORY

83

62C 0.6n 0.05 for n 6


63C Pr(survives) = 0.92 + 0.08 0.97 0.99 = 0.9968
On average, the pilot will be killed in one test flight in 315!
Small
Medium
Large

white blue green orange red


18 78 108
54 42
30 130 180
90 70
12 52
72
36 28

64C
65C 0.553 = 0.166
66C (a) 0.53 0.42 = 0.223

(b) 0.53 + 0.42 0.223 = 0.727

67C 0.65 0.94 0.99 0.97 = 0.587.

Exercises . . .
3.1

A breakfast cereal manufacturer packs one of five pictures (a, b, c, d, e) in each


box of cereal. If you buy two boxes, what is the sample space for the random
experiment whose outcome is the two pictures in the boxes?

3.2

A small town has three grocery stores (1, 2 and 3). Four ladies living in this town
each randomly and independently pick a store in which to shop. Give the sample
space of the experiment which consists of the selection of the stores by the ladies.
Then define the events:
A: all the ladies choose Store 1
B: half the ladies choose Store 1 and half choose Store 2
C: all the stores are chosen (by at least one lady).

3.3

Let A, B and C be three arbitrary events. Find expressions for the events
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)

3.4

only A occurs
both A and B but not C occur
all three events occur
at least one occurs
at least two occur
exactly one occurs
exactly two occur
no more than two occur
none occur.

Let A and B be two events defined on a sample space S. Write down an expression
for each of the following events, express their probabilities in terms of Pr(A), Pr(B)
and Pr(A B), and evaluate their probabilities if Pr(A) = 0.3, Pr(B) = 0.4 and
Pr(A B) = 0.2:

84

INTROSTAT
(a)
(b)
(c)
(d)
(e)
(f)

either A or B occurs
both A and B occur
A does not occur
A occurs but B does not occur
neither A nor B occurs
exactly one of A and B occurs.

3.5 Show that Pr(A B) = 1 Pr(B) Pr(A B).


3.6 Prove that Pr(A B) Pr(A) Pr(A B) for any events A and B.
3.7 If A, B and C are events in a sample S, which of the following assignments of
probabilities are impossible?
(a)
(b)
(c)
(d)
(e)

Pr(A) = 0.7
Pr(A) = 0.2
Pr(A) = 0.3
Pr(A) = 0.3
Pr(A) = 0.8

Pr(B) = 0.9
Pr(B) = 0.5
Pr(B) = 0.8
Pr(B) = 0.7
Pr(B) = 0.4

Pr(C) = 0.3
Pr(C) = 0.3
Pr(C) = 0.1
Pr(C) = 0.8
Pr(C) = 0.5

Pr(A B) = 0.4.
Pr(A B) = 0.25.
Pr(A B) = 0.2.
Pr(A C) = 0.1.
Pr(A C) = 0.7.

3.8

What is the probability that a six-digit telephone number has no repeated digits?
Do not allow the number to start with a zero.

3.9

A motor car manufacturer produces four different models, each with three levels of
luxury, and with five colour options. One example of each combination of model,
luxury level and colour is on display in a parking lot.
(a) How many cars are on display?
(b) An interested client parks his vehicle in the parking lot. A rock from an
explosion at a nearby construction site lands on one of the cars. What is the
probability that it lands on the clients car?
(c) What assumptions did you make in order to do part (b)?

3.10

A pack of cards like the one described in Example 13C is being used by four players
for a game of bridge, so each is dealt a hand of 13 cards. The king, queen and
jack are referred to as picture cards. Find the probability that a bridge hand
(13 cards) contains
(a) 3 spades, 4 diamonds, 1 heart and 5 clubs
(b) 3 aces and 4 picture cards.

3.11

If Pr(A) = 0.6, Pr(B) = 0.15, and Pr(B | A) = 0.25 find the following probabilities
(a) Pr(B | A) (c) Pr(A B)
(b) Pr(A | B) (d) Pr((A B) (A B)).

3.12 If A and B are mutually exclusive, what is Pr(A | B)?


3.13 If A B = , show that Pr(A | A B) = Pr(A)/(Pr(A) + Pr(B)).
3.14 The probability that a student passes Statistics is 0.8 if he studies for the exam and
0.3 if he does not study. If 60% of the class studied for the exam, and a student
chosen at random from the class passes, what is the probability that he studied?

85

CHAPTER 3. PROBABILITY THEORY


3.15

The probability that a cancer test will detect the disease in a person who has
cancer is 0.98. The probability that a person who does not have cancer will give a
positive reading on the test is 0.1 (i.e. the test says he has the disease even though
he has not). If 0.1 per cent of the population has cancer, what is the probability
that a person selected at random will in fact have cancer, given that he shows a
positive reading on the cancer test? Comment on your answer.

3.16

The probability that twins are identical is 0.7. Identical twins are always of the
same sex, while non-identical twins are of the same sex with probability 0.5. What
is the probability that twin boys are identical twins?

3.17 The sample space for the response of a voters attitude towards a political issue has
three elementary events: A1 = {in favour}, A2 = {opposed}, A3 = {undecided}.
Let B be the event that a voter is under 25 years of age. Given the following table
of probabilities, compute the probability that a voter is opposed to the issue, given
that he is under 25.
Event
Probability of Event

A1
0.4

A2
0.5

A3
0.1

B | A1
0.8

B | A2
0.2

B | A3
0.5

3.18 A and B are events such that Pr(A) = 0.6, Pr(B | A) = 0.3, and Pr(A B) = 0.72.
Are A and B independent, mutually exclusive, or both?
3.19

If the probability is 0.001 that a 20-watt bulb will fail a 10-hour test, what is the
probability that a sign constructed of 1000 bulbs will burn for 10 hours
(a) with no bulb failure?
(b) with one bulb failure?
(c) with k bulb failures?

3.20 Show that if events A and B are independent, then the following pairs of events
are also independent.
(a) A and B
(b) A and B.
3.21 The events A, B and C are such that A and B are independent and B and C
are mutually exclusive. Their probabilities are Pr(A) = 0.3, Pr(B) = 0.4, and
Pr(C) = 0.2. Calculate the probabilities of the following events.
(a)
(b)
(c)
(d)
(e)
3.22

Both B and C occur.


At least one of A and B occurs.
B does not occur.
All three events occur.
(A B) C.

A satellite is to have a number of solar panels, which will function independently


of each other, and each will fail during the mission with probability 0.05. For
success, at least one must function until the end of the mission. How many solar
panels are necessary, if the probability of success must exceed 0.999?

86

INTROSTAT

Further exercises . . .
3.23

(a) In how many ways can the batting order for a cricket team (11 players) be
arranged?
(b) In how many ways can the team be arranged, given that three specific players
have definite positions in the batting order?
(c) In how many ways can the team be divided up into 2 teams of 5 players each,
and one player left out?
(d) What is the probability that the player left out is the captain of the team?

3.24 If seven diplomats were asked to line up for a group picture with the senior diplomat
in the centre, how many distinguishable arrangements are possible?
3.25 The telephone numbers for the University of Cape Towns Rondebosch campus all
start with 650 followed by a four digit number.
(a) How many different telephone numbers can be accommodated on this campus?
(b) What is the probability that a randomly selected number has its last three
digits (i) 000 (three zeros) (ii) all the same?
3.26 Suppose A and B are events in a sample space.
(a) If A B = B, what is the numerical value of Pr(A | B)?
(b) If A B = , what is Pr(A | B)?
(c) If A B = A, what is Pr(A | B)?
3.27 Let A and B be two events defined on a sample space S.
(a) Write down an expression for each of the following events in terms of unions,
intersections and complements, and express their probabilities in terms of
Pr(A), Pr(B) and Pr(A B).
(i) Both A and B occur.
(ii) At least one of A and B occur.
(iii) Either A occurs, or B occurs, but not both.
(iv) A occurs but B does not occur.
(v) A occurs, or B does not occur.
(b) Now suppose that Pr(A) = 14 , Pr(B) = 13 and that A and B are independent
events. Evaluate the probabilities in part (a).
3.28

(a) What is the probability of drawing exactly 1 spade in a bridge hand (as
defined in Exercise 3.10)?
(b) What is the probability of drawing at least 1 spade?
(c) What is the probability that a bridge hand contains 3 spades, 7 diamonds, 2
hearts and 1 club?
(d) What is the probability that a bridge hand contains both the ace and the
king of spades?

3.29 If A and B are events in S, and if Pr(A) = 31 , Pr(B) = 34 , and Pr(A B) =


find
(a) Pr(A B)

11
12

CHAPTER 3. PROBABILITY THEORY

87

(b) Pr(A | B)
(c) Pr(B | A).
3.30

Let A and B be two events in a sample space. Suppose Pr(A) = 0.4 and
Pr(A B) = 0.7. Let Pr(B) = p.
(a) For what value of p are A and B mutually exclusive?
(b) For what value of p are A and B independent?

3.31 There are n people in a room.


(a) What is the probability that at least two are born in the same sign of the
Zodiac? (Assume 12 signs, each of equal duration.)
(b) What is this probability if n = 13?
3.32

In how many ways can a student answer an eight-question, true-false examination


if
(a) he marks half the questions true and half the questions false?
(b) he marks no two consecutive answers the same?

3.33 A card is drawn from an ordinary pack of 52, looked at and replaced, and the pack
shuffled. How many times should this be done in order to have a 90% chance of
seeing the ace of spades at least once?
3.34

A class of 75 students is to be divided into three tutorial groups of sizes 24, 31


and 20 respectively.
(a) In how many ways can this be done?
(b) There are two brothers in the class. What is the probability that they are in
the same tutorial class?

3.35

(a) (i) How many arrangements of the letters s t a t i s t i c s are


possible if repeated letters such as s are indistinguishable from one
another?
(ii) If the letters are randomly arranged, what is the probability that the first
and last letters are both i ?
(b) (i) How many different words (including nonsense words) can be made from
the letters a s t r o n o m e r s ?
(ii) If the letters are randomly arranged, what is the probability that both
the letters m o o n and the letters s t a r appear in sequence
in the word?

3.36

For safety reasons, each of 1000 parts in a spacecraft is duplicated. The spacecraft
will fail in its mission if any component and its safety duplicate both fail. Each
component fails (independently of any other component) with probability 0.01.
What is the probability that the mission fails?

3.37 The probability that a B.Sc. student takes neither Statistics nor Chemistry is 0.3
and the probability that he takes Statistics but not Chemistry is 0.2. If B.Sc. students take Statistics and Chemistry independently of each other, what is
(a) the probability that a B.Sc. student takes Statistics?
(b) the probability that a B.Sc. student takes Chemistry but not Statistics?

88

INTROSTAT

3.38 Is it possible for events A and B to be defined on a sample space with the following
probabilities?
(a) Pr(A) = 0.5, Pr(B) = 0.8 and Pr(A | B) = 0.2
(b) Pr(A) = 0.5, Pr(A | B) = 0.7, Pr(A B) = 0.3 and Pr(A B) = 0.6.
3.39 (a) If A, B and D are three events in a sample space where A B = and
A B = S, show that
(i) Pr(D) = Pr(D | A) Pr(A) + Pr(D | B) Pr(B)
Pr(A) Pr(D | A)
.
(ii) Pr(A | D) =
Pr(D)
(b) Two machines are producing the same item. Last week, Machine A produced
40% of the total output, and Machine B the remainder. On average, 10%
of the items produced by Machine A were defective, and 4% of the items
produced by Machine B were defective.
(i) What proportion of last weeks entire production was defective?
(ii) If an item selected at random from the combined output produced last
week is found to be defective, what is the probability it came from Machine A?
3.40 The probability of passing Statistics without doing these exercises is 0.1 and 0.8 if
they are done. If 60% of students do these exercises, what is the probability that
a student has not done the exercises if he passes Statistics?
3.41

Which of the following pairs of events would you expect to be independent, which
mutually exclusive and which neither?
(a) studying Economics and being left-handed,
(b) owning a dog and paying vets bills,
(c) the prices of shares in Anglovaal and Gold Fields (both in the mining house
sector of the Johannesburg Stock Exchange) both rising today,
(d) being a member of the Canoe Club and studying for a B.A.,
(e) buying sugar-free cooldrink and buying a cream doughnut for yourself.

3.42

An X-ray test is used to detect a disease that occurs, initially without any obvious
symptoms, in 3% of the population. The test has the following error rates: 7%
of people who are disease free have a positive reaction and 2% of the people who
have the disease have a negative reaction. A large number of people are screened at
random using the test, and those with a positive reaction are examined further.
(a)
(b)
(c)
(d)

What proportion of people who have the disease are correctly diagnosed?
What proportion of people with a positive reaction actually have the disease?
What proportion of people with a negative reaction actually have the disease?
What proportion of the tests conducted give the incorrect diagnosis?

Proofs of the counting rules . . .


3.43 Prove Counting rule 5; i.e. prove that the number of combinations (selections)

of r objects, selected from n objects, allowing repetitions, is n + rr 1 . [Hint:
consider r objects and n 1 marker objects, and apply Counting rule 6. Let the
markers divide the objects up into the n types of objects.]

89

CHAPTER 3. PROBABILITY THEORY

3.44 Prove Counting rule 6; i.e. prove that the number of distinguishable arrangements

of n objects, of which n1 are of type 1 and n2 of type 2, is given by nn1 .

3.45 Prove Counting rule 7; i.e. prove that the number of combinations
of sizes n1 , n2 , . . . , nk

choosen from a set of n items is given by n1 n2 n. . . n .
k

3.46 Prove Counting rule 8; i.e. prove that the number of distinguishable arrangements
of n objects, of
 which n1 are of type 1, n2 of type 2, . . . , nk of type k is given by
n
n1 n2 . . . n .
k

Solutions to exercises . . .
3.1 S = {aa, ab, . . . , de, ee}, 25 elementary events. 15 if one does not distinguish
between, for example, ab and ba
3.2 S = {1111, 1112, 1121, . . . , 3333}, 81 elementary events.
A = {1111}. B = {1122, 1212, 1221, 2211, 2121, 2112}
C = {1231, 1232, 1233, 1321, . . . , 3321}, 36 elementary events.
(b) (A B) C.
3.3 (a) A (B C) = A B C
(c) A B C.
(d) A B C.
(e) (A B) (A C) (B C).
(f) (A (B C) (B (A C)) (C (A B)).
(g) (A B C) (A C B) (B C A).
(h) (A B C)
(i) (A B C).
3.4 (a) 0.5

(b) 0.2

(c) 0.7

(d) 0.1

(e) 0.5

(f) 0.3.

3.7 (a) impossible, Pr(A B) > 1


(b) impossible, Pr(A) < Pr(A B).
(c) impossible, Pr(C) < 0
(d) possible
(e) impossible, Pr(A) > Pr(A C).
3.8 9 (9)5 /(9 105 ) = 0.1512.
3.9 (a) 60

(b) 1/61

(c) An equal probability of hitting any of the 61 cars!


.
 13 13 13. 52
4 12 36 52.
3.10 (a) 13
(b)
3 4 1 5
13
3 4 6
13
3.11 (a) 0.083

(b) 0.33

(c) 0.7

(d) 0.65.

3.12 0.
3.14 0.8.
3.15 0.5213.
3.16 0.8235.
3.17 0.2128.
3.18 Events A and B are independent.
3.19 (a) 0.3677

(b) 0.3681


(c) 1000
0.001k 0.9991000k .
k

90

INTROSTAT

3.21 (a) 0

(b) 0.58

(c) 0.6

(d) 0

(e) 0.32.

3.22 3
3.23 (a) 11!

(c) 5 11
51

(b) 8!

3.24 6!


(d) 5105

3.25 (a) 104 = 10 000

(b) (i) 10/10 000 = 0.001

3.26 (a) 1.0

(c) 1.0.

(b) 0.0

11 
5 5 1 = 0.0909.
(ii) (10 10)/10 000 = 0.01

3.27 (a) (i) A B, Pr(A B)


(ii) A B, Pr(A) + Pr(B) Pr(A B)
(iii) (A B) (A B), Pr(A) + Pr(B) 2 Pr(A B)
(iv) A B, Pr(A) Pr(A B)
(v) A B, 1 + Pr(A B) Pr(B).
(b) (i) 1/12

(ii) 1/2

(iii) 5/12

(iv) 1/6

(v) 3/4.

 39 52
 52
3.28 (a) 13
,
(b) 1 39
1  12 13
13
13
13 13 13 52
2 50 52.
(c) 13
(d)
3 7 2 1
13
2 11
13

3.29 (a) 1/6

(b) 2/9

3.30 (a) p = 0.3

(b) p = 0.5.

3.31 (a) 1 (12)n /12n



3.32 (a) 8
4

(c) 1/2.

(b) 1.0

(certain event).

(b) 2.

3.33 (51/52)n < 0.1

for

n > 119.


3.34 (a) 24 75
31 20 
73  +
73 /
75 
(b) 22 73
+
31 20
24 29 20
24 31 18
24 31 20


8
3.35 (a) (i) 3 10
= 0.022
3 2  = 50400 (ii) 3 3 /50400

11
5
(b) (i) 2 2 2 = 4989600 (ii) 2 2 /4989600
3.36 0.095

3.37 (a) 0.4

(b) 0.3

3.38 (a) No, Pr[A B] = 1.14 > 1 (!)


(b) No. Value of Pr[A B] from first three statements is inconsistent with fourth.
3.39 (b) (i) 6.4%

(ii) 0.625

3.40 0.0769
3.41 (a) and (d) are independent, (b) and (c) are neither, and (e) is mutually exclusive
if you argue that a diet-conscious person wont buy a cream doughnut!

91

CHAPTER 3. PROBABILITY THEORY

3.42 Let D be the event having disease, let N be the event testing negative.
(a) Pr(N | D) = 1 Pr(N | D) = 1 0.02 = 0.98 (98% of those with the disease
have a positive reaction)
(b) Pr(D | N ) = 0.3022 (30.22%)
(c) Pr(D | N ) = 0.0007 (0.07%)
(d) Misdiagnosed is the event (N D) (N D), the union of two mutually exclusive events.
Pr((N D) (N D)) = Pr(N | D) Pr(D) + Pr(N | D) Pr(D)
= 0.02 0.03 + 0.07 0.97 = 0.0685

(6.85%)

92

INTROSTAT

Chapter

RANDOM VARIABLES
KEYWORDS: Random variables, discrete and continuous random
variables, probability mass functions and probability density functions.

Words or numbers. . .
In Chapter 3, we defined a sample space as the set of all the elementary events that
are possible outcomes of a random experiment. Sometimes, we expressed these elementary events quantitatively (the length of time for which a light bulb lasts, the number of
items purchased by a customer, the proportion of voters who support a particular proposal), and sometimes we used verbal, qualitative descriptions of the elementary events
(for random experiments such as the state of the economy, the sex of an applicant for a
job, the colour of a vehicle ordered by a purchaser).
In order to manipulate the events defined on a sample space mathematically, it
is necessary to attach a numerical value to each elementary event. Frequently, the
elementary events are quantitative, and there is a natural and obvious way to assign
numbers to them the survival time (in hours) of the light bulb, the count of items
purchased, the number of girls in families of four children.
However, if the elementary events are expressed qualitatively, we have to assign a
number to each elementary event. For example, the economy might be classified as being
in recession, stable or booming; we could asign a 1 to the event recession,
2 to the event stable and 3 to the event booming. An applicant for a advertised
post could be male or female, and we could assign 0 to the event male applicant and
1 to the event female applicant. To repeat the motivation for assigning numbers to
elementary events it clears the way for us to develop a general mathematical theory
for handling the probabilities of events in a sample space.
Once all the elementary events in a sample space have numerical values assigned
to them, we follow the classic algebraic tradition and let X stand for the numerical
values of the elementary events. We then call X a random variable. X is a variable
because it can take on (or assume) different values. X is a random variable because
the particular value it takes on depends on the outcome of a random experiment.
By convention, statisticians use the capital letters near the end of the alphabet to
denote random variables. Their favourite choice is the letter X.

93

94

INTROSTAT

Definition of a random variable. . .


A random variable X is a numerical variable whose value is determined by the
outcome of a random experiment. Expressed somewhat differently, a random variable is
a function whose domain is the sample space, and whose range is the real line.
Once we are dealing with a random variable, events (which we defined in Chapter 3
as subsets of the sample space) become subsets of the real line, usually a set of points or
an interval. Thus X = 1, X < 4 and 5 X 10 are events. And because they are
events, they have probabilities we write Pr[X = 1], Pr[X < 4] and Pr[5 X 10],
and read them as the probability that the random variable X takes on the value 1,
the probability that X is less than 4, and the probability that X lies in the interval
from 5 to 10, inclusive of the end points.
Statisticians have also adopted the convention that small letters (e.g. x, a, b) are
used to denote particular values that a random variable may assume. Thus Pr[X = x]
means the probability that the random variable X takes on some particular value x.
If there is only one random variable under consideration, and hence no ambiguity, we
abbreviate Pr[X = x] to Pr(x).
Example 1A: Consider tossing a die. We attach numerical values to the elementary
events in the sample space in the obvious way:

If the die is unbiased, then Pr[X = 1] = 1/6, i.e. Pr(1) = 1/6. We can write
Pr[X = x] = 1/6, or Pr(x) = 1/6, for x = 1, 2, 3, 4, 5 and 6.
It is important to realize that the definition of a random variable does not imply that
a different numerical value needs to be assigned to each elementary event. In fact, we
often want random variables in which the same value is assigned to different elementary
events. The following four examples illustrate this.
Example 2A: Consider the random experiment that consists of tossing an unbiased
coin 3 times (see also Example 3C of Chapter 3). If the random variable of interest is the
number of heads that occur, then we attach numerical values to the elementary events
as follows:
S=
{ HHH HHT HTH THH HTT THT TTH TTT }
X=
3
2
2
2
1
1
1
0
The event X = 1 corresponds to the subset {HTT, THT, TTH} of S, and thus
Pr[X = 1] = 3/8. Also, clearly, Pr(0) = 1/8, Pr(2) = 3/8 and Pr(3) = 1/8.
Example 3B: Suppose you have 5 coins in your pocket two 5c coins, two 10c coins
and a 50c coin and you pull out two coins at random for a tip. Let the random variable
X be the amount of the tip. What are the possible values for X, and the probabilities
that X takes on these values?
We denote the coins 51 52 101 102 and 50. The sample space S, and the numerical
values assigned to each elementary event, can be represented as:

CHAPTER 4. RANDOM VARIABLES

95

S = { 51 52 51 101 51 102 52 101 52 102 101 102 51 50 52 50 101 50 102 50 }


X=
10
15
15
15
15
20
55
55
60
60
If we assume that each of the ten pairs of coins is equally likely, then Pr[X = 10] = 0.1,
Pr[X = 15] = 0.4, Pr[X = 20] = 0.1, Pr[X = 55] = 0.2, and Pr[X = 60] = 0.2.
Example 4B: A shocking snooker player hits the balls around at random until he gets
one into a pocket. There are 15 red balls (valued at 1 point) and one each of the colours
yellow, green, brown, blue, pink and black (valued from 2 to 7 respectively). What is the
sample space for this random experiment? Let the random variable X be the score.
What values can X take on and with what probabilities?
Denoting the red balls red 1, red 2,. . ., red 15, the elementary events in the sample
space, and the X values assigned to them, are
S = { red 1, . . . , red 15, yellow, green, brown, blue, pink, black }
X=
1
...
1
2
3
4
5
6
7
and, assuming that each ball is equally likely to be pocketed, Pr[X = 1] = 15/21 and
Pr[X = 2] = Pr[X = 3] = . . . = Pr[X = 7] = 1/21.
Example 5C: A car salesperson is scheduled to see two clients today. She sells only
two models of cars, an executive (E) and a basic (B) model. Each executive model
sold earns the salesperson a commission of R2000, while each basic model sold earns
her only R1000. If the sale is lost (L), no commission is earned. Suppose Pr(E) = 0.2,
Pr(B) = 0.3, and Pr(L) = 0.5, and that sales are independent of each other. Let the
random variable X be the total commission earned by the salesperson today. What
values can X take on, and with what probabilities?

Discrete and continuous random variables. . .


Random variables fall into two categories discrete and continuous. The mathematical treatment of these two types of random variables is very different as you will
learn from the remainder of this chapter.
Discrete random variables take on isolated values along the real line, usually
(but by no means always) integer values. Examples of integer-valued discrete random
variables are:
the number of customers entering a store between 09h00 and 10h00
the number of occupied tables at a restuarant
the number of clients visited by a salesperson during a day
the number of applicants who respond to an job advertisement
Discrete random variables with values that are not integers do also exist! This
happens when the random variable consists of the ratio of two counts: for example, we
might measure the effectiveness of a television advertisement for a luxury car as the
number of cars sold divided by the number of times the advertisement was shown on
TV. Both the numerator and the denominator are then integers, so the random variable
must be a rational number. Detailed consideration of this type of random variable is
beyond the scope of this book, but there are a couple of simple examples!

96

INTROSTAT

In contrast to discrete random variables, a continuous random variable can


(conceptually, at least) be measured to any degree of accuracy; i.e. between every two
possible values x1 and x2 that the random variable can assume, there is another possible
value x3 , between x1 and x2 . The set of all possible values of a continuous random
variable is usually an interval of the real line. Examples of continuous random variables
are:

the distance a car travels on one litre of petrol

the proportion of gold in a sample of ore

the volume of milk that actually goes into a nominally one litre carton

the time that a customer waits in the queue at a fast food outlet

the direction of the wind at midday.

Example 6C: Which of the following are random variables? Which of the random
variables are continuous and which are discrete? Write down the set of values that each
random variable can take on.
(a)
(b)
(c)
(d)

The number of customers arriving at a supermarket during the morning.


The number of letters in the Greek alphabet.
The opening price of gold in New York on Monday next week.
The number of seats that will be sold for a performance of a play in a theatre with
a capacity of 328.
(e) The length of time you have to wait at an autobank.
(f) The ratio between the circumference and the diameter of a circle.
(g) The last digit of a randomly selected telephone number.

The distinction between discrete and continuous random variables is critical because
we develop different mathematical approaches for the two types of random variable.
(Interestingly, though, in advanced treatments of random variables, the mathematical approach for both types is again unified!) We describe discrete random variables
mathematically using probability mass functions. Continuous random variables are
described by probability density functions. We adopt the convention of using p(x)
to denote a probability mass function and f (x) for a probability density function.

97

CHAPTER 4. RANDOM VARIABLES

Probability mass functions. . .


A function p(x) is called a probability mass function (frequently abbreviated to
p.m.f.) if it satisfies the conditions PMF1, PMF2 and PMF3.
PMF1: p(x) is defined for all values of x, but p(x) 6= 0 only at a finite or countably
infinite set of points.
PMF2: all values of p(x) lie in the unit interval [0, 1], that is 0 p(x) 1.
PMF3: p(x) = 1, where the sum is taken over all values of x for which p(x) 6= 0.
We now consider several examples of probability mass functions.
Example 7A: An unbiased die is rolled and the random variable X consists of the
number of dots appearing on the upturned face. Find the probability mass function for
this random variable.
Simply letting Pr[X = x] = p(x) gives us the required probability mass function.
Because the die is unbiased Pr[X = 1] = p(1) = 1/6, and similarly p(2) = p(3) = p(4) =
p(5) = p(6) = 1/6. All other values of X represent impossible events; for example,
Pr[X = 1] = p(1) = 0 (you cannot get 1 when you toss a die!), p(0) = 0, p(113) =
0, p(187) = 0, and so on. Defined in this way p(x) is non-zero at only six isolated points,
and zero for all other values of x, thus satisfying condition PMF1. PMF2 is satisfied
because p(x) is either 0 or 1/6 (which both lie in the closed interval [0, 1], and for PMF3
we note that
6
X
p(x) = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1.
x=1

The probability mass function that describes tossing a single die can therefore be written
as
p(x) = 1/6 x = 1, 2, 3, 4, 5, 6
=0
all other values of x.
1/6
p(x)
0
2

Example 8B: Heavily-backed favourite Enforce came through along the inside, but was
overwhelmed by Susans Dream, quoted at 91. Express the anticipated performance of
the filly Susans Dream in this horse race as a probability mass function.
Let X = 0 describe the event that Susans Dream loses the race and X = 1 the event
that she wins. The quoted odds of 91 means that the probability of losing is estimated
by the bookmaker as 9 times the probability of winning. Thus Pr[X = 0] = 9/10 and
Pr[X = 1] = 1/10, so that
p(x) = 9/10 x = 0
= 1/10 x = 1
=0
all other values of x.

98

INTROSTAT

PMF1 is satisfied, because p(x) is non-zero at only two points. Both values of p(x) lie
in the unit interval, so PMF2 is satisfied. The two values of p(x) add to one, so PMF3
is satisfied.
Example 9C: The Minister of Environment Affairs has to decide on a fishing quota for
the forthcoming season. Currently, the biomass of fish is estimated to be 20 m tonnes.
The fish may have a good breeding season (with probability 0.3) and produce 10 m tonnes
of young, or have a bad breeding season and produce only 1 m tonnes. A so-called warmwater event may occur with probability 0.1, and kill 15 m tonnes of fish, otherwise
1 m tonnes of fish will die. Find the probability mass function for X, the biomass of
fish before setting the quota (assuming all events are independent). If the minister bases
his decision using a policy that the biomass must remain 10 m tonnes or more with
probability 0.8, what should his decision be?
Example 10C: The hostile merger bid by Minorco on Consgold in 1989 was, at one
point, considered highly likely to fail by the financial media. They quoted a 121 chance
of failure. Express the anticipated outcome of the merger as a probability mass function.
For a discrete random variable X, the probability of the event a X b is found
by summing the relevant values of the probability mass function:

Pr[a X b] =

b
X

p(x).

x=a

Be careful in your handling of and <, and and >: if X assumes only
integer values, then
Pr[a < X < b] =

b1
X

p(x).

x=a+1

Also, if you have to find Pr[X b], the lower limit of the summation is b, but the upper
limit is the largest value of x for which p(x) is defined. You need this information for
the following examples.
Example 11A:
(a) Check that the function
p(x) = x/15 x = 1, 2, 3, 4, 5
=0
otherwise
satisfies the conditions for being the probability mass function of some random
variable X. Sketch p(x).
(b) Find Pr[2 X 4].
(c) Find Pr[X 3].

99

CHAPTER 4. RANDOM VARIABLES


0.3
p(x)

0.2
0.1
0.0
0

x
(a)

PMF1: p(x) 6= 0 for only five values of x and p(x) is defined for all values of
x.

PMF2: all values of p(x) lie in the interval [0, 1].


P5
P5
x
1+2+3+4+5
PMF3:
=
= 1.
x=1 p(x) =
x=1
15
15
P4
P4
x
2+3+4
3
(b) Pr[2 X 4] = x=2 p(x) = x=2
=
= .
15
15
5
P5
P5
x
3+4+5
4
=
=
(c) Pr[X 3] = x=3 p(x) = x=3
15
15
5

Example 12B: Show that the function

1 x
p(x) = ( ) x = 1, 2, 3, . . .
2
=0
otherwise
is a probability mass function.
PMF1: p(x) is defined for all x, and is non-zero on the set of positive integers
{1, 2, 3, . . .}, a countably infinite set1 .
PMF2: p(x) takes on the values 0, 21 , 41 , 18 , . . . all of which lie in the unit interval.
PMF3:

X
x=1

a=

X
1x 1 1 1
= + + +
p(x) =
2
2 4 8
x=1
1
1.
(1 ) = 1.
=
2
2

(Recall that sum of infinity of a geometric progression is given by a/(1 r). Here
1
1
2 and r = 2 .)

Example 13C: Find the sample space for the random experiment which consists of
rolling a pair of dice. Find the probability mass function for the random variables X
defined to be sum of the values on the dice and Y defined to be the product of the values.
Find Pr[X 10] and Pr[Y 13].
1

A set is said to be countably infinite if there is an orderly way of setting about counting its members.
The set of integers is a countably infinite set. However, the set {x|0 x 1}, the unit interval, is noncountable no matter how what system you use to count the numbers, you always leave out infinitely
many!

100

INTROSTAT

Example 14C:
(a) Show that the function

p(x) = x3 0.53
=0

x = 0, 1, 2, 3
otherwise

satisfied the conditions for being a probability mass function.


(b) Show that it is the probability mass function for the random variable X, the
number of heads obtained when three coins are flipped.
(c) Find P [X 2].
Example 15C: Which of the following functions satisfy the conditions for being a
probability mass function?
(a)
p(x) = 0.25 (0.75)x x = 0, 1, 2, . . .
=0
elsewhere
(b)
p(x) = 15 (2x 3) x = 1, 2, 3, 4, 5
=0
elsewhere
(c)
p(x) = 0.3
= 0.4
= 0.2
= 0.1
= 0.0

x=1
x=2
x=3
x=4
otherwise

(d)

p(x) = x4
=0

3  7
2x
2

x = 0, 1, 2
otherwise

Example 16C:
(a) For what value of k will
k
p(x) = x!
=0

x = 0, 1, 2, 3, 4
otherwise

be a probability mass function?


(b) Find Pr[X < 2].
Example 17C: 10% of the customers entering a supermarket purchase a particular
brand of margarine. A market researcher wishes to interview a sample of these customers.
As people exit the supermarket she asks them if they have purchased this margarine.
Let the random variable X be the number of people she approaches before she finds
her first customer who has purchased the margarine. Find the probability mass function
for X.

101

CHAPTER 4. RANDOM VARIABLES

Bar graphs. . .
A probability mass function is conveniently plotted by means of a bar graph. This
gives an easily interpretable visual impression of the shape of the distribution of probabilities associated with the random variable. The example demonstrates the method.
Example 18A: Plot a bar graph for the probability mass function
 
4
p(x) =
0.6x 0.44x x = 0, 1, 2, 3, 4
x
=0
otherwise
of the random variable X.
We compute the following probabilities:
x
0
1
2
3
4
p(x)
0.026
0.154
0.346
0.346
0.130
and plot them as a bar graph. The heights of the lines are equal to the probabilities of
the events X = 0, X = 1, X = 2, X = 3 and X = 4. Naturally, the sum of the heights
of the bars must be equal to one.
0.4
0.3
p(x) 0.2
0.1
0.0
0

2
x

Example 19C: A random variable X has probability mass function


12
25x
=0

p(x) =

x = 1, 2, 3, 4
otherwise

Plot the bar graph.

Probability density functions. . .


Continuous random variables are represented by probability density functions.
The mathematical treatment of probability density functions is very different to that of
probability mass functions: having separate notations for them reminds us to keep the
mathematical differences in view. We use p(x) for probability mass functions and f (x)
for probability density functions.
A function f (x) is called a probability density function (sometimes abbreviated
to p.d.f.) if it satisfies the conditions PDF1, PDF2 and PDF3.
PDF1: f (x) is defined for all values of x.

102

INTROSTAT
PDF2: all values of f (x) lie in the interval [0, ); that is 0 f (x) < .
R
PDF3: f (x) dx = 1, i.e. the area under the curve of a probability density
function is one.

Frequently, the function f (x) is non-zero only on some interval, say (a, b) (this
interval may also be closed, or one of the limits may be infinity). It is then only necRb
essary to check PDF3 on this interval: a f (x) dx = 1. This is obvious, because then
R
Ra
Rb
R
Rb
0 dx = a f (x) dx because f (x) = 0 outside
f (x) dx = 0 dx+ a f (x) dx+ b
the interval (a, b).
We have seen that probabilities for discrete random variables are found by calculating the values of the probability mass function p(x) at the points of interest and summing
them. However, for continuous random variables, the probability density function f (x)
is constructed in such a way that probabilities of events are found by integration: the
area under the graph between the numbers c and d represents the probability
of the
Rd
event the random variable X lies between c and d i.e. Pr(c X d) = c f (x) dx.
This is illustrated below:
0.4
0.3
f (x) 0.2
0.1
0.0

.....
..............
............
............
..............
...............
................
.................
..................
....................
....................
....................
....................
....................
....................
c
d
x

.......................
...
.....
...
....
...
...
.
.
...
...
...
.
.
.
...
.
...
...
.
...
..
.
...
.
.
...
.
...
...
.
...
..
.
...
.
.
.
...
...
...
.
.
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
...
...
.
...
..
....
.
.
.
....
.
.
.
.
.....
....
.
.
......
.
.
.......
....
.
.
.
.
.
.
...........
......
.
.
.
.
.
.............................
.
.
.
.
.
.......................

This, in fact, motivates condition PDF3 of our definition of a probability density


function. The random variable X must lie between and . Thus the event
< X < is a certain event, and therefore its probability must be equal to
1:
Z
f (x) dx = 1.
i.e.
Pr( < X < ) =

We now consider several examples of probability density functions.


Example 20A: Show that
f (x) = 4 0 x 0.25
= 0 otherwise
is a probability density function. Sketch the probability density function.
PDF1: f (x) is defined for all x.
PDF2: f (x) takes on only the values 0 and 4, both of which are positive.
R 0.25
4 dx = [4x]0.25
= 1.
PDF3: 0
0

103

CHAPTER 4. RANDOM VARIABLES


4
3
f (x)

2
1

0.25

0.00

0.25

0.50

0.75

1.00

Example 21B: In a certain risky sector of the share market, the proportion of companies that survive (i.e. are not delisted) a year is a continuous random variable lying in
the interval from zero to one. A statistician examines the data collected over past years
and suggests that the function
f (x) = 20x3 (1 x) 0 x 1
=0
otherwise
might be useful in modelling X, the annual proportion of companies that survive.
(a) Check that f (x) is a probability density function.
(b) What is the probability that between 30% and 50% of the companies survive a
year?
(c) What is the probability that less than 10% of the companies survive a year?
(a) PDF1 (f (x) is defined for all x), and PDF2 (f (x) > 0) are satisfied. To check
PDF3,
Z 1
Z 1
3
20x3 20x4 dx = [5x4 4x5 ]10 = 1,
20x (1 x) dx =
0

as required.
(b) The probability that between 30% and 50% survive is
Pr[0.3 < X < 0.5] =

0.5

0.3

20x3 (1 x) dx = [5x4 4x5 ]0.5


0.3 = 0.157

(c) The probability that less than 10% survive is


Pr[X < 0.1] =

0.1
0

20x3 (1 x) dx = [5x4 4x5 ]0.1


0 = 0.00046

Example 22B: A remote country service station is supplied with petrol once a week.
The weekly demand for petrol (measured in 1000s of litres) is a random variable with
probability density function
1
f (x) = 2500
(10 x)3 0 x 10
=0
otherwise

(a) Show that f (x) is a probability density function

104

INTROSTAT

(b) What is the probability that between 3 and 5 thousand litres of petrol are sold in
a week?
(c) If less than 2 thousand litres are sold in a week, the petrol company does not
bother to deliver a supply. What is the probability of this event?
(d) If the service station has a 7 thousand litre tank, what is the probability that it
runs out of petrol in a week, assuming that it started the week full?
(e) What size tank is required in order to be 98% certain that weekly demand can be
met?
(f) What is the probability of selling exactly 5 thousand litres in a week?
(a) Checking the three conditions:
PDF1: f (x) is defined for all x.
PDF2: 10 x is positive for x in the interval [0, 10], hence f (x) 0.

PDF3:

1
2500

10
0


10


1
1
1
1
4
4
=
10
(10 x) dx =
(10 x)
2500
4
2500 4
0
3

=1

(b) The probability that sales lie between 3000 and 5000 litres is

5
Z 5
1
1
1
4
3
Pr[3 X 5] =
(10 x) dx =
(10 x)
2500 3
2500
4
3
 
1
1
=

(54 74 ) = 0.1776
2500
4
(c) The probability that sales are less than 2000 litres is

2
1
1
4
Pr[0 X 2] =
(10 x)
2500
4
0
 
1
1
=

(84 104 ) = 0.5904


2500
4
(d) The probability that sales exceed 7000 litres is


Z 10
1
1
1
34 = 0.0081
(10 x)3 dx =
Pr[x 7] =
2500 7
2500 4
Demand exceeds capacity with probability 0.0081; on average, the tank is emptied
once in every 123 weeks.
(e) We want Pr[0 X c] = 0.98, or equivalently Pr[c X 10] = 0.02 : that is


Z 10
1
1
1
4
3
(10 c) = 0.02
(10 x) dx =
2500 c
2500 4
1

Thus c = 10 (0.02 10 000) 4 = 10 3.76 = 6.24. We need a 6240 litre tank.


(f) Probability of selling exactly 5000 litres, may be expressed as
Z 5
f (x) dx = 0.
Pr[5 X 5] =
5

105

CHAPTER 4. RANDOM VARIABLES

This is a general principle for continuous variables the probability of the


random variable taking on a particular value exactly is zero. This seems counterintuitive, but is due to the fact that our ability to measure is always discrete (for
example, digital petrol pumps measure to the nearest tenth of a litre). Continuous
random variables are essentially unobservable, an abstract mathematical concept that is
useful only because it is convenient.
As a corollary of the above, note that for continuous random variables
Z d
f (x) dx.
Pr[c X d] = Pr[c < X d] = Pr[c X < d] = Pr[c < X < d] =
c

(This is not true of discrete random variables, where one has to be alert to the type of
inequality.)
Examples 21B and 22B showed how a random variable and a probability density
function could be used to model a practical problem. The particular probability
density functions that we used were chosen to make the integration trivial, and would
certainly be poor representations of reality in both situations. In the next chapter we
will be considering various probability mass functions and probability density functions
which have proved themselves useful in practice as models of real-world phenomena.
Example 23B:
(a) Could the following function serve as a probability density function for some random variable X?
f (x) = 6x(1 x) 0 x 1
=0
otherwise
1.5

1.0
f (x)
0.5

.....
......... .............
....
.....
...
....
...
...
.
.
.
...
...
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
.
...
..
.
...
..
...
.
...
....
...
...
...
.
.
...
....
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
.

f (x) = 6x(1 x)

1
x

(b) What is the probability that


(i) 0 X 0.2?

(ii) 0.4 X 0.6?


(a) We must check that the three conditions of our definition are satisfied
PDF1: f (x) is defined for all x.
PDF2: All values f (x) lie in the set {y|0 y 1.5} which contains no
negative numbers.

106

INTROSTAT
PDF3: We need to check that the area under the curve between 0 and 1 is
equal to 1:
Z

6x(1 x) dx = 6
0
1

1
1
= 6 x2 x3
2
3
0


1 1
=6

= 1.
2 3

f (x) dx =

1
0

(x x2 ) dx

The conditions are satisfied.


R 0.2
(i) Pr[0 X 0.2] = 6 0 (x x2 ) dx = 0.104
R 0.6
(ii) Pr[0.4 X 0.6] = 6 0.4 (x x2 ) dx = 0.296

Note that there is no requirement for the graph of a probability density function f (x)
to be a smooth curve such as this one:
2
f (x)

1
0

............................
.......
.......
.....
......
.....
......
...
......
.
.
......
...
......
.
.
.....
.
.
.....
...
......
.
......
..
.
......
..
.
......
..
......
.
......
..
.
......
..
......
.
.......
..
.
.......
..
.......
.
........
..
.
........
..
.........
.
..........
....
...........
.
.............
.
................
...
.....................
.
...............................
..
.
....................
...

0.0

0.5
x

1.0

In fact, a great variety of shapes are possible. The only restrictions are that f (x) must be
non-zero, and that the area under the curve must be equal to one. It is important to grasp
that the actual values of f (x) (the height of the curve at value x) cannot be interpreted
as being the probability that the random variable X is equal to x. This interpretation was
possible with graphs of the probability mass functions of discrete random variables. For
continuous random variables, probabilities are computed by integrating the probability
density function.
4

f (x) 2
1

...
..
...
...
...
...
...
...
....
...
.
.
....
...
....
....
...
...
....
....
.....
.....
.
.....
.
.
.
.....
.....
......
.....
......
......
.......
.......
........
........
.
...........
.
.
.
.
.
.
.
.
..................................

2
1
0

0.0

0.5
x

1.0

.
.........
.....
... ....
... ....
.... ...
.. ...
... ....
...
...
...
...
...
...
...
...
...
...
...
..
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
....
...
....
.....
...
......
...
......
......
...
.......
...
.......
........
...
...........
...
............
...........
...
.
.
.............................................

0.0

0.5
x

1.0

107

CHAPTER 4. RANDOM VARIABLES


3

2
f (x)
1

...
..... ...
..... ....
...
.....
...
.....
.
.
.
.
...
.....
...
.....
.
.
...
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
...
.
.
...
.
...
.
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
.
...
.
...
.
.
...
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
.
...
.
...
.
.
...
.
.
...
.
...
.
.
.
...
...
.
.
.
.
..
.....

0.0

0.5
x

1.0

.............
... ....
.........
..
... ...
...
..
..
.... ....
.
..
..
...
..
...
..
..
..
..
..
.
..
.
..
.
..
...
.
..
..
..
...
....
..
..
.
..
...
.
.
..
..
.
.
...
..
...
..
....
....
...
.
..
...
.
...
..
..
...
.
...
..
...
...
....
..
..
.
..
...
.
..
...
.
.
..
..
.
...
.
..
.
...
.
..
..
... ....
..
...
..
... ..
...
..
.........
...
...
...
...
...
.
.
..
..
.
...
..
.....
.
.
......
.
.
.
.
.
..............
...................

0.0

0.5
x

Example 24B: Let X be a random variable with probability density function


1

f (x) = ke 2 x 0 x
=0
otherwise
What value must k assume?
To make f (x) a density function, k must be chosen so that
Z

f (x) dx = 1

i.e.
Z

ke 2 x dx
0
i
h
1
= 2k = 1
= 2ke 2 x

f (x) dx =

Thus k = 12 .

A selection of examples. . .
Example 25C:
(a) Find the value of k, so that the function
f (x) = k(x2 1) 1 x 3
=0
otherwise
may serve as a probability density function.
(b) Find the probability that X lies between 2 and 3.

1.0

108

INTROSTAT

Example 26C: Verify that each of the following functions satisfies the conditions for
being either a probability mass function or probability density function.
(a)
p(x) = x/6 x = 1, 2, 3
=0
otherwise
(b)
 4
p(x) = x4 21 x = 0, 1, 2, 3, 4
=0
otherwise
(c)
f (x) = 1 0 x 1
= 0 otherwise
(d)
f (x) = |x| 1 x 1
=0
otherwise
(e)
f (x) = log x 0 < x 1
=0
otherwise
(f)
p(x) = e1 /x! x = 0, 1, 2, . . .
=0
otherwise
(g)
p(x) = 1/n x = 1, 2, 3, . . . , n
=0
otherwise
(h)
f (x) = 12 sin x 0 x
=0
otherwise
Example 27C: The probability density function of a random variable X is given by
f (x) = kx(1 x2 ) 0 x 1
=0
elsewhere
(a) Show that the value of k must be 4
(b) Calculate Pr[0 < X < 12 ]
(c) Find the value of A so that Pr[0 < X < A] = 21 .
Example 28C: For what values of A can p(x) be a probability mass function?
p(x) = (1 A)/4
= (1 + A)/2
= (1 A)/4
=0

x=0
x=1
x=2
otherwise

109

CHAPTER 4. RANDOM VARIABLES

Example 29C: A small pool building company is equally likely to be able to complete
2 or 3 pool contracts each month. The company receives between 1 and 4 contracts
to build pools each month, with probabilities Pr(1) = 0.1, Pr(2) = 0.2, Pr(3) = 0.5,
Pr(4) = 0.2. At the beginning of this month the company has two contracts carried
forward from the previous month. The random variable X of interest is the number of
contracts to be carried forward to next month. Find the probability mass function of X.
In particular, what is the probability that no contracts will be carried forward to next
month? Assume that the number of contracts is independent of the number of pools
completed. Also, to simplify the problem, assume that the contracts for a month are
made at the beginning of the month.

Solutions to examples. . .
5C The sample space, numerical values for the elementary events and their associated
probabilities are:
S
= {
EE
EB
EL
BB
BE
BL
LL
LE
LB
}
X =
4000 3000 2000 2000 3000 1000
0
2000 1000
Pr =
0.04 0.06 0.10 0.09 0.06 0.15 0.25 0.10 0.15
The probability mass function is therefore given by
p(x) = 0.25
= 0.30
= 0.29
= 0.12
= 0.04
=0

x=0
x = 1000
x = 2000
x = 3000
x = 4000
otherwise

6C (b) & (f) are not random variables, (a), (d) & (g) are discrete, (c) & (e) are
continuous.
9C

p(x) = 0.07
= 0.03
= 0.63
= 0.27
=0

x=6
x = 15
x = 20
x = 29
otherwise

Set the quota at 10 m tonnes.


10C Let X = 0 be fail, X = 1 be succeed.
p(x) = 12/13 x = 0
= 1/13 x = 1
=0
otherwise
13C The sample space, numerical values for the elementary events and their associated
probabilities are:
X
=
2
3
4
5
6
7
8
9
10
11
12
2
3
4
5
6
5
4
3
2
1
1
Probability
=
36
36
36
36
36
36
36
36
36
36
36

110

INTROSTAT
Y=
Pr(Y ) =

10

12

15

16

18

20

24

25

30

36

1
36

2
36

2
36

3
36

2
36

4
36
13
36 .

2
36

1
36

2
36

4
36

2
36

1
36

2
36

2
36

2
36

1
36

2
36

1
36

Pr[x 10] = 0.1667, Pr[Y 13] =


14C (c) 0.875

15C (a), (c) and (d) are probability mass functions, but (b) is not, because
p(1) = 0.2 < 0.
16C (a) k = 24/65

(b) 48/65

17C The probability mass function is


p(x) = 0.1 0.9x x = 0, 1, 2, . . .
=0
otherwise
25C (a) k = 3/20

(b) Pr[2 < X < 3] = 4/5

26C (a), (b), (f) and (g) are probability mass functions, (c), (d), (e) and (h) are probability density functions.
27C (b) 7/16

(c) A = 0.5412

28C 1 A 1 (otherwise probabilities are either negative or greater than one).


29C

p(x) = 0.1 0.5 = 0.05


= 0.2 0.5 + 0.1 0.5 = 0.15
= 0.2 0.5 + 0.5 0.5 = 0.35
= 0.5 0.5 + 0.2 0.5 = 0.35
= 0.2 0.5 = 0.1
=0

x=0
x=1
x=2
x=3
x=4
otherwise

Exercises. . .
4.1

Which of the following random variables are discrete, and which are continuous?
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)

4.2

the time required to answer this question


the number of words in a book chosen at random from the library
the number of heads in 6 flips of a coin
the number of goals scored in a soccer match
the maximum temperature recorded at D.F. Malan airport today
the volume of air breathed in by an individual when asked to take a deep
breath
the annual income to the nearest cent of a randomly chosen wage-earner
the population of a randomly chosen town in the Orange Free State
the length of time you have to wait for a bus
the amount of rain that falls in a day.

Check which of the following functions can serve as probability mass functions or
probability density functions.

111

CHAPTER 4. RANDOM VARIABLES


(a)
p(x) = x/6 x = 1, 2, 3
=0
otherwise
(b)
3
(2 x2 ) 1 x 1
f (x) = 10
=0
otherwise

(c)
p(x) = x

x=

1 3 1 1
, , ,
16 16 4 2

(d)
f (x) = 2x/3 1 < x < 2
=0
otherwise
(e)
f (x) = 14
=0

3<x<7
otherwise

(f)
p(x) =

x = 1, 2, . . . , n

1
n(n+1)
2

=0

otherwise

4.3 Show that the following functions are probability mass functions.
(a)

p(x) = e2 2x x! x = 0, 1, 2, . . .
=0
otherwise
(b)
p(x) = x5
=0

 1 x 3 5x
4 4

x = 0, 1, 2, 3, 4, 5
otherwise

4.4 Show that the following functions are probability density functions.
(a)

f (x) = (2 x)1 0 < x < 1


=0
otherwise

(b)
1

f (x) = 41 xe 2 x 0 x <
=0
otherwise
4.5

What must the value of k be so that the following functions are probability density
functions?
(a)
f (x) = kx2 (1 x) 0 < x < 1
=0
otherwise
(b)
f (x) = ke4x 0 x <
=0
otherwise

112

INTROSTAT

4.6 A random variable X has probability density function given by


f (x) = Ax3 0 x 10
=0
otherwise
Find A. What is the probability that X lies between 2 and 5, and what is the
probability that X is less than 3? Sketch the density function.
4.7

A random variable X has probability density function


f (x) = ex 0 x <
=0
otherwise
Find the number t such that Pr[X < t] = 12 .

4.8 For the probability density function


f (x) = 2x 0 x 1
= 0 otherwise
find the number a such that the probability that X < a is three times the probability that X a.
4.9 If f (x) = 3x2 for 0 < x < 1, and zero elsewhere, find the number b, such that X
is equally likely to be greater than, or less than b.
4.10

The probability density function of the life in hours X of a certain kind of radio
tube is found to be
f (x) = 100/x2 x > 100
=0
otherwise
Three such tubes are bought for a radio set. What is the probability that none
will have to be replaced during the first 150 hours of operation?

4.11 A batch of small-calibre ammunition is accepted as satisfactory if none of a sample


of five shots fall more than 8 cm from the target at a given range. If X, the distance
from the centre of the target to an impact point, has probability density function
f (x) = xex 0 x <
=0
otherwise
find the probability that a batch is accepted.

Further exercises. . .
4.12 A continuous random variable X has probability density function
f (x) = k a x b
= 0 otherwise
for arbitrary constants a and b. Find the value of k.
4.13 Find values for c so that the following functions may serve as probability density
functions:

113

CHAPTER 4. RANDOM VARIABLES


(a)
f (x) = c + ex 0 x 1
=0
otherwise
(b)
f (x) = x +
=0
4.14

1
2

0xc
otherwise

The density function for a random variable X is given by


f (x) = 43 (kx x2 ) 0 x 2
=0
otherwise
(a) Determine the value of k.
(b) Calculate Pr[0 < X < 1].

4.15

(a) Check whether


p(x) = x5
=0

4 
2x /36

x = 0, 1, 2
elsewhere

is a probability mass function.


(b) Calculate Pr[X = 0], Pr[X = 1] and Pr[X = 2].

Solutions to exercises. . .
4.1 (b) (c) (d) (g) and (h) are discrete
(a) (e) (f) and (i) are continuous.
(j) is an unusual example of a mixed continuous and discrete random variable:
although the random variable is, at face value, continuous, it cannot be modelled
by a conventional probability density function because the probability of no rain
in a day is not zero but positive. The probability function for X needs to be
something like
p(x) = p
x=0
= f (x) x > 0
=0
otherwise
with the probability density function f (x) integrating to 1 p.
4.2 (a) (b) (e) and (f) satisfy conditions.
For (c), p(x) is not defined for all X.
For (d), f (x) < 0 for 1 < x < 0.
4.5 (a) k = 12

(b) k = 4

4.6 A = 1/2500, Pr[2 < X < 5] = 0.0609, Pr[X < 3] = 0.0081.


4.7 0.6931
4.8 0.8660
4.9 0.7937
4.10 8/27

114

INTROSTAT

4.11 0.9850
4.12 k = 1/(b a)
4.13 (a) c = e1
4.14 (a) k = 2

(b) c = 1
(b)

1
2

4.15 Pr[X = 0] = 6/36, Pr[X = 1] = 20/36, Pr[X = 2] = 10/36

Chapter

PROBABILITY DISTRIBUTIONS I:
THE BINOMIAL, POISSON,
EXPONENTIAL AND NORMAL
DISTRIBUTIONS
KEYWORDS: Binomial, Poisson, exponential and normal distributions.
A number of probability mass and density functions have proved themselves useful as
models for a large variety of practical problems in business and elsewhere. We consider
four of the most frequently encountered probability distributions in this chapter the
Binomial, Poisson, Exponential and Normal Distributions.

The binomial distribution . . .


The binomial distribution may be used as a probability model in situations in
which the following conditions are satisfied:
1. We have a random experiment which has a sample space with exactly two outcomes, one of which we can label success, and the other failure: i.e. S =
{success, failure}.
e.g. A door-to-door salesperson calls on a prospective client the client either
purchases the product (success) or does not purchase (failure).

2. The random experiment is repeated n times, n 1. The outcome on any one


repetition is not influenced by the outcome on any other repetition. We say we
have n independent trials of the random experiment.
e.g. The salesperson calls on n = 6 prospective clients the clients make their
purchasing decisions independently (there is no communication between them!).
3. The probability of success remains constant from trial to trial. We assume that
each client is equally likely to purchase the product. Let Pr(success) = p; thus
Pr(failure) = 1p. It is sometimes convenient to let q = 1p, so that Pr(failure) =
q.
115

116

INTROSTAT

Our random variable X is the number of successes we observe in n trials. If the


conditions above are satisfied, then we say that we have a binomial process, and that
the random variable X has a binomial distribution. In the above example, X is the
number of calls that resulted in sales. Because 6 calls were made, X must assume one
of the values 0, 1, . . . , 6, and X is therefore an example of a discrete random variable.
Binomial processes occur in many contexts. From an industrial or commercial perspective, one of the most important binomial processes occurs in the field of quality. The
quality of a product or service, whether it is a tomato, a nail, a personal computer, a
car, an insurance policy or the punctuality of a train, can be classified as satisfactory
or defective. In particular, the binomial probability distribution provides the basis for
deciding whether or not a consignment of goods meets the desired specifications.
BINOMIAL DISTRIBUTION
In a binomial process, we have n independent trials, each trial has two
outcomes, success of failure, and Pr[success] = p for all trials. Let the
random variable X be the number of successes in n trials.
Then X has the binomial distribution, and Pr[X = x] is given by
the probability mass function
 x
nx x = 0, 1, . . . , n
p(x) = n
x p (1 p)
=0
otherwise
Once we give values to n and p, (n 1, 0 < p < 1), a particular binomial distribution
is specified. n and p are examples of what we call the parameters of the distribution.
Once the parameters of a distribution have values, a particular distribution is specified.
We have a neat abbreviated notation which saves us writing the random variable
X is distributed binomially with parameters n and p. We compress all this information
into the symbols X B(n, p).
Example 1A: A door-to-door salesperson calls on 6 clients per session. Each client
makes their purchasing decision independently of the others, with probability 0.2 of
purchasing the product. What are the probabilities that 0, 1, 2, 3, 4, 5 or 6 clients
purchase the product?
Clearly, the three conditions for the binomial process are satisfied, and X, the number of clients who purchase the product, has a binomial distribution with n = 6 and
p = 0.2 : thus X B(6, 0.2).
Instead of simply using the formula given in the box, let us compute from first
principles the probability of, say, 2 clients purchasing the product, i.e. Pr[X = 2]:
Firstly, 2 clients out of 6 can purchase the product in many different permutations.
Let A1 be the event that the first 2 clients purchase (these are the successes that we
count) and that clients 3 to 6 refuse to purchase (i.e. are failures). Then, using our
usual conventions, we can write
A1 = S S F F F F.

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

117

Let the events A2 , A3 represent other permutations of 2 successes and 4 failures, e.g.
A2 = F S S F F F

A3 = F F S S F F

How many permutations


of 2 successes and 4 failures are there? Counting rule 6 tells us

there are 62 = 15 such permutations, so we could write down events from A1 to A15 .
Secondly, we compute Pr(A1 ). Because the clients act independently of each other,
Pr(A1 ) = Pr(S S F F F F )

= Pr(S) Pr(S) Pr(F ) Pr(F ) Pr(F ) Pr(F )

= p2 (1 p)4 = 0.22 0.84 .

Recall that the probability of the intersection of independent events is the product of
the individual probabilities, so that
Pr(A1 ) = Pr(A2 ) = . . . = Pr(A15 ) = 0.84 0.22
Thirdly, the events A1 , A2 , . . . , A15 are mutually exclusive no client can simultaneously both purchase and refuse to purchase! Thus
Pr[X = 2] = p[A1 A2 A15 ]

= Pr(A1 ) + Pr(A2 ) + + Pr(A15 )


 
6
=
0.22 0.84 = 15 0.22 0.84 = 0.2458.
2

 
6
0.22 0.84 obtained from first
2
principles is the same as that obtained by substituting n = 6, p = 0.2 and x = 2 into
the formula for the binomial probability mass function.
Try computing the remaining probabilities from first principles, and compare them
with the results obtained from the formula. The probabilities are given in the table
below:
Stop a while and convince yourself that the answer

118

INTROSTAT

p(x) = Pr[X = x]
6
6
0 0.8 = 0.2621

6
1
5
1 0.2 0.8 = 0.3932
6
2
4
2 0.2 0.8 = 0.2458
6
3
3
3 0.2 0.8 = 0.0819
6
4
2
4 0.2 0.8 = 0.0154
6
5
1
5 0.2 0.8 = 0.0015
6
6
6 0.2 = 0.0001

0
1
2
3
4
5
6

The probability that all six clients purchase the product is very small (0.0001) but
will occasionally occur (we expect it roughly once in every 10 000 times that a session of
6 calls are made!). The probability that none of the 6 clients purchase is 0.2621, so that
in approximately a quarter of sessions of 6 calls no purchases are made. The probability
that two or more purchases are made is Pr[X 2] = 0.2458 + 0.0819 + 0.0154 + 0.0015 +
0.0001 = 0.3417, so that in approximately one-third of sessions of 6 calls the salesperson
achieves two or more sales.
Example 2B: What is the probability of a contractor being awarded only one out of
five contracts? Assume that the probability of being awarded a contract is 0.5.
Let success = awarded a contract. Pr (success) = p = 12 . So q = 1 p = 21 . We
have n = 5 trials. Let X be the number of successes in 5 trials. Then X B(5, 12 ).
  x 54
5 1 1
P [X = x] = p(x) =
x = 0, 1, . . . , 5
x 2 2
So

  5
5 1
= 5/32.
Pr[X = 1] = p(1) =
1 2

Example 3B: Check that the binomial distribution



p(x) = nx px (1 p)nx x = 0, 1, 2, . . . , n
=0
otherwise
is a probability mass function.
(i) It is defined everywhere and p(x) 6= 0 on the finite set {0, 1, . . . , n}.
(ii) p(x) has no negative terms.
(iii) For convenience, let q = 1 p.
n  
X
n
x=0

nx

n  
X
n

px q nx
x
x=0
 
 
 
 
n 0 n
n 1 n1
n x nx
n n 0
=
p q +
p q
+ +
p q
+ +
p q
0
1
x
n
= (p + q)n

p (1 p)

(from the binomial theorem hence the name binomial distribution)

119

CHAPTER 5. PROBABILITY DISTRIBUTIONS I


= 1n
= 1.

(because q = 1 p)

Bar graphs of the binomial distribution . . .


To gain some feeling for the shape of the binomial distribution, consider the following
three bar graphs, for which n is fixed at 15, and p is varied.
0.3
X B(15, 0.5)

0.2
p(x)
0.1
0.0
0

10

15

10

15

10

15

0.3

X B(15, 0.3)

0.2
p(x)
0.1
0.0
0

5
x

0.3

X B(15, 0.8)

0.2
p(x)
0.1
0.0
0

5
x

120

INTROSTAT

Further examples on the binomial distribution. . .


Example 4B: A certain type of pill is packed in bottles of 12 pills each. 10% of the
pills are chipped in the manufacturing process.
(a) Explain why the binomial distribution can provide a reasonable model for the
random variable X, the number of chipped pills found in a bottle of 12 pills. What
are the appropriate parameters?
(b) What is the probability that a bottle of pills contains x chipped pills, i.e. what is
Pr[X = x]?
(c) What are the probabilities of
(i) 2 chipped pills?
(ii) no chipped pills?
(iii) at least 2 chipped pills?
(a) We check that the three conditions are satisfied.
1. The random experiment consists of examining a pill, and deciding whether it
is chipped or unchipped. Thus there are two possible outcomes, as required.
Because chipped pills are the things we are looking for and counting, we
will let chipped pill = success and unchipped pill = failure.
2. There are 12 pills in the bottle. We repeat the experiment 12 times, examining each pill. It seems reasonable to assume that the pills are chipped
independently of each other.
3. It also seems reasonable that the probability of a pill being chipped is the
same for each pill.
Thus the binomial distribution with parameters n = 12 and p = 0.10 may be used
to model the phenomenon of the number of chipped pills in a bottle of pills.
(b) Because X B(12, 0.10)

x
12x x = 0, 1, . . . , 12
p(x) = 12
x 0.10 0.90
=0
otherwise
(c)

(i) Pr[X = 2] = p(2) =


(ii) Pr[X = 0] = p(0) =

12
2
2 0.10

12
0
0 0.10

0.9010 = 0.2301
0.9012 = 0.2824

(iii) Pr[X 2] = 1 Pr[X = 0] Pr[X = 1] = 1 0.2824 0.3766 = 0.3410.


Example 5C: A TV manufacturer is supplied with a certain component by a specialist producer. Each incoming consignment of components is subjected to the following
quality control procedure. A random sample of 10 components is individually tested.
If there are one or more defective components among the 10 tested, the entire consignment is rejected. If there are no defective components in the sample, the consignment
is accepted.
(a) What are the probabilities of a consignment being rejected if the true proportions
of defective components are
(i) 1%
(ii) 10%
(iii) 30%

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

121

(b) If a sample of 20 components (instead of 10) were tested, and the consignment
rejected if two or more proved defective, calculate the probabilities of rejecting a
consignment for the same proportions of defective components.
(c) Which quality control procedure do you think is the better?

The Poisson distribution. . .


Many phenomena in physics obey the Poisson probability law named in honour of
the French mathematician Simeon D. Poisson (17811840). The classic example is the
decomposition of radio-active nuclei. In management science, the number of demands
for service in a given period of time (e.g. on tellers in a bank, the stock pile of a factory,
the runways of an airport, the lines of a telephone exchange) often obeys (either exactly
or approximately) a Poisson distribution. This applies also to the occurrence of accidents,
errors, breakdowns and other calamities the number that occurs within a specified
time period has a Poisson distribution under certain circumstances.
In broad terms, the condition for a Poisson process is that the events occur in time
at random. Loosely, this means that an event is equally likely to occur at any instant
in time. If a phenomenon obeys the Poisson process, then the Poisson distribution may
be used to model the number of occurrences of the event during a fixed time
period. We can also use the Poisson distribution when we count the occurrences of
an event in a fixed amount of space. For example, the number of faults in 100 m of
computer cable, the number of misprints on an A4 page, and the number of diamonds in
a cubic metre of ore are all Poisson processes (if the events occur at random in space)
and can be modelled using the Poisson distribution.
The Poisson distribution has only one parameter, namely the average rate at which
events are occurring per time period. Because the number of events that occur in the
interval must be an integer, the Poisson distribution is discrete. The probability mass
function is given in the box:
POISSON DISTRIBUTION
We are given a period of time during which events occur at random.
The average rate at which events occur is events per time period.
It is critical that the time period referred to in the rate must be the
same as the time period during which the events are counted. Let the
random variable X be the number of events occurring during the time
period.
Then X has the Poisson distribution with parameter , i.e. X P (),
and has probability mass function
e x
x!
=0

p(x) =

x = 0, 1, 2, . . .
otherwise

The bar graphs below show the shape of Poisson distribution for two values of .

122

INTROSTAT
0.3
X P (3)

0.2
p(x)
0.1
0.0
0

10

15

10

15

x
0.3
X P (8)

0.2
p(x)
0.1
0.0
0

5
x

Example 6A: We have a large fleet of delivery trucks. On average we have 12 breakdowns per 5-day working week. Each day we keep two trucks on standby. What is the
probability that on any day
(a) no standby trucks are needed?
(b) the number of standby trucks is inadequate?
Let the random variable X be the number of trucks that break down in a given day.
Because we are dealing with breakdowns, it is reasonable to assume that they occur at
random and that the Poisson distribution is a realistic model.
Because we are interested in breakdowns per day, we need to convert the given weekly
rate into a daily rate. 12 breakdowns per 5 days is equivalent to 12/5 = 2.4 breakdowns
per day. Thus we assume that X has the Poisson distribution with parameter = 2.4,
i.e. X P (2.4). Hence
e2.4 2.4x
Pr[X = x] = p(x) =
x!
(a) Pr(no breakdowns) = Pr[X = 0] = p(0) =
(b)

e2.4 2, 40
= 0.0907
0!

Pr(inadequate standby trucks) = Pr[X > 2]


= 1 Pr[X 2]

= 1 (p(0) + p(1) + p(2))

= 1 (0.0907 + 0.2177 + 0.2613)

= 0.4303.

This means that 9% of days we will not use our standby trucks at all, but that
on 43% of days we will run out of standby trucks. We should investigate the financial
implications of putting more trucks on standby.

123

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

Example 7B: Bank tellers make errors in entering figures in their ledgers at the rate
of 0.75 errors per page of entries. What is the probability that in a random sample of 4
pages there will be 2 or more errors?
Because we are dealing with errors, we assume a Poisson distribution. If errors occur
at 0.75 errors per page, then the error rate per 4 pages is 3. So we choose = 3.
Hence
e3 3x
Pr[X = x] =
x!
Then:
Pr[X 2] = 1 Pr[X < 2]

= 1 Pr[X = 0] Pr[X = 1]

e3 30 e3 31

0!
1!
= 1 0.0498 0.1494 = 0.8008 .
=1

Example 8C: Show that the function


e x
x!
=0

p(x) =

x = 0, 1, 2, . . .
otherwise

is in fact a probability mass function.


[You need the mathematical result:
e = 1 + +

X x
2 3
+
+ =
2!
3!
x!
x=0

Example 9C: Beercans are randomly tossed alongside the national road, with an
average frequency 3.2 per km.
(a) What is the probability of seeing no beercans over a 5 km stretch?
(b) What is the probability of seeing at least one beercan in 200 m?
(c) Determine the values of x and y in the following statement: 40% of 1 km sections
have x or fewer beercans, while 5% have more than y.
For (c), the following information is useful:
x
p(x) =

e3.2 3.2x
0.0408 0.1304 0.2087 0.2226 0.1781 0.1140 0.0608 0.0278 0.0111
x!

Pr[X
P x]
= xt=0 p(t)

0.0408 0.1712 0.3799 0.6025 0.7806 0.8946 0.9554 0.9832 0.9943

124

INTROSTAT

A more formal derivation of the Poisson distribution . . .


We remind ourselves of two mathematical results:
1. limn (1 /n)n = e
2. limn a/n = 0 for any finite value a.
Suppose events are occurring at random with rate per time period. Divide the
time period into n extremely short intervals of time, each of length 1/n. These time
intervals are regarded as being so short that it is assumed impossible for two or more
events to occur in the same interval. With this assumption it is true to say that the
probability that an event occurs in the short interval is /n, and thus the probability
that an event does not occur is 1 /n.
We now stand back and look at the problem from a new angle. We think of each
interval as a trial. There are exactly two outcomes of each trial either an event
occurs (success) or does not occur (failure). The probability of success, p, is /n.
There are n intervals, thus we have n trials. Let X be the number of events that occur in
the time period. Clearly, X is a random variable satisfying the conditions for a binomial
distribution, X B(n, /n) and thus
Pr[x events in the time period] = Pr[x successes in n trials]

   x 
nx
n

1
=
n
n
x
We now let n get so large that the assumption of two or more events in one interval
being impossible becomes realistic. Ultimately, we use the two mathematical results
above to see what happens when we let n tend to infinity:

   x 
nx
n

1
p(x) = lim
n x
n
n

 

x

n!
n
x
= lim
1
1
n x!(n x)! nx
n
n


n

x

n!
=
lim
1
lim
lim 1
n
x! n (n x)! nx n
n
n


x

n n1 n2
nx+1
=
lim

...
e 1,
n
x!
n
n
n
n
using the first of the mathematical results above. A simple re-expression of each term
within brackets yields


 



x
1
2
x1
p(x) =
lim 1 1
1
... 1
e
x! n
n
n
n
x
1 1 1 . . . 1 e
=
x!
using the second of the mathematical results x times. Therefore, we have the result we
require,
e x
p(x) =
,
x!
the probability mass function for the Poisson distribution.

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

125

The exponential distribution . . .


The exponential distribution arises out of the same underlying scenario as the Poisson distribution, the Poisson process in which events occur at random in time or space.
For the Poisson distribution, we counted the number of events that occurred in a fixed
period of time. For the exponential distribution we consider the interval of time between events. We let the random variable X be the time between events. Obviously,
X is a continuous random variable (it can take on any random variable (it can take
on any value in the sample space S = {x|x 0}), and must therefore have a probability
density function. We motivate the formula for the density function in Example 13C.
EXPONENTIAL DISTRIBUTION
If events are occurring at random with average rate per unit of time,
then the probability density function for the random variable X, the
length of time between events is given by
f (x) = ex x 0
=0
otherwise
X is said to have the exponential distribution with parameter , and
we write X E().
The exponential distribution can also be used to model the distance between
events in space, so long as the space is one-dimensional! For example, the exponential
distribution can be used for the distance between flaws in cable, but not for the distance
between flaws on an A4 page, because the page is two-dimensional!
Example 10A: A computer that operates continuously breaks down at random on
average 1.5 times per week.
This tells us = 1.5 per week, and that the random variable X, the time between
breakdowns, has density function
f (x) = 1.5e1.5x x 0
=0
otherwise
What is the probability of no breakdowns for 2 weeks?
This implies that X must be greater than 2 (think through this statement carefully)
and that we want Pr[X > 2]. Because the exponential distribution is continuous, we
evaluate this probability by integration:
P [X > 2] =



1.5e1.5x dx = e1.5x 2

= e + e3 = 0 + e3

= 0.0498.

What is the probability of a breakdown within 3 days?

126

INTROSTAT

We first make our units of time compatible: 3 days = 3/7 week. We want the
probability of a breakdown before 3/7 week:
P [0 < X < 3/7] =

3/7


3/7
1.5e1.5x dx = e1.5x 0

= e0.6429 + e0 = 0.5258 + 1
= 0.4742

These probabilities are depicted in the figure below, which also shows the general shape
of the exponential distribution.
1.5

..
..
..
..
...
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
....
....
1.5x
....
.....
.....
.....
......
......
......
.......
.......
........
.........
..........
............
..............
..................
...........................
.................................................
..................................................................................................................

.
..
..
...
....
....
..... Pr[0 < X < 3/7]
......
....
1.0 ...........
........
..........
..........
...........
f (x)
...........
...........
...........
...........
......
0.5 ................
...........
f (x) = 1.5e
...........
...........
...........
...........
...........
...........
...........
......
0.0
0
1

X E(1.5)

Pr[X > 2]
...................................................................

x
Example 11B: Show that the exponential distribution
f (x) = ex x 0
=0
otherwise
is a probability density function.
We check that the three condition for a probability density function are satisfied.
(i) f (x) is defined everywhere. It is non-zero on the interval [0, ), i.e. the set
{x|0 x < }.
(ii) f (x) is never negative. Because is a rate, it must be positive, and ex is positive.
(iii)
Z

f (x) dx =

ex dx = [ex ]
0

= e + e0 = 0 + 1

= 1,

as required for the area under the curve of a probability density function.

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

127

Example 12C: Let the random variable X be the time in hours for which a light bulb
burns from the time it is put into service. The probability density function of X is given
by
1
1
e 1000 x x 0
f (x) = 1000
=0
otherwise

(a) What is the probability that the bulb burns for between 100 and 1000 hours?
(b) What is the probability that the bulb burns for more than 1000 hours?
(c) What is the probability that the bulb burns for a further 1000 hours, given that
it has already burned for 500? (Use conditional probabilities!)
Example 13C: Events occur according to a Poisson process with intensity (i.e. at
rate per unit of time).
(a) Use the Poisson distribution to determine the probability of no events in t units of
time.
(b) Now use the exponential distribution to determine the probability that the time
between events is greater than t.
(c) Compare the answers to (a) and (b) and explain these results.
Example 14C: Flaws occur in telephone cable at the average rate of 4.4 flaws per
km of cable. Calculate the following probabilities. (Make use of binomial, Poisson and
exponential distributions.)
(a)
(b)
(c)
(d)

What is the probability of 1 flaw in 100 m of cable?


What is the probability of more than 3 flaws in 250 m of cable?
What is the probability that the distance between flaws exceeds 500m?
In ten 200 m lengths of cable, what is the probability that 8 or more are free of
flaws?

The normal distribution. . .


The normal distribution is often referred to as the Gaussian distribution, in honour of
Carl Friedrich Gauss (17771855), a famous German mathematician, who, for more than
a century, was credited with its discovery. The same result was published at about the
same time by the equally famous French mathematician the Marquis de Laplace (1749
1827). But the normal distribution had actually been discovered nearly a century earlier
by Abraham de Moivre (16671754). In 1733 he published a mathematical pamphlet that
was not widely circulated and was quickly forgotten. A copy of de Moivres pamphlet
was found in 1924, and the English statistician Karl Pearson found that it contained the
formula for the normal distribution. De Moivres precedence in discovering the normal
distribution is contained in a paper published in 1924 by Pearson (Historical note on
the origin of the normal curve of errors) in the journal Biometrika volume 16 pages
402404, an important statistical journal which is still publishing major discoveries in
statistics.
The normal distribution is the most important distribution in statistics. Part of the
reason for this is a result called the central limit theorem, which states that if a

128

INTROSTAT

random variable X is the sum of a large number of random increments, then X has the
normal distribution.
The daily turnover of a large store is the sum of the purchases of all the individual
customers. The height of a 50-year old pinetree can be thought of as the sum of each
years growth which itself is a variable affected by sunshine, temperature, rainfall,
etc. So one expects the heights of 50-year old pinetrees to obey a normal distribution.
Similarly, an examination mark is the sum of the scores in a large number of questions.
Thus, by the central limit theorem, one expects daily turnover, the heights of trees and
examination marks (approximately, at least) to be normally distributed.
The normal distribution is continuous, and has probability density function
f (x) =

1
2 2

12

2

<x<

There are two parameters, (mu, the Greek letter m for Mean) and (sigma,
the Greek letter s for Standard deviation).
The constant tells us where the graph is located (it can take on any real value);
the constant (which is always positive) tells how spread out the distribution is. The
graphs, depicting f (x) for a few values of and , make this clear: The most striking
feature of the normal distribution is that it is bell-shaped. Notice also that the centre
of the bell is located at the value , and that the distribution gets flatter as gets
larger. The plot also illustrates the fact that the area under the curve for a probability
density function is one; to accommodate this, notice that as the curve gets flatter, its
maximum value has to become smaller.
0.8

...
. .
. ..
.. ..
.
.
.
.
2
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. ... ...... ..
.... . ....
...
.
.
. .....
... .
...
. ...
...
.. .
....
.
....
.
....
..
...
..
.
...
.
.
.
...
...
....
......
....
.
...
.
.
.
. ...
.
...
.. .
.
.
. ...
.. .
.
.
.
. ..
.
.. .
...
.
.
.
...
.
..
.
.
.
...
.
.
... .
...
. ....
.
..
.
.
...
.
.
.
...
...
...
.
...
..
..
.
.
.
.
...
.
..
.
...
.
.
.......... ....... .......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.. ....... ....... .....
.... .
...
.
.
.
.
... ........ ....... ....... ....... ...
.
.
.
.
.
.
.
....... ........
...
.
...
. .......
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....
........ .......
..
.
.
.. ....... ...
.....
....... ....... ........
.
...
.
... .... ........... .
..... ......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
........... ....... .......
.
.
........
..
..
... ......
.
.....
.
.
.
.
.
.
.
.
.
.
.
.... . .... . ... .......
.
.
.
.
.
.
.
.
.
.
.
.
....
....
......................
.... .......................

N (0, 1)
N (0, 0.5 )
N (2, 4 )
N (6, 1)

0.6

0.4

f (x)

0.2

10

If X has the normal distribution with parameters and , we abbreviate this to X


N (, 2 ), reading this as the random variable X has the normal distribution with
parameters and 2 . When we use this notation, our convention is to write 2 for
the second parameter, not plain . The parameter 2 is known as the variance of the
distribution. As in Chapter 1, the variance is the square of the standard deviation.
Unfortunately it is impossible to determine probabilities by integrating the normal
probability density function. However (and this makes life very easy), the integration

129

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

can be done by computer, and we are supplied with a table of probabilities for the normal
distribution.
It should come as a surprise to you that a single table is all we need. After all,
there are infinitely many combinations of and , and it seems that we ought to have
a massive book of normal tables. We are luckier than we deserve to be, and there is
a connecting link between all normal distributions which makes it possible to get away
with a single table! We will learn how to use this amazing table by means of an example.

Example 15A: If the amount of margarine, X, in a 250 g tub is normally distributed


with = 251 and = 3, what is the probability that the tub will contain
(a) between 251 g and 253 g of margarine?
(b) less than 250 g of margarine?
For part (a) we want Pr[251 X 253]
=

253

251

1 x251 2
1
e 2 ( 3 ) dx
29

It is impossible to evaluate this integral by finding an indefinite integral and then


substituting. So we resort to our tables. We are given tables for only one set of values of
the two parameters this is all we need: when = 0 and = 1 we have the standard
normal distribution which has density function
1 2
1
f (x) = e 2 x
2

<x<

How do we make do with tables for only the standard normal distribution? Because
we have an easily proved result that the proportion of the density function that
lies between the mean and a specified number of standard deviations away
from the mean is always constant regardless of the numerical values of the mean
and standard deviation.
Translated into mathematical symbols, this important result can be written as
Z

+z

1 x 2
1

e 2 ( ) dx =
2 2

1 2
1
e 2 z dx
2

Use integration by substitution to prove this by putting z = (x )/.

As an example of this, the areas depicted below are equal. The shading, in both
cases, shows the area under the curve between the mean and one standard deviation
above the mean. Both plots have the same scale on both axes so you can count the
dots for a numerical proof!

130

INTROSTAT
0.4

0.4

0.2 f (x)

f (z) 0.2
X N (251, 32 )
.... .
........ .
............
..............
...............
...............
...............
...............
...............
...............

...........................................
........
......
......
.....
.....
.....
.
.
.
.
......
......
.....
.....
.....
.
.
.
.
.....
....
.
.
.....
.
.
..
.
.....
.
.
.
.....
....
.
.
.
.....
.
....
......
.
.
.
.
......
...
.
.
.
.
......
.
....
.
......
.
.
.
.
.......
....
.
.
.
........
.
.....
.
.
.........
.
.
.
.
.
.
............
......
.
.
.
.
.
.
.
.
.
.
.
..................
.
.
.
.
...
..............................

245
x

250

Z N (0, 1)
.
..
..
..
...
...
....
....
....
....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....

........
... ...
... ....
..
....
..
.
..
...
...
...
....
..
...
..
..
..
..
...
..
...
...
...
...
..
...
..
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
..
..
...
...
...
...
.
.
...
.
.
...
.
..
...
.
.
..
...

2
z

255
=3

0
2
=1

254 is one standard deviation (i.e. 3 units) above 251, the mean. Thus the area between
251 and 254 in N (251, 32 ) is the same as that between 0 and 1 in N (0, 1).
Returning to part (a) of our margarine example, we need the area between 251 and
253 of N (251, 32 ). 253 is two-thirds of a standard deviation above the mean of 251,
because (253 251)/3 = 2/3. Thus Pr[251 < X < 253] = Pr[0 < Z < 2/3], as depicted
below:
0.4

0.4

0.2 f (x)

f (z) 0.2

X N (251, 32 )
...... .........
................ ........
.....
........
......
..................................
.....
.
.
.
.
....
.
.
.
.. .
.
.
.
.
...
............... ....................
.....
.....
.....
.
.
.
..
.....
.
.
.
.
.....
...
.....
...............
.....
......
......
......
.
.
.
.
.
.....
.
.
......
.
.
.
.
.
.
.
.
......
....
.
.
.
.
.
.
.
.
.
.
.......
.
.
.
.
.
...
.
.
........
.
.
.
.
.
.
.
.
.........
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
............
.......
.
.
.
.
.
.
.
.
.
.
.
.
..................
.
.
.
.
.
.
.
.
.
.
.
.
.....
...
......................
245

250
x

Z N (0, 1)
.
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...

.........
... ...
.. ....
..
..
...
...
.
...
...
...
...
....
..
...
..
...
..
..
..
...
...
...
...
...
...
...
...
..
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
.
.
..
...
....
.
...
.
.
...
.
.
...
..
.
.
..
..

255

Some numerical results from the normal tables help to give a feel for the normal
distribution. The area from one standard deviation below the mean to one standard
deviation above the mean is 0.683 (close to 2/3rds); i.e.
Pr[ < X < + ] = 0.683.
The corresponding probabilities for two, three and four standard deviations are:
Pr[ 2 < X < + 2] = 0.954

131

CHAPTER 5. PROBABILITY DISTRIBUTIONS I


Pr[ 3 < X < + 3] = 0.997
Pr[ 4 < X < + 4] = 0.999999

These results are true for all combinations of and ! In general terms, two-thirds (68%)
of a normal distribution is within one standard deviation of its mean, 95% is within two
standard deviations, and virtually all of it is within three standard deviations,
The general result is that the area between and some point x for N (, 2 ) is the
x
same as the area between 0 and z = x
for N (0, 1). The formula z = tells us how
many standard deviations the point x is away from the mean . Once again, you can
count the dots in the plot below of the normal distribution with arbitrary parameters
and and in the standard normal distribution N (0, 1):
0.4

f (z) 0.2

f (x)
X N (, 2 )
.... .
........ .
............
..............
................
.................. .
.....................
.....................
.....................
.....................

.........................................
........
......
......
.....
.....
.....
.
.
.
.
.
.....
.....
.....
.....
.....
.
.
.
.
.....
...
.
.
.....
.
.
....
...
.
.
.
.
.....
....
.
......
.
.
.
......
....
.
.
.
.
......
...
.
.
.
.
......
.
....
.
......
.
.
.
.
.......
.....
.
.
.......
.
...
.
.........
.
.
.
.
.
.
.
...........
......
.
.
.
.
.
.
.
.
.
................
.
.
.
.
.
.
........
................................

Z N (0, 1)
.
..
..
..
...
...
....
....
....
....
.....
.....
.....
.....
......
......
......
......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......

.....
... ..
.. ...
.. ....
..
....
..
.
..
..
...
...
....
...
...
..
..
...
..
..
..
..
...
...
...
...
...
..
...
...
...
..
...
...
...
...
...
....
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
..
..
...
...
...
...
.
.
...
.
.
.
...
..
.
...
.
..
....

2
z=

In our margarine example, we use z = (x )/ for x = 251 and x = 253 to get



253 251
251 251
<Z<
= Pr[0 < Z < 0.67]
Pr[251 < X < 253] = Pr
3
3


From the table for the standard normal distribution (Table 1) we read off this probability
as 0.2486. Thus
Pr[251 < X < 253] = 0.2486,
almost a quarter of margarine tubs contain between 251 g and 253 g of margarine.
Part (b) of our question asked for the probability that a tub of margarine was
underweight, i.e. the probability that X < 250. The area between and 250 in
N (251, 32 ) is the same as the area between and (250 251)/3 = 1/3 in N (0, 1):

250 251
= P [Z < 1/3].
Pr[X < 250] = Pr Z <
3


132

INTROSTAT
0.4

0.4

Z N (0, 1)

........
... ...
... ....
..
....
..
.
..
...
...
...
....
..
...
..
..
..
..
...
..
...
...
...
...
..
...
..
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
..
..
...
...
...
...
.
.
...
.
.
...
.
..
...
.
.
..
...

.
.
.
...
..
..
....
...
...
...
.
f
(x)
f
(z)
......
0.2
0.2
....
.......
X N (251, 32 )
.....
.....
.....
...........................................
.
.
.
.
.
.
.
.
......
.....
.
.
.
.
.
.
.
.
.
.....
.........
.....
..... . .
..........
.....
.
.
..
.
.
.
.....
..
.....
..... .......
.........
.....
.....
.
.....
...............
..
.
.
..
.
.
.
.
.
.....
..... . . . . .
.....
............
...... . . . . . .
......
...... . . . . . .
......
.
.
...... . . . . . . .
.
.
.
.
.
.
.
......
...........
..... .................
......
.....
.......
.....
..................
.........
.....
........
.
.
......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....
.
.
.
.
.
.
.
.
...........
...... . . . . . . . . . . . .
...............
..............
..........
...........
................................. . . . . . . . . . . . . . . .
245

250
x

255

Because our tables give us areas between 0 and a point z, we have to go through the
steps depicted in the diagrams below to find this probability. We make use of the facts
that the normal distribution is symmetric, and that the area from 0 to is 0.5.
Alternatively, we can write:

Pr[X < 250] = Pr[Z < (250 251)/3] = Pr[Z < 1/3] = Pr[Z > 1/3]
= 0.5 Pr[0 < Z < 1/3] = 0.5 0.1293 = 0.3707.

The value 0.1293 is looked up in Table 1. Thus 37% of the tubs will contain less margarine
than stated. Notice that because the normal distribution is symmetric we only need
tables for half of the distribution.

Example 16B: If = 4 and = 8 what is the probability that a normally distributed


random variable X lies between 2 and 18?


X
18 4
24
<
<
Pr[2 < X < 18] =
8

8
= Pr [0.25 < Z < 1.75]

(letting z = (x )/)

= 0.0987 + 0.4599 = 0.5586

Example 17B: A t-shirt manufacturer knows that the chest measurements of his
customers are normally distributed with mean 92 cm and standard deviation 5 cm. He
makes his t-shirts in four sizes S ( fit size range 8087 cm), M (to fit 8794), L (to
fit 94101) and XL (to fit 101108). What proportion of customers fit into each size
t-shirt?

133

CHAPTER 5. PROBABILITY DISTRIBUTIONS I


........................
.....
......
....
....
...
...
.
.
.
...
...
...
..
.
..
..
..
.
..
.
2
.
..
..
..
.
...
..
.
...
..
.
...
..
...
..
..
..
...
...
...
..
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
....
...
...
...
...
...
...
...
.
...
.
...
..
.
...
..
.
...
..
.
...
..
.
...
..
..
.
...
..
.
...
....
...
.
.
...
..
.
...
...
....
...
.
.
...
..
.
...
..
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
.
...
..
...
.
.
.
...
.
...
...
.
...
.
.
.
...
...
...
.
.
.
...
.
..
...
.
.
...
..
.
.
...
...
...
.
.
....
...
.
.
....
.
.
.
.....
.
.
...
......
.
.
.
.
.
......
....
.
.
.
.
.......
.
.
........
.....
.
.
.
.
.
.
.
...........
......
.
.
.
.
.
.....................
.
.
.
.
.
.
.
.........
........................

X N (92, 5 )

80

87

94

XL

101

108

We need to find the z-values for each of the boundary points, by using the formula
z = (x )/.
Then, from our normal tables, we find the area between each of these points and the
mean. This gives
x

z = (x 92)/5

Area between x and

80
87
94
101
108

2.4
1.0
0.4
1.8
3.2

0.4918
0.3413
0.1554
0.4641
0.4993

The proportions for each size are then found by subtraction (or addition in the case
of size M), as follows:
Size
S
M
L
XL

Proportion
0.4918 0.3413
0.3413 + 0.1554
0.4641 0.1554
0.4993 0.4641

= 0.1505
= 0.4967
= 0.3087
= 0.0352

(15.05%)
(49.67%)
(30.87)%
( 3.52%)

Check for yourself that 0.89% of customers dont fit into any size t-shirt.
Example 18C: The mean inside diameter of washers produced by a machine is 0.403
cm and the standard deviation is 0.005 cm.
Washers with an internal diameter less than 0.397 cm or greater than 0.406 cm are
considered defective. What percentage of the washers produced are defective, assuming
the diameters are normally distributed?

134

INTROSTAT

Example 19C: In a large group of men 4% are under 160 cm tall and 52% are between
160 cm and 175 cm tall. Assuming that heights of men are normally distributed, what
are the mean and standard deviation of the distribution?
Example 20C: A soft-drink vending machine is set to discharge an average of 215 ml
of cooldrink per cup. The amount discharged is normally distributed with standard
deviation 10 ml.
(a) If 225 ml cups are used, what proportion of cups overflow?
(b) What is the probability that a cup contains at least 200 m` of cooldrink?
(c) What size cups ought to be used if it is desirable that only 2% of cups overflow?

Sums and differences of independent normal random variables . . .


Suppose we have a number of tasks that have to be completed in sequence e.g.
when a building is constructed. Suppose the time taken for each task obeys a normal
distribution, each having a given mean and variance and is independent of the time taken
for the other tasks. Obviously the total time taken will also be a random variable.
What will its distribution be and what will its mean and variance be? Without proof, we
state that the total time taken will be normally distributed with mean total time equal
to the sum of the means for each task, and variance equal to the sum of the variances
(not the standard deviations). Mathematically, we write this as follows. If the time Xi
taken for the i th task is such that
Xi N (i , i2 )
and if it is independent
P of the time taken for other tasks, then the distribution of the
random variable Y = ni1 Xi is
Y N (, 2 )

P
P
where = ni=1 i , and 2 = ni=1 i2 .
Sometimes we need to consider the difference of two independent normally distributed random variables. Suppose
X1 N (1 , 12 ) and X2 N (2 , 22 )
then, letting Z = X1 X2 , we state, without proof, that
Z N (1 2 , 12 + 22 ).
The mean of the random variable Z is found by subtraction, but the variance is still
found by addition.

135

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

Example 21B: You have 4 chores to perform before getting to Statistics lectures by
08h10. The time (in minutes) to perform each chore is normally distributed with mean
and standard deviation as given below:
mean ()

std. dev. ()

1. Shower

0.5

2. Get dressed

1.0

3. Eat breakfast

10

3.5

4. Drive to university

15

5.0

(a) If you get up at 07h30, what is the probability of being late?


(b) (i) At what time should you get up so as to be 99% sure you will not be more
than 3 minutes late?
(ii) What is the probability, getting up at this time, that you will be there after
08h10?
(a) The total time taken to get to university is a normally distributed random variable
X with mean
= 5 + 4 + 10 + 15 = 34 minutes
and variance
2 = 0.52 + 1.02 + 3.52 + 5.02 = 38.5
and therefore standard deviation = 6.205.
The probability that you take more than the allowed 40 minutes is


40 34
Pr[X > 40] = Pr Z >
= Pr[Z > 0.97] = 0.1660
6.205
On average, you will be late one day in six, because 1/0.1660 6.
(b) (i) We must choose x so that Pr[X < x] = 0.99. From tables Pr[Z < z] = 0.99
implies z = 2.33. Thus, using the formula z = (x )/ ,
2.33 =

x 34
,
6.205

which has solution x = 48.5 minutes.


48.5 minutes before 08h10 is 07h21.5.
(ii) Probability of taking a total of more than 45.5 minutes is



45.5 34
Pr[X > 45.5] = Pr Z >
= Pr[Z > 1.85] = 0.0322.
6.205
Youll be late about one day in 31, on average, but (by part (b)) more than
three minutes late only once in every 100 days.

136

INTROSTAT

Example 22C: Plastic caps seal the ends of the tube into which your degree certificate
is placed when you graduate. Suppose the tubes have a mean diameter of 24.0mm and
a standard deviation of 0.15mm, and that the plastic caps have a mean diameter of
23.8 mm and a standard deviation of 0.11mm. If the diameter of the cap is 0.10 mm
or more larger than that of the tube, the cap cannot be squashed into the tube, and if
the diamater of the cap is 0.45 mm or more smaller than that of the tube, it will not
seal the tube, but will just keep falling out. If a tube and and plastic cap are selected at
random, what are the probabilities of (a) the cap being too large for the tube, and (b)
the cap falling out of the tube?

Multiplying a normal random variable by a constant


Suppose that an American textbook says that the heights of students have a normal
distribution with mean 67 inches and standard deviation 4 inches. How do we convert
this information to a normal distribution with heights in centimetres?
A result, which we do not prove, comes to our aid. It says, if the random variable
X N (, 2 ), and if a is a constant, then the random variable Y = aX also has a
normal distribution, with
Y N (a, a2 2 ).
To solve the inches and centimetres problem, we note that the conversion factor
from inches to centimetres is 2.54 (one inch = 2.54 cm). So if X N (67, 16), where X
is measured in inches, then Y = 2.54X will be in centimetres, and
Y N (2.54 67, 2.542 16) = N (170.2, 103.2).

Example 23C: Another textbook says that the mass of an Ostrich is normally distributed with mean 68745 g and variance 13201000g 2 .
(a) Convert this information to a random variable with mass measured, more sensibly,
in kilograms.
(b) What is the probability that three ostriches weigh more than 225 kg?

Percentage points of the standard normal distribution


Often, instead of wanting to find Pr[Z > z ] for some given value of z , we are given
a probability p and need to find the value of z that makes the equation Pr[Z > z] = p
true. It is convenient to use the notation z (p) to denote the value of z which provides
the solution to this equation, and to describe it as the 100p% point of the distribution.
Example 24A: Find z (0.10) , the 10% point of the standard normal distribution.
In other words, we are asked to find the point along the standard normal distribution
such that 10% of the distribution lies to the right of it:

137

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

Z ~ N(0,1)

0.4

f(z)

0.3

0.2

0.40

0.1

0.10

0.0
3

1
z (0.10)

Remember that Table 1 is constructed to give probabilities between 0 and z . Therefore, to find z (0.10) , we search in the body of Table 1 until we find the closest value we
can to 0.40. We find 0.3997 when z = 1.28. Thus z (0.10) = 1.28; we say tat 1.28 is
the 10% point of the standard normal distribution. Sometimes, we need to be
more precise, and say that 1.28 is the upper 10% point of the standard normal distribution. Clearly, because of the symmetry of the standard normal distribution, 1.28 is
the lower 10% point of the standard normal distribution. The lower 10% point is also,
in as perverted way, the upper 90% point, so that we can even write z (0.90) = 1.28!
Example 25C: Find (a) z (0.05) , (b) z (0.025) , (c) z (0.01) , (d) z (0.005) , (e) z (0.25) , (f)
z (0.5) , (g) z (0.95) and (h) z (0.99) .

Solutions to examples
 10
5C (a) 1 10
(i) 0.0956 (ii) 0.6513 (iii) 0.9718
0 q
 19
20 20
(b) 1 0 q 20
1 pq
(i) 0.0169 (ii) 0.6083 (iii) 0.9924
(c) The second procedure is less likely to reject relatively satisfactory consignments, and is more likely to reject very poor consignments. However, it costs
twice as much to do the checking, so there is a trade-off.
9C (a) 0.1125 106

(b) 0.4727

(c) x = 2 y = 6

138

INTROSTAT

12C (a) 0.5369

(b) 0.3679

13C (a) and (b) et

(c) 0.3679

(c) The events are the same.

14C (a) 0.2834 (Poisson) (b) 0.0257 (Poisson) (c) 0.1108 (Exponential) (d) 0.0158
(Binomial)
18C 38.94%
19C = 173.83, = 7.895
20C (a) 0.1587 (b) 0.9332 (c) 235.5 ml (The machine is pretty useless!)
22C (a) 0.0537 (b) 0.0901
23C (a) The conversion factor is to divide by 1000. Y N (68.745, 13.201), where
Y is measured in kilograms. (b) If one ostrich weighs Y kg, then three weigh
V = Y1 + Y2 + Y3 kg, and V N (3 68.745, 3 13.201) = N (206.235, 39.603).
Pr[V > 225] = Pr[Z > 2.98] = 0.00144.
25C (a) z (0.05) = 1.64, (b) z (0.025) = 1.96, (c) z (0.01) = 2.33, (d) z (0.005) = 2.58, (e)
z (0.25) = 0.67, (f) z (0.5) = 0, (g) z (0.95) = 1.64 and (h) z (0.99) = 2.33.

Exercises on the binomial distribution


5.1

...

Suppose that 25% of the people entering a supermarket are aged between 18 and
30 years, classified as young adults. A market researcher has to fulfil a quota of
10 interviews. What is the probability her quota of interviews contains
(a) exactly x young adults?
(b) no young adults?
(c) between four and six (inclusive) young adults?

5.2

(a) A true-false test is given with five questions. To pass you need at least four
right. You guess each answer. What is the probability that you pass?
(b) The true-false test is replace with a multiple-choice test with four alternative
answers, only one of which is correct. If you guess, what is the probability
that you pass?

5.3

A shopper has a choice between a supermarket and a hypermarket. She chooses


the supermarket 60% of the time, and the hypermarket 40% of the time. What
are the probabilities that, on her next seven shopping trips,
(a) she shops at the hypermarket three times?
(b) she shops at least twice at the supermarket?
(c) she shops at only one of the stores?

5.4

An anti-aircraft battery in England during World War II had on the average 3 out
of 10 successes in shooting down flying bombs that came within range. What was
the chance that, if eight bombs came within range, two or more were shot?

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

139

Poisson distribution . . .
5.5

What is the probability of finding 12 errors in a 200-page book if the printers have
an error rate of 0.075 errors per page? Assume that a Poisson distribution may
be used to model the occurrence of errors.

5.6

A pump fails, on the average, once in every 500 hours of operation.


(a) Find the probability that the pump has more than one failure during a 500hour period.
(b) What is the probability of exactly 3 failures in 2000 hours of operation?

5.7

The average demand on a factory store for a certain electric motor is 8 per week.
When the storeman places an order for these motors, delivery takes one week. If
the demand for motors has a Poisson distribution, how low can the storeman allow
his stock to fall before ordering a new supply if he wants to be at least 95% sure
of meeting all requirements while waiting for his new supply to arrive?

5.8 The average number of accidental drownings per year is 3.5 per 100 000 population.
(a) Find the probability that in a city with a population of 200 000 there will be
between 4 and 8 (inclusive) accidental drownings per year.
(b) What are the probabilities that in towns of 15 000, 20 000 and 50 000 there
will be no drownings in a year?

Exponential distribution . . .
5.9

5.10

If the average drowning rate is 3.5 per 100 000 population per year what is the
probability that the time interval between drownings in a city of 200 000 will be
less than one month?
Customers arrive at a restaurant at the rate of 90 per hour during lunch time.
(a) If a customer has just arrived, what is the probability that it will be at least
another minute before the next customer arrives?
(b) If a minute has already passed by since the last customer arrived, what is
the probability that it is at least another minute before the next customer
arrives?
(Hint: use conditional probabilities.)

5.11

The life of an electronic device is known to have the exponential distribution with
parameter = 1/1000.
(a) What is the probability that the device lasts less than 1000 hours?
(b) What is the probability it will last more than 1200 hours?
(c) If three such devices are taken at random, what is the probability that one
will last less than 800 hours, another between 500 and 1200 hours, and the
third between 1200 and 2000 hours?

140
5.12

INTROSTAT
The duration (in minutes) of showers on a tropical island is approximately exponentially distributed with = 1/5.
(a) Out of 3 showers, what is the probability that not more than 2 will last for
10 minutes or more?
(b) What is the probability that a shower will last at least 2 minutes more, given
that it has already lasted 5 minutes?

Normal distribution . . .
5.13 If the random variable Z has the standard normal distribution (i.e. Z N (0; 1)
find the following probabilities:
(a)
(c)
(e)
(g)
(i)

P [0 < Z < 1]
P [1.64 < Z < 1.64]
P [Z < 1.38]
P [2.3 < Z < 1.6]
P [1.74 < Z < 0.86]

(b)
(d)
(f)
(h)
(j)

P [0 < Z < 1.96]


P [Z 2]
P [Z < 2.1]
P [1 < Z < 2]
P [3 Z 3]

5.14 Given that Z has standard normal distribution, what values must z have in order
to make each of the following statements true?
(a)
(b)
(c)
(d)
(e)
5.15

P [0 < Z < z ] = 0.475


P [z < Z < z ] = 0.95
P [Z < z ] = 0.05
P [Z > z ] = 0.005
P [Z < z ] = 0.99

If X is distributed as a normal variable with mean 3 and variance 4, find


(a) P [X < 4]
(b) P [|X| < 6]
(c) P [3.5 < X < 6.5]

5.16

If X N (0; 14 ) find

(a) P [X > 2]
(b) P [0 < X < 1]

5.17 If X N (1; 4) find numbers x0 and x1 such that


(a) P [X > x0 ] = 0.10
(b) P [X > x1 ] = 0.80

5.18

If X has a normal distribution, and if P [X < 10] = 0.8413, what is the value of
the mean if the distribution is known to have variance 2 = 16?

5.19

Sports shirts are frequently classified as S, M, L and XL for small, medium, large
and extra large neck sizes. S fits a neck circumference of less than 37 cm, M fits
between 37 and 40.5 cm and L fits between 40.5 and 44 cm while XL fits necks
over 44 cm in circumference. The neck circumference of adult males has a normal
distribution with = 40 and = 2.
(a) What proportion of shirts should be manufactured in each category?

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

141

(b) If you wanted to define categories S, M, L, XL so that each category contained


25% of the total population of adult males, what neck sizes must you assign
to each of these categories.
5.20

Suppose that the profit (or loss) per day of a shopkeeper dealing in a perishable
item is approximately normally distributed with mean R10 and standard deviation
R5. What is the probability that
(a) he makes a loss on any one day?
(b) his profit exceeds R14?
(c) he makes exactly R10 profit?

5.21 Consider an I.Q. test for which the scores of adult Americans are known to have a
normal distribution with expected value 100 and variance 324, and a second I.Q.
test for which the scores of adult Americans are known to have a normal distribution with expected value 50 and variance 100. Under the assumption that both
tests measure the same phenomenon (intelligence), what score on the second
test is comparable to a score of 127 on the first test? Explain your answer.

Further exercises, with the distributions mixed up! . . .


5.22 Which of the four probability distributions we have considered might serve as
models for
(a)
(b)
(c)
(d)

the intervals of time between breakdowns of a computer?


the number of times a total of 6 occurs when 2 dice are thrown 5 times?
the precise masses of packets of 36 biscuits?
the number of telephone calls received each day by a telephone counselling
service?

5.23

The average number of oil tankers arriving each day at a Persian Gulf port is
known to be 7. The facilities at the port can handle at most 10 tankers per day.
If tankers arrive at random, what is the probability that on a given day tankers
have to be turned away?

5.24

Airplane engines operate independently in flight and fail with probability 1/10.
A plane makes a successful flight if at most half of its engines fail. Determine the
probability of a successful flight for two-engined and four-engined planes.

5.25

An ice-cream vendors sales follow a Poisson distribution with an average rate of


10 per hour.
(a) What is the probability that he sells at least one ice-cream in his first hour
of operation?
(b) How much should his stock of ice-cream be at any point in time if he wants
to be at least 95% sure that he does not run out of ice-cream in the following
hour?
(c) Given that he has made no sale during the past 15 minutes, what is the
probability that he makes a sale within the next 20 minutes?

5.26 A die is thrown 10 times. What is the probability of obtaining at least three even
numbers?

142
5.27

INTROSTAT
The annual income of residents of Bishopscourt is normally distributed with mean
R25 000 and standard deviation R5000. What is the highest income of the lowest
20% of income earners in Bishopscourt?

5.28 A liquid culture medium contains on the average m bacteria per ml. A large
number of samples is taken, each of 1 ml, and bacteria are found to be present in
90% of the samples. Estimate m.
5.29 The strength of a plastic produced by a certain process is known to be normally
distributed. If 10% of the plastic has a strength of at least 4000 kg, and 70% has
a strength exceeding 3000 kg, what are the mean and standard deviation of the
distribution?
5.30

A bank has 175 000 credit card holders. During one month the average amount
spent by each card holder totalled R192,50 with a standard deviation of R60,20.
Assuming a normal distribution, determine the number of card holders who spent
more than R250.

5.31

The maximum (stated) load of a passenger lift is 8 passengers or 600 kg. If the
masses of people using the lift can be considered to be normally distributed with
mean 70 kg and standard deviation 15 kg, how often will the combined mass of 8
passengers exceed the 600 kg limit?

5.32 A road is constructed so that the right-turn lane at an intersection has a capacity
of 3 cars. Suppose that 30% of cars approaching the intersection want to turn
right. If a string of 15 cars approaches the intersection, what is the probability
that the lane will be insufficiently large to hold all the cars wanting to turn right?
5.33

The weekly demand for sulphuric acid from the store of a chemical factory is
normally distributed with mean 246 litres and standard deviation 50 litres. After
placing an order with the sulphuric acid manufacturers, delivery to the store takes
one week.
(a) How low can the stock of sulphuric acid be allowed to fall before ordering a
new supply in order to be 95% sure of meeting all requirements while waiting
for the new supply to arrive?
(b) What volume of sulphuric acid should then be ordered so that, with 95%
certainty, it will not be necessary to reorder within 6 weeks?

The keen may like to try these exercises . . .


5.34 Suppose the number of eggs a bird lays in its nest has a Poisson distribution with
parameter . Suppose each egg hatches with probability p. Show that the number
of eggs that hatch in a nest has a Poisson distribution with parameter p.
5.35 (a) Suppose we have n trials of a random experiment with three possible outcomes which have probabilities p1 , p2 and p3 , (p1 + p2 + p3 = 1). Show that
the probability that x1 of the n trials have the first outcome, x2 the second
outcome, x3 the third outcome (x1 + x2 + x3 = n) is given by the trinomial
distribution


n
p(x1 , x2 , x3 ) =
px1 1 px2 2 px3 3
x1 x2 x3

143

CHAPTER 5. PROBABILITY DISTRIBUTIONS I

(b) Extend this result to the multinomial distribution:


P there are now m possible outcomes with probabilities p1 , p2 , . . . , pm , ( m
i=1 pi = 1), and the
probability thatP
x1 trials have the first outcome, x2 the second, . . . , xm the
m th outcome ( m
i=1 xi = n), is given by


n
px1 px2 . . . pxmm
x1 x2 . . . xm 1 2

Solutions to exercises . . .
5.1 (a)

10
c

0.25x 0.7510x

(b) 0.0563

5.2 (a) 0.1875

(b) 0.0156

5.3 (a) 0.2903

(b) 0.9812

(c) 0.2206

(c) 0.0296

5.4 0.7447
5.5 0.0829
5.6 (a) 0.2642

(b) 0.1954

5.7 13
5.8 (a) 0.6473

(b) 0.5916,

0.4966 and

0.1738

5.9 0.442
5.10 (a) 0.2231
5.11 (a) 0.632

(b) 0.2231
(c) 0.551 0.305 0.166 = 0.0279

(b) 0.301

5.12 (a) Pr(a shower lasts longer than 10 mins) = 0.135, Pr(X 2) = 0.9975
(b) 0.6703
5.13 (a)
0.3414
(b)
(e)
0.0838
(f)
(i) 0.1540 (j) 0.99730
5.14 (a) 1.964

(b) 1.96

(c) 1.64

5.15 (a) 0.6915

(b) 0.9332

5.16 (a) 0.0000

(b) 0.4773

5.17 (a) 3.56

0.4750
0.09821

(c)
(g)
(d) 2.58

0.8990
0.9345

(d)
(h)

0.0228
0.1359

(e) 2.33

(c) 0.3612

(b) 2.68

5.18 = 6
5.19 (a) 7% S, 53% M, 38% L, 2% XL
5.20 (a) 0.0228
5.21 65

(b) 0.2119

(c) 0

(b) S < 38.65 < M < 40 < L < 41.35 < XL

144

INTROSTAT

5.22 (a) Exponential (b) Binomial (c) Normal (d) Poisson


5.23 0.0985
5.24 0.9900 and 0.9963 for 2- and 4-engined planes, respectively
5.25 (a) 0.999995

(b) 15

(c) 0.964

5.26 0.9453
5.27 R20 800
5.28 m = 2.3026
5.29 = 3288.89, = 555.56
5.30 29 488
5.31 0.1736
5.32 0.7031
5.33 (a) 328.0 litres

(b) 1676.9 litres

Chapter

MORE ABOUT RANDOM


VARIABLES
KEYWORDS: Mean, variance, standard deviation, and coefficient of
variation of a random variable; the distribution function and the median of a random variable; the normal approximations to the binomial
and Poisson distributions.

Mean and variance of a random variable . . .


In chapter 1 we learnt about measures of location and spread for samples of data. We
now develop equivalent concepts for random variables. The most important measures of
location and spread for random variables are given the same names, the mean and the
variance, as were used for samples of data. However, the formulae defining the mean of
a random variable and the variance of a random variable are completely different
from the formula which defined the mean of a sample, which we denoted x
, and the
variance of a sample, denoted s2 . The justification for using the names mean and
variance in both contexts is fairly subtle and will be made clear in chapter 8.
As you might expect, discrete and continuous random variables are handled separately, but the notation is the same in both cases. The mean of a random variable X
is denoted by or E[X] (the expected value of X ) and the variance of a random
variable X is denoted by 2 or Var[X] (the variance of X ).
A discrete random variable X has a probability mass function p(x); the values of
p(x) are non-zero for a countable set of x-values. For a discrete random variable X ,
the mean is defined to be
X
= E[X] =
x p(x).
x

In words, this says that the mean of a discrete random variable is equal to the sum of a
set of terms, each term being one of the values that the random variable can take on (x),
multiplied by the probability of taking that value (p(x)). The variance makes immediate
use of as defined above. The variance of a discrete random variable is defined to be
X
2 = Var[X] =
(x )2 p(x).
x

The sum is taken over the set of values for which the probability mass function p(x) is
positive.
145

146

INTROSTAT

A continuous random variable X has a probability density function f (x); usually,


the values of f (x) are non-zero on some interval (a, b). For a continuous random variable
X , the mean is defined to be
Z b
x f (x) dx
= E[X] =
a

and the variance is

= Var[X] =

(x )2 f (x) dx.

The limits of integration are taken over the interval for which the probability density
function f (x) is non-zero.
As was the case with the variance of a sample in chapter 1, there is an alternative
formula for the variance of a random variable, which also provides a short-cut for
many problems. If X is discrete, use
X

x2 p(x) 2 .
2 = Var[X] =
x

For X continuous, use

2 = Var[X] =

b
a

x2 f (x) dx 2 .

You should easily be able to prove that the pairs of formulae for the variance of discrete
and continuous random variables are equivalent.
For both discrete and continuous random variables, the standard deviation of
the random variable X is defined to be the square root of the variance:
p
= Var(X).

The coefficient of variation of the random variable X is defined to be the


ratio of the standard deviation and the mean:
p
Var(X)
= /.
CV =
E(X)

The coefficient of variation has uses in sampling theory. It is frequently multiplied by


100 and then expressed as a percentage. The coefficient of variation is only sensibly
defined if the lower limit of the random variable X is zero.
Example 1A: Suppose the random variable X has probability density function
f (x) = 6x(1 x) 0 x 1
=0
otherwise
Find (a) the mean, (b) the variance and (c) the coefficient of variation of the random
variable X .
(a) To find the mean, we use the definition for a continuous random variable:
Z
x f (x) dx
E[X] =

(x2 x3 ) dx
x 6x(1 x) dx = 6
0
0




1 1
1
1 3 1 4 1
x x
) =
=6
=6
3
4
3
4
2
0
=

CHAPTER 6. MORE ABOUT RANDOM VARIABLES

147

(b) We use the alternative formula for finding the variance:


Z
x2 f (x) dx 2
Var[X] =
=

x2 6x(1 x) dx

 1 2
2

1
(x3 x4 ) dx
4
0



1
1 1
1 4 1 5
1
1
1

=6 x x
=6
=
4
5
4
4 5
4
20
0

=6

(c) The coefficient of variation is given by


p
p
Var[X]
1/20
CV =
= 0.4472
=
1
E[X]
2
Example 2A: The discrete random variable X has probability mass function given by
p(x) = 1/6 x = 1, 2, 3, 4, 5, 6
=0
otherwise
Find (a) the mean, (b) the variance and (c) the coefficient of variation of the random
variable.
(a) We use the definition of the mean of a discrete random variable:
X
E(X) =
x p(x)
x

6
X

x 1/6

x=1

= 1/6 + 2/6 + 3/6 + 4/6 + 5/6 + 6/6


= 3.5
(b) We use the alternative formula for finding the variance of a discrete random variable:
X

x2 p(x) 2
Var(X) =
x


= 1/6 + 4/6 + 9/6 + 16/6 + 25/6 + 36/6 3.52
= 2.917

(c) Coefficient of variation is


CV =

Var(X)
=
E(X)

2.917
= 0.488
3.5

In the next two examples, it will help you to be reminded that the mean is the sum
of the values that the random variable takes on multiplied by the probabilities of taking
on these values.

148

INTROSTAT

Example 3C: You are contemplating whether it is worth your while driving out to the
Northern Suburbs to call on a client. You estimate that there is a 40% chance that the
client will purchase your product. Commission on the sale is R50, but the petrol will
cost you R15.
(a) Should you call on the client?
(b) What probability of purchase would lead to an expected net gain of zero for the
call?
Example 4C: You are considering investing in one of two shares listed on the stock
exchange. You estimate that the probability is 0.3 that share A will decline by 15% and
a probability of 0.7 that it will rise by 30%. Correspondingly, for share B, you estimate
that the probability is 0.4 that it will decline by 15% and that the probability that it will
rise by 30% is 0.6. The return on a share is defined as the percentage price change.
(a)
(b)
(c)
(d)
(e)

Calculate the expected return for each share.


Calculate the standard deviations of the returns for each share.
Which of the two shares would you say is the more risky? Why?
Calculate the coefficient of variation for each share.
Which of the two shares would you buy? Why?

Example 5B: Suppose the random variable X has the Poisson distribution with parameter . Find E[X].
The probability mass function for the Poisson distribution is
e x
x!
=0

p(x) =

x = 0, 1, 2, . . .
otherwise

Using the definition of the mean of a discrete random variable, we have


E[X] =
=

X
x=0

x p(x)
x

e x
x!

X
x
e x
=

x (x 1)!
x=1

= e

X
x=1

= e

x1
(x 1)!

2
1+ +
+
1!
2!

= e e

because x! = x (x 1)!

The mean of a Poisson distribution is its parameter, .

CHAPTER 6. MORE ABOUT RANDOM VARIABLES

149

Example 6B: Suppose X B(n, p). Find the expected value of X .


The probability mass function for the binomial distribution is given by
 x nx
p(x) = n
x = 0, 1, . . . n
x p q
=0
otherwise
Once again, we use the definition of the mean of a discrete random variable:
 
n
X
n x nx
x
E[X] =
p q
x
x=0
 
n
X
n x nx
x
p q
=
x
x=1

n
X
x=1

= np

n!
px q nx
(x 1)!(n x)!

n
X
x=1

(n 1)!
px1 q nx .
(x 1)!(n x)!

Now substitute y = x 1, and substitute m = n 1. With this substitution, note


that when x = 1, y = 0, and that when x = n, y = n 1, which in terms of the
second substitution is y = m. We use these to adjust the lower and upper limits of the
summation.
m
X
m!
py q my
E[X] = np
y!(m y)!
y=0
m  
X
m
= np
py q my
y
y=0
= np 1,

because

m  
X
m
py q my = 1;
y
y=0

it is the sum of all possible values of the probability mass function of the binomial
distribution B(m, p). Therefore, as required
E[X] = np,
which says that the mean of the binomial distribution B(n, p) is the product of the two
parameters of the distribution, n and p.

A geometrical interpretation of the mean and the variance . . .


The formulae for finding the mean are mathematically equivalent to performing the
following operations:
R
E[X] = x f (x) dx is equivalent to cutting the shape of the graph of f (x) out
of a piece of tin or cardboard of uniform thickness and finding the point along the x-axis
on which it balances.
P
E[X] = x x p(x) is equivalent to hanging masses corresponding to p(x) at the
point x along a ruler, and finding the point at which the ruler balances.

150

INTROSTAT
0.8

f (x)

0.4

0.0

.....
... ...
... ...
.... ....
...
..
...
...
...
..
..
..
...
..
..
...
..
...
..
...
....
...
...
...
....
...
..
...
...
..
..
...
..
...
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
..
...
..
...
..
..
...
..
...
..
..
...
..
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
.
.
...
....
...
.
.
..
..
..
.
..
..
...
.
..
....
.
.
...........
...
.
.
.
.
.
.
.
.
.
............
.
.
.
.
.........

2
3
x
Variance small

0.8

f (x)

0.4

....................
.....
....
....
...
...
...
...
...
.
.
...
.
.
...
..
.
.
...
...
...
.
...
..
.
...
..
.
...
..
.
...
..
...
.
...
..
.
...
..
.
...
.
.
.
...
..
.
...
.
..
...
.
.
...
..
.
...
.
..
...
.
.
....
...
.
.
....
.
.
....

0.0
0

2
3

x
Variance large

The variance may be thought of as a measure of the average distance of the random
variable X from its mean. If the p.d.f. or p.m.f. is very flat, the variance will be
large. If the probability function is very peaked, the variance will be small. This is
illustrated above. In the plot on the left, f (x) is peaked, and the terms (x )2 are on
average small, leading to a small variance. On the other hand, if as in the plot on the
right, f (x) is flat, then the terms (x )2 tend to be large, leading to a large variance.
Applied Mathematics students will see the relationship between means and centres of gravity and between variances and moments of inertia.

Skewness . . .
Just as a histogram can be skew, so can the distribution of a random variable.
We use the same terminology here as in chapter 1. A random variable is said to be
positively skewed if it has a long tail on the right-hand side. Similarly, a random
variable has negative skewness if it has a long tail on the left-hand side. Symmetric
distributions are just that: symmetric, so that the tail on the left is a mirror image of
the tail on the right.
Statisticians sometimes need to describe the shape of the tails of a probability distribution. Even if two distributions may have the same mean and variance, the shapes
of the tails may be quite different. Statisticians distinguish between heavy-tailed distributions, in which the probability of observations far from the mean is relatively large,
and light-tailed distributions, in which observations far from the mean are unlikely.

151

CHAPTER 6. MORE ABOUT RANDOM VARIABLES

0.8

f (x) 0.4

0.0

Negatively
skewed

..
...
...
....
..
..
..
..
..
.
.
..
...
...
....
...
...
...
..
.
.
..
...
...
..
.
.
..
...
..
...
.
.
..
...
...
...
...
.
.
.
....
.....
.....
......
.......
.
.
.
.
.
.
.
......
............

2
x

0.8

0.4

0.0

Symmetric
......
... ...
... ....
.... ....
..
...
..
..
..
..
...
..
...
...
...
...
...
..
...
...
..
...
..
...
..
...
...
...
...
...
...
...
...
...
....
..
..
...
..
...
..
...
..
..
...
..
...
..
..
...
..
...
..
...
...
...
...
...
...
...
...
...
...
...
...
..
..
.
.
..
..
...
.
.
........
.
.
.
.
..........
..............

The distribution function

2
x

Positively
skewed

0.8

0.4

0.0

........
... ..
.. ..
.. ..
... ....
...
..
...
..
...
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
....
......
...
......
...
.......
.........
...
.............
...
...
...

2
x

F (x) . . .

Let X be a random variable with probability density function f (x) or probability


mass function p(x). The function that gives the probability that X takes on a value
less than or equal to x is called the distribution function and is denoted by F (x):
F (x) = Pr[X x].
If X is continuous,
F (x) =
and if X is discrete

F (x) =

f (t) dt

p(t).

tx

Note that for X continuous, F 0 (x) = f (x), i.e. the derivative of the distribution function
is the probability density function.
Example 7A: Find the distribution function F (x) for the exponential distribution.
Z x
Z x
ex dx
f (x) dx =
F (x) = Pr[X x] =
0

ix
h
= ex = ex + e0
0

= 1 ex

The distribution function should be defined for the domain (, ): thus we write
F (x) = 0
x<0
= 1 ex x 0
as the distribution function for the exponential distribution. Differentiating F (x) yields
dF (x)
= f (x) = 0
x<0
dx
x
= e
x0
the probability density function of the exponential distribution.

152

INTROSTAT

Example 8A: Find the distribution function for the random variable X with the
binomial distribution X B(3, 0.2).
 
3
Pr[X = x] =
0.2x 0.83x
x = 0, 1, 2, 3
x
which yields p(0) = 0.512, p(1) = 0.384, p(2) = 0.096, p(3) = 0.008. Thus Pr[X
0] = 0.512, Pr[X 1] = 0.512 + 0.384 = 0.896, etc, so that
F (x) = 0
x<0
= 0.512 0 x < 1
= 0.896 1 x < 2
= 0.992 2 x < 3
=1
x3
The graphs of the distribution function F (x) for Examples 7A and 8A are shown
below. The graph of F (x) is always an increasing function, between 0 and 1. If X is a
discrete random variable, then F (x) will be a step function.
1.0

F (x)

0.5

0.0

........................
.............
..........
........
.
.
.
.
.
.....
.....
.....
....
...
.
.
..
...
...
...
..
.
...
...
...
..
.
...
...
...
....
..
...
x
...
....
..
....
..
...
...
....
..
..
..
..
....
..
...
...
....
..
...
...
....
..

F (x) = 1 e
(with = 1)
Example 7A

2
x

1.0

F (x)

0.5

Step function of
Example 8A

0.0
0

2
x

The distribution function gives us an expression which can be used directly to


compute probabilities. For a continuous random variable, the distribution function can
frequently be expressed as a formula, and this does away with the need to integrate to
compute probabilities. For a discrete random variable, the distribution function is always
a step function (why?), and is therefore less useful for computing probabilities than for
a continuous random variable. Notice that the domain of the distribution function F (x)
is always the entire real line for both discrete and continuous random variables.
Example 9C: Find the distribution function for the random variable X with density
function

f (x) = 12 / x 0 x 1
=0
otherwise

CHAPTER 6. MORE ABOUT RANDOM VARIABLES

153

Example 10B: Suppose events are occurring at random with average rate per unit
of time. What is the probability density function of the random variable X , the waiting
time to the second event?
We first find F (x).
F (x) = Pr[X x] = Pr [waiting less than x units of time for the 2nd event]
= 1 Pr [waiting more than x units of time for the 2nd event]

= 1 Pr [less than 2 events take place in x units of time]

= 1 Pr [0 events in x units] Pr[1 event in x units]

= 1 ex xex ,

using the Poisson distribution. We now differentiate F (x) to find the density function
f (x).
f (x) =

dF (x)
= ex ex + 2 xex
dx
= 2 xex .

Thus the density function of the random variable X , the waiting time to the second
event, is given by
f (x) = 2 xex x 0
=0
otherwise
This is an extension of the exponential distribution which is the density function of the
waiting time to the first event. See Exercises 6.23 and 6.24.
Example 11C: A company which sells expensive woodworking machinery has established that the time between sales can be modelled by an exponential distribution, with
parameter = 2 per five-day working week. The company has had no sales over the last
week, and the manager fears that if no sales are made in the next few days, the company
will be in serious financial difficulty. Throw some light on the situation by computing
an expression for the probability that the number of days between two sales will be x
or fewer days. Plot this function. Can you allay the managers fears?
Example 12C: An estate agent has five houses to sell. She believes her situation can
be modelled by a binomial process, and that her probability of selling each house within
a month is 0.4. Compute and plot the distribution function of the number of sales in
the month. How would this plot help the estate agent?
Example 13C: A student believes that she is equally likely to obtain a mark between
45% and 60% for an examination.
(a) Find the distribution function for the random variable giving the students examination mark. Assume that the random variable is continuous!
(b) Use the distribution function to determine the probability that the student obtains
less than 50% for her examination.

154

INTROSTAT

The median of a random variable . . .


The distribution function gives us a way of defining the median of a random
variable (which is not the same as the median of a sample, considered in chapter 1). If
X is a random variable with distribution function F (x), then the median of X , denoted
xm , satisfies the equation
1
F (xm ) = .
2
If X is continuous, the median is clearly the value xm such that
Z xm
1
f (x) dx = .
2

Half the density function lies below the median, and half lies above it: the picture makes
this clear. The mean and the median of a random variable only coincide when the density
function is symmetric.
0.6

....
... ...
.. .....
..
..
..
....
...
..
...
..
...
...
...
...
...
...
...
m
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
....
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
....
...
.....
...
.....
...
......
......
...
......
...
.......
........
...
.........
...
............
..............
...
....................
.....................................
...
.................................................................
...
........................................................................................
.

Median x

f (x)

0.3

Mean

0.5

0.0

0.5

4
x

If X is discrete, F (x) is a step function, and the median is taken to be the lowest
value of x for which F (x) 21 .
Example 14A: Find the median of the exponential distribution.
In Example 7A we showed that the distribution function of the exponential distribution is
F (x) = 1 ex .
The median xm is therefore the solution to the equation
1
= 1 exm
2
Rearranging, this yields

1
exm = .
2
Now take natural logarithms to obtain
1
xm = loge ,
2
so that, finally, the median is given by
xm = 0.6931/.

CHAPTER 6. MORE ABOUT RANDOM VARIABLES

155

Example 15C: Find the distribution functions and medians of the random variables
having the following probability density/mass functions. Compare the medians with the
means.
(a)
4 0.54 x = 0, 1, 2, 3, 4
p(x) = x
=0
otherwise
(b)
40.4x 0.64x x = 0, 1, 2, 3, 4
p(x) = x
=0
otherwise
(c)
f (x) = 1/5 3 < x < 8
=0
otherwise
(d)
f (x) = 1/x 1 x e
=0
otherwise

Using the normal distribution to approximate the


binomial and Poisson distributions . . .
As surprising as it might appear, it is possible (under certain conditions) to compute
approximate probabilities for both the binomial and Poisson distributions, which are
both discrete, using our tables for the normal distribution, which is continuous.
So far we have shown (Examples 6B and 5B, respectively) that if X B(n, p)
then E[X] = np and if X P (), then E[X] = . We now need Var[X] for both
distributions. Finding these variances is a fairly stiff exercise which is left for you to do
(Exercise 6.22). We simply state that for the binomial distribution, Var[X] = npq =
np(1 p) and for the Poisson distribution, Var[X] = , the same as the mean.
NORMAL APPROXIMATION TO THE BINOMIAL
DISTRIBUTION
If X B(n, p) and both np and n(1 p) are greater than 5 and
0.1 < p < 0.9 then

X
: N np, np(1 p) ,
where
: means is approximately distributed.

The normal distribution used to approximate the binomial distribution is the one
with the same mean and variance as the binomial distribution being approximated.
The same principle applies for the normal approximation to the Poisson distribution;
the approximating normal distribution has the same mean and variance as the Poisson
distribution being approximated.

156

INTROSTAT
NORMAL APPROXIMATION TO THE POISSON
DISTRIBUTION
If X P () and > 10 then
X
: N (, ).

The examples show how to use the approximation to compute binomial and Poisson
probabilities.
Example 16A: If you toss an unbiased coin 20 times what is the probability of exactly
15 heads?
X , the number of heads, has a binomial distribution B(20, 21 ) with mean np = 10
and variance np(1 p) = 5. Also n(1 p) = 10. Because np > 5 and n(1 p) > 5
and 0.1 < p < 0.9, we can approximate the distribution of X by means of a normal
distribution which we will denote Y . The appropriate normal distribution Y to use to
appproximate the binomial distribution X is the one with the same mean and variance
as X . Thus we take = 10 and 2 = 5 so that Y N (10, 5). We write X
: N (10, 5)
and say X is distributed approximately normally with mean 10 and variance 5.
We now have to get around the problem of using a continuous distribution for a
discrete random variable. The probability that the normal distribution takes on the
value 15 is zero. If we are going to obtain a positive probability, we need an interval
over which to evaluate the area under the curve. How do we choose this interval? The
approximation has been designed in such a way that to obtain the probability that
X = 15 for the binomial distribution, we calculate Pr[14.5 < Y < 15.5] for the normal
distribution (10, 5).
We learnt in Chapter 5 how to find probabilities for any normal distribution; we
transform it to the
standard normal distribution, using the formula z = (x )/ with
= 10 and = 5. Thus
 14.5 10
15.5 10 

<Z<
5
5
= Pr(2.01 < Z < 2.46)

Pr(14.5 < Y < 15.5) = Pr

= 0.49305 0.4778 = 0.01525


from the table of the standard normal distribution. Thus Pr(14.5 < Y < 15.5) = 0.0153.
By way of comparison, the exact answer obtained from the binomial probability
distribution is computed as
  20
20 1
= 0.0148.
Pr(X = 15) =
15 2
Our approximate answer is within 3 12 % of the true value.
Example 17A: If sales of tractors by a dealer occur in accord with a Poisson distribution with rate = 25 sales per month, what is the probability of 30 or more sales in a
month?
Because exceeds 10, X , the number of tractors sold, can be closely approximated
by a normal distribution Y with = 25 and 2 = 25. To find the probability that

157

CHAPTER 6. MORE ABOUT RANDOM VARIABLES

the discrete distribution is 30 or more, we obtain the probability that the continuous
distribution exceeds 29.5.
Pr[X 30] = Pr[Y > 29.5] = Pr[Z > (29.5 25)/5)]
= Pr[Z > 0.9]
= 0.1841
The method we have demonstrated to compute the approximate probabilities is
summarized in the block below:
PROCEDURE FOR USING THE APPROXIMATIONS
If the random variable X has a binomial or Poisson distribution and
satisfies the conditions for being approximated by a normally distributed random variable Y , then
Pr[a X b] = Pr[a

1
1
<Y <b+ ]
2
2

Example 18B: An actuarial lifetable states that the probability that a 40-year old
man will die before age 60 years is 0.17. An insurance company insures 300 men aged
40. What is the probability that the number of insured men who will die before age 60
lies between 50 and 60 (inclusive)?
Let the random variable X be the number who will die between age 50 and age 60.
Clearly, X B(300, 0.17).
We calculate np, n(1 p), np(1 p):
np = 300 0.17 = 51,

n(1 p) = 249,

np(1 p) = 42.3.

The conditions for using the normal approximation are therefore satisfied. Thus X
B(300, 0.17) can be approximated by Y N (51, 42.3).
Pr[50 X 60] = Pr[49.5 < Y < 60.5]

= Pr[(49.5 51)/ 42.3 < Z < (60.5 51)/ 42.3]


= Pr[0.23 < Z < 1.46]

= 0.0910 + 0.4279
= 0.5189
Example 19C: The average number of newspapers sold at a busy intersection during
rush hour is 5 per minute. What is the probability that in a 15-minute period during
rush hour more than 85 newspapers are sold at the intersection?
Example 20C: A car ferry can accommodate 298 cars. Because bookings are not
always taken up, the operators accept 335 bookings for each ferry crossing, hoping that
no more than 298 cars arrive. If individual bookings are taken up independently with
probability 0.85, what is the probability, when 335 bookings have been made, that more
cars will arrive than can be accommodated for a particular crossing?

158

INTROSTAT

Solutions to examples . . .
3C (a) E[X] = 5 > 0, so you should make the call.
(b) E[X] = 0 implies 35x + (15)(1 x) = 0, so that x = 0.3.
4C (a)
(b)
(c)
(d)
(e)

E[XA ] = 16.5%, E[XB ] = 12.0%,


Var[XA ] = 20.62%, Var[XB ] = 22.05%,
Share B is more risky, because it has the larger standard deviation.
Coefficients of variation. A: 1.25, B: 1.83.
Buy share A. It has both a higher expected return and a lower variance.

9C The distribution function is


F (x) = 0
x<0

= x 0x1
=1
x>1
2

11C X E( 52 ). Compute the distribution function F (x) = 1 e 5 x for x 0


(and F (x) = 0 for x < 0). Some values are F (x): Pr[X 5] = 0.865 is the
probability of making a sale within 5 days; Pr[X 4] = 0.798, Pr[X 3] = 0.699,
Pr[X 2] = 0.55.
12C The distribution function is
F (x) = 0
= 0.0777
= 0.3369
= 0.6825
= 0.9129
= 0.9897
=1

x<0
0x<1
1x<2
2x<3
3x<4
4x<5
x5

13C (a) f (x) = 1/15 for 45 < x < 60 (and zero otherwise).
F (x) = 0
x < 45
= (x 45)/15 45 x 60
=1
x > 60
(b) 0.333
15C (a) The distribution function is
F (x) = 0
= 0.0625
= 0.3125
= 0.6875
= 0.9375
=1
xm
=2

x<0
0x<1
1x<2
2x<3
3x<4
x4
=2

159

CHAPTER 6. MORE ABOUT RANDOM VARIABLES


(b) The distribution function is
F (x) = 0
= 0.1296
= 0.4752
= 0.8208
= 0.9744
=1
xm
=2

x<0
0x<1
1x<2
2x<3
3x<4
x4
= 1.6

(c) The distribution function is


F (x) = 0
= (x 3)/5
=1
xm
= 5.5

x3
3<x<8
x8
= 5.5

(d) The distribution function is


F (x) = 0
= log x
=1
xm
= 1.6487

x<1
1xe
xe
= 1.7183

For the symmetric distributions, (a) and (c), the mean and median coincide.
19C 0.1131
20C 0.0179

Exercises . . .
6.1

Find the mean, the variance, the distribution function and the median of the
random variables having the following probability density/mass functions.
(a)
f (x) = x/50 0 < x < 10
=0
otherwise
(b)
p(x) = (x 1)/15 x = 1, 2, . . . , 6
=0
otherwise
(c)
p(x) = 0.1
= 0.3
= 0.5
=0

x = 10 or
x = 20
x = 30
otherwise

x = 40

(d)
f (x) = 3x2 /125 0 < x < 5
=0
otherwise

160

INTROSTAT
(e)
f (x) = 10x9 0 < x < 1
=0
otherwise

6.2

Let X be a random variable with probability function


f (x) = kx( 21 x) 0 x 12
=0
otherwise
(a) Determine k so that f (x) is a density function.
(b) Find the mean and variance of X .
(c) Find the distribution function F (x) and the median.

6.3

A random variable X has probability density function


f (x) = Ax + B 0 X 1
=0
otherwise
If the mean of X is 0.5, find A and B .

6.4

For the opportunity to roll a die you pay n cents. If you score a six, you get
a reward of R2.00, and get your n cents returned. How much should you pay
in order to make this a fair game? (Note: a game is defined to be fair if the
expected gain is zero.)

6.5

A charitable institution wishes to raise funds by holding either a braaivleis, or a


dinner in a hotel. If they choose a braaivleis they will lose R1200 if it rains, or
make R3000 if it does not rain. If they hold a dinner they will make R2400 if
it rains and lose R300 if it does not rain. Which should they choose (a) if the
probability of rain is 1/3, (b) if the probability of rain is 1/2?

6.6 Find the mean and the variance of the exponential distribution.
6.7 Find the mean of the normal distribution.
6.8

What is the probability of obtaining more than 300 sixes in 1620 tosses of a fair
die?

6.9

Experience has shown that 15% of travellers reserving flights with Wildebeest
Airlines do not take up their seats. If each plane has 50 seats and 58 bookings are
accepted beforehand, what is the probability that everyone wishing to make the
flight can be accommodated?

6.10

The university lost-property office receives, on average, 64 articles per weekday.


What is the probability that on one day
(a) between 70 and 80 articles (inclusive) are received?
(b) exactly 64 articles are received?
(c) less than 50 articles are received?

6.11 Each kilogram of grass seeds contains an average of 200 weed seeds. What is the
probability that a kilogram of seeds contains more than 225 weed seeds? (Assume
that the number of weed seeds has a Poisson distribution.)

CHAPTER 6. MORE ABOUT RANDOM VARIABLES


6.12

161

A holiday farm has 110 bungalows. During the winter holidays, the farm has an
average occupancy of 60 bungalows per night. Assuming that the random variable
giving the number of bungalows let per night has a binomial distribution, compute
the probabilities
(a) fewer than 70 bungalows are let for a night, and
(b) between 65 and 75 bungalows (inclusive) are let for a night.

6.13 Prove that the mean and the median of a continuous random variable having a
symmetric probability distribution function are equal.
6.14

A random variable X has probability density function


f (x) = k(1 x2 ) 1 < x < 1
=0
otherwise

(a) Determine the value of k .


(b) Find the mean and variance of the random variable X .
(c) Determine the probability that the random variable X lies within an interval
of one standard deviation on either side of the mean.
(d) Find the distribution function and the median.
6.15 The probability density function of a random variable X is given by
f (x) = a 0 x 1
=b 1x3
= 0 otherwise
where a and b are constants.
(a) Given that the mean of the random variable X is 5/4, prove that a = 21 and
b = 14 .
(b) Find the variance of X .
(c) Find F (x), the distribution function, and xm the median.
(d) What is the probability that 3 out of 5 observations on the variable X are
less than the median?
6.16

An automatic teller machine (ATM) is installed in a busy shopping area. The


number of customers using the ATM per hour can be modelled by a Poisson random
variable with a mean of 70. Find, using an approximation, the probability that
between 15 and 35 customers (inclusive) use the ATM in any half-hour period.

6.17 A fair coin is tossed 250 times. Find the probability that the number of heads will
not differ from 125 by
(a) more than 10
(b) less than 7.
6.18 On average, a newspaper is delivered late twice per month. Calculate the approximate probability that the paper will be late more than 30 times in a year.

162

INTROSTAT

6.19 A random variable X has probability density function


0x3

f (x) = Ax + B

We are told that the mean of X is E(X) = 1.


(a)
(b)
(c)
(d)
6.20

Find
Find
Find
Find

the
the
the
the

values of the constants A and B .


variance of X .
distribution function of X .
median of X .

Consider the following game. You pay an amount x, and then toss a coin until
a head appears. If a head is obtained on the first or second throw, you lose. If a
head is obtained on the third or fourth throws, you win R1. If a head appears on
the fifth or subsequent throw, you win R5.
(a) If x = 75c, what are your expected winnings (or losses) per game?
(b) What should x be to make the game a fair game?

6.21 The random variable X has the probability mass function (known as the truncated Poisson distribution)
e x
x! (1 e )
=0

p(x) =

x = 1, 2, . . .
otherwise

(a) Check that the conditions for a probability mass function are satisfied.
(b) Find the mean of the random variable X .

Some more difficult exercises

...

6.22 (a) Show that the variance of the random variable X can be expressed as
Var[X] = E[X(X 1)] + E[X] (E[X])2 ,
P
where E[X(X 1)] means x x(x 1)p(x).
(b) Use this result to find the variance of the binomial and Poisson distributions:
(i) if X B(n, p), then Var(X) = np(1 p)
(ii) if X P (), then Var(X) = .
6.23 (a) Show that the probability density function derived in Example 10B
f (x) = 2 xex x 0
=0
otherwise
satisfies the conditions for being a probability density function.
(b) Find the mean and variance of the random variable X , the waiting time to
the second event.
6.24 Generalize the results of Example 10B to determine the probability density function
of the waiting time to the n-th event.

163

CHAPTER 6. MORE ABOUT RANDOM VARIABLES

Solutions to exercises . . .
6.1 (a)
(b)
(c)
(d)
(e)

= 20/3, 2 = 50/9, and xm = 7.07


= 14/3, 2 = 14/9, and xm = 5
= 26, 2 = 64, and xm = 30
= 15/4, 2 = 15/16, and xm = 3.97
= 10/11, 2 = 5/726, and xm = 0.7943

6.2 (a) k = 48
(b) = 14 , and 2 = 1/80
(c) The distribution function is
F (x) = 0
x<0
= 12x2 16x3 0 x
=1
x > 12
The median xm =
6.3 A = 0

1
4

1
2

because f (x) is symmetric.

B=1

6.4 40 cents
6.5

(a) Choose a braai, then E(gain) = 1600


(b) Choose a dinner, then E(gain) = 1050

6.6 = 1/, and 2 = 1/2


6.7 E[X] =
6.8 0.0212
6.9 0.6700
6.10 (a) 0.2254

(b) 0.0478

(c) 0.0351

6.11 0.0359
6.12 (a) 0.966

(b) 0.193

6.14 (a) k = 3/4


(b) = 0 2 = 1/5
(c) Pr[0.4472 < X < 0.4472] = 0.6261
(d) The distribution function is
F (x) = 0
x1
1
3
= 4 (3x x + 2) 1 x 1
=1
x>1
The median xm = 0 because f (x) is symmetric.

164

INTROSTAT

6.15 (b) 2 = 37/48


(c) The distribution function is
F (x) = 0
x<0
1
= 2x
0x1
= 12 + 14 (x 1) 1 < x 3
=1
x3
Median is xm = 1
(d) 0.3125
6.16 0.5316
6.17 0.8164

0.4122

6.18 0.0918
6.19 (a) A = 2/9 B = 2/3
(b) 2 = 0.5
(c) The distribution function is given by
F (x) = 0
x<0
= x2 /9 + 2x/3 0 x < 3
=1
x3
(d) xm = 0.8787
6.20 (a) Expected loss of 25c per game.
(b) 50c
6.21 (b) E(X) = /(1 e )
6.23 (b) E(X) = 2/
Var(X) = 2/2
Pn1 x
e
(x)k /k!, where Fn (x) is the distribution function of the
6.24 Fn (x) = 1 k=0
waiting time to the nth event. The density function is found by differentiation:
fn (x) = n xn1 ex /(n 1)! x 0
=0
otherwise

Chapter

PROBABILITY DISTRIBUTIONS II:


THE NEGATIVE BINOMIAL,
GEOMETRIC, HYPERGEOMETRIC
AND UNIFORM DISTRIBUTIONS
KEYWORDS: Negative binomial, geometric, hypergeometric and uniform distributions.

Recap . . .
To date we have learnt about four important probability distributions, two of which
were discrete, the binomial and Poisson distributions, and two of which were continuous,
the exponential and normal distributions. There are very many other useful probability
distributions and in this chapter we extend our repertoire.

The negative binomial distribution . . .


The underlying scenario for the negative binomial distribution is identical to that of
the binomial distribution. But there is a twist in the tail! The binomial distribution is
applicable if we have a random experiment with two outcomes, success and failure,
and if the probability of success, p, did not vary between trials of the experiment.
These two conditions must also be satisfied for the negative binomial distribution to be
applicable. For the binomial distribution, we fixed n, the number of trials, and we
counted the number of successes we observed on these trials, letting the random variable
X be the observed number of successes. But for the negative binomial distribution,
however, we fix r , the number of successes, and count the number of failures until
we have the r th success. Thus we let the random variable X be the number of failures
before we obtain r successes. We count failures, not trials! The probability mass
function for the negative binomial distribution is easily derived from first principles.
If the random variable X takes on the value x, so that there are x failures before
r successes, there must have been a total of x + r trials and the r th success must have
occurred on the very last trial, the (x + r)th trial.1 This in turn implies that there were
1

Have you ever wondered why, when you are looking for something, you always find it in the last

165

166

INTROSTAT

r 1 successes and x failures in the first x + r 1 trials. Thus


Pr[X = x] = Pr[x failures in x + r 1 trials]
Pr[success in (x + r)th trial]


x+r1
=
pr1 q x p
x


x+r1
=
pr q x
x

We clinch the above discussion into the box:


NEGATIVE BINOMIAL DISTRIBUTION
We have a series of independent trials, each of which has two outcomes,
success or failure. Pr[success] = p for each trial. Let q = 1 p.
Let r be a fixed number of successes, and let the random variable X
be the number of failures obtained before we have r successes.
Then X has the negative binomial distribution with parameters
r and p, i.e. X N B(r, p), and has probability mass function


x+r1
p(x) =
pr q x x = 0, 1, 2, . . .
x
=0
otherwise
The name negative binomial distribution is not a good one! The mathematicians
have a result called the negative binomial theorem which states that, under certain
conditions,


X
r+x1 x
r
(1 q) =
q .
x
x=0

This result, with its negative exponent, is used to prove the condition PMF3, that
P

x=0 p(x) = 1. In the same way that the binomial theorem gave its name to the
binomial distribution, this mathematical result has transferred its name to the negative
binomial distribution. There is nothing else about the negative binomial distribution
that is negative!
Example 1A: A market research company requires each of its fieldworkers to conduct
10 interviews per day. Not everybody approached by a fieldworker agrees to participate
in an interview. In fact, only 60% of approaches lead to an interview. What is the
probability that the 10th interview is obtained from the 15th person approached?
Let the random variable X be the number of failures before 10 successes. Here
r = 10, p = 0.6, so that X N B(10, 0.6). We want to find Pr[X = 5]:


5 + 10 1
Pr[X = 5] =
0.610 0.45 = 0.1240.
5
place that you look for it? Answer! Because once you have found it, you (hopefully!) stop looking for
it. Similarly for a negative binomial process.

CHAPTER 7. PROBABILITY DISTRIBUTIONS II

167

The special case of the negative binomial distribution when r = 1 is called the geometric distribution. Check for yourself that the mass function simplifies to the function
given below:
GEOMETRIC DISTRIBUTION
Under the same conditions as for the negative binomial distribution,
let the random variable X be the number of trials before the first
success. Then X has the geometric distribution with parameter p,
X G(p), and has probability mass function
p(x) = pq x x = 0, 1, 2, . . .
=0
otherwise
Example 2A: Suppose that you interview job applicants in succession until you find a
person that satisfies the job description. Suppose that, at each interview, the probability
of finding the right person is 0.3.
(a) What is the probability that you appoint the third person you interview?
(b) What is the probability that you will need to do five or more interviews?

(a) Let X be the number of trials before you succeed. Clearly X G(0.3). We want
Pr[X = 2]:
Pr[X = 2] = 0.3 0.72 = 0.147
(b) The probability of needing at least five interviews implies that the number of
unsuccessful interviews must be four or more.
Pr[X 4] = 1 Pr[X 3]

= 1 Pr[X = 0] Pr[X = 1] Pr[X = 2] Pr[X = 3]

= 1 0.3 0.3 0.7 0.3 0.72 0.3 0.73

= 0.2401

Example 3B: Some notebook computers make use of an active colour display on
their screens. One reason why these computers are expensive is that the manufacturing
process is so delicate that many of the screens produced are defective, and have to be
discarded when tested during the assembly process. Only 58% of all screens produced
are free from defects and can be used. If an order is placed for five notebook computers,
what is the probability that
(a) the fifth non-defective screen is the eighth screen tested?
(b) no more than nine screens are required?
Let the random variable X be the number of defective screens tested before the fifth
non-defective screen is found. Thus X has the negative binomial distribution:
X N B(5, 0.58).
(a) We want Pr[X = 3] =

3+51
3

0.585 0.423 = 0.170.

168

INTROSTAT

(b) Here we need


Pr[X 4] =


4 
X
x+4
x=0

0.585 0.42x

= 0.066 + 0.138 + 0.174 + 0.170 + 0.143


= 0.691
Example 4C: Show that the geometric distribution satisfies the conditions for a probability mass function. (The sum you have to evaluate is a geometric progression). Show,
assuming the result for the negative binomial theorem given earlier, that the negative
binomial distribution satisfies the conditions for a probability mass function.
Example 5C: Suppose that the probability of passing the board examination is 0.45,
that this probability does not vary with time, and that each attempt is independent of
previous attempts. What is the probability that you pass the examination on your fifth
attempt?
Example 6C: A breakfast cereal manufacturer places 1 of 16 picture cards of famous
soccer players in each packet of cereal. Each picture is equally likely to be contained in
a packet. You have collected 15 of the 16 cards. What is the probability that you will
have to buy x more packets to complete the set?
Example 7C: A computer salesman has established that if he takes an interested
customer out to lunch, the probability of making a sale increases from 0.4 to 0.65. The
salesman needs to make only 7 more sales to reach his target, and decides to continue
taking all interested customers out to lunch until he reaches his target. What is the
probability that he will need to take customers out to lunch
(a) on 10 occasions?
(b) on more than 10 occasions?
(c) What are these probabilities if he does not take the customers out to lunch?

The hypergeometric distribution . . .


The binomial and negative binomial distributions require that the probability of
success remains the same from trial to trial. In many practical situations this is unrealistic in particular it is unrealistic in sampling problems when the sampling is done
without replacement.
Consider, for example, a population of N articles, M of which are defective. We
draw a sample of size n. Let the random variable X be the number of defective articles
in n articles sampled without replacement.
For the event X = x to occur, we must draw x articles from the M defective
articles, and n x from the N M non-defective articles. Counting rule 3 of chapter 3

ways, and that we can choose
tells us that we can choose x articles from M in M
x

N

M
n x articles from N M in n x ways. Thus the total number of ways in which
the event X = x can occur is
 

M
N M
x
nx

CHAPTER 7. PROBABILITY DISTRIBUTIONS II

169

The total number of ways in which a sample of size n can be drawn from N articles is
N  . Thus using the rule for computing probabilities when the elementary events are
n
equiprobable, we have
 
  
M
N M . N
Pr[X = x] =
.
x
nx
n
HYPERGEOMETRIC DISTRIBUTION
Given a population of size N , of which M are defective, a sample
of size n (n N ) is drawn. Let the random variable X be the
number of defectives in the sample. Then X has the hypergeometric
distribution with parameters N, M and n, X H(N, M, n) and X
has probability mass function
 
  
M
N M . N
p(x) =
x = 0, 1, . . . , n
x
nx
n
=0
otherwise
Example 8A: A fisherman caught 10 lobsters, 3 of which were undersized. An inspector
of the Sea Fisheries Branch measured a random sample of 4 lobsters. What is the
probability that the sample contains no undersized lobsters?
Here N = 10, M = 3 and n = 4. If X is the number of undersized lobsters in the
sample of 4, then
 
  
3 10 3 . 10
Pr[X = 0] =
= 0.1667
0
40
4
Conversely, the probability that the inspector finds at least one undersized lobster is
1 0.1667 = 0.8333.
Example 9B: A team of 15 people is chosen from a class of 65 MBA students to play
a social rugby match. The class contains 25 engineers. What is the probability that the
team contains
(a) four engineers?
(b) at least four engineers?
(a) Let X be the number of engineers in the sample. Then N = 25+40 = 65, M = 25
and n = 15, so that X H(65, 25, 15). Thus
  . 
25 40
65
Pr[X = 4] =
= 0.1410.
4
11
15
(b) The probability that X is 4 or more is
Pr[X 4] = 1 Pr[X 3] = 1 {p(0) + p(1) + p(2) + p(3)}
            . 
65
25 40
25 40
25 40
25 40
+
+
+
=1
15
12
13
3
14
2
15
1
0
= 0.9176

170

INTROSTAT

Example 10C: There are addition errors in 3 out of a total of 32 invoices. An auditor
checks a random sample of 10 invoices. What are the probabilities of finding (a) 0, (b)
1, (c) 2 and (d) all 3 errors in the sample?
If N is much larger than n, then the difference between using the binomial distribution (which assumes that sampling is done with replacement) and using the
hypergeometric distribution is small. If n/N < 0.1, so that the size of the sample is less
than 10% of the total population size, then the binomial distribution may be used to
give satisfactory approximations to hypergeometric probabilities:
BINOMIAL APPROXIMATION TO THE HYPERGEOMETRIC
DISTRIBUTION
If X H(N, M, n) and if n/N < 0.1, then X
: B(n, p) with
p = M/N

Example 11B: A company is established in two cities, Johannesburg and Cape Town.
The total staff complement is 240, of which only 32 are based in Cape Town. If 10
of the total of 240 staff are randomly selected to attend a course on VAT, what is
the probability that two of the 10 are from Cape Town. Calculate both exact and
approximate probabilities.
Let X be the number of staff members selected from Cape Town.
X H(240, 32, 10)
The exact probability is given by
Pr[X = 2] =

32
2



  
208 . 240
= 0.2604
8
10

Using the binomial approximation to the hypergeometric distribution, we put p =


32/240 = 0.1333 so that X
: B(10, 0.1333), and
 
10
Pr[X = 2] =
0.13332 0.86678 = 0.2546.
2
The error in the approximation is
0.2604 0.2546
= 0.022 or 2.2%.
0.2604

CHAPTER 7. PROBABILITY DISTRIBUTIONS II

171

The uniform distribution . . .


This is the simplest possible continuous distribution. It is used to model the situation
where all values in some interval (a, b) are equally likely to occur. At first glance the
uniform distribution looks uninspiringly simple; but it is this very simplicity that gives
it its importance. It is possible, but not trivial, to programme a computer to produce a
series of numbers that look like a random sample from the uniform distribution.
UNIFORM DISTRIBUTION
If the continuous random variable X is equally likely to take on any
value in the interval (a, b), then X has the uniform distribution,
X U (a, b), with probability density function
1
ba
=0

f (x) =

axb
otherwise

Example 12A: Suppose the mass of a nominally 500 g tub of margarine is equally
likely to take on any value in the interval (495, 510). What is the probability that a
randomly chosen tub will have a mass less than 500 g?
Let the random variable X be the mass of a tub of margarine. Because X
U (495, 510)
1
= 1/15 495 < x < 510
f (x) = 510495
=0
otherwise

500

h x i500
1
dx =
15
15 495
495
= (500 495)/15 = 1/3

Pr(X < 500) =

Example 13C: If X U (a, b), investigate the following properties of X :

(a) Show that the function given for the uniform distribution satisfies the conditions
for a probability density function.
(b) Show that E[X] = 12 (b + a) and that Var[X] = (b a)2 /12
(c) Show that the distribution function is given by
F (x) = 0
x<a
= (x a)/(b a) a x b
=1
x>b

Example 14C: An investor knows that his share portfolio is equally likely to yield an
annual return anywhere in the interval between 5% and 35%. The fixed deposit rate is
13.5%. What is the probability that he would be better off investing his funds in a fixed
deposit account (rather than in his share portfolio) over the forthcoming year?

172

INTROSTAT

Example 15C: The final mark (Y ) for a particular statistics course comprises of a
30% weighting for the class record and a 70% weighting for the examination. A student
believes that she is equally likely to obtain a mark anywhere between 45% and 65% for
her examination (X ).
(a) Find the probability density function for her final mark (Y ) if she has a class
record of 50%.
(b) Find the distribution function of her final mark.
(c) Find the probability she gets a third-class pass (between 50% and 60%) as a final
mark.
(d) Find the probability that she gets a lower second (between 60% and 70%).
(e) What is the probability that she fails (below 50%)?

Solutions to examples . . .
5C Pr[passing on 5th attempt] = Pr[X = 4] = 0.554 0.45 = 0.0412

1 15 x1
6C Pr[x packets] = Pr[x 1 failures] = 16
16

7C X N B(7, 0.65)
(a) Pr[X = 3] = 0.1766
(b) Pr[X > 3] = 1 Pr[X 3] = 0.4862
(c) 0.0297 and 0.9452
.
3  29  32
10C Pr[x errors] = p(x) = x
10
10 x
. 


29
32
(a) p(0) = 3
0 10 . 10 = 0.3105
 29 32
(b) p(1) = 3
1 9
10 = 0.4657
(c) p(2) = 0.1996
(d) p(3) = 0.0242
(Note that these probabilities sum to 1.)
14C Pr(5 < X < 13.5) = 0.283
15C (a) The random variable Y U (46.5, 60.5).
1
f (y) = 60.546.5
=
=0

1
14

46.5 < y < 60.5


otherwise

(b) The distribution function is


F (y) = 0
y < 46.5
= (y 46.5)/14 46.5 y < 60.5
=1
y 60.5
(c) 0.714
(d) 0.036
(e) 0.250, although if the examiners have compassion, and pass her on a mark
between 49 and 50, the probability of failing is 0.179!

CHAPTER 7. PROBABILITY DISTRIBUTIONS II

173

Exercises . . .
7.1

What is the probability that a fair die is tossed z times before a 6 appears
(a) for the first time
(b) for the second time
(c) for the r th time?

7.2

A parliamentary candidate needs to collect 300 signatures before he can be nominated. If the probability that a voter approached at random will give a signature
is 0.15, what is the probability that 1300 voters need to be approached before the
300th signature is collected?

7.3 By assuming that the negative binomial distribution is a probability mass function
show that


X
x+r1
q x = (1 q)r .
x
x=0

7.4

Show that if X N B(r, p), then E[X] = rq/p and (more difficult) that Var[X] =
rq/p2 .
(Hint: the same procedure as was used for finding the mean and variance of the
binomial and Poisson distributions is applied here.)

7.5 In exercise 7.2, what number of voters can be expected to refuse to sign before 300
signatures have been collected?
7.6 The Blood Transfusion Service knows that 6.3% of the population belong to the
A-negative blood group.
(a) If people donate blood at random, what is the probability that x people will
not belong to the A-negative group before
(i) the first A-negative donor
(ii) the fourth A-negative donor?
(b) What is the expected number of donors not having A-negative blood before
(i) the first A-negative donor
(ii) the fourth A-negative donor?
7.7

A company that specializes in the breeding of fish needs to estimate the number
of fish in one of the dams on their fish farm. In order to estimate the number of
fish N in the dam, M are caught, marked and released. After sufficient time has
elapsed for the marked fish to mix thoroughly, fish are caught one by one until r
marked fish have been caught. The fish are released as soon as they have been
examined for marks.
(a) What is the probability that x unmarked fish are examined before r marked
fish are caught?
(b) What is the expected number of unmarked fish that need to be examined
before r marked fish are caught?
(c) What, therefore, would you suggest as an estimate of N , the total population?

174

INTROSTAT
(d) If 300 fish are marked, and 189 unmarked fish are caught before 50 marked
fish are caught, what is your estimate of the total population?

7.8 (a) Plot the bar graph and the distribution function of the negative binomial
distribution with parameters
(i) r = 4
p = 0.8
(ii) r = 4
p = 0.5.
(b) Thus determine the medians of these two negative binomial distributions.
7.9 A small shop has 10 cartons of milk left, of which three are sour. Unaware of this,
you ask for four cartons of milk.
(a) If the four cartons are selected at random, what is the probability that you
get x cartons of sour milk?
(b) Evaluate these probabilities, and plot them as a bar graph.
7.10 If X H(N, M, 2) write down the probability mass function, and show that it
sums to one.
7.11 Show that the mean of the random variable X having the hypergeometric distribution H(N, M, n) is nM/N
. It is much more difficult to show that the variance
  N n 
M
M
is given by n N 1 N
N 1 .

7.12

An engineer has 60 fuses of which 7 are defective. He selects 5 fuses (without


replacement) for a particular job.
(a) What is the exact probability of getting 1 defective fuse in the 5 fuses selected?
(b) Use the binomial approximation to the hypergeometric distribution to estimate this probability.
(c) What is the percentage error of the approximation?

7.13 A company employs five male and three female computer programmers. Four of
the eight programmers are selected at random to serve on a committee. One of
the four is chosen from the committee to report to the manager. If this person is
female, find the conditional probability that the committee consists of two males
and two females.
(Hint: use Bayes theorem and the hypergeometric distribution.)
7.14 A manufacturer of light bulbs reports that among a consignment of 10 000 sent to
a supermarket, 2500 were faulty. A shopper selects 10 of these bulbs at random.
What is the approximate probability that more than 2 are faulty?
7.15 A child plays with a pair of scissors and a piece of string 10 cm long. He cuts the
string into two at a randomly chosen place.
(a) What is the probability that the piece of string to the left of the pair of scissors
is less than 4 cm long?
(b) What is the probability that the shorter piece of string is less than 2.5 cm
long?
7.16

A taxi travels between two cities A and B which are 100 km apart. There are
service stations at A and B and at the midpoint of the route. If the taxi breaks

CHAPTER 7. PROBABILITY DISTRIBUTIONS II

175

down, it does so at random at any point along the route between the cities. If a
tow truck is dispatched from the nearest service station, what is the probability
that it has to travel more than 15 km to reach the taxi?
7.17 A radio station announces the time every 15 minutes between midnight and 06h00.
If you wake up at random in the early hours of the morning and switch on the
radio, what is the probability that you have to wait less than 5 minutes to find out
the time?
7.18

Between 08h00 and 09h00 buses leave the residence for the university at the
following numbers of minutes past 08h00:
00 03 05 07 10 12 15 30 37 55 60
(a) Calculate the probability of having to wait less than 2 minutes for a bus, if you
arrive at Mowbray station at a time uniformly distributed over the interval
(i) 08h00 to 09h00
(ii) 08h00 to 08h20
(iii) 08h02 to 08h30
(b) Calculate the probability of having to wait less than 5 minutes for a bus if
you arrive between 08h00 and 08h35.

Solutions to exercises . . .
 1 2 5 z2
(b) z1
(6) (6)
7.1 (a) ( 16 )( 65 )z1
z2
z1 1 r 5 zr
. Remember that the negative binomial distribution counts the
(c) zr ( 6 ) ( 6 )
number of failures before r successes. Here z = x + r trials.


1299
7.2 The number of refusals X N B(300, 0.15). Therefore p(1000) =
0.15300 0.851000 .
1000
7.3 Note that this result is analogous to the binomial theorem expansion of (x + y)n
for negative values of n.
7.5 If X N B(300, 0.15), then E(X) = 300 0.85/0.15 = 1700.
7.6 (a) (i) 0.063 0.937x
(b) (i) q/p = 14.9

3
4
x
(ii) x +
x 0.063 0.937 .
(ii) qr/p = 59.5.

7.7 (a) The number of unmarked fish X N B(r, M/N ).



Then p(x) = x + xr 1 (M/N )r (1 M/N )x .
(b) E(X) = r(N M )/M
(c) N = M (E(X) + r)/r
(d) We have to hope that our observed value of X , namely 189, is close to E(X):
We then estimate N = 300(189 + 50)/50 = 1434.
7.8 (b) (i) xm = 1

(ii) xm = 3.

176

INTROSTAT

7.9 (a) The number of sour milk cartons X H(10, 3, 4).


 7  10
Thus p(x) = x3 4x
/ 4 .
(b) p(0) = 0.167

p(1) = 0.500

p(2) = 0.300

p(3) = 0.033.

7.12 (a) The numberof defective


 60 fuses has a hypergeometric distribution: X H(60, 7, 5).
Thus p(1) = 71 53
4 / 5 = 0.3753.
(b) Because n/N < 0.1, X
: B(5, 7/60). Thus

   1  4
53
5
7
= 0.3552.
p(1)
60
60
1
(c) Percentage error =

0.37530.3552
0.3752

100 = 5.4%

7.13 Let Ai be the event the committee contains i females. Let B be the event the
computer programmer is female. Then
Pr(B|A0 ) = 0, Pr(B|A1 ) =

1
1
3
, Pr(B|A2 ) = , Pr(B|A3 ) = .
4
2
4

Also:
  . 
3 5
8
P r(A0 ) =
= 5/10
0 4
4
  . 
3 5
8
Pr(A1 ) =
= 30/70, etc.
1 3
4
By Bayes theorem,
!
3
. X
Pr(B|Ai )(Pr(Ai )
Pr(A2 |B) = Pr(B|A2 ) Pr(A2 )
i=0


30 .

1
=
2 70
= 0.5614

5
1 30 1 30 3
5
0
+
+
+
70 4 70 2 70 4 70

7.14 Let X be the number of faulty bulbs in 10.


X H(10 000, 2500, 10). X
: B(10,

1
4 ).

Pr[X > 2] = 1 0.5256 = 0.4744.

7.15 Let X be length of string to the left of the pair of scissors. X U (0, 10).
(a) Pr[X < 4] = 0.4
(b) 0.5
7.16 0.4
7.17 0.33
7.18 (a) (i) 0.333
(b) 0.657.

(ii) 0.600

(iii) 0.464

Chapter

MORE ABOUT MEANS


KEYWORDS: Population mean and variance, statistic, sampling distribution, central limit theorem, the sample mean is a random variable,
confidence interval, tests of hypotheses, null and alternative hypotheses, significance level, rejection region, test statistic, one-sided and
two-sided alternatives, Type I and Type II errors

The unobtainable carrot . . .


We introduce now a new concept the mean and variance of a population and
then immediately make it virtually unobtainable. Consider this problem. We want to
make global statements about the travelling times to work of all people employed in a
country. To be precise, we wish to determine the mean and variance of these travelling
times. Now let your imagination run riot considering the logistics of such an operation
(observers, stop-watches, data collection and processing, . . .). There is no doubt that
the mean and variance of the travelling times of all employed people are numbers that
exist it is just too expensive and too time consuming to obtain them. So what do we
do? We take the travelling times of a sample of employed people and use the sample
mean and variance to estimate the mean and variance of the travelling times of the
population of employed people. The operation of taking a sample from a population
is not a trivial one, and we shall discuss it further in Chapter 11.
We now have three concepts, each called a mean: the mean of a sample (chapter 1),
the mean of a probability distribution (chapter 6) and now the mean of a population. The
sample mean is used to estimate the population mean. When a probability distribution
is chosen as a statistical model for a population, one of the criteria for determining the
parameters of the probability distribution is that the mean of the probability distribution
should be equal to the population mean. This paragraph so far is also true when we
replace the word mean with the word variance. It is a universal convention to use the
symbols and 2 for the population mean and population variance respectively, and
the fact that these symbols are also used for the mean and the variance of a probability
distribution causes no confusion.
Because these are important notions, we risk saying them again. The population
mean and variance are quantities that belong to the population as a whole. If you
could examine the entire population of interest then you could determine the one true
177

178

INTROSTAT

value for the population mean and the one true value for the population variance.
Usually, it is impracticable to do a census of every member of a population to determine
the population mean. The standard procedure is to take a random sample from the
population of interest and estimate , the population mean, by means of x
, the sample
mean.

Statistics . . .
We remind you again of the special definition, within the subject Statistics, of a
statistic. A statistic is defined as any value computed from the elements of a random
sample. Thus x
, s2 are examples of statistics.

The random variable,

X ...

We argued above that the population parameter has a fixed value. But the sample
mean x
, the statistic which estimates , depends on the particular sample drawn, and
therefore varies from sample to sample. Thus the sample mean x
is a random variable
it takes on different values for different random samples. In accordance with our custom
of using capital letters for random variables and small letters for particular values of
random variables, we will now start referring to the sample mean as the random variable
X.
Because the statistic X is a random variable it must have a probability distribution.
We have a special name for the probability distributions of statistics. They are called
sampling distributions. This name is motivated by the fact that statistics depend on
samples.
In order to find the sampling distribution of X , consider the following. Suppose
that we take a sample of size n from a population which has a normal distribution with
known mean and known variance 2 . Apart from dividing by n, which is a fixed
number, we can then think of X as the sum of n values, let us call them X1 , X2 , X3 ,
. . . , Xn , each of which has a normal distribution, with mean and variance 2 ; i.e.
Xi N (, 2 ). In Chapter 5, we stated that the sum of normal distributions also has a
normal distribution; the mean of the sum is the sum of the means, and the variance of
the sum is the sum of the variances. Thus
n
X
i=1

Xi N (n, n 2 )

because
the means and variances are all equal. But we do not want the distribution of
Pn
X
i=1 i ; we want the distribution of
n

X=

1X
Xi .
n
i=1

We also saw in Chapter 5 that if the random variable X N (, 2 ), then the distribution
of aX N (a, a2 2 ). Applying this result with a = 1/n, we have
n

X=

1X
2
Xi N (, ).
n
n
i=1

CHAPTER 8. MORE ABOUT MEANS

179

This is true for all values of n. The bottom line is that if a sample of any size is taken
from a population having a normal distribution with mean and variance 2 , then the
mean of that sample will also have a normal distribution, with the same mean , but
variance 2 /n.
But what happens if we take a sample from a population that does not have a normal
distribution? Suppose that we do however know that this distribution has mean and
variance 2 . Suppose we take a sample of size n, and compute the sample mean X . As
mentioned
above, apart from division by the sample size, the sample mean X consists
Pn
of
X
i=1 i , the sum of n random variables. In Chapter 5, we mentioned the central
limit theorem, which states that the sum of a large number of random variables
always has a normal distribution. Thus, by the central limit theorem, the sample
mean has a normal distribution if the sample size is large enough. A sample size of
30 or more is large enough so that the distribution of the sample mean can be assumed
to have a normal distribution; the approximation to a normal distribution is usually
good even for much smaller samples. It can be shown that if we draw a sample of n
observations from a population which has population mean and variance 2 then X
can be modelled by a normal distribution with mean and variance 2 /n, i.e.
X
: N (, 2 /n)
(the sample mean is approximately distributed normally, with mean and variance
2 /n). The approximation is invariably very good for n 30. But if the population
which is being sampled has a normal distibution, then
X N (, 2 /n)
for all values of n, including small values.
This is a powerful result. Firstly, it tells us about the sampling distribution of X ,
regardless of the distribution of the population from which we sample. The approximation becomes very good as the sample size increases. If the population which we sampled
had a distribution which was similar to a normal distribution, the approximation is good
even for small samples. If the sampled population had exactly a normal distribution,
then X too has exactly a normal distribution for all sample sizes. But even when the
population from which the samples were taken looks nothing like a normal distribution,
the approximation gets better and better as the sample size increases.
Secondly, (and remember that we are thinking of X as a random variable and that
therefore it has a mean) it tells us that the mean of the sample mean X is the same
as the mean of the population from which we sampled. The sample mean is therefore
likely to be close to the true population mean. In simple terms, the sample mean is a
good statistic to use to estimate the population mean.
Thirdly, the inverse relation between the sample size and variance of the sample
mean has an important practical consequence. With a large sample size, the sample
mean is likely, on average, to be closer to the true population mean than with a small
sample size. In crude terms, sample means based on large samples are better than sample
means based on small samples.
The rest of this chapter, and all of the next, builds on our understanding of the
distribution of X . This is a fairly deep concept, and many students take a while before
they really understand it. The above section should be revisited from time to time. Each
time, you should peel off a new layer of the onion of understanding.

180

INTROSTAT

Let us now consider various problems associated with the estimation of the mean.
For the remainder of this chapter we will make the (usually unrealistic) assumption that
even though the population mean is unknown, we do in fact know the value of the
population variance 2 . In the next chapter, we will learn how to deal with the more
realistic situation in which both the population mean and variance are unknown.

Estimating an unknown population mean when the population variance is assumed known . . .
We motivate some theory by means of an example.
Example 1A: We wish to estimate the population mean of travelling times between
home and university. For the duration of this chapter we have to assume the population
variance 2 known, so let us suppose = 1.4 minutes. On a sample of 40 days, we use
a stopwatch to measure our travelling time, with the following results (time in minutes)
17.2
18.1
16.4
19.1

18.3
16.9
19.3
17.0

16.8
17.2
17.5
20.5

15.9
17.6
18.1
16.1

17.4
15.8
17.2
18.7

16.8
18.4
17.7
19.0

16.3
17.9
17.1
17.3

18.4
16.5
16.0
17.6

19.1
17.9
18.1
18.2

29.3
16.8
22.3
16.5

Adding these numbers and dividing by 40, gives the sample mean X = 17.76.
Fine, X estimates , so we have what we wanted. But how good is this estimate?
How much confidence can we place in it? To answer this question we need to make use
of the sampling distribution of X . We shall see that we can use this distribution to form
an interval of numbers, called a confidence interval, so that we can make a statement
about the probability that the confidence interval contains the population mean.

Confidence intervals . . .
In many situations, in business and elsewhere, a sample mean by itself is not very
useful. Such a single value is called a point estimate. For example, suppose that the
breakeven point for a potential project is R20 million in revenues, and that we are given
a point estimate for revenues of R22 million. Our dilemma is that this point estimate
is subject to variation it may well be below R20 million. Clearly, it will be far more
helpful in taking a decision if we could be given an interval of values and a probability
statement about the likelihood that the interval contains the true value. If we could
be told that the true value for revenue is likely to lie in the interval R21 million to
R23 million, we would decide to go ahead with the project. But if we were told it was
likely that the true value lay in the interval R12 million to R32 million, we would most
certainly want to get better information before investing in the project. Notice that, for
both of these intervals, the point estimate of revenues, R22 million, lies at the midpoint
of the interval. Our hesitation to invest, given the second scenario, is not due to the
position of the midpoint, but to the width of the interval.
The most commonly used probability associated with confidence intervals is 0.95,
and we then talk of a 95% confidence interval. This means that the probability is 0.95

181

CHAPTER 8. MORE ABOUT MEANS

that the confidence interval will contain the true population value. Conversely, the
probability that the confidence interval does not contain the population mean is 0.05.
Put another way, the confidence interval will not contain the population mean 1 time in
20, on average. Let us develop the theory for setting up such a confidence interval for
the mean (assuming, as usual for this chapter, that 2 is known).
We make use of the fact that (for large samples)
X N (, 2 /n).
We make the usual transformation to obtain a standard normal distribution:

Z = (X )/ N (0, 1)
n
From our normal tables we know that
Pr[1.96 < Z < 1.96] = 0.95
i.e. the probability that the standard normal distribution lies between 1.96 and +1.96
is 0.95.
....................
......
.....
.....
....
....
....
...
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
...
...
.
...
..
.
...
.
..
...
.
.
....
...
.
.
....
...
.
.....
.
.
.
......
....
.
.
.
.
.......
....
.
.
.
........
.
.
.
.............
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
..........
.
........

Z N (0, 1)

2 1
1.96

1.96

We can substitute (X )/ n in place of Z in the square brackets because it has


a standard normal distribution:

Pr[1.96 < (X )/ < 1.96] = 0.95.


n
Rearranging, we obtain:

Pr[X 1.96 < < X + 1.96 ] = 0.95.


n
n
Look at this carefully: in words, it says, the probability is 0.95 that the interval
(X 1.96 n , X + 1.96 n ) contains , the population mean. This is the 95%
confidence interval for . Lets apply this result to our introductory example.
Example 1A, continued: Look back and check that X = 17.76, = 1.4 and n = 40.
Substitute these values into the formula for the 95% confidence interval:

(17.76 1.96 1.4/ 40 , 1.96 1.4/ 40)


which reduces to (17.33 , 18.19). Thus, finally, we can state that we are 95% certain
that the interval (17.33 , 18.19) covers the population mean of the travelling time from
home to university. We now not only have the point estimate given by X , we also
have some insight into the accuracy of our estimate.

182

INTROSTAT

To obtain confidence intervals with different probability levels, all that needs to
be changed is the z -value obtained from the normal tables. The box below gives the
appropriate z -values for the most frequently used confidence intervals.
CONFIDENCE INTERVAL FOR , 2 KNOWN
If we have a random sample of size n with sample mean X , then A%
confidence intervals for are given by




X z , X +z
n
n
where the appropriate values of z are given by:
A%
90%
95%
98%
99%

z
1.64
1.96
2.33
2.58

Example 2B: An estimate of the mean fuel consumption (litres/100 km) of a car is
required. A sample of 47 drivers each drive the car under a variety of conditions for
100 km, and the fuel consumed is measured. The sample mean turns out to be 6.73
litres/100 km. The value of is known to be 1.7`/100 km. Determine 95% and 99%
confidence intervals for .
We have X = 6.73, = 1.7 and n = 47. Thus the 95% confidence interval for is
given by

(6.73 1.96 1.7/ 47 , 6.73 + 1.96 1.7/ 47)


which is (6.24 , 7.22). We are 95% sure that the true fuel consumption falls within this
interval. The 99% confidence interval is found by replacing 1.96 by 2.58:

(6.73 2.58 1.7/ 47 , 6.73 + 2.58 1.7/ 47)


which is (6.09 , 7.37). We are 99% sure that the true fuel consumption falls within this
interval. Note the penalty we have to pay for increasing the level of confidence: the
interval is considerably wider.
Suppose now that there was a block in the communication between the experimentor
and the statistician and that the actual sample size was not 47, but 147. What now are
the 95% and 99% confidence intervals?
They are given by

(6.73 1.96 1.7/ 147 , 6.73 + 1.96 1.7/ 147


= (6.46 , 7.00)

and

(6.73 2.58 1.7/ 147 , 6.73 + 2.58 1.7/ 147)


= (6.37 , 7.09)

respectively. Note that if everything else remains unchanged, the confidence interval
becomes shorter with an increase in the sample size. Resist the temptation to conclude

183

CHAPTER 8. MORE ABOUT MEANS

that by increasing the sample size it is more likely that the confidence interval covers
the true mean. The increase in sample size results in narrower confidence intervals, but
the level of confidence remains the same.
Example 3C: A chain of stores is interested in the expenditure on sporting equipment
by high school pupils during a winter season. A random sample of 58 high school pupils
yielded a mean winter expenditure on sporting equipment of R168.15. Assuming that the
population standard deviation is known to be R37.60, find the 95% confidence interval
for the true mean winter expenditure on sporting equipment.

Determining the sample size to achieve a desired


accuracy . . .
Example 4A: A rugby coach of an under-19 team is interested in the average mass of
18-year-old males. The population standard deviation of mass for this age class is known
to be 6.58 kg. What size sample is needed in order to get an estimate of the average
mass of 18-year-old males which we are 95% sure lies within 2 kg of the true value?
Clearly, we want to find a confidence interval of the form (X 2 , X + 2). Because

the 95% confidence interval is given by (X 1.96 / n , X + 1.96 / n) it follows


that, with = 6.58

2 = 1.96 6.58/ n,
and thus n = (1.96 6.58/2)2 = 42, rounding upwards to the next integer.
The general method for determining sample sizes is given in the box:
CALCULATING THE SAMPLE SIZE
DESIRED ACCURACY, 2 KNOWN

TO

ACHIEVE

To obtain an estimate of the mean which is within L units of the


population mean with confidence level A%, the required sample size is
n = (z /L)2
where the appropriate values of z are
A%
90%
95%
98%
99%

z
1.64
1.96
2.33
2.58

Example 5C: The population variance of the amount of cooldrink supplied by a vending
machine is known to be 2 = 115 ml2 .
(a) The machine was activated 61 times, and the mean amount of cooldrink supplied
on each occasion was 185 ml. Find a 99% confidence interval for the mean.
(b) What size samples are required if the estimate is required to be within (i) 1 ml
(ii) 0.5 ml of the true value with probability 0.99?

184

INTROSTAT

Testing whether the mean is a specified value when the


population variance is assumed known . . .
We again use an example to motivate our problem.
Example 6A: A manufacturer claims that, on average, his batteries last for 100 hours
before they go flat. The population standard deviation of battery life is known to be
12 hours. We take a sample of 50 batteries and find the sample mean X is 95.5 hours.
Does this disprove the manufacturers claim?
As a first approach to this problem, let us assume that the manufacturers claim is
true, and that the true mean is in fact 100 hours, and ask: What is the probability
that X < 95.5, given that = 100? We know that
X N (, 2 /n)
and that
Z=

X
N (0, 1).
/ n

X 100

N (0, 1).
12/ 50
Here X = 95.5, and the corresponding z -value is therefore

In this example Z =

z=

95.5 100

= 2.65.
12/ 50

Thus
Pr[X < 95.5] = Pr[Z < 2.65] = 0.0040, from tables.
This says, if the manufacturers claim is true (i.e. = 100) then the probability of
getting a sample mean of 95.5 or less is 0.004, a very small probability.
Now we have to take a decision. Either
(a) the manufacturer is correct, and a very unlikely event has occurred, one that will
occur on average 4 times in every 1000 samples, or
(b) the manufacturers claim is not true, and the true population mean is less than
100, making a population mean of 95.5 or less a more likely event.
The statistician here would go for alternative (b). He would reason that alternative
(a) is so unlikely that he can safely reject it, and he would conclude that the manufacturers claim is exaggerated.

Tests of hypotheses . . .
The problem posed in example 6A introduces us to the concept of statistical inference how we infer or draw conclusions from data. Tests of hypotheses, also
called significance tests, are the foundation of statistical inference.
Whenever a claim or assumption needs to be examined by means of a significance
test, we have a step-by-step procedure, as outlined below. We will use the above problem
to illustrate the steps. A modified procedure to do hypothesis tests will be discussed
later.

185

CHAPTER 8. MORE ABOUT MEANS

1. Set up a null hypothesis. This is almost always a statement about the value
of a population parameter. Here the null hypothesis is that the true mean is
equal to 100. For our null hypothesis we usually take any claim that is made (and
usually we are hoping to be able to reject it!).
We abbreviate the above null hypothesis to
H0 : = 100.
2. An alternative hypothesis H1 is defined. H1 is accepted if the test enables us
to reject H0 . Here the alternative hypothesis is H1 : < 100. We shall see that
this is a one-sided alternative and gives rise to a one-tailed significance
test.
3. A significance level is chosen. The significance level expresses the probability of
rejecting the null hypothesis when it is in fact true. We usually work with a 5%
or 0.05 significance level. We will then make the mistake of rejecting a true null
hypothesis 5% of the time, i.e. one time in twenty. If the consequences of wrongly
rejecting the null hypothesis are serious, a 1% or 0.01 level is used. Here a 5%
significance level would suffice.
4. We determine, from tables, the set of values that will lead to the rejection of the
null hypothesis. We call this the rejection region. Our test statistic will have a
standard normal distribution thus we use normal tables. Because H1 is onesided and contains a < sign, the rejection region is in the lower (left hand) end
of the standard normal distribution. We reject H0 if the sample mean X is too
far below the hypothesized population mean , that is, if X is too negative.
Because the significance level is 5% we must therefore find the lower 5% point of
the standard normal distribution; this is 1.64.
.........................
......
.....
.....
....
...
...
...
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
....
.
.
.
....
..
.
.
.
.....
...
.
.
......
.
.
...
......
.
.
.
.
.
.......
....
.
.
.
.
...........
.
.
.
.
....
................
...................

Z N (0, 1)

2 1
1.64

Thus the rejection region ties up with the distribution of the test statistic, the form of the alternative hypothesis, and the size of the significance level. The value we look up in the table is frequently called the critical
value of the test statistic.
5. We calculate the test statistic. We know that X N (, 2 /n). If H0 is true
then X N (100, 122 /50) and
Z=

X 100

N (0, 1).
12/ 50

In our example, X = 95.5 and the observed value of the test statistic z is

z = (95.5 100)/(12/ 50) = 2.65.

186

INTROSTAT

6. We state our conclusions. We determine whether the value we observed falls into
the rejection region. If it does, then we reject the null hypothesis H0 and accept
the alternative H1 . The result is then said to be statistically significant.
The feeling you should have by now is that if X , the mean from our sample, is too
far from 100, then we are going to reject H0 . But how far is too far?

We will reject H0 if the calculated value of Z = (X 100)/(12/ 50) is less than


1.64. Because 2.65, the observed value of Z , is less than 1.64 we reject H0 at the
5% significance level.
In essence what we are doing is to say that it is an unlikely event (with probability
less than 0.05) to get a sample mean of 95.5, or smaller, when the population mean is
100. Because the true mean is unlikely to be 100 we conclude that it must be less than
100. Another angle on this is to remember that we are testing the position of , which
is a measure of location. We are opting for the conclusion that the entire distribution is
located (or centred) on a point to the left of the hypothesized location; this conclusion
is consistent with the low sample mean we obtained.
Example 7B: The mean output of a typist in a typing pool is known to be 30 letters
per day with a standard deviation of 10 letters. A new typist in the pool types 795
letters in her first 30 days, i.e. 795/30 = 26.5 letters per day. Is her performance so far
below average that we should fire her? Use a 1% significance level.
1. We set up the null hypothesis by giving the typist the benefit of the doubt: H0 :
= 30.
2. The alternative hypothesis is H1 : < 30.
3. We use a 1% significance level the probability of rejecting a true null hypothesis
is then 0.01. Because the decision to fire a person is an important one, we do not
want to take this decision incorrectly. Using a 1% significance level means that we
will make this incorrect decision only one time in a hundred.
4. An observed z -value of less than 2.33 will lead us to reject the null hypothesis:
..........................
......
.....
....
....
....
...
...
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
...
.
...
...
.
.
....
..
.
.
....
.
...
.....
.
.
.
.
......
....
.
.
.
.......
.
.
....
.
.........
.
.
.
.
.
.
...............
.....
.......
...................

Z N (0, 1)

3 2
2.33

5. The test statistic is


z=

26.5 30
X
=

= 1.92.
/ n
10/ 30

6. Because 1.92 > 2.33 we do not reject H0 . We thus decide to keep our new
typist on. Note that if we had used a 5% significance level, the critical value
would have been 1.64, and we would have decided that our new typist was below
standard.

187

CHAPTER 8. MORE ABOUT MEANS

Example 8C: From past records it is known that the checkout times at supermarket
tills have a standard deviation of 1.3 minutes. Past records also reveal that the average
checkout time at a certain type of till is 4.1 minutes. A new type of till is monitored and
64 randomly sampled customers had an average checkout time of 3.8 minutes. Does the
new till result in a significant reduction in checkout times? Use a 1% level of significance.

One-sided and two-sided alternative hypothesis

...

The following example illustrates the use of a two-sided alternative hypothesis.


Guidelines on the choice of one-sided and two-sided alternative hypotheses is given after
the example.
Example 9B: A farmer who has used a certain fertilizer for many years knows that
his average yield of tomatoes is 2.5 tons/ha with a standard deviation of 0.53 tons/ha.
The fertilizer is discontinued and he has to use a new brand. He suspects that the new
fertilizer might alter the yield, but he has no idea whether the change will be an increase
or a decrease. At the end of the season he finds that the average yield in 35 plots has
been 2.65 tons/ha. At the 5% level of significance, test whether this is a significant
change.
1. H0 : = 2.5
2. H1 : 6= 2.5. The farmer does not know in advance in which direction the change
in yield might go. He wants to be able to reject H0 if the yield either increases or
decreases significantly.
3. Significance level : 5%.
4. Rejection region. Because of the form of the alternative hypothesis, we need to
have two rejection regions, so that we can reject the null hypothesis either if the
change of fertilizer decreases the yield or if it increases the yield. The two rejection
regions are made the same size, so that their total probability is 0.05. We reject
the null hypothesis if |z| > 1.96. The diagram illustrates this.
....................
......
.....
.....
....
...
....
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
...
..
.
.
...
..
.
...
.
....
...
.
.
....
..
.
.
.
.....
.
....
......
.
.
.
.
.......
...
.
.
.
.
.
.
.........
.....
.
.
.
.
..............
.
.
.
.
.
.
.
........
..............

Z N (0, 1)

2 1
1.96

1.96

5. We calculate the test statistic:


z=

X
2.65 2.5
= 1.67.
: z=
/ n
0.53/ 35

6. The observed z -value, 1.67, does not lie within either part of the rejection region;
thus we cannot reject H0 . On the available evidence the farmer concludes that
the new fertilizer is not significantly different to the old.

188

INTROSTAT

Some guidelines . . .
The null hypothesis and the alternative hypothesis should both be determined before
the data are gathered. The guideline for the choice of alternative hypotheses is: always
use a two-sided alternative unless there are good theoretical reasons for using a onesided alternative. The use of a one-sided alternative is never justified by claiming the
data point that way. In the hypothesis testing procedure we have considered above,
the significance level ought also to be predetermined.

Type I and type II errors . . .


By now you are probably saying: it is better to use a 1% significance level than a
5% significance level, because with a 1% level we make the mistake of rejecting a true
null hypothesis only 1 time in 100, whereas we make this mistake 1 time in 20 with a
5% significance level. If you have thought of that, then you have done well, but you
have overlooked something. There is another type of error you can make: the error of
accepting the null hypothesis when it is false. The higher the significance level, the more
difficult it is to reject the null hypothesis, and the more likely we are to start accepting
the null hypothesis when it is false. Statisticians call the two classes of errors type I
and type II errors, respectively.
True situation

Our
decision

H0 true

H0 false

accept
H0

correct
decision

type II
error

reject
H0

type I
error

correct
decision

The probability of committing a type I error is the significance level of the test,
sometimes also referred to as the size of the test. The probability of committing a
type II error varies depending on how close H0 is to the true situation, and is difficult to
control. The tradition of using 5% and 1% significance levels is based on the experience
that, at these levels, the frequency of type II errors is acceptable.

Comparing two sample means (with known population


variances) . . .
We recall a result we first used in chapter 5. If X1 and X2 are random variables
with normal distributions,
X1 N (1 , 12 ),

X2 N (2 , 22 )

and X1 and X2 are independent, then the distribution of the random variable Y =
X1 X2 is given by
Y = X1 X2 N (1 2 , 12 + 22 ).

189

CHAPTER 8. MORE ABOUT MEANS

Example 10A: A cross-country athlete runs an 8 km time trial nearly every Wednesday
as part of his weekly training programme. Last year, he ran on 49 occasions, and the
mean of his times was 30 minutes 25.4 seconds (30.42 minutes). So far this year his
mean time has been 30 minutes 15.7 seconds (30.26 minutes) over 35 runs. Assuming
that the standard deviation last year was 0.78 minutes, and this year is 0.65 minutes,
do these data establish whether, at the 5% significance level, there has been a reduction
in the athletes time over 8 km?
We work our way through our six-point plan:
1. Let the population means last year and this year be 1 and 2 respectively. Our
null hypothesis specifies that there is no change in the athletes times between last
year and this year:
H 0 : 1 = 2 ,

or equivalently,

H0 : 1 2 = 0.

Notice once again how the null hypothesis expresses the concept we are hoping to
disprove. The null hypothesis can helpfully be thought of as the hypothesis of
no change, or of no difference.
2. The alternative hypothesis contains the statement the athlete hopes is true:
H 1 : 1 > 2 ,

or equivalently,

H1 : 1 2 > 0.

The alternative hypothesis does not specify the amount of the change; it simply
states that there has been a decrease in average time.
3. Significance level : we specified that we would perform the test at the 5% level.
4. Rejection region. Because the test statistic has a standard normal distribution (we
will show this in the next paragraph), because the significance level is 5%, and
because we have a one-sided greater than alternative hypothesis, we will reject
H0 if the observed value of the test statistic is greater than 1.64.
....................
......
.....
.....
....
....
....
.
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
...
...
.
...
..
.
.
...
...
.
....
.
..
....
.
.
.
.....
...
.
.
.
......
.
...
.
.
.......
.
.
.
.
........
.....
.
.
.
.
.
.
.............
.
.
.
..........
...................

Z N (0, 1)

1.64

5. Test statistic. In general, let us suppose that we have a sample of size n1 from
one population and a sample of size n2 from a second population. Suppose the
sample means are X 1 and X 2 , respectively, and that the population means and
variances are 1 , 2 , 12 and 22 . Then
X 1 N (1 , 12 /n1 ) and

X 2 N (2 , 22 /n2 ).

Now, using the reminder at the beginning of this section,




12 22
+
.
X 1 X 2 N 1 2 ,
n1 n2

190

INTROSTAT
We now transform to the standard normal distribution by subtracting the mean,
and dividing by the standard deviation:
Z=

X 1 X 2 (1 2 )
r
N (0, 1).
12 22
n1 + n2

This is our test statistic, and we can substitute for each variable: from the problem
description we have
X 1 = 30.42

n1 = 49

1 = 0.78

X 2 = 30.26

n2 = 35

2 = 0.65 ,

and from the null hypothesis we have


1 2 = 0
Thus, substituting, we compute our observed z -value as
z=

30.42 30.26 (0.0)


r
= 1.02.
0.782 + 0.652
49
35

6. Conclusion. The observed value of the test statistic does not lie in the rejection
region. We have to disappoint our athlete and tell him to keep trying.
Example 11B: A retail shop has two drivers that transport goods between the shop and
a warehouse 15 km away. They argue continuously about the choice of route between the
shop and warehouse, each claiming that his route is the quicker. To settle the argument,
you wish to decide (5% significance level) which drivers route is the quicker. Over a
period of months you time the two drivers, and the data collected are summarized below.

Number of observations
Average time (minutes)
Standard deviation (minutes)

Driver 1

Driver 2

n1 = 38
X 1 = 20.3
1 = 3.7

n2 = 43
X 2 = 22.5
2 = 4.1

1. H0 : 1 2 = 0. Both drivers take equally long.


2. H1 : 1 2 6= 0. We do not know in advance which driver is quicker.
3. Significance level : 5%.
4. We have a two-sided alternative at the 5% level. Thus we reject H0 if the test
statistic exceeds 1.96 or is less than 1.96; i.e. we reject H0 if |z| > 1.96.

191

CHAPTER 8. MORE ABOUT MEANS


.........................
.....
.....
....
....
....
...
.
.
...
...
...
...
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
...
.
..
...
.
.
....
...
.
.
....
..
.
.
.....
.
...
.
.....
.
.
.
......
....
.
.
.
.......
.
.
....
.
..........
.
.
.
.
.
.
.
.
.
..................
...................

Z N (0, 1)

2 1
1.96

1.96

5. The test statistic is


z=

20.3 22.5 0
X 1 X 2 (1 2 )
s
= r
4.12
3.72
12 22
+
+
38
43
n1 n2

= 2.54
6. Because | 2.54| > 1.96 we reject H0 and conclude that the driving times are not
equal. By inspection, it is clear that the route used by driver 1 is the quicker.
Example 12C: Two speedreading courses are available. Students enrol independently
for these courses. After completing their respective courses, the group of 27 students
who took course A had an average reading speed of 620 words/minute, while the group
of 38 students who took course B had an average speed of 684 words/minute. If it is
known that reading speed has a standard deviation of 25 words/minute, test (at the 5%
significance level) whether there is any difference between the two courses.
Example 13C: The scientists of the Fuel Improvement Centre think they have found a
new petrol additive which they hope will reduce a cars fuel consumption by 0.5 `/100 km.
Two series of trials are conducted, one without and the other with the additive. 40 trials
without the additive show an average consumption of 9.8 `/100 km. 50 trials with the
additive show an average consumption of 9.1 `/100 km. Do these data establish that
the additive reduces fuel consumption by 0.5 `/100 km? Use a 5% significance level,
and assume that the population standard deviations without and with the addition of
the additive are 0.8 and 0.7 `/100 km respectively.
Example 14C: Show that a 95% confidence interval for 1 2 , the difference between
two independent means, with variances assumed known, is given by

s
s
2
2
2
2

1
1
X 1 X 2 1.96
+ 2 , X 1 X 2 + 1.96
+ 2.
n1 n2
n1 n2

Find a 95% confidence interval for the difference between mean travelling times in example 15B.

192

INTROSTAT

A modified hypothesis testing procedure . . .


The six-point plan for hypothesis testing, with steps 1 to 3 completed before the
data are considered, represents the classical approach to hypothesis testing:
1. Null hypothesis H0
2. Alternative hypothesis H1
3. Significance level
4. Rejection region
5. Test statistic
6. Conclusion.
However, in practice, and particularly in the presentation of statistical conclusions
in the journal literature, an alternative approach has been widely adopted:
1. Null hypothesis H0
2. Alternative hypothesis H1
3. Test statistic
4. Observed significance level
5. Conclusion.
In this approach, instead of fixing the significance level beforehand, we
report the highest level at which the test statistic is observed to be significant.
This level of significance is conventionally reported as a probability (rather than as a
percentage), and should be chosen from the levels given in the table below: We will adopt
the convention that if the significance level is not stated, then the modified hypothesis
testing procedure is to be followed.

Significance
level
more than 20%
20%
10%
5%
1%
0.5%
0.1%
0.05%
0.01%

z -value
Reported
one-sided two-sided
probability alternative alternative
P
P
P
P
P
P
P
P
P

> 0.20
< 0.20
< 0.10
< 0.05
< 0.01
< 0.005
< 0.001
< 0.0005
< 0.0001

z
z
z
z
z
z
z
z
z

< 0.84
> 0.84
> 1.28
> 1.64
> 2.33
> 2.58
> 3.09
> 3.29
> 3.72

z
z
z
z
z
z
z
z
z

< 1.28
> 1.28
> 1.64
> 1.96
> 2.58
> 2.81
> 3.29
> 3.48
> 3.89

verbal
description
not significant
possibly significant
nearly significant
significant
very significant
highly significant
very highly significant
very highly significant
very highly significant

CHAPTER 8. MORE ABOUT MEANS

193

A sensible strategy to determine the observed significance level is to start at the


5% level. If the test statistic is significant at the 5% level, move up to the 1% level,
and carry on upwards to higher levels of significance until either you find a level at
which the test statistic is no longer significant, or you reach the end of the table! Report
the highest level at which the test statistic was significant. If the test statistic was
not significant at the 5% level, move down to the lower levels of significance. If the
test statistic is significant at the 10% or even 20% significance level, you should report
this, but state the conclusion in somewhat cautious terms and add a comment that, if
possible, further investigations are needed.
The level of significance can be reported as a percentage, but there is a strong
tradition of reporting statistical significance in a shorthand notation as a P -value. So
significance at the 5% level is abbreviated to P < 0.05, and significance at the 1%
level to P < 0.01, etc. This shorthand method is illustrated in the examples.
Example 15A: The weekly wage of semi-skilled workers in an industry is known to
have population mean R286.50 and standard deviation R29.47. If a company in this
industry pays a sample of 40 of its workers an average of R271.78 per week, can the
company be accused of paying inferior wages?
1. H0 : = 286.50.
2. H1 : < 286.50.
3. The test statistic is
X
271.78 286.50

=
/ n
29.47/ 40
= 3.16.

z=

4. Examining the column in the table for a one-sided alternative, we see that z =
3.16 is significant at the 5% level (because z < 1.64), the 1% level (z < 2.33),
the 0.5% level (z < 2.58), the 0.1% level (z < 3.09), but not at the 0.05% level
(because z > 3.29). The highest level at which the observed value of the test
statistic z = 3.16 is significant is the 0.1% level.
5. We illustrate the conventional form of writing the conclusion.
We have tested the sample mean against the population mean, and have
found a very highly significant difference (z = 3.16, P < 0.001). We
conclude that the company is paying inferior wages.
Notice carefully the shorthand method for presenting the results. The value of the
test statistic is given, and the observed level of significance of this value.
Example 16B: The specifications for extra-large eggs are that they have a mean mass
of 125 g and standard deviation 6 g. A sample of 24 reputedly extra-large eggs had an
average mass of 123.2 g. Are the eggs smaller than the specifications permit?
1. H0 : = 125 g .
2. H1 : < 125 g .

194

INTROSTAT

3. Test statistic z = 123.2 125 = 1.47.


6/ 24
4. From the table, z = 1.47 is significant at the 10% level (but not at 5%).
5. The test shows that the mass of the sample of 24 eggs was close to being significantly lower than permitted by the specifications (z = 1.47, P < 0.10). Further
investigation is recommended to clarify the situation.
Example 17C: The standard deviation of salaries in the private sector is known to be
R4500, while in the public sector it is R3000. A random sample of 80 people employed
in the private sector has a mean income of R8210, and a sample of 65 public sector
employees has a mean income of R7460. Test the hypothesis that incomes in the private
sector are significantly higher than those in the public sector.
Example 18C: A tellurometer is an electronic distance-measuring device used by
surveyors. A new model has been developed, and its calibration is tested by repeatedly
and independently measuring a distance of exactly 500 m. After 100 such measurements,
the mean is 499.993 m with a standard deviation of 0.023 m. Does the new tellurometer
have a systematic bias? Assume that the sample is sufficiently large that s.
Example 19C: In the course of a year a statistician performed 60 hypothesis tests on
different sets of data, each at the 5% significance level. Suppose that in every case the
null hypothesis was true. What is the probability that he made no incorrect decisions?
What is the expected number of incorrect decisions?

Hypothesis testing using the normal approximation to


the binomial and Poisson distributions . . .
The normal distribution can also be used, under certain conditions, to test whether
the parameters p and of the binomial and Poisson distributions have particular values.
The examples illustrate the procedure. Note that these are not tests on a mean of a
random variable, but on the parameters of binomial and Poisson distributions.
Example 20A: A new coin is produced. It is necessary to test to see if it is unbiased.
A random sample of 1000 coins are tossed independently and 478 heads appear. Is the
coin unbiased? Use a 1% level of significance.
Let p be the probability of getting a head. The random variable of interest X , the
number of heads in 1000 trials, has a binomial distribution with parameters n and p.
1. H0 : p = 21 . This states that the coin is unbiased.
2. H1 : p 6= 21 . If p differs from 21 then the bias can go either way remember that
the null and alternative hypotheses ought to be formulated before you conduct the
experiment. This is an example of a two-sided alternative and gives rise to a
two-tailed test.
3. Significance level : 1%.
4. Because we will be using the normal distribution for our test statistic, we will
reject H0 if the observed z -value is greater than 2.58 or less than 2.58.

195

CHAPTER 8. MORE ABOUT MEANS


.........................
.....
.....
....
....
....
...
.
.
...
...
...
...
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
...
.
..
...
.
.
....
...
.
.
....
..
.
.
.....
.
...
.
.....
.
.
.
......
....
.
.
.
.......
.
.
....
.
..........
.
.
.
.
.
.
.
.
.
..................
...................

Z N (0, 1)

3 2
2.58

2.58

5. The random variable X , the number of heads in 1000 trials, has a binomial distribution with mean np and variance npq . If H0 is true, then p = 12 and np = 500
and npq = 250. Thus X satisfies the conditions for it to be approximated by
the normal distribution (see Chapter 6 for Normal approximation to thebinomial
distribution) with mean 500 and variance 250 (i.e. standard deviation 250) i.e.
X N (500, 250) and the usual transformation to the standard normal distribution yields
478 500
z=
= 1.39.
250
6. This value does not lie in the rejection region : we therefore cannot reject H0 and
conclude that the coin is unbiased.
In general, the procedure for testing the null hypothesis that the parameter p of the
binomial distribution is some particular value, is summarized as follows. If X B(n, p),
if np and n(1 p) are both greater than 5 and if 0.1 < p < 0.9, then the distribution of
the binomial random variable can be approximated by normal distribution N (np, npq),
so that the formula for calculating the test statistic z is
X np
.
z=
npq
The value X is the observed number of success in n trials, and the value for p is taken
from the null hypothesis.
Likewise, the procedure for testing the null hypothesis that the parameter of the
Poisson distribution is some particular value, is summarized as follows. The Poisson
distribution P () can be approximated by the normal distribution N (, ) provided
> 10. So the formula for calculating the test statistic is thus
z=

X
,

where X is the observed value of the Poisson random variable and the value for is
taken from the null hypothesis.
Example 21B: The number of aircraft landing at an airport has a Poisson distribution.
Last year the parameter was taken to be 120 per week. During a week in March this
year an aviation department official recorded 143 landings. Does this datum suggest
that the parameter has increased? Use a 5% significance level.
1. H0 : = 120. The null hypothesis states that the rate is unchanged.
2. H1 : > 120.

196

INTROSTAT

3. Significance level : 5%.


4. Reject H0 if the observed z -value exceeds 1.64.
5. Using the normal approximation to the Poisson distribution, we compute the statistic
143 120
X
= 2.10.
=
z=
120

6. Because 2.10 > 1.64, we reject H0 , and conclude that the rate of landings has
increased.
In other words, it is an unlikely event to observe 143 landings in a week, if the true
rate is 120.
Example 22C: A large motor car manufacturer enjoys a 21% share of the South
African market. Last month, out of 2517 new vehicles sold, 473 were produced by this
manufacturer. Does management have cause for alarm? Quote the observed significance
level.
Example 23C: A die is rolled 300 times, and 34 sixes are observed. Is the die biased?
Example 24C: The complaints department of a large department store deals with, on
average, 20 complaints per day. On Monday, 9 May 1994, there were 33 complaints.
(a) How frequently should there be 33 or more complaints?
(b) At the 1% significance level, test if management should decide that it would be
worth investigating this days complaints further.
(c) What would be acceptable numbers of complaints per day at the 5% and 1%
significance levels?
Example 25C: In a large company with a gender initiative, the proportion of female
staff in middle-management positions was 0.23 in 1990. Currently, in one of the regional
offices of the company, there are 28 women and 63 men in middle-management positions.
If this regional office can be assumed to be representative of the company as a whole,
has there been a significant increase in the proportion of women in middle management?
Do the test at the 5% significance level.

Solutions to examples . . .
3C (158.47, 177.83).
5C (a) (181.46, 188.54).
(b) (i) 766
(ii) 3062.
Note: to improve the precision by a factor

1
2

required a sample 4 times as large.

8C Reject H0 if z < 2.33. Observed z = 1.85. Cannot reject H0 .


12C z = 10.17. Reject H0 .
13C H0 : 1 2 = 0.5. z = 1.25. Cannot reject H0 .

CHAPTER 8. MORE ABOUT MEANS

197

14C 95% confidence interval : (3.90, 0.50).


17C Using modified procedure, z = 1.20, P < 0.20, possibly significant.
18C z = 3.043, P < 0.005, highly significant.
19C Pr (all correct decisions) = 0.9560 = 0.046. E(incorrect decisions) = 60 0.05 =
3.
22C z = 2.72, P < 0.005, highly significant.
23C H0 : p = 1/6, z = 2.47, P < 0.05, significant.
24C (a) Use normal approximation to Poisson: Pr[x 33] = 0.0026, or once in
385 days.
(b) H0 : = 20, H1 : > 20. Reject H0 if z > 2.33, observed z = 2.91. Reject
H0 .
(c) Largest acceptable number of complaints is 27 at the 5% level and 30 at the
1% level.
25C H0 : p = 0.23. Observed z = 1.606. Cannot reject H0 .

Exercises using the central limit theorem . . .


8.1

A manufacturer of light bulbs prints Average life 2100 hours on the package of
its bulbs. If the true distribution of lifetimes has a mean of 2130 and standard
deviation 200, what is the probability that the average lifetime of a random sample
of 50 bulbs will exceed 2100 hours? (Hint: use the result X N (, 2 /n).)

8.2 Certain tubes produced by a company have a mean lifetime of 1000 hours and a
standard deviation of 160 hours. The tubes are packed in lots of 100. What is the
probability that the mean lifetime for a randomly selected pack will exceed 1020
hours?

Exercises on confidence intervals and sample sizes . . .


8.3

(a) A sample survey was conducted in a city suburb to determine the mean family
income for the area. A random sample of 200 households yielded a mean of
R6578. The standard deviation of incomes in the area is known to be R1000.
Construct a 95% confidence interval for .
(b) Suppose now that the investigator wants to be within R50 of the true value
with 99% confidence. What size sample is required?
(c) If the investigator wants to be within R50 of the true value with 90% confidence, what is the required sample size?

198

INTROSTAT

8.4 We want to test the strength of lift cables. We know that the standard deviation
of the maximum load a cable can carry is 0.73 tons. We test 60 cables and find
the mean of the maximum loads these cables can support to be 11.09 tons. Find
intervals that will include the true mean of the maximum load with probabilities
(a) 0.95 and (b) 0.99.
8.5

A random variable has standard deviation = 4. The 99% confidence interval


for the mean of the random variable was (19.274 , 20.726). What was the sample
mean and what was the size of the sample?

8.6

During a student survey, a random sample of 250 first year students were asked to
record the amount of time per day spent studying. The sample yielded a mean of
85 minutes, with a standard deviation of 30 minutes. Construct a 90% confidence
interval for the population mean.

Exercises on hypothesis testing (use the six-point plan) . . .


8.7

The mean height of 10-year-old males is known to be 150 cm and the standard
deviation is 10 cm. An investigator has selected a sample of 80 males of this age
who are known to have been raised on a protein-deficient diet. The sample mean
is 147 cm. At the 5% level of significance, decide whether diet has an effect on
height.

8.8 A machine is supposed to produce nuts with a mean diameter of 20 mm and a


standard deviation of 0.2 mm. A sample of 40 nuts had a mean of 20.05 mm. Make
a decision whether or not the machine needs adjustment, using a 5% significance
level.
8.9 Suppose that the time taken by a suburban train to get from Cape Town to Wynberg is a random variable with a N (, 4) distribution. 25 journeys take an average of 27.7 minutes. Test the hypothesis H0 : = 27 against the alternatives (a)
H1 : 6= 27 and (b) H1 : > 27 at the 5% level of significance.
8.10

The lifetime of electric bulbs is assumed to be exponentially distributed with a


mean of 40 days. A sample of 100 bulbs lasted on average 37.5 days. Using a
1% level of significance, decide whether the true mean is in fact less than 40 days.
(Hint: remember that the mean and variance of the exponential distribution are
1/ and 1/2 respectively.)

8.11 A mechanized production line operation is supposed to fill tins of coffee with a
mean of 500.5 g of coffee with a standard deviation of 0.6 g. A quality control
specialist is concerned that wear and tear has resulted in a reduction in the mean.
A sample of 42 tins had a mean content of 500.1 g. Use a 1% significance level to
perform the appropriate test.
8.12

(a) A coin is tossed 10 000 times, and it turns up heads 5095 times. Is it
reasonable to think that the coin is unbiased? Use 5% significance level.
(b) A coin is tossed 400 times, and it turns up heads 220 times. Is the coin
unbiased? Use 5% level.

CHAPTER 8. MORE ABOUT MEANS

199

8.13 The beta coefficient is a measure of risk widely used by financial analysts. Larger
values for beta represent higher risk. A particular analyst would like to determine
whether gold shares are more risky than industrial shares. From past records,
it is known that the standard deviation of betas for gold shares is 0.313 and for
industrial shares is 0.507. A sample of 40 gold shares had a mean beta coefficient
of 1.24 while a sample of 30 industrial shares had a mean of 0.72. Using a 1% level
of significance, conduct the appropriate statistical test for the financial analyst.
8.14

The standard deviation of scores in an I.Q. test is known to be 12. The I.Q.s of
random samples of 20 girls and 20 boys yield averages of 110 and 105 respectively.
Use a 1% significance level to test the hypothesis that the I.Q.s of boys and girls
are different.

Exercises on hypothesis testing (use the modified procedure) . . .


8.15 In an assembly process, it is known from past records that it takes an average
of 4.32 hours with a standard deviation of 0.6 hours for a computer part to be
assembled. A new procedure is adopted. A random sample of 100 items using
the new procedure took, on average, 4.13 hours. Assuming that the standard
has remained unaltered, test whether the new procedure is effective in reducing
assembly time.
8.16

It is accepted the standard deviation of the setting time of an adhesive is 1.25


hours under all conditions. It is, however, well known that the adhesive sets more
quickly in warm conditions. A new additive is developed that reputedly accelerates
setting under freezing conditions. A sample of 50 joints glued with the adhesive,
plus additive, had a mean setting time of 4.78 hours, whereas a sample of 30 joints
glued without the additive in the adhesive took, on average, 5.10 hours to set. The
entire experiment was conducted at 0 C. Is the additive effective in accelerating
setting time?

8.17

A reporter claims that at least 60% of voters are concerned about conservation
issues. Doubting this claim, a politician samples 480 voters and of them 275
expressed concern about conservation. Test the reporters claim.

8.18

If the average rate of computer breakdowns is 0.05 per hour, or less, the computer
is deemed to be operating satisfactorily. However, in the past 240 hours the computer has broken down 18 times. Test the null hypothesis that = 0.05 against
the alternative that exceeds 0.05.

8.19 The mean grade of ore at a gold mine is known to be 4.40 g with a standard deviation of 0.60 g per ton of ore milled. A geologist has randomly selected 30 samples
of ore in a new section of the mine, and determined their mean grade to be 4.12 g
per ton of ore. Test whether the ore in the new section of the mine is of inferior
quality.

200

INTROSTAT

Solutions to exercises . . .
8.1 0.8554.
8.2 0.1056.
8.3 (a) (6439.4, 6716.6)
8.4 (a) (10.91, 11.27)

(b) 2663

(c) 1076.

(b) (10.85, 11.33).

8.5 Mean was 20.0, n = 202.


8.6 (81.9, 88.1).
8.7 Observed value of test statistic z = 2.68 critical value = 1.64, reject H0 .
8.8 z = 1.58 < 1.96, cannot reject H0 .
8.9 (a) z = 1.75 < 1.96, cannot reject H0 . (b) z = 1.75 > 1.64, reject H0 .
8.10 z = 0.63 > 2.33, cannot reject H0 .
8.11 z = 4.32 < 2.33, reject H0 .
8.12 (a) Yes, z = 1.90 < 1.96, cannot reject H0 : coin is unbiased.
(b) No, z = 2.00 > 1.96, reject H0 .
8.13 z = 4.95 > 2.33, reject H0 .
8.14 z = 1.32 < 2.58, cannot reject H0 .
8.15 Conclusion: the new procedure is statistically highly significant in reducing assembly time (z = 3.17, P < 0.001).
8.16 Conclusion: the adhesive with the additive is possibly significant (z =
1.11, P < 0.20). Further investigation recommended.
8.17 Use normal approximation to B(480, 0.6). Conclusion: accept the reporters claim
(z = 1.21, P > 0.20) that the proportion does not differ significantly from 60%.
8.18 Use normal approximation to P (12). Conclusion: the computer is having breakdowns at a rate significantly higher than 0.05/hr . (z = 1.73, P < 0.05).
8.19 Conclusion: The new section of the mine does contain inferior grade ore (z =
2.56, P < 0.01).

Chapter

THE t- AND F-DISTRIBUTIONS


KEYWORDS: t-distribution, degrees of freedom, pooled sample variance, F -distribution

Population variance unknown . . .


So far, for confidence intervals for the mean, and for tests of hypotheses about means,
we have always had to make the very restrictive assumption that the population variance
(or standard deviation) was known. Until the early 1900s, whenever the population
variance 2 was unknown, the sample variance s2 was substituted in its place. This
is reasonably satisfactory when the samples are large enough, and consequently the
sample variance lies reliably close to the population variance. We shall see in this
chapter that when the sample size exceeds about 30 then it is reasonable to make this
substitution. Problems arise with small samples, because the sample variance s2 is then
very variable and can be far removed from the population variance 2 , which we think
of as being the one true value.
At the turn of the century, W.S. Gosset, who worked for an Irish brewery, needed to
do statistical tests using small samples. This motivated him to tackle the mathematical
statistical problem of how to use s2 , the estimate of 2 , in these tests. He solved the
problem. Because of his professional connections, he published the theory he developed
under the pen-name Student. Gosset in 1908 published the actual distribution of
t=

s/ n

X
has the standard normal distribution. But
We know, from chapter 8, that Z = /
n
when , the single true value for the standard deviation in the population, is replaced
by s, the estimate of , this is no longer true, although, as the sample size n increases,
it rapidly becomes an excellent approximation. But for small samples it is far from the
truth. This is because the sample variance s2 (and also s) is itself a random variable
it varies from sample to sample. We will discuss the sampling distribution of s2 in
the next chapter.
We know that the size of the sample influences the accuracy of our estimates. The
larger the sample the closer the estimate is likely to be to the true value. Students
t-distribution takes account of the size of the sample from which s is calculated.

201

202

INTROSTAT

The shape of the t-distribution is similar to that of the normal distribution. However, the shape of the distribution varies with the sample size. It is longeror heavier-tailed than the normal distribution when the sample size is small. As the
sample size increases, the t-distribution and normal distribution become progressively
closer, and, ultimately, they are identical. The standard normal distribution, and two
t-distributions are plotted.
..............
....... ........ ................
...... ....
...........
........
.....
......
.
.
.
. . . . . . . . .......
....
.
.
.
.. . ....
... . ..
.. ... ....
...... ..
.
.
.. .......
.... .
..... . ..
.. ........
.
.
.
.
.. .......
15
.... .
.. . ..
.
. .......
.. . ..
..... ..
.
..... .
.. ........
3
.
... ..
.
. ........
... .
.
.. ..
.
...... ..
. .......
.
.. .....
.... ..
.
.... .
. .......
.
.. ...
.... ..
.
.. ........
.... ..
.
.... .
.. ........
.
... ..
.
.. ......
.... .
.
.
.. ........
...... ..
.
.. ......
.... ..
.
.........
... ..
.
.
.........
..... .
.... ..
.........
.
....
.
........
...
.
.
......
......
.
......
...
.
....
.
.
.
.....
.
.
.
...
....
.
.
.
.
.
........ .
........
......... .
.
.
. ........
....... .
.
...... . .
. .. .........
.
......... . .
. ... ...........
..... .... . . .
.
.
...... ... . .
. . .... ...........
.
.
...... . ... . . . .
..
. .........
.
.
.
....... . ....
.
.
.
.
.............
......... .... . . . . .
...
.
.
.
.
.
.
........... .... ....
.
.
.
................ ...
......................................
......
..

N (0, 1)
t
t

Degrees of freedom . . .
In order to gain some insight into the notion of degrees of freedom, consider again
the definition of the sample variance:
s2 =

n
X
1
(xi x
)2 .
n1
i=1

The terms xi x
are the deviations of each of the xi from the sample mean. To
achieve a given sample mean for, say, six numbers, five of these can be chosen at will,
but the last is then fully determined. Suppose we are given the information that the
mean of six numbers is 5 and that the first five of the six numbers are 4, 9, 5, 7 and 3.
The sixth number must be 2, otherwise the sample mean would not be 5. It is fixed, it
has no freedom. In general if we are told that the mean of n numbers is x
, and that
the first n 1 numbers are x1 , x2 , . . . , xn1 , then it is easy to see that xn must be
given by
xn = n
x x1 x2 xn1 .
In other words, once we are given x
and the first n 1 of the xi we have enough
information to compute the sample variance. Thus, only n 1 of the deviation terms
xi x
in s2 contain real information; the last term is just a formality, but it must be
included! We say that s2 , based on a sample of size n, has n 1 degrees of freedom.
(This is part of the reason why the formula for the sample variance s2 calls for division
by n 1, and not n.)
We will be encountering the concept of degrees of freedom regularly. We have a
simple rule which helps in making decisions about degrees of freedom.

203

CHAPTER 9. THE t- AND F-DISTRIBUTIONS


THE DEGREES OF FREEDOM RULE
For each parameter we need to estimate prior to evaluating the current
parameter of interest, we lose one degree of freedom.

For example, when we use s2 to estimate 2 , we first need to use x


to estimate .
So we lose one degree of freedom. There will be further examples later on in which we
will lose two or more degrees of freedom!
Because the shape of the t-distribution varies with the sample size, there is a whole
family of t-distributions. It is therefore necessary to have some means of indicating
which t-distribution is being used in a specific situation. We do this by using a subscript.
Intuition suggests that the subscript should be the sample size, but it turns out that the
sensible subscript is the degrees of freedom of the standard deviation, which is one less
than the sample size. So we use the notation
tn1 =

s/ n

and say that the expression on the right hand side has the t-distribution with n 1
degrees of freedom, or simply the tn1 -distribution.
As for the normal distribution, we need tables for looking up values for the tdistribution. The shapes of the t-distributions are dependent on the degree of freedom;
thus we cannot get away with a single table for all t-distributions as we did with the
normal distribution. We really do appear to need a separate table for each number of
degrees of freedom. But we take a short cut, and we only present a selection of key
values from each t-distribution in a single table (Table 2). If you think about it, you
can now begin to understand why we can do this; even for the normal distribution, we
repeatedly only use a handful of values; z (0.05) = 1.64, z (0.025) = 1.96, z (0.01) = 2.33
and z (0.005) = 2.58 are by far the most frequently used percentage points of the standard
normal distribution. In Table 2, there is one line for each t-distribution; on that line,
we present 11 percentage points.

Confidence intervals (using the sample variance) . . .


Example 1A: Ten direct flights to Johannesburg from Cape Town took, on average, 103
minutes. The sample standard deviation was 5 minutes. Determine a 95% confidence
interval for the true mean flight duration.
We have x
= 103, s = 5 and n = 10. We know that the statistic (X )/ s has
n
the t-distribution with 9 degrees of freedom, denoted by t9 .
In Table 2, we see that, for t9 , the points between which 95% of the distribution
lies are 2.262 and +2.262.
.....................
......
.....
....
....
....
...
.
.
...
...
...
9
...
.
...
.
..
...
.
.
...
..
.
...
.
..
...
.
.
....
..
....
...
....
....
.....
....
.....
.....
.
.
.
.
......
....
.
.......
.
.
.
.
........
....
.
.
.
.
.
.
.
...........
......
.
.
.
.
................
.
.
.
.
.
.
.
.
..
..........

tt

3 2
2.262

2.262

204

INTROSTAT
Because 2 12 % (or 0.025) of the t9 distribution lies to the right of 2.262 we write
(0.025)

t9

= 2.262

and speak of the 2 12 % point of the t9 distribution.


We can write
Pr(2.262 < t9 < 2.262) = 0.95.
But

X
t9 .
s/ 10

Therefore



X

Pr 2.262 <
< 2.262 = 0.95.
s/ 10
Manipulation of the inequalities, as done in the same context in Chapter 8, yields


s
s
= 0.95.
Pr X 2.262 < < X + 2.262
10
10

Substituting X = 103 and s = 5 yields


5
s
X 2.262 = 103 2.262 = 99.4
10
10
for the lower limit of the confidence interval and
s
5
X + 2.262 = 103 + 2.262 = 106.6
10
10
for the upper limit. Thus a 95% confidence interval for the mean flying time is given by
(99.4, 106.6).
CONFIDENCE INTERVAL FOR , WHEN 2 IS
ESTIMATED BY s2
If we have a random sample of size n from a population with a normal
distribution, and the sample mean is x and the sample variance is s2 ,
then confidence intervals for are given by


s
s

X tn1 , X + tn1
n
n
where the t values are obtained from the t-tables. For 95% confidence intervals, use the column in the tables headed 0.025. For 99%
confidence intervals, use the column headed 0.005.
Example 2B: A random sample of 25 loaves of bread had a mean mass of 796 g and
a standard deviation of 7 g. Calculate the 99% confidence interval for the mean.
We have x = 796, and s = 7 has 24 degrees of freedom. Because we want 99%
(0.005)
confidence intervals, we must look up t24
in the t-tables:
(0.005)

t24

= 2.797.

205

CHAPTER 9. THE t- AND F-DISTRIBUTIONS


......................
.....
.....
.....
....
....
...
...
...
.
.
...
..
.
...
.
24
..
...
.
.
...
..
.
.
...
..
.
...
.
...
..
.
.
...
..
....
...
....
....
....
...
.....
....
.
.
.
......
.
....
......
.
.
.
.
.......
....
.
.
.
.
.
.........
.
.....
.
.
..............
.
.
.
.
.
.
.
.
.
........
.............

tt

3 2
2.797

2.797

Thus the 99% confidence interval for the mean is

(796 2.797 7/ 25 , 796 + 2.797 7/ 25)


which reduces to (792.08, 799.92).
Example 3C: An estimate of the mean fuel consumption (litres/100 km) of a new
car is required. A sample of 18 drivers each drive the car under a variety of conditions
for 100 km, and the fuel consumption is measured for each of the drivers. The sample
mean was calculated to be x
= 6.73 litres/100 km. The sample standard deviation was
s = 0.35 litres/100 km. Find the 99% confidence interval for the population mean.

Testing whether the mean is a specified value


(population variance estimated from the sample) . . .
Example 4A: A poultry farmer is investigating ways of improving the profitability of
his operation. Using a standard diet, turkeys grow to a mean mass of 4.5 kg at age 4
months. A sample of 20 turkeys, which were given a special enriched diet, had an average
mass of 4.8 kg after 4 months. The sample standard deviation was 0.5 kg. Using the
5% significance level test whether the new diet is effectively increasing the mass of the
turkeys.
We follow the standard hypothesis testing procedure.
1. H0 : = 4.5. As usual, for our null hypothesis we assume no change has taken
place.
2. H1 : > 4.5. A one-sided alternative is appropriate, because an enriched diet
should not cause a loss of mass.
3. Significance level: 5%
4. The rejection region is found by reasoning as follows. Because the population
variance is unknown, we need to use the t-distribution. The degrees of freedom for
t will be 19, because s is based on a sample of 20 observations. We will thus reject
(0.05)
(0.05)
H0 if the observed t-value exceeds t19 . From the t-tables, t19
= 1.729.
...................
.....
......
....
.....
....
....
...
...
.
.
...
..
.
...
.
19
..
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
.
....
...
...
....
....
...
.....
....
.
.
.....
.
.
......
....
.
.
.
.
.......
....
.
.
.
........
.
.
.
.....
...........
.
.
.
.
.
.
.
.
.
.
.
...............
..............

tt

1.729

206

INTROSTAT

5. The formula for the test statistic is


t=

X
.
s/ n

We substitute the appropriate values, and compute


t=

4.8 4.5
= 2.68.
0.5/ 20

6. Because 2.68 > 1.729 we reject H0 , and conclude that, at the 5% significance
level, we have established that the new enriched diet is effective.
Example 5B: The average life of 6 car batteries is 30 months with a standard deviation
of 4 months. The manufacturer claims an average life of 3 years for his batteries. We
suspect that he is exaggerating. Test his claim at the 5% significance level.
1. H0 : = 36 months .
2. H1 : < 36 months .
3. Significance level : 5%.
4. Degrees of freedom is 6 1 = 5. So we use the t5 -distribution. From the form
of the alternative hypothesis and the significance level, we will reject H0 if the
(0.05)
(0.05)
observed t-value is less than t5
. From our tables t5
= 2.015.
.....................
.....
......
....
....
....
...
.
.
...
..
.
.
...
5
..
.
...
.
...
..
.
.
...
...
...
.
..
...
.
.
....
..
....
...
...
....
....
...
.....
.....
.
.
......
.
.
.
......
....
.
.
.
.
.
.......
.....
.
.
.........
.
.
.
.
.
.............
......
.
.
.
.
.
.
.
.
.
.
.
...........
.
.
........

tt

2 1
2.015

5. Substituting into the test statistic tn1 =


t5 =

yields
s/ n

30 36
= 3.67
4/ 6

6. This lies in the rejection region: we conclude that the true mean is significantly
less than 36 months.
Example 6C: A purchaser of bricks believes that their crushing strength is deteriorating. The mean crushing strength had previously been 400 kg, but a recent sample of 81
bricks yielded a mean crushing strength of 390 kg, with a standard deviation of 20 kg.
Test the purchasers belief at the 1% significance level.

207

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

Example 7C: The specifications for a certain type of ball bearings stipulate a mean
diameter of 4.38 mm. The diameters of a sample of 12 ball bearings are measured and
the following summarized data computed:
x2i = 235.7403.

xi = 53.18

Is the sample consistent with the specifications?

Comparing two sample means (variance estimated from


the sample) . . .
When we have small samples from two populations and want to compare their means,
the procedure is a little more complex than you might have expected. In Chapter 8, the
test statistic for comparing two means was
Z=

X 1 X 2 (1 2 )
r
.
12 22
n1 + n2

You would anticipate that the test statistic now might be


X 1 X 2 (1 2 )
r
.
s21
s22
n1 + n2
But, unfortunately, mathematical statisticians can show that this quantity does not have
the t-distribution. In order to find a test statistic which does have the t-distribution,
an additional assumption needs to be made. This assumption is that the population
variances in the two populations from which the samples were drawn are equal. The
examples below highlight this difference between comparing the means of two populations
when the variances are known and when the variances are estimated from the samples.
Example 8A: Two varieties of wheat are being tested in a developing country. Twelve
test plots are given identical preparatory treatment. Six plots are sown with Variety 1
and the other six plots with Variety 2 in an experiment in which the crop scientists
hope to determine whether there is a significant difference between yields, using a 5%
significance level.
The results were:
Variety 1 :
Variety 2 :

1.5
1.6

1.9
1.8

1.2
2.0

1.4
1.8

2.3
2.3

and

1.3

tons per plot


tons per plot

One of the plots planted with Variety 2 was accidently given an extra dose of fertilizer,
so the result was discarded. The means and standard deviation are calculated. They
are
x
1 = 1.60

s1 = 0.42

n1 = 6

x
2 = 1.90

s2 = 0.27

n2 = 5.

We follow the standard hypothesis testing procedure.

208

INTROSTAT

1. H0 : 1 2 = 0.
2. H1 : 1 2 6= 0

(a two-tailed test).

3. Significance level : 5%.


4. & 5. Before we can find out the rejection region we need to know the degrees of freedom. The procedure is rather different from that in the test when the population
variances were known. Instead of working with the individual variances 12 and
22 we assume that both populations have the same variance and we pool
the two individual sample variances s21 and s22 to form a joint estimate of the
variance, s2 . This assumption of equal variances is required by the mathematical
theory underlying the t-distribution, into which we will not delve. Of course, we
do need to check this assumption of equal variances, which we shall discuss later
in the chapter (the F -distribution).
The general formula for the pooled variance is
s2 =

(n1 1)s21 + (n2 1)s22


n1 + n2 2

where s21 is based on a sample of size n1 and s22 is based on a sample of size n2 .
In the above example, n1 = 6 and n2 = 5. Therefore
5 (0.42)2 + 4 (0.27)2
6+52
= 0.13

s2 =

Therefore, the pooled standard deviation

s = 0.13 = 0.361.
How many degrees of freedom does s have? s21 has 5 and s22 has 4. Therefore s2
has 5+4 = 9 degrees of freedom. In general, s2 has (n1 1)+(n2 1) = n2 +n2 2
degrees of freedom. We lose two degrees of freedom because we estimated the two
parameters 1 (by X 1 ) and 2 (by X 2 ) before estimating s2 .
Thus we use the t-distribution with 9 degrees of freedom, and because we have
(0.025)
a two-sided alternative and a 5% significance level we need the value of t9
,
which from the tables is 2.262. So we will reject H0 if the observed t9 -value is
less than 2.262 or greater than 2.262.
...........
...... ..........
.....
...
....
....
...
...
.
.
...
..
.
9
...
.
..
...
.
.
...
..
.
.
...
..
.
.
.
.
....
...
...
...
....
....
...
...
....
....
.
.
.....
.
.
......
....
.
.
.
.
......
....
.
.
........
.
.
.
.
.........
.....
.
.
.
.
.
.
.
.
..............
.....
.........
.................

tt

3 2
2.262

2.262

The formula for calculating the test statistic in this hypothesis-testing situation,
the so-called two-sample t-test, is
tn1 +n2 2 =

X 1 X 2 (1 2 )
r
1
1
+
s
n1 n2

209

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

where X 1 and X 2 are the sample means, the value of 1 2 is determined by


the null hypothesis, s is the pooled sample standard deviation, and n1 and n2 are
the sample sizes.
Substituting our data into this formula:
t6+52 =

1.60 1.90 0
r
.
1 1
0.361
+
6 5

Thus
t9 = 1.372.
6. Because 1.372 does not lie in the rejection region we conclude that the difference
between the varieties is not significant.

Example 9B: A marketing specialist considers two promotions in order to increase


sales of do-it-yourself hardware in a supermarket. During a trial period the promotions
are run on alternative days. In the first promotion, a free set of drill bits is given if
the customer purchases an electric drill. In the second, a substantial discount is given
on the drill. The marketing specialist is particularly interested in the average amount
spent on do-it-yourself hardware by customers who took advantage of the promotion. On
the basis of randomly selected samples, the following data were obtained of the amount
spent on items of do-it-yourself hardware.
Promotion
n
x
s

Free gift
8
R490
R104

Discount
9
R420
R92

Test, at the 5% level of significance, whether there is any difference in the effectiveness of the two promotions.
1. H0 : 1 = 2 .
2. H1 : 1 6= 2 .
3. Significance level : 5%.
4. Degrees of freedom : n1 + n2 2 = 15. From t-tables, if the observed t15 value
exceeds 2.131, we reject H0 .
........................
.....
....
....
....
....
...
...
...
.
.
...
15
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
.
.
.
...
...
....
...
....
....
....
....
.....
.....
.
......
.
.
.
......
....
.
.
.
.
.
.......
.....
.
.
........
.
.
.
.
............
......
.
.
.
.
.
.
.
.
.
.
..............
.
............

tt

2 1
2.131

2.131

210

INTROSTAT

5. We need first to calculate s2 , the pooled variance:


(n1 1)s21 + (n2 1)s22
n1 + n2 2
7 1042 + 8 922
=
15
= 9561.60

s2 =

and so the pooled standard deviation is


s = 97.78.
We now calculate the observed test statistic
tn1 +n2 2 =
t15 =

X 1 X 2 (1 2 )
q
s n11 + n12

490 420 0
q
= 1.47.
97.78 18 + 19

6. Because 1.47 < 2.131, we cannot reject H0 , and we conclude that we cannot detect
a difference between the effectiveness of the two promotions.
Example 10C: Two methods of assembling a new television component are under consideration by management. Because of more expensive machinery requirements, method
B will only be adopted if it is significantly shorter than method A by more than a minute.
In order to determine which method to adopt a skilled worker becomes proficient in both
methods, and is then timed with a stopwatch while assembling the component by both
methods. The following data were obtained:
Method A
Method B

x1 = 7.72 minutes
x2 = 6.21 minutes

s1 = 0.67 minutes
s2 = 0.51 minutes

n1 = 17
n2 = 25

What decision should be taken?

Testing variances for equality . . .


The t-tests for comparing means of two populations, as introduced in the previous
section, required us to assume that the two populations had equal variances.
This assumption can be tested easily. Using the statistical jargon we have developed,
what we need is a test of H0 : 12 = 22 against H1 : 12 6= 22 . (Situations do arise when
we need one-sided alternatives and the test we will develop can handle both one-sided
and two-sided alternatives.)
Because s21 and s22 are our estimates of 12 and 22 , our intuitive feeling is that
we would like to reject our null hypothesis when s21 and s22 are too far apart. This
might suggest the quantity s21 s22 as our test statistic. However, we cannot easily
find the sampling distribution of s21 s22 . It turns out that the test statistic which
is mathematically convenient to use is the ratio F = s21 /s22 . The sampling distribution
of the statistic F is known as the F -distribution, or Fisher distribution. Sir Ronald

211

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

Fisher was a British statistician who was one of the founding fathers of the discipline of
Statistics.
Because variances are by definition positive, the statistic F is always positive. When
H0 : 12 = 22 is true, we expect the sample variances to be nearly equal, so that F will
be close to one. When H0 is false, and the population variances are unequal, then F
will tend to be either large or small, where in this context small means close to zero.
Thus, we accept H0 for F -values close to one, and we reject H0 when the F -value is
too large or too small. The rejection region is obtained from F -tables, but the shape of
probability density function for a typical F -distribution is shown here.
0.6

f (x)

0.4

0.2

0.0

....
... ...
.. ..
.. ...
..
..
...
...
.
..
...
..
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
....
.....
...
.....
...
......
...
......
.......
...
.........
...
...........
..............
...
........................
.........................................
...
........................................................................................
..

F -distribution

4
x

The most striking feature of the probability density function of the F -distribution
is that it is not symmetric. It is positively skewed, having a long tail to the right.
The mode (the x-value associated with the maximum value of the probability density
function) is less than one, but the mean is greater than one, the long tail pulling the
mean to the right. The lack of symmetry makes it seem that we will need separate tables
for the upper and lower percentage points. However, by means of a simple trick (to be
explained later), we can get away without having tables for the lower percentage points
of the F -distribution.
Because the F -statistic is the ratio of two sample variances, it should come as no
surprise to you that there are two degrees of freedom numbers attached to F the
degrees of freedom for the variance in the numerator, and the degrees of freedom for the
denominator variance. It would therefore appear that we need an encyclopaedia of tables
for the F -distribution! To avoid this, it is usual to only present the four most important
values for each F -distribution; the 5%, 2.5%, 1% and 0.5% points. The conventional
way of presenting F -tables is to have one table for each of these percentage points; in
this book Table 4.1 gives the 5% points, Table 4.2 the 2.5% points, Table 4.3 the 1%
points and Table 4.4 the 0.5% points. Within each table, the rows and columns are used
for the degrees of freedom in the denominator and numerator, respectively.
Example 11A: In example 8A, the sample standard deviations were s1 = 0.42 and
s2 = 0.27. Let us test at the 5% level to see if the assumption of equal variances was
reasonable.
1. H0 : 12 = 22 .
2. H1 : 12 6= 22 .

212

INTROSTAT

3. Significance level : 5%.


4. The rejection region. If s21 , based on a sample of size n1 (and therefore having
n1 1 degrees of freedom), is the numerator, and s22 (sample size n2 , degrees
of freedom n2 1) is the denominator, then we say that F = s21 /s22 has the
F -distribution with n1 1 and n2 1 degrees of freedom. We write
s21
Fn1 1,
s22

n2 1 .

In example 8A, the sample sizes for s21 and s22 were 6 and 5 respectively. Thus we
use the F -distribution with 6 1 and 5 1 degrees of freedom, i.e. F5,4 .

Because we have a two-sided test at the 5% level, we need the upper and lower 2 12 %
points of F5,4 . This means that we must use Table 4.2, and go to the intersection
of column 5 and row 4, where we find that the upper 2 12 % point of F5,4 is 9.36.
(0.025)

We write F5,4
= 9.36. (Notice that, in F -tables, the usual matrix convention
of putting rows first, then columns, is not adopted.)
0.6

f (x)

0.4

.......
... ...
.. ....
..
....
...
.
..
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
..
...
...
...
...
...
5,4
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
....
....
....
...
....
...
.....
......
...
.......
...
.......
(0.025)
.........
...
..........
...
.............
.......................
...
5,4
......................................
...
...........................................................................................
.

0.2

0.0

= 9.36
12

x
We will reject H0 if our observed F value exceeds 9.36. The tables do not enable
us to find the lower rejection region, but for the reasons explained below, we do
not in fact need it.
5. The observed F -value is
F = s21 /s22 = 0.422 /0.272 = 2.42.
6. Because 2.42 < 9.36, we do not reject H0 . We conclude that the assumption of
equal variances is tenable, and that therefore it was justified to pool the variances
for the two-sample t-test in example 8A.
The trick that enables us never to need lower percentage points of the F -distribution is to adopt the convention of always putting the numerically larger variance into
the numerator so that the calculated F -statistic is always larger than one and
adjusting the degrees of freedom. Let s21 and s22 have n1 1 and n2 1 degrees of
freedom respectively. Then, if s21 > s22 , consider the ratio F = s21 /s22 which has the
Fn1 1,n2 1 -distribution. If s22 > s21 , use F = s22 /s21 with the Fn2 1,n1 1 -distribution.
This trick depends on the mathematical result that 1/Fn1 1,n2 1 = Fn2 1,n1 1 .

213

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

Example 12B: We have two machines that fill milk bottles. We accept that both
machines are putting, on average, one litre of milk into each bottle. We suspect, however,
that the second machine is considerably less consistent than the first, and that the volume
of milk that it delivers is more variable. We take a random sample of 15 bottles from
the first machine and 25 from the second and compute sample variances of 2.1 ml2 and
5.9 ml2 respectively. Are our suspicions correct? Test at the 5% significance level.
1. H0 : 12 = 22 .
2. H1 : 12 < 22 .
3. Significance level : 5% (and note that we now have a one-sided test).
4. & 5. Because s22 > s21 , we compute
F = s22 /s21 = 5.9/2.1 = 2.81.
(0.05)

This has the Fn2 1,n1 1 = F24,14 -distribution. From our tables, F24,14 = 2.35
1.0

f (x)

0.5

.........
... ......
...
...
...
..
.
...
..
...
..
....
..
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
24,14
...
...
...
...
...
...
...
...
...
...
..
...
...
...
....
...
....
....
.....
(0.05)
...
......
.......
...
.......
24,14
...
.........
.
...........
.
................
..
..............................
..
.................................................................
.................

0.0

2
x

= 2.35

6. The observed F -value of 2.81 > 2.35 and therefore lies in the rejection region.
We reject H0 and conclude that the second bottle filler has a significantly larger
variance (variability) than the first.
Example 13C: Packing proteas for export is time consuming. A florist timed how long
it took each of 12 labourers to pack 20 boxes of proteas under normal conditions, and
then timed 10 labourers while they each packed 20 boxes of proteas with background
music. The average time to pack 20 boxes of proteas under normal conditions was 170
minutes with a standard deviation of 20 minutes, while the average time with background
music was 157 minutes, with a standard deviation of 25 minutes. At the 5% significance
level, test whether background music is effective in reducing packing time. Test also the
assumption of equal variances.

214

INTROSTAT

An approximate t-test when the variances cannot be assumed to be equal . . .


The F -test described above ought always to be applied before starting the t-test to
compare the means. It is conventional to use a 5% significance level for this test. If we
do not reject the null hypothesis of equal variances, we feel justified in computing the
pooled sample variance which the two-sample t-test requires. What happens, however,
if the F -test forces us to reject the assumption that the variances are equal? We resort
to an approximate t-test, which does not pool the variances, but which makes an
adjustment to the degrees of freedom.
The test statistic is
(X 1 X 2 ) (1 2 )
s
.
t =
s22
s21
+
n1 n2
This statistic has approximately the t-distribution. By experimentation, researchers
have determined that the degrees of freedom for t can be approximated by

.  (s2 /n )2 (s2 /n )2 
1
2
1

2
2
2
n = (s1 /n1 + s2 /n2 )
+ 2
2.
n1 + 1
n2 + 1
This messy formula inevitably gives a value for n which is not an integer. It is
feasible to interpolate in the t-tables, but we will simply take n to be the nearest
integer value to that given by the formula above.
Example 14B: Personnel consultants are interested in establishing whether there is
any difference in the mean age of the senior managers of two large corporations. The
following data gives the ages, to the nearest year, of a random sample of 10 senior
managers, sampled from each corporations:
Corporation 1
Corporation 2

52
44

50
45

53
39

42
49

57
43

43
49

52
45

44
47

51
42

34
46

Conduct the necessary test.


We compute
x
1 = 47.80

s1 = 6.86

x
2 = 45.00

s2 = 3.20

We first test for equality of variances:


1. H0 : 12 = 22 .
2. H1 : 12 6= 22 .
3. F = s22 /s22 = 6.862 /3.202 = 4.60.
4. Using the F9,9 -distribution, and remembering that the test is two-sided, we see
(0.025)
that this F -value is significant at the 5% level (F9,9
= 4.03), although not
(0.005)

significant at the 1% level (F9,9

= 6.54).

215

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

5. The conclusion is that the population variances are not equal (F9,9 = 4.60, P <
0.05).
Thus pooling the sample variances is not justified, and we have to use the approximate t-test:
1. H0 : 1 = 2 .
2. H1 : 1 6= 2 .
3. Substituting into the formula for the approximate test statistic yields
(47.8 45.0)
t = r
= 1.17.
3.202
6.862
+
10
10
Substituting into the degrees of freedom formula yields
(


)
2 2 . (6.862 /10)2
2 /10)2
2
3.20
(3.20
6.86
+
+
2
n =
10
10
11
11
= 13.57 14

(0.100)

4. Because t14
level.

= 1.345, we accept the null hypothesis even at the 20% significance

5. We conclude that there is no difference in the mean age of senior employees between
the two corporations (t14 = 1.17, P > 0.20).
Example 15C: A particular business school requires a satisfactory GMAT examination
score as its entrance requirement. The admissions officer believes that, on average, engineers have higher GMAT scores than applicants with an arts background. The following
GMAT scores were extracted from a random sample of applicants with engineering and
arts backgrounds.
Engineering
Art

600
550

650
450

640
700

720
420

700
750

620
500

740
520

650

Investigate the admission officers belief.


Example 16C: The dividend yield of a share is the dividend paid by the share during
a year divided by the price of the share. A financial analyst wants to compare the
dividend yield of gold shares with that of industrial shares listed on the Johannesburg
Stock Exchange. She takes a sample of gold shares and a sample of industrial shares
and computes the dividend yields. What conclusions did she come to?

Gold
shares
Industrial
shares

3.6
4.6
3.2
2.5

4.0
4.0
2.5
8.4

3.9
3.2
8.4
5.2

5.0
15.5
8.7
4.7

2.7
5.7
2.7
3.1

3.7
3.6
3.1
5.5

4.6
4.1
5.3
6.5

3.5
6.2
4.3
3.6

4.5

3.5

3.9

3.7

5.6
4.5

4.0
1.6

5.1
3.1

6.8

216

INTROSTAT

The data are summarized as:

n
x
s

Gold

Industrial

20
4.675
2.677

23
4.710
2.004

A important footnote to t-tests and F -tests . . .


Whenever the t-test or F -test is applied, it is assumed that the population from
which the sample was taken has a normal distribution. Can we test this assumption?
And how should we proceed if the test shows that the distribution of the data is not a
normal distribution?
The first of these questions can be answered using the methods of chapter 10. The
answer to the second question is that methods for doing tests for non-normal data do
exist, but are beyond the scope of this book. For the record, they are called nonparametric tests the tests using the normal distribution and the t-distribution are
known as parametric tests.
Another question. What should we do when there are three (or more) populations
which we want to compare simultaneously? In example 15C, we might have wanted to
do a comparison between students with engineering, arts and science backgrounds, and
to test the null hypothesis that there is no difference between the mean GMAT scores
for these three groups of students. The method to use then is called the analysis of
variance, usually abbreviated to ANOVA.

A summary of hypothesis testing on means . . .


The decision tree displayed in Figure 9.1 and the comments below aim to give
clear guidelines to help you decide which test to apply, and presents the formulae for all
the test statistics of Chapters 8 and 9.
If the sample is large, greater than 30 say, then the central limit theorem applies
and X has a normal distribution. If the sample is smaller than 30, then X will have
a normal distribution if the population from which the sample is drawn has a normal
distribution. If the sample is small, and the underlying population does not have a
normal distribution, these techniques are not valid, and non-parametric statistical
tests must be applied. These tests fall beyond the scope of our course.
If the sample variance is estimated from a large sample, then it is common practice
to assume that 2 = s2 and to use the normal distribution (i.e. the methods of Chapter
8, where the population variance was assumed known). In any case, the t-distribution
is nearly identical to the normal distribution for large degrees of freedom, and the
percentage points are almost equal. We will adopt the convention that for degrees of
freedom 30 or fewer the t-distribution is used, between 30 and 100 use of both the tand the normal distribution is acceptable, and over 100 the normal distribution is mostly
used. This note only applies if the variance is estimated.

217

CHAPTER 9. THE t- AND F-DISTRIBUTIONS


START

Is the population
variance known or
estimated?

Population variance
known

Population variance
unknown

Is there one
sample or two?

Is there one
sample or two?
Two samples.
Test statistic:

One sample.
Test statistic:

X

Z=
/ n
N (0, 1)

Z=

One sample.
Test statistic:

1 X
(1 2 )
X
s2
12 22
+
n1 n2

s/ n
tn1

t=

N (0, 1)

All t-tests assume that


the samples come from
normal distributions

Two samples.
First test the assumption that
variances are equal
Test statistic:
F =

s21
s22

or

F =

s22
s21

Fn1 1,n2 1 or Fn2 1,n1 1


Use 5% significance level, F > 1
Is assumption accepted or rejected?

Assumption rejected.
Test statistic:

Assumption accepted.
Test statistic:
1 X
(1 2 )
X
r2
1
1
+
s
n1 n2
tn1 +n2 2

t? =

t=

where
s=

(n1 1)s21 + (n2 1)s22


n1 + n2 2

1 X
(1 2 )
X
s2
s21
s2
+
n1 n2

tn?
where
2
s2
s21
+ 2
n1 n2
n? =
2 2
2
2
s22 /n2
s1 /n1
+
n1 + 1
n2 + 1


Figure 9.1: Decision Tree for Hypothesis Testing on Means

218

INTROSTAT

Example 17B: Use the decision tree to decide which test to apply. An experiment
compared the abrasive wear of two different laminated materials. Twelve pieces of
material 1 and 10 pieces of material 2 were tested and in each case the depth of wear
was measured. The results were as follows:
Material 1 x1 = 8.5 mm s1 = 4 mm n1 = 12
Material 2 x2 = 8.1 mm s2 = 5 mm n2 = 10
Test the hypothesis that the two types of material exhibit the same mean abrasive
wear at the 1% significance level.

Begin at START
The population variance is estimated. Go right. There are samples from two
populations. Go right again. The assumption that the variances are equal is
accepted. Check this for yourself. Go left.
Pool the variances and use the test statistic with n1 + n2 2 degrees of freedom.
Do this example as an exercise.
Example 18C: In an assembly process, it is known from past records that it takes
an average of 3.7 hours with a standard deviation of 0.3 hours to assemble a certain
computer component. A new procedure is adopted. The first 100 items assembled using
the new procedure took, on average, 3.5 hours each. Assuming that the new procedure
did not alter the standard deviation, test whether the new procedure is effective in
reducing assembly time.

Solutions to examples

...

3C 6.73 2.898 0.35/ 18, 6.73 + 0.24), which is (6.49, 6.97)


6C t80 = 4.50 < 2.374, reject H0 .
7C t11 = 2.34, P < 0.05, significant difference.
10C t40 = 2.80, P < 0.005, highly significant. Adopt new method.
13C F9,11 = 1.56 < 3.59, cannot reject H0 and pool variances. t20 = 1.36 < 1.725,
cannot reject H0 .
(0.025)

15C F = 6.28 > F6,7


= 5.12. Therefore cannot pool variances. t? = 2.18, n? = 8.2,
so use degrees of freedom 8. Using one-sided test, there is a significant difference
(P < 0.05).
16C F19,22 = 1.784, so variances can be pooled. Two-sample t-test yields t41 =
0.0244, no significant difference. Dividend yield on gold shares not significantly
different from that on industrial shares (t41 = 0.0244, P > 0.20).
18C Path through flow chart: start, standard deviation
known, go left, one sample, go

left again, and use test statistic z = (X ) n N (0, 1).

219

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

Exercises on confidence intervals

...

9.1 Find the 95% confidence interval for the mean salary of teachers if a random sample
of 16 teachers had a mean salary of R12 125 with a standard deviation of R1005.
9.2

We want to estimate the mean number of items of advertising matter received


by medical practitioners through the post per week. For a random sample of 25
doctors, the sample mean is 28.1 and the sample standard deviation is 8. Find
the 95% confidence limits for the mean.

9.3 Over the past 12 months the average demand for sulphuric acid from the stores
of a large chemical factory has been 206 `; the sample standard deviation has
been 50 `. Find a 99% confidence interval for the true mean monthly demand for
sulphuric acid.
9.4

A sample of 10 measurements of the diameter of a brass sphere gave mean x = 4.38


cm and standard deviation s = 0.06 cm. Calculate (a) 95% and (b) 99% confidence
intervals for the actual diameter of the sphere. Why is the second confidence
interval longer than the first?

Exercises on hypothesis testing (one sample) . . .

9.5

In a textile manufacturing process, the average time taken is 6.4 hours. An innovation which, it is hoped, will streamline the process and reduce the time, is
introduced. A series of 8 trials used the modified process and produced the following results:
6.1

5.9

6.3

6.5

6.2

6.0

6.4

6.2.

Using 5% significance level, decide whether the innovation has succeeded in reducing average process time.
9.6 In 1989, the Johannesburg Stock Exchange (JSE) boomed, and the Allshare Index
showed an annual return of 55.5%. A sample of industrial shares yielded the
following returns:
54.5
52.8

47.8
56.1

42.3
23.2

59.8
52.5

33.1
65.7

49.7
47.5

16.0
32.5

50.3
46.7

Test, at the 5% significance level, whether the performance of industrial shares


lagged behind the market as a whole.

220

INTROSTAT

9.7 The mean score on a standardized psychology test is supposed to be 50. Believing
that a group of psychologists will score higher (because they can see through the
questions), we test a random sample of 11 psychologists. Their mean score is 55
and the standard deviation is 3. What conclusions can be made?
9.8

The specification quoted by ABC Alloys for a particular metal alloy was a melting
point of 1660 C. Fifteen samples of the alloy, selected at random, had a mean melting point of 1648 C with a standard deviation of 45 C. Is this evidence consistent
with the quoted specification?

Exercises using F -test for equality of variances . . .


9.9 If independent random samples of size 10 from two normal populations have sample
variances s21 = 12.8 and s22 = 3.2, what can you conclude about a claim that the
two populations have the same variance? Use a 5% significance level.
9.10

From a sample of size 13 the estimate of the standard deviation of a population


was calculated as 4.47 and from a sample of size 16 from another population the
standard deviation was 8.32. Can these populations be considered as having equal
variances? Use a 5% level of significance.

Exercises on hypothesis testing (two samples) . . .


9.11 The national electricity supplier claims that switching off the hot water cylinder
at night does not result in a saving of electricity. In order to test this claim
a newspaper reporter obtains the co-operation of 16 house owners with similar
houses and salaries. Eight of the selected owners switch their cylinders off at
night. The consumption of electricity in each house over a period of 30 days is
measured; the units are kWh (kilowatt-hours). The following data are collected:
OFF GROUP
n1 = 8
x
1 = 680
s21 = 450

ON GROUP
n2 = 8
x2 = 700
s22 = 300

Test, at the 5% level, whether there is a significant saving in electricity if the


cylinder is switched off at night. Test the assumption that the variances are equal,
also at the 5% level.
9.12

A comparison is made between two brands of toothpaste to compare their effectiveness at preventing cavities. 25 children use Hole-in-None and 30 children use
Fantoothtic in an impartial test. The results are as follows:
Sample size
Average number of new cavities
Standard deviation

24
1.6
0.7

30
2.7
0.9

221

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

At the 1% level of significance, investigate whether one brand is better than the
other.
9.13 A company claims that its light bulbs are superior to those of a competitor on
the basis of a study which showed that a sample of 40 of its bulbs had an average
lifetime of 522 hours with a standard deviation of 28 hours, while a sample of
30 bulbs made by the competitor had an average lifetime of 513 hours with a
standard deviation of 24 hours. Test the null hypothesis 1 2 = 0 against a
suitable one-sided alternative to see if the claim is justified.
9.14

The densities of sulphuric acid in two containers were measured, four determinations being made on one and six on the other. The results were:
(1) 1.842

1.846

1.843

1.843

(2) 1.848

1.843

1.846

1.847

1.847

1.845

Do the densities differ at the 5% significance level,


(a) if there is no reason beforehand to believe that there is any difference in
density between the containers?
(b) if we have good reason to suspect that, if there is any difference, the first
container will be less dense than the second?
9.15 A teacher used different teaching methods in two similar statistics classes of 35
students each. Each class then wrote the same examination. In one class, the
mean was x1 = 82% with s1 = 3%. In the other class, the results were x2 = 77%
and s2 = 7%. Test to see if this provides the teacher with evidence that one
teaching method is superior to the other. Use a 5% significance level.
9.16 Two drivers, A and B, do fuel consumption tests on a single car. The cars are refuelled every 100 km, and the number of litres required to refill the tank measured.
Drivers A and B drive 2100 km and 2800 km, respectively, and so are refuelled
21 and 28 times. Driver B has recently read a pamphlet entitled Fuel Economy
Tips, and has been putting these ideas into practice. The results are summarized
in the table below:
Driver

Sample size

Sample mean
(`/100 km)

Sample standard deviation


(`/100 km)

A
B

21
28

9.03
8.57

1.73
0.89

Is Driver B more economical than Driver A?

222

INTROSTAT

Exercises using normal, t- and F -distributions . . .


9.17 The following statistics were calculated from random samples of daily sales figures
for two departments of a large store:
Department

Hardware

Crockery

15
1400
180

15
1250
120

Sample size
Mean daily sales (rands)
Standard deviation (rands)

The sales manager feels that mean sales in hardware is significantly higher than in
crockery. Test this idea statistically using a 1% significance level.
9.18

Travel times by road between two towns are normally distributed; a random
sample of 16 observations had a mean of 30 minutes and a standard deviation of
5 minutes.
(a) Find a 99% confidence interval for the mean travel time.
(b) Estimate how large a sample would be needed to be 95% sure that the sample
mean was within half a minute of the population mean. You need to modify
the formula for estimating sample sizes in Chapter 8 to take account of the
fact that you are given a sample standard deviation based on a small sample
rather than the population standard deviation.

9.19 An insurance company has found that the number of claims made per year on a
certain type of policy obeys a Poisson distribution. Until five years ago, the rate
of claims averaged 13.1 per year. New restrictions on the acceptance of this type
of insurance were introduced five years ago, and since then 51 claims have been
made.
(a) Test whether the restrictions have been effective in reducing the number of
claims.
(b) Find an approximate 95% confidence interval for the average claim rate under
the new restrictions.
9.20 A new type of battery is claimed to have two hours more life than the standard
type. Random samples of new and standard batteries are tested, with the following
summarized results:

New
Standard

Sample size

Sample mean
(hours)

Sample standard
deviation (hours)

94
42

39.3
36.8

7.2
6.4

Test the claim at the 5% significance level.

223

CHAPTER 9. THE t- AND F-DISTRIBUTIONS


9.21

The mean commission of floor-wax salesmen has been R6000 per month in the
past. New brands of wax are now providing stiffer competition for the salesmen,
but inflation has pushed up the commission per sale. Management wishes to test
whether the figure of R6000 per month still prevails, and examines a sample of 120
recent monthly commission figures. They are found to have a mean of R5850 and
a standard deviation of R800. Test at a 5% significance level whether the mean
commission rate has changed.

9.22 Two bus drivers, M and N, travel the same route. Over a number of journeys the
times taken by each driver to travel from bus stop 5 to bus stop 19 were noted.
The summarized results are presented below:

Driver

Trips

Mean
(minutes)

Standard Deviation
(minutes)

M
N

12
21

18.1
21.3

1.9
3.9

(a) Test whether driver M is more consistent in journey times than driver N.
(b) Test whether there is a significant difference in the times taken by each driver.
9.23 It is necessary to compare the precision of two brands of detectors for measuring
mercury concentration in the air. The brand B detector is thought to be more
accurate than the brand A detector. Seven measurements are made with a brand
A instrument, and six with a brand B instrument one lunch hour. The results
(micrograms per cubic metre) are summarized as follows:
x
A = 0.87

sA = 0.019

x
B = 0.91

sB = 0.008

At the 5% significance level, do the data provide evidence that brand B measures
more precisely than brand A?
9.24 Show that if we test at the 5% significance level (using either the t- or normal
distributions) the null hypothesis H0 : = 0 against H1 : 6= 0 , we will reject
H0 if and only if 0 lies outside the 95% confidence interval for .
9.25

The health department wishes to determine if the mean bacteria count per ml
of water at Zeekoeivlei exceeds the safety level of 200 per ml. Ten 1 m` water
samples are collected. The bacteria counts are:
225

210

185

202

216

193

190

207

204

220

Do these data give cause for concern?


9.26

A normally distributed random variable has standard deviation = 5. A sample


was drawn, and the 95% confidence interval for the mean was calculated to be
(70.467, 73.733). The experimenter subsequently lost the original data. Tell him

224

INTROSTAT
(a) what the sample mean of the original data was, and
(b) what size sample he drew.

9.27 The following are observations made on a normal distribution with mean and
variance 2 :
50

53

47

51 49

(a) Find 95% and 99% confidence intervals for


(i) assuming 2 = 5
(ii) assuming 2 is unknown, and has to be estimated by s2 .
(b) Comment on the relative lengths of the confidence intervals found in (a).

Solutions to exercises

...

9.1 (11 590, 12 660)


9.2 (24.80, 31.40)
9.3 (161, 251)
9.4 (a) (4.335, 4.425)
(b) (4.315, 4.445)
The higher the confidence level, the wider the interval.
9.5 t7 = 2.83 < 1.895, reject H0 .
9.6 t15 = 2.97 < 1.753, reject H0 , industrial shares performed worse than the
All-Share Index.
9.7 t10 = 5.53, P < 0.0005, very highly significant.
9.8 t14 = 1.03, P > 0.10, at this stage cannot reject H0 , the evidence is consistent
with the specifications, but further investigation is warranted.
9.9 F9,9 = 4.0 < 4.03, cannot reject H0 .
9.10 F15,12 = 3.46 > 3.18, reject H0 .
9.11 F7,7 = 1.50 < 4.99, cannot reject H0 and thus pooling of variances justified.
t14 = 2.066 < 1.761, reject H0 .
9.12 F29,24 = 1.65 < 2.22, cannot reject H0 , pool variances.
t53 = 4.98 < 2.672, reject H0 .
9.13 F39,29 = 1.36 < 1.81, cannot reject H0 and pool variances.
t68 = 1.41, P < 0.10, nearly significant.
9.14 F3,5 = 1.07 < 7.76, cannot reject H0 and pool variances.
(a) t8 = 2.19 < 2.306, cannot reject H0 .
(b) t8 = 2.19 > 1.860, reject H0 .

CHAPTER 9. THE t- AND F-DISTRIBUTIONS

225

9.15 F34,34 = 5.44 > 2.57, reject H0 at 1% level. Variances cannot be pooled. Degrees
of freedom
n = 46.8 47. t = 3.88 > 2.01,
reject H0 .
9.16 F20,27 = 3.78, P < 0.01, very significant, so variances cannot be pooled. Degrees
of freedom
n = 28.7 29. t = 1.11, P < 0 > 20,
possibly significant.
9.17 F14,14 = 2.25 < 2.48, cannot reject H0 and hence pool the variances. t28 = 2.69 >
2.467, reject H0 .
9.18 (a) (26.32 , 33.68)
(b) n = (t?m1 s/L)2 = (2.131 5/ 12 )2 = 455, where m is the size of the sample
used for estimating s (in this case 16), so the that degrees of freedom for t
is 15.
9.19 (a) z = 1, 79, P < 0.05, significant reduction.
(b) (7.4 , 13.0).
9.20 F93.41 = 1.27 < 1.74, cannot reject H0 and pool variances (s = 6.965).
t134 = 0386 < 1.656, cannot reject H0 .
9.21 t119 = 2.05 < 1.98, reject H0 .
9.22 (a) F20,11 = 4.214, P < 0.05, significant differences in variances. Pooling not
justifiable.
(b) n = 32, t = 3.16, P < 0.005, highly significant difference in means.
9.23 F6,5 = 5.64 > 4.95, reject H0 .
9.25 t9 = 1.25, P < 0.20, possibly significant.
9.26 (a) x
= 72.1

(b) n = 36.

9.27 (a) The confidence intervals are:


95%
99%
(i)
(48.04 , 51.96)
(47.42 , 52.48)
(ii)
(47.22 , 52.78)
(45.40 , 54.60)
(b) 99% confidence intervals are wider than 95% confidence intervals. If 2 is
estimated by s2 , the confidence interval is wider.

226

INTROSTAT

Chapter

10

THE CHI-SQUARED DISTRIBUTION


KEYWORDS: 2 -distribution, goodness of fit tests, observed and expected frequencies, contingency tables, tests of association, tests and
confidence intervals for the variance.

Many seemingly unrelated applications . . .


There is a surprisingly large number of tests of hypotheses for which the sampling
distribution of the test statistic has a 2 -distribution, either exactly or approximately.
In terms of numbers of applications in Statistics, the chi-squared distribution is probably
in second place, after the normal distribution.
We consider here only three of these applications of the chi-squared distribution
goodness of fit tests, tests of association in contingency tables, and tests and
confidence intervals for the sample variance. Although the rationale behind the first
two applications is completely different, the calculation of the test statistics is almost
identical.

Goodness of fit tests . . .


In chapter 9, to do the two-sample t-test, we found that we had to make the assumption that the variances were equal, and then immediately provided a method for testing
this assumption, the F -test. At the very end of the chapter, we said that another assumption underlying all t-tests was that the samples were drawn from populations which
had normal distributions. There is therefore a need to be able to test this assumption.
The test described in this section enables us to do just this. Not only can we test if the
process that generated the data has a normal distribution, we can test whether data fits
any specified distribution. In our first example, we test whether the process generating
misprints in the pages of newspapers produces data with a Poisson distribution.
Example 1A: In the printing industry, it is generally thought that misprints occur at
random i.e. they occur independently of each other. Thus the number of misprints
per page can be expected to obey a Poisson distribution with some parameter . To test
this hypothesis, 200 pages from newspapers were examined, and the number of misprints
on each page noted. The data are presented below. A 5% significance level is to be used.
227

228

INTROSTAT

Number of misprints

Observed number of pages

0
1
2
3
4
5

43
69
53
21
8
6

The table tells us that 43 of the 200 pages examined were free of misprints, 69 of
the pages had one misprint, 53 pages had two misprints each (a total of 106 misprints
on these 53 pages), . . . , 6 pages had five misprints on each page (30 misprints on these
6 pages).
In order to test whether the Poisson distribution fits this data, the first problem is
to decide what value to use for the parameter of the Poisson distribution. This can
either be specified by the null hypothesis, or the data can be used to estimate . We
treat these two situations separately.
Let us first consider the test of the null hypothesis that the data can be thought
of as a sample from a Poisson distribution with parameter = 1.2 misprints per page
against the alternative that the data come from some other distribution.
Thus we have
1. H0 : Data come from a Poisson distribution with = 1.2
2. H1 : The distribution is not Poisson with = 1.2
3. Significance level : 5%
4. & 5. We need firstly to compute the expected (theoretical) frequencies, assuming that
the null hypothesis is true. If misprints are occurring in accordance with a Poisson
distribution with rate = 1.2 then the probability that a page contains x misprints
is
p(x) = e1.2 1.2x /x!
Thus the probability of no misprints is
p(0) = e1.2 = 0.3012
Thus 30.12% of pages are expected to have no misprints. A sample of 200 pages
can therefore be expected to have 60.24 pages with no misprints (30.12% of 200).
This theoretical frequency is to be compared with an observed frequency of 43.
Similarly p(1) = e1.2 1.2 = 0.3614. Thus the expected frequency of one error is
200 0.3614 = 72.28. We compare this with an observed 69 pages with one error.
Continuing in this way we build up a table of observed and expected frequencies.

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

229

number of
misprints (x)

observed number
of pages (Oi )

theoretical
probability p(x)

expected
frequency (Ei )

0
1
2
3
4
5 or more

43
69
53
21
8
6

1.20 e1.2 /0! = 0.3012


1.21 e1.2 /1! = 0.3614
1.22 e1.2 /2! = 0.2169
1.23 e1.2 /3! = 0.0867
1.24 P
e1.2 /4! = 0.0260
1 4x=0 p(x) = 0.0078

60.24
72.28
43.38
17.34
5.20
1.56

Even if H0 is true, we anticipate that the observed and expected frequencies will
not be exactly equal; this is because we expect some sampling fluctuation in our
random sample of 200 pages. We would clearly like to reject H0 , however, if the
difference between the observed frequencies and the expected frequencies is too
large.
We need to find a test statistic which is a function of the differences between
observed and expected frequencies and which has a known sampling distribution.
The sum of these differences is of no use because they sum to zero. So we square
the differences and they all become positive. A difference of 3 is negligible if
the observed and expected frequencies are 121 and 124, but it important if the
frequencies are 6 and 9. To take account of this we divide the squared differences
by their expected frequency. The right statistic to use is

D2 =

X (Oi Ei )2
Ei

where the sum is taken over all the cells in the table.
The statistic D 2 has approximately a chi-squared (2 ) distribution. Like the tand F -distributions, the 2 distribution has a degrees of freedom number attached
to it. In tests like the one above, the correct degrees of freedom is given by k 1,
where k is the number of cells into which the data are categorized. Here k = 6,
and therefore the degrees of freedom for 2 is 5. Thus we will reject H0 if D 2
2(0.05)
= 11.071.
exceeds the 5% point of the 25 distribution. From our tables 5

230

INTROSTAT

0.15

0.10
f (x)

.......
........ ...........
....
....
...
...
...
...
..
...
.
...
..
.
...
..
...
.
...
....
...
.
...
.
...
....
...
...
...
...
...
...
.
.
...
....
...
...
...
...
...
...
2
...
...
...
...
... 5
...
....
.
.
.
...
.
....
..
....
...
....
...
....
....
....
.....
...
.....
.....
...
......
2 (0.05)
...
......
.......
...
.......
...
........
........ 5
...
..........
..
............
..............
...
.....................
............................
...
...

0.05

0.00

10

= 11.071

15

x
If D 2 exceeds 11.071 then the observed and expected frequencies are too far
apart for their differences to be explained by chance sampling fluctuations alone.
Let us compute D 2 for the above data.
D2 =

X (Oi Ei )2
(43 60.24)2
(69 72.28)2
(6 1.56)2
=
+
+ +
Ei
60.24
72.28
1.56
= 22.17

6. This value lies in the rejection region. We thus reject the null hypothesis that the
data are sampled from a Poisson distribution with = 1.2.
We said earlier that the statistic D 2 has approximately a 2 distribution. This approximation is good provided all the expected frequencies exceed about 5. Looking
back at the table of expected frequencies, you will see that this condition has been violated the expected number of pages with 5 (or more) misprints is 1.56, which is
much less than 5. To get around this we amalgamate adjoining classes. This reduces
the number of cells, and also the degree of freedom. Here it is necessary to amalgamate
the last two cells:

number of
misprints

observed number
of pages (Oi )

theoretical
probability

expected
frequency (Ei )

0
1
2
3
4 or more

43
69
53
21
8+6 = 14

0.3012
0.3614
0.2169
0.0867
0.0338

60.24
72.28
43.38
17.34
6.76

All the expected values exceed 5, and we now correctly compute D 2 as


(43 60.24)2
(14 6.76)2
+ +
60.24
6.76
= 15.78

D2 =

231

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

We now have only 5 cells, so the degrees of freedom for 2 is 4. The 5% point of 24 is
9.488. Because 15.78 > 9.488, we reject H0 and come to the same conclusion as before
(but using the right method this time!).
0.2

24

f (x) 0.1

2 (0.05)

= 9.488

0.0
0

10

15

Let us now use the data to estimate , and see what difference this makes to the
test. Our null and alternative hypotheses are now
1. H0 : the data fit some Poisson distribution, and
2. H1 : the data fit a distribution other than the Poisson distribution.
3. Significance level : 5%
4. & 5. To find we need to estimate the rate at which misprints occurred in our sample
data. The total number of misprints that occurred was 0 43 + 1 69 + 2 53 +
+ 5 6 = 300 misprints. 300 misprints in 200 pages implies that the mean
rate at which misprints occur is 1.5 misprints per page. We therefore try to fit a
Poisson distribution with = 1.5.
Using the same procedure as before, we find a table of expected frequencies.

number of
misprints
0
1
2
3
4
5 or more

observed number
of pages (Oi )
43
69
53
21
8o
6

14

theoretical
probability
0.2231
0.3347
0.2510
0.1255
0.0471
0.0186

expected
frequencies (Ei )
44.62
66.94
50.20
25.10
9.42 o
13.14
3.72

We amalgamate the last two cells, so that all expected values exceed 5. We now
have five cells.

232

INTROSTAT
The rule for finding the degrees of freedom in this case is
Degrees of freedom = k d 1
where k is the number of cells, and
d is the number of parameters estimated from the data.

Here k = 5 and d = 1, because we estimated just one parameter, , from the data.
Thus we must use 2 with 511 = 3 degrees of freedom. The 5% point of 2 is 7.815.
We reject H0 if D 2 exceeds 7.815. Notice that goodness of fit tests are intrinsically onesided we reject H0 if D 2 is too large. If D 2 is small, it means that the distribution
specified by H0 fits the data very well.

0.2

23

f (x)
0.1

2 (0.05)

= 7.815

0.0
0

Using the formula D 2 =

10

15

P (Oi Ei )2
we calculate
Ei

(14 13.14)2
(43 44.62)2
+ +
44.62
13.14
= 1.01

D2 =

6. Because 1.01 lies in the acceptance region, we cannot reject H0 . It is reasonable


to conclude that the Poisson distribution with = 1.5 misprints per page fits the
data well.

233

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

A short-cut formula . . .
As in the case with calculating the sample variance there is a short-cut formula for
calculating D 2 .
X (Oi Ei )2
Ei
X O2 2Oi Ei + E 2
i
i
=
Ei
X O2
X
X
i
=
2
Oi +
Ei
Ei

D2 =

But

P
Oi = Ei = n, the sample size. (Each summation is over the k cells.)
Therefore
X O2
i
n.
D2 =
Ei

This is the best formula for calculating D 2 .

Example 2B: A random sample of 230 students took an I.Q. test. The scores they
obtained have been summarized in the table below:

Score

Observed frequency

< 90
90 110
110 130
> 130

18
87
104
21

Test at the 5% level whether these data come from a normal distribution.
1. H0 : The data follow a normal distribution (parameters to be estimated from the
data).
2. H1 : The distribution is not normal.
3. Significance level : 5%.
4. We use the 2 distribution with k d 1 degrees of freedom. Here there are k = 4
cells and d = 2 parameters ( and ) are estimated.
2(0.05)

Thus we use 21 , and reject H0 if D 2 exceeds 1

= 3.843.

234

INTROSTAT
4
3
f (x)

2
1
0

...
...
...
...
...
...
..
..
..
..
..
..
..
..
..
..
..
..
..
...
...
...
...
...
..
...
..
..
..
..
..
...
...
...
..
..
...
...
2
...
....
1
......
2 (0.05)
.......
..........
................
1
.............................
....................................................................................................
....................................................

= 3.841

x
5. We need to estimate and . Let us suppose that before the data was summarized into the table above, the sample mean x
and the standard deviations were
computed to be 111 and 16.3 respectively.
We now use N (111, 16.32 ) to compute the theoretical probabilities that randomly
selected I.Q.s fall into the 4 cells in our table.
If X N (111, 16.32 ), then


90 111
Pr(X < 90) = Pr Z <
16.3
= Pr(Z < 1.29)

=0.0985

Thus the probability that a randomly selected single individual has an I.Q. less
than 90 is 0.0985. In a sample of size 230 we therefore expect 230 0.0985 = 22.7
students to have I.Q.s below 90.
We do a similar calculation for the remaining cells, to obtain the table:

score
< 90
90 110
110 130
130

observed
frequency

theoretical probability

expected
frequency

18
87
104
21

Pr[Z < 1.29] = 0.0985


Pr[1.29 Z < 0.06] = 0.3776
Pr[0.96 Z < 1.17] = 0.4029
Pr[Z 1.17] = 0.1210

22.7
86.8
92.7
27.8

Thus
X O2
182
872
1042
212
i
n=
+
+
+
230
Ei
22.7 86.8
92.7
27.8
= 234.01 230 = 4.01

D2 =

235

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

6. Because 4.01 > 3.841, we reject H0 and conclude that the normal distribution is
not a good fit to this data.
This example illustrates the method to use to test whether data fit a normal
distribution. In practice, however, the data would be divided into many more
than the 4 cells we used in this example.
Look back at the plot of the 21 -distribution in example 2B. The shape is quite
unlike that of the typical chi-squared distribution with more than one degree of
freedom, as depicted in example 1A. It is closely related to the exponential dis2 (0.05)
= 3.841, i.e. the 5% point of the
tribution of chapter 5. We saw that 1
chi-squared distribution with one degree of freedom, is 3.841. And it is not an accident that the square root of 3.841 is equal to 1.96, the 2.5% point of the standard
normal distribution there is a mathematical relationship between the normal
and chi-squared distributions.
Example 3C: A T-shirt manufacturer makes a certain line of T-shirt in three colours:
white, red and blue. 50% of the T-shirts made are white, 25% are red and 25% blue.
One outlet reported sales of 47 white T-shirts, 32 red and 21 blue. Test whether sales
are consistent with the manufactured proportions.
Example 4C: A popular clothing store is interested in establishing the distribution of
customers arriving at the cashiers. During a 100-minute period, the number of customers
arriving per minute was counted, with the following results:

Number of
customers
per minute
observed
frequency

5 or more

25

26

21

15

Can a Poisson distribution be used to model the arrival times?

Tests of association in contingency tables . . .


Another use for the chi-squared distribution is in contingency tables. A contingency table is simply a table (or matrix) of counts. Each entry in the table is called a
cell. Each member of a sample is classified according to two variables, e.g. eye colour and
hair colour, and each cell in the table represents the count of the number of members of
the sample who have a particular combination of the two variables, e.g. blue eyes and
blond hair. The chi-squared distribution is the sampling distribution of the test statistic
which tests whether there is an association (or relationship) between the two variables:
e.g. Is there a tendency for eye colour to be related to hair colour? Or are eye colour
and hair colour independent?

236

INTROSTAT

Example 5A: A financial analyst is interesting in determining whether the size of a


company (as measured by its market capitalization) has any association with its performance (as categorized by the annual percentage price change of a share in the company
on the stock market). The table below gives the number of companies falling into each
category in a sample of 420 firms.
Company size
Small

Performance (% price change)


< 0% 020% 2040% > 40%
66
90
28
39

Total
223

Medium

24

33

17

11

85

Large
Total

55
145

37
160

10
55

10
60

112
420

There were thus 66 companies whose performances were negative (column < 0%)
and which were small, etc. This rectangular collection of frequencies of occurrence is a
typical example of a contingency table. This is a 34 contingency table. It is customary
to give the number of rows first, then the number of columns. The designer of the South
African decimal currency had the right idea rands and cents, rows and columns.
In general we talk of an r c contingency table with r rows and c columns.
The financial analyst wants to know if there is a significant relationship (at the 5%
level) between the performance and the size of the companies. We use our standard
approach to hypothesis testing.
1. We start by assuming that there is no relationship between the two variables;
thus we have H0 : the performance and the size of a company are statistically
independent.
You will be wise at this point to reread the last few pages of Chapter
3, where the concept of independence was developed!
2. The alternative hypothesis, which the financial analyst is trying to establish, is
H1 : the performance and the size of a company are associated.
3. Significance level : 5%.
4. & 5. Under the assumption of independence made in the null hypothesis, we calculate
theoretical expected frequencies for each cell. If the theoretical frequencies (which
assume independence) and the observed frequencies are too different, we reject
the null hypothesis of independence, and conclude that there is a dependence or
a relationship or association between the two variables. A careful examination of
the table then helps us to determine the nature of the association.
From the table above, we see that 145 out of the 420 companies had negative performance figures. Thus, using the relative frequencies in the sample, we estimate the
probability that a randomly selected company will have negative performance figures as
Pr[negative performance] = 145/420 = 0.345.
Similarly, 223 out of the 420 companies were classified as small. We estimate that the
probability that a randomly selected company will be small as Pr[small] = 223/420 =
0.531.

237

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

If, as H0 tells us, the performance of a company is independent of its size, then
the probability that a randomly selected company will have both negative performance
figures and be small will be the product of the individual probabilities, i.e.
Pr[negative performance and small] = Pr[negative performance] Pr[small]
145 223

= 0.345 0.531
=
420 420
=0.1833
Thus, if H0 is true, the proportion of companies that has negative performance and
is small will be 0.1833, or, expressed as a percentage, 18.33%. We have 420 companies
in our sample, so we expect 0.183 420 = 77.0 to have negative performance and be
small. This expected frequency can be computed more directly as
145 223
145 223

420 =
= 77.0
420 420
420
Thus the calculation of the expected frequencies for this (and every other) cell of
the table reduces to a very simple formula:
expected frequency =

column total row total


grand total

The full set of expected frequencies is given in brackets under the observed values
in the table below:

Performance (% price change)


020%
2040%
> 40%

Firm size

< 0%

Small

66
(77.0)
24
(29.3)
55
(38.7)

90
(85.0)
33
(32.4)
37
(42.7)

28
(29.2)
17
(11.1)
10
(14.7)

39
(31.9)
11
(12.1)
10
(16.0)

145

160

55

60

Medium
Large
Total

Total
223
85
112
420

In all the above arithmetic, dont lose sight of the fact that the expected frequencies have been computed assuming that the two variables are independent. Thus, the
larger the differences between the observed frequencies in the table and the expected
frequencies, the less likely it is that the two variables are independent. The comparison
of observed and expected frequencies makes use of the same statistic as was used for
goodness of fit tests:
D2 =

X O2
i
n,
Ei

238

INTROSTAT

where we sum over all the cells in the contingency table, and n is the total number of
observations.
Once again we have a degrees of freedom problem. It can be shown that the following
rule gives us the degrees of freedom we require:
DEGREES OF FREEDOM FOR TESTS OF ASSOCIATION
For an rc contingency table, the appropriate chi-squared distribution
has degrees of freedom = (r 1)(c 1)
Here we have a 3 4 contingency table, hence the degrees of freedom for chi-squared is
(3 1)(4 1) = 6.
Using the 5% significance level, we determine from tables that the 5% point of
2
6 is 12.592. We will reject H0 if D 2 > 12.592. Notice that tests of association for
contingency tables are almost invariably one-sided.
We compute D 2 using the short-cut formula:
902
102
102
662
+
+ +
+
420
77.0 85.0
14.7 16.0
= 19.09.

D2 =

0.15

0.10
26

f (x)
0.05

2 (0.05)

= 12.592

0.00
0

10
x

15

20

6. Because 19.09 > 12.592 we reject H0 . Thus our financial analyst has demonstrated
that there is a significant relationship between the performance of a firm and its
size. Examination of the table of observed and expected frequencies shows that
expected values exceed observed values in the top left and bottom right corners
of the contingency table, and vice versa in the top right and bottom left corners.
This indicates an inverse relationship between these two variables, that is, as the
size of a firm decreases, the performance improves.

239

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

Example 6B: A photographic company wishes to compare its present automatic film
processing machines with two other machines that have recently come onto the market.
It processes 194 films on the present machine (called A), and hires the other machines
(B and C) for short periods and processes smaller numbers of films on them. Using very
stringent criteria, the manager classifies each film as being satisfactorily processed or
not. The following contingency table is obtained:

Satisfactory
Unsatisfactory

Processing machine
B

93
101

24
6

8
18

Test whether the classification of the film as satisfactory or not is independent of the
machine the film was processed on.
No significance level is given, and we use the modified hypothesis testing procedure.
1. H0 : The classification is independent of the machine
2. H1 : The classification is dependent on the machine
3. We compute the expected values, given in brackets in the table below:

Processing machine
A
B
Satisfactory
Unsatisfactory

Total
125

93
((194 125)/250 = 97)
101
(97)

24
(15)
6
(15)

8
(13)
18
(13)

194

30

26

Total

125
250

Next we compute D 2 :
932
242
182
+
+ +
250
97
15
13
= 14.98.

D2 =

4. The degrees of freedom are (3 1) (2 1) = 2. Examining the row in the tables


for 22 , we see that the highest level at which the test statistic, 14.98, is significant
is the 0.1% level (P < 0.001).

240

INTROSTAT

5. Our conclusion is that the classification of the film is highly significantly dependent
on the machine used (22 = 14.98, P < 0.001). (Note that we report the value
of the test statistic as a 2 value.) Examination of the contingency table shows
that machine B tends to produce a large proportion of photographs classified as
satisfactory, while machine C tends to produce a large proportion of unsatisfactory
photographs. Processing machine B is the recommended choice.
Example 7C: An analysis of the tries, penalty goals and dropped goals scored by South
Africa in rugby tests against the British Isles, New Zealand, Australia and France gave
the following contingency table:

British Isles
New Zealand
Australia
France

Tries

Penalties

Drops

70
44
79
43

31
38
31
32

6
11
7
3

The number of penalities scored against New Zealand and France seems unexpectedly
high, and this leads you to want to test the hypothesis that the mode of scoring is
dependent on the opponents.
Example 8C: An investment broking company is keen to establish whether there is an
association between their clients perception of their attitude towards risk and the type
of investments they prefer. They obtained 340 responses to a questionnaire designed to
capture this information. A summary of the number of clients falling into the various
categories is shown below:

Risk category
Type of
investment

Risk
averse

Risk
neutral

Risk
lover

Fixed deposits
Bonds
Unit trusts
Options and futures

79
10
12
10

58
8
10
34

49
9
19
42

Test the hypothesis that the type of investment is dependent on the risk perception.

241

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

Confidence intervals and tests for the variance . . .


It can be shown that if a sample is drawn from a population having a normal
distribution, then a simple transformation of the sample variance s2 has exactly the
2 -distribution:
(n 1)s2 / 2 2n1 .
This states that the sample variance s2 , multiplied by its degrees of freedom (n1), and
divided by the population 2 has the 2 distribution with n 1 degrees of freedom.
We can use this result to set up confidence intervals for the population variance, and to
do hypothesis tests on the variance.

Example 9A: A dairy is concerned about the variability in the amount of milk per
bottle. A sample of 25 bottles was examined, and the sample standard deviation of the
contents was computed to be 3.71 ml.
(a) Find 95% confidence intervals for , the population standard deviation.
(b) At the 1% significance level, test whether the observed sample standard deviation is
consistent with the industrial specification that the population standard deviation
must not exceed 2.5 ml.
(a) Confidence Interval: The degrees of freedom of the appropriate 2 -distribution is
one less than the sample size. Here, n 1 = 24 degrees of freedom. To obtain a
95% confidence interval, the upper and lower 2 12 % points of the 2 -distribution must
be obtained from tables. Because the 2 -distribution is not symmetric, the lower 2 21 %
2(0.025)

point must be obtained separately. The upper 2 12 % point which we require is 24


39.364, and the lower

0.050

f (x)
0.025

2 12 %

point, which we look up, is

= 12.401.

....................
....
..
...
...
..
...
.
...
..
...
..
.
...
2
...
....
.
...
.
... 24
....
...
.
.
...
...
...
...
...
...
...
...
..
.
...
....
...
.
.
...
...
....
.
...
.
...
....
...
.
.
..
...
....
.
...
.
...
....
...
...
...
...
...
...
...
...
...
...
... 2 (0.975)
...
.
...
.
...
..
24
.
2 (0.025)
....
..
.
.....
..
.....
.
.....
..
.
...... 24
.......
..
.
........
..
............
....
..................
.....
...............
.................................................

0.000

2(0.975)
24

10

= 12.401

20

30

40

= 39.364

50

x
Thus we can write:
Pr[12.401 < 224 < 39.364] = 0.95
Because (n 1)s2 / 2 2n1 , and because n 1 = 24 and s2 = 3.712 we have
Pr[12.401 < 24 3.712 / 2 < 39.364] = 0.95
Rearranging, so that inequalities produce a confidence interval for 2 , we have
Pr[24 3.712 /39.364 < 2 < 24 3.712 /12.401] = 0.95.

242

INTROSTAT

This reduces to
Pr[8.39 < 2 < 26.64] = 0.95.
The 95% confidence interval for 2 , the population variance, is given by (8.39, 26.64),
and the 95% confidence interval for , the population standard deviation, obtained by
taking square roots, is (2.90, 5.16). Unlike the confidence intervals for the mean, the
point estimate of the variance does not lie at the midpoint of the confidence interval
(and the same is true for the confidence interval for the standard deviation).
CONFIDENCE INTERVAL FOR 2
If we have a random sample of size n from a population with a normal distribution, and the sample variance is s2 , then A% confidence
intervals for 2 are given by

((n 1) s2 /2n1
, (n 1) s2 /2n1
)
is the appropriate upper percentage point and
where the value 2n1
is the appropriate lower percentage point, determined from ta2n1
bles. The confidence interval for is found by taking square roots.

(b) Hypothesis test.


1. Our null hypothesis specifies the maximum acceptable population standard deviation. H0 is thus taken to be
H0 : = 2.5.
2. The alternative hypothesis states the region of unacceptable standard deviations:
H1 : > 2.5.
3. Significance level : 1%.
4. We will reject H0 if our test statistic exceeds the upper 1% of the 2 -distribution
2(0.01)
with n 1 degrees of freedom: i.e. if 24
= 42.980 is exceeded.
5. Our test statistic is
2 = (n 1)s2 / 2 = 24 3.712 /2.52
= 52.85

6. Because 52.85 > 42.980, we reject the null hypothesis, and conclude that the
specification is not met, and that therefore the dairy has a problem.

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

243

Example 10B: Variability in the return of a traded security is often thought of as a


measure of total risk of the security. A certain portfolio manager will only invest in a
security if its population standard deviation of return does not exceed 10% per month.
A sample of 18 monthly returns on a particular security yielded a sample deviation of
14.2% per month.
(a) Find a 90% confidence interval for the population standard deviation.
(b) Test the null hypothesis that the standard deviation does not exceed the upper
limit required by the portfolio manager.
2(0.95)

2(0.05)

= 8.672 and 17
= 27.587. Thus the 90% confidence
(a) From tables, 17
interval for the population variance 2 is given by
(17 14.22 /27.587, 17 14.22 /8.672) = (124.3, 395.3)

(b)

and the 90% confidence interval for the standard deviation is





124.3, 395.3 = (11.1, 19.9)


1. H0 : = 10.

2. H1 : > 10.
3. The test statistic is
2 = (n 1)s2 / 2 = 17 14.12 /102
= 34.28.

4. From the row in the tables for 217 , we see that 34.28 is significant at the 1%
level.
5. We conclude that the security being investigated does not meet the variability
requirements of the portfolio manager (217 = 34.28, P < 0.01), and should
be regarded by him as too risky.
Example 11C: A certain moulded concrete product is manufactured in Johannesburg
and Cape Town. Since the aggregate used in the production might differ between the
two regions, management is interested in comparing the mass and variablity of the
product between the regions. A sample of 43 items produced in Johannesburg had a
mean mass of 53.2 kg with a standard deviation of 4.2 kg, while a sample of 23 items
produced in Cape Town had a mean mass of 54.4 kg with a standard deviation of
3.3 kg. Set up 95% confidence intervals for the population means and standard deviations
for products manufactured in both Johannesburg and Cape Town. Test if there is a
significant difference between the means and variances of the two populations. If you find
there is no significant difference, find appropriate pooled estimates of the overall mean
and standard deviation, and find 95% confidence intervals for these overall estimates.

244

INTROSTAT

Solutions to examples . . .
3C D 2 (or 22 ) = 2.78, P > 0.20 (2(0.20) = 3.219). Not significant.
4C = 2.25
2(0.20)

D 2 (or 24 ) = 3.00, P > 0.20, (4


bution fits.

= 5.989). Not significant, Poisson distri-

2(0.05)

7C D 2 (or 26 ) = 14.4198, P < 0.05 (6


dependent on opponents.

= 12.592). Significant, mode of scoring

2(0.001)

8C D 2 (or 26 ) = 29.97, P < 0.001, (6


ment dependent on risk.

= 22.45). Significant, type of invest-

11C 95% confidence intervals tabulated as follows:

Johannesburg
Cape Town
pooled

(51.9 , 54.5)
(53.0 , 55.8)
(52.6 , 54.6)

(3.5 , 5.3)
(2.6 , 4.7)
(3.3 , 4.7)

Test of variances: F42.22 = 1.62, P > 0.10, pooling of variances justified.


Pooled estimate of standard deviation: s = 3.9 with 64 degrees of freedom.
Test of means: t = 1.19, P > 0.20, no significant difference.

Pooled estimate of mean: x


= 53.6, n = 66.

Exercises on goodness of fit tests . . .


10.1 On a true-false test with 100 questions, a student gets 61 correct. The lecturer
claims that the student was merely guessing, and was just lucky to get such a high
mark. Test this claim at the 5% significance level.
10.2

In 20 soccer cup finals in South Africa between 1959 and 1978, the 40 teams
involved scored the following numbers of goals per match:

goals per match


number of teams:

0
8

1
6

2
13

3
8

4
3

5
2

6
0

or more

Is it true that the number of goals scored per team per match fits a Poisson
distribution?

245

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

10.3 A large furniture store also sells television sets. Their planning department is
interested in the distribution of their daily sales of television sets. A sample of
210 days shows the following sales volumes:

Number of sales

Observed frequency (days)

0
1
2
3
4 or more

60
71
53
20
6

Test at the 5% significance level whether the daily sales volumes occur in accordance with a Poisson distribution.
10.4 In order to check whether a die is balanced it was rolled 240 times and the following
results were obtained:

Face
No. of occurrences

1
33

2
51

3
49

4
36

5
32

6
39

Test the hypothesis that the die is balanced.


10.5

There is an annoying traffic light at an intersection near your home. Each time you
leave home, you note the colour as you pass the last lamp-post before the traffic
light. During a month, your records show that there was a red light 95 times,
amber 15 times, and green 30 times. Thinking this unreasonable, you contact the
Traffic Department, who claim that the light is set to be red for 50%, amber for
10% and green for 40% of the time for traffic arriving at the intersection from the
direction of your home. Test the Traffic Departments claim at the 1% significance
level.

10.6 A vegetable processing company has frozen peas as one of its major lines. They
have employed a consultant to research ways of improving their product. The
consultant experiments with hybrids between two varieties of peas. He knows
that, according the Mendelian inheritance theory, when the two varieties of peas
are hybridized, 4 types of seeds yellow smooth, green smooth, yellow wrinkled
and green wrinkled are produced in the ratio 9 : 3 : 3 : 1. In the experiment,
the consultant observes 102 yellow smooth, 30 green smooth, 42 yellow wrinkled
and 15 green wrinkled seeds. Are the results consistent with the theory at the 5%
level of significance?
10.7

It is hypothesised that the number of bricks laid per day by a team followed a
normal distribution with mean 2000 and standard deviation 300. A record of bricks
laid over a typical 100 day period produces the following data:

246

INTROSTAT

Bricks laid per day

number of days

under 1500
15001750
17502000
20002250
22502500
over 2500

5
14
30
28
17
6

Test the above hypothesis.


10.8

A survey of 320 families with 5 children revealed the distribution shown in the
table below. Is this data consistent with the hypothesis that male and female
births are equally probable?

No. of males
No. of families

5
18

4
56

3
110

2
88

1
40

0
8

Exercises on contingency tables . . .


10.9 A hair colour company wants to know if blondes have more boy friends. They
obtained the following results from a random sample.

No boy friends

One boy friend

More than one


boy friend

4
19

17
43

30
62

blonde
non-blonde

What should the company decide?


10.10

Records of 500 car accidents were examined to determine the degree of injury
to the driver and whether or not he was wearing a safety belt at the time of the
accident. The data are summarized below:

Minor injury
Major injury
Death

Safety belt

No safety belt

128
49
11

168
104
40

247

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

Test the hypothesis that the severity of the injury to the driver was independent
of whether or not the driver wears a safety belt. Use a 1% significance level. By
comparing observed and expected values, interpret your result.
10.11 An insecticide company is concerned about the effectiveness of their product across
a range of insect species. 200 specimens of each of four species of insects are placed
in a container and a prescribed amount of an insecticide is added. After an hour
the number of survivors for each species are noted:

Species
Survived
Killed

27
173

42
158

68
132

31
169

Does the insecticide differ in its effectiveness according to species at the 1% level
of significance?
10.12 Does the following sample indicate in the population sampled that preference for
cars of certain makes is independent of sex?

Men
Women

Make of car
B

60
80

80
70

110
100

10.13

A nationwide survey is conducted to determine the publics attitude towards the


abolition of capital punishment, and to compare it with that of police officers.
A sample of 300 members of the public produced the results:
In favour
90

Indifferent
60

Against
150

Indifferent
10

Against
70

A sample of 100 police officers showed:


In favour
20

Do the attitude patterns between police and public differ significantly? Use a 1%
significance level.

248

INTROSTAT

Exercises on the sample variance . . .


10.14

The amount of drug put into a tablet must remain very constant. For a certain
drug the maximum acceptable standard deviation is 1 mg. Analysis of 10 tablets
produced the following data (amount of drug in milligrams):
26 28 26 25 32 27 29 25 30 28
(a) Construct a 95% confidence interval for the true standard deviation.
(b) Using a 1% significance level, decide whether the fluctuation in drug content
exceeds the acceptable level of 1 mg.

10.15

The exact contents of seven similar containers of motor oil were (in millilitres)
499 501 500 498 501 501 499

(a) Calculate the mean and standard deviation.


(b) What assumption is it necessary to make in order to construct confidence
intervals for the mean and the standard deviation?
(c) Calculate 95% confidence intervals for both the mean and the standard deviation.
10.16 An agricultural economist needs to estimate crop losses due to insect damage. To
do this, he needs to estimate the mean number of insects per square metre. He
wishes to be 95% sure of being within two insects of the true mean number/m2 .
He has no idea of the population variance and thus runs a pilot survey, collecting
data from 61 quadrats of 1 m2 . The sample mean is 85.1 and the sample variance
s2 = 101.4.
(a) Find 95% confidence intervals for and for 2 .
(b) What size sample is suggested in order to be 95% sure of getting within 2 of
the true mean?
(c) The researcher, however, is a cautious fellow, and thinks that his point estimate of s2 might be underestimating 2 . Show that he can be 95% sure
that 2 < 140.6. Using this conservatively large estimate of 2 , show that a
sample of size 141 is required.
(d) He therefore samples a further 80 quadrats, and computes x
= 87.3 and
s2 = 121.4 from the combined sample. Find 95% confidence intervals for
and 2 .
(e) If the distribution of the number of insects per square metre is very skew,
comment on the reliability of the confidence intervals obtained in (d).

CHAPTER 10. THE CHI-SQUARED DISTRIBUTION

249

Solutions to exercises . . .
2(0.05)

10.1 D 2 = 4.84 > 3.841 = 1

reject H0 .

2(0.20)

10.2 D 2 = 3.84 < 4.642 = 3


.
Cannot reject H0 , Poisson distribution provides satisfactory fit (P > 0.20).
2(0.05)

10.3 D 2 = 1.55 < 7.815 = 3


.
Cannot reject H0 , Poisson distribution provides good fit.
2(0.20)

10.4 D 2 = 8.30 > 7.289 = 5


.
Possibility that die is not balanced (P < 0.20).
2(0.01)

10.5 D 2 = 21.07 > 9.210 = 2

reject H0 .

2(0.05)

cannotreject H0 .

2(0.20)

cannotreject H0 .

10.6 D 2 = 3.08 < 7.815 = 3


10.7 D 2 = 0.72 < 7.289 = 5

2(0.05)

10.8 D 2 = 11.96 > 11.071 = 5


.
Conclude that male and female births are not equally probable (P < 0.05).
2(0.20)

10.9 D 2 = 2.10 < 3.219 = 2


.
No evidence to show that blondes have more boyfriends (P > 0.20).
2(0.01)

10.10 D 2 = 11.63 > 9.210 = 2

2(0.01)

10.11 D 2 = 30.81 > 11.345 = 3

reject H0 .
,

reject H0 .

2(0.20)

10.12 D 2 = 4.00 > 3.219 = 2


.
Slight evidence (P < 0.20) for difference between sexes.
2(0.01)

10.13 D 2 = 12.47 > 9.210 = 2

reject H0 .

10.14 (a) (1.56 , 4.15).


2(0.01)
(b) D 2 = 46.4 > 21.666 = 9
,

reject H0 .

10.15 (a) We need to assume that the exact contents of the containers of motor oil are
normally distributed.
(b) For : (498.7, 501.0).
For 2 : (0.613, 7.160).
10.16 (a) For : (82.52 : 87.69).
For 2 : (73.03 , 150.32).
(0.025)
t
s
= 102.
(b) n = 60
L
2(0.95)
(c) Because
60  = 43.185,
 2 60101.4
(d) > 43.185 = Pr[ 2 > 140.9] = 0.95
n = 141.
(e) For : (85.48 , 89.12).
For 2 : (94.33 , 155.73).

250

INTROSTAT
(f) By the central limit theorem, the confidence interval for the mean is likely
to be reliable (unless the distribution is so very skew that even a sample of
size 141 is not large enough for the sample mean to be approximately normally
distributed). On the other hand, the confidence interval for 2 depends on
the distribution being normal, and hence is unreliable.

Chapter

11

PROPORTIONS AND SAMPLES


SURVEYS
KEYWORDS: Population and sample proportions, confidence intervals and hypothesis tests for proportions, finite populations, sample
surveys, sampling methods, random numbers, random number tables.

Sample surveys . . .
Statisticians are frequently required to estimate the proportion of a population
having some characteristic. We are all familiar with the opinion polls that take place
around election time and which purport to inform us what proportion of the electorate
will support each party or each candidate. Market researchers run surveys to determine
the proportion of the population who saw and absorbed the television advertisement
for their clients product.
Ideally, if it were convenient, quick and affordable, one would choose to obtain data
from each element of the population. Then we could estimate the population proportion
exactly, using the information from the complete data set.
However each datum we obtain takes some time to observe and record, and also
generates costs that must be covered. So a complete collection of all data may have
large time and cost impacts. In reality there are very often severe constraints on time
and money available for data collection and capture.
Estimating proportions from sample data, rather than from the complete population
data, is the usual challenge that confronts us. How could such a strategy of making
conclusions about the entire population, on the basis of only an incomplete subset of
the population, ever make sense?
In general the strategy can only make sense if we have reason to believe that the
two aggregated parts of the population, comprising the sampled and observed elements,
on the one hand, and the unsampled and unobserved elements, are essentially similar or
equivalent.
If that belief in equivalence is correct, then the sample can be thought of as being
representative of the unsampled group, as well as being obviously representative of
itself.
When the belief is correct, the sample will be representative not only of itself and of
the unsampled group, but therefore also representative of the population in its entirety.
251

252

INTROSTAT

In these circumstances, we believe the sample, the non-sample and the entire population
are essentially similar, and in that sense, representative of each other.
On the other hand, if we have no reason to believe in the equivalence of the sample
and non-sample subsets, or worse still, if we have reason to believe they are actually
different from one another, we have a severe problem. The strategy of using the sample
data to estimate population features will be incorrect and misleading.
The situation we have described thus far implies that we would still have to find a
way of ensuring the reasonableness or validity of a belief that a particular sample of
size n is representative of the population (of size N > n) from which it is drawn.
We cannot check the belief in the absence of information about the non-sampled
part of the population. Generally there is no such information available. Thus the belief
is not verifiable. If the information for verifiablity were available, then there may be no
additional information value or purpose a sample could actually serve.
Instead of ensuring that a sample is representative, the statistician relies upon a less
strict criterion, called randomness, as a device by which to eliminate any conscious or
unconscious bias in the selection of sample elements from the population. In its simplest
form, randomness implies that, at every stage in the sampling, each and every element
in the population has the same chance of being selected into the sample as each and
every other population element.
It is possible to achieve this type of random selection in practice, using various simple
techniques such as listing and numerically labelling each element of a population (e.g.,
from 1 to N ), putting labels 1 to N into a container and then drawing n numbered
labels from the thoroughly shaken container.
When a sample consisting of n elements has been chosen through an appropriate
randomisation method, we indicate its near-representative quality, by calling it a random sample. A random sample is highly likely to be either representative or close to
representative of the population.
While it is possible that a random sample might turn out to be unrepresentative
of its population, that outcome is very rare, if the sample size is moderate. Moreover,
the likeliness, of such a technically possible outcome, decreases in probability to almost
zero, as the random sample size n increases.
In practice, there are some golden rules for studying a population using statistical
methods:
1. Use random selection to ensure we have random samples.
2. Moderate sample size n can help keep costs and time requirements within limits.
3. Random sample size n should be large enough for all the objectives of the study.
In this chapter we focus upon the study objective of using random sample data, and
sample proportions to estimate corresponding population proportions.
The two most frequently asked questions (which are in fact interrelated) are:
1. What size sample is needed to estimate a proportion?
2. What is the reliability or margin of error of the estimate?
We will answer both questions.

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

253

Confidence intervals for proportions . . .


We will use , the Greek letter p, to denote the true population proportion
(0 1) and P to denote a sample proportion. P is used to estimate , in the
same way as the sample mean X is used to estimate the population mean , and the
sample variance s2 is used to estimate the population variance 2 .
Note that the meaning of here is different from its familiar use as the constant for
355
circular dimensions = 22
7 or = 113 . Here the use of parallels the use of , Greek
letters for unknown constants, namely proportion and mean respectively.
Suppose we sample 1000 voters and 500 say that they will vote for a particular
candidate, then P = 0.50. We cannot claim this sample estimate to be the exact
population proportion . Another sample may well yield P = 0.47 or P = 0.54.
Although the population proportion is constant (at least for a short period of time!),
the sample proportions will vary from sample to sample. P is yet another example of a
statistic; it is a random variable and therefore has a sampling distribution.
When the sample size n is large, and P is not close to either zero or one, the
sampling distribution of P is closely approximated by the normal distribution
P
: N

(1 )
,
n

But is unknown it is the very number we are trying to estimate therefore we


do not know the variance of this approximate normal distribution. But, in this case, we
can generally get away with substituting P = in the expression for the variance. This
substitution works well because, in sample surveys for proportions, we usually have large
samples, with n frequently equal to values of 1000 or larger. Making this substitution,
we have that P has, to a good approximation, the normal distribution
P
: N

P (1 P )
,
n

We now construct confidence intervals for in much the same way that we constructed confidence intervals for when the standard deviation was assumed to be
known:
r
. P (1 P )
(P )
N (0, 1).
n
Thus

"


Pr 1.96 < (P )

#
P (1 P )
< 1.96 = 0.95,
n

and, by rearrangement, 95% confidence intervals for are given by


"

Pr P 1.96

#
r
P (1 P )
P (1 P )
= 0.95.
< < P + 1.96
n
n

The 99% confidence interval is found, as usual, by replacing 1.96 by 2.58.

254

INTROSTAT

It is important to understand that in rearranging the terms, we are changing the


meaning in our approach. We began with a probability statement about (random)
sample proportions P from a particular population with proportion .
Instead, we now adopt a method of using a random sample to define an interval.This
method will have a chosen probability (e.g. 95%, or 99%) of providing an interval
covering the unknown true value of the population proportion , and a corresponding
probability (e.g. 5%, or 1%) of failure to cover that value, regardless of the specific
population to which it is applied.
The purpose of the method is to handle the uncertainty that is always our predicament when we can only access information from part of a population, rather than the
entire population. The validity of the method depends upon the sample being random.
The subtlety of the change in meaning is that we have moved our attention from
simply and only the single random sample of size n that constitutes our data, to a
method of handling any random sample of the same size n from any population! We
interpret the re-arranged probability expression to imply we have 95% confidence that
the method will yield an interval that includes the unknown parameter value.
Example 1A: Suppose 500 voters in a sample of 1000 say they will vote for a candidate.
Find the 95% confidence interval for .
We have P = 0.50 and n = 1000. So the 95% confidence interval for is given by
!
r
r
0.5(1 0.5)
0.5(1 0.5)
, 0.5 + 1.96
0.5 1.96
1000
1000
= (0.5 0.031, 0.5 + 0.031) = (0.469 , 0.531)

Equivalently, we can say that we are 95% sure that the interval (46.9% , 53.1%)
contains the true population proportion. Note that, in all our formulae and calculations,
proportions lie between 0 and 1. However, it is often convenient to communicate our
results as percentages. The quantity that we add to, and subtract from, the point
estimate of proportion to form the confidence intervals, is called the reliability margin
of the estimate at the given confidence level, and is conventionally denoted by L,
and expressed as a percentage. Thus
r
P (1 P )
%
L = 100 1.96
n
at the 95% confidence level.
In this example, the plus/minus term is 100 0.031 = 3.1% so we say that, at
the 95% confidence level, our estimate has a reliability margin of 3.1%. Note that the
confidence interval for the percentage is of the form (100P L)%, as in 50.0% 3.1%.
Note that the use of the term reliability here is different from (and inverse to) the way
it is used in common speech. Here reliability margin is a margin of error or variability
that naturally arises in random samples of a given size n. Our preference is always
for smaller reliability margin values, because the corresponding confidence intervals are
narrower. We say our estimates should be more precise.
The only way to achieve this goal of greater precision is to reduce L. By inspection,

we can see L is small when n is large, and hence n large. In principle we prefer sample
size n as large as any time and cost constraints will permit.
We now have an answer to the second of the two questions we asked at the beginning
of the chapter.

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

255

Example 2B: A market research company establishes that 28 out of 323 randomly
sampled households have more than one television set. Compute a 95% confidence
interval for the percentage of households having more than one television set. What is
the reliability margin, at the 95% level, of the estimate?
We calculate P = 28/323 = 0.0867, and n = 323. Thus the 95% confidence interval
is
!
r
0.0867 0.9133
, 0.0867 + 0.0307
0.0867 1.96
323
= (0.0560 , 0.1174).
Expressing proportion in percentage terms, the percentage of households with more
than one television set is within the confidence interval (5.60%, 11.74%). The reliability
margin of our estimate at the 95% confidence level is 3.07%.
Example 3C: In a questionnaire, 146 motorists out of a (random) sample of 252 stated
that when they next replaced tyres on their cars, they would insist on radial ply tyres.
Find a 95% confidence interval for the proportion of motorists in the population who
will get radials. What is the reliability margin at this confidence level?

Finite populations . . .
The method used above presupposes that the population being sampled is infinite,
or that the sampling is done with replacement. If neither of these assumptions is true,
we are sampling without replacement from a finite population. Then the random error
in the method will be reduced.
Intuitively we might expect this result because, by sampling without replacement,
no element can be selected twice, and every resulting sample of size n has a greater
chance of being representative of the population. Effectively we have eliminated all the
random samples which have any duplicated elements.
If the size of the population being sampled is N , and the sampling of an element
is done randomly n times without replacement, then the sampling distribution of P is
once again approximately normal, but with a reduced variance:


P (1 P )(N n)
.
P
: N ,
n(N 1)
Confidence intervals are constructed using the same procedure as before, with the
necessary modification for the diminished variance. Thus, the 95% confidence interval
is given by
s
s
!
P (1 P )(N n)
P (1 P )(N n)
, P + 1.96
P 1.96
n(N 1)
n(N 1)
with a reliability margin of our estimator P at the 95% confidence level given by
s
P (1 P )(N n)
%.
L = 100 1.96
n(N 1)

256

INTROSTAT

Example 4B: A lecturer, anxious to estimate the overall pass rate in a class of 823 students, takes a random sample of 100 scripts and finds 27 failures. Find a 95% confidence
interval for the failure rate.
We have P = 0.27, n = 100, and N = 823. Because the sample size is more than
10% of the population size we use the modified confidence interval:
s
!
0.27(1 0.27)(823 100)
, 0.27 + 0.082 = (0.188 , 0.352).
0.27 1.96
100(823 1)
At this stage, all the lecturer can say is that he has used a method for which there
is a 95% probability that the interval (18.8% , 35.2%) includes the true class failure
rate. The reliability margin at this confidence level is 8.2%, which is a comparatively
large figure, and the interval so wide it does not provide useful information! To get a
narrower confidence interval (at the 95% level), a larger sample size would be needed.
Example 5C: In a constituency of 8000 voters, 748 voters out of a sample of 1341
voters state they will support the Materialistic Party, 510 voters state they will vote
for the Ecological Party and the remaining 83 voters are undecided. Find 95% confidence intervals for the population proportions which will support each party, and the
population proportion of undecided voters. What are the associated reliabilities?

Sample sizes . . .
When you approach a statistician with our first question: What size sample is
needed? he will reply by asking four further questions:
1. What is N , the size of the population?
2. Do you want 95% or 99% (or some other level) confidence intervals?
3. What margin of error or reliability margin (L%) can you accept?
4. Do you have a rough estimate, P , of ?
The two formulae for the reliability margin L given earlier connect these four quantities with the sample size. If the population was infinite, we had
r
P (1 P )
.
L = 100 1.96
n
Making n the subject of the formula yields
n = (100/L)2 1.962 P (1 P ).
This formula is appropriate if the answers to the four questions are:
1. The population is very large.
2. 95% confidence level. (For 99% confidence level use 2.58 in place of 1.96. For
other levels use the appropriate value from the normal tables.)
3. Reliability margin of L% is given a value.
4. The rough estimate of is substituted for P . If no estimate of is available,
let P = 0.5. To see the logic behind this choice, let us consider the function
y = P (1 P ) further. In the region of interest, for values of P between 0 and 1,
the graph looks like this:

257

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS


0.25

P (1 P )

0.00

...............................
........
......
......
.....
.....
.....
....
....
.
.
.
....
...
.
...
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
...
.
...
.
.
...
.
...
...
.
...
..
.
...
..
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
.
...
..
.
...
..
...
.
...
..
.
...
..
.
...
..
.
...
...
....
.
...
.
...
..
.
...
....
...
.
.
...
....
...
.
...
.
..
...
.
...
....
...
...
...
.
.
...
..
.
...
..

0.0

0.5

1.0

P
It is easy to show that the maximum of y = P (1 P ) occurs at P = 0.5. Thus
taking P = 0.5 in the sample size formula gives the largest possible sample size
that might be required to achieve a margin of error L%. Because it is an expensive
exercise to take a sample, we like our samples to be as small as possible. Thus if
an estimate of is available, we should always use it to determine sample size,
because it will yield a smaller n. If you wish to err on the side of caution, then, in
the sample size formula, use a value for P which is a little closer to 0.5 than your
estimate of .
If the answer to question 1 is that the population size is finite and of size N , then
we determine n from the reliability margin expression incorporating the reduced
variance:
N
.
n=
N (L/100)2
1+
1.962 P (1 P )
In finite populations randomly sampled without replacement, the formula for the
required minimal sample size n will always be smaller than the required size for random
sampling with replacement.
Example 6A: What size sample is needed in each of the following situations? The
numbers 1 to 4 refer to the four relevant questions for determining the sample size.
(a) 1.N =
2. 95% level
3.L = 3%
4.P = 0.5
(b) 1.N =
2. 95%
3.L = 3%
4.P = 0.25
(c) 1.N = 10 000
2. 95%
3.L = 2%
4.P = 0.35
(d) 1.N = 20 000
2. 99%
3.L = 1%
4.P = 0.10
(e) 1.N = 15 000
2. 99%
3.L = 1.5%
4.P = 0.85.
(a) n = (100/3)2 1.962 0.5 0.5 = 1068.
(b) n = (100/3)2 1.962 0.25 0.75 = 801.
100 000
= 1794.
(c) n =
10 000(2/100)2
1+
1.962 0.35 0.65
20 000
(d) n =
= 4610.
20 000(1/100)2
1+
2.582 0.1 0.9

258
(e) n =

INTROSTAT
15 000
= 3015.
15 000(1.5/100)2
1+
2.582 0.85 0.15

Example 7B: The Student Health Service at a university with 12 000 students wishes
to conduct a Health Awareness Survey by interviewing a sample of students. They desire
a reliability margin of not more than 5% at a 95% confidence level in all questions that
seek to estimate proportions. What size sample is required?
We are given N = 12 000, we are told to find a 95% confidence interval (so z = 1.96)
and to use reliability margin of L = 5%. In the absence of any guidance about the likely
value of , we use P = 0.5 in the sample size formula because it gives the maximum
possible sample size. Thus
n=

12 000
= 373.
12 000(5/100)2
1+
1.962 0.5 0.5

Example 8C: In a town of 45 000 households, a market research organisation is


conducting a survey into the use of three brands of soap powder. They want to estimate
the proportion of households using each product, and at the 95% confidence level, they
want a reliability margin of 2%. They provisionally estimate the proportion of users of
Brand A as 15%, of Brand B as 22% and of Brand C as 17%. What size sample do they
need to take?

Testing that the proportion is a specific value . . .


Example 9A: It is suspected that, in a lower-middle-class suburb, residents replace
their cars less frequently than the national average. For example, it is known that, nationally, the proportion of cars under three years old is 27.1%. A researcher investigates
and finds that 37 out of 155 cars belonging to residents of the suburb, were less than 3
years old. At the 5% significance level, test whether the proportion of cars in the suburb
is less than the national average?
1. The null hypothesis states that the true proportion of residents in the lower-middleclass suburb with new cars is the same as the national proportion:
H0 : = 0.271.
2. The alternative hypothesis expresses the suspicion:
H1 : < 0.271.
3. Significance level : 5%.
4. The test statistic, we will see, has an approximate normal distribution. Thus, as
in chapter 8, we will reject H0 if the observed value of z < 1.64.

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

259

5. Test statistic. We asserted earlier that the random variable P , the estimate of ,
has a normal distribution. If H0 is true, then if P is based on a sample of size n,
its distribution is given by


(1 )
.
P
: N ,
n
Thus the test statistic is

: N (0, 1).
(1 )
n
Note that the null hypothesis specifies the value for . For this problem, P =
37/155 = 0.239 and n = 155, from the sample, and = 0.271 is specified by the
null hypothesis. Thus
r
 0.271(1 0.271)
z = (0.239 0.271)
155
= 0.896.
Z=q

6. Because 1.64 < 0.896, we are not able to reject H0 . Thus our data does not
indicate that residents of this suburb buy new cars significantly (i.e. discernibly)
less frequently than the national average.
Example 10B: A town of 3000 households is subjected to an intensive advertising
campaign for Easyspread yellow margarine, including free samples. Beforehand, the
proportion of Easyspread users in the town was the same as the national average, 0.132.
One week later, a sample of 350 households was asked if they had bought Easyspread
within the last seven days, and 64 made a positive response. Has the campaign been
effective?
1. H0 : = 0.132, the proportion of Easyspread users has not changed.
2. H1 : > 0.132 (if the campaign is successful).
3. Test statistic. Because of the finite population size we use the adjusted variance:


(1 )(N n)
P
: N ,
.
n(N 1)
Our test statistic is thus

.
Z = (P )

(1 )(N n)

: N (0, 1).
n(N 1)

For this example, P = 64/350 = 0.183, n = 350, N = 3000 and = 0.132:


.r 0.132 0.868 2650
z = (0.183 0.132)
350 2999
= 3.00.
4. From the tables of the normal distribution at z = 3.00 we obtain we obtain
P r = 0.49865, so that this z -value is significant at better than the 0.5% level.
5. We conclude that the campaign has been highly effective in increasing Easyspreads
share of the market (z = 3.00, P < 0.005).

260

INTROSTAT

Example 11C: A manufacturer claims that his market share is 60%. However, a
random sample of 500 customers reveals that only 275 are users of his product. Test at
the 5% significance level whether the population market share is less than that claimed
by the manufacturer.

Testing for a difference between proportions . . .


Example 12A: At election time surveys are conducted in two suburbs which fall into
the same constituency. In Broomhill, 175 voters out of a sample of 318 were in favour
of a given candidate. In Crosspool, 143 voters out of a sample of 307 were in favour of
this candidate. At the 5% level, is there a difference between the proportions of voters
supporting the candidate in each suburb?
1. Let 1 and 2 are the population proportions supporting the candidate in Broomhill
and Crosspool, respectively. The null hypothesis says that the proportions in each
suburb are the same:
H 0 : 1 = 2 .
2. The alternative hypothesis states the population proportions are unequal:
H1 : 1 6= 2 .
3. Significance level : 5%.
4. Rejection region. Assuming that the test statistic will have the standard normal
distribution, and because we have a two-tailed test at the 5% significance level, we
will reject H0 if |z| > 1.96.
5. Test statistic. We know that P1 and P2 , the sample estimates of 1 and 2 based
on samples of size n1 and n2 respectively have normal distributions




2 (1 2 )
1 (1 1 )
P2
: N 2 ,
.
P1
: N 1 ,
n1
n2
In chapter 5, we saw that the difference between two random variables, each having
a normal distribution, is also a normal distribution. Thus we can write


1 (1 1 ) 2 (1 2 )
+
.
P1 P2
: N 1 2 ,
n1
n2
If H0 is true, then 1 = 2 = (say) and



1
1
+
.
P1 P2
: N 0 , (1 )
n1 n2
The value of needs to be estimated from the observed valus Y1 = 175 and
Y2 = 143, random samples of size n1 and n2 . If H0 is true, then P1 and P2 are
both estimates of , and we combine them to get a pooled estimate of . The
pooled estimate is called P , and is computed as a weighted average of P1 and P2 :
P = (n1 P1 + n2 P2 )/(n1 + n2 ) = (Y1 + Y2 )/(n1 + n2 ).

261

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS


We can say, approximately, that
P1 P2
: N

0, P (1 P )

1
1
+
n1 n2



so that the test statistic is


Z=r

P1 P2

P (1 P ) n11 +

1
n2

: N (0, 1)


For our example, P1 = 175/318 = 0.550, P2 = 143/307 = 0.466, n1 = 318, n2 =


307. The pooled estimate of P is given by
P = (318 0.550 + 307 0.466)/(318 + 307) = (175 + 143)/625 = 0.509.
We compute the test statistic:
z=q

0.550 0.466
0.509 0.491

1
318

1
307

 = 2.10.

6. We reject H0 because the test-statistic z = 2.10 satisfies z > 1.96, and conclude
that the difference between the proportions is, at the 5% level, significant.
Example 13B: Miss Jones, the senior typist, made errors on 15 out of 125 pages of
typing, whilst Miss Smith made errors on 44 pages out of 255 pages of typing. Is Miss
Jones error rate significantly lower than Miss Smiths?
1. H0 : 1 = 2 .
2. H1 : 1 < 2 .
3. The test statistic is
.
Z = (P1 P2 )

P (1 P )


1
1
+
.
n1 n2

We have P1 = 15/125 = 0.120, P2 = 44/255 = 0.173, n1 = 125, n2 = 255. We


compute the pooled estimate P :
P = (125 0.120 + 255 0.173)/(125 + 255) = (15 + 44)/380 = 0.155.
Thus
.
z = (0.120 0.173)

0.155 0.845

1
1
+
125 255

= 1.34.

The test statistic is significant only at the 10% level. We conclude that the difference between the error rates is nearly significant (z = 1.34, P < 0.10).

262

INTROSTAT

Example 14C: In a random sample of 350 students, 47 were overweight, while in a


random sample of 176 businessmen, 36 were overweight. Do these data support the
hypothesis that a larger proportion of businessmen are overweight than students?
You can also test the hypothesis of Example 14C using the test of association in a
2 2 contingency table, which we learnt in Chapter 10. Check this statement. The two
tests can be shown to be mathematically equivalent; both make the same assumptions,
and will always lead to the same decision about accepting or rejecting the null hypothesis.

Random Sampling methods . . .


We have stated in many examples that we have a sample or a random sample.
It is now time to consider how random samples are obtained. An important branch of
statistics is called sampling theory, and we will introduce a few of the concepts.
Two general types of sampling exist probability and non-probability. Probability sampling includes simple random sampling, stratified, cluster, area, double (twophase), multi-stage and sequential random sampling. Non-probability sampling includes
judgement, systematic and quota sampling and is characterized by the fact that no check
can be kept on sampling errors for estimates. Various combinations of these methods
may, or sometimes must, be used to fit individual circumstances. Randomness should
be sought wherever possible, but problems of non-response and organization sometimes
mitigate against this ideal.
A standard method of drawing a simple random sample is the use of random
number tables. Table 6 is an example of such a table. Tables of up to one million
random digits have been prepared for this purpose, and there are standard computer
package programs that will easily generate them.
Assume we wish to draw a simple random sample of 320 from a population of 8724
dwelling units, each of which has been allocated a number from 1 to 8724. We start by
artibrarily choosing a starting point somewhere in the table, say the 6th row and the
lst column in table 6. This starting point gives us the numbers 1164 4842 2873. We use
these four digit numbers to select the sample. As long as we work through the table
systematically (here along the rows) we will obtain random digits.
Thus we would get the numbers 1164, 4842, 2873, 6089, 9329, 7601, 5677, 7791, 5219,
7374 etc. If a number occurs which is greater than 8724 we ignore and delete it and
select another until we have selected 320 distinct numbers between 1 and 8724. Thus we
would delete 9329. If the population consists of data stored on a computerized accounting
system or municipal records file, the procedure is simple. Units can also be chosen on an
area or spatial basis, where each random number defines the coordinates of a block or
quadrat. The important thing to remember is that each unit being potentially sampled
should have an equal chance of being chosen. Consequently, samples of families chosen
on an area basis should take into account varying population densities to ensure equal
probabilities of selection for all families. All the statistical methods we have devised in
this course are applicable to simple random samples. We mention briefly some other
sampling strategies.
Stratification is the separation of an entire population into several groups (called
strata), on the basis of some known information. Strata are constructed so that the
elements within any stratum are very similar to one another, but also tend to be at least
somewhat different from elements in the other strata. Stratifying the population enables
us to eliminate the possibility of obtaining particular kinds of unrepresentative samples.

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

263

By randomly sampling within each stratum we obtain a composite random sample that
is suitably balanced across the strata.
The population and hence the sample may be stratified with regard to one or more
classification variables, the number depending on practical limitations (e.g. gender,
education level, type of dwelling being known for each person in a region of interest,
before the sampling begins).
In cases where the proportions in the sample from each subpopulation are the same
as the proportions in the population we say we have proportional stratified random
sampling. When the proportions do not correspond, some form of weighting must
usually be applied to each subsample in order to draw conclusions about the whole
population.
Often the stratification allows us to obtain very much better estimates and much
narrower confidence intervals, than corresponding amounts spent on a simple random
sample. Alternatively, we may save on sampling costs because we can achieve the same
reliability margins as a simple random sample, but with a smaller total count of sample
units.
Cluster random sampling focuses upon subgroups of a population that are conveniently identified and sampled. It may be easier to sample Cape Town addresses than
to sample the Cape Town population. Clustering reduces the cost of taking a random
sample by ensuring that the units with sampled clusters are geographically close to each
other, and travel costs to obtain data are less than when individual persons have to be
located and interviewed.
Whereas strata are sets of units constructed on the basis of additional knowledge
about similarities and contrasts between units available ahead of the sampling, clusters
are sets of units constructed purely upon a basis of convenient access.
If the householder population of a country are to be sampled with regard to housing
questions, one might first choose a random sample from the list of towns and cities (as
Stage 1) from the set of towns and cities. Stage 2 might involve random selection from
the list of suburbs in each town chosen, Stage 3 a random sample from a list of blocks
within each suburb selected, and at Stage 4 a random selection from a list of houses
from each block.
An overall margin of error statement can be made by combining the variations or
sampling errors at each of the four stages. This strategy is also known as multi-stage
cluster random sampling. One can easily see that this method will be more convenient
than a random sample from the list of all householders in the country. The convenience
comes with some cost: the confidence intervals for cluster random sampling are wider
than for simple random sampling.
Sometimes two or more random samples are taken from the population at distinct
points in time. The first random sample is often used to give an indication of the sample
size required for the second and main random sample, and to aid in its stratification. The
questions of importance appear in the second sample. This strategy is called double
or two-phase sampling. Sequential random sampling also involves a series of random
samples, usually to keep a continuous check on some feature which may change with time,
or to provide further information on a particular aspect. Depending on circumstances
and the information required, each sample may or may not include the same units.
Systematic random sampling involves selecting a random starting point and
then every k -th unit in a list or in a moving system thereafter, e.g., the twenty-third
person leaving a train station, and then every hundredth person. This strategy will often

264

INTROSTAT

give good results, but is subject to immeasurable and perhaps large errors if the interval
between units sampled coincides with a cycle inherent in the population sampled. An
obvious example of a difficulty is sampling every 14th (or 21st) unit of a population of
daily petrol sales figures, when a regular pattern occurs over a seven day sales week.
Other cycles may, unfortunately, not be as obvious.
In all forms of probability sampling, the problem of non-response is an important
issue. Suppose we choose a random sample of voters using the electoral register and random numbers. The interviewers may have to make several calls to find people at home;
they may have moved, be on holiday or in hospital; they may refuse to be interviewed.
If we have a sample of 1000 and 900 respond, can we simply regard this as a random
sample of 900? It would be dangerous to do so, because we would then be assuming
that the non-responders hold similar views to the responders. The very fact that they
are non-responders makes them different to the others. The difference probably extends
to their opinions on the subject of the questionnaire.
If we did proceed to assume that respondents and non-respondents were similar
enough to ignore the 100, and treat the 900 as a random sample, we have an ethical
obligation to report that assumption as a fundamental basis of our analysis. This strategy
can be made explicit, but we will be inlikley to ever know whether or not it was valid.
In short, despite our best efforts to use random selection to increase the chance
of near-representative samples, non-response within a well-chosen random sample may
cause important parts of the information to be hidden from our analysis.

Non-random Sampling methods . . .


Judgement sampling selects units according to the views of the sampler. However
experienced and objective an expert may be in choosing what he feels is a representative
or balanced sample, the possibility of unintentional bias in the selections can never be
ruled out. Thus the findings of his study will only have credibility with those people
who believe the sample was balanced, and may have no credibility beyond that group.
Quota sampling attempts to remove the problem of non-response. The interviewer
is given a description of the types of people to be interviewed and the number of each
type required. The selection of actual individuals is then left to the interviewer. This
subjective selection means that, like systematic and judgement sampling, randomness is
not present, and there is no real check on the size of the error in sampling. The British
Market Research Society, asked to explain the poor performance of almost all opinion
polls in predicting results of the 1970 British general election, cited the use of quota
sampling as a major reason. Although unintentional interviewer bias had in the past
often tended to even out, the only way in which this bias could consistently be avoided
was through some form of random or probability sampling.

Designing sample surveys . . .


(a) Before embarking on a major survey it is best to arrange a pilot survey. This
small-scale study will test the effectiveness and shortcomings of the questionnaire
and give first-hand information on problems facing the interviewer. It is well worth
the extra effort and time involved, as it enables one to revise ambiguous questions,
insert reply categories which might have been overlooked, and even change to a

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

(b)

(c)
(d)

(e)

(f)

(g)

(h)

265

more efficient survey procedure. If a major error is only discovered during or after
the main survey, economic reasons usually prevent one from starting afresh, and
the entire exercise can be largely a waste of time and money.
Questionnaires should be carefully drawn up, taking into account the information that is really desired, the range of possible answers, and the types of people
likely to be interviewed. Wherever possible, people concerned with the survey subject,the study population or study area should be consulted. Questions should be
uncomplicated and kept to a minimum.
Data will often be entered directly into the computer from a questionnaire the
layout should be checked in advance for its suitability at the data capture stage.
The questionnaire should also be discussed beforehand with the person responsible for summarizing and analysing the information, particularly if a computer is
involved. The question of whether or not to process the information by computer
should, of course, be considered at a very early stage. This decision will depend
largely on the size of the sample, the number of questions in the questionnaire, the
complexity of the information required and the relative costs.
Confidence levels are usually selected to be 95% or 99%, although in circumstances with high sampling costs, 90% may be the maximum achievable in practice.
A reliability margin of about 5% is usually quite satisfactory, but anything larger
than 10% is not of much practical use. A smaller margin demands a much greater
sample. There is nowhere near a linear relationship between size of population
and size of sample necessary for a fixed reliability margin L%. There is no basis
for the often held view that 5% (or 10%) of the population will always give a
satisfactory sample. It should be borne in mind that the reliability margin is an
absolute value and that a figure of 14% 3% is relatively more reliable than one
of 45% 5%, because the margin is smaller.
We have so far restricted ourselves to a question with either a yes or no answer.
The same theory holds for questions with any number of categories if Pi is
the percentage of replies to category i, we use the same theory with Pi instead of
P.
If any conclusions are to be drawn specifically for a subgroup of the population, e.g. all bachelors, then the reliability margin of any of these conclusions is
dependent on the numbers of the specific subgroup in the sample and in the
population. Although 400 may give an accurate balanced view of a population of
40 000, it is simply nave and completely incorrect to expect 10 to represent an
accurate and balanced view of a sub-population of 1000.
These sample size figures all relate to a randomly drawn sample, without obvious
bias. Even in the balanced case, different levels of response (acceptance and satisfactory completion of the questionnaire) from different types in a heterogeneous
population may cause bias. Here suitable stratification prior to random selection
might help in maintaining the correct balance. Questionnaires can be postal, by
personal interview or postal with a personal follow-up. The postal method saves
much time and costs, but suffers much more from problems of non-response.
Finally, the above points all attempt to deal with and minimise possible inaccuracies caused by drawing a sample which is not statistically representative of the
population. These selection errors may be minimal compared with those errors
resulting from poor interviewing, bad questions, incorrect transcription
and faulty processing and flawed interpretation. A moderate-sized random

266

INTROSTAT
sample, well controlled and carefully processed, may well give better results than
a complete census, the sheer size of which can cause very rushed interviewing and
processing and a greater proportion of errors.

The classic sampling fiasco is that of the Literary Digest poll, which sampled over
2 million opinions on the winner of the 1936 American Presidential Election. The L.D.
predicted only 40.9% of the votes for Roosevelt and a landslide win for Landon. Actual
returns gave Roosevelt an overwhelming win with 60.7% of the vote. Sample size 2 million final error 20%! This huge error was due to plain stupidity, in picking the sample
from telephone directories and car registration files. The L.D. survey specialists may
have effectively sampled the opinions of the upper-class and middle-class people, but left
out the ill-clad, ill-fed and ill-housed lower-income classes who voted overwhelmingly
in favour of Roosevelts New Deal policies.
Example 15C: For each of the following situations describe an appropriate sampling
technique.
(a) The Electricity Supply Commission is interested in evaluating the effect of an
energy conservation advertising campaign. It wishes to estimate the proportion of
households which are acting to reduce their electricity consumption.
(b) The City Council wishes to know the proportion of residents who favour the construction of a new city by-pass road.
(c) A chain store with outlets in nine regions wishes to estimate the number of bad
debtors nationwide.
(d) The university library wants to estimate the proportion of books that have not been
used for a year. A book is defined to have been used if it has been date-stamped
within the past year.
(e) A manager of a commercial forest plantation wishes to determine the proportion
of trees in a plantation that have been infested with an insect pest.

Solutions to examples . . .
3C (0.518, 0.640)

or (51.8%, 64.0%)

L = 6.1%.

5C Materialist Party: (53.4% , 58.2%) L = 2.4%.


Ecological Party: (35.6% , 40.4%) L = 2.4%.
Undecided: (5.0% , 7.4%) L = 1.2%.
8C 1590 (Use value of P closest to 0.5, i.e. the 0.22 of Brand B.)
11C z = 2.282 < 1.64, reject H0 .
14C z = 2.086,

P < 0.05, significant.

Exercises on confidence intervals and sample sizes . . .


11.1

In a sample of 500 garages it was found that 170 sold tyres at prices below those
recommended by the manufacurer.

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

267

(a) Estimate the percentage of all garages selling tyres below the recommended
price.
(b) Calculate the 95% confidence limits for this estimate.
(c) What is the reliability of the estimate?
(d) What size sample would have to be taken in order to estimate the percentage
to within 2% at a 95% confidence level? Use the value obtained in (a) as a
rough estimate of the true proportion.
11.2

What size sample is needed in each of the following situations? Sampling will be
done without replacement.
Population
Confidence
Reliability
Rough estimate
size N
level
L%
of
(a)
infinite
95%
5%
50%
(b)
infinite
95%
1%
50%
(c)
infinite
95%
5%
25%
(d)
infinite
99%
5%
25%
(e)
infinite
95%
5%
75%
(f)
infinite
95%
1%
10%
(g)
3000
95%
3%
40%
(h)
10 000
95%
3%
40%
(i)
10 000
95%
1%
40%
(j)
10 000
99%
3%
20%
(k)
10 000
99%
2%
12%
(l)
40 000
95%
1%
8%
(m)
infinite
95%
2%
unknown
(n)
7000
95%
0.5%
97%

11.3 A simple random sample from a community of 6000 is to be asked a question to


determine the proportion in favour of some proposal.
(a) What size sample is needed if the result is required to have a reliability of 3%
at a 95% confidence level, and it is known in advance that the percentage in
favour will exceed 70%?
(b) If 840 out of 1000 members of the community were in favour, find a 99%
confidence interval for the community viewpoint on the proposal. What is
the reliability?
11.4 The proportion of people owning citizen band radios in a town of 7500 people is
to be estimated by means of a simple random sample, without replacement.
(a) What size sample is needed if it is required that the sample result be within
4% of the true result at a 99% confidence level?
(b) If there were 235 citizen band radio owners in a sample of 850 people, find a
95% confidence interval for the proportion in the town as a whole. What is
the reliability of the estimate?
11.5

A survey is to be made of student participation in sport within the three faculties


of a university. The faculties have 1000, 4000 and 5000 students. The proportion
of students participating in sport is known to be well under 30%.
(a) What size sample is needed within each faculty if a reliability of 3% is required
at the 95% confidence level?

268

INTROSTAT
(b) What size sample is required if it is only necessary to estimate the overall proportion within the university with the same reliability and confidence
level?
(c) Explain the difference between the overall sample sizes obtained in (a) and
(b).

11.6 You read in a newspaper report that 53% of businessmen think that the economic
climate is improving. Upon checking you learn that the sample size used in the
research was 100. Find the 95% confidence interval and comment.
11.7

A questionnaire has four questions, and the rough estimate of the proportions of
interest are 10%, 70%, 80% and 5%. What size sample is required to achieve a
reliability of 2% at the 90% confidence level in all four questions? The population
may be assumed to be infinite.

11.8 In a random sample of 120 pages typed by Miss Brown there were 23 pages with
typing errors.
(a) Find a 90% confidence interval for the proportion of Miss Browns pages that
have errors.
(b) How large a sample would we need if we wanted our confidence interval to
have overall length 4%?

Exercises on hypothesis testing . . .


11.9 A manufacturer guarantees that only 10% of vehicles have brake defects in the
first 10 000 km. Feeling that this is an underestimate, a consumer organization
undertook an investigation which showed that 18 out of 110 vehicles examined
had brake defects within 10 000 km. At the 1% significance level, do these data
disprove the manufacturers claim?
11.10

A certain type of aircraft develops minor trouble in 4% of flights. Another


type of aircraft develops similar trouble in 19 out of 150 flights. Investigate the
performance of the two types of aircraft and comment on any significant difference.

11.11 The national average for the ownership of motorbikes by teenagers is 15%. In
an affluent suburb which has 2000 teenagers, 45 out of a sample of 250 teenagers
owned motorbikes. Test if this proportion is significantly more than the national
average.
11.12

After corrosion tests, 42 of 536 metal components treated with Primer A and
91 of 759 components treated with Primer B showed signs of rusting. Test the
hypothesis that Primer A is superior to Primer B as a rust inhibitor at the 1%
significance level.

11.13 In a sample of 540 wives of professional and salaried workers, 42% had visited
their doctor at least once during the preceding 3 months. During the same period,
of a sample of 270 wives of labourers and unskilled workers, 36% had visited a
doctor. By the use of an appropriate statistical test, consider the validity of the
assertion that middle-class wives are more likely to visit their doctors than the
wives of working-class husbands.

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

269

11.14 In a sample of 569 wives of professional and salaried workers 45% attended weekly
the local welfare centre with their infants. For a sample of 245 wives of agricultural
workers, the corresponding proportion was 35%. Test the hypothesis that there is
no difference between the two groups in respect of their attendance at such centres.
11.15

Two groups, A and B, each consist of 100 people who have a disease. A serum
is given to Group A but not to Group B (which is called the control group);
otherwise, the two groups are treated identically. It is found that in Groups A and
B, 75 and 65 people, respectively, recover from the disease. Test the hypothesis
that the serum helps to cure the disease.

11.16 A sample poll of 300 voters from district A and 200 voters from district B showed
that 168 and 96 respectively were in favour of a given candidate. At a level of
significance of 5%, test the hypothesis that
(a) there is a difference between the districts
(b) the candidate is preferred in district A.
11.17 Two separate groups of sailors were randomly selected. One group of 350 sailors
was given seasickness pills of Brand A and another group of 220 sailors was given
pills of Brand B. The number of sailors in each group that became seasick during
a very heavy storm were 57 and 28 respectively. Can one conclude, at a 5%
significance level, that there is no real difference in the effectiveness of the pills?
11.18

There are 3000 students in the Arts Faculty and 2500 in the Science Faculty of
a university. In a sample of 350 arts students there were 186 non-smokers, while
of 400 science students, 273 were non-smokers. Is there a difference between the
proportions of non-smokers in the two faculties? (Develop a test that adjusts the
variances for the finite population sizes.)

11.19 In a study to estimate the proportion of residents in a certain city and its suburbs
who were opposed to the construction of a nuclear power plant, it was found that
48 out of 100 urban residents were opposed to the construction of the power plant,
while 91 out of 125 suburban residents were opposed. Test whether the level
of opposition to the nuclear power plant varies significantly between urban and
suburban residents.
11.20 In an election, one ballot box at a polling station was found to have a broken
seal, and there was concern that votes for a particular party, called ABC, had
been removed and destroyed. None of the other boxes had broken seals. Of the
remaining votes cast at that polling station,32% were in favour of party ABC. The
ballot box with the broken seal contained 543 votes, of which 134 were for party
ABC. Does this information provide support for the allegation that the ballot box
had been tampered with? What assumptions are required to conduct the test?

Exercises on sample surveys . . .


11.21

What is the aim of choosing a random sample? Mention two modifications that
can be made to a simple random sampling plan, and outline the benefits of these
modifications.

270

INTROSTAT

11.22

A statistician wishes to assess the opinion that residents of a suburb have about
their local bus service. Describe how he might go about obtaining a representative
sample of their opinions.

11.23

What objections would you raise if you were told to use, as a random sample of
the households in the Cape Town area, the first three addresses on the top of each
page of the Cape Peninsula telephone directory?

Solutions to exercises . . .
11.1 (a) 0.34 or 34%
(b) (.382 , 0.298) or (38.2% , 29.8%)
(c) L = 4.2%
(d) 2156.
11.2
(a)
(e)
(i)
(m)

385
289
4798
2401

11.3 (a) 780

(b)
(f)
(j)
(n)

9604
3458
1059
2728

(c)
(g)
(k)

289
764
1495

(b) (0.813 , 0.867) or (81.3% , 86.7%)

11.4 (a) approximately 911

(b) (0.248 , 0.305)

(d)
(h)
(`)

500
930
2640

L = 2.7%.

L = 2.83%.

11.5 (a) 473 out of 1000, 732 out of 4000 and 760 out of 5000, a total of 1965.
(b) 823.
11.6 (43.2% , 62.8%). The confidence interval is so long that it is of little use.
11.7 n = 1412. (Use the estimate closest to 0.50, i.e. 0.70).
11.8 (a) (13.3% , 25.1%)

(b) L = 2%,

1042.

11.9 z = 2.23 < 2.33, cannot reject H0 .


11.10 z = 5.47,

P < 0.0001, very highly significant.

11.11 z = 1.42,

P < 0.10, nearly significant.

11.12 z = 2.43 < 2.33, reject H0 .


11.13 z = 1.643,
11.14 z = 2.65,

P < 0.10, nearly significant.


P < 0.01, very significant.

11.15 z = 1.54, P < 0.10, nearly significant (but in medical statistics, nearly significant is not sufficient!).
11.16 (a) z = 1.76 < 1.96, cannot reject H0 .
(b) z = 1.76 > 1.64, reject H0 .
11.17 z = 1.16 < 1.96, cannot reject H0 .

CHAPTER 11. PROPORTIONS AND SAMPLES SURVEYS

271

11.18 The test statistic is


 1

.
2
(N2 n2 )
(N1 n1 )
+
z = (P1 P2 )
P (1 P )
n1 (N1 1) n2 (N2 1)
z = 4.56, P < 0.0001, very highly significant.
11.19 There is a very highly significant difference between the urban and suburban
residents (z = 3.804, P < 0.0005).
11.20 There is evidence of removal of votes (z = 3.66, P < 0.0005). (One-sided
alternative used, as suggested by the question.) The chief assumption is votes
were randomly allocated to the boxes. If the boxes were not used simultaneously
but sequentially, the proportion voting for party ABC is assumed to have remained
constant at the polling station across all the two sets of boxes (with broken and
unbroken seals).

272

INTROSTAT

Chapter

12

REGRESSION AND CORRELATION


KEYWORDS: Regression, scatter plot, dependent and independent
variables, method of least squares, the normal equations, regression coefficients, correlation coefficient, residual standard deviation,
confidence intervals for predictions, nonlinear regression, exponential
growth.

Advertising and sales, wages and productivity, class record


and final mark . . .
We often encounter problems in which we would like to describe the relationship
between two or more variables. A person in marketing will want to determine the number
of sales associated with advertising expenditure, an economist will want to know how
wages effects the productivity of an industry, and you, as a student of statistics, should
be able to define a more precise problem, and ask: Given my class record, find a 95%
confidence interval for my final mark.
In this chapter we will make a start at answering two questions:
1. Is there really a relationship between the variables? (the correlation
problem). If the answer to this question is yes, we go on to ask:
2. How do we predict values for one variable, given particular values for
the other variable(s)? (the regression problem).
We consider the second question first.

Regression . . .
Example 1A: Suppose, as suggested above, we would like to predict a students mark
in the final examination from his class record. We gather the following data, for 12
students. (In practice we would use far more data; we are now just illustrating the
method.)
273

274

INTROSTAT
Class record
61
39
70
63
83
75
48
72
54
22
67
60

Final mark
83
62
76
77
89
74
48
78
76
51
63
79

We depict the data graphically in what is known as a scatter plot , or scattergram;


or scatter diagram:
100
90

80
70

Final
mark

60

50

40
30
20
10
0
0

10 20 30 40 50 60 70 80 90 100
Class record

A haphazard scattering of points in the scatter plot would show that no relationship
exists. Here we have a distinct trend as x increases, we see that y tends to increase
too.
We are looking for an equation which describes the relationship between mid-year
mark and final mark so that for a given mark at the mid-year, x, we can predict the
final mark y . The equation finding technique is called regression analysis. We call the
variable to be predicted, y , the dependent variable, and x is called the explanatory
variable.
The variable x is often called the independent variable, but this is a very poor
name, because statisticians use the concept of independence in an entirely different

275

CHAPTER 12. REGRESSION AND CORRELATION

context. Here, x and y are not (statistically) independent. If they were, it would be
stupid to try to find a relationship between them!

Linear regression

...

We confine ourselves to the fitting of equations which are straight lines thus we will
consider linear regression. This is not as restrictive as it appears many non-linear
equations can be transformed into straight lines by simple mathematical techniques; and
many relationships can be approximated by straight lines in the range in which we are
interested.
The general formula for the straight line is
y = a + bx
The a value gives the y -intercept, and the b value gives the slope. When a and b are
given numerical values the line is uniquely specified. The problem in linear regression
is to find values for the regression coefficients a and b in such a way that we obtain
the best fitting line that passes through the observations as closely as possible.
We must decide the criteria this best line should satisfy. Mathematically, the
simplest condition is to stipulate that the sum of the squares of the vertical differences
between the observed values and the fitted line should be a minimum. This is called the
method of least squares, and is the criterion used almost universally in regression
analysis.
Pictorially, we must choose the straight line in such a way that the sum of the
squares of the ei on the graph below is a minimum, i.e. we wish to minimize
=

n
X

e2i

i=1

observed value yi

Dependent
variable
y

en ...................

(xi , yi )

ei

.
.......
.......
.......
.......
.
.
.
.
.
.
.
...
.......
.......
.......
.......
.......
.
.
.
.
.
.
..
........
n1
i
.......
.......
.......
.......
.
.
.
.
.
.
.....
.
.
.
.
.
.
..
.......
.......
.......
........
.......
.
.
.
.
.
.
...
........
.......
1.....................
.
.
.
.
.
....
.......
2
.......
.......
.......

predicted value y

xi

Explanatory variable x
In general, we have n pairs of observations (xi , yi ). Usually these points do not lie
on a straight line. Note that ei is the vertical difference between the observed value of

276

INTROSTAT

yi and the associated value on the straight line. Statisticians use the notation yi (and
say y hat) for the point that lies on the straight line. Because it lies on the line,
yi = a + b xi . The difference is expressed as
ei = yi yi = yi (a + bxi ).
Thus
=

e2i =

n
X
(yi a bxi )2 .
i=1

We want to find values for a and b which minimize . The mathematical procedure
for doing this is to differentiate firstly with respect to a and secondly to differentiate
with respect to b. We then set each of the derivatives equal to zero. This gives us
two equations in two unknowns, a and b, and we ought to be able to solve them.
Fortunately, the two equations turn out to be a pair of linear equations for a and b; this
is a type of problem we learnt to do in our first years at high school! Technically, the
derivatives are partial derivatives, and we use the standard mathematical notation
for partial derivatives, , instead of the more familiar notation d .
a
da
The partial derivatives are:
n
X

(yi a bxi ) = 0
= 2
a
i=1

= 2
b

n
X
i=1

xi (yi a bxi ) = 0

Setting these partial derivatives equal to zero gives us the so-called normal equations:
n
X

yi = na + b

i=1

xi y i = a

xi

i=1

i=1

n
X

n
X

n
X
i=1

xi + b

n
X

x2i

i=1

Pn
Pn
Pn
2
We can calculate
i=1 xi ,
i=1 xi and
i=1 xi yi from our data, and n is the sample size. This gives us the numerical coefficients for the normal equations. The only
unknowns are a and b.
By manipulating the normal equations algebraically, we can solve them for a and b
to obtain
P P
P
x y
xy
Pn
b=
P 2 ( x)2
x
Pn
P
yb x
,
a = y b
x=
n
Pn
P
abbreviating
x, etc. It is convenient to define quantities SSx , SSxy and
i=1 xi to
SSy as below:

CHAPTER 12. REGRESSION AND CORRELATION

277

SUMS OF SQUARES

P
X
X
( x)2
2
2
SSx =
(x x
) =
x
n P P
X
X
x y
SSxy =
(x x
)(y y) =
xy
P 2 n
X
X
(
y) )
SSy =
(y y)2 =
y2
n
The letters SS stand for Sum of Squares, and the subscript(s) indicate(s) the variable(s) in the sum. In this notation, the regression
coefficients a and b, can be written as
b = SSxy /SSx
P
P
yb x
,
a = y b
x=
n
and the straight line for predicting the values of the dependent variable
y from the explanatory variable x is
y = a + bx.

Although SSy was not needed for the calculation of a and b, it will be needed for a
soon-to-be-developed formula, so it is efficient to define it now and to calculate it along
with SSx and SSxy .
The formulae in the box are the most useful for finding the least squares linear
regression equation y = a + bx.

A computational scheme

...

We set out the procedure for calculating the regression coefficents in full. It is very
arithmetic intensive, and is best done using a computer. But it is important to be able
to appreciate what the computer is doing for you!

Example 1A, continued: The manual procedure for calculating the regression coefficients is a four-point plan.

1. We set out our data as in the following table, and sum the columns:

278

INTROSTAT

P
Thus

x2

y2

xy

61
39
70
63
83
75
48
72
54
22
67
60

83
83
76
77
89
74
48
78
76
51
63
79

612 = 3721
1521
4900
3969
6889
5625
2304
5184
2916
484
4489
3600

832 = 6889
3844
5776
5929
7921
5476
2304
6084
5776
2601
3969
6241

61 83 = 5063
2418
5320
4851
7387
5550
2304
5616
4104
1122
4221
4740

714

856

45 602

62 810

52 696

x = 714,

y = 856,

x2 = 45 602,

y 2 = 62 810, and

xy = 52 696.

2. Calculate the sample means:


X
X
x
=
x/n = 714/12 = 59.5 and y =
y/n = 856/12 = 71.33
3. Calculate SSx , SSxy and SSy :
P
X
( x)2
7142
2
SSx =
x
= 45 602
= 3119
n P
12
P
X
( x)( y)
714 856
SSxy =
xy
= 52 696
= 1764
n
12
P
X
( y)2
8562
SSy =
y2
= 62 810
= 1748.67
n
12
4. Calculate the regression coefficients a and b:
Thus
b=

SSxy
1764
= 0.566
=
SSx
3119

and
a = y b
x = 71.33 0.566 59.5 = 37.65
Therefore the regression equation for making year-end predictions (y) from midyear mark (x) is
y = 37.65 + 0.566 x
The hat notation is a device to remind you that this is an equation which is to be
used for making predictions of the dependent variable y . This notation is almost
universally used by statisticians. So if you obtain x = 50% at mid-year, you can
predict a mark of
y = 37.65 + 0.566 50 = 66.0%

for yourself at the end of the year.

279

CHAPTER 12. REGRESSION AND CORRELATION

We will defer the problem of placing a confidence interval on this prediction until
later. In the meantime, y is a point estimate of the predicted value of the dependent
variable. The quantity SSy , which we have calculated above, will be used in forming
confidence intervals (and also in correlation analysis).
Example 2B: A personnel manager wishes to investigate the relationship between income and education. He conducts a survey in which a random sample of 20 individuals
born in the same year disclose their monthly income and their number of years of formal
education. The data is presented in the first two columns of the table below.
Find the regression line for predicting monthly income (y) from years of formal
education (x).

Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
P

Years of formal
education
x

Annual income
(1000s of rands)
y

x2

y2

xy

12
10
15
12
16
15
12
16
14
14
16
12
15
10
18
11
17
15
12
13

4
5
8
10
9
7
5
10
7
6
8
6
9
4
7
8
9
11
4
5

144
100
225
144
256
225
144
256
196
196
256
144
225
100
324
121
289
225
144
169

16
25
64
100
81
49
25
100
49
36
64
36
81
16
49
64
81
121
16
25

48
50
120
120
144
105
60
160
98
84
128
72
135
40
126
88
153
165
48
65

275

142

3883

1098

2009

We complete the table by computing the terms x2 , y 2 and xy , and adding the
columns.
Next, x
= 275/20 = 13.75 and y = 142/20 = 7.10.

280

INTROSTAT

We now calculate SSx , SSxy and SSy :


X
X
SSx =
x2 (
x)2 /n = 3883 2752 /20 = 101.75
X
X
X
SSxy =
xy (
x)(
y)/n = 2009 275 142/20 = 56.50
X
X
SSy =
y2 (
y)2 /n = 1098 1422 /20 = 89.80.

Thus, the regression coefficients are

b = SSxy /SSx = 56.50/101.75 = 0.555


and
a = y b
x = 7.10 0.555 13.75 = 0.535
The required regression line for predicting monthly income from years of education is
thus
y = 0.535 + 0.555x.
We can use this equation to predict that the (average) monthly income of people
with 12 years of education is
y = 0.535 + 0.555 12 = 6.125,
or R6125.
Example 3C: Suppose that for some reason the personnel manager wants to predict
years of education from monthly income.
(a) Calculate this regression line. (Hint: interchange the roles of x and y .)
(b) Is this regression line the same as that obtained by making x the subject of the
formula in the regression line y = 0.535+0.555x obtained in example 2B? Explain
your results.

Correlation . . .
To justify a regression analysis we need first to determine whether in fact there is
a significant relationship between the two variables. This can be done by correlation
analysis.
If all the data points lie close to the regression line, then the line is a very good fit
and quite accurate predictions may be expected. But if the data is widely scattered,
then the fit is poor and the predictions are inaccurate.
The goodness of fit depends on the degree of association or correlation between
the two variables. This is measured by the correlation coefficient. We use r for the
correlation coefficient and define it by the formula
P
(xi x
)(yi y)
r = pP
P
(xi x
)2 (yi y)2
For computational purposes, the most useful formula for the correlation coefficient
is expressed in terms of SSx , SSxy and SSy :
r=p

SSxy
.
SSx SSy

281

CHAPTER 12. REGRESSION AND CORRELATION

The correlation coefficient r always lies between 1 and +1. If r is positive, then
the regression line has positive slope b, and as x increases so does y . If r is negative
then the regression line has negative slope and as x increases, y decreases. In other
words, in any one example, r , b and SSxy must have the same sign:
SSxy
r=p
=b
SSx SSy

SSx
.
SSy

In the scatter plot on the left, both the correlation coefficient r and the slope coefficient
b of the regression line will be positive. In contrast, both r and b will be negative in
the plot on the right:
r>0

Yield y

r<0

Yield y

Rainfall x

No. of insect pests x

We now look at the extreme values of r . It can readily be shown that r cannot
be larger than plus one or smaller than minus one. These values can be interpreted as
representing perfect correlation. If r = +1 then the observed data points must lie
exactly on a straight line with positive slope, and if r = 1, the points lie on a line with
negative slope:
r = 1

r = +1

Half way between +1, perfect positive correlation, and 1, perfect negative correlation, is 0. How do we get zero correlation? Zero correlation arises if there is no
relationship between the variables x and y , as in the example below:

282

INTROSTAT

r0

I.Q. y

Shoe size x
Thus, if the data cannot be used to predict y from x, r will be close to zero. If it
can be used to make predictions r will be close to +1 or 1. How close to +1 or to
1 must r be to show statistically significant correlation? We have tables which tell us
this.
We express the above more formally by stating that r , which is the sample correlation coefficient estimates , the population correlation coefficient. (, rho, is
the small Greek r we are conforming to our convention of using the Greek letter
to signify the population parameter and the Roman letter for the estimate of the parameter.) What we need is a test of the null hypothesis that the population correlation
coefficient is zero (i.e. no relationship) against the alternative that there is correlation;
i.e. we need to test the null hypothesis H0 : = 0 against the alternative H1 : 6= 0.
Mathematical statisticians have derived the probability density function for r when the
null hypothesis is true, and tabulated the appropriate critical values. Table 5 is such a
table. The probability density function for r depends on the size of the sample, so that,
as for the t-distribution, degrees of freedom is an issue.
The tables tell us, for various degrees of freedom, how large (or how small) r has
to be in order to reject H0 . The alternative hypothesis above was two-sided, thus there
must be negative as well as positive values of r that will lead to the rejection of the
null hypothesis. Because the sampling distribution of r is symmetric, our tables only
give us the upper percentage points. Unlike the earlier tests of hypotheses that we
developed, there is no extra calculation to be done to compute the test statistic. In
this case, r itself is the test statistic.
When the sample correlation coefficient r is computed from n pairs of points, the
degrees of freedom for r are n2. In terms of our degrees of freedom rule, we lose two
degrees of freedom, because we needed to calculate x
and y before we could calculate r .
We estimated two parameters, the mean of x and the mean of y , so we lose two degrees
of freedom.
Example 4A: Suppose that we have a sample of 12 pairs of observations, i.e. 12 xvalues and 12 y -values and that we calculate the sample correlation coefficient r to be
0.451. Is this significant at the 5% level, using a two-sided alternative?
1. H0 : = 0

283

CHAPTER 12. REGRESSION AND CORRELATION


2. H1 : 6= 0
3. Significance level : 5%

4. Because we have a two sided alternative we consult the 0.025 column of the table.
We have 12 pairs of observations, thus there are 12-2 = 10 degrees of freedom. The
critical value from Table 5 is 0.576. We would reject H0 if the sample correlation
coefficient lay either in the interval (0.576, 1) or in (1, 0.576):
......................
....
.....
...
...
...
...
.
.
...
..
.
.
...
...
...
.
.
...
.
.
...
...
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
...
.
...
..
.
.
...
.
.
...
..
.
.
...
..
...
.
.
...
..
.
.
...
..
.
....
.
..
....
.
.
.
......
..
.
.
.
.
.
......
....
.
.
........
.
.
.
.
...........
......
.
.
.
.
.
.
.
.
.
.........
.
.
......

0.576

+0.576

+1

5. Our calculated value of r is 0.451.


6. Because the sample correlation coefficient r does not fall into the rejection region
we are unable to say that there is significant correlation. Thus we would not be
justified in performing a regression analysis.
Example 5C: In the following situations, decide whether the sample correlation coefficients r represent significant correlation. Use the modified hypothesis testing procedure,
and give observed P -values.
(a)
(b)
(c)
(d)
(e)

n = 12,
n = 36,
n = 52,
n = 87,
n = 45,

r = 0.66, two-sided alternative hypothesis


r = 0.25, two-sided alternative hypothesis
r = 0.52, one-sided alternative hypothesis H1 : > 0
r = 0.23, one-sided alternative hypothesis H1 : > 0
r = 0.19, one-sided alternative hypothesis H1 : < 0.

Example 1A, continued: For the data given in Example 1A, test if there is a significant correlation between class record and final mark. Do the test at the 1% significant
level.
We can justify using a one-sided alternative hypothesis if this world is at all fair,
if there is any correlation between class record and final mark it must be positive! Thus
we have:
1. H0 : = 0
2. H1 : > 0
3. The sample correlation coefficient is
r=p

SSxy
1764
=
= 0.755
3119 1748.67
SSx SSy

284

INTROSTAT

4. The sample size was n = 12, so we have 10 degrees of freedom. The correlation
(0.05)
(0.01)
is significant at the 5% level (r > r10
= 0.4973), the 1% level (r > r10
=
(0.005)
0.6581), the 0.5% level (r > r10
= 0.7079), but not at the 0.1% level (r >
(0.001)
r10
= 0.7950).
5. Our conclusion is that there is a strong positive relationship between the class
record and the final mark (r10 = 0.755, P < 0.005).
Example 2B, continued: Is there significant correlation between monthly income and
years of education? Do the test at the 1% significance level.
1. H0 : = 0
2. H1 : 6= 0
3. Significance level : 1%
4. The sample size was n = 20, so the appropriate degrees of freedom is 20 2 = 18.
(0.005)
We will reject H0 if the sample correlation coefficient is greater than r18
=
0.5614, or less than 0.5614 remember that the alternative hypothesis here is
two-sided.
5. We calculate r to be
p

r = SSxy / SSx SSy = 56.5/ 101.75 89.8 = 0.591

6. The sample correlation coefficient lies in the rejection region. We have established
a significant correlation between years of education and monthly education. We
may use the regression line to make predictions.

Caution : cause and effect relationships . . .


A significant correlation coefficient does not in itself prove a cause-effect relationship. It only measures how the variables vary in relation to each other, e.g. the
high correlation between smoking and lung cancer does not in itself prove that smoking
causes cancer. The existence of a correlation between dental plaque and tooth decay
does not prove that plaque causes tooth decay. Historically, over the past few decades,
the price of gold has trended upwards through time but time does not cause the price
of gold to increase.
There might well be a cause-effect relationship between these variables the statistician points out that a relationship exists, the research worker has to explain the mechanism. Thus an economist might note a high correlation between the price of milk and
the price of petrol, but would not say that an increase in the price of petrol actually
causes an increase in the price of milk. The relationship between petrol and milk prices
is explained in terms of a third variable, the rate of inflation.
It is mainly in laboratory situations, where all other possible explanatory variables
can be controlled by the researcher, that the strength of cause-effect relations between
variables are most easily measured. Japanese quality engineers, under the leadership
and inspiration of Genichi Taguchi, have been extremely successful in improving the
quality of products by isolating the variables that explain defectiveness. This has

CHAPTER 12. REGRESSION AND CORRELATION

285

Figure 12.1:
been done, largely, by holding all variables in a manufacturing process constant, except
for one experimental variable. If a correlation is found between the variable that is
being experimented with and product quality, then this variable is a likely cause of
defectiveness, and needs to be monitored carefully. On the other hand, if no correlation
is found between a variable and product quality, then this variable is unlikely to be
critical to the process, and the cheapest possible setting can be used.

Caution : linear relationships . . .


The correlation coefficient, r , is designed to measure the degree of straight line
(linear) trend. If the data lies along some curved line, the correlation coefficient may
be small, even though there is a very strong relationship; e.g. when the relationship is
quadratic and the points lie close to a parabola, as in Figure 12.1.
The real relationship between x and y is part of a parabola, not a straight line. It
is always a good idea to plot a scatter plot, and to determine whether the assumption
of linearity is reasonable. We can deal with certain non-linear situations quite easily
by making transformations. We return to this later.

Some sums of squares . . .


The algebra of regression is rich in relationships between the various quantities. We
only cover those relationships that are essential to what follows!
Consider the quantity
X
X
(yi yi )2 =
(yi a bxi )2 .

This measures the sum of squares of the differences between the observed values yi
and the predicted values yi . This quantity will be small if the observed values yi fall
close to the regression line y = a + bx, and will be large if they do not. The term yi yi
is therefore called the residual or error for the ith observation. Thus we define
X
X
SSE =
(yi yi )2 =
(yi a bxi )2

286

INTROSTAT

to be the sum of squares due to error.


P
We now do some algebra. Substitute a = y b
x into (yi a bxi )2 to obtain:
X
SSE =
(yi a bxi )2
X
=
(yi y + b
x bxi )2
X
=
((yi y) b(xi x
))2
X
X
X
(xi x
)2
=
(yi y)2 2b
(yi y)(xi x
) + b2
= SSy 2b SSxy + b2 SSx

But b = SSxy /SSx , so, in a few more easy steps, we have


SSE = SSy b SSxy .
The first term on the right-hand side measures the total variability of y in fact,
SSy /(n 1) is the sample variance of y . We call this term the total sums of squares
and denote it SST, so that
SST = SSy .
The second term on the right-hand side measures how much the total variability is
reduced by the regression line. Thus the term b SSxy is known as the sums of squares
due to regression, and is denoted SSR. So we can write
SSE = SST SSR or

SST = SSR + SSE.

This result is an important one, because it shows that we can decompose the total
sum of squares, SST, into the part that is explained by the regression, SSR, and the
remainder that is unexplained, SSE, the error sum of squares (or residual sum of
squares).
Finally, we find two alternative expressions for SSR = b SSxy to be useful. First the
short one, useful for calculations. Because b = SSxy /SSx , another expression for SSR
is
2
SSxy
SSR =
SSx
Secondly,
X
SSR = b SSxy = b
(yi y)(xi x
)
X
=b
(yi yi + yi y)(xi x
)
X
X
=b
(yi yi )(xi x
) +
(
yi y)(bxi b
x)
We now consider the first and second terms separately. We will show that the first term
is equal to zero. We note that yi = a + bxi = y b
x + bxi and substitute this in place
of yi in the first term:
hX
i
X
b
(yi y + b
x bxi )(xi x
) = b
((yi y) b(xi x
))(xi x
)
i
hX
X
= b ( (yi y)(xi x
) b
(xi x
)2
= bSSxy b2 SSx
SSxy 2
SSxy 
SSxy
SSx
=
SSx
SSx
= 0,

CHAPTER 12. REGRESSION AND CORRELATION

287

which is a lot of stress to prove that something is nothing! For the second term we note
that b xi = yi a and b x
= y a, so (bxi b
x) = (
yi y) and we have that
X
SSR =
(
yi y)2 .

Thus, the partitioning of the total sum of squares

SST = SSR + SSE


can be written as:
X
X
X
(yi y)2 =
(
yi y)2 +
(yi yi )2

Prediction intervals . . .
In the same way as the standard deviation played a key role in confidence intervals
for means, an equivalent quantity called the residual standard deviation is used in
calculating confidence intervals for our predictions we call these prediction intervals. The residual standard deviation, also denoted s, measures the amount of variation
left in the observed y -values after allowing for the effect of the explanatory variable x.
The residual standard deviation is defined to be
r
SSE
s=
n2
rP
(yi yi )2
=
n2
rP
(yi a bxi )2
.
=
n2
In the previous section, we showed that
SSE = SSy b SSxy .
So a better formula for computational purposes is
r
SSy b SSxy
.
s=
n2
Because the residual standard deviation is estimated, the t-distribution is used to
form the prediction intervals. The degrees of freedom for the residual standard deviation
are n 2 (we lose two degrees of freedom because two parameters are estimated by x

and y before we can calculate the residual standard deviation). The prediction interval
for a predicted value of y for a given value of x is

s
)2
predicted value tn2 s 1 + 1 + (x x
,
n
SSx

s
1 (x x
)2
.
predicted value + tn2 s 1 + +
n
SSx

288

INTROSTAT

The term (x x)2 in the prediction interval has the effect of widening the prediction
interval when x is far from x
. Thus our predictions are most reliable when they are
made for x-values close to x
, the average value of the explanatory variable. Predictions
are less reliable when (x x
)2 is large. In particular, it is unwise to extrapolate, that
is, to make predictions that go beyond the range of the x-values in the sample, although
in practice we are often required to do this.
Example 1A, continued: Find a 95% prediction interval for the predicted value of
the final mark given a class record of 50%.
We first need to compute the residual standard deviation:
s =

r
SSy b SSxy
174.67 0.566 1764
=
n2
12 2
= 8.66

Earlier we saw that the point estimate of the predicted value for y was 66% when
x = 50%.
(0.025)
The required value from the t-tables, t10
, is 2.228. Note also x
= 59.5, SSx =
3119. Thus the 95% prediction interval is given by
r

1
(50 59.5)2
+
;
12
3119
66.0 + 20.35)

(66.0 2.228 8.66

1+

= (45.65, 86.35).

The coefficient of determination . . .


The earlier examples demonstrated how the sample correlation coefficient, r , can
be used to test whether a significant linear relationship between x and y exists. A
further useful statistic which is frequently reported in practice is the coefficient of
determination. It is defined as
coefficient of determination =

SSR
.
SST

The coefficient of determination has a useful practical interpretation. It measures the


proportion of the total variability of y which is explained by the regression line.
There is a simpler formula for the coefficient of determination. Above, we showed
that SSR = b SSxy and SST = SSy . Substituting, we have
coefficient of determination =

2
Sxy
b SSxy
=
= r2.
SSy
SSx SSy

So we have the astonishing result that the coefficient of determination is simply the
square of the correlation coefficient! For this reason the coefficient of determination in
simple linear regression is denoted r 2 .

289

CHAPTER 12. REGRESSION AND CORRELATION

Example 1A, continued: Compute and interpret the coefficient of determination,


r 2 , for the regression of the final examination mark on the class record.
The sample correlation coefficient was 0.755. So the coefficient of determination is
r 2 = 0.7552 = 0.5700.
We interpret this by stating that 57% of the variation in marks in the final examination
can be explained by variation in the marks of the class test.
Example 2B, continued: Compute and interpret the r 2 for the relationship between
years of formal education and monthly income.
The coefficient of determination is
r 2 = 0.59112 = 0.3494.
Hence approximately 35% of the variation in monthly incomes can be explained by the
variation in the years of formal education.

Summary of computational scheme for simple linear regression . . .


The key quantities in simple linear regression are summarized below.
x

x2

y2

xy

y2

y2

1X
x
n
1X
y =
y
n
X
X
SSx =
x2 (
x)2 /n
X
X X
SSxy =
xy (
y)(
x)/n
X
X
SSy =
y2 (
y)2 /n
x
=

b = SSxy /SSx

a = y b
x
p
SSx SSy
r = SSxy
q
s = (SSy b SSxy )/(n 2)

xy

290

INTROSTAT

Predicted values : y = a + bx
95% prediction intervals:
(0.025)

predicted value tn2

1+

)2
1 (x x
+
n
SSx

Example 6B: In an investigation to determine whether the rewards on financial investments are related to the risks taken, data on a sample of 17 sector indices on the
Johannesburg Stock Exchange was collected (Source: Financial Risk Service, D.J. Bradfield & D. Bowie). The data gives a beta value for each sector (x), a quantity widely
used by financial analysts as a proxy for risk, and the reward (y ) or return for each
sector, calculated as the percentage price change over a 12 month period.
(a) Draw a scatter plot and discuss the visual appearance of the relationship.
(b) Calculate the correlation coefficient, and decide if the relationship between risk
and return is significant?
(c) Find the regression line. What is the coefficient of determination of the regression.
(d) Find 95% prediction intervals for the predicted returns for investments with risks
given by betas of 0.5, 0.8 and 1.5.

Sector

1. Diamonds
2. West Wits
3. Metals & Mining
4. Platinum
5. Mining Houses
6. Evander
7. Motor
8. Investment Trusts
9. Banks
10. Insurance
11. Property
12. Paper
13. Industrial Holdings
14. Beverages & Hotels
15. Electronics
16. Food
17. Printing
Totals

Beta
x

Return (%)
y

x2

y2

xy

1.31
1.15
1.29
1.44
1.46
1.00
0.74
0.67
0.69
0.79
0.27
0.75
0.83
0.81
0.11
0.11
0.51

46
67
46
40
78
40
24
20
17
19
8
25
27
34
7
19
33

1.716
1.323
1664
2.074
2.132
1.000
0.548
0.449
0.476
0.624
0.073
0.563
0.689
0.656
0.012
0.012
0.260

2116
4489
2116
1600
6084
1600
576
400
289
361
64
625
729
1156
49
361
1089

60.26
77.05
59.34
57.60
113.88
40.00
17.76
13.40
11.73
15.01
2.16
18.75
22.41
27.54
0.77
2.09
16.83

13.93

550

14.271

23704

556.58

291

CHAPTER 12. REGRESSION AND CORRELATION

(a) The scatter plot shows that the percentage return increases with increasing values
of beta. Furthermore, the plot makes it clear that it is appropriate, over the
observed range of values for beta, to fit a straight line to this data.
100
80
Percentage
return
y

60
40

20
0
0.0

0.5

1.0

1.5

2.0

Beta x
Having computed the basic sums at the foot of the table, we compute x
= 0.8194
and y = 32.35. Then
SSx = 14.271 (13.93)2 /17 = 2.857

SSxy = 556.58 13.93 550/17 = 105.904


SSy = 23704 (550)2 /17 = 5909.882

(b) Using the modified hypothesis testing procedure:


1. H0 : = 0
2. H1 : 6= 0

p
1
3. r = SSxy / SSx SSy = 105.904/(2.857 5909.882) 2 = 0.815

4. Using the correlation coefficient table with 17 2 = 15 degrees of freedom,


(0.0005)
we see that 0.815 is significant at the 0.1% level, because 0.815 > r15
=
0.7247.

5. We have established a highly significant relationship between risk and return


(r = 0.815, P < 0.001).
(c) The coefficients for the regression line are given by:
SSxy
105.994
=
= 37.068
SSx
2.857
a = y b
x = 32.35 37.068 0.8194 = 1.976
b=

The regression line is thus


y = 1.976 + 37.068x.
The coefficient of determination is equal to r 2 = 0.8152 = 0.664. Thus the regression line accounts for about two-thirds of the variability in return.

292

INTROSTAT

(d) Predicted return for an investment which had a beta of 0.5 is


y = 1.976 + 37.068 0.5 = 20.51%
Similarly, the predicted return when x = 0.8 and x = 1.5 are 31.63% and 57.58%,
respectively.
To find the prediction intervals we need to calculate the residual standard deviation:

s=

(SSy bSSxy )/(n 2) =

(5909.882 37.068 105.904)/15

= 11.501.

(0.025)

We need t15
. From the t-tables, (Table 2) this is 2.131. We note that x
=
0.8194.
The 95% prediction interval for x = 0.5 is thus
r
1
(0.5 0.8194)2
+
, 20.510 + 25.641)
(20.510 2.131 11.501 1 +
17
2.857
or (5.13%, 46.15%).
For x = 0.8 and x = 1.5, the only terms in the prediction interval formula that
change are the first term (the predicted value) and the final term under the square
root sign. When x = 0.8, the 95% prediction interval is
r
1
(0.8 0.8194)2
+
, 31.630 + 25.221)
(31.630 2.131 11.501 1 +
17
2.857
or (6.41%, 56.85%).
for x = 1.5, we have
(57.578 2.131 11.501

1+

1
(1.5 0.8194)2
+
, 57.578 + 27.081)
17
2.857

or (30.50%, 84.65%).
Of the three values for which we have computed prediction intervals, x = 0.8 lies
closest to the mean of the observed range of x-values, and therefore the associated
prediction interval for x = 0.8 is the shortest of the three.
Example 7C: To check on the strength of certain large steel castings, a small test piece
is produced at the same time as each casting, and its strength is taken as a measure of
the strength of the large casting. To examine whether this procedure is satisfactory, i.e.,
the test piece is giving a reliable indication of the strength of the castings, 11 castings
were chosen at random, and both they, and their associated test pieces were broken.
The following were the breaking stresses:
Test piece (x) :
Casting (y) :

45
39

67
86

61
97

77
102

71
74

51
53

45
62

58
69

48
80

62
53

36
48

293

CHAPTER 12. REGRESSION AND CORRELATION


(a)
(b)
(c)
(d)

Calculate the correlation coefficient, and test for significance.


Calculate the regression line for predicting y from x.
Compute and interpret the coefficient of determination.
Find 90% prediction limits for the strength of a casting when x = 60.

Nonlinear curve fitting . . .


Not all relationships between variables are linear. An obvious example is the relationship between time (t) and population (p), a relationship which is frequently said to
be exponential, i.e. of the form p = aebt , where a and b are regression coefficients.
Sometimes there are theoretical reasons for using a particular mathematical function
to describe the relationship between variables, and sometimes the relationship that fits
variables reasonably well has to be found by trial and error. The more complex the
mathematical function, the more data points are required to determine the regression
coefficients reliably. The simplest mathematical function useful in regression analysis is
the straight line, giving rise to the linear regression we have been considering. Because
of the simplicity of linear regression, we often make use of it as an approximation (over
a short range) to a more complex mathematical function.
Sometimes, however, we can transform a nonlinear relationship into a linear one,
and then apply the methods we have already learnt. It is these rather special situations
which we consider next.

Exponential growth . . .
Example 8A: Fit an exponential growth curve of the form y = aebx to the population
of South Africa, 1904-1976:
Year (x) :
Population :
(millions)(y)

1904
5.2

1911
6.0

1921
6.9

1936
9.6

1946
11.4

1951
12.7

1960
16.0

1970
21.8

From a scatter plot it is clear that linear regression is inappropriate:

30

Population
(millions)
y

20

10

0
1900

1920

1940
Year x

1960

1980

1976
26.1

294

INTROSTAT

However, there is a straightforward way to transform this into a linear relation. Take
natural logarithms on both sides of the equation y = aebx . This yields (remembering
loge e = 1)
log y = log a + bx.
We now put Y = log y and A = log a we rewrite it as
Y = A + bx.
This is the straight line with which we are familiar! The scatter plot of the logarithm of
population against year is plotted below; a straight line through the points now seems
to be a satisfactory model.

3
Logarithm
of
population
(loge millions)
Y

1
1900

1920

1940

1960

1980

Year x
By using our computational scheme for linear regression on the pairs of data values
x and Y (= log y), we compute the regression coefficients A and b. Having done this, we
can now transform back to the exponential growth curve we wanted. Because Y = log y ,
we can write
y =eY = eA+bx
= eA ebx = aebx ,
where a = eA . The table below demonstrates the computational procedure for fitting
exponential growth. (It is convenient here to take x to be years since 1900.)
x

y
4
11
21
36
46
51
60
70
76

375

5.2
6.0
6.9
9.6
11.4
12.7
16.0
21.8
26.1

Y = log y

x2

Y2

xY

1.65
1.79
1.93
2.26
2.43
2.54
2.77
3.08
3.26

16
121
441
1296
2116
2601
3600
4900
5776

2.723
3.204
3.725
5.108
5.905
6.452
7.673
9.486
10.628

6.60
19.69
40.53
81.36
111.78
129.54
166.20
215.60
247.76

21.72

20 867

54.904

1019.06

CHAPTER 12. REGRESSION AND CORRELATION

295

From these basic sums we first compute x


= 41.67 and Y = 2.41 and then SSx =
5242 and SSxY = 114.06. Thus
b = SSxY /SSx = 0.0218
A = Y b
x = 1.50.
The regression line for predicting Y from x is
Y = 1.50 + 0.0218 x
Transforming back to the exponential growth curve yields
y = e1.50 e0.0218 x
= 4.48 e0.0218 x .

In full knowledge that it is dangerous to use regression analysis to extrapolate beyond


the range of x-values contained in our data, we risk an estimate of the population of
South Africa in the year 2000, i.e. when x = 100:
y = 4.48 e0.0218100 = 39.6 million.
(In 1980 the population demographers were in fact predicting a population of 39.5 million
for the year 2000.)

Other relationships that can be transformed to be linear . . .


Example 9A: Transform the following functions into linear relationships.
(a) y = abx
(b) y = axb
(c) ay = bx
In each case we take natural logarithms:
(a) log y = log a + x log b.
Putting Y = log y , A = log a and B = log b, this becomes
Y = A + xB,
which is linear.
(b) log y = log a + b log x.
Put Y = log y , A = log a and X = log x to obtain
Y = A + bX.
(c) log a + log y = x log b.
Put A = log a, Y = log y

and

B = log b to obtain

A + Y = xB

or

Y = A xB.

296

INTROSTAT

Example 10B: To investigate the relationship between the curing time of concrete and
the tensile strength the following results were obtained:

Curing time (days) (x):


Tensile strength
(kg/cm2 ) (y):

1 23

2 12

3 31

10

22.4

24.5

26.3

30.2

33.9

35.5

(a) Draw a scatter plot.


(b) Assuming that the theoretical relationship between tensile strength (y) and curing
time (x) is given by
b

y = ae x ,
find the regression coefficients a and b.
(c) Predict the tensile strength after curing time of three days.

(a)
40

Tensile
strength
(kg/cm 3 )

30

20

10

Days x
(b) Taking logarithms we obtain
log y = log a + b
Put Y = log Y, A = log A and X =
The computational scheme is:

1
x

1
x

to obtain the linear equation Y = A + bX .

297

CHAPTER 12. REGRESSION AND CORRELATION


x
1 23
2
2 12
3 13
5
10
P

y
22.4
24.5
26.3
30.2
33.9
35.5

X = 1/x
0.6
0.5
0.4
0.3
0.2
0.1
2.1

Y = log y
3.11
3.20
3.27
3.41
3.52
3.57
20.08

X2
0.36
0.25
0.16
0.09
0.04
0.01
0.91

Y2
9.67
10.24
10.69
11.63
12.39
12.74
67.36

XY
1.87
1.60
1.31
1.02
0.70
0.36
6.86

From these sums we compute the quantities


= 0.35
X

Y = 3.347

SSX = 0.175

SSXY = 0.168

and then the slope coefficient and the y -intercept


b = SSXY /SSX = 0.960

and

= 3.683.
A = Y bX

Thus the equation for predicting Y , the logarithm of the tensile strength y , from
X , the reciprocal of curing time x, is
Y = 3.683 0.960 X.
To express this in terms of the original variables x and y , we write
y = eY = eA+bX = eA ebX
= e3.683 e0.960X
= 39.766 e0.960/x .
(c) After 3 days curing, i.e. x = 3,
y = 39.766 e0.960/3 = 28.9 kg/cm2 .
Example 11C: A company that manufactures gas cylinders is interested in assessing the
relationships between pressure and volume of a gas. The table below gives experimental
values of the pressure P of a given mass of gas corresponding to various values of the
volume V . According to thermodynamic principles, the relationship should be of the
form P V = C , where and C are constants.
(a) Find the values of and C that best fit the data.
(b) Estimate P when V = 100.0.

Volume V
Pressure P

54.3
61.2

61.8
49.5

72.4
37.6

88.7
28.4

118.6
19.2

194.0
10.1

298

INTROSTAT

Multiple regression . . .
In many practical situations, it is more realistic to believe that more than one explanatory variable is related to the dependent variable. Thus the quality of the grape
harvest depends not only on the amount of rain that falls during spring, but also probably on the hours of sunshine during summer, the amount of irrigation during summer,
whether the irrigation was by sprinklers, furrows or a drip system, the amounts and types
of fertilizers applied, the amounts and types of pesticides used, and even whether or not
the farmer used scarecrows to frighten the birds away! Regression models that include
more than one explanatory variable are called multiple regression models. Multiple
regression should be seen as a straightforward extension of simple linear regression.
The general form of the multiple regression model is
y = 0 + 1 x1 + 2 x2 + + k xk + e
where y is the dependent variable, x1 , x2 , . . . , xk are the explanatory variables and 0 ,
1 , 2 , . . . , k are the true regression coefficients, the population regression parameters.
The term e at the end of the regression model is usually called the error, but this is
a bad name. It does not mean mistake, but it is intended to absorb the variability in
the dependent variable y which is not accounted for by the explanatory variables which
we have measured and have included in the regression model.
The regression coefficients 0 , 1 , 2 , . . . , k are unknown parameters (note the
use of Greek letters once again for parameters) and need to be estimated. The data from
which we are to estimate the regression coefficients consists of n sets of k + 1 numbers
of the form: the observed values of the dependent variable y , and the associated values
of the k explanatory variables xi . For example, we would measure the quality of the
grape harvest, together with all the values of the explanatory variables that led to that
quality.

The least squares estimation approach . . .


As with simple linear regression, the regression coefficients need to be estimated.
We are searching for a model of the form
y = b0 + b1 x1 + b2 x2 + + bk xk ,
where bi is the sample estimate of the regression parameter i .
The method for estimating the regression coefficients in multiple regression is identical to that used for simple regression. As before, we use the method of least squares
and minimize the sum of squares of the difference between the observed y -values and
the predicted y -values. Because there are now k + 1 regression coefficients to estimate (and not only two as was the case with simple regression), the algebra gets messy,
and the arithmetic gets extremely tedious. So we resort to getting computers to do the
calculations on our behalf.
The primary reason for the computational difficulty in multiple regression as compared with simple regression, is that, instead of having only two partial derivatives and
two equations to solve, we have to find k + 1 partial derivatives and the regression coefficients are obtained as the solutions to a set of k + 1 linear equations. But the key point

CHAPTER 12. REGRESSION AND CORRELATION

299

is that the underlying philosophy remains the same: we minimize the sum of squared
residuals.
There are no simple explicit formulae for the regression coefficients bi for multiple
regression. They are most conveniently expressed in terms of matrix algebra. But they
are readily computed by a multitude of statistical software packages. We assume that
you have access to a computer that will do the calculations, and we will take the approach
of helping you to interpret the results.

Example 12A: To demonstrate how a regression equation can be estimated for two or
more explanatory variables, we consider again Example 2B where the personnel manager
was concerned with the analysis and prediction of monthly incomes.
Recall that in Example 2B the relationship between monthly income (y ) and years
of education (which we will now denote x1 ) was estimated using the simple regression
model:

y = 0.535 + 0.555x1 .

Because r = 0.5911 was highly significant, we had established that a relationship existed
between the two variables. But the coefficient of determination r 2 = 0.3494, so that
approximately 35% of the variability in incomes could be explained by this single variable,
years of formal education.
But what about the remaining 65%? The personnel manager is curious to establish whether any other variable can be found that will help to reduce this unexplained
variability and which will improve the goodness of fit of the model.
After some consideration, the personnel manager considers that years of relevant
experience is another relevant explanatory variable that could also impact on their
incomes. He gathers the extra data, and calls it variable x2 .

300

INTROSTAT

Person

Monthly income
(R1000s)
y

Years of formal
education
x1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

4
5
8
10
9
7
5
10
7
6
8
6
9
4
7
8
9
11
4
5

12
10
15
12
16
15
12
16
14
14
16
12
15
10
18
11
17
15
12
13

Years of
experience
x2
2
6
16
23
12
11
1
18
12
5
17
4
20
6
7
13
12
3
2
6

Now the regression equation that includes both the effect of years of education (x1 )
and the years of relevant experience (x2 ) is
y = b0 + b1 x1 + b2 x2 .
The computer generated solution is
y = 0.132 + 0.379x1 + 0.180x2
where x1 is years of formal education and x2 is years of relevant experience.
Notice that the coefficients of x1 and the intercept have changed from the solution
when only x1 was in the model. Why? Because some of the variation previously explained by x1 is now explained by x2 . (There is a mathematical result that says that
the earlier regression coefficients will remain unchanged only if there is no correlation
between x1 and x2 .)
In multiple regression we can interpret the the magnitude of the regression coefficient
bi as the change induced in y by a change of 1 unit in xi , holding all the other variables
constant. In the above example, b1 = 0.379 can be interpreted as measuring the change
in income (y ) corresponding to a one-year increase in years of formal education (while
holding years of experience constant). And b2 = 0.180 is the change in y induced by
a one-year increase in experience (while holding years of formal education constant).
Because y is measured in 1000s of rands per month, the regression model is telling us

301

CHAPTER 12. REGRESSION AND CORRELATION

that each year of education is worth R379 per month, and that each year of experience
is worth R180 per month!
One question that the personnel manager would ask is: To what extent has the
inclusion of the additional variable contributed towards explaining the variation of incomes? For simple regression, the coefficient of determination, r 2 , answered this question. We now have a multiple regression equivalent, called the multiple coefficient
of determination, denoted R2 , which measures the proportion of variation in the dependent variable y which is explained jointly by all the explanatory variables in the
regression model. The computation of R2 is not as straightforward as in simple regression, but is invariably part of the computer output. In our example, the computer
printout that gave us the model
y = 0.132 + 0.379x1 + 0.180x2
also told us that R2 = 0.607. This means that 60.7% of the variability of incomes is
explained jointly by x1 (years of formal education) and x2 (years of relevant experience).
This represents a substantial improvement in the explanation of y (monthly incomes)
from 35% to 61%. However, 39% of the variation in y remains unexplained, and is
absorbed into the error term discussed earlier. This leads us into taking a closer look
at the sources of variation in the dependent variable y .

Understanding the sources of variation in y

...

In simple regression, we partitioned the variation in the dependent variable y into


two components, the variation explained by the regression and the unexplained variation,
the variation due to errors. The last paragraph of the previous section suggested that
we can do the same in multiple regression.
The same partitioning of the total sum of squares that we developed for simple
regression is applicable in the multiple regression situation (although the arithmetic
involves a lot more number crunching!):
SST = SSR + SSE
or

X
X
X
(yi y)2 =
(
yi y)2 +
(yi yi )2 .

The multiple coefficient of determination is defined in the same way as in the simple
case. It is the ratio of the sum of squares explained by the regression and the total sum
of squares,
SSR
.
R2 =
SST
But in multiple regression, we do need to have a different notation for the multiple coefficient of determination (R2 in place of r 2 ) because we no longer have a straightforward
squaring of a sample correlation coefficient.
The partitioning of the total sum of squares also helps to provide insight into the
structure of the multiple regression model. In the section that follows, we will see that
the partitioning enables us to decide whether the equation generated by the multiple
regression is significant. To do this in simple regression, we merely calculated the
correlation coefficient r and checked in Table 5 whether or not this value of r was
significant. The approach is now quite different.

302

INTROSTAT

Testing for significant relationships

...

In multiple regression, the appropriate null and alternative hypotheses for testing
whether or not there is a significant relationship between y and the xi s are:
H0 : 1 = 2 = = k = 0
H1 : one or more of the coefficients are non-zero.
If we reject the null hypothesis we can conclude that there is a significant relationship
between the dependent variable y and at least one of the explanatory variables xi .
The test statistic for this hypothesis is couched in terms of the ratio between the
variance due to regression and that due to error. It is therefore not surprising that the
F -distribution provides the critical values for the test statistic (recall chapter 9). The
test statistic is calculated as one of the outputs in most regression software packages,
and is usually presented as part of an analysis of variance or ANOVA table.
The ANOVA table summarizes the sources of variation, and usually has the following
structure:

ANOVA table
Source

Sum of
Squares
(SS)

Degrees of
Freedom
(DF)

Mean
square
(MS)

Regression
Error

SSR
SSE

k
nk1

MSR = SSR/k
MSE = SSE/(n k 1)

Total

SST

n1

F = MSR/MSE

The test statistic is the variance explained by the regression divided by the variance
due to error. The distribution of the test statistic for the above null hypothesis is
F =

MSR
Fk,nk1,
MSE

where n is the number of observations and k is the number of dependent variables. We


reject the null hypothesis if the observed F -value is larger then the critical value in the
F -tables.
Example 12A, continued: Test the significance of the regression model with dependent variable monthly income and explanatory variables years of formal education
and years of relevant experience. Perform the test at the 5% significance level.

303

CHAPTER 12. REGRESSION AND CORRELATION

The hypotheses for testing the significance of the regression can thus be formulated
as:
1. H0 : 1 = 2 = 0
2. H1 : one or more of the coefficients are non-zero.
3. Significance level: 5%
4. Rejection region. Because n = 20 and k = 2, the test statistic has the Fk,nk1 =
F2,17 -distribution. We reject H0 if the observed F -value is greater than the upper
5% point of F2,17 , i.e. if it is greater than
0.05
F2,17
= 3.59.

5. Test statistic. The ANOVA table obtained from the computer printout looks like
this:

Source
Regression
Error
Total

ANOVA table
Sum of
Degrees of
Squares (SS) Freedom (DF)
54.5391
2
35.2609
17
89.8000
19

Mean Square
(MS)
27.2696
2.0742

F
13.15

The observed value of the test statistic is given by F = 27.2696/2.0742 = 13.15.


6. Conclusion. Because F = 13.35 > 3.59, we reject H0 at the 5% significance level
and conclude that a significant relationship exists between monthly income, the
dependent variable, and the explanatory variables, years of formal education and
years of relevant experience. Thus the regression model can be used for predicting
monthly income. As in the simple regression case, it is advisable not to extrapolate,
and the values of the explanatory variables for which predictions are made ought
to be within their ranges in the observed data.
For example, the predicted monthly income for a person with x1 = 13 years of
education and x2 = 8 years of experience is obtained by making the appropriate
substitutions in the regression model:
y = 0.132 + 0.379x1 + 0.180x2
= 0.132 + 0.379 13 + 0.180 8

= 6.5.

The predicted monthly income for this person is R6 500.

304

INTROSTAT

A test for the significance of individual variables . . .


Frequently, we need to establish whether a single regression coefficient in a multiple
regression model is significant in the regression. In the jargon of hypothesis testing, the
appropriate null and alternative hypotheses are:
H0 : i = 0
H1 : i 6= 0.
The test statistic for this null hypothesis is
bi
tnk1 ,
sbi
where bi is the estimated regression coefficient, and sbi is the standard deviation of the
estimate of bi . Both these values are calculated by the computer software and can be
found in the printout.
Example 12A, continued: Test whether each of the explanatory variables x1 and x2
in the regression model for monthly income are significant. Do the tests at the 5% level.
To test if x1 , years of formal education, is significant the procedure is as follows:
1. H0 : 1 = 0
2. H1 : 1 6= 0
3. Significance level: 5%
4. Rejection region. The appropriate degrees of freedom are nk1 = 2021 = 17.
The test is two-sided, because we want to reject the null hypothesis either if 1
is significantly positive or significantly negative. So the critical value of the test
(0.025)
= 2.110, and we will reject H0 if |t| > 2.110.
statistic is t17
5. Test statistic. The values b1 = 0.379 and sb1 = 0.152 are computed by the
regression programme. Hence t = 0.379/0.152 = 2.49
6. Conclusion. Because the observed t-value of 2.49 lies in the rejection region, we
reject H0 : 1 = 0 at the 5% level, and conclude that the variable x1 , number of
years of formal education is significant in the regression model.
A similar procedure is used to decide if x2 , years of relevant experience is significant.
1. H0 : 2 = 0
2. H1 : 2 6= 0
3. Significance level: 5%
(0.025)

4. Rejection region. The critical value of the test statistic is once again t17
2.110, and we will reject H0 if |t| > 2.110.

5. Test statistic. The values b2 = 0.180 and sb2 = 0.054 are computed by the
regression programme, and t = 0.180/0.054 = 3.33.

305

CHAPTER 12. REGRESSION AND CORRELATION

6. Conclusion. Because the observed t-value lies in the rejection region, we reject
H0 : 2 = 0 at the 5% level, and conclude that the variable x1 , number of years
of relevant experience is also significant in the regression model.
Example 13B: A market research company is interested in investigating the relationship between monthly sales income and advertising expenditure on radio and television
for a specific area. The following data is gathered.
Sample
number
1
2
3
4
5
6
7
8
9
10

Monthly
sales
(R1000s)
105
99
104
101
106
103
103
104
100
98

Radio
advertising
(R1000s)
0.5
1.0
0.5
1.5
2.3
1.3
3.2
1.5
0.8
0.6

Television
advertising
(R1000s)
6.0
3.0
5.0
3.5
4.0
4.5
3.5
4.0
3.0
2.8

(a) Find the regression equation relating monthly sales revenue to radio and television
advertising expenditure.
(b) Estimate and interpret R2 .
(c) Test the regression equation for significance at the 5% significance level.
(d) Test, for each variable individually, whether radio and television advertising expenditure are significant in the model at the 5% level.
(e) Comment on the effectiveness of radio vs television advertising for this industry.
You will need the following information, extracted from a computer printout, to
answer the questions.
Table of estimated coefficients
Estimated Estimated
Variable
coefficient
standard
deviation
Radio
1.6105
0.4687
Television
2.3414
0.4041
Intercept
90.9725

Source
Regression
Error
Total

ANOVA table
Sum of
Degrees of
Squares (SS) Freedom (DF)
54.194
2
9.906
7
64.100
9

Mean Square
(MS)
27.097
1.415

F
19.15

306

INTROSTAT

(a) From the table of estimated coefficients, the regression model is


y = 90.97 + 1.61x1 + 2.34x2
where y is monthly sales revenue, x1 is the radio advertising expenditure and x2
is the television advertising expenditure.
(b) The coefficient of determination R2 = SSR/SST = 54.19/64.10 = 0.845. Hence
84.5% of the variation in sales volume can be explained by variation in radio and
television advertising expenditures.
(c) To test the regression for significance:
1. H0 : 1 = 2 = 0
2. H1 : At least one of the i are non-zero, where i = 1, 2.
3. Significance level: 5%.
(0.05)

4. Rejection region. Reject H0 if observed F > F2,7

= 4.74

5. Test statistic. F = MSR/MSE = 27.097/1.415 = 19.15.


6. Because the observed F exceeds 4.74 we reject H0 at the 5% level and conclude that a significant relationship exists between sales revenue and advertising expenditure.
(d) To test the individual coefficients for significance. Firstly, for advertising expenditure on radio.
1. H0 : 1 = 0
2. H1 : 1 6= 0

3. Significance level: 5%.


(0.025)

4. Reject H0 if observed |t| > t7

= 2.365.

5. Test statistic: t = b1 /sb1 = 1.611/0.469 = 3.43.


6. Reject H0 , and conclude that sales revenue is related to expenditure on radio
advertising.
Secondly, for advertising expenditure on television.
1. H0 : 2 = 0
2. H1 : 2 6= 0

3. Significance level: 5%.


(0.025)

4. Again, reject H0 if observed |t| > t7

= 2.365.

5. Test statistic: t = b2 /sb1 = 2.341/0.404 = 5.80.


6. Reject H0 , and conclude that sales revenue is related to expenditure on television advertising.
(e) Recall that the regression coefficients can be interpreted as the magnitude of the
impact that a unit change in xi has on y (holding the other variables constant). In
this example, both explanatory variables are measured in the same units. Because
b1 < b2 , we have a suggestion that expenditure on television advertising is more
effective than expenditure on radio advertising. But, beware, we have not tested
this difference statistically we need to test H0 : 1 = 2 against the alternative
H1 : 1 < 2 .

307

CHAPTER 12. REGRESSION AND CORRELATION

Example 14C: A chain of department stores wants to examine the effectiveness of


its part-time employees. Data on sales as well as number of hours worked during one
weekend and months of experience was collected for 10 randomly selected part-time sales
persons.

Person
1
2
3
4
5
6
7
8
9
10

Number
of
sales
4
2
15
9
11
8
14
17
16
2

Number
of hours
worked
5
4
12
10
9
8
13
14
12
4

Months
of
experience
1
2
6
6
8
10
12
15
14
3

(a) Write down the regression equation relating number of sales (y ) to number of
hours worked (x1 ) and months of experience (x2 ).
(b) Compute and interpret R2 .
(c) Test the regression equation for significance at the 5% significance level.
(d) Test, for x1 and x2 individually, that they are significant in the model at the 5%
level.
(e) Do you think that the experience of part-time employees makes a difference to
their number of sales? Explain.
You will require the following information extracted from the relevant computer
printout.
Table of estimated coefficients
Estimated Estimated
Variable
coefficient
standard
deviation
Hours worked
1.3790
0.2270
Experience
0.0998
0.1716
Intercept
-3.5176

Source
Regression
Error
Total

ANOVA table
Sum of
Degrees of
Squares (SS) Freedom (DF)
282.708
2
12.892
7
295.599
9

Mean Square
(MS)

308

INTROSTAT

Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Monthly income
(R1000s)
y
4
5
8
10
9
7
5
10
7
6
8
6
9
4
7
8
9
11
4
5

Years of formal
education
x1
12
10
15
12
16
15
12
16
14
14
16
12
15
10
18
11
17
15
12
13

Years of
experience
x2
2
6
16
23
12
11
1
18
12
5
17
4
20
6
7
13
12
3
2
6

Gender
x3
1
0
1
0
1
0
1
1
0
0
1
1
0
0
1
1
0
1
0
0

The use of dummy variables for qualitative variables

...

In the discussion thus far on regression, all the explanatory variables have been
quantitative. Occasions frequently arise, however, when we want to include a categorical
or qualitative variable as an explanatory variable in a multiple regression model. We
demonstrate in this section how categorical variables can be included in the model.
Example 12A, continued: Assume now that the personnel manager who was trying
to find explanatory variables to predict monthly income now believes that, in addition
to years of formal education and years of relevant experience, gender may have a bearing
on an individuals monthly income. Test whether the gender variable is significant in
the model (at the 5% level).
The personnel manager first gathers the gender information on each person in the
sample. The next step is to convert the qualitative variable gender into a dummy
variable. A dummy variable consists only of 0s and 1s. In this example, we will code
gender as a dummy variable x3 . We put x3 = 0 if the gender is female and x3 = 1 if the
gender is male. Results of this coding are shown in the final column of the table. The
variable x3 can be thought of as a switch which is turned on or off to indicate
the two genders.
We now proceed as if the dummy variable x3 was just an ordinary explanatory variable in the regression model, and estimate the regression coefficients using the standard
least squares method to obtain the estimated model:
y = b0 + b1 x1 + b2 x2 + b3 x3 .

309

CHAPTER 12. REGRESSION AND CORRELATION

Now if gender is significant in the model, b3 will be significantly different from zero.
The interpretation of b3 is the estimated difference in monthly incomes between males
and females (holding the other explanatory variables constant). In the way that we have
coded x3 , then a positive value for b3 would indicate that males are estimated to earn
more than females, and vice versa.
The computer printout contains the table of estimated regression coefficients.
Table of estimated coefficients

Variable

Estimated
coefficient

Estimated
standard
deviation

Computed
t-value

0.321
0.192
0.838
0.381

0.156
0.054
0.664

2.06
3.56
1.26

Years of education
Years of experience
Gender
Intercept

The multiple regression model is therefore


y = 0.381 + 0.321x1 + 0.192x2 + 0.838x3 ,
At face value, the regression model suggests that males earn R838 more than females.
The ANOVA table is as follows:
ANOVA table
Source

Sum of
Squares (SS)

Degrees of
Freedom (DF)

Mean Square
(MS)

Regression
Error

57.7390
32.0610

3
16

19.246
2.004

9.60

Total

89.8000

19

The observed F -value of 9.60 needs to be compared with the five per cent point of F3,16 .
From the F -tables, this is 3.24, so the multiple regression model is significant.
The multiple coefficient of determination is computed as R2 = SSR/SST = 57.739/89.8 =
0.643. As a result of the inclusion of the additional variable, gender, the multiple correlation coefficient has increased from 60.7% (when we had two explanatory variables)
to 64.3%. This seems a relatively small increase, especially when compared with the increase in the coefficient of determination in going from one explanatory variable (34.9%)
to two (60.7%). This leads us to ask whether the addition of the third explanatory
variable x3 was worthwhile. We can do this using the methods of the previous section
we test if the coefficient associated with gender is significantly different from zero.

310

INTROSTAT

Following our standard layout for doing a test of a statistical hypothesis (for the last
time!), the appropriate null and alternative hypotheses are:
1. H0 : 3 = 0
2. H0 : 3 6= 0
3. Significance level: 5%
4. The degrees of freedom are n k 1 = 20 3 1 = 16. We reject H0 if
= 2.120.
|t| > t0.025
16
5. The test statistic is
t=

0.838
b3
= 1.26,
=
sb3
0.664

which lies in the acceptance region.


6. We conclude that we cannot reject H0 , and that we have found no difference
in monthly income between the sexes. We concludet that 3 is not significantly
different from zero.
In this example, the qualitative variable had only two categories, male and female.
If there are more than two categories, it is necessary to include further dummy variables. The number of dummy variables needed is always one less than the number of
categories. For example, if we decided that we also wanted to include marital status
as an explanatory variable with the three categories single, married and divorced,
we could use two dummy variables x4 and x5 to code each individual as follows:
x4 = 0
x4 = 1
x4 = 0

x5 = 0
x5 = 0
x5 = 1

for married
for single
for divorced

Example 15C: In example 6B, we investigated the relationship between rewards and
risks on the Johannesburg Stock Exchange (JSE). It is well known that the JSE can be
divided into three major categories of shares: Mining (M), Finance (F) and Industrial
(I). The shares of example 6B fall into the following categories:
(a) Design a system of dummy variables which accommodates the three share categories in the regression model. Show how each sector is coded.
(b) Use a multiple regression computer package to compute the regression analysis
with return (y ) as the dependent variable using, as explanatory variables, the
beta (x1 ) and the share category effects. Interpret the regression coefficients.
(c) Test whether the share category effects are significant.

311

CHAPTER 12. REGRESSION AND CORRELATION


Sector
1. Diamonds
2. West Wits
3. Metals & Mining
4. Platinum
5. Mining Houses
6. Evander
7. Motor
8. Investment Trusts
9. Banks
10. Insurance
11. Property
12. Paper
13. Industrial Holdings
14. Beverages & Hotels
15. Electronics
16. Food
17. Printing

Return (%)
y
46
67
46
40
78
40
24
20
17
19
8
25
27
34
7
19
33

Beta
x1
1.31
1.15
1.29
1.44
1.46
1.00
0.74
0.67
0.69
0.79
0.27
0.75
0.83
0.81
0.11
0.11
0.51

Industry
M
M
M
M
F
M
I
F
F
F
F
I
I
I
I
I
I

Solutions to examples . . .
3C (a) x = 9.284 + 0.629y
(b) Making x the subject of the formula yields x = 0.964 + 1.802y.
Note that the method of least squares chooses the coefficients a and b to minimize vertical distances, i.e. parallel to the y axis . Thus it is not symmetric
in its treatment of x and y . Interchanging the roles of x and y gives rise to a new
arithmetical problem, and hence a different solution.
5C (a) P < 0.05 (P < 0.02 is also correct) (b) P < 0.20, but this would not be
considered significant.
(c) P < 0.0005 (d) P < 0.025 (e) P < 0.20, but
not significant!
7C (b) r = 0.704, P < 0.005, a significant relationship exists.
(c) r 2 = 0.4958. Almost 50% of the variation in the breaking stress of the casting
can be explained by the breaking stress of the test piece.
(d) y = 4.50 + 1.15 x
(e) (44, 103).
11C (a) = 1.404 C = 15978.51 log(C) = 9.679, thus P V 1.404 = 15978.51
24.86

(b)

14C (a) y = 3.518 + 1.379x1 + 0.100x2 .


(b) R2 = 0.956. The percentage of variation in sales explained by the two explanatory variables is 95.6%.
(c) F = 76.75 > 4.74, hence a significant relationship exists.

312

INTROSTAT
(d) For x1 : t = 6.074 > 2.365, significant in the model.
For x2 : t = 0.581 < 2.365, not significant in the model.
(e) The result suggests that experience is not an important factor for the performance of part-time salespersons.

15C (a) The final two columns show the two dummy variables needed to code the
share categories.
Sector
1. Diamonds
2. West Wits
3. Metals & Mining
4. Platinum
5. Mining Houses
6. Evander
7. Motor
8. Investment Trusts
9. Banks
10. Insurance
11. Property
12. Paper
13. Industrial Holdings
14. Beverages & Hotels
15. Electronics
16. Food
17. Printing

Return (%)
y
46
67
46
40
78
40
24
20
17
19
8
25
27
34
7
19
33

Beta
x1
1.31
1.15
1.29
1.44
1.46
1.00
0.74
0.67
0.69
0.79
0.27
0.75
0.83
0.81
0.11
0.11
0.51

Dummy
x2
x3
0
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1

(b) y = 0.578 + 39.078x1 1.346x2 + 3.172x3


= 2.160, and reject if |t| > 2.160. For x1 ,
(c) In each case, compare with t0.025
13
t = 3.887, significant. For x2 , t = 0.150, not significant. For x3 , t = 0.320,
not significant. Hence the share categories do not have an influence on return.
In other words, if the beta (risk) is held constant, no additional reward can
be expected from being in a different category. This finding is consistent with
the idea in finance that you reduce risk by diversifying. Therefore you cannot
expect any additional rewards for only being invested in one area.

Exercises . . .
For each example it is helpful to plot a scatter plot.
12.1 The marks of 10 students in two class tests are given below.
(a) Calculate the correlation coefficient and test it for significance at the 5% level.
(b) Find the regression line for predicting y from x.

313

CHAPTER 12. REGRESSION AND CORRELATION

(c) What mark in the second test do you predict for a student who got 70% for
the first test?
1st class
test(x)
2nd class
test(y)

12.2

65

78

52

82

92

89

73

98

56

76

39

43

21

64

57

47

27

75

34

52

The table below shows the mass (y) of potassium bromide that will dissolve in
100m` of water at various temperatures (x).
Temperature x C
Mass y (g)

0
54

20
65

40
75

60
85

80
96

(a) Find the regression line for predicting the y from x.


(b) Find the correlation coefficient, and test it for significance.
(c) Find 95% prediction limits for predicting y when x = 50 C .
12.3 Fit a linear regression to the trading revenue of licensed hotels in South Africa,
19701979. Test for significant correlation. Forecast the trading revenue for 1982,
and find a 90% prediction interval for your forecast.
Year (x)
1970
1971
Trading
146
170
revenue
(millions of rands) (y)

1972
186

1973
209

1974
245

1975
293

1976
310

1977
337

1978
360

1979
420

12.4 Investigate the relationship between the depth of a lake (y) and the distance from
the lake shore (x). Use the following data, which gives depths at 5 metre intervals
along a line at right angles to the shore.
Distance from
shore (m) (x)
Depth (m) (y)

10

15

20

25

30

35

40

45

13

22

37

57

94

Fit a nonlinear regression of the form y = aebx .


12.5 Fit a growth curve of the form y = abx to the per capita consumption of electricity
in South Africa, 1950-1978. Use the following data:
Year(x)
per capita
consumption
(kWh/yr)(y)

1950
733

1955
960

1960
1125

1965
1483

1969
1810

1973
2295

1975
2533

1977
3005

1978
3272

314
12.6

INTROSTAT
Fit a linear regression to the number of telephones in use in South Africa, 1968
1979. Test for significant correlation. Forecast the number of telephones for 1984,
and find a 90% prediction interval for this forecast.

Year(x)
1968 1969 1970 1971 1972 1973 1974
Number of 1.24 1.31 1.51 1.59 1.66 1.75 1.86
telephones
(millions)(y)

12.7

1975 1976 1977 1978 1979


1.98 2.11 2.24 2.36 2.59

The number of defective items produced per unit of time, y , by a certain machine
is thought to vary directly with the speed of the machine, x, measured in 1000s
of revolutions per minute. Observations for 12 time periods yielded results which
are summarized below:
X
X
X
x = 60
x2 = 504
xy = 1400
X
X
y = 200
y 2 = 4400
(a) Fit a linear regression line to the data.
(b) Calculate the correlation coefficient, and test it for significance at the 5%
level.
(c) Calculate the residual standard deviation.
(d) Find a 95% prediction interval for the number of defective items when the
machine is running at firstly 2000, and secondly 6000 revolutions per minute.

12.8

20 students were given a mathematical problem to solve and a short poem to


learn. Student i received a mark xi for his progress with the problem and a mark
yi for his ability to memorize the poem.
The data is given, in summary form, below:
X
X
xi = 996
yi = 1 101
X
X
x2i = 60 972
yi2 = 73 681
X
xi yi = 64 996
(a) Compute the correlation coefficient, and test it for significance at the 1% level
of significance.
(b) Compute the regression line for predicting the ability to memorize poetry
from the ability to solve mathematical problems.
(c) Compute the regression line for predicting problem solving ability from ability
to memorize poetry.
(d) Why is the line obtained in (c) not the same as the line found by making x
the subject of the formula in the line obtained in (b)?

315

CHAPTER 12. REGRESSION AND CORRELATION

12.9 The following data is obtained to reveal the relationship between the annual production of pig iron (x) and annual numbers of pigs slaughtered (y)
Year
x : Production of pig iron
(millions of metric tons)
y : Number of pigs
slaughtered (10 000s)

A
8

B
7

C
5

D
6

E
9

F
8

G
11

H
9

I
10

J
12

16

12

10

20

14

18

17

21

20

The following information is also provided:


P
P 2
P
P 2
P
x = 85
x = 765
y = 156
y = 2614
xy = 1405

(a) Construct a scatter plot.


(b) Compute the least squares regression line, making production of pig iron the
independent variable.
(c) Determine the correlation coefficient, and test for significance at the 5% significance level.
(d) Interpret your results, considering whether your findings make sense in the
light of the nature of the two variables.

12.10 Daily turnover (y) and price (x) of a product were measured at each of 10 retail
outlets. The data was combined and summarized and the least squares regression
line was found to be y = 8.7754 0.211 x. The original data was lost and only
the following totals remain:
X
X
X
x = 55
x2 = 385
y 2 = 853.27
(a) Find the correlation coefficient r between x and y and test for significance
at a 1% significance level.
(b) Find the residual standard deviation.
12.11 An agricultural company is testing the theory that, for a limited range of values,
plant growth (y) and fertilizer (x) are related by an equation of the form y = axb .
Plant growth and fertilizer values were recorded for a sample of 10 plants and after
some calculation the following results were obtained:
X
X
X
n = 10
x = 54
y = 1427
xy = 10 677
X
X
X
(log x) = 6.5
(log y) = 20.0
(log x log y) = 14.2
X
X
X
(log x)2 = 5.2
(log y)2 = 41.7
x2 = 366
X
y 2 = 301 903
(a) Use these results to fit the equation to this data.
(b) Use your equation to predict plant growth when x = 10.

316

INTROSTAT

12.12

A famous economist has a theory that the quantity of a commodity produced is


related to the price of the commodity by the function
Q = a bP

where Q is the quantity produced, and P is the price.


The economist studies the market for 8 months, and notes the following results:
P

1.35
1.45
1.50
1.55
1.70
1.90
2.05
2.10

10.35
11.25
11.95
12.55
14.10
17.25
19.85
20.30

(a) Calculate the values of the constants a and b that best fit the data. You will
need some of the following:
X
X
X
P = 13.60
Q = 117.60
P 2 = 23.68
X
X
X
Q2 = 1836.48
P Q = 207.73
log Q = 9.24
X
X
(log Q)2 = 10.77
P log Q = 15.94

(The logarithms have been taken to base 10)


(b) Estimate Q when P = 2.

12.13 Given a set of data points (xi , yi ), and that the slopes of the regression lines
for predicting y from x and x from y are b1 and b2 respectively, show that the
correlation coefficient is given by
p
r = b1 b2

An exercise on transformations to linearity . . .


12.14 Find transformations that linearize the following.
(a) The logistic growth curve
y=

1
.
1 + aebx

(Hint: consider 1 y , then take logarithms. Note also that 0 < y < 1,
and that y is interpreted as the proportion of the asymptote [the final value]
grown at time x. The curve is shaped like an S squashed to the right!)

317

CHAPTER 12. REGRESSION AND CORRELATION


(b) The Beverton-Holt stock-recruit relationship used in fisheries models
y = ax/(b + x)
where y is the recruitment value, and x the size of the parent stock.
(c) The Ricker stock-recruit model
y = axeb x
where x and y have the same meanings as in (b).
(d) The Gompertz growth curve
x
y = (ea )b
where b > 0, and y is the proportion of the asymptote.
(e) The von Bertalanffy growth curve
y = (1 awbx )3

where a > 0, and y is the proportion of the asymptote. (Hint: consider


1
(1 y 3 )1 .)
The growth curves (d) and (e) are similar to the logistic growth curve (a).
They differ in that growth in the initial period is faster, then slower in the
middle, and the approach to the asymptote is more gradual.

Exercises on multiple regression . . .

12.15 Shown below is a partial computer output from a regression analysis of the form
y = b0 + b1 x1 + b2 x2 + b3 x3 .

Table of estimated coefficients

Variable

x1
x2
x3
Intercept

Estimated
coefficient

Estimated
standard
deviation

0.044
1.271
0.619
0.646

0.011
0.418
0.212

318

INTROSTAT
ANOVA table

(a)
(b)
(c)
(d)

Source

Sum of
Squares (SS)

Degrees of
Freedom (DF)

Mean Square
(MS)

Regression
Error

22.012
A

3
B

C
0.331

Total

24.000

Find values for A, B , and C in the ANOVA table.


Compute and interpret R2 , the multiple coefficient of determination.
Test the regression equation for significance at the 5% level.
Test each of the explanatory variables for significance (5% level).

12.16 A large corporation conducted an investigation of the job satisfaction (y ) of its


employees, as a function of their length of service (x1 ) and their salary (x2 ). Job
satisfaction (y ) was rated on a scale from 0 to 10, x1 was measured in years and x2
was measured in R1000s per month. The following regression model was estimated
from a sample of employees.
y = 4.412 0.615x1 + 0.353x2 .
Part of the computer output was as follows:
Table of estimated coefficients
Estimated
coefficient

Variable

Estimated
standard
deviation

Length of service (x1 )


Salary (x2 )
Intercept

0.187
0.061

ANOVA table
Source

Sum of
Squares (SS)

Degrees of
Freedom (DF)

Mean Square
(MS)

Regression
Error

7.646
B

C
7

D
F

Total

CHAPTER 12. REGRESSION AND CORRELATION

319

R2 = 0.8364
(a) Interpret the implications of the signs of b1 and b2 for job satisfaction.
(b) Complete the entries A F in the ANOVA table. What was the sample
size?
(c) Test the overall regression for significance at the 1% level.
(d) Compute the appropriate t-ratios, and test the explanatory variables individually for significance.
12.17 In an analysis to predict sales of for a chain of stores, the planning department
gathers data for a sample of stores. They take account of the number of competitors
in a 2 km radius (x1 ), the number of parking bays within a 100 m radius (x2 ),
and whether or not there is an automatic cash-dispensing machine on the premises
(x3 ), where x3 = 1 if a cash-dispensing machine is present, and x3 = 0 if it is not.
They estimate the multiple regression model
y = 11.1 3.2x1 + 0.4x2 + 8.5x3 ,
where y is daily sales in R1000s.
(a) What does the model suggest is the effect of having an automatic cashdispensing machine on the premises?
(b) Estimate sales for a store which will be opened in an area with
(i) 3 competitors within 2 km, 56 parking bays within 100 m and a cashdispensing machine;
(ii) 1 competitor, 35 parking bays and no cash-dispensing machine.

Solutions to exercises . . .
12.1 (a) r = 0.84 > 0.6319, reject H0 .
(b) y = 24.58 + 0.93 x.

(c) When x = 70 y = 40.25.


12.2 (a) y = 54.2 + 0.52 x
(b) r = 0.9998,

P < 0.001, very highly significant.

(c) s = 0.365, prediction interval: (78.9, 81.5).


12.3 r = 0.9922, P < 0.001, very highly significant. y = 133.91 + 29.71 x, taking x
as years since 1970. For 1982, x = 12, y = 490.42, and the 90% prediction interval
is (461.03, 519.81).
12.4 y = 1.395 e0.0930

12.5 y = 705.71 1.054x , taking x as years since 1950.


r = 0.997, P < 0.001, very highly significant.

12.6 y = 1.214+0.116 x, taking x as years since 1968. In 1984, x = 16, y = 3.06 million
telephones. The 90% prediction interval is (2.94, 3.19). r = 0.994, P < 0.001,
very highly significant.

320

INTROSTAT

12.7 (a) y = 6.86 + 1.96 x


(b) r = 0.857, P < 0.001, very highly significant.
(c) s = 5.31.
(d) 10.78 12.57, 18.63 12.35.
12.8 (a) r = 0.8339, P < 0.001, very highly significant.
(b) y = 10.329 + 0.894 x.
(c) x = 6.971 + 0.778 y .
(d) The method of least squares is not symmetric in its treatment of the dependent
and independent variables.
12.9 (b) y = 0.200 + 1.859 x.

(c) r = 0.902 > 0.6319, reject H0 .

(d) There is not a cause and effect relationship between the variables. The relationship between the variables is caused by a third variable, possibly either population or standard of living.
12.10 (a) r = 0.1159 > 0.7646, cannot reject H0 .
(b) 5.81.

12.11 (a) y = 3.32 x1.231 .


(b) When x = 10,

y = 56.5.

12.12 (a) log10 Q = 0.4507 + 0.4143 P


or Q = 100.4507
(b) Q = 19.02.

0.4143P

= 2.823 2.596P

12.15 (a) A = 1.988, B = 6, C = 7.337


(b) R2 = 0.9172, or 91.7% of the variation accounted for by the explanatory
variables.
(0.05)
(c) F = 22.17 > F3,6 = 4.76, hence a significant relationshop exists.
(d) In each case, explanatory variable is significant if |t| > 2.447. For x1 , t = 4.00,
for x2 , t = 3.04, and for x3 , t = 2.92. Hence all three explanatory variables are
significant in the regression model.
12.16 (a) The longer the service the less the job satisfaction. Larger salaries are associated with greater job satisfaction.
(b) A = 9.142, B = 1.496, C = 2, D = 3.823, E = 9, F = 0.214, and n = 10.
(0.01)
(c) F = 17.86 > F2,7 = 9.55, hence a significant relationship exists.
(d) In each case the explanatory variable is significant if |t| > 2.365. For x1 ,
t = 3.289, for x2 , t = 5.787. Hence both explanatory variables are significant in
the regression model.
12.17 (a) R8500 per day increase in sales

(b) (i) R32 400

(ii) R21 900

SUMMARY OF THE PROBABILITY


DISTRIBUTIONS
All in one Place! . . .

In this section, we summarize the probability distributions introduced in this book.


The probability mass functions and probability density functions are given, as well as
their means and variances.
BINOMIAL DISTRIBUTION

Discrete

We have n independent trials, each trial has two outcomes, success of failure,
and Pr[success] = p for all trials. The random variable X is the number of
successes in n trials; n 1 must be an integer, and 0 p 1. Then X has
the binomial distribution, i.e. X B(n, p), with probability mass function
 x
nx x = 0, 1, . . . , n
p(x) = n
x p (1 p)
=0
otherwise
Var[X] = np(1 p)

E[X] = np

POISSON DISTRIBUTION

Discrete

Events occur at random in time, with an average rate of events per time
period (or space). The random variable X is a count of the number of events
occurring during a fixed interval of time (or space). The time period (or amount
of space) referred to in the rate must be the same as the time period (or space) in
which events are counted. Then X has the Poisson distribution with parameter
> 0, i.e. X P (), and has probability mass function
e x
x!
=0

p(x) =

E[X] =

x = 0, 1, 2, . . .
otherwise
Var[X] =

321

322

INTROSTAT
EXPONENTIAL DISTRIBUTION

Continuous

As for the Poisson distribution, events occur at random with average rate per
unit of time (or space). Let the continuous random variable X be the interval
between two events. X has the exponential distribution with parameter ,
i.e. X E(), with probability density function
f (x) = ex x 0
=0
otherwise
E[X] =

Var[X] =

1
2

NORMAL DISTRIBUTION

Continuous

The normal distribution with parameters ( < < ) and 2 ( > 0)


is a continuous random variable X , denoted X N (, 2 ), with probability
density function
f (x) =

1
2 2

12

E[X] =

2

<x<

Var[X] = 2

The standard normal distribution is a continuous random variable Z , and has


= 0, = 1. The probability density function is
1 2
1
e 2 z
f (x) =
2

E[X] = 0

<x<
Var[X] = 1

NEGATIVE BINOMIAL DISTRIBUTION

Discrete

As for the binomial distribution, we have independent trials, each with two
outcomes, success or failure; Pr[success] = p for each trial. Fix the number of
successes r , and let the random variable X be the number of failures obtained
before the r th success. Then X has the negative binomial distribution with
parameters r and p, i.e. X N B(r, p), with probability mass function


x+r1
p(x) =
pr q x x = 0, 1, 2, . . .
x
=0
otherwise
E[X] =

r(1 p)
p

Var[X] =

r(1 p)
p2

323

CHAPTER 12. SUMMARY OF THE PROBABILITY DISTRIBUTIONS


GEOMETRIC DISTRIBUTION

Discrete

Under the same conditions as for the negative binomial distribution, let the
random variable X be the number of trials before the first success. Then X has
the geometric distribution with parameter p, X G(p), and has probability
mass function
p(x) = pq x x = 0, 1, 2, . . .
=0
otherwise
E[X] =

(1 p)
p

Var[X] =

(1 p)
p2

HYPERGEOMETRIC DISTRIBUTION

Discrete

Given N objects, of which M are are of type 1 and N M of type 2, a sample


of size n (n N ) is drawn. Let the random variable X be the number of
objects of type 1 in the sample. Then X has the hypergeometric distribution,
X H(N, M, n), with probability mass function
 
  
M
N M . N
p(x) =
x = 0, 1, . . . , n
x
nx
n
=0
otherwise
nM
E[X] =
N

M
Var[X] = n
N

M
1
N



UNIFORM DISTRIBUTION

N n
N 1

Continuous

If the continuous random variable X is equally likely to take on any value in


the interval (a, b), then X has the uniform distribution, X U (a, b), with
probability density function
1
ba
=0

f (x) =

E[X] =

a+b
2

axb
otherwise

Var[X] =

(b a)2
12

324

INTROSTAT
THE t, F and 2 DISTRIBUTIONS

Continuous

For completeness sake, we state the probability density functions of these three
distributions. The t-distribution with parameters n, the degrees of freedom, is
a continous random variable with probability density function


1 [(n + 1)/2]
f (x) =
n
(n/2)

x2
1+
n

(n+1)/2

<x<

n
n2
The F -distribution with parameters n1 and n2 , the degrees of freedom for the
numerator and denominator respectively, is a continuous random variable with
probability density function
E[X] = 0

[(n1 + n2 )/2]
f (x) =
(n1 /2)(n2 /2)
=0
E[X] =

Var[X] =

n1
n2

(n1 /2)

n2
n2 2

x(n1 /2)1
(1 + nn12 x)(n1 +n2 )/2

0<x<
otherwise

Var[X] =

2n22 (n1 + n2 2)
n1 (n2 2)2 (n2 4)

The 2 -distribution with parameter n, the degrees of freedom, is a continuous


random variable with probability density function
f (x) =

1
2n/2 (n/2)

xn/21 ex/2 0 < x <

=0

otherwise

E[X] = n

Var[X] = 2n

(n) is defined to be (n 1)! if n is an integer. If n +

1
2

is an integer, then

1 3 5 (2n 1)
1

(n + ) =
2
2n
so that
(4) = 3! = 6

and

(2.5) =

1 3
= 1.329
22

TABLES

Table 1

Standard normal distribution

t-distribution

Chi-squared distribution

4.1

F -distribution (5% points)

4.2

F -distribution (2.5% points)

4.3

F -distribution (1% points)

4.4

F -distribution (0.5% points)

Correlation coefficient

325

326

INTROSTAT

TABLE 1. STANDARD NORMAL DISTRIBUTION: Areas under the standard normal


curve between 0 and z , i.e. Pr[0 < Z < z ]
z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0

0.00
0.0000
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.2580
0.2881
0.3159
0.3413
0.3643
0.3849
0.4032
0.4192
0.4332
0.4452
0.4554
0.4641
0.4713
0.4772
0.4821
0.4861
0.4893
0.49180
0.49379
0.49534
0.49653
0.49744
0.49813
0.49865
0.49903
0.49931
0.49952
0.49966
0.49977
0.49984
0.49989
0.49993
0.49995
0.49997

0.01
0.0040
0.0438
0.0832
0.1217
0.1591
0.1950
0.2291
0.2611
0.2910
0.3186
0.3438
0.3665
0.3869
0.4049
0.4207
0.4345
0.4463
0.4564
0.4649
0.4719
0.4778
0.4826
0.4864
0.4896
0.49202
0.49396
0.49547
0.49664
0.49752
0.49819
0.49869
0.49906
0.49934
0.49953
0.49968
0.49978
0.49985
0.49990
0.49993
0.49995
0.49997

0.02
0.0080
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.3686
0.3888
0.4066
0.4222
0.4357
0.4474
0.4573
0.4656
0.4726
0.4783
0.4830
0.4868
0.4898
0.49224
0.49413
0.49560
0.49674
0.49760
0.49825
0.49874
0.49910
0.49936
0.49955
0.49969
0.49978
0.49985
0.49990
0.49993
0.49996
0.49997

0.03
0.0120
0.0517
0.0910
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.3708
0.3907
0.4082
0.4236
0.4370
0.4484
0.4582
0.4664
0.4732
0.4788
0.4834
0.4871
0.4901
0.49245
0.49430
0.49573
0.49683
0.49767
0.49831
0.49878
0.49913
0.49938
0.49957
0.49970
0.49979
0.49986
0.49990
0.49994
0.49996
0.49997

0.04
0.0160
0.0557
0.0948
0.1331
0.1700
0.2054
0.2389
0.2704
0.2995
0.3264
0.3508
0.3729
0.3925
0.4099
0.4251
0.4382
0.4495
0.4591
0.4671
0.4738
0.4793
0.4838
0.4875
0.4904
0.49266
0.49446
0.49585
0.49693
0.49774
0.49836
0.49882
0.49916
0.49940
0.49958
0.49971
0.49980
0.49986
0.49991
0.49994
0.49996
0.49997

0.05
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.3749
0.3944
0.4115
0.4265
0.4394
0.4505
0.4599
0.4678
0.4744
0.4798
0.4842
0.4878
0.4906
0.49286
0.49461
0.49598
0.49702
0.49781
0.49841
0.49886
0.49918
0.49942
0.49960
0.49972
0.49981
0.49987
0.49991
0.49994
0.49996
0.49997

0.06
0.0239
0.0636
0.1026
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.3770
0.3962
0.4131
0.4279
0.4406
0.4515
0.4608
0.4686
0.4750
0.4803
0.4846
0.4881
0.4909
0.49305
0.49477
0.49609
0.49711
0.49788
0.49846
0.49889
0.49921
0.49944
0.49961
0.49973
0.49981
0.49987
0.49992
0.49994
0.49996
0.49998

0.07
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.3340
0.3577
0.3790
0.3980
0.4147
0.4292
0.4418
0.4525
0.4616
0.4693
0.4756
0.4808
0.4850
0.4884
0.4911
0.49324
0.49492
0.49621
0.49720
0.49795
0.49851
0.49893
0.49924
0.49946
0.49962
0.49974
0.49982
0.49988
0.49992
0.49995
0.49996
0.49998

0.08
0.0319
0.0714
0.1103
0.1480
0.1844
0.2190
0.2517
0.2823
0.3106
0.3365
0.3599
0.3810
0.3997
0.4162
0.4306
0.4429
0.4535
0.4625
0.4699
0.4761
0.4812
0.4854
0.4887
0.4913
0.49343
0.49506
0.49632
0.49728
0.49801
0.49856
0.49896
0.49926
0.49948
0.49964
0.49975
0.49983
0.49988
0.49992
0.49995
0.49997
0.49998

0.09
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
0.3830
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
0.4857
0.4890
0.4916
0.49361
0.49520
0.49643
0.49736
0.49807
0.49861
0.49900
0.49929
0.49950
0.49965
0.49976
0.49983
0.49989
0.49992
0.49995
0.49997
0.49998

327

CHAPTER 12. TABLES

TABLE 2. t-DISTRIBUTION: One sided critical values, i.e. the value of tPn such that
P = Pr[tn > tPn ], where n is the degrees of freedom
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
z

0.4
0.325
0.289
0.277
0.271
0.267
0.265
0.263
0.262
0.261
0.260
0.260
0.259
0.259
0.258
0.258
0.258
0.257
0.257
0.257
0.257
0.257
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.253

0.3
0.727
0.617
0.584
0.569
0.559
0.553
0.549
0.546
0.543
0.542
0.540
0.539
0.538
0.537
0.536
0.535
0.534
0.534
0.533
0.533
0.532
0.532
0.532
0.531
0.531
0.531
0.531
0.530
0.530
0.530
0.530
0.530
0.530
0.529
0.529
0.529
0.529
0.529
0.529
0.529
0.528
0.528
0.527
0.527
0.526
0.526
0.526
0.526
0.526
0.526
0.525
0.525
0.525
0.524

0.2
1.376
1.061
0.978
0.941
0.920
0.906
0.896
0.889
0.883
0.879
0.876
0.873
0.870
0.868
0.866
0.865
0.863
0.862
0.861
0.860
0.859
0.858
0.858
0.857
0.856
0.856
0.855
0.855
0.854
0.854
0.853
0.853
0.853
0.852
0.852
0.852
0.851
0.851
0.851
0.851
0.850
0.849
0.848
0.847
0.846
0.846
0.845
0.845
0.845
0.844
0.844
0.844
0.843
0.842

0.1
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310
1.309
1.309
1.308
1.307
1.306
1.306
1.305
1.304
1.304
1.303
1.301
1.299
1.296
1.294
1.292
1.291
1.290
1.289
1.289
1.288
1.287
1.286
1.286
1.282

Probability Level P
0.05 0.025
0.01
6.314 12.71 31.82
2.920 4.303 6.965
2.353 3.182 4.541
2.132 2.776 3.747
2.015 2.571 3.365
1.943 2.447 3.143
1.895 2.365 2.998
1.860 2.306 2.896
1.833 2.262 2.821
1.812 2.228 2.764
1.796 2.201 2.718
1.782 2.179 2.681
1.771 2.160 2.650
1.761 2.145 2.624
1.753 2.131 2.602
1.746 2.120 2.583
1.740 2.110 2.567
1.734 2.101 2.552
1.729 2.093 2.539
1.725 2.086 2.528
1.721 2.080 2.518
1.717 2.074 2.508
1.714 2.069 2.500
1.711 2.064 2.492
1.708 2.060 2.485
1.706 2.056 2.479
1.703 2.052 2.473
1.701 2.048 2.467
1.699 2.045 2.462
1.697 2.042 2.457
1.696 2.040 2.453
1.694 2.037 2.449
1.692 2.035 2.445
1.691 2.032 2.441
1.690 2.030 2.438
1.688 2.028 2.434
1.687 2.026 2.431
1.686 2.024 2.429
1.685 2.023 2.426
1.684 2.021 2.423
1.679 2.014 2.412
1.676 2.009 2.403
1.671 2.000 2.390
1.667 1.994 2.381
1.664 1.990 2.374
1.662 1.987 2.368
1.660 1.984 2.364
1.659 1.982 2.361
1.658 1.980 2.358
1.656 1.977 2.353
1.654 1.975 2.350
1.653 1.973 2.347
1.653 1.972 2.345
1.645 1.960 2.326

0.005
63.66
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
2.744
2.738
2.733
2.728
2.724
2.719
2.715
2.712
2.708
2.704
2.690
2.678
2.660
2.648
2.639
2.632
2.626
2.621
2.617
2.611
2.607
2.603
2.601
2.576

0.0025
127.3
14.09
7.453
5.598
4.773
4.317
4.029
3.833
3.690
3.581
3.497
3.428
3.372
3.326
3.286
3.252
3.222
3.197
3.174
3.153
3.135
3.119
3.104
3.091
3.078
3.067
3.057
3.047
3.038
3.030
3.022
3.015
3.008
3.002
2.996
2.990
2.985
2.980
2.976
2.971
2.952
2.937
2.915
2.899
2.887
2.878
2.871
2.865
2.860
2.852
2.847
2.842
2.838
2.807

0.001
318.3
22.33
10.21
7.173
5.894
5.208
4.785
4.501
4.297
4.144
4.025
3.930
3.852
3.787
3.733
3.686
3.646
3.610
3.579
3.552
3.527
3.505
3.485
3.467
3.450
3.435
3.421
3.408
3.396
3.385
3.375
3.365
3.356
3.348
3.340
3.333
3.326
3.319
3.313
3.307
3.281
3.261
3.232
3.211
3.195
3.183
3.174
3.166
3.160
3.149
3.142
3.136
3.131
3.090

0.0005
636.6
31.60
12.92
8.610
6.869
5.959
5.408
5.041
4.781
4.587
4.437
4.318
4.221
4.140
4.073
4.015
3.965
3.922
3.883
3.850
3.819
3.792
3.768
3.745
3.725
3.707
3.689
3.674
3.660
3.646
3.633
3.622
3.611
3.601
3.591
3.582
3.574
3.566
3.558
3.551
3.520
3.496
3.460
3.435
3.416
3.402
3.390
3.381
3.373
3.361
3.352
3.345
3.340
3.291

328

INTROSTAT

TABLE 3. CHI-SQUARED DISTRIBUTION: One sided critical values, i.e. the value
2(P )
2(P )
of n
such that P = Pr[2n > n ], where n is the degrees of freedom, for P > 0.5

n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

0.9995
0.000
0.001
0.015
0.064
0.158
0.299
0.485
0.710
0.972
1.265
1.587
1.935
2.305
2.697
3.107
3.536
3.980
4.439
4.913
5.398
5.895
6.404
6.924
7.453
7.991
8.537
9.093
9.656
10.227
10.804
11.388
11.980
12.576
13.180
13.788
14.401
15.021
15.644
16.272
16.906
20.136
23.461
30.339
37.467
44.792
52.277
59.895
67.631
75.465
91.389
107.599
124.032
140.659

0.999
0.000
0.002
0.024
0.091
0.210
0.381
0.599
0.857
1.152
1.479
1.834
2.214
2.617
3.041
3.483
3.942
4.416
4.905
5.407
5.921
6.447
6.983
7.529
8.085
8.649
9.222
9.803
10.391
10.986
11.588
12.196
12.810
13.431
14.057
14.688
15.324
15.965
16.611
17.261
17.917
21.251
24.674
31.738
39.036
46.520
54.156
61.918
69.790
77.756
93.925
110.359
127.011
143.842

0.9975
0.000
0.005
0.045
0.145
0.307
0.527
0.794
1.104
1.450
1.827
2.232
2.661
3.112
3.582
4.070
4.573
5.092
5.623
6.167
6.723
7.289
7.865
8.450
9.044
9.646
10.256
10.873
11.497
12.128
12.765
13.407
14.055
14.709
15.368
16.032
16.700
17.373
18.050
18.732
19.417
22.899
26.464
33.791
41.332
49.043
56.892
64.857
72.922
81.073
97.591
114.350
131.305
148.426

0.995
0.000
0.010
0.072
0.207
0.412
0.676
0.989
1.344
1.735
2.156
2.603
3.074
3.565
4.075
4.601
5.142
5.697
6.265
6.844
7.434
8.034
8.643
9.260
9.886
10.520
11.160
11.808
12.461
13.121
13.787
14.458
15.134
15.815
16.501
17.192
17.887
18.586
19.289
19.996
20.707
24.311
27.991
35.534
43.275
51.172
59.196
67.328
75.550
83.852
100.655
117.679
134.884
152.241

Probability Level P
0.99
0.975
0.000
0.001
0.020
0.051
0.115
0.216
0.297
0.484
0.554
0.831
0.872
1.237
1.239
1.690
1.647
2.180
2.088
2.700
2.558
3.247
3.053
3.816
3.571
4.404
4.107
5.009
4.660
5.629
5.229
6.262
5.812
6.908
6.408
7.564
7.015
8.231
7.633
8.907
8.260
9.591
8.897
10.283
9.542
10.982
10.196
11.689
10.856
12.401
11.524
13.120
12.198
13.844
12.878
14.573
13.565
15.308
14.256
16.047
14.953
16.791
15.655
17.539
16.362
18.291
17.073
19.047
17.789
19.806
18.509
20.569
19.233
21.336
19.960
22.106
20.691
22.878
21.426
23.654
22.164
24.433
25.901
28.366
29.707
32.357
37.485
40.482
45.442
48.758
53.540
57.153
61.754
65.647
70.065
74.222
78.458
82.867
86.923
91.573
104.034 109.137
121.346 126.870
138.821 144.741
156.432 162.728

0.95
0.004
0.103
0.352
0.711
1.145
1.635
2.167
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.672
9.390
10.117
10.851
11.591
12.338
13.091
13.848
14.611
15.379
16.151
16.928
17.708
18.493
19.281
20.072
20.867
21.664
22.465
23.269
24.075
24.884
25.695
26.509
30.612
34.764
43.188
51.739
60.391
69.126
77.929
86.792
95.705
113.659
131.756
149.969
168.279

0.9
0.016
0.211
0.584
1.064
1.610
2.204
2.833
3.490
4.168
4.865
5.578
6.304
7.041
7.790
8.547
9.312
10.085
10.865
11.651
12.443
13.240
14.041
14.848
15.659
16.473
17.292
18.114
18.939
19.768
20.599
21.434
22.271
23.110
23.952
24.797
25.643
26.492
27.343
28.196
29.051
33.350
37.689
46.459
55.329
64.278
73.291
82.358
91.471
100.624
119.029
137.546
156.153
174.835

0.8
0.064
0.446
1.005
1.649
2.343
3.070
3.822
4.594
5.380
6.179
6.989
7.807
8.634
9.467
10.307
11.152
12.002
12.857
13.716
14.578
15.445
16.314
17.187
18.062
18.940
19.820
20.703
21.588
22.475
23.364
24.255
25.148
26.042
26.938
27.836
28.735
29.635
30.537
31.441
32.345
36.884
41.449
50.641
59.898
69.207
78.558
87.945
97.362
106.806
125.758
144.783
163.868
183.003

0.6
0.275
1.022
1.869
2.753
3.656
4.570
5.493
6.423
7.357
8.295
9.237
10.182
11.129
12.078
13.030
13.983
14.937
15.893
16.850
17.809
18.768
19.729
20.690
21.652
22.616
23.579
24.544
25.509
26.475
27.442
28.409
29.376
30.344
31.313
32.282
33.252
34.222
35.192
36.163
37.134
41.995
46.864
56.620
66.396
76.188
85.993
95.808
105.632
115.465
135.149
154.856
174.580
194.319

329

CHAPTER 12. TABLES

TABLE 3, continued. CHI-SQUARED DISTRIBUTION: One sided critical values, i.e.


2(P )
2(P )
the value of n
such that P = Pr[2n > n ], where n is the degrees of freedom, for
P < 0.5

n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

0.4
0.708
1.833
2.946
4.045
5.132
6.211
7.283
8.351
9.414
10.473
11.530
12.584
13.636
14.685
15.733
16.780
17.824
18.868
19.910
20.951
21.992
23.031
24.069
25.106
26.143
27.179
28.214
29.249
30.283
31.316
32.349
33.381
34.413
35.444
36.475
37.505
38.535
39.564
40.593
41.622
46.761
51.892
62.135
72.358
82.566
92.761
102.946
113.121
123.289
143.604
163.898
184.173
204.434

0.2
1.642
3.219
4.642
5.989
7.289
8.558
9.803
11.030
12.242
13.442
14.631
15.812
16.985
18.151
19.311
20.465
21.615
22.760
23.900
25.038
26.171
27.301
28.429
29.553
30.675
31.795
32.912
34.027
35.139
36.250
37.359
38.466
39.572
40.676
41.778
42.879
43.978
45.076
46.173
47.269
52.729
58.164
68.972
79.715
90.405
101.054
111.667
122.250
132.806
153.854
174.828
195.743
216.609

0.1
2.706
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412
29.615
30.813
32.007
33.196
34.382
35.563
36.741
37.916
39.087
40.256
41.422
42.585
43.745
44.903
46.059
47.212
48.363
49.513
50.660
51.805
57.505
63.167
74.397
85.527
96.578
107.565
118.498
129.385
140.233
161.827
183.311
204.704
226.021

0.05
3.841
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307
19.675
21.026
22.362
23.685
24.996
26.296
27.587
28.869
30.144
31.410
32.671
33.924
35.172
36.415
37.652
38.885
40.113
41.337
42.557
43.773
44.985
46.194
47.400
48.602
49.802
50.998
52.192
53.384
54.572
55.758
61.656
67.505
79.082
90.531
101.879
113.145
124.342
135.480
146.567
168.613
190.516
212.304
233.994

Probability Level P
0.025
0.01
5.024
6.635
7.378
9.210
9.348
11.345
11.143
13.277
12.832
15.086
14.449
16.812
16.013
18.475
17.535
20.090
19.023
21.666
20.483
23.209
21.920
24.725
23.337
26.217
24.736
27.688
26.119
29.141
27.488
30.578
28.845
32.000
30.191
33.409
31.526
34.805
32.852
36.191
34.170
37.566
35.479
38.932
36.781
40.289
38.076
41.638
39.364
42.980
40.646
44.314
41.923
45.642
43.195
46.963
44.461
48.278
45.722
49.588
46.979
50.892
48.232
52.191
49.480
53.486
50.725
54.775
51.966
56.061
53.203
57.342
54.437
58.619
55.668
59.893
56.895
61.162
58.120
62.428
59.342
63.691
65.410
69.957
71.420
76.154
83.298
88.379
95.023 100.425
106.629 112.329
118.136 124.116
129.561 135.807
140.916 147.414
152.211 158.950
174.648 181.841
196.915 204.530
219.044 227.056
241.058 249.445

0.005
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
26.757
28.300
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997
41.401
42.796
44.181
45.558
46.928
48.290
49.645
50.994
52.335
53.672
55.002
56.328
57.648
58.964
60.275
61.581
62.883
64.181
65.475
66.766
73.166
79.490
91.952
104.215
116.321
128.299
140.170
151.948
163.648
186.847
209.824
232.620
255.264

0.0025
9.140
11.983
14.320
16.424
18.385
20.249
22.040
23.774
25.463
27.112
28.729
30.318
31.883
33.426
34.949
36.456
37.946
39.422
40.885
42.336
43.775
45.204
46.623
48.034
49.435
50.829
52.215
53.594
54.966
56.332
57.692
59.046
60.395
61.738
63.076
64.410
65.738
67.063
68.383
69.699
76.223
82.664
95.344
107.808
120.102
132.255
144.292
156.230
168.081
191.565
214.808
237.855
260.735

0.001
10.827
13.815
16.266
18.466
20.515
22.457
24.321
26.124
27.877
29.588
31.264
32.909
34.527
36.124
37.698
39.252
40.791
42.312
43.819
45.314
46.796
48.268
49.728
51.179
52.619
54.051
55.475
56.892
58.301
59.702
61.098
62.487
63.869
65.247
66.619
67.985
69.348
70.704
72.055
73.403
80.078
86.660
99.608
112.317
124.839
137.208
149.449
161.582
173.618
197.450
221.020
244.372
267.539

0.0005
12.115
15.201
17.731
19.998
22.106
24.102
26.018
27.867
29.667
31.419
33.138
34.821
36.477
38.109
39.717
41.308
42.881
44.434
45.974
47.498
49.010
50.510
51.999
53.478
54.948
56.407
57.856
59.299
60.734
62.160
63.581
64.993
66.401
67.804
69.197
70.588
71.971
73.350
74.724
76.096
82.873
89.560
102.697
115.577
128.264
140.780
153.164
165.436
177.601
201.680
225.477
249.049
272.422

330

INTROSTAT
(0.05)

TABLE 4.1. 5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

1
161.4
18.51
10.13
7.71
6.61
5.99
5.59
5.32
5.12
4.96
4.84
4.75
4.67
4.60
4.54
4.49
4.45
4.41
4.38
4.35
4.32
4.30
4.28
4.26
4.24
4.23
4.21
4.20
4.18
4.17
4.16
4.15
4.14
4.13
4.12
4.11
4.11
4.10
4.09
4.08
4.06
4.03
4.00
3.98
3.96
3.95
3.94
3.93
3.92
3.91
3.90
3.89
3.89

2
199.5
19.00
9.55
6.94
5.79
5.14
4.74
4.46
4.26
4.10
3.98
3.89
3.81
3.74
3.68
3.63
3.59
3.55
3.52
3.49
3.47
3.44
3.42
3.40
3.39
3.37
3.35
3.34
3.33
3.32
3.30
3.29
3.28
3.28
3.27
3.26
3.25
3.24
3.24
3.23
3.20
3.18
3.15
3.13
3.11
3.10
3.09
3.08
3.07
3.06
3.05
3.05
3.04

3
215.7
19.16
9.28
6.59
5.41
4.76
4.35
4.07
3.86
3.71
3.59
3.49
3.41
3.34
3.29
3.24
3.20
3.16
3.13
3.10
3.07
3.05
3.03
3.01
2.99
2.98
2.96
2.95
2.93
2.92
2.91
2.90
2.89
2.88
2.87
2.87
2.86
2.85
2.85
2.84
2.81
2.79
2.76
2.74
2.72
2.71
2.70
2.69
2.68
2.67
2.66
2.65
2.65

4
224.6
19.25
9.12
6.39
5.19
4.53
4.12
3.84
3.63
3.48
3.36
3.26
3.18
3.11
3.06
3.01
2.96
2.93
2.90
2.87
2.84
2.82
2.80
2.78
2.76
2.74
2.73
2.71
2.70
2.69
2.68
2.67
2.66
2.65
2.64
2.63
2.63
2.62
2.61
2.61
2.58
2.56
2.53
2.50
2.49
2.47
2.46
2.45
2.45
2.44
2.43
2.42
2.42

NUM (Numerator Degrees of Freedom)


5
6
7
8
9
10
230.2 234.0 236.8 238.9 240.5 241.9
19.30 19.33 19.35 19.37 19.38 19.40
9.01
8.94
8.89
8.85
8.81
8.79
6.26
6.16
6.09
6.04
6.00
5.96
5.05
4.95
4.88
4.82
4.77
4.74
4.39
4.28
4.21
4.15
4.10
4.06
3.97
3.87
3.79
3.73
3.68
3.64
3.69
3.58
3.50
3.44
3.39
3.35
3.48
3.37
3.29
3.23
3.18
3.14
3.33
3.22
3.14
3.07
3.02
2.98
3.20
3.09
3.01
2.95
2.90
2.85
3.11
3.00
2.91
2.85
2.80
2.75
3.03
2.92
2.83
2.77
2.71
2.67
2.96
2.85
2.76
2.70
2.65
2.60
2.90
2.79
2.71
2.64
2.59
2.54
2.85
2.74
2.66
2.59
2.54
2.49
2.81
2.70
2.61
2.55
2.49
2.45
2.77
2.66
2.58
2.51
2.46
2.41
2.74
2.63
2.54
2.48
2.42
2.38
2.71
2.60
2.51
2.45
2.39
2.35
2.68
2.57
2.49
2.42
2.37
2.32
2.66
2.55
2.46
2.40
2.34
2.30
2.64
2.53
2.44
2.37
2.32
2.27
2.62
2.51
2.42
2.36
2.30
2.25
2.60
2.49
2.40
2.34
2.28
2.24
2.59
2.47
2.39
2.32
2.27
2.22
2.57
2.46
2.37
2.31
2.25
2.20
2.56
2.45
2.36
2.29
2.24
2.19
2.55
2.43
2.35
2.28
2.22
2.18
2.53
2.42
2.33
2.27
2.21
2.16
2.52
2.41
2.32
2.25
2.20
2.15
2.51
2.40
2.31
2.24
2.19
2.14
2.50
2.39
2.30
2.23
2.18
2.13
2.49
2.38
2.29
2.23
2.17
2.12
2.49
2.37
2.29
2.22
2.16
2.11
2.48
2.36
2.28
2.21
2.15
2.11
2.47
2.36
2.27
2.20
2.14
2.10
2.46
2.35
2.26
2.19
2.14
2.09
2.46
2.34
2.26
2.19
2.13
2.08
2.45
2.34
2.25
2.18
2.12
2.08
2.42
2.31
2.22
2.15
2.10
2.05
2.40
2.29
2.20
2.13
2.07
2.03
2.37
2.25
2.17
2.10
2.04
1.99
2.35
2.23
2.14
2.07
2.02
1.97
2.33
2.21
2.13
2.06
2.00
1.95
2.32
2.20
2.11
2.04
1.99
1.94
2.31
2.19
2.10
2.03
1.97
1.93
2.30
2.18
2.09
2.02
1.97
1.92
2.29
2.18
2.09
2.02
1.96
1.91
2.28
2.16
2.08
2.01
1.95
1.90
2.27
2.16
2.07
2.00
1.94
1.89
2.26
2.15
2.06
1.99
1.93
1.88
2.26
2.14
2.06
1.98
1.93
1.88

11
243.0
19.40
8.76
5.94
4.70
4.03
3.60
3.31
3.10
2.94
2.82
2.72
2.63
2.57
2.51
2.46
2.41
2.37
2.34
2.31
2.28
2.26
2.24
2.22
2.20
2.18
2.17
2.15
2.14
2.13
2.11
2.10
2.09
2.08
2.07
2.07
2.06
2.05
2.04
2.04
2.01
1.99
1.95
1.93
1.91
1.90
1.89
1.88
1.87
1.86
1.85
1.84
1.84

12
243.9
19.41
8.74
5.91
4.68
4.00
3.57
3.28
3.07
2.91
2.79
2.69
2.60
2.53
2.48
2.42
2.38
2.34
2.31
2.28
2.25
2.23
2.20
2.18
2.16
2.15
2.13
2.12
2.10
2.09
2.08
2.07
2.06
2.05
2.04
2.03
2.02
2.02
2.01
2.00
1.97
1.95
1.92
1.89
1.88
1.86
1.85
1.84
1.83
1.82
1.81
1.81
1.80

13
244.7
19.42
8.73
5.89
4.66
3.98
3.55
3.26
3.05
2.89
2.76
2.66
2.58
2.51
2.45
2.40
2.35
2.31
2.28
2.25
2.22
2.20
2.18
2.15
2.14
2.12
2.10
2.09
2.08
2.06
2.05
2.04
2.03
2.02
2.01
2.00
2.00
1.99
1.98
1.97
1.94
1.92
1.89
1.86
1.84
1.83
1.82
1.81
1.80
1.79
1.78
1.77
1.77

14
245.4
19.42
8.71
5.87
4.64
3.96
3.53
3.24
3.03
2.86
2.74
2.64
2.55
2.48
2.42
2.37
2.33
2.29
2.26
2.22
2.20
2.17
2.15
2.13
2.11
2.09
2.08
2.06
2.05
2.04
2.03
2.01
2.00
1.99
1.99
1.98
1.97
1.96
1.95
1.95
1.92
1.89
1.86
1.84
1.82
1.80
1.79
1.78
1.78
1.76
1.75
1.75
1.74

331

CHAPTER 12. TABLES


TABLE 4.1, continued. 5% critical values for the F -DISTRIBUTION
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

15
245.9
19.43
8.70
5.86
4.62
3.94
3.51
3.22
3.01
2.85
2.72
2.62
2.53
2.46
2.40
2.35
2.31
2.27
2.23
2.20
2.18
2.15
2.13
2.11
2.09
2.07
2.06
2.04
2.03
2.01
2.00
1.99
1.98
1.97
1.96
1.95
1.95
1.94
1.93
1.92
1.89
1.87
1.84
1.81
1.79
1.78
1.77
1.76
1.75
1.74
1.73
1.72
1.72

16
246.5
19.43
8.69
5.84
4.60
3.92
3.49
3.20
2.99
2.83
2.70
2.60
2.51
2.44
2.38
2.33
2.29
2.25
2.21
2.18
2.16
2.13
2.11
2.09
2.07
2.05
2.04
2.02
2.01
1.99
1.98
1.97
1.96
1.95
1.94
1.93
1.93
1.92
1.91
1.90
1.87
1.85
1.82
1.79
1.77
1.76
1.75
1.74
1.73
1.72
1.71
1.70
1.69

17
246.9
19.44
8.68
5.83
4.59
3.91
3.48
3.19
2.97
2.81
2.69
2.58
2.50
2.43
2.37
2.32
2.27
2.23
2.20
2.17
2.14
2.11
2.09
2.07
2.05
2.03
2.02
2.00
1.99
1.98
1.96
1.95
1.94
1.93
1.92
1.92
1.91
1.90
1.89
1.89
1.86
1.83
1.80
1.77
1.75
1.74
1.73
1.72
1.71
1.70
1.69
1.68
1.67

18
247.3
19.44
8.67
5.82
4.58
3.90
3.47
3.17
2.96
2.80
2.67
2.57
2.48
2.41
2.35
2.30
2.26
2.22
2.18
2.15
2.12
2.10
2.08
2.05
2.04
2.02
2.00
1.99
1.97
1.96
1.95
1.94
1.93
1.92
1.91
1.90
1.89
1.88
1.88
1.87
1.84
1.81
1.78
1.75
1.73
1.72
1.71
1.70
1.69
1.68
1.67
1.66
1.66

NUM (Numerator Degrees of Freedom)


19
20
22
24
27
30
247.7 248.0 248.6 249.1 249.6 250.1
19.44 19.45 19.45 19.45 19.46 19.46
8.67
8.66
8.65
8.64
8.63
8.62
5.81
5.80
5.79
5.77
5.76
5.75
4.57
4.56
4.54
4.53
4.51
4.50
3.88
3.87
3.86
3.84
3.82
3.81
3.46
3.44
3.43
3.41
3.39
3.38
3.16
3.15
3.13
3.12
3.10
3.08
2.95
2.94
2.92
2.90
2.88
2.86
2.79
2.77
2.75
2.74
2.72
2.70
2.66
2.65
2.63
2.61
2.59
2.57
2.56
2.54
2.52
2.51
2.48
2.47
2.47
2.46
2.44
2.42
2.40
2.38
2.40
2.39
2.37
2.35
2.33
2.31
2.34
2.33
2.31
2.29
2.27
2.25
2.29
2.28
2.25
2.24
2.21
2.19
2.24
2.23
2.21
2.19
2.17
2.15
2.20
2.19
2.17
2.15
2.13
2.11
2.17
2.16
2.13
2.11
2.09
2.07
2.14
2.12
2.10
2.08
2.06
2.04
2.11
2.10
2.07
2.05
2.03
2.01
2.08
2.07
2.05
2.03
2.00
1.98
2.06
2.05
2.02
2.01
1.98
1.96
2.04
2.03
2.00
1.98
1.96
1.94
2.02
2.01
1.98
1.96
1.94
1.92
2.00
1.99
1.97
1.95
1.92
1.90
1.99
1.97
1.95
1.93
1.90
1.88
1.97
1.96
1.93
1.91
1.89
1.87
1.96
1.94
1.92
1.90
1.88
1.85
1.95
1.93
1.91
1.89
1.86
1.84
1.93
1.92
1.90
1.88
1.85
1.83
1.92
1.91
1.88
1.86
1.84
1.82
1.91
1.90
1.87
1.85
1.83
1.81
1.90
1.89
1.86
1.84
1.82
1.80
1.89
1.88
1.85
1.83
1.81
1.79
1.88
1.87
1.85
1.82
1.80
1.78
1.88
1.86
1.84
1.82
1.79
1.77
1.87
1.85
1.83
1.81
1.78
1.76
1.86
1.85
1.82
1.80
1.77
1.75
1.85
1.84
1.81
1.79
1.77
1.74
1.82
1.81
1.78
1.76
1.73
1.71
1.80
1.78
1.76
1.74
1.71
1.69
1.76
1.75
1.72
1.70
1.67
1.65
1.74
1.72
1.70
1.67
1.65
1.62
1.72
1.70
1.68
1.65
1.63
1.60
1.70
1.69
1.66
1.64
1.61
1.59
1.69
1.68
1.65
1.63
1.60
1.57
1.68
1.67
1.64
1.62
1.59
1.56
1.67
1.66
1.63
1.61
1.58
1.55
1.66
1.65
1.62
1.60
1.57
1.54
1.65
1.64
1.61
1.59
1.56
1.53
1.64
1.63
1.60
1.58
1.55
1.52
1.64
1.62
1.60
1.57
1.54
1.52

40
251.1
19.47
8.59
5.72
4.46
3.77
3.34
3.04
2.83
2.66
2.53
2.43
2.34
2.27
2.20
2.15
2.10
2.06
2.03
1.99
1.96
1.94
1.91
1.89
1.87
1.85
1.84
1.82
1.81
1.79
1.78
1.77
1.76
1.75
1.74
1.73
1.72
1.71
1.70
1.69
1.66
1.63
1.59
1.57
1.54
1.53
1.52
1.50
1.50
1.48
1.47
1.46
1.46

60
252.2
19.48
8.57
5.69
4.43
3.74
3.30
3.01
2.79
2.62
2.49
2.38
2.30
2.22
2.16
2.11
2.06
2.02
1.98
1.95
1.92
1.89
1.86
1.84
1.82
1.80
1.79
1.77
1.75
1.74
1.73
1.71
1.70
1.69
1.68
1.67
1.66
1.65
1.65
1.64
1.60
1.58
1.53
1.50
1.48
1.46
1.45
1.44
1.43
1.41
1.40
1.39
1.39

100
253.0
19.49
8.55
5.66
4.41
3.71
3.27
2.97
2.76
2.59
2.46
2.35
2.26
2.19
2.12
2.07
2.02
1.98
1.94
1.91
1.88
1.85
1.82
1.80
1.78
1.76
1.74
1.73
1.71
1.70
1.68
1.67
1.66
1.65
1.63
1.62
1.62
1.61
1.60
1.59
1.55
1.52
1.48
1.45
1.43
1.41
1.39
1.38
1.37
1.35
1.34
1.33
1.32

200
253.7
19.49
8.54
5.65
4.39
3.69
3.25
2.95
2.73
2.56
2.43
2.32
2.23
2.16
2.10
2.04
1.99
1.95
1.91
1.88
1.84
1.82
1.79
1.77
1.75
1.73
1.71
1.69
1.67
1.66
1.65
1.63
1.62
1.61
1.60
1.59
1.58
1.57
1.56
1.55
1.51
1.48
1.44
1.40
1.38
1.36
1.34
1.33
1.32
1.30
1.28
1.27
1.26

332

INTROSTAT
(0.025)

TABLE 4.2. 2.5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

1
648
38.51
17.44
12.22
10.01
8.81
8.07
7.57
7.21
6.94
6.72
6.55
6.41
6.30
6.20
6.12
6.04
5.98
5.92
5.87
5.83
5.79
5.75
5.72
5.69
5.66
5.63
5.61
5.59
5.57
5.55
5.53
5.51
5.50
5.48
5.47
5.46
5.45
5.43
5.42
5.38
5.34
5.29
5.25
5.22
5.20
5.18
5.16
5.15
5.13
5.12
5.11
5.10

2
799
39.00
16.04
10.65
8.43
7.26
6.54
6.06
5.71
5.46
5.26
5.10
4.97
4.86
4.77
4.69
4.62
4.56
4.51
4.46
4.42
4.38
4.35
4.32
4.29
4.27
4.24
4.22
4.20
4.18
4.16
4.15
4.13
4.12
4.11
4.09
4.08
4.07
4.06
4.05
4.01
3.97
3.93
3.89
3.86
3.84
3.83
3.82
3.80
3.79
3.78
3.77
3.76

3
864
39.17
15.44
9.98
7.76
6.60
5.89
5.42
5.08
4.83
4.63
4.47
4.35
4.24
4.15
4.08
4.01
3.95
3.90
3.86
3.82
3.78
3.75
3.72
3.69
3.67
3.65
3.63
3.61
3.59
3.57
3.56
3.54
3.53
3.52
3.50
3.49
3.48
3.47
3.46
3.42
3.39
3.34
3.31
3.28
3.26
3.25
3.24
3.23
3.21
3.20
3.19
3.18

4
900
39.25
15.10
9.60
7.39
6.23
5.52
5.05
4.72
4.47
4.28
4.12
4.00
3.89
3.80
3.73
3.66
3.61
3.56
3.51
3.48
3.44
3.41
3.38
3.35
3.33
3.31
3.29
3.27
3.25
3.23
3.22
3.20
3.19
3.18
3.17
3.16
3.15
3.14
3.13
3.09
3.05
3.01
2.97
2.95
2.93
2.92
2.90
2.89
2.88
2.87
2.86
2.85

NUM (Numerator Degrees of Freedom)


5
6
7
8
9
10
922
937
948
957
963
969
39.30 39.33 39.36 39.37 39.39 39.40
14.88 14.73 14.62 14.54 14.47 14.42
9.36
9.20
9.07
8.98
8.90
8.84
7.15
6.98
6.85
6.76
6.68
6.62
5.99
5.82
5.70
5.60
5.52
5.46
5.29
5.12
4.99
4.90
4.82
4.76
4.82
4.65
4.53
4.43
4.36
4.30
4.48
4.32
4.20
4.10
4.03
3.96
4.24
4.07
3.95
3.85
3.78
3.72
4.04
3.88
3.76
3.66
3.59
3.53
3.89
3.73
3.61
3.51
3.44
3.37
3.77
3.60
3.48
3.39
3.31
3.25
3.66
3.50
3.38
3.29
3.21
3.15
3.58
3.41
3.29
3.20
3.12
3.06
3.50
3.34
3.22
3.12
3.05
2.99
3.44
3.28
3.16
3.06
2.98
2.92
3.38
3.22
3.10
3.01
2.93
2.87
3.33
3.17
3.05
2.96
2.88
2.82
3.29
3.13
3.01
2.91
2.84
2.77
3.25
3.09
2.97
2.87
2.80
2.73
3.22
3.05
2.93
2.84
2.76
2.70
3.18
3.02
2.90
2.81
2.73
2.67
3.15
2.99
2.87
2.78
2.70
2.64
3.13
2.97
2.85
2.75
2.68
2.61
3.10
2.94
2.82
2.73
2.65
2.59
3.08
2.92
2.80
2.71
2.63
2.57
3.06
2.90
2.78
2.69
2.61
2.55
3.04
2.88
2.76
2.67
2.59
2.53
3.03
2.87
2.75
2.65
2.57
2.51
3.01
2.85
2.73
2.64
2.56
2.50
3.00
2.84
2.71
2.62
2.54
2.48
2.98
2.82
2.70
2.61
2.53
2.47
2.97
2.81
2.69
2.59
2.52
2.45
2.96
2.80
2.68
2.58
2.50
2.44
2.94
2.78
2.66
2.57
2.49
2.43
2.93
2.77
2.65
2.56
2.48
2.42
2.92
2.76
2.64
2.55
2.47
2.41
2.91
2.75
2.63
2.54
2.46
2.40
2.90
2.74
2.62
2.53
2.45
2.39
2.86
2.70
2.58
2.49
2.41
2.35
2.83
2.67
2.55
2.46
2.38
2.32
2.79
2.63
2.51
2.41
2.33
2.27
2.75
2.59
2.47
2.38
2.30
2.24
2.73
2.57
2.45
2.35
2.28
2.21
2.71
2.55
2.43
2.34
2.26
2.19
2.70
2.54
2.42
2.32
2.24
2.18
2.68
2.53
2.40
2.31
2.23
2.17
2.67
2.52
2.39
2.30
2.22
2.16
2.66
2.50
2.38
2.28
2.21
2.14
2.65
2.49
2.37
2.27
2.19
2.13
2.64
2.48
2.36
2.26
2.19
2.12
2.63
2.47
2.35
2.26
2.18
2.11

11
973
39.41
14.37
8.79
6.57
5.41
4.71
4.24
3.91
3.66
3.47
3.32
3.20
3.09
3.01
2.93
2.87
2.81
2.76
2.72
2.68
2.65
2.62
2.59
2.56
2.54
2.51
2.49
2.48
2.46
2.44
2.43
2.41
2.40
2.39
2.37
2.36
2.35
2.34
2.33
2.29
2.26
2.22
2.18
2.16
2.14
2.12
2.11
2.10
2.09
2.07
2.07
2.06

12
977
39.41
14.34
8.75
6.52
5.37
4.67
4.20
3.87
3.62
3.43
3.28
3.15
3.05
2.96
2.89
2.82
2.77
2.72
2.68
2.64
2.60
2.57
2.54
2.51
2.49
2.47
2.45
2.43
2.41
2.40
2.38
2.37
2.35
2.34
2.33
2.32
2.31
2.30
2.29
2.25
2.22
2.17
2.14
2.11
2.09
2.08
2.07
2.05
2.04
2.03
2.02
2.01

13
980
39.42
14.30
8.72
6.49
5.33
4.63
4.16
3.83
3.58
3.39
3.24
3.12
3.01
2.92
2.85
2.79
2.73
2.68
2.64
2.60
2.56
2.53
2.50
2.48
2.45
2.43
2.41
2.39
2.37
2.36
2.34
2.33
2.31
2.30
2.29
2.28
2.27
2.26
2.25
2.21
2.18
2.13
2.10
2.07
2.05
2.04
2.02
2.01
2.00
1.99
1.98
1.97

14
983
39.43
14.28
8.68
6.46
5.30
4.60
4.13
3.80
3.55
3.36
3.21
3.08
2.98
2.89
2.82
2.75
2.70
2.65
2.60
2.56
2.53
2.50
2.47
2.44
2.42
2.39
2.37
2.36
2.34
2.32
2.31
2.29
2.28
2.27
2.25
2.24
2.23
2.22
2.21
2.17
2.14
2.09
2.06
2.03
2.02
2.00
1.99
1.98
1.96
1.95
1.94
1.93

333

CHAPTER 12. TABLES


TABLE 4.2, continued. 2.5% critical values for the F -DISTRIBUTION
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

15
985
39.43
14.25
8.66
6.43
5.27
4.57
4.10
3.77
3.52
3.33
3.18
3.05
2.95
2.86
2.79
2.72
2.67
2.62
2.57
2.53
2.50
2.47
2.44
2.41
2.39
2.36
2.34
2.32
2.31
2.29
2.28
2.26
2.25
2.23
2.22
2.21
2.20
2.19
2.18
2.14
2.11
2.06
2.03
2.00
1.98
1.97
1.96
1.94
1.93
1.92
1.91
1.90

16
987
39.44
14.23
8.63
6.40
5.24
4.54
4.08
3.74
3.50
3.30
3.15
3.03
2.92
2.84
2.76
2.70
2.64
2.59
2.55
2.51
2.47
2.44
2.41
2.38
2.36
2.34
2.32
2.30
2.28
2.26
2.25
2.23
2.22
2.21
2.20
2.18
2.17
2.16
2.15
2.11
2.08
2.03
2.00
1.97
1.95
1.94
1.93
1.92
1.90
1.89
1.88
1.87

17
989
39.44
14.21
8.61
6.38
5.22
4.52
4.05
3.72
3.47
3.28
3.13
3.00
2.90
2.81
2.74
2.67
2.62
2.57
2.52
2.48
2.45
2.42
2.39
2.36
2.34
2.31
2.29
2.27
2.26
2.24
2.22
2.21
2.20
2.18
2.17
2.16
2.15
2.14
2.13
2.09
2.06
2.01
1.97
1.95
1.93
1.91
1.90
1.89
1.87
1.86
1.85
1.84

18
990
39.44
14.20
8.59
6.36
5.20
4.50
4.03
3.70
3.45
3.26
3.11
2.98
2.88
2.79
2.72
2.65
2.60
2.55
2.50
2.46
2.43
2.39
2.36
2.34
2.31
2.29
2.27
2.25
2.23
2.22
2.20
2.19
2.17
2.16
2.15
2.14
2.13
2.12
2.11
2.07
2.03
1.98
1.95
1.92
1.91
1.89
1.88
1.87
1.85
1.84
1.83
1.82

NUM (Numerator Degrees of Freedom)


19
20
22
24
27
30
992
993
995
997 1000 1001
39.45 39.45 39.45 39.46 39.46 39.46
14.18 14.17 14.14 14.12 14.10 14.08
8.58
8.56
8.53
8.51
8.48
8.46
6.34
6.33
6.30
6.28
6.25
6.23
5.18
5.17
5.14
5.12
5.09
5.07
4.48
4.47
4.44
4.41
4.39
4.36
4.02
4.00
3.97
3.95
3.92
3.89
3.68
3.67
3.64
3.61
3.58
3.56
3.44
3.42
3.39
3.37
3.34
3.31
3.24
3.23
3.20
3.17
3.14
3.12
3.09
3.07
3.04
3.02
2.99
2.96
2.96
2.95
2.92
2.89
2.86
2.84
2.86
2.84
2.81
2.79
2.76
2.73
2.77
2.76
2.73
2.70
2.67
2.64
2.70
2.68
2.65
2.63
2.59
2.57
2.63
2.62
2.59
2.56
2.53
2.50
2.58
2.56
2.53
2.50
2.47
2.44
2.53
2.51
2.48
2.45
2.42
2.39
2.48
2.46
2.43
2.41
2.38
2.35
2.44
2.42
2.39
2.37
2.33
2.31
2.41
2.39
2.36
2.33
2.30
2.27
2.37
2.36
2.33
2.30
2.27
2.24
2.35
2.33
2.30
2.27
2.24
2.21
2.32
2.30
2.27
2.24
2.21
2.18
2.29
2.28
2.24
2.22
2.18
2.16
2.27
2.25
2.22
2.19
2.16
2.13
2.25
2.23
2.20
2.17
2.14
2.11
2.23
2.21
2.18
2.15
2.12
2.09
2.21
2.20
2.16
2.14
2.10
2.07
2.20
2.18
2.15
2.12
2.08
2.06
2.18
2.16
2.13
2.10
2.07
2.04
2.17
2.15
2.12
2.09
2.05
2.03
2.15
2.13
2.10
2.07
2.04
2.01
2.14
2.12
2.09
2.06
2.03
2.00
2.13
2.11
2.08
2.05
2.01
1.99
2.12
2.10
2.07
2.04
2.00
1.97
2.11
2.09
2.05
2.03
1.99
1.96
2.10
2.08
2.04
2.02
1.98
1.95
2.09
2.07
2.03
2.01
1.97
1.94
2.04
2.03
1.99
1.96
1.93
1.90
2.01
1.99
1.96
1.93
1.90
1.87
1.96
1.94
1.91
1.88
1.85
1.82
1.93
1.91
1.88
1.85
1.81
1.78
1.90
1.88
1.85
1.82
1.78
1.75
1.88
1.86
1.83
1.80
1.76
1.73
1.87
1.85
1.81
1.78
1.75
1.71
1.86
1.84
1.80
1.77
1.73
1.70
1.84
1.82
1.79
1.76
1.72
1.69
1.83
1.81
1.77
1.74
1.70
1.67
1.82
1.80
1.76
1.73
1.69
1.66
1.81
1.79
1.75
1.72
1.68
1.65
1.80
1.78
1.74
1.71
1.67
1.64

40
1006
39.47
14.04
8.41
6.18
5.01
4.31
3.84
3.51
3.26
3.06
2.91
2.78
2.67
2.59
2.51
2.44
2.38
2.33
2.29
2.25
2.21
2.18
2.15
2.12
2.09
2.07
2.05
2.03
2.01
1.99
1.98
1.96
1.95
1.93
1.92
1.91
1.90
1.89
1.88
1.83
1.80
1.74
1.71
1.68
1.66
1.64
1.63
1.61
1.60
1.58
1.57
1.56

60
1010
39.48
13.99
8.36
6.12
4.96
4.25
3.78
3.45
3.20
3.00
2.85
2.72
2.61
2.52
2.45
2.38
2.32
2.27
2.22
2.18
2.14
2.11
2.08
2.05
2.03
2.00
1.98
1.96
1.94
1.92
1.91
1.89
1.88
1.86
1.85
1.84
1.82
1.81
1.80
1.76
1.72
1.67
1.63
1.60
1.58
1.56
1.54
1.53
1.51
1.50
1.48
1.47

100
1013
39.49
13.96
8.32
6.08
4.92
4.21
3.74
3.40
3.15
2.96
2.80
2.67
2.56
2.47
2.40
2.33
2.27
2.22
2.17
2.13
2.09
2.06
2.02
2.00
1.97
1.94
1.92
1.90
1.88
1.86
1.85
1.83
1.82
1.80
1.79
1.77
1.76
1.75
1.74
1.69
1.66
1.60
1.56
1.53
1.50
1.48
1.47
1.45
1.43
1.42
1.40
1.39

200
1016
39.49
13.93
8.29
6.05
4.88
4.18
3.70
3.37
3.12
2.92
2.76
2.63
2.53
2.44
2.36
2.29
2.23
2.18
2.13
2.09
2.05
2.01
1.98
1.95
1.92
1.90
1.88
1.86
1.84
1.82
1.80
1.78
1.77
1.75
1.74
1.73
1.71
1.70
1.69
1.64
1.60
1.54
1.50
1.47
1.44
1.42
1.40
1.39
1.36
1.35
1.33
1.32

334

INTROSTAT
(0.01)

TABLE 4.3. 1% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

1
4052
98.50
34.12
21.20
16.26
13.75
12.25
11.26
10.56
10.04
9.65
9.33
9.07
8.86
8.68
8.53
8.40
8.29
8.18
8.10
8.02
7.95
7.88
7.82
7.77
7.72
7.68
7.64
7.60
7.56
7.53
7.50
7.47
7.44
7.42
7.40
7.37
7.35
7.33
7.31
7.23
7.17
7.08
7.01
6.96
6.93
6.90
6.87
6.85
6.82
6.80
6.78
6.76

2
4999
99.00
30.82
18.00
13.27
10.92
9.55
8.65
8.02
7.56
7.21
6.93
6.70
6.51
6.36
6.23
6.11
6.01
5.93
5.85
5.78
5.72
5.66
5.61
5.57
5.53
5.49
5.45
5.42
5.39
5.36
5.34
5.31
5.29
5.27
5.25
5.23
5.21
5.19
5.18
5.11
5.06
4.98
4.92
4.88
4.85
4.82
4.80
4.79
4.76
4.74
4.73
4.71

3
5404
99.16
29.46
16.69
12.06
9.78
8.45
7.59
6.99
6.55
6.22
5.95
5.74
5.56
5.42
5.29
5.19
5.09
5.01
4.94
4.87
4.82
4.76
4.72
4.68
4.64
4.60
4.57
4.54
4.51
4.48
4.46
4.44
4.42
4.40
4.38
4.36
4.34
4.33
4.31
4.25
4.20
4.13
4.07
4.04
4.01
3.98
3.96
3.95
3.92
3.91
3.89
3.88

4
5624
99.25
28.71
15.98
11.39
9.15
7.85
7.01
6.42
5.99
5.67
5.41
5.21
5.04
4.89
4.77
4.67
4.58
4.50
4.43
4.37
4.31
4.26
4.22
4.18
4.14
4.11
4.07
4.04
4.02
3.99
3.97
3.95
3.93
3.91
3.89
3.87
3.86
3.84
3.83
3.77
3.72
3.65
3.60
3.56
3.53
3.51
3.49
3.48
3.46
3.44
3.43
3.41

NUM (Numerator Degrees


5
6
7
8
5764 5859 5928 5981
99.30 99.33 99.36 99.38
28.24 27.91 27.67 27.49
15.52 15.21 14.98 14.80
10.97 10.67 10.46 10.29
8.75
8.47
8.26
8.10
7.46
7.19
6.99
6.84
6.63
6.37
6.18
6.03
6.06
5.80
5.61
5.47
5.64
5.39
5.20
5.06
5.32
5.07
4.89
4.74
5.06
4.82
4.64
4.50
4.86
4.62
4.44
4.30
4.69
4.46
4.28
4.14
4.56
4.32
4.14
4.00
4.44
4.20
4.03
3.89
4.34
4.10
3.93
3.79
4.25
4.01
3.84
3.71
4.17
3.94
3.77
3.63
4.10
3.87
3.70
3.56
4.04
3.81
3.64
3.51
3.99
3.76
3.59
3.45
3.94
3.71
3.54
3.41
3.90
3.67
3.50
3.36
3.85
3.63
3.46
3.32
3.82
3.59
3.42
3.29
3.78
3.56
3.39
3.26
3.75
3.53
3.36
3.23
3.73
3.50
3.33
3.20
3.70
3.47
3.30
3.17
3.67
3.45
3.28
3.15
3.65
3.43
3.26
3.13
3.63
3.41
3.24
3.11
3.61
3.39
3.22
3.09
3.59
3.37
3.20
3.07
3.57
3.35
3.18
3.05
3.56
3.33
3.17
3.04
3.54
3.32
3.15
3.02
3.53
3.30
3.14
3.01
3.51
3.29
3.12
2.99
3.45
3.23
3.07
2.94
3.41
3.19
3.02
2.89
3.34
3.12
2.95
2.82
3.29
3.07
2.91
2.78
3.26
3.04
2.87
2.74
3.23
3.01
2.84
2.72
3.21
2.99
2.82
2.69
3.19
2.97
2.81
2.68
3.17
2.96
2.79
2.66
3.15
2.93
2.77
2.64
3.13
2.92
2.75
2.62
3.12
2.90
2.74
2.61
3.11
2.89
2.73
2.60

of Freedom)
9
10
6022 6056
99.39 99.40
27.34 27.23
14.66 14.55
10.16 10.05
7.98
7.87
6.72
6.62
5.91
5.81
5.35
5.26
4.94
4.85
4.63
4.54
4.39
4.30
4.19
4.10
4.03
3.94
3.89
3.80
3.78
3.69
3.68
3.59
3.60
3.51
3.52
3.43
3.46
3.37
3.40
3.31
3.35
3.26
3.30
3.21
3.26
3.17
3.22
3.13
3.18
3.09
3.15
3.06
3.12
3.03
3.09
3.00
3.07
2.98
3.04
2.96
3.02
2.93
3.00
2.91
2.98
2.89
2.96
2.88
2.95
2.86
2.93
2.84
2.92
2.83
2.90
2.81
2.89
2.80
2.83
2.74
2.78
2.70
2.72
2.63
2.67
2.59
2.64
2.55
2.61
2.52
2.59
2.50
2.57
2.49
2.56
2.47
2.54
2.45
2.52
2.43
2.51
2.42
2.50
2.41

11
6083
99.41
27.13
14.45
9.96
7.79
6.54
5.73
5.18
4.77
4.46
4.22
4.02
3.86
3.73
3.62
3.52
3.43
3.36
3.29
3.24
3.18
3.14
3.09
3.06
3.02
2.99
2.96
2.93
2.91
2.88
2.86
2.84
2.82
2.80
2.79
2.77
2.75
2.74
2.73
2.67
2.63
2.56
2.51
2.48
2.45
2.43
2.41
2.40
2.38
2.36
2.35
2.34

12
6107
99.42
27.05
14.37
9.89
7.72
6.47
5.67
5.11
4.71
4.40
4.16
3.96
3.80
3.67
3.55
3.46
3.37
3.30
3.23
3.17
3.12
3.07
3.03
2.99
2.96
2.93
2.90
2.87
2.84
2.82
2.80
2.78
2.76
2.74
2.72
2.71
2.69
2.68
2.66
2.61
2.56
2.50
2.45
2.42
2.39
2.37
2.35
2.34
2.31
2.30
2.28
2.27

13
6126
99.42
26.98
14.31
9.82
7.66
6.41
5.61
5.05
4.65
4.34
4.10
3.91
3.75
3.61
3.50
3.40
3.32
3.24
3.18
3.12
3.07
3.02
2.98
2.94
2.90
2.87
2.84
2.81
2.79
2.77
2.74
2.72
2.70
2.69
2.67
2.65
2.64
2.62
2.61
2.55
2.51
2.44
2.40
2.36
2.33
2.31
2.30
2.28
2.26
2.24
2.23
2.22

14
6143
99.43
26.92
14.25
9.77
7.60
6.36
5.56
5.01
4.60
4.29
4.05
3.86
3.70
3.56
3.45
3.35
3.27
3.19
3.13
3.07
3.02
2.97
2.93
2.89
2.86
2.82
2.79
2.77
2.74
2.72
2.70
2.68
2.66
2.64
2.62
2.61
2.59
2.58
2.56
2.51
2.46
2.39
2.35
2.31
2.29
2.27
2.25
2.23
2.21
2.20
2.18
2.17

335

CHAPTER 12. TABLES


TABLE 4.3, continued. 1% critical values for the F -DISTRIBUTION
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

15
6157
99.43
26.87
14.20
9.72
7.56
6.31
5.52
4.96
4.56
4.25
4.01
3.82
3.66
3.52
3.41
3.31
3.23
3.15
3.09
3.03
2.98
2.93
2.89
2.85
2.81
2.78
2.75
2.73
2.70
2.68
2.65
2.63
2.61
2.60
2.58
2.56
2.55
2.54
2.52
2.46
2.42
2.35
2.31
2.27
2.24
2.22
2.21
2.19
2.17
2.15
2.14
2.13

16
6170
99.44
26.83
14.15
9.68
7.52
6.28
5.48
4.92
4.52
4.21
3.97
3.78
3.62
3.49
3.37
3.27
3.19
3.12
3.05
2.99
2.94
2.89
2.85
2.81
2.78
2.75
2.72
2.69
2.66
2.64
2.62
2.60
2.58
2.56
2.54
2.53
2.51
2.50
2.48
2.43
2.38
2.31
2.27
2.23
2.21
2.19
2.17
2.15
2.13
2.11
2.10
2.09

17
6181
99.44
26.79
14.11
9.64
7.48
6.24
5.44
4.89
4.49
4.18
3.94
3.75
3.59
3.45
3.34
3.24
3.16
3.08
3.02
2.96
2.91
2.86
2.82
2.78
2.75
2.71
2.68
2.66
2.63
2.61
2.58
2.56
2.54
2.53
2.51
2.49
2.48
2.46
2.45
2.39
2.35
2.28
2.23
2.20
2.17
2.15
2.13
2.12
2.10
2.08
2.07
2.06

18
6191
99.44
26.75
14.08
9.61
7.45
6.21
5.41
4.86
4.46
4.15
3.91
3.72
3.56
3.42
3.31
3.21
3.13
3.05
2.99
2.93
2.88
2.83
2.79
2.75
2.72
2.68
2.65
2.63
2.60
2.58
2.55
2.53
2.51
2.50
2.48
2.46
2.45
2.43
2.42
2.36
2.32
2.25
2.20
2.17
2.14
2.12
2.10
2.09
2.07
2.05
2.04
2.03

NUM (Numerator Degrees


19
20
22
24
6201 6209 6223 6234
99.45 99.45 99.46 99.46
26.72 26.69 26.64 26.60
14.05 14.02 13.97 13.93
9.58
9.55
9.51
9.47
7.42
7.40
7.35
7.31
6.18
6.16
6.11
6.07
5.38
5.36
5.32
5.28
4.83
4.81
4.77
4.73
4.43
4.41
4.36
4.33
4.12
4.10
4.06
4.02
3.88
3.86
3.82
3.78
3.69
3.66
3.62
3.59
3.53
3.51
3.46
3.43
3.40
3.37
3.33
3.29
3.28
3.26
3.22
3.18
3.19
3.16
3.12
3.08
3.10
3.08
3.03
3.00
3.03
3.00
2.96
2.92
2.96
2.94
2.90
2.86
2.90
2.88
2.84
2.80
2.85
2.83
2.78
2.75
2.80
2.78
2.74
2.70
2.76
2.74
2.70
2.66
2.72
2.70
2.66
2.62
2.69
2.66
2.62
2.58
2.66
2.63
2.59
2.55
2.63
2.60
2.56
2.52
2.60
2.57
2.53
2.49
2.57
2.55
2.51
2.47
2.55
2.52
2.48
2.45
2.53
2.50
2.46
2.42
2.51
2.48
2.44
2.40
2.49
2.46
2.42
2.38
2.47
2.44
2.40
2.36
2.45
2.43
2.38
2.35
2.44
2.41
2.37
2.33
2.42
2.40
2.35
2.32
2.41
2.38
2.34
2.30
2.39
2.37
2.33
2.29
2.34
2.31
2.27
2.23
2.29
2.27
2.22
2.18
2.22
2.20
2.15
2.12
2.18
2.15
2.11
2.07
2.14
2.12
2.07
2.03
2.11
2.09
2.04
2.00
2.09
2.07
2.02
1.98
2.07
2.05
2.00
1.96
2.06
2.03
1.99
1.95
2.04
2.01
1.97
1.93
2.02
1.99
1.95
1.91
2.01
1.98
1.94
1.90
2.00
1.97
1.93
1.89

of Freedom)
27
30
6249 6260
99.46 99.47
26.55 26.50
13.88 13.84
9.42
9.38
7.27
7.23
6.03
5.99
5.23
5.20
4.68
4.65
4.28
4.25
3.98
3.94
3.74
3.70
3.54
3.51
3.38
3.35
3.25
3.21
3.14
3.10
3.04
3.00
2.95
2.92
2.88
2.84
2.81
2.78
2.76
2.72
2.70
2.67
2.66
2.62
2.61
2.58
2.58
2.54
2.54
2.50
2.51
2.47
2.48
2.44
2.45
2.41
2.42
2.39
2.40
2.36
2.38
2.34
2.36
2.32
2.34
2.30
2.32
2.28
2.30
2.26
2.28
2.25
2.27
2.23
2.26
2.22
2.24
2.20
2.18
2.14
2.14
2.10
2.07
2.03
2.02
1.98
1.98
1.94
1.96
1.92
1.93
1.89
1.92
1.88
1.90
1.86
1.88
1.84
1.86
1.82
1.85
1.81
1.84
1.79

40
6286
99.48
26.41
13.75
9.29
7.14
5.91
5.12
4.57
4.17
3.86
3.62
3.43
3.27
3.13
3.02
2.92
2.84
2.76
2.69
2.64
2.58
2.54
2.49
2.45
2.42
2.38
2.35
2.33
2.30
2.27
2.25
2.23
2.21
2.19
2.18
2.16
2.14
2.13
2.11
2.05
2.01
1.94
1.89
1.85
1.82
1.80
1.78
1.76
1.74
1.72
1.71
1.69

60
6313
99.48
26.32
13.65
9.20
7.06
5.82
5.03
4.48
4.08
3.78
3.54
3.34
3.18
3.05
2.93
2.83
2.75
2.67
2.61
2.55
2.50
2.45
2.40
2.36
2.33
2.29
2.26
2.23
2.21
2.18
2.16
2.14
2.12
2.10
2.08
2.06
2.05
2.03
2.02
1.96
1.91
1.84
1.78
1.75
1.72
1.69
1.67
1.66
1.63
1.61
1.60
1.58

100
6334
99.49
26.24
13.58
9.13
6.99
5.75
4.96
4.41
4.01
3.71
3.47
3.27
3.11
2.98
2.86
2.76
2.68
2.60
2.54
2.48
2.42
2.37
2.33
2.29
2.25
2.22
2.19
2.16
2.13
2.11
2.08
2.06
2.04
2.02
2.00
1.98
1.97
1.95
1.94
1.88
1.82
1.75
1.70
1.65
1.62
1.60
1.58
1.56
1.53
1.51
1.49
1.48

200
6350
99.49
26.18
13.52
9.08
6.93
5.70
4.91
4.36
3.96
3.66
3.41
3.22
3.06
2.92
2.81
2.71
2.62
2.55
2.48
2.42
2.36
2.32
2.27
2.23
2.19
2.16
2.13
2.10
2.07
2.04
2.02
2.00
1.98
1.96
1.94
1.92
1.90
1.89
1.87
1.81
1.76
1.68
1.62
1.58
1.55
1.52
1.50
1.48
1.45
1.42
1.41
1.39

336

INTROSTAT
(0.005)

TABLE 4.4. 0.5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

1
16212
198.5
55.55
31.33
22.78
18.63
16.24
14.69
13.61
12.83
12.23
11.75
11.37
11.06
10.80
10.58
10.38
10.22
10.07
9.94
9.83
9.73
9.63
9.55
9.48
9.41
9.34
9.28
9.23
9.18
9.13
9.09
9.05
9.01
8.98
8.94
8.91
8.88
8.85
8.83
8.71
8.63
8.49
8.40
8.33
8.28
8.24
8.21
8.18
8.13
8.10
8.08
8.06

2
19997
199.0
49.80
26.28
18.31
14.54
12.40
11.04
10.11
9.43
8.91
8.51
8.19
7.92
7.70
7.51
7.35
7.21
7.09
6.99
6.89
6.81
6.73
6.66
6.60
6.54
6.49
6.44
6.40
6.35
6.32
6.28
6.25
6.22
6.19
6.16
6.13
6.11
6.09
6.07
5.97
5.90
5.79
5.72
5.67
5.62
5.59
5.56
5.54
5.50
5.48
5.46
5.44

3
21614
199.2
47.47
24.26
16.53
12.92
10.88
9.60
8.72
8.08
7.60
7.23
6.93
6.68
6.48
6.30
6.16
6.03
5.92
5.82
5.73
5.65
5.58
5.52
5.46
5.41
5.36
5.32
5.28
5.24
5.20
5.17
5.14
5.11
5.09
5.06
5.04
5.02
5.00
4.98
4.89
4.83
4.73
4.66
4.61
4.57
4.54
4.52
4.50
4.47
4.44
4.42
4.41

4
22501
199.2
46.20
23.15
15.56
12.03
10.05
8.81
7.96
7.34
6.88
6.52
6.23
6.00
5.80
5.64
5.50
5.37
5.27
5.17
5.09
5.02
4.95
4.89
4.84
4.79
4.74
4.70
4.66
4.62
4.59
4.56
4.53
4.50
4.48
4.46
4.43
4.41
4.39
4.37
4.29
4.23
4.14
4.08
4.03
3.99
3.96
3.94
3.92
3.89
3.87
3.85
3.84

NUM (Numerator Degrees of Freedom)


5
6
7
8
9
10
23056 23440 23715 23924 24091 24222
199.3 199.3 199.4 199.4 199.4 199.4
45.39 44.84 44.43 44.13 43.88 43.68
22.46 21.98 21.62 21.35 21.14 20.97
14.94 14.51 14.20 13.96 13.77 13.62
11.46 11.07 10.79 10.57 10.39 10.25
9.52
9.16
8.89
8.68
8.51
8.38
8.30
7.95
7.69
7.50
7.34
7.21
7.47
7.13
6.88
6.69
6.54
6.42
6.87
6.54
6.30
6.12
5.97
5.85
6.42
6.10
5.86
5.68
5.54
5.42
6.07
5.76
5.52
5.35
5.20
5.09
5.79
5.48
5.25
5.08
4.94
4.82
5.56
5.26
5.03
4.86
4.72
4.60
5.37
5.07
4.85
4.67
4.54
4.42
5.21
4.91
4.69
4.52
4.38
4.27
5.07
4.78
4.56
4.39
4.25
4.14
4.96
4.66
4.44
4.28
4.14
4.03
4.85
4.56
4.34
4.18
4.04
3.93
4.76
4.47
4.26
4.09
3.96
3.85
4.68
4.39
4.18
4.01
3.88
3.77
4.61
4.32
4.11
3.94
3.81
3.70
4.54
4.26
4.05
3.88
3.75
3.64
4.49
4.20
3.99
3.83
3.69
3.59
4.43
4.15
3.94
3.78
3.64
3.54
4.38
4.10
3.89
3.73
3.60
3.49
4.34
4.06
3.85
3.69
3.56
3.45
4.30
4.02
3.81
3.65
3.52
3.41
4.26
3.98
3.77
3.61
3.48
3.38
4.23
3.95
3.74
3.58
3.45
3.34
4.20
3.92
3.71
3.55
3.42
3.31
4.17
3.89
3.68
3.52
3.39
3.29
4.14
3.86
3.66
3.49
3.37
3.26
4.11
3.84
3.63
3.47
3.34
3.24
4.09
3.81
3.61
3.45
3.32
3.21
4.06
3.79
3.58
3.42
3.30
3.19
4.04
3.77
3.56
3.40
3.28
3.17
4.02
3.75
3.54
3.39
3.26
3.15
4.00
3.73
3.53
3.37
3.24
3.13
3.99
3.71
3.51
3.35
3.22
3.12
3.91
3.64
3.43
3.28
3.15
3.04
3.85
3.58
3.38
3.22
3.09
2.99
3.76
3.49
3.29
3.13
3.01
2.90
3.70
3.43
3.23
3.08
2.95
2.85
3.65
3.39
3.19
3.03
2.91
2.80
3.62
3.35
3.15
3.00
2.87
2.77
3.59
3.33
3.13
2.97
2.85
2.74
3.57
3.30
3.11
2.95
2.83
2.72
3.55
3.28
3.09
2.93
2.81
2.71
3.52
3.26
3.06
2.91
2.78
2.68
3.50
3.24
3.04
2.88
2.76
2.66
3.48
3.22
3.02
2.87
2.74
2.64
3.47
3.21
3.01
2.86
2.73
2.63

11
24334
199.4
43.52
20.82
13.49
10.13
8.27
7.10
6.31
5.75
5.32
4.99
4.72
4.51
4.33
4.18
4.05
3.94
3.84
3.76
3.68
3.61
3.55
3.50
3.45
3.40
3.36
3.32
3.29
3.25
3.22
3.20
3.17
3.15
3.12
3.10
3.08
3.06
3.05
3.03
2.96
2.90
2.82
2.76
2.72
2.68
2.66
2.64
2.62
2.59
2.57
2.56
2.54

12
24427
199.4
43.39
20.70
13.38
10.03
8.18
7.01
6.23
5.66
5.24
4.91
4.64
4.43
4.25
4.10
3.97
3.86
3.76
3.68
3.60
3.54
3.47
3.42
3.37
3.33
3.28
3.25
3.21
3.18
3.15
3.12
3.09
3.07
3.05
3.03
3.01
2.99
2.97
2.95
2.88
2.82
2.74
2.68
2.64
2.61
2.58
2.56
2.54
2.52
2.50
2.48
2.47

13
24505
199.4
43.27
20.60
13.29
9.95
8.10
6.94
6.15
5.59
5.16
4.84
4.57
4.36
4.18
4.03
3.90
3.79
3.70
3.61
3.54
3.47
3.41
3.35
3.30
3.26
3.22
3.18
3.15
3.11
3.08
3.06
3.03
3.01
2.98
2.96
2.94
2.92
2.90
2.89
2.82
2.76
2.68
2.62
2.58
2.54
2.52
2.50
2.48
2.45
2.43
2.42
2.40

14
24572
199.4
43.17
20.51
13.21
9.88
8.03
6.87
6.09
5.53
5.10
4.77
4.51
4.30
4.12
3.97
3.84
3.73
3.64
3.55
3.48
3.41
3.35
3.30
3.25
3.20
3.16
3.12
3.09
3.06
3.03
3.00
2.97
2.95
2.93
2.90
2.88
2.87
2.85
2.83
2.76
2.70
2.62
2.56
2.52
2.49
2.46
2.44
2.42
2.40
2.38
2.36
2.35

337

CHAPTER 12. TABLES

TABLE 4.4, continued. 0.5% critical values for the F -DISTRIBUTION (continued)
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

15
24632
199.4
43.08
20.44
13.15
9.81
7.97
6.81
6.03
5.47
5.05
4.72
4.46
4.25
4.07
3.92
3.79
3.68
3.59
3.50
3.43
3.36
3.30
3.25
3.20
3.15
3.11
3.07
3.04
3.01
2.98
2.95
2.92
2.90
2.88
2.85
2.83
2.82
2.80
2.78
2.71
2.65
2.57
2.51
2.47
2.44
2.41
2.39
2.37
2.35
2.33
2.31
2.30

16
24684
199.4
43.01
20.37
13.09
9.76
7.91
6.76
5.98
5.42
5.00
4.67
4.41
4.20
4.02
3.87
3.75
3.64
3.54
3.46
3.38
3.31
3.25
3.20
3.15
3.11
3.07
3.03
2.99
2.96
2.93
2.90
2.88
2.85
2.83
2.81
2.79
2.77
2.75
2.74
2.66
2.61
2.53
2.47
2.43
2.39
2.37
2.35
2.33
2.30
2.28
2.26
2.25

17
24728
199.4
42.94
20.31
13.03
9.71
7.87
6.72
5.94
5.38
4.96
4.63
4.37
4.16
3.98
3.83
3.71
3.60
3.50
3.42
3.34
3.27
3.21
3.16
3.11
3.07
3.03
2.99
2.95
2.92
2.89
2.86
2.84
2.81
2.79
2.77
2.75
2.73
2.71
2.70
2.62
2.57
2.49
2.43
2.39
2.35
2.33
2.31
2.29
2.26
2.24
2.22
2.21

18
24766
199.4
42.88
20.26
12.98
9.66
7.83
6.68
5.90
5.34
4.92
4.59
4.33
4.12
3.95
3.80
3.67
3.56
3.46
3.38
3.31
3.24
3.18
3.12
3.08
3.03
2.99
2.95
2.92
2.89
2.86
2.83
2.80
2.78
2.76
2.73
2.71
2.70
2.68
2.66
2.59
2.53
2.45
2.39
2.35
2.32
2.29
2.27
2.25
2.22
2.20
2.19
2.18

NUM (Numerator Degrees of Freedom)


19
20
22
24
27
30
24803 24837 24892 24937 24997 25041
199.4 199.4 199.4 199.4 199.5 199.5
42.83 42.78 42.69 42.62 42.54 42.47
20.21 20.17 20.09 20.03 19.95 19.89
12.94 12.90 12.84 12.78 12.71 12.66
9.62
9.59
9.53
9.47
9.41
9.36
7.79
7.75
7.69
7.64
7.58
7.53
6.64
6.61
6.55
6.50
6.44
6.40
5.86
5.83
5.78
5.73
5.67
5.62
5.31
5.27
5.22
5.17
5.12
5.07
4.89
4.86
4.80
4.76
4.70
4.65
4.56
4.53
4.48
4.43
4.38
4.33
4.30
4.27
4.22
4.17
4.12
4.07
4.09
4.06
4.01
3.96
3.91
3.86
3.91
3.88
3.83
3.79
3.73
3.69
3.76
3.73
3.68
3.64
3.58
3.54
3.64
3.61
3.56
3.51
3.46
3.41
3.53
3.50
3.45
3.40
3.35
3.30
3.43
3.40
3.35
3.31
3.25
3.21
3.35
3.32
3.27
3.22
3.17
3.12
3.27
3.24
3.19
3.15
3.09
3.05
3.21
3.18
3.12
3.08
3.03
2.98
3.15
3.12
3.06
3.02
2.97
2.92
3.09
3.06
3.01
2.97
2.91
2.87
3.04
3.01
2.96
2.92
2.86
2.82
3.00
2.97
2.92
2.87
2.82
2.77
2.96
2.93
2.88
2.83
2.78
2.73
2.92
2.89
2.84
2.79
2.74
2.69
2.88
2.86
2.80
2.76
2.70
2.66
2.85
2.82
2.77
2.73
2.67
2.63
2.82
2.79
2.74
2.70
2.64
2.60
2.80
2.77
2.71
2.67
2.61
2.57
2.77
2.74
2.69
2.64
2.59
2.54
2.75
2.72
2.66
2.62
2.56
2.52
2.72
2.69
2.64
2.60
2.54
2.50
2.70
2.67
2.62
2.58
2.52
2.48
2.68
2.65
2.60
2.56
2.50
2.46
2.66
2.63
2.58
2.54
2.48
2.44
2.64
2.62
2.56
2.52
2.46
2.42
2.63
2.60
2.55
2.50
2.45
2.40
2.56
2.53
2.47
2.43
2.37
2.33
2.50
2.47
2.42
2.37
2.32
2.27
2.42
2.39
2.33
2.29
2.23
2.19
2.36
2.33
2.28
2.23
2.17
2.13
2.32
2.29
2.23
2.19
2.13
2.08
2.28
2.25
2.20
2.15
2.10
2.05
2.26
2.23
2.17
2.13
2.07
2.02
2.24
2.21
2.15
2.11
2.05
2.00
2.22
2.19
2.13
2.09
2.03
1.98
2.19
2.16
2.11
2.06
2.00
1.96
2.17
2.14
2.09
2.04
1.98
1.93
2.15
2.12
2.07
2.02
1.97
1.92
2.14
2.11
2.06
2.01
1.95
1.91

40
25146
199.5
42.31
19.75
12.53
9.24
7.42
6.29
5.52
4.97
4.55
4.23
3.97
3.76
3.59
3.44
3.31
3.20
3.11
3.02
2.95
2.88
2.82
2.77
2.72
2.67
2.63
2.59
2.56
2.52
2.49
2.47
2.44
2.42
2.39
2.37
2.35
2.33
2.31
2.30
2.22
2.16
2.08
2.02
1.97
1.94
1.91
1.89
1.87
1.84
1.82
1.80
1.79

60
25254
199.5
42.15
19.61
12.40
9.12
7.31
6.18
5.41
4.86
4.45
4.12
3.87
3.66
3.48
3.33
3.21
3.10
3.00
2.92
2.84
2.77
2.71
2.66
2.61
2.56
2.52
2.48
2.45
2.42
2.38
2.36
2.33
2.30
2.28
2.26
2.24
2.22
2.20
2.18
2.11
2.05
1.96
1.90
1.85
1.82
1.79
1.77
1.75
1.72
1.69
1.68
1.66

100
25339
199.5
42.02
19.50
12.30
9.03
7.22
6.09
5.32
4.77
4.36
4.04
3.78
3.57
3.39
3.25
3.12
3.01
2.91
2.83
2.75
2.69
2.62
2.57
2.52
2.47
2.43
2.39
2.36
2.32
2.29
2.26
2.24
2.21
2.19
2.17
2.14
2.12
2.11
2.09
2.01
1.95
1.86
1.80
1.75
1.71
1.68
1.66
1.64
1.60
1.58
1.56
1.54

200
25399
199.5
41.92
19.41
12.22
8.95
7.15
6.02
5.26
4.71
4.29
3.97
3.71
3.50
3.33
3.18
3.05
2.94
2.85
2.76
2.68
2.62
2.56
2.50
2.45
2.40
2.36
2.32
2.29
2.25
2.22
2.19
2.16
2.14
2.11
2.09
2.07
2.05
2.03
2.01
1.93
1.87
1.78
1.71
1.66
1.62
1.59
1.56
1.54
1.51
1.48
1.46
1.44

338

INTROSTAT

TABLE 5. CORRELATION COEFFICIENT. Critical values of the correlation coefficients for one-sided tests of the null hypothesis H0 : = 0 (where degrees of freedom =
sample size - 2)
Deg. of
Freedom
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200

0.4
0.3090
0.2000
0.1577
0.1341
0.1186
0.1075
0.0990
0.0922
0.0867
0.0820
0.0780
0.0746
0.0715
0.0688
0.0664
0.0643
0.0623
0.0605
0.0588
0.0573
0.0559
0.0546
0.0534
0.0522
0.0511
0.0501
0.0492
0.0483
0.0474
0.0466
0.0458
0.0451
0.0444
0.0437
0.0431
0.0425
0.0419
0.0414
0.0408
0.0403
0.0380
0.0360
0.0328
0.0304
0.0284
0.0268
0.0254
0.0242
0.0232
0.0214
0.0201
0.0189
0.0179

0.3
0.5878
0.4000
0.3197
0.2735
0.2427
0.2204
0.2032
0.1895
0.1783
0.1688
0.1607
0.1536
0.1474
0.1419
0.1370
0.1326
0.1285
0.1248
0.1214
0.1183
0.1154
0.1127
0.1102
0.1078
0.1056
0.1036
0.1016
0.0997
0.0980
0.0963
0.0947
0.0932
0.0918
0.0904
0.0891
0.0878
0.0866
0.0855
0.0844
0.0833
0.0785
0.0744
0.0679
0.0628
0.0588
0.0554
0.0525
0.0501
0.0479
0.0444
0.0415
0.0391
0.0371

0.2
0.8090
0.6000
0.4919
0.4257
0.3803
0.3468
0.3208
0.2998
0.2825
0.2678
0.2552
0.2443
0.2346
0.2260
0.2183
0.2113
0.2049
0.1991
0.1938
0.1888
0.1843
0.1800
0.1760
0.1723
0.1688
0.1655
0.1624
0.1594
0.1567
0.1540
0.1515
0.1491
0.1468
0.1446
0.1425
0.1405
0.1386
0.1368
0.1350
0.1333
0.1257
0.1192
0.1088
0.1007
0.0942
0.0888
0.0842
0.0803
0.0769
0.0712
0.0666
0.0628
0.0595

0.1
0.9511
0.8000
0.6870
0.6084
0.5509
0.5067
0.4716
0.4428
0.4187
0.3981
0.3802
0.3646
0.3507
0.3383
0.3271
0.3170
0.3077
0.2992
0.2914
0.2841
0.2774
0.2711
0.2653
0.2598
0.2546
0.2497
0.2451
0.2407
0.2366
0.2327
0.2289
0.2254
0.2220
0.2187
0.2156
0.2126
0.2097
0.2070
0.2043
0.2018
0.1903
0.1806
0.1650
0.1528
0.1430
0.1348
0.1279
0.1220
0.1168
0.1082
0.1012
0.0954
0.0905

Probability Level (P)


0.05
0.025
0.01
0.9877 0.9969 0.9995
0.9000 0.9500 0.9800
0.8054 0.8783 0.9343
0.7293 0.8114 0.8822
0.6694 0.7545 0.8329
0.6215 0.7067 0.7887
0.5822 0.6664 0.7498
0.5494 0.6319 0.7155
0.5214 0.6021 0.6851
0.4973 0.5760 0.6581
0.4762 0.5529 0.6339
0.4575 0.5324 0.6120
0.4409 0.5140 0.5923
0.4259 0.4973 0.5742
0.4124 0.4821 0.5577
0.4000 0.4683 0.5425
0.3887 0.4555 0.5285
0.3783 0.4438 0.5155
0.3687 0.4329 0.5034
0.3598 0.4227 0.4921
0.3515 0.4132 0.4815
0.3438 0.4044 0.4716
0.3365 0.3961 0.4622
0.3297 0.3882 0.4534
0.3233 0.3809 0.4451
0.3172 0.3739 0.4372
0.3115 0.3673 0.4297
0.3061 0.3610 0.4226
0.3009 0.3550 0.4158
0.2960 0.3494 0.4093
0.2913 0.3440 0.4032
0.2869 0.3388 0.3972
0.2826 0.3338 0.3916
0.2785 0.3291 0.3862
0.2746 0.3246 0.3810
0.2709 0.3202 0.3760
0.2673 0.3160 0.3712
0.2638 0.3120 0.3665
0.2605 0.3081 0.3621
0.2573 0.3044 0.3578
0.2429 0.2876 0.3384
0.2306 0.2732 0.3218
0.2108 0.2500 0.2948
0.1954 0.2319 0.2737
0.1829 0.2172 0.2565
0.1726 0.2050 0.2422
0.1638 0.1946 0.2301
0.1562 0.1857 0.2196
0.1496 0.1779 0.2104
0.1386 0.1648 0.1951
0.1297 0.1543 0.1826
0.1223 0.1455 0.1723
0.1161 0.1381 0.1636

0.005
0.9999
0.9900
0.9587
0.9172
0.8745
0.8343
0.7977
0.7646
0.7348
0.7079
0.6835
0.6614
0.6411
0.6226
0.6055
0.5897
0.5751
0.5614
0.5487
0.5368
0.5256
0.5151
0.5052
0.4958
0.4869
0.4785
0.4705
0.4629
0.4556
0.4487
0.4421
0.4357
0.4296
0.4238
0.4182
0.4128
0.4076
0.4026
0.3978
0.3932
0.3721
0.3542
0.3248
0.3017
0.2830
0.2673
0.2540
0.2425
0.2324
0.2155
0.2019
0.1905
0.1809

0.0025
1.0000
0.9950
0.9740
0.9417
0.9056
0.8697
0.8359
0.8046
0.7759
0.7496
0.7255
0.7034
0.6831
0.6643
0.6470
0.6308
0.6158
0.6018
0.5886
0.5763
0.5647
0.5537
0.5434
0.5336
0.5243
0.5154
0.5070
0.4990
0.4914
0.4840
0.4770
0.4703
0.4639
0.4577
0.4518
0.4461
0.4406
0.4353
0.4301
0.4252
0.4028
0.3836
0.3522
0.3274
0.3072
0.2903
0.2759
0.2635
0.2526
0.2343
0.2195
0.2072
0.1968

0.001
1.0000
0.9980
0.9859
0.9633
0.9350
0.9049
0.8751
0.8467
0.8199
0.7950
0.7717
0.7501
0.7301
0.7114
0.6940
0.6777
0.6624
0.6481
0.6346
0.6219
0.6099
0.5986
0.5879
0.5776
0.5679
0.5587
0.5499
0.5415
0.5334
0.5257
0.5184
0.5113
0.5045
0.4979
0.4916
0.4856
0.4797
0.4741
0.4686
0.4634
0.4394
0.4188
0.3850
0.3583
0.3364
0.3181
0.3025
0.2890
0.2771
0.2572
0.2411
0.2276
0.2162

0.0005
1.0000
0.9990
0.9911
0.9741
0.9509
0.9249
0.8983
0.8721
0.8470
0.8233
0.8010
0.7800
0.7604
0.7419
0.7247
0.7084
0.6932
0.6788
0.6652
0.6524
0.6402
0.6287
0.6178
0.6074
0.5974
0.5880
0.5789
0.5703
0.5621
0.5541
0.5465
0.5392
0.5322
0.5254
0.5189
0.5126
0.5066
0.5007
0.4950
0.4896
0.4647
0.4432
0.4079
0.3798
0.3568
0.3375
0.3211
0.3068
0.2943
0.2733
0.2562
0.2419
0.2298

You might also like