Intro Stat 2010
Intro Stat 2010
December 2010
ii
Introduction
IntroSTAT apart, there seem to be two kinds of introductory Statistics textbooks. There
are those that assume no mathematics at all, and get themselves tied up in all kinds of
knots trying to explain the intricacies of Statistics to students who know no calculus.
There are those that assume lots of mathematics, and get themselves tied up in the
knots of mathematical statistics.
IntroSTAT assumes that students have a basic understanding of differentiation and
integration. The book was designed to meet the needs of students, primarily those in
business, commerce and management, for a course in applied statistics.
IntroSTAT is designed as a lecture-book. One of our aims is to maximize the time
spent in explaining concepts and doing examples. It is for this reason that three types
of examples are included in the chapters. Those labeled A are used to motivate concepts,
and often contain explanations of methods within them. They are for use in lectures.
The B examples are worked examples they shouldnt be used in lectures there
is nothing more deadly dull than lecturing through worked examples. Students should
use the B examples for private study. The C examples contain problem statements.
A selection of these can be tackled in lectures without the need to waste time by the
lecturer writing up descriptions of examples, and by the students copying them down.
There are probably more exercises at the end of most chapters than necessary. A
selection has been marked with asterisks ( ) these should be seen as a minimum set
to give experience with all the types of exercises.
Acknowledgements . . .
We are grateful to our colleagues who used editions 1 to 4 of IntroSTAT and made
suggestions for changes and improvements. We have also appreciated comments from
students. We will continue to welcome their ideas and hope that they will continue to
point out the deficiencies. Mrs Tib Cousins undertook the enormous task of turning
edition 4 into TEX files, which were the basis upon which this revision was undertaken.
Mrs Margaret Zaborowski helped us proofread the text but any remaining errors are
our responsibility.
This volume is essentially the 1996 edition of Introstat with some minor corrections
of errors, and was reset in LATEX from the the original plain TEX version.
iii
iv
INTROSTAT
Contents
Introduction
iii
1 EXPLORING DATA
2 SET THEORY
45
3 PROBABILITY THEORY
55
4 RANDOM VARIABLES
93
5 PROBABILITY DISTRIBUTIONS I
115
145
7 PROBABILITY DISTRIBUTIONS II
165
177
201
227
251
273
321
TABLES
325
vi
CONTENTS
Chapter
1
EXPLORING DATA
KEYWORDS: Data summary and display, qualitative and quantitative data, pie charts, bar graphs, histograms, symmetric and skew
distributions, stem-and-leaf plots, median, quartiles, extremes, fivenumber summary, box-and-whisker plots, outliers and strays, measures
of spread and location, sample mean, sample variance and standard deviation, summary statistics, exploratory data analysis.
Facing up to uncertainty . . .
We live in an uncertain world. But we still have to take decisions. Making good
decisions depends on how well informed we are. Of course, being well informed means
that we have useful information to assist us. So, having useful information is one of the
keys to good decision making.
Almost instinctively, most people gather information and process it to help them
take decisions. For example, if you have several applicants for a vacant post, you would
not draw a number out of a hat to decide which one to employ. Almost without thinking
about it, you would attempt to gather as much relevant information as you can about
them to help you compare the applicants. You might make a short-list of applicants to
interview, and prepare appropriate questions to put to each of them. Finally you come
to an informed decision.
Sometimes the available information is such that we feel it is easy to make a good
decision. But at other times, so much confusion and uncertainty cloud the situation
that we are inclined to go by gut-feeling or even by guessing. But we can do better
than this. This books aims to equip you with some of the necessary skills to outguess
the competition. Or, putting it less brashly, to help you to make consistently sound
decisions.
As the world becomes more technologically advanced, people realize more and more
that information is valuable. Obtaining the information they need might just require a
phone call, or maybe a quick visit to the library. Sometimes, they might need to expend
more energy and extract some information out of a database. Or worse, they might have
to design an experiment and gather some data of their own. On other occasions, the
information might be hidden in historical records.
Usually, data contains information that is not self-evident. The message cannot be
extracted by simply eye-balling the data. Ironically, the more valuable the information,
1
INTROSTAT
the more deeply it usually lies buried within the data. In these instances, statistical
tools are needed to extract the information from the data. Herein lies the focus of this
book.
For example, consider the record of share prices on the Johannesburg Stock Exchange.
Hidden in this data lies a wealth of information whether or not a share is risky, or if it
is over- or underpriced. This data even contains traces of our own emotions whether
our sentiments are mawkish or positive, risk prone or risk averse and our preferences
for higher dividends, for smaller companies, for blue-chip shares. Little wonder that
there is a multitude of financial analysts out there trying to analyse share price data
hoping that they might unearth valuable information that will deliver the promise of
better profits.
Just as the financial analysts have an insatiable appetite for information on which
to base better investment decisions, so in every field of human endeavour, people are
analysing information with the objective of improving the decisions they take.
One of the essentials skills needed to extract information from data, to interpret this
information, and to take decisions based on it, is Statistics. Not everyone is willing,
or has the foresight, to master a course in the science of Statistics. We are fortunate
that this is true otherwise statisticians would not hold the monopoly on superior
decision-making!
You have already made at least one good decision the decision to do a course in
Statistics.
...
Most people seem to think that what statisticians do all day is to count and to add.
The two kinds of statisticians that most frequently impinge on the general public are
really parodies of statisticians: the sports statistician and the official statistician.
Statistics is not what you see at the bottom of the television screen during the French
Open Tennis Championships: statistique, followed by a count of the number of double
faults and aces the players have produced in the match so far! Nor is statistics about
adding up dreary columns of figures, and coming to the conclusion, for example, that
there were 30 777 000 sheep in South Africa in 1975. That sort of count is enough to
put anyone to sleep!
If statisticians dont count in the 123 sense, in what sense do they count? What
is statistics? We define statistics as the science of decision making in the face of
uncertainty. The emphasis is not on the collection of data (although the statistician
has an important role in advising on the data collection process), but on taking things
one step further interpreting the data. Statistics may be thought of as data-based
decision making. Perhaps it is a pity that our discipline is called Statistics. A far better
name would have been Decision Science. Statistics really comes into its own when the
decisions to be made are not clear-cut and obvious, and there is uncertainty (even after
the data has been gathered) about which of several alternative decisions is the best one
to choose.
For example, the decision about which card to play in a game of bridge to maximize
your chance of winning, or the decision about where to locate a factory so as to maximize
the likelihood that your companys share of the market will reach a target value are not
simple decisions. In both situations, you can gather as much data as you can (the cards
in your hand, and those already played in the first case, proximity to raw materials and
to markets in the second), and take a best possible decision on the basis of this data,
but there is still no guarantee of success. In both cases, your opposition may react in
unexpected ways, and you risk defeat.
In the above sentences, the words uncertainty, chance, likelihood and risk
have appeared. All these are qualitative terms. Before the statistician can get down to
his or her real job (of taking decisions in the face of uncertainty), this nebulous concept
of uncertainty has to be put onto a firm footing. Probability Theory is the branch of
mathematics that achieves this quantification of uncertainty.
Therefore, before you can become a statistician, you have to learn a hefty chunk of
Probability Theory. This is contained in chapters 2 to 7. Chapters 8 to 12 deal with the
science of data-based decision making.
However, in the remainder of this chapter, we aim to give you insight into what is to
come in the later chapters, to give you a feeling for data, and to do data-based decision
making using intuitive concepts.
...
Data is information. There are data drips and data floods, and statisticians have
to learn to deal with both. Usually, there is either too little or too much data! When
data comes in floods, the problem is to extract the salient features. When data comes
in drips, the problem is to know what are valid interpretations.
Besides the amount of data, there are different types of data. For the moment,
we need to distinguish between qualitative and quantitative data. Qualitative data
is usually non-numerical, and arises when we classify objects using labels or names as
categories: for example, make of car, colour of eyes, gender, nationality, profession,
cause of death, etc. Sometimes the categories are semi-numerical: for example, size of
companies categorized as small, medium or large.
Quantitative data, on the other hand, is always numerical, and data points can
be ranked or ordered. Quantitative data usually arises from measuring or counting:
for example, flying time between airports, number of rooms in a house, salary of an
accountant, cost of building a school, volume of water in a dam, number of new car sales
in a month, the size of the AIDS epidemic, etc..
INTROSTAT
Example 1A: Table 1.1 contains data on a class of 81 Master of Business Administration
(MBA) students. The table shows each students faculty for their first degree, in either
Arts, Commerce, Engineering, Medicine, Science or other. Also given are their test
scores for an entrance examination known as the GMAT, a test commonly used by
business schools worldwide as part of the information to assist in the selection process.
Our brief is to construct a visual summary of the distribution of students having the
various first-degree categories in the table.
Firstly, we decide that first-degree category, the data that we are being asked to
display, is qualitative data. Appropriate display techniques are the pie chart and the
bar graph.
Secondly, we find the frequency distribution of the qualitative data by counting
the number of students falling into each category. At the same time, we calculate
relative frequencies by dividing the frequency in each category by the total number
of observations:
First degree
Frequency
Relative
frequency
Engineering
Science
Arts
Commerce
Medicine
Other
28
16
16
10
5
6
0.35
0.20
0.20
0.12
0.06
0.07
Science
.................................................
..........
.............
..........
........
........
.......
.......
......
.
.
.
.
.
......
.....
.
.
......
.
.
.
... ....
.....
.
.
.
.
...
....
...
.
.
.
....
.
...
...
....
.
.
.
...
..
...
.
.
.
...
...
..
.
.
...
.
.
...
...
...
.
.
...
...
..
.
.
...
...
..
.
.
...
...
..
.
...
.
...
..
...
...
.
...
...
....
...
...
.
.
.
...
...
....
.
..
...
....
..
...
..
...
..
..
...
...
...
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
...
....... .................
.
.
.
.
.
.
.
.
..
.
... .... .......
...
.......
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
...
... ..... ..........
.
.......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
........
...
....
...........
........
....
...
...
...
..........
........
....
...
...
...
...........
........
....
... ...................
..
........
...
....
..
........
.....
.
.
.
.
.
....
........
...
...
........ ....
....
.
...
.......
....
....
...
....
..
...
...
....
...
...
.
.
....
..
...
...
.
.
.
.
.
.
...
....
...
..
.
...
.
.
....
...
....
.... ......
.
....
.... ...
.....
.....
.
.....
....
...
......
.....
.
.
.
.
.
.
......
....
.
.
.
.......
....
.
.
....
.
........
.
.
.
.
.
.
...
.........
...
............
.........
...............................................................
Other
Medicine
Arts
Commerce
GMAT
score
1. Engineering
2. Engineering
3. Engineering
4. Engineering
5. Engineering
6. Engineering
7. Engineering
8. Engineering
9. Engineering
10. Engineering
11. Engineering
12. Engineering
13. Engineering
14. Engineering
15. Engineering
16. Engineering
17. Engineering
18. Engineering
19. Engineering
20. Engineering
21. Engineering
22. Engineering
23. Engineering
24. Engineering
25. Engineering
26. Engineering
27. Engineering
610
510
610
580
720
620
540
500
750
640
550
650
600
600
510
570
620
590
660
550
560
630
540
560
650
540
680
First
degree
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
GMAT
score
Engineering
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Science
Arts
Arts
Arts
Arts
Arts
Arts
Arts
Arts
Arts
Arts
710
600
550
540
620
650
500
590
630
660
570
600
630
500
580
560
550
560
550
500
510
570
510
660
500
710
510
First
degree
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
Arts
Arts
Arts
Arts
Arts
Arts
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Commerce
Medicine
Medicine
Medicine
Medicine
Medicine
Other
Other
Other
Other
Other
Other
GMAT
score
500
620
550
600
520
520
550
520
560
560
600
540
550
650
510
560
590
700
640
680
580
550
680
540
640
620
450
INTROSTAT
Bar graph: Notice that there is no quantitative scale along the vertical axis of the
bar graph, that the bars are not connected, and that the widths of the bars have no
particular relevance. Because there is no quantitative ordering of the categories, we are
free to arrange them as we please. As for the pie chart, it is generally best to arrange
the bars in decreasing order of relative frequency; this makes comparison easier, and also
tends to highlight the important features of the data. Relative frequencies could also
have been used in the construction of the bar graph.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .
... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . .. ... . ... .. . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Science
... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ... .. . ... .. ... . .. ... . ... .. . ... .. ... . .. ... . ... .. . ...
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . ..
16
Arts
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. . . . . . . . . . . . . . . . . .
.. . . .. . .. . . .. . .. . . .. . .
.. . . .. . .. . . .. . .. . . .. . .
.. . . .. . .. . . .. . .. . . .. . .
.. . . .. . .. . . .. . .. . . .. . .
. . . . . . . . . . .
... . .. ... . ... .. . ... .. ... . ..
.. . . .. . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . .
.. . . .. . .. . . .. . .. . .
10
28 Engineering
Commerce
Other
Medicine
10
20
30
8.27
Diamonds
5
2
3
4
3
1.39
0.89
5.45
3.99
7.27
All-Gold
Copper
Manganese
Platinum
Tin
Other metals & minerals
1
1
2
1
3
0.55
0.91
5.00
0.01
0.26
Metal
&
minerals
Mining houses
Mining holding
3
3
16.79
7.13
Mining
financial
5
4
3
12
11
2.80
2.15
1.02
0.47
0.97
Financial
Industrial holdings
Beverages & hotels
Building & construction
Chemicals
Clothing, footwear & textiles
Electronics, electr. & battery
Engineering
Fishing
Food
Furniture & household goods
Motors
Paper & packaging
Pharmaceutical & medical
Printing & publishing
Steel & allied
Stores
Sugar
Tobacco & match
Transportation
6
2
6
2
10
7
7
1
4
5
6
3
3
3
2
10
1
1
3
9.94
3.43
0.92
3.46
0.62
1.39
0.95
0.08
2.14
0.30
0.43
2.26
0.48
0.18
1.99
2.16
0.43
2.48
0.22
Industrial
AllShare
INTROSTAT
xmax xmin
xmax xmin
=
1 + log2 n
1 + 1.44 loge n
xmax xmin
.
n
750 450
xmax xmin
=
= 33.33.
n
81
As a general rule, avoid choosing class intervals which are of awkward lengths.
Multiples of 2, 5 and 10 are most frequently used. Feel free to choose intervals
between half and double those suggested by the guidelines. All the class intervals
should be the same width. Resist the temptation to make the class intervals wider
over that part of the range where the data is sparser this has the effect of
1
We consider a sample to be a small number of observations taken from the population of interest.
We hope that the sample is representative of the population as a whole, so that conclusions drawn
from the sample will be valid for the population. We consider methods of obtaining a representative
sample in Chapter 11.
3. Count the number of GMAT scores falling into each class interval. The most
convenient way to do this is to set up a tick sheet, and to make one pass through
the data allocating each score to its class interval. This sets up a frequency
distribution:
Class
interval
frequency
450499
500549
550599
600649
650699
700749
750799
1
20
24
21
10
4
1
Total
81
10
INTROSTAT
25
20
Number
of
MBA
students
15
10
0
400
500
600
700
800
GMAT scores
The striking feature of the histogram is that it is not symmetric but is skewed to
the right, which means that it has a long tail stretching off to right. The terms in bold
are technical, jargon terms, but their meanings are obvious.
A seasoned statistician would expect a distribution of test scores (or examination
results) to have a tail at both ends of the frequency distribution. In the above display,
there has been a truncation of the distribution at 500 (apart from a single score of
450). We would infer that the acceptance criterion on the MBA programme is a GMAT
score of 500 or more. In reality there is a tail on the left, but it is suppressed by the
fact that applicants who achieved these scores were not accepted. In the light of this, a
statistician would also query the score of 450. Is it an error in the data? Maybe it should
be 540, and there has been a transcription error. But a more plausible explanation is
that the student was outstanding in some other aspect of the selection process maybe
the personal interview was very impressive!
Example 4B: The risks taken by investors when they invest in the stock exchange are
of considerable interest to financial analysts. Investors associate the risks of investments
with how volatile (or variable) the price changes are. Analysts measure volatility of price
changes using the standard deviation a statistical measure of variability that we
will learn about later in this chapter. The table below contains the standard deviations
(or risks) of a sample of 75 shares listed on the Johannesburg Stock Exchange. The units
are per cent per month. Construct a suitable histogram of the data.
23
19
27
26
17
22
23
20
25
10
17
11
17
11
25
18
16
8
12
26
21
11
13
20
11
25
15
28
22
12
23 25 12 23 27 14
15 12 12 12 21 13
14 9 13 11 23 23
21 9 13 19 19 13
25 22 12 11 22 20
28
11
10
14
14
9
13
12
15
10
23
13
12
17
23
11
So a sensible length for the class interval is 2, and we use class interval boundaries
at 8, 10, 12, . . . , 28. Effectively, the class intervals are 89, 1011, . . . , 2829.
3. Count the number of shares falling into each class:
Class
interval
Frequency
89
1011
1213
1415
1617
1819
2021
2223
2425
2627
2829
4
10
16
7
5
4
6
12
5
4
2
Total
75
15
Number
10
of
shares
5
10
. ... .. ... . .. .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . .. . .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
. . . . .
. . . .. . .. . . ... . ... .. . ... ..
. . . .. . .. . . .. . .. . . .. .
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
. . . .. . .. . . .. . .. . . .. . . .. . .. . . ..
20
Risk
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. . . . .
30
12
INTROSTAT
The striking feature of this histogram is that is has two clear peaks. In statistical
jargon, it is said to be bimodal. The visual display of this information has thus revealed
information which was not at all obvious from even a careful search through the 75 values
in the table of data. The financial analyst now needs an explanation for the bimodality.
Further investigation revealed that gold shares were predominantly responsible for the
peak on the right while industrial shares were found to be responsible for the peak on
the left. The histogram reveals that gold shares generally have a substantially higher risk
than industrial shares. In laymans terms, we conclude that gold shares are generally
more volatile than industrial shares.
Example 5C: Plot a histogram to display the examination marks of 25 students and
comment on the shape of the histogram:
68
78
72
48
39
55
50
53
69 52
61 71
51
42
50
57
41
34
52
57
65 37
66 87
45
Example 6C: A company that produces timber is interested in the distribution of the
heights of their pine trees. Construct a histogram to display the heights, in metres, of
the following sample of 30 trees:
18.3
17.7
18.5
19.1
19.1
17.8
17.3
17.4
20.1
19.4
19.3
19.4
17.6
18.7
20.5
20.1
18.2
16.8
19.9
20.0
18.8
20.0
17.7
19.7
19.5
20.0
18.4
19.3
17.5
20.4
Example 7A: At the end of 1983/84 English football season, the points scored by each
club were as follows:
13
Arsenal
Aston Villa
Birmingham City
Coventry City
Everton
Ipswich Town
Leicester City
Liverpool
Luton Town
Manchester United
Norwich City
63
60
48
50
59
53
51
79
51
74
50
Nottingham Forest
Notts County
Queens Park Rangers
Southampton
Stoke City
Sunderland
Tottenham Hotspur
Watford
West Bromwich
West Ham United
Wolverhampton
71
41
73
71
50
52
61
57
51
60
29
To produce the stem-and-leaf plot for the points scored by each team, we split each
number into a stem and a leaf. In this example, the natural split is to use the tens
as stems, and the units as leaves. Because the numbers range from the 20s to the 70s,
our stems run from 2 to 7. We write them in a column:
stems leaves
2
3
4
5
6
7
We now make one pass through the data. We split each number into its stem
and its leaf, and write the leaf on the appropriate stem. The first number, the
63 points scored by Arsenal, has stem 6 and leaf 3. We write a 3 as a leaf on
stem 6:
stems leaves
2
3
4
5
6 3
7
Aston Villas 60 points become leaf 0 on stem 6. Birmingham Citys 48 points are
entered as leaf 8 on stem 4. After the first six scores have been entered, we have:
14
INTROSTAT
stems
2
3
4
5
6
7
leaves
8
093
30
leaves
9
81
0931100271
3010
94131
count
1
0
2
10
4
5
22
In the last column, we enter the count of the number of leaves on each stem, add
them up, and check that we have entered the right number of leaves!
The final step is to sort the leaves on each stem from smallest to largest, and to add
a cumulative count column:
stems
2
3
4
5
6
7
sorted
leaves
9
18
0001112379
0013
11349
count
1
0
2
10
4
5
cum.
count
1
1
3
13
17
22
What have we created? Essentially, we have a histogram on its side, with class
intervals of length 10. But in addition, we have retained all the original information.
In a histogram, we would only have known that five teams scored between 70 and 79
points; now we know that there were scores of 71 (two teams), 73, 74, and Liverpools
league-winning 79!
Example 8A: For the data of example 1A, produce and compare stem-and-leaf plots
of the GMAT scores of students with Engineering and Arts backgrounds.
All the GMAT scores ended in a zero, so this contains no useful information; therefore
we use the hundreds as stems and the tens as leaves. For both categories of students,
we would then have only three stems, 5, 6 and 7. Looking back at the histogram
display of GMAT scores (example 3A), we note that we used class intervals of width 50
units. We can create this width class interval in the stem-and-leaf plot as demonstrated
below.
15
leaves
140144
8579566
11240023
5658
21
5
count
6
7
8
4
2
1
28
Arts:
stems
5
5?
6
6?
7
7?
leaves
01101022
6575
20
6
1
count
8
4
2
1
1
0
16
In this approach, we split each 100 into two stems; the first is labelled and
encompasses the leaves from 0 to 4, the second is labelled? and includes the leaves
from 5 to 9.
The final step is to sort the leaves for each stem.
ENGINEERING
stems
5
5?
6
6?
7
7?
sorted
leaves
011444
5566789
00112234
5568
12
5
ARTS
cum.
count
stems
6
13
21
25
27
28
5
5?
6
6?
7
7?
sorted
leaves
00011122
5567
02
6
1
cum.
count
8
12
14
15
16
16
16
INTROSTAT
pattern seems marked enough to suggest that engineers perform better, on average, than
the arts students.
If splitting stems into two parts seems inadequate for the data set on hand, here
is a system for splitting them into five!
Example 9B: Produce a stem-and-leaf plot for the risk data of Example 4B.
As for the histogram, it would be sensible to use stems of width 2. Each stem is
therefore split into five: 0 and 1 are denoted , 2 and 3 are denoted t, 4 and 5
are denoted f, 6 and 7 are denoted s, and 8 and 9 are denoted ?. Notice the
convenient mnemonics English is a marvellous language!
Arts:
sorted
cum.
stems
leaves
count count
? 8999
4
4
1 0001111111
10
14
1t 2222222223333333
16
30
1f 4444555
7
37
1s 67777
5
42
1? 8999
4
46
2 000111
6
52
2t 222233333333
12
64
2f 55555
5
69
2s 6677
4
73
2? 88
2
75
Note that we have presented the stem-and-leaf plot with the leaves already sorted.
Example 10C: Produce a stem-and-leaf plot for the examination marks of another
group of 25 students.
50
72
79
71
53
61
85
51
50 53
72 53
65
39
58
67
43
27
45
43
48 51
69 53
54
30
20
28
30
33
30
31
31
28
28
Windhoek
Cape Town
George
Port Elizabeth
East London
Beaufort West
Queenstown
Durban
Pietermaritzburg
Ladysmith
32
22
21
18
17
23
22
26
27
30
17
Example 12C: In order to assess the prices of the television repair industry, a faulty
television set was taken to 37 TV repair shops for a quote. The data below represents
the quoted prices in rands. Construct a stem-and-leaf plot and comment on its features.
60
185
105
55
200
75
158
38
150 140
75 120
48
75
36
120
125
150
85
125
120
245
125
176
90
145
60
60
200
78
49
145
28
38
94
98
165
18
INTROSTAT
Example 13A: Find the five-number summary for the end-of-season football points of
Example 7A.
An easy way to find the five-number summary is to use the stem-and-leaf plot.
stems
2
3
4
5
6
7
sorted
leaves
9
18
0001112379
0013
11349
count
1
0
2
10
4
5
cum.
count
1
1
3
13
17
22
Because n = 22, the median has rank m = (n + 1)/2 = (22 + 1)/2 = 11 12 . We need
to average the numbers with ranks 11 and 12. From the cumulative count, we see that
the last leaf on stem 4 has rank 3, and the last leaf on stem 5 has rank 13. Counting
along stem 5, we find that 53 is the number with rank 11 and 57 has rank 12. Thus the
median is the average of these two numbers (53 + 57)/2) = 55; we write x(m) = 55. Half
the teams scored below 55 points, half scored above 55 points.
The lower quartile has rank l = ([11 12 ]+1)/2) = (11+1)/2 = 6. The observation with
rank 6 is 50, thus x(l) = 50. The upper quartile has rank u = n l + 1 = 22 6 + 1 = 17.
The observation with rank 17 is 63, thus x(u) = 63. The extremes are x(1) = 29 and
x(n) = 67. The five-number summary is:
(29, 50, 55, 63, 79).
Why is this a big deal? Because it tells us that . . .
1. Half the teams scored below 55 points, half scored above 55 points, because 55 is
the median.
2. Half the teams scored between 50 and 63 points, because these two numbers are
the lower and upper quartiles.
3. A quarter of the scores lay between 29 and 50, a quarter between 50 and 55, a
quarter between 55 and 63, and a quarter between 63 and 79.
4. All the scores lay between 29 and 79.
Example 14B: Find the five-number summaries for GMAT scores of both engineering
and arts students. Use the stem-and-leaf plot of example 8A.
stems
5
5?
6
6?
7
7?
ENGINEERING
sorted
leaves
count
011444
6
5566789
7
00112234
8
5568
4
12
2
5
1
cum.
count
6
13
21
25
27
28
stems
5
5?
6
6?
7
7?
ARTS
sorted
leaves
count
00011122
8
5567
4
02
2
6
1
1
1
0
cum.
count
8
12
14
15
16
16
19
For the engineers, the median has rank m = (28 + 1)/2 = 14 21 . Thus x(m) = (600 +
600)/2 = 600. The lower quartile has rank l = ([m] + 1)/2 = ([14 21 ] + 1)/2 = 7 12 , and
the upper quartile rank n l + 1 = 22 7 21 + 1 = 21 21 . So x(l) = (550 + 550)/2 = 550,
and x(u) = (640 + 650)/2 = 645. The five-number summary is
(500, 550, 600, 645, 750).
For the arts students, the median has rank m = (16+1)/2 = 8 12 . Thus x(m) = (520+
550)/3 = 535. The lower quartile has rank l = ([m] + 1)/2 = ([8 12 ] + 1)/2 = 4 21 , and the
upper quartile rank n l + 1 = 16 4 21 + 1 = 12 12 . So x(l) = (510 + 510)/2 = 510, and
x(u) = (570 + 600)/2 = 585. The five-number summary is
(500, 510, 535, 585, 710).
For the engineers, the median GMAT score was 600; by contrast, for arts students,
it was only 535. The central 50% of engineers obtained scores in the interval from 550 to
645, while the central 50% of arts students were in a downwards-shifted interval, 510 to
585. This reinforces our earlier interpretation that engineers tend to have higher GMAT
scores than arts students.
Example 15C: Find the five-number summaries for the data of (a) Example 10C, (b)
Example 11C, and (c) Example 12C.
20
INTROSTAT
100
80
60
median (55)
lower quartile (50)
Points
40
Figure 1.1:
Example 17B: Draw a series of box-and-whisker plots to compare the GMAT scores
of each category of MBA students.
We computed the five-number summaries of the GMAT scores for engineering and
arts students in Example 14B. The five-number summaries for all the categories are:
Engineering
Science
Arts
Commerce
Medicine
Other
(500,
(500,
(500,
(510,
(580,
(460,
550,
550,
510,
540,
585,
540,
600,
585,
535,
555,
640,
585,
645,
625,
585,
600,
690,
640,
750)
660)
710)
700)
700)
680)
The box-and-whisker plots, shown side-by-side, reveal the differences between the
various categories of students.
21
700
GMAT
scores
600
500
ENG.
SCI.
ARTS
COM.
MED.
OTH.
400
We see from a comparison of the box-and-whisker plots that the students in this class
with a medical background had the highest median GMAT score, followed by engineers,
with arts students having the lowest median. The skewness to the right (now shown as a
long whisker pointing upwards!) which we commented on earlier for the class as a whole,
is also evident for engineering, science, arts and commerce students, the categories for
which the sample sizes were large.
22
INTROSTAT
Example 18A: The university computing service provides data on the amount of
computer usage (hours) by each of 30 students in a course:
AD483
CI144
FV246
HN050
JV670
LW032
PH544
SA831
TB864
WB909
AM044
CS572
GM337
JK314
KM232
MA276
PS279
SC186
VO822
YG007
AS677
EK817
GR803
JR894
LJ419
MJ076
RR676
SS154
WG794
ZP559
53
7
38
48
31
48
4
51
11
73
2
25
36
84
35
69
60
47
41
38
36
20
33
154
44
95
18
37
34
125
Is the lecturer justified in claiming that certain students appear to be making excessive
use of the computer (playing games?) while the usage of others is so low that she is
suspicous that they are not doing the work themselves?
23
stems
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sorted
leaves
247
18
05
134566788
14788
13
09
3
4
5
cum.
count
3
5
7
16
21
23
25
26
27
28
28
28
29
29
29
30
24
INTROSTAT
JR894
150
100
ZP559
MJ076
Hours
JK318
50
median (38)
lower quartile (31)
CI144
AM044
TP864
PH544
The lecturer now has a list of students whose computer utilization appears to be
suspicious.
Example 19C: A company that produces breakfast cereals is interested in the protein
content of wheat, its basic raw material. The protein content of 29 samples of wheat
(percentages) was recorded as follows:
9.2 8.0 10.9 11.6 10.4 9.5 8.5 7.7 8.0 11.3 10.0 12.8 8.2 10.5 10.2
11.9 8.1 12.6 8.4 9.6 11.3 9.7 10.8 83 10.8 11.5 21.5 9.4 9.7
Confirm the statisticians conclusion that the values 83 and 21.5 are outliers.
The statistician asked that these values should be investigated. Checking back to
the original data, it was discovered that 83 should have been 8.3, and 21.5 should have
been 12.5. Transposed digits and misplaced decimal points are two of the most frequent
types of error that occur when data is entered into a computer.
Example 20C: A winery is concerned about the possible impact of global warming
on the grape crop. It was able to obtain some interesting historical rainfall data going
back to 1884 from a wine-producing region. The rainfall (mm) in successive Januaries
at Paarl for the 22-year period 18841905 were recorded as follows:
25
Rain
2.6
4.9
16.3
21.6
6.1
.0
.0
1.1
Year
1892
1893
1894
1895
1896
1897
1898
1899
Rain
37.8
.0
.0
52.3
4.1
6.4
15.8
27.7
Year
1900
1901
1902
1903
1904
1905
Rain
3.0
145.1
39.7
105.9
17.8
10.6
Statistics in Statistics . . .
Within the discipline Statistics, we give a precise technical definition to the concept,
a statistic. A statistic is any quantity determined from a sample. Thus the median is
a statistic, and so are the other four numbers that make up a five-number summary.
These are examples of summary statistics, because they endeavour to summarize
certain aspects of the information contained in the sample. We now learn about a
further bunch of statistics.
26
INTROSTAT
The sample mean locates the middle of the batch of data values in a special way.
It is equivalent to hanging a 1 kg mass at points x1 , x2 , . . . xn along a ruler (of zero
mass), and then x
is the point at which the ruler balances. (The masses neednt be 1 kg,
but they must all be equal!)
The mean is much easier to calculate than the median. The mean requires a single
pass through the data, adding up the values. In contrast, the data needs to be sorted
before the median can be computed, an operation which requires several passes through
the data.
Example 21A: Find the sample mean of the dividend yields of 15 shares in the paper and packaging sector of the Johannesburg Stock Exchange. Also find the median.
Compare the mean and the median. The yields are expressed as percentages.
Copi
Caricar
Coates
Consol
DRG
3.3
8.4
10.7
6.0
9.6
E. Haddon
Kohler
Metal Box
Metaclo
Nampak
7.6
7.1
6.6
8.6
5.8
Pr. Paper
Prs. Sup
Sappi
Trio Rand
Xactics
sorted
leaves
9
03
8
067
156
246
6
7
count
1
2
0
1
3
3
3
1
1
cum.
count
1
3
3
4
7
10
13
14
15
6.7
2.9
7.5
8.2
3.0
27
The median has rank m = (15 + 1)/2 = 8, and thus x(m) = 7.1. In this example,
there is little difference between the two measures of location. But this is not always the
case . . .
Example 22A: Find the mean and the median of the weekly volume of the same
15 shares as in Example 21A. The weekly volume is the number of shares traded in a
week.
Copi
Caricar
Coates
Consol
DRG
2
2
3
1
31
300
100
100
200
800
E. Haddon
0
Kohler
100
Metal Box 111 400
Metaclo
700
Nampak
100
Pr. Paper
700
Prs. Sup
0
Sappi
40 600
Trio Rand 84 100
Xactics
45 900
28
INTROSTAT
Measures of spread . . .
Measures of spread give insight into the variability of a set of data. Two measures
of spread can be defined in an obvious way from the five-number summary. They are:
the range R, defined as
R = x(n) x(1) ,
and the interquartile range I, defined as
I = x(u) x(l) .
The range is unreliable as a measure of spread because it depends only on the smallest
and largest values in the sample, and is thus as sensitive as it can possibly be to outlying
values in the sample. It is the ultimate example of a non-robust statistic! On the other
hand, the interquartile range is the length of the interval covering the central half of the
data values in the sample, and it is not sensitive to outliers in the data. The interquartile
range is a robust measure of spread.
The sample variance and its square root, the sample standard deviation, have
the same advantage, easier algebraic manipulation, over the range and interquartile
range that the mean had over the median. Therefore the sample variance is frequently
the only measure of spread calculated for a set of data.
The sample variance, denoted by s2 , is defined by the formula
n
s2 =
1 X
(xi x
)2 .
n1
i=1
In words, it is the sum of the squared differences between each data value and the sample
mean, with this sum being divided by one less than the number of terms in the sum.
The sample standard deviation, denoted by s, is the square root of the sample variance.
It is a nuisance to have these two measures of spread, s and s2 , one of which is simply
the square root of the other. Why have both? The standard deviation is the easier of
the two measures of spread to get an intuitive feeling for, largely because it is measured
in the same units as the original data. The variance is measured in squared units,
an awkward quantity to visualize. For example, if data consists of prices measured in
rands, the sample variance has units squared rands (whatever that means!), but the
standard deviation is in rands. Even worse, if the data consists of percentages, the
variance has units %2 , whereas the standard deviation has the intelligible units %.
But mathematical statisticians prefer to work with the variance not having to deal
with a square root in the algebra makes their lives simpler and neater. So the two
equivalent measures of spread co-exist side by side, and we just have to come to terms
with both of them.
Example 23A: Compute the sample variance s2 and the standard deviation s for the
dividend yields of the 15 shares of Example 21A.
29
1 X
(xi x
)2
n1
i=1
1
=
(3.3 6.8)2 + (8.4 6.8)2 + (10.7 6.8)2 +
(15 1)
+ (8.2 6.8)2 + (3.0 6.8)2
1
=
(3.5)2 + (1.6)2 + (3.9)2 + + (1.4)2 + (3.8)2
14
1
=
12.25 + 2.56 + 15.21 + + 1.96 + 14.44
14
1
75.62
=
14
= 5.40
s2 =
5.40 = 2.32.
The variance and the standard deviation are always positive. This is guaranteed,
because all the terms in the sum are squared, which makes them positive, even though
some of the individual differences are negative.
The variance can be calculated more efficiently by a short-cut formula. The short
cut involves reducing the number of subtractions needed to calculate the variance from
n to 1. Examine the following steps carefully:
(n 1)s2 =
=
=
n
X
i=1
n
X
i=1
n
X
i=1
(xi x
)2
(x2i 2
x xi + x
2 )
x2i
n
X
2
x xi +
n
X
x
2
i=1
i=1
The
third term involves adding x
2 to itself n times. So it is equal to n x
2 . But x
=
P
n
1
x , so
n
i=1
n
1 X 2
nx
=
xi
n
2
i=1
2
x xi = 2
x
i=1
2
=
n
=
2
n
n
X
i=1
n
X
i=1
n
X
i=1
xi
xi
n
X
i=1
xi
2
xi
30
INTROSTAT
Substituting these expressions for the second and third terms yields
n
n
n
X
2 X 2 1 X 2
2
2
xi
xi +
xi
(n 1)s =
n
n
i=1
n
X
i=1
s2
is
i=1
i=1
n
1 X
2
xi )2 .
xi (
n
i=1
n
n
1 hX 2 1 X 2 i
xi
xi .
n1
n
i=1
i=1
Look carefully at this formula. There is now only one subtraction, whereas the
original formula involved n subtractions.
Example 24A: Calculate the sample variance of the dividend yields again, this time
using the short-cut
Pn formula.
We need i=1 xi , the sum of the data values, given by
n
X
i=1
P
and ni=1 x2i , the sum of squares of the data values, i.e. square them first, then add
them, like this:
n
X
x2i = 3.32 + 8.42 + + 3.02 = 769.22
i=1
Then
n
n
1 hX 2 1 X 2 i
xi
xi
n1
n
i=1
i=1
i
1
1h
(769.22 (102.0)2 = 5.40
=
14
15
s2 =
as before.
If the data has a symmetric distribution with no outliers, then the standard deviation has the following approximate interpretation. The interval from one standard
deviation below the sample mean to one standard deviation above it, (
x s, x
+ s),
should contain about two-thirds of the observations. Thus the sample mean and the
sample standard deviation together provide a two-number summary of the data set.
Many data sets are summarized by these two statistics the sample mean provides a
measure of location and the sample standard deviation a measure of spread.
However, the sample variance and the sample standard deviation have the disadvantage that, like the mean, they are sensitive to outliers. They are sensitive in two ways.
First of all, the outlier distorts the mean, so all the differences (xi x
) are misleading.
Secondly, if xj , the jth data value, is an outlier, then the term (xj x
) will be large
relative to the other differences, and, once it is squared, it can make a disproportionately
large contribution to the sum of squared differences.
Note that the intervals (x(1) , x(n) ), (
x s; x
+ s), and (x(l) , x(u) ) cover 100%, 68%
two-thirds, and exactly 50% of the observations, respectively. But it is not possible to
make direct comparisons between the range, the standard deviations and the interquartile range.
31
Example 25B: Calculate the sample means, sample standard deviations, medians,
interquartile ranges and ranges of the GMAT scores for each faculty category of Example 1A. Comment on the results.
P
P
For the sample mean and variance, we need the quantities ni=1 xi and ni=1 x2i . For
the category Engineering, we have
n
X
i=1
n
X
i=1
Then
x
= 16 850/28 = 601.8
and
1
(10 254 100 (16 850)2 /28) = 4 222.6.
27
Location
First
degree
Engineering
Science
Arts
Commerce
Medicine
Other
Spread
x(m)
601.8
583.1
555.6
567.0
638.0
581.7
600
585
535
555
640
585
65.0
48.5
62.8
45.7
53.1
80.1
95
75
75
60
105
100
250
160
210
140
120
220
In commenting on this table, we look first at the measures of location. The sample
means show that students with first degrees in medicine had the highest mean GMAT
score (638.0), followed by engineering students (601.8), and then commerce (567.0).
The lowest mean was recorded for arts students (555.6). The medians follow the same
pattern, and apart from arts, the sample means and medians are relatively close. In the
box-and-whisker plots in Example 17B, we saw that the distribution of GMAT scores
for arts appeared to be strongly skewed to the right. Hence the difference between the
sample mean (555.6) and the median (535) for this category of students is consistent
with the earlier evidence of skewness.
32
INTROSTAT
For the measures of spread, it is evident that the category Other has the largest
standard deviation (80.1), followed by engineering (65.0). The smallest standard deviation was for commerce (45.7). The interquartile ranges (I) and ranges (R) follow
a broadly similar pattern. A plausible explanation as to why the category Other
should have the largest standard deviation (and the second largest interquartile range
and range) is that it encompasses a wide diversity of students, not falling into any of the
single faculty categories.
The conclusions reached here provide a partial description of this MBA class of
81 students. If they were representative of all MBA students at all universities, we
might be able to generalize the statements. Another worry that we would have before
we could generalize the results relates to issues of sample size. Could the differences
in the measures of location and spread we observed here occur just because we got an
unusually bright group of, for example, medical students in this MBA class? We will
defer further consideration of these statistical issues until chapter 8! In order to prepare
ourselves for taking that kind of decision we have to learn some probability theory.
Example 26C: Calculate the sample mean and standard deviation, the median, the
range and interquartile range of Paarl rainfall data (Example 20C).
Example 27C: (a) Suppose that the sample mean and standard devation of the n
numbers x1 , x2 , . . . , xn are x
and s. An additional observation xn+1 becomes available.
Show that the updated mean x
? is
x
? =
n
x + xn+1
n+1
(b) The sample mean of nine numbers is 4.8 with standard deviation 3.0. A 10th
observation is made. It is 6.8. Update the mean and the standard deviation.
33
Solutions to examples . . .
2C The frequencies and relative frequencies, from which the bar graphs are constructed, are given in the table.
Major sector
(a) Frequency
(b) Percentage of
All-Share Index
2
1
17
8
6
35
82
0.82%
8.27%
18.99%
6.73%
23.92%
7.41%
33.86%
151
100%
Coal
Diamonds
All-Gold
Metals & minerals
Mining financial
Financial
Industrial
Total
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . ..
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . ..
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
.. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. . . . . . . . . . . . . . . . . . . . . . .
.. . .. . . .. . .. . . .. . .. . . .
.. . .. . . .. . .. . . .. . .. . . .
.. . .. . . .. . .. . . .. . .. . . .
. . . . . . . . . . .
.. . .. . . .. .
.. . .. . . .. .
.. . .. . . .. .
. . . . .
.. . .. . .
.. . .. . .
.. . .. . .
. . . .
..
..
..
.
35
17
82 Industrial
Financial
All-Gold
2 Coal
1 Diamonds
0
10
20
30
40
50
60
70
80
90
7%
Financial
19%
7%
All-Gold
24%
1%
Mining financial
Coal
8%
34% Industrial
10
Diamonds
20
30
34
INTROSTAT
Index in relation to the small number of shares. In fact, the single share AngloAmerican had a weighting of 8.95% in the All-Share Index!
Number
of
students
5
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
.. . . .. . .. .
. . . . .
0
0
20
40
. .. . ... .. ... .
. . . .. . .. .
. . . .. . .. .
. . . .. . .. .
. . . .. . .. . . . . . .
. . . .. . .. . .. . .. . . .. .
. . . .. . .. . .. . .. . . .. .
60
80
100
Examination mark
6C We used a class interval of 0.5.
6
Number
4
of
trees
2
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .
0
15
16
. ... .. ... . .. .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
17
. ... .. . ... ..
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. .. . . .. .
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .. . ..
. . . .
. ... .. ... . .. .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
. .. . .. . . .
18
. .. ... . ... ..
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
. . .. . .. .
19
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
.. . .. . . .
. . . .
. ... .. . ... .. .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
. .. . . .. . .
20
Height (m)
10C
stems
2
3
4
5
6
7
8
sorted
leaves
7
9
3358
0011333348
1579
1229
5
cum.
count
1
1
4
10
4
4
1
1
2
6
16
20
24
25
11C
stems
1
2
2
3
sorted
leaves
78
01223
67888
00001123
count
2
5
5
8
cum.
count
2
7
12
20
. .. ... . ... ..
. . .. . .. .
. . .. . .. .
21
22
35
stems
0
0
1
1
2
sorted
leaves
28,36,38,38,48,49
55,60,60,60,75,75,78,85,90,94,98
05,20,20,20,25,25,25,40,45,45
50,50,58,65,76,85
00,00,45
count
6
12
10
6
3
cum.
count
6
18
28
34
37
20C
stems
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
sorted
leaves
0.0,0.0,0.0,0.0,1.1,2.6,3.0,4.1,4.9,6.1,6.4
0.6,5.8,6.3,7.8
1.6,7.7
7.8,9.7
2.3
5.9
5.1
cum.
count
11
15
17
19
19
20
20
20
20
20
21
21
21
21
22
36
INTROSTAT
150
100
Rainfall
(mm)
50
upper quartile (27.7 mm)
median (8.5 mm)
lower quartile (2.6 mm)
26C x
= 235.8
s = 365.7
xm = 85
R = 1451
I = 251.
Exercises . . .
1.1 As a cartoon strip matures, it is likely to change in subtle ways. In this exercise,
we want to look at the pattern of word usage in Shultzs Peanuts, comparing
the period 1959/60 with 1975/76. The tables below show the number of words per
cartoon strip in the two periods. Produce stem-and-leaf plots and find five-number
summaries for both periods. Draw side-by-side box-and-whisker plots, and discuss
how the number of words per cartoon strip has changed.
Number of words in 66 Peanuts cartoon strips from 1959/60, reprinted in Youre
a winner, Charlie Brown.
51
44
23
46
30
39
35
55
26
43
28
26
35
42
45
53
36
34
30
40
63
35
29
35
41
5
58
34
76
60
52
15
28
51
40
41
55
27
37
47
32
47
38
0
59
43
59
40
41
16
24
22
32
46
32
28
35
45
49
45
49
14
43
23
29
4
37
40
49
34
46
45
45
44
39
36
63
38
41
44
35
52
39
43
18
49
35
50
60
30
20
35
52
39
40
47
33
34
47
32
30
28
47
29
45
21
29
39
17
35
44
20
17
45
40
37
52
17
30
45
37
33
28
22
42
32
19
1.2 If you are a salesperson, it is very easy to get pessimistic because on many days no
sales are made. The days when everthing goes well keep you going. The daily sales
of a second-hand car salesperson are tabled below. Display them as a histogram.
Calculate also the sample mean, standard deviation and median. Is the median a
helpful measure of central tendency for this data?
0
0
0
0
0
0
0
4
0
0
0
3
1
0
0
1
1
1
0
0
1
1
1
1
0
1
1
0
0
6
0
0
0
3
0
0
1
0
1
0
0
0
0
2
0
5
0
0
0
0
1
0
1
0
2
0
2
1
5
2
0
3
0
0
2
1
1
0
0
0
0
3
3
0
0
1
2
3
0
1
4
2
0
0
1
0
0
0
0
0
2
1
1
1
2
0
0
1
0
0
0
1
0
1
2
0
1
0
(a) Display the data as a stem-and-leaf plot, compute the five-number summary
and draw the box-and-whisker plot.
(b) Find the sample mean and standard deviation. What proportion of the data
values is within one standard deviation of the mean?
(c) Comment on the shape of the distribution of the data, and attempt to interpret it.
1.4 Water is a crucial resource in a generally arid country like South Africa. The
mean annual runoff (millions of m3 ) of 63 rivers in the old Cape Province of South
Africa is given in the table below. Find the five-number summary, and attempt
to draw the box-and-whisker plot. Take logarithms of each data value, and repeat
the exercise. Discuss the effect of the logarithmic transformation.
38
INTROSTAT
River
Mean
annual
runoff
Kei
1001
Quko
41
Kwenxura
25
Kwelera
32
Gqunube
35
Buffalo
82
Goba
6
Gxulu
6
Ncera
6
Tyolumnqa
25
Kiwane
2
Keiskamma
133
Cqutywa
2
Bira
6
Mgwalana
5
Mtati
4
Mpekweni
2
Fish
479
Kleinemond
6
Riet
4
Kowie
23
River
Kasuka
Kariega
Bushmans
Boknes
Sundays
Coega
Swartkops
Yellowwoods
Gamtoos
Kabeljou
Seekoei
Krom
Klipdrif
Storms
Elandsbos
Keurbooms
Knysna
Goukamma
Swartvlei
Touw
Kaaimans
Mean
annual
runoff
5
15
38
13
29
13
84
45
485
27
27
105
35
69
67
160
110
44
73
30
59
River
Mean
annual
runoff
Groot Brak
29
Klein Brak
45
Hartenbos
5
Gouritz
744
Kafferkuils
141
Duiwenhoks
131
Bree
1893
Heuningnes
78
Uilskraal
65
Kleinriviersvlei
96
Botriviersvlei
116
Palmiet
310
Wildevoelvlei
38
Sout
38
Diep
43
Berg
235
Verlorevlei
102
Wadrifsoutpan
19
Jakkals
10
Olifants
1217
Orange
9344
(Data from Noble and Hemens, S.A. National Scientific Programmes, Report No
34, 1976.)
1.5 This is an exercise in robustness! Calculate the median, interquartile range, range,
sample mean and sample standard deviation for the following 12 data values:
10.8 9.7 14.1 12.3 10.9 8.9 11.7 12.6 11.2 10.5 8.3 131
Which value looks suspiciously like an outlier? Put its decimal point back in
the right place, and recompute the summary statistics. Which of these statistics
change, and which remain the same? For which statistic is the percentage change
the largest? (The percentage change is the difference between the correct and
the biased values, divided by the correct value, multiplied by 100.)
1.6 Heathrow Airport in London is one of the worlds busiest airports. The time of
touchdown for planes arriving between 17h30 and 19h30 on 17 October 1991 is
recorded (to the second) in the table below. Compute the inter-arrival times in
seconds and present appropriate summary statistics. What do you think the target
39
inter-arrival time is? Was there any apparent difference between the first and the
second hour of observation? How frequently did glitches (= irregularities) occur?
17h30:07
32:46
34:14
37:13
38:56
40:27
41:37
43:21
44:24
45:50
47:10
48:58
50:03
51:49
52:50
54:32
55:42
17h57:21
?
17h59:36
18h01:24
03:10
04:26
05:47
08:49
10:27
11:24
12:51
15:34
16:52
18:10
19:24?
21:48
24:20
25:41
26:51
18h28:22
18h29:51
31:51
34:04
36:40
37:52
40:41
42:23
43:59
45:20
46:42
48:38
50:44
52:02
53:44
55:05
56:44
58:11
19h00:06
19h01:41
03:33
06:10
07:29
09:25
10:11
12:29
13:41
15:33
17:26
19:14
21:03
22:24
24:15
25:38
26:59
19h29:26
40
INTROSTAT
Solutions to exercises
1.1
stems
0
1
2
3
4
5
6
7
1959/60
sorted
leaves
045
1456
233466788899
0022244555556789
00011123334555667799
12355899
03
6
count
3
4
12
16
20
8
2
1
cum.
count
3
7
19
35
55
63
65
66
count
0
5
8
27
20
4
2
0
cum.
count
0
5
13
40
60
64
66
66
1975/76
stems
0
1
2
3
4
5
6
7
sorted
leaves
77789
01228899
000002223344555556777899999
00012344455555677799
0222
03
The five-number summaries are (0, 28, 37.5, 46, 76) for the early period and (17,
30, 37, 45, 63) for the late period.
The box-and-whisker plots, plotted side-by-side are:
80
70
60
50
Number
40
of
words
30
20
10
0
41
The medians are similar during both periods, but there appears to be less variability during 1975/76 than during 1959/60 both the range and the interquartile
range are shorter.
1.2
60
50
40
Number
of
30
days
20
10
0
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . .. . . .. . .. .
. . . . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
.. . . .. . .. . . .
. .. . . .. . .. . .
. .. . . .. . .. . .
. .. . . .. . .. . .
. .. . . .. . .. . . . . .. . . .. . .. . . . . .. . . .. . .. .
. . . . . . . . . . . . . . . . . . . . . . . .
x
= 0.82, s = 1.24, xm = 0
No, the median is not much use here more than half the data values are zero!
Even though the data is very skew, the mean is more interesting than the median.
1.3 (a) The stem-and-leaf plot is
stems
1
2
3
4
5
6
7
8
9
sorted
leaves
89
344
0224669
3335
4466
577
4
7
count
2
3
7
4
4
3
1
0
1
cum.
count
2
5
12
16
20
23
24
24
25
Five-number summary (18, 32, 43, 56, 97). The value of 97 is a stray.
42
INTROSTAT
100
90
80
70
60
Time
(mins)
50
median (43)
40
30
20
10
0
(b) x
= 44.4, s = 19.3. 15 out of 25 observations (60%) lie within one standard
deviation of the mean, i.e. in the interval (25.1, 63.7).
(c) Apart from the stray (which should be investigated), the data show relatively
little skewness to the right. In the context, the interquartile range (24 minutes) is probably wider than desirable, and efforts should be made to make
services times more consistent.
1.4 Five-number summary (2, 13, 38, 105, 9344), strays greater than 239, outliers
greater than 440. On taking logarithms (base 10), the five-number summary is
(0.30, 1.11, 1.58, 2.02, 3.97), strays exceed 2.9, and there are no outliers.
10000
800
700
9344
1893
1217
1001
500
Runoff
400
3
(m 106)
300
1000
Runoff
100
3
(m 106)
50
10
310 Palmiet
200
100
0
1893 Bree
1217 Olifants
1001 Kei
500
744 Gouritz
485 Gamtoos
479
Fish
9344 Orange
5000
Orange
Bre
e
Olifants
Kei
600
5
upper quartile (105)
median (38)
lower quartile (33)
upper
quartile (105)
median (38)
lower
quartile (33)
43
In the plot on the left, the runoffs from four rivers cannot be plotted to scale.
The inner scale on the plot on the right shows logarithms to base 10. Notice how
dramatic the effect is always look at scales on plots to see if data has been
transformed.
1.5 With the outlier (131), xm = 11.05, I = 2.35, R = 122.7, x
= 21.00, and s = 34.68.
With the outlier corrected (13.1), xm = 10.85, I = 2.35, R = 5.8, x
= 11.18, and
s = 1.71. The outlier affects the range, standard deviation and sample mean (in
that order).
1.8 The bar graph is most effective if the expenditures are arranged in order:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . ..
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . ..
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . ..
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . . .. . .. . . .. . .
. . . . . . . . . . . . . . . . . . . . . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. .. . .. . . .. . .. . . .. . .. . . .. . . .. . .. . .
. . . . . . . . . . . . . . . . .
. .. . .. . . ..
. .. . .. . . ..
. .. . .. . . ..
. . . . .
. .. . .. .
. .. . .. .
. .. . .. .
. . . .
. .. .
. .. .
. .. .
. .
R8 000 (4%)
R5 000 (2%)
R25 000
Newspaper
Radio
Magazines
Pamphlets
Miscellaneous
R50 000
R75 000
R100 000
44
INTROSTAT
Chapter
2
SET THEORY
KEYWORDS:
Set, subset, intersection, union, complement,
empty and universal sets, mutually exclusive sets; pairwise mutually
exclusive and exhaustive sets.
...
Simply because one of Murphys Laws states that before you can do anything, you
have to do something else. Before we can do statistics we have to do probability
theory, and for that we need some set theory. So here we go.
Definition of sets . . .
We define a set A to be a collection of distinguishable objects or entities. The set
A is determined when we can either (a) list the objects that belong to A or (b) give a
rule by which we can decide whether or not a given object belongs to A.
Example 1A: (a) If we say, The letters e, f, g belong to the set A, then we write
A = {e, f, g}
(b) If we say, The set B consists of real numbers between 1 and 10 inclusive, then
we write
B = {x | 1 x 10}.
We read this by saying: The set B consists of all real numbers x such that x is larger
than or equal to 1 but is less than or equal to 10.
Because the object e belongs to the set A we write
eA
and we say: e is an element of A. Because e does not belong to B, we write
e 6 B
46
INTROSTAT
Example 2B:
(a) Express in set theory notation: the set U of numbers which have square roots
between 1 and 4.
(b) Write out in full all the elements of the set Z = {(x, y) | x {1, 2, 3, 4}, y = x2 }.
(a) Because the square roots of numbers between 1 and 16 belong to U , we write
U = {x | 1 x 16}.
(b) Z = {(1, 1), (2, 4), (3, 9), (4, 16)}.
Example 3C: Which of the following statements are correct and which are wrong?
(a)
(b)
(c)
(d)
(e)
(f)
(g)
{3, 3, 3, 3} = {3}
6 {5, 6, 7}
C = {1, 0, 1}
F = {x | 4 < f < 5}
{1, 2, 7} = {7, 2, 1, 7}
If H = {2, 4, 6, 8}, J = {1, 2, 3, 4} and K = {2x | x H}, then K = J
{1} {1, 2, 3}.
Subsets . . .
Suppose we have two sets, G and H, and that every element of G also belongs to
H. Then we say that G is a subset of H and we write G H. We can also write
H G and say H contains G. If every element in G does not also belong to H, we
write G 6 H and say G is not a subset of H.
Example 4A: Let G = {1, 3, 5}, H = {1, 3, 5, 9} and J = {1, 2, 3, 4, 5}. Then
clearly G H, H 6 J, J G.
Note that the notation , for sets is analogous to the notation , for ordinary
numbers (rather than the notation <, >). The round end of the subset notation tells
you which of the sets is smaller (in the same way as the pointed end shows which
of two numbers is smaller).
Our definition of subset has a curious (at first sight) but logical consequence. Because
every element in G belongs to G, we can write G G. For numbers, we can write 2 2.
If H G and G H, then, obviously, H = G. For numbers, x 2 and x 2
together imply that x = 2.
Example 5C: Let V = {v | 0 < v < 5}, W = {0, 5}, X = {1, 2, 3, 4}, Y = {2, 4},
Z = {x | 1 x 4}. Which of the following statements are true, and which are false:
(a) V = W
(e) X = Z
(b) Y X
(f) Z 6 V
(c) W V
(g) Y W
(d) Z X
(h) Y Z
47
Intersections . . .
Suppose that L = {a, b, c} and M = {b, c, d}. Then L 6 M and M 6 L. But if
we consider the set N = {b, c}, then we see that N L and N M , and that no other
set of which N is a subset has this property. This leads us to the idea of intersection.
The intersection of any two sets is the set that contains precisely those elements
which belong to both sets. For the sets, L, M and N above we write N = L M and
read this N equals L intersection M . The intersection of two sets M and N can be
thought of as the set containing those elements which belong to both M and to N .
Example 6A: If P = {x | 0 x 10} and Q = {x | 5 < x < 20}, find P Q. Is
5 P Q? Is 10 P Q?
Paying careful attention to the endpoints,
P Q = {x | 5 < x 10}.
No, 5 6 P Q, but, yes, 10 P Q.
Unions
The concept union contrasts with the concept intersection. The union of two sets
A and B is the set that contains the elements that belong to A or to B. Here we use
the word or in an inclusive sense we do not exclude from the union those elements
that belong to both A and B.
If A = {1, 2, 3} and B = {2, 3, 4, 5} then the union of A and B is the set
C = {1, 2, 3, 4, 5}. We write
C =AB
and say C equals A union B.
48
INTROSTAT
Complements . . .
Our final concept from set theory is that of the complement of a set. Given the
sample space S, we define the complement of a set A to be the set of elements of S
which are not in A. The complement of A is written A, and is always relative to the
sample space S.
If S = {1, 2, 3, 4, 5, 6}, A = {1, 3, 5} and B = {2, 4, 6} then A = {2, 4, 6}.
We write
A=B
and say the complement of A equals B or, more briefly, A complement equals B.
Example 8A: If S = {x | 0 x 1} and D = {x | 0 < x < 1}, find D.
Because the set D excludes the endpoints of the interval from zero to one, D = {0, 1}.
Example 9C: If the sample space S contains the letters of the alphabet, i.e. S =
{a, b, c, . . . , x, y, z}, the set A contains the vowels, the set B contains the consonants,
the set C contains the first 10 letters of the alphabet C = {a, b, c, . . . , h, i, j} pick out
the true and false statements in the following list:
(a) A B = S
(g) S B = B
(b) A B =
(h) A A = S
(c) S S
(i) C A = {o, u}
(d) A C = {a, e, i}
(j) (A C) = A C
(e) A B
(k) A C C
(l) S =
(f) A = B
Venn diagrams . . .
A pictorial representation of sets that helps us solve many probems in set theory is
known as the Venn diagram. In the diagrams below think of all the points in the
rectangle as being the sample space S, and all the points inside the circles for A and
B as the sets A and B respectively. The shaded area in the diagram on the left then
represents A B, the set of points belonging to A and B. Similarly the diagram on the
right is a visual representation of A B, the set of points belonging to A or B. Recall
once again the special, inclusive meaning we give to or. When drawing Venn diagrams
it is helpful to associate intersection with and and union with or.
49
AB
AB
The diagram on the left below shows how to depict two mutually exclusive sets in a
Venn diagram.
Venn diagrams are usually only useful for up to three sets: the area shaded in the
diagram on the right is A B C.
S
B
C
ABC
AB =
A1..............
.
.... A2
....
A
A
..
...
....... ...
...
....... .....
...
.......
..
.......
...
...
.......
.
.
...
.
.
.
...
.
.
...
.......
...
...
.......
...
.
3
.
.
.
.
.
.
.
1 ....
...
..
.......
...
.......
6 ........
8
... .............
.........................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
........ ... ..
.
.
.
.
.
.
.
.
.
.
.
.
..
.........
... ..
..........
...
...........
... ...
...
.......... ..
...
... ...
........... .....
...
...
..
.
...........
.
.
...
..........
2 ... ...
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
5 ....
..
...
..........
...
.
...
...........
..
...
... .............
...
..
...
................
...
..
...
.....
..
...
.
.
.
.
.
...
.. .....
...
...
.. ...
...
...
.. ...
...
...
.....
..
..
.
.
.
4
7
9
.
.
.
.
...
...
...
...
...
...
...
...
..
.
.
.
.
.
..
...
...
...
...
...
50
INTROSTAT
(A B) C
(A C) (B C)
Example 11C: Draw Venn diagrams to show that the following are true:
(a) A B = A (B A)
(b) (A C) (B A) = (A C) (A B)
(c) The sets A B, A C, B A and (A C) form a family of pairwise mutually
exclusive and exhaustive sets.
Example 12C: Draw Venn diagrams to determine which of the following statements
are true.
(a)
(b)
(c)
(d)
(e)
(f)
(A B) = A B
(A B) (A B) A B
(A B) C = (A C) (B C)
(C A) (C B) = (C (A B))
[(A B) C] [(A C) B] = [(A B C) (A B) C] (A B)
If the sets A1 , A2 , A3 , and A4 are pairwise mutually exclusive and exhaustive, and
B is an arbitrary set, then
B = (A1 B) (A2 B) (A3 B) (A4 B).
Solutions to examples . . .
3C (a), (b), (c) and (e) are correct; (d) should read either F = {x | 4 < x < 5} or
F = {f | 4 < f < 5}. For (f), check that the following statement is correct: if H
and J are as given, and if K = {2x | x J} then K = H. For (g), note that we
never use the -notation with a set on the left hand side.
5C Only (b) and (d) are true.
9C All are true.
51
Easy exercises . . .
2.1 Let S be {1, 2, 3, 4, 5, 6}, the set of all possible outcomes when a die is thrown
and the number of dots on the uppermost face recorded. Describe in words the
following sets:
(a) {6}
(d) {2, 4, 6}
(b) {1, 2, 3, 4}
(e) {5, 6}
(c) {1, 3, 5}
(f) {6}
2.2
2.3 Let S denote the set of all companies listed on the Johannesburg Stock Exchange.
Let A = {x | x is in the gold mining sector},
let D = {x | the share price of x is higher now than six months ago}.
A B,
A D,
A C D,
B (C A),
(e)
(f)
(g)
(h)
A,
C D,
BC
(B A) (C D).
AA=S
AA=
AB =AB
AB =AB
A (B C) = (A B) (A C)
A (B C) = (A B) (A C)
ABC =S
ABC =
AB AB
52
INTROSTAT
(j) A (B C) A (B C)
Draw a series of Venn diagrams representing three sets, and shade in the following
areas.
(a)
(b)
(c)
(d)
ABC
(B A) (A C)
ABC
(A B C) (B C).
and Bi Bj = for i 6= j.
Solutions to exercises . . .
2.1 (a)
(b)
(c)
(d)
(e)
(f)
53
(b) Set of gold mining companies whose share price is higher now than six months
ago.
(c) Set of gold mining companies with a financial year ending in June whose share
price is higher now than six months ago.
(d) Set of all companies which have an annual turnover exceeding R10 million
and which are either gold mining companies or companies with financial years
ending in June (or both).
(e) Set of companies not in the gold mining sector.
(f) Set of companies which either do not have a financial year ending in June or
have a share price which is higher now than six months ago.
(g) Set of companies which either do not have an annual turnover exceeding R10
million or do not have financial year ending in June.
(h) Set of companies which either do not have an annual turnover exceeding R10
million or are not in the gold mining sector or both have a financial year
ending in June and have a share price which is higher now than six months
ago.
(Notice how difficult it is to express unambiguously in words the meaning of a few
mathematical symbols.)
2.4 All are true, except (g) and (h).
2.5 , {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}.
2.9 (c) C and D are mutually exclusive.
54
INTROSTAT
Chapter
PROBABILITY THEORY
KEYWORDS:
Random experiments, sample space, events,
elementary events, certain and impossible events, mutually exclusive events, probability, relative frequency, Kolmogorovs
axioms, permutations, combinations, conditional probability,
Bayes theorem, independent events.
56
INTROSTAT
(b) A phone number is chosen at random. The number is dialled, and the person who
answers is asked whether he/she is currently watching television. If the telephone is
unanswered after 45 seconds, the outcome, no reply, is recorded. The set of possible outcomes, the sample space, is S = {yes, no, wont say, number engaged, no reply}.
(c) A light bulb is allowed to burn until it burns out. The lifetime of the bulb is
recorded. The possible outcomes are the set of non-negative real numbers (i.e. the
set of positive numbers plus zero the bulb might not burn at all). The sample
space is thus S = {t | t 0}.
(d) A die is placed in a shaker, which is agitated violently, and thrown out onto
the table. The dots on the upturned face are counted. The sample space is
S = {1, 2, 3, 4, 5, 6}.
(e) In a survey of traffic passing a particular point on Boulevard East, a time period
of one minute is chosen at random, and the number of vehicles that pass the point
in the minute is counted. The possible outcomes are the integers, including zero:
{0, 1, 2, 3, . . .}.
(f) A geologist takes rock samples in a mine in order to determine the quality of the
iron-ore to be mined. The analytical laboratory reports the proportion of iron in
the ore. The sample space is S = {p | 0 p 1}.
Events . . .
An event is defined to be any subset of the sample space of S. Thus if S =
{1, 2, 3, 4, 5, 6} then the sets A = {1, 3, 5} and C = {3, 4, 5, 6} are events. A is
clearly the event of getting an odd number. C is the event of getting a number greater
than or equal to 3.
The empty set is the set containing no elements. It is often denoted by {} or by .
By the definition of subsets, and S are both subsets of S and are thus events.
They have special names. The event is called the impossible event, and S is called
the certain event.
An elementary event is an event with exactly one member, the events D = {3}
and E = {5} are elementary events. A and C are not elementary events, nor are and
S.
Given two events A and B we say that A and B are mutually exclusive if A B =
, that is if they have no elementary events in common, e.g. if A = {1, 3, 5} and
B = {2, 4, 6} then A B = . This makes sense, because A is the event of getting an
odd number, B is the event of getting an even number and obviously you cannot get
an odd number and an even number on the same throw of a die.
...
57
Example 2B: What is the sample space for each of the following random experiments?
(a) A game of squash is played and the score at the end of the first set is noted.
(b) We record the way in which a batsman in cricket ends his innings.
(c) An investor owns two shares which she monitors for a month. At the end of the
month she records whether they went up, down or remained unchanged.
(a) Squash is played to 9 points, with deuce at 8-all, in which case the player who
reached 8 first decides whether to play to 9 points or to 10 points. Thus the sample
space is S = {9-0, 9-1, 9-2, 9-3, 9-4, 9-5, 9-6, 9-7, 9-8, 10-8, 10-9}.
(b) The set of ways in which a batsmans innings can end is given by S = {bowled,
caught, leg before wicket, run out, stumped, hit wicket, not out, retired, retired
hurt, obstruction, timed out}.
(c) It is convenient to let U = up D = down and N = no change. Then S = { UU,
UD, UN, DU, DD, DN, NU, ND, NN}, where, for example, DU means first share
down, second share up.
Notice how we construct the most detailed possible sample space the set of
outcomes {both up, one up & one down, one up & one unchanged, one down & one
unchanged, both unchanged, both down} is not acceptable because more than one elementary event gives rise to several of these outcomes. For example, both elementary
events UD and DU give rise to one up & one down.
Example 3C: A random experiment consists of tossing 3 coins of values R1, R2 and
R5 and observing heads and tails. Which of the following is the correct sample space?
(a) S = {3 heads,2 heads,1 head,0 heads}.
(b) S = {3 heads,2 heads 1 tail,1 head 2 tails,3 tails}.
(c) S = {HHH,HHT,HTH,THH,HTT,THT,TTH,TTT} where, for example, HTH
means heads on R1, tails on R2 and heads on R5.
Example 4B: Refer to the random experiments (a) to (c) of Example 2B, and give the
subsets of S that correspond to the following events.
(a)
58
INTROSTAT
(c) Suppose now that three outcomes are recorded: sale made (S), sales potential good
(P ), no sale ever likely to be made (N ). List the sample space if two clients are
visited.
Example 6C. A party of five hikers, three males and two females, walk along a mountain trail in single file.
(a) What is the sample space S?
(b) Find the subset of S that correspond to the events:
U : a female is in the lead
59
Relative frequencies . . .
To try to get some insight into the concept of probability, consider a random experiment on some sample space S repeated infinitely many times. Lets start by doing
n trials of the random experiment and counting the number of times r that some event
A S occurs during the n trials. Then we define r/n to be the relative frequency of
the event A. Obviously, 0 r/n 1. Thus relative frequencies and probabilities both
lie between zero and one.
We can think of the probability of the event A as the relative frequency of A as n,
the number of trials of the random experiment gets very large. In symbols
Pr(A) = lim
r
.
n
If you toss a fair coin, then the probability of heads is equal to the probability of
tails, i.e. Pr(H) = Pr(T ) = 0.5. If you tossed the coin 10 times you might observe
6 heads, a relative frequency of 6/10 = 0.6. But if you tossed it 100 times you might
observe 53 heads, relative frequency 53/100 = 0.53. If you kept going for a few hours
more, and tossed it 1000 times you might observe 512 heads, giving a relative frequency
of 512/1000 = 0.512. As the number of trials increases, the relative frequency
tends to get closer and closer to the true probability.
Relative
frequency (r/n)
Front row
Front three rows
Whole class
Do the relative frequencies get closer to the true probability as n gets larger?
60
INTROSTAT
61
n
[
i=1
Ai
n
X
Pr(Ai ).
i=1
Proof: The proof is by repeated use of axiom 3. The events (A1 A2 . . . An1 ) and
An are mutually exclusive. Thus
!
!
n1
n
[
[
Ai + Pr(An )
Ai = Pr
Pr
i=1
i=1
i=1
so that
Pr
n
[
i=1
Ai
= Pr
n2
[
Ai
i=1
+ Pr(An1 ) + Pr(An ).
62
INTROSTAT
Pr(A) =
where we define the function n(A) to mean the count of the number of elementary events
in A. Clearly, n(S) = N .
Example 11A: Consider tossing a fair die. Then S = {1, 2, 3, 4, 5, 6} and N = 6. Let
A = {1, 3, 5} the event of getting an odd number. Find Pr(A).
The number of elementary events in A is n(A) = 3. So
Pr(A) =
3
1
n(A)
= = ,
N
6
2
63
64
INTROSTAT
an aisle seat?
a seat in the smoking section?
a window seat in the non-smoking section?
a window seat in row 1?
Permutations of n objects . . .
Recall that a set is just a group of objects, and that the order in which the objects
are listed is irrelevant. We now consider the number of different ways all the objects in
a set may be arranged in order. A set containing n distinguishable objects has
n(n 1) 3 2 1 = n! (n factorial)
different orderings of the objects belonging to the set. We can see this by thinking
in terms of having n slots to fill with the n objects in the set. Each slot can hold one
object. We can choose an object for the first slot in n ways; there are then n 1 objects
available for the second slot, so we can select an object for the second slot in n 1 ways,
leaving n 2 objects available for the third slot, . . . , until the last remaining object has
to placed in final slot. We say that there are n! distinct arrangements (technically, we
call each arrangement or ordering a permutation) of the n objects in the set.
Example 15A: If the set A = {1, 2, 3}, list all the possible permutations.
There are 3! = 3 2 1 = 6 distinct arrangements of the objects in A. They are:
1 2 3 1 3 2 2 1 3 2 3 1 3 1 2 3 2 1.
65
Thus there are n!/(n r)! ways of ordering r elements taken from a set containing n
elements using each element at most once. Note that we are (a) choosing r objects and
(b) arranging them. We are here involved in two processes, choosing and arranging. The
number of ways of choosing and arranging r objects out of n distinguishable objects is
called the number of permutations of n objects taken r at a time and is denoted
by (n)r (n permutation r).
n!
(n)r =
(n r)!
This formula is also valid for r = n if we adopt the convention that 0! = 1.
Example 16A: The focusing mechanism on Rons camera is bust, so that he can only
take pictures of people at a distance of 2 metres, so he only takes pictures of 3 people at
a time. How many different pictures (a rearrangement of the same people is considered
a different picture) are possible if 10 people are present?
This is the same as asking for the number of permutations of 10 objects taken 3 at
a time, given by
10!
(10)3 =
= 10 9 8 = 720.
(10 3)!
Example 17A: Suppose (as happened in South Africa in 1994) that 19 political parties
contested an election. One party wanted the ballot papers to have the parties listed in
random order. Another said it was impractical. How many different orderings would
have been possible?
This is equivalent to asking: How many permutations of 19 objects taken 19 at a
time are there? The answer is:
19!
19!
19!
=
=
(19)19 =
(19 19)!
0!
1
= 121 645 100 000 000 000 = 1.216451 1017 ,
roughly 25 million different ballot papers per man, woman and child on planet earth!
66
INTROSTAT
Example 18A: In how many ways can a 9 man work team be formed from 15 men?
The problem asks only for the number of ways of choosing 9 men out of 15:
15
15!
=
= 5005.
9
9! 6!
Example 19A: How many different bridge hands can be dealt from a pack of 52 playing
cards?
A bridge hand contains 13 cards what matters is only the group of cards (even
though you might arrange them in a convenient order). Therefore, bridge hands consist
of combinations of 52 objects taken 13 at a time:
52
= 635 013 559 600.
13
At 15 minutes per bridge game, there are enough different bridge hands to keep you
going for about 20 million years continuously.
Example 20B: From 8 accountants and 5 computer programmers, in how many ways
can one select a committee of
(a) 3 accountants and 2 computer programmers?
(b) 5 people, subject to the condition that the committee contain at least 2 computer
programmers and at least two accountants.
67
(a) We can choose 3 accountants from 8 in 83 ways. We can choose 2 computer
programmers from 5 in 5
for every group
2 ways. We multiply the results, because
5
of 3 accountants that we choose, we can choose one of 2 different groups of
computer programmers. Thus we can choose a committee of 3 accountants and 2
computer programmers in
8 5
= 56 10 = 560 ways.
3 2
(b) The total possible number of ways of forming the committee is:
The total number of ways of forming a committee composed of 3 accountants and
2 computer programmers plus the number of ways of composing a committee of
2 accountants and 3 computer programmers:
8 5
8 5
+
= 560 + 280 = 840 ways.
3 2
2 3
(a) Because words like BEER, with repeated letters, are permissible, the potential
number of 4 letter words is 264 = 456 976.
(b) Clearly, because number plates like BBB444 with repetitions are permissible, the
number of possible number plates is 243 103 = 13 824 000, or nearly 14 million.
68
INTROSTAT
Counting rules . . .
The discussion above can be summarized into several counting rules:
Counting Rule 1: The number of distinguishable arrangements of n distinct objects,
not allowing repetitions is n!.
Counting Rule 2: The number of ways of ordering r objects chosen from n distinct
objects, not allowing repetitions is
(n)r =
n!
.
(n r)!
Counting Rule 3: The number of ways of choosing a set of r objects from n distinct
objects, not allowing repetitions, is
n
n!
.
=
r!(n r)!
r
69
Without repetition
With repetition
rule 4
Permutations
rules 1 and 2
n!
(n)r =
(n r)!
nr
Combinations
rule 3
n =
n!
r
(n r)! r!
rule 5
n + r 1
r
We add three further useful counting rules which we will state, and leave the proofs
as exercises.
Counting Rule 6: The number of distinguishable arrangements of n items, of which
n1 are of one kind and n2 = n n1 are of another kind is
n
n
n!
=
=
.
n1 ! n2 !
n1
n2
Here it is assumed that the n1 items of the first kind are indistinguishable from each
other, and the n2 items of the second kind are indistinguishable from each other.
Before we move onto the final two counting rules, we define a generalization of the
binomial coefficient, known as the multinomial coefficient. We let
n
n!
=
,
n1 n2 . . . nk
n1 ! n2 ! . . . nk !
P
where ki=1 ni = n, and call this the multinomial coefficient. The sum of the numbers
in the bottom row of a multinomial coefficient must be equal to the number at the
top! Notice that in multinomial coefficient notation, the binomial coefficient would
therefore have to be written with two numbers in the bottom row:
n
n
=
.
r
r (n r)
70
INTROSTAT
n
.
n1 n2 . . . nk
12
12!
= 27 720.
=
345
3! 4! 5!
71
Example 28B: A clothing store has designed a series of seven different advertisements,
labelled AG. A local newspaper offers a special rate if advertisements are placed on the
first, second and third pages of the next weekend edition.
(a) How many different arrangements of the advertisements are possible, assuming
that the same advertisement is not repeated.
(b) If the marketing manager decides to allocate the advertisements randomly, and
decides not to use the same advertisement more than once, what is the probability
that advertisements A and B appear on the first and second pages respectively?
The advertisements are distinguishable, repetitions are not allowed, arrangements
are important, so the solution requires application of counting rule 2.
(a) The number of arrangements is (7)3 = 7!/4! = 210.
(b) For the third page, one of C, D, E, F or G must be selected. Hence Pr(A and on
first and B on second page) = 5/210 = 0.024.
Example 29B: A wealthy investor decides to give four of her 12 investments to her
daughter. Five of her investments are in gold-mining companies, the remaining seven in
various industrial companies. Her daughter is given the opportunity to select the four
companies at random.
(a) How many different sets of companies could the daughter be given?
(b) What is the probability that the daughter gets a poorly diversified portfolio of
investments, with either four gold-mining companies, or four industrial companies?
(a) The 12 companies are distinguishable, repetitions are not possible, and arrangements are irrelevant, so counting rule 3 is appropriate. The number of combinations
of 12 companies taken four at a time is 12
4 = 495.
(b) The number of ways of selecting four gold-mining companies is 54 = 5, and
therefore Pr(4gold-mining companies) = 5/495. The number of ways of selecting
four industrial companies is 7
4 = 35, with probability 35/495. The probability
of an undiversified portfolio is the sum of these two probabilities: (5 + 35)/495 =
0.081, about one chance in 12.
Example 30B: At a party, there are substantial stocks of five brands of beer Castle,
Lion, Ohlssons, Black Label and Amstel. One of the party-goers grabs two cans without
looking. How many different combinations of 2 beers are possible?
The brands are distinguishable, repetitions are permitted, but the ordering is unimportant, so counting rule 5 is used. The number of ways of selecting two cans from five
brands allowing repetitions is
5+21
2
6
= 15.
2
72
INTROSTAT
Example 31B: A group of 20 people is to travel in three light aircraft seating 4, 6 and
10 people respectively. What is the probability that three friends travel on the same
plane?
The total number of ways of choosing combinations of sizes 4, 6 and 10 from a group
of size 20 is, by counting rule 7, given by
20
20!
= 38 798 760.
=
4! 6! 10!
4 6 10
If the three friends travel in the four-seater aircraft, the remaining 17 must be split into
groups of sizes 1, 6 and 10. This can be done in 1 17
6 10 ways. Similarly, if they travel
in the six-seater, the other 17 must be split into groups of 4, 3 and 10,and if they travel
in the 10-seater, the others must be in groups of sizes 4, 6 and 7. Thus the total number
of ways the three friends can be together is
17
17
17
+
+
= 4 900 896 ways.
1 6 10
4 3 10
467
Thus, Pr(3 friends together) = (4 900 896)/(38 798 760) = 0.1263.
Example 32B: A bridge hand consists of 13 cards dealt from a pack of 52 playing
cards. What is the probability of being dealt a hand containing exactly 5 spades?
The cards are distinguishable, repetitions are not possible, and the arrangement of
the cards is irrelevant (the order in which you are actually dealt the cards does not make
any difference to the hand). So counting rule 3 is the one to use to determine that the
total number of possible hands is 52
13 .
Applying counting rule 3 twice more, the number of ways of drawing 5 spades from
the 13 in the pack and the remaining 8 cards for the hand from the 39 clubs, hearts and
39
diamonds is 13
5 8 . Hence
13 39
5
8
= 0.1247.
Pr(5 spades) =
52
13
Example 33B: If there are r people together, what is the probability that they all
have different birthdays (assuming that leap years dont exist)?
To determine the total number of ways they can have birthdays we use counting
rule 4, the dates are distinguishable, repetitions are possible, and we think in terms of
going through the r people in some order. The total number of ways is 365r .
The number of ways they can have different birthdays is given by counting rule 2,
which doesnt allow repetitions: (365)r . So
Pr (all different birthdays) =
(365)r
.
365r
If n = 23 this probability is 0.493, which is just less than one-half. The probability of
the complementary event, that there is at least one pair of shared birthdays, is therefore
0.507, marginally over 0.5. This means that, on average, in every second group of
23 people there will be shared birthdays.
73
Example 34B: Participants in a market research survey are given a set of 16 cards,
each having a picture of a well known car model. The participants are asked to sort the
cards into three piles:
Pile 1 the 3 models rated best of all.
Pile 2 the 5 models rated above average.
Pile 3 the 8 models rated below average.
In how many ways can the three piles be formed?
The 16 cards are distinguishable, but once they are in their pile their ordering is
irrelevant, and repetitions are impossible. So use counting rule 7: the sorting task can
be performed in
n
16
=
= 720 720 ways.
r1 r2 r3
358
Example 35B: To open a certain bicycle combination lock you have to get all five
digits (between 0 and 9) correct. What is the probability that a thief is successful on
his first combination?
A combination lock is not a combination lock at all it should be called a permutation lock! You dont only have to hit the right digits, you have to get them in exactly
the right order! Clearly, the probability is 1/105 = 0.00001.
Example 36C: There are 33 candidates for an election to a committee of three. What
is the probability that Jones, Smith and Brown are elected?
Example 37C: A group of eight students fill the front row at Statistics lectures daily.
They decide to keep attending lectures until they have exhausted every possible arrangement in the front row. For how many days will they attend lectures?
Example 38C: A young investor is considering the purchase of a portfolio of three
shares from the Building and construction sector of the stock exchange. He chooses
three shares at random from the 25 shares currently listed in this sector.
(a) How many ways can shares be selected for the portfolio?
(b) What is the probability that Everite, Grinaker and L.T.A. (three shares in this
sector) are selected?
(c) What is the probability that Grinaker is one of the selected shares?
Example 39C: A firm of speculative builders has bought three adjoining plots. The
company builds houses in seven styles. It is concerned about the visual appearance
of the houses from the street. So they ask their drafting section to sketch all possible
selections of street views.
(a) How many sketches are required if (i) no repetitions of styles are allowed, and if
(ii) they allow repetitions of styles?
(b) If they choose one sketch at random from those in part (a)(ii), what is the probability that all the houses will be of different styles?
(c) In order to determine the materials required, the quantity surveying department is
concerned only with the three styles which might be built (and not on which plot
they are built on). How many combinations of styles must they be prepared for if
(i) no repetitions of styles are allowed, and if (ii) repetitions are allowed?
74
INTROSTAT
Example 40C: Two new computer codes are being developed to prevent unauthorized
access to classified information. The first consists of six digits (each chosen from 0 to 9);
the second consists of three digits (from 0 to 9) followed by two letters (A to Z, excluding
I and O).
(a) Which code is better at preventing unauthorized access (defined as breaking the
code in one attempt)?
(b) If both codes are implemented, the first followed by the second, what is probability
of gaining access in a single attempt?
Example 41C: A housewife is asked to rank five brands of washing powder (A, B,
C, D, E) in order of preference. Suppose that she actually has no preference, and her
ordering is arbitrary. What is the probability that
(a) brand A is ranked first?
(b) brand C is ranked first and brand D is ranked second?
Conditional probability . . .
Conditional probabilities provide a method for updating or revising probabilities
in the light of new information. On Monday, the weather forecaster might say the probability of rain on Thursday is 50% (weather forecasters have not heard of Kolmogorovs
axioms, and insist on giving probabilities as percentages!), on Tuesday he might revise
this probability in the light of additional information to 70%, and on Wednesday, with
even more reliable information, he might say 60%. In statistical jargon, we would say
that each forecast was conditional on the information available up to that point in
time.
Example 42A: We draw one card from a pack of 52 cards. The probability that it is
the King of Clubs is 1/52. Suppose now that someone draws the card for you, and tells
you that the card is a club. Now what is the probability that it is the King of Clubs?
Obviously 1/13. We have reduced our sample space from the set of 52 cards to the set
of 13 clubs.
CONDITIONAL PROBABILITY
Let A and B be two events in a sample space S. Then the conditional
probability of the event B given that the event A has occurred, denoted
by Pr(B | A), is
Pr(B | A) = Pr(A B)/ Pr(A)
provided that Pr(A) 6= 0. Pr(B | A) is read the probability of B given A.
The conditional probability Pr(B | A) may be thought of as a reassessment of the
probability of B given the information that some other event A has occurred.
75
The event King of clubs is a subset of the events clubs so the intersection of
these two events is the event King of clubs. Pr (clubs) is the probability of drawing a
club there are 13 ways of doing this. Hence Pr (clubs) = 13/52. Therefore
1/52
13/52
= 1/13,
Example 43C: You, a woman with a medical background, are one of 198 applicants
for an M.B.A. programme of whom 81 will be selected. You hear, along the grapevine,
on good authority that there were 70 woman applicants, of whom 38 were selected.
Assess your probabilities of being accepted before and after you receive the grapevine
information. Use the definition of conditional probability.
Example 44B: Suppose A and B are two events in a sample space, and that Pr(A) =
0.6, Pr(B) = 0.2 and Pr(A | B) = 0.5.
Find
(a) Pr(B | A) (c) Pr(A B)
(b) Pr(A B) (d) Pr(B | A).
In this type of problem a useful first step always is to simplify as many conditional
probabilities into absolute probabilities as possible.
From the given information we note that
Pr(A | B) = Pr(A B)/ Pr(B)
0.5 = Pr(A B)/0.2.
Thus
Pr(A B) = 0.1.
76
INTROSTAT
Pr(B A) + Pr(B A)
Pr(A)
= Pr(A)/ Pr(A)
by Theorem 2
=1
Example 46C: Is it possible for events A and B in a sample space to have the following
probabilities?
Pr(A) = 0.5
Pr(B) = 0.8
Pr(A | B) = 0.2.
Example 47C: The probability that a first year student passes Economics if he passes
Statistics is 0.5, the probability that he passes Statistics if he passes Economics is 0.8,
and the (unconditional) probability that he passes Statistics is 0.7. The Statistics results come out first, and the student finds he has failed. What is now the conditional
probability of passing Economics?
Example 48C: Show that for any three events A, B and C in a sample space S
Pr(A B C) = Pr(A | B C) Pr(B | C) Pr(C).
Bayes Theorem . . .
For any two events A and B there are two conditional probabilities that can be
considered:
Pr(B | A) = Pr(A B)/ Pr(A)
A very useful tool for finding conditional probabilities is Bayes theorem, which connects
Pr(B | A) with Pr(A | B), named in honour of Rev. Thomas Bayes, who did pioneering
work in probability theory in the 1700s.
Bayes Theorem. If A and B are two events, then
Pr(A | B) =
Pr(B | A) Pr(A)
Pr(B | A) Pr(A) + Pr(B | A) Pr(A)
77
Pr(A B)
.
Pr(A B) + Pr(A B)
We note that
and
Therefore
Pr(A | B) =
Pr(B | A) Pr(A)
.
Pr(B | A) Pr(A) + Pr(B | A) Pr(A)
Example 49B: A television manufacturer cannot produce the full quota of tubes it
requires, so it purchases 20% of its needs from an outside supplier. The quality manager
has determined that 6% of the tubes produced in house are defective, and that 8% of
the purchased tubes are defective. He finds the tube of a randomly selected television
to be defective. What is the probability that the tube was produced by the company.
Let D be the event tube defective,
C be the event produced by the company.
We are given Pr(D | C) = 0.06, Pr(D | C) = 0.08 and Pr(C) = 0.8. We want to
find Pr(C | D). By Bayes theorem
Pr(C | D) =
=
Pr(D | C) Pr(C)
Pr(D | C) Pr(C) + Pr(D | C) Pr(C)
0.06 0.8
= 0.75.
0.06 0.8 + 0.08 0.2
Example 50B: You feel ill at night and stumble into the bathroom, grab one of three
bottles in the dark and take a pill. An hour later you feel really ghastly, and you
remember that one of the bottles contains poison and the other two aspirin.
Your handy medical text says that 80% of people who take the poison will show
the same symptoms as you are showing, and that 5% of people taking aspirin will have
them.
Let B be the event having the symptoms
A be the event taking the poison
Then A is the event taking aspirin.
78
INTROSTAT
What is the probability that you took the poison given that you have got the symptoms, i.e. what is Pr(A | B)?
Pr(A | B) =
Pr(B | A) Pr(A)
.
Pr(B | A) Pr(A) + Pr(B | A) Pr(A)
and
Pr(B | A) = 0.05.
and
Pr(A) = 2/3.
Thus
Pr(A | B) =
0.8 1/3
= 0.89.
0.8 1/3 + 0.05 2/3
79
Example 56C: We have been using the kiddie version of Bayes theorem. Prove the
adult version. Let A1 , A2 , . . . , An be a set of mutually exclusive and exhaustive events
in S. Let B be any other event. Then
Pr(Ai | B) =
Pr(B | Ai ) Pr(Ai )
.
Pr(B | A1 ) Pr(A1 ) + Pr(B | A2 ) Pr(A2 ) + + Pr(B | An ) Pr(An )
Example 57C: Jim has applied for a bursary for next year. His estimates of the
probabilities of getting each grade of result, and his probabilities of getting the bursary
given each grade are given in the table.
Grade
1st
Upper 2nd
Lower 2nd
3rd
Fail
Pr(getting grade)
Pr(getting bursary|grade)
0.20
0.90
0.15
0.75
0.50
0.40
0.10
0.15
0.05
0.00
You subsequently hear that he was awarded the bursary. What is the probability
(a) that he got a first class pass?
(b) that he failed?
(c) that he got either an upper second or a lower second?
Example 58C: A family has two dogs (Rex and Rover) and a cat called Garfield. None
of them is fond of the postman. If they are outside, the probabilities that Rex, Rover
and Garfield will attack the postman are 30%, 40% and 15%, respectively. Only one is
outside at a time, with probabilities 10%, 20% and 70%, respectively. If the postman is
attacked, what is the probability that Garfield was the culprit?
Independent events . . .
The intuitive feeling is that independent events have no effect upon each other. But
how do we decide whether two events A and B are independent? If the occurrence of
event A has nothing to do with the occurrence of event B, then we expect the conditional
probability of B given A to be the same as the unconditional probability of B:
Pr(B | A) = Pr(B).
The information that event A has occurred does not change the probability of B occurring. If Pr(B | A) = Pr(B), then, using the definition of conditional probability,
Pr(B A)
= Pr(B),
Pr(A)
or,
Pr(A B) = Pr(A) Pr(B).
This leads us to definition of independent events.
Events A and B are independent if
Pr(A B) = Pr(A) Pr(B).
80
INTROSTAT
In words, the probability of the intersection of independent events is the product of their
individual probabilities.
The definition can be extended to the independence of a series of events: if events
A1 , A2 , . . . , An are independent, then
Pr(A1 A2 . . . An ) = Pr(A1 ) Pr(A2 ) Pr(An ).
Many students initially have a conceptual difficulty separating the concepts of events
that are mutually exclusive and events that are independent. It helps (a little) to
realize that mutually exclusive is a concept from set theory (chapter 2) and can be
represented on a Venn diagram. But independence is a concept in probability theory
(chapter 3), and cannot be represented in a Venn diagram.
Note that independent events are never mutually exclusive. For example, the events
the gold price goes up today and it rains in Cape Town this afternoon are conceptually independent they have nothing to do with each other. However, the gold price
might go up today and it might rain in Cape Town this afternoon the intersection of
these events is not empty, and they are therefore not mutually exclusive. On the other
hand, if you toss a coin, the events heads and tails are mutually exclusive. Someone
tells you he tossed a coin and got heads. In the light of this information, your assessment of the probability of getting tails is instantly adjusted downwards from 0.5 to
zero! Mutually exclusive events are certainly not independent events!
Try making up your own examples to clear up the difference between the two concepts.
Example 59A: Let A be the event that a microchip is manufactured perfectly. Let B
be the event that the chip is installed correctly. If Pr(A) = 0.98 and Pr(B) = 0.93 what
is the probability that the installed chip functions perfectly?
We require Pr(A B). Because manufacture and installation may be considered
independent, we have:
Pr(A B) = Pr(A) Pr(B) = 0.98 0.93 = 0.9114.
Example 60B: A four-engined plane can land safely even if three engines fail. Each
engine fails, independently of the others, with probability 0.08 during a flight. What is
the probability of making a safe landing?
Let Ai be the event that engine i fails. Then the event safe landing can be written
as (A1 A2 A3 A4 ), the complement of the event all engines fail.
Pr(A1 A2 A3 A4 ) = 1 Pr(A1 A2 A3 A4 )
Example 61C: An orbiting satellite has three panels of solar cells, which function
independently of each other. Each fails during the mission with probability 0.02. What
is the probability that there will be adequate power output during the entire mission if
(a) all three panels must be active?
(b) at least two panels must be active?
81
Example 62C: The probability that a certain type of air-to-air missile will hit its
target is 0.4. How many missiles must be fired simultaneously if it is desired that the
probability of at least one hit will exceed 0.95?
Example 63C: A test pilot will have to use his ejector seat with probability 0.08. The
probability that the ejector seat works is 0.97. The probability that his parachute opens
is 0.99. Assuming these events to be independent, calculate the probability that the test
pilot survives the flight.
Example 64C: Suppose that a fashion shirt comes in three sizes and five colours. The
three sizes (and the percentage of the population who purchase each size) are: small
(30%), medium (50%), and large (20%). Market research indicates the following colour
preferences: white (6%), blue (26%), green (36%), orange (18%), and red (14%). The
management of a store expects to sell 1000 of these shirts. How many shirts of each size
and colour should they order? Assume independence.
Example 65C: Some financial academics argue that the day-to-day movements of share
prices are statistically independent. Assume, hypothetically, that the share De Beers has
a probability of 0.55 of rising on any given trading day. What is the probability that it
rises on three successive trading days?
Example 66C: The probability that the rand will weaken against the dollar tomorrow
is 0.53. The probability that you will wake up late tomorrow is 0.42.
(a) What is the probability that, tomorrow, the rand weakens against the dollar and
you wake up late?
(b) What is the probability that, tomorrow, the rand weakens against the dollar or
you wake up late?
Example 67C: The probability that you will be able to answer the question in the
examination on Chapter 3 of IntroSTAT is 0.65. The probability that you enter the
numbers into your calculator correctly is 0.94. The probability that your calculator
operates correctly is 0.99. The probability that you copy the answer correctly from your
calculator to your answer book is 0.97. What is the probability that you get the question
correct?
Solutions to examples . . .
3C Alternative (c) is correct, it lists all the elementary events.
5C (a) S = {SS, SN, N S, N N }.
6C (a) S = {M M M F F , M M F M F , M F M M F , F M M M F , F M M F M , F M F M M ,
FFMMM, MFMFM, MFFMM, MMFFM}
(b) U = {F M M M F, F M M F M, F M F M M, F F M M M }
V = {F M M F M , F M F M M , F F M M M , M F M F M , M F F M M , M M F F M }
W = {M F M F M }
82
INTROSTAT
(c) U = {M M M F F, M M F M F, M F M M F, M F M F M, M F F M M,
MMFFM}
U W =
V W = {M F M F M }
U V = {F M M M F }
(c) 0.16
(d) 0.004
(c) 24
2 /2300 = 0.12
1
7
(c) (i) 3 = 35 (ii)
= 84
3
40C (a) 1/(106 ) = 0.000 001 000 and 1/(103 242 ) = 1/(576 000) = 0.000 001 736. Six
digits are better than three digits followed by two letters.
(b) 1/(106 103 242 ) = 0.1736 1011
41C (a) 4!/5! =
1
5
(b) 3!/5! =
1
20
0.10.6
0.10.6+0.30.4
0.70.4
0.70.4+0.90.6
0.850.3
0.850.3+0.080.7
= 0.820
= 0.333
= 0.341
(b) 0.0
0.150.7
0.30.1+0.40.2+.15.7
(c) 0.6158
= 0.488
(b) 0.9412 + 3 0.02 0.982 = 0.9988
83
64C
65C 0.553 = 0.166
66C (a) 0.53 0.42 = 0.223
Exercises . . .
3.1
3.2
A small town has three grocery stores (1, 2 and 3). Four ladies living in this town
each randomly and independently pick a store in which to shop. Give the sample
space of the experiment which consists of the selection of the stores by the ladies.
Then define the events:
A: all the ladies choose Store 1
B: half the ladies choose Store 1 and half choose Store 2
C: all the stores are chosen (by at least one lady).
3.3
Let A, B and C be three arbitrary events. Find expressions for the events
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
3.4
only A occurs
both A and B but not C occur
all three events occur
at least one occurs
at least two occur
exactly one occurs
exactly two occur
no more than two occur
none occur.
Let A and B be two events defined on a sample space S. Write down an expression
for each of the following events, express their probabilities in terms of Pr(A), Pr(B)
and Pr(A B), and evaluate their probabilities if Pr(A) = 0.3, Pr(B) = 0.4 and
Pr(A B) = 0.2:
84
INTROSTAT
(a)
(b)
(c)
(d)
(e)
(f)
either A or B occurs
both A and B occur
A does not occur
A occurs but B does not occur
neither A nor B occurs
exactly one of A and B occurs.
Pr(A) = 0.7
Pr(A) = 0.2
Pr(A) = 0.3
Pr(A) = 0.3
Pr(A) = 0.8
Pr(B) = 0.9
Pr(B) = 0.5
Pr(B) = 0.8
Pr(B) = 0.7
Pr(B) = 0.4
Pr(C) = 0.3
Pr(C) = 0.3
Pr(C) = 0.1
Pr(C) = 0.8
Pr(C) = 0.5
Pr(A B) = 0.4.
Pr(A B) = 0.25.
Pr(A B) = 0.2.
Pr(A C) = 0.1.
Pr(A C) = 0.7.
3.8
What is the probability that a six-digit telephone number has no repeated digits?
Do not allow the number to start with a zero.
3.9
A motor car manufacturer produces four different models, each with three levels of
luxury, and with five colour options. One example of each combination of model,
luxury level and colour is on display in a parking lot.
(a) How many cars are on display?
(b) An interested client parks his vehicle in the parking lot. A rock from an
explosion at a nearby construction site lands on one of the cars. What is the
probability that it lands on the clients car?
(c) What assumptions did you make in order to do part (b)?
3.10
A pack of cards like the one described in Example 13C is being used by four players
for a game of bridge, so each is dealt a hand of 13 cards. The king, queen and
jack are referred to as picture cards. Find the probability that a bridge hand
(13 cards) contains
(a) 3 spades, 4 diamonds, 1 heart and 5 clubs
(b) 3 aces and 4 picture cards.
3.11
If Pr(A) = 0.6, Pr(B) = 0.15, and Pr(B | A) = 0.25 find the following probabilities
(a) Pr(B | A) (c) Pr(A B)
(b) Pr(A | B) (d) Pr((A B) (A B)).
85
The probability that a cancer test will detect the disease in a person who has
cancer is 0.98. The probability that a person who does not have cancer will give a
positive reading on the test is 0.1 (i.e. the test says he has the disease even though
he has not). If 0.1 per cent of the population has cancer, what is the probability
that a person selected at random will in fact have cancer, given that he shows a
positive reading on the cancer test? Comment on your answer.
3.16
The probability that twins are identical is 0.7. Identical twins are always of the
same sex, while non-identical twins are of the same sex with probability 0.5. What
is the probability that twin boys are identical twins?
3.17 The sample space for the response of a voters attitude towards a political issue has
three elementary events: A1 = {in favour}, A2 = {opposed}, A3 = {undecided}.
Let B be the event that a voter is under 25 years of age. Given the following table
of probabilities, compute the probability that a voter is opposed to the issue, given
that he is under 25.
Event
Probability of Event
A1
0.4
A2
0.5
A3
0.1
B | A1
0.8
B | A2
0.2
B | A3
0.5
3.18 A and B are events such that Pr(A) = 0.6, Pr(B | A) = 0.3, and Pr(A B) = 0.72.
Are A and B independent, mutually exclusive, or both?
3.19
If the probability is 0.001 that a 20-watt bulb will fail a 10-hour test, what is the
probability that a sign constructed of 1000 bulbs will burn for 10 hours
(a) with no bulb failure?
(b) with one bulb failure?
(c) with k bulb failures?
3.20 Show that if events A and B are independent, then the following pairs of events
are also independent.
(a) A and B
(b) A and B.
3.21 The events A, B and C are such that A and B are independent and B and C
are mutually exclusive. Their probabilities are Pr(A) = 0.3, Pr(B) = 0.4, and
Pr(C) = 0.2. Calculate the probabilities of the following events.
(a)
(b)
(c)
(d)
(e)
3.22
86
INTROSTAT
Further exercises . . .
3.23
(a) In how many ways can the batting order for a cricket team (11 players) be
arranged?
(b) In how many ways can the team be arranged, given that three specific players
have definite positions in the batting order?
(c) In how many ways can the team be divided up into 2 teams of 5 players each,
and one player left out?
(d) What is the probability that the player left out is the captain of the team?
3.24 If seven diplomats were asked to line up for a group picture with the senior diplomat
in the centre, how many distinguishable arrangements are possible?
3.25 The telephone numbers for the University of Cape Towns Rondebosch campus all
start with 650 followed by a four digit number.
(a) How many different telephone numbers can be accommodated on this campus?
(b) What is the probability that a randomly selected number has its last three
digits (i) 000 (three zeros) (ii) all the same?
3.26 Suppose A and B are events in a sample space.
(a) If A B = B, what is the numerical value of Pr(A | B)?
(b) If A B = , what is Pr(A | B)?
(c) If A B = A, what is Pr(A | B)?
3.27 Let A and B be two events defined on a sample space S.
(a) Write down an expression for each of the following events in terms of unions,
intersections and complements, and express their probabilities in terms of
Pr(A), Pr(B) and Pr(A B).
(i) Both A and B occur.
(ii) At least one of A and B occur.
(iii) Either A occurs, or B occurs, but not both.
(iv) A occurs but B does not occur.
(v) A occurs, or B does not occur.
(b) Now suppose that Pr(A) = 14 , Pr(B) = 13 and that A and B are independent
events. Evaluate the probabilities in part (a).
3.28
(a) What is the probability of drawing exactly 1 spade in a bridge hand (as
defined in Exercise 3.10)?
(b) What is the probability of drawing at least 1 spade?
(c) What is the probability that a bridge hand contains 3 spades, 7 diamonds, 2
hearts and 1 club?
(d) What is the probability that a bridge hand contains both the ace and the
king of spades?
11
12
87
(b) Pr(A | B)
(c) Pr(B | A).
3.30
Let A and B be two events in a sample space. Suppose Pr(A) = 0.4 and
Pr(A B) = 0.7. Let Pr(B) = p.
(a) For what value of p are A and B mutually exclusive?
(b) For what value of p are A and B independent?
3.33 A card is drawn from an ordinary pack of 52, looked at and replaced, and the pack
shuffled. How many times should this be done in order to have a 90% chance of
seeing the ace of spades at least once?
3.34
3.35
3.36
For safety reasons, each of 1000 parts in a spacecraft is duplicated. The spacecraft
will fail in its mission if any component and its safety duplicate both fail. Each
component fails (independently of any other component) with probability 0.01.
What is the probability that the mission fails?
3.37 The probability that a B.Sc. student takes neither Statistics nor Chemistry is 0.3
and the probability that he takes Statistics but not Chemistry is 0.2. If B.Sc. students take Statistics and Chemistry independently of each other, what is
(a) the probability that a B.Sc. student takes Statistics?
(b) the probability that a B.Sc. student takes Chemistry but not Statistics?
88
INTROSTAT
3.38 Is it possible for events A and B to be defined on a sample space with the following
probabilities?
(a) Pr(A) = 0.5, Pr(B) = 0.8 and Pr(A | B) = 0.2
(b) Pr(A) = 0.5, Pr(A | B) = 0.7, Pr(A B) = 0.3 and Pr(A B) = 0.6.
3.39 (a) If A, B and D are three events in a sample space where A B = and
A B = S, show that
(i) Pr(D) = Pr(D | A) Pr(A) + Pr(D | B) Pr(B)
Pr(A) Pr(D | A)
.
(ii) Pr(A | D) =
Pr(D)
(b) Two machines are producing the same item. Last week, Machine A produced
40% of the total output, and Machine B the remainder. On average, 10%
of the items produced by Machine A were defective, and 4% of the items
produced by Machine B were defective.
(i) What proportion of last weeks entire production was defective?
(ii) If an item selected at random from the combined output produced last
week is found to be defective, what is the probability it came from Machine A?
3.40 The probability of passing Statistics without doing these exercises is 0.1 and 0.8 if
they are done. If 60% of students do these exercises, what is the probability that
a student has not done the exercises if he passes Statistics?
3.41
Which of the following pairs of events would you expect to be independent, which
mutually exclusive and which neither?
(a) studying Economics and being left-handed,
(b) owning a dog and paying vets bills,
(c) the prices of shares in Anglovaal and Gold Fields (both in the mining house
sector of the Johannesburg Stock Exchange) both rising today,
(d) being a member of the Canoe Club and studying for a B.A.,
(e) buying sugar-free cooldrink and buying a cream doughnut for yourself.
3.42
An X-ray test is used to detect a disease that occurs, initially without any obvious
symptoms, in 3% of the population. The test has the following error rates: 7%
of people who are disease free have a positive reaction and 2% of the people who
have the disease have a negative reaction. A large number of people are screened at
random using the test, and those with a positive reaction are examined further.
(a)
(b)
(c)
(d)
What proportion of people who have the disease are correctly diagnosed?
What proportion of people with a positive reaction actually have the disease?
What proportion of people with a negative reaction actually have the disease?
What proportion of the tests conducted give the incorrect diagnosis?
89
3.44 Prove Counting rule 6; i.e. prove that the number of distinguishable arrangements
of n objects, of which n1 are of type 1 and n2 of type 2, is given by nn1 .
3.45 Prove Counting rule 7; i.e. prove that the number of combinations
of sizes n1 , n2 , . . . , nk
choosen from a set of n items is given by n1 n2 n. . . n .
k
3.46 Prove Counting rule 8; i.e. prove that the number of distinguishable arrangements
of n objects, of
which n1 are of type 1, n2 of type 2, . . . , nk of type k is given by
n
n1 n2 . . . n .
k
Solutions to exercises . . .
3.1 S = {aa, ab, . . . , de, ee}, 25 elementary events. 15 if one does not distinguish
between, for example, ab and ba
3.2 S = {1111, 1112, 1121, . . . , 3333}, 81 elementary events.
A = {1111}. B = {1122, 1212, 1221, 2211, 2121, 2112}
C = {1231, 1232, 1233, 1321, . . . , 3321}, 36 elementary events.
(b) (A B) C.
3.3 (a) A (B C) = A B C
(c) A B C.
(d) A B C.
(e) (A B) (A C) (B C).
(f) (A (B C) (B (A C)) (C (A B)).
(g) (A B C) (A C B) (B C A).
(h) (A B C)
(i) (A B C).
3.4 (a) 0.5
(b) 0.2
(c) 0.7
(d) 0.1
(e) 0.5
(f) 0.3.
(b) 1/61
(b) 0.33
(c) 0.7
(d) 0.65.
3.12 0.
3.14 0.8.
3.15 0.5213.
3.16 0.8235.
3.17 0.2128.
3.18 Events A and B are independent.
3.19 (a) 0.3677
(b) 0.3681
(c) 1000
0.001k 0.9991000k .
k
90
INTROSTAT
3.21 (a) 0
(b) 0.58
(c) 0.6
(d) 0
(e) 0.32.
3.22 3
3.23 (a) 11!
(c) 5 11
51
(b) 8!
3.24 6!
(d) 5105
(c) 1.0.
(b) 0.0
11
5 5 1 = 0.0909.
(ii) (10 10)/10 000 = 0.01
(ii) 1/2
(iii) 5/12
(iv) 1/6
(v) 3/4.
39 52
52
3.28 (a) 13
,
(b) 1 39
1 12 13
13
13
13 13 13 52
2 50 52.
(c) 13
(d)
3 7 2 1
13
2 11
13
(b) 2/9
(b) p = 0.5.
(c) 1/2.
(b) 1.0
(certain event).
(b) 2.
for
n > 119.
3.34 (a) 24 75
31 20
73 +
73 /
75
(b) 22 73
+
31 20
24 29 20
24 31 18
24 31 20
8
3.35 (a) (i) 3 10
= 0.022
3 2 = 50400 (ii) 3 3 /50400
11
5
(b) (i) 2 2 2 = 4989600 (ii) 2 2 /4989600
3.36 0.095
(b) 0.3
(ii) 0.625
3.40 0.0769
3.41 (a) and (d) are independent, (b) and (c) are neither, and (e) is mutually exclusive
if you argue that a diet-conscious person wont buy a cream doughnut!
91
3.42 Let D be the event having disease, let N be the event testing negative.
(a) Pr(N | D) = 1 Pr(N | D) = 1 0.02 = 0.98 (98% of those with the disease
have a positive reaction)
(b) Pr(D | N ) = 0.3022 (30.22%)
(c) Pr(D | N ) = 0.0007 (0.07%)
(d) Misdiagnosed is the event (N D) (N D), the union of two mutually exclusive events.
Pr((N D) (N D)) = Pr(N | D) Pr(D) + Pr(N | D) Pr(D)
= 0.02 0.03 + 0.07 0.97 = 0.0685
(6.85%)
92
INTROSTAT
Chapter
RANDOM VARIABLES
KEYWORDS: Random variables, discrete and continuous random
variables, probability mass functions and probability density functions.
Words or numbers. . .
In Chapter 3, we defined a sample space as the set of all the elementary events that
are possible outcomes of a random experiment. Sometimes, we expressed these elementary events quantitatively (the length of time for which a light bulb lasts, the number of
items purchased by a customer, the proportion of voters who support a particular proposal), and sometimes we used verbal, qualitative descriptions of the elementary events
(for random experiments such as the state of the economy, the sex of an applicant for a
job, the colour of a vehicle ordered by a purchaser).
In order to manipulate the events defined on a sample space mathematically, it
is necessary to attach a numerical value to each elementary event. Frequently, the
elementary events are quantitative, and there is a natural and obvious way to assign
numbers to them the survival time (in hours) of the light bulb, the count of items
purchased, the number of girls in families of four children.
However, if the elementary events are expressed qualitatively, we have to assign a
number to each elementary event. For example, the economy might be classified as being
in recession, stable or booming; we could asign a 1 to the event recession,
2 to the event stable and 3 to the event booming. An applicant for a advertised
post could be male or female, and we could assign 0 to the event male applicant and
1 to the event female applicant. To repeat the motivation for assigning numbers to
elementary events it clears the way for us to develop a general mathematical theory
for handling the probabilities of events in a sample space.
Once all the elementary events in a sample space have numerical values assigned
to them, we follow the classic algebraic tradition and let X stand for the numerical
values of the elementary events. We then call X a random variable. X is a variable
because it can take on (or assume) different values. X is a random variable because
the particular value it takes on depends on the outcome of a random experiment.
By convention, statisticians use the capital letters near the end of the alphabet to
denote random variables. Their favourite choice is the letter X.
93
94
INTROSTAT
If the die is unbiased, then Pr[X = 1] = 1/6, i.e. Pr(1) = 1/6. We can write
Pr[X = x] = 1/6, or Pr(x) = 1/6, for x = 1, 2, 3, 4, 5 and 6.
It is important to realize that the definition of a random variable does not imply that
a different numerical value needs to be assigned to each elementary event. In fact, we
often want random variables in which the same value is assigned to different elementary
events. The following four examples illustrate this.
Example 2A: Consider the random experiment that consists of tossing an unbiased
coin 3 times (see also Example 3C of Chapter 3). If the random variable of interest is the
number of heads that occur, then we attach numerical values to the elementary events
as follows:
S=
{ HHH HHT HTH THH HTT THT TTH TTT }
X=
3
2
2
2
1
1
1
0
The event X = 1 corresponds to the subset {HTT, THT, TTH} of S, and thus
Pr[X = 1] = 3/8. Also, clearly, Pr(0) = 1/8, Pr(2) = 3/8 and Pr(3) = 1/8.
Example 3B: Suppose you have 5 coins in your pocket two 5c coins, two 10c coins
and a 50c coin and you pull out two coins at random for a tip. Let the random variable
X be the amount of the tip. What are the possible values for X, and the probabilities
that X takes on these values?
We denote the coins 51 52 101 102 and 50. The sample space S, and the numerical
values assigned to each elementary event, can be represented as:
95
96
INTROSTAT
the volume of milk that actually goes into a nominally one litre carton
the time that a customer waits in the queue at a fast food outlet
Example 6C: Which of the following are random variables? Which of the random
variables are continuous and which are discrete? Write down the set of values that each
random variable can take on.
(a)
(b)
(c)
(d)
The distinction between discrete and continuous random variables is critical because
we develop different mathematical approaches for the two types of random variable.
(Interestingly, though, in advanced treatments of random variables, the mathematical approach for both types is again unified!) We describe discrete random variables
mathematically using probability mass functions. Continuous random variables are
described by probability density functions. We adopt the convention of using p(x)
to denote a probability mass function and f (x) for a probability density function.
97
The probability mass function that describes tossing a single die can therefore be written
as
p(x) = 1/6 x = 1, 2, 3, 4, 5, 6
=0
all other values of x.
1/6
p(x)
0
2
Example 8B: Heavily-backed favourite Enforce came through along the inside, but was
overwhelmed by Susans Dream, quoted at 91. Express the anticipated performance of
the filly Susans Dream in this horse race as a probability mass function.
Let X = 0 describe the event that Susans Dream loses the race and X = 1 the event
that she wins. The quoted odds of 91 means that the probability of losing is estimated
by the bookmaker as 9 times the probability of winning. Thus Pr[X = 0] = 9/10 and
Pr[X = 1] = 1/10, so that
p(x) = 9/10 x = 0
= 1/10 x = 1
=0
all other values of x.
98
INTROSTAT
PMF1 is satisfied, because p(x) is non-zero at only two points. Both values of p(x) lie
in the unit interval, so PMF2 is satisfied. The two values of p(x) add to one, so PMF3
is satisfied.
Example 9C: The Minister of Environment Affairs has to decide on a fishing quota for
the forthcoming season. Currently, the biomass of fish is estimated to be 20 m tonnes.
The fish may have a good breeding season (with probability 0.3) and produce 10 m tonnes
of young, or have a bad breeding season and produce only 1 m tonnes. A so-called warmwater event may occur with probability 0.1, and kill 15 m tonnes of fish, otherwise
1 m tonnes of fish will die. Find the probability mass function for X, the biomass of
fish before setting the quota (assuming all events are independent). If the minister bases
his decision using a policy that the biomass must remain 10 m tonnes or more with
probability 0.8, what should his decision be?
Example 10C: The hostile merger bid by Minorco on Consgold in 1989 was, at one
point, considered highly likely to fail by the financial media. They quoted a 121 chance
of failure. Express the anticipated outcome of the merger as a probability mass function.
For a discrete random variable X, the probability of the event a X b is found
by summing the relevant values of the probability mass function:
Pr[a X b] =
b
X
p(x).
x=a
Be careful in your handling of and <, and and >: if X assumes only
integer values, then
Pr[a < X < b] =
b1
X
p(x).
x=a+1
Also, if you have to find Pr[X b], the lower limit of the summation is b, but the upper
limit is the largest value of x for which p(x) is defined. You need this information for
the following examples.
Example 11A:
(a) Check that the function
p(x) = x/15 x = 1, 2, 3, 4, 5
=0
otherwise
satisfies the conditions for being the probability mass function of some random
variable X. Sketch p(x).
(b) Find Pr[2 X 4].
(c) Find Pr[X 3].
99
0.2
0.1
0.0
0
x
(a)
PMF1: p(x) 6= 0 for only five values of x and p(x) is defined for all values of
x.
1 x
p(x) = ( ) x = 1, 2, 3, . . .
2
=0
otherwise
is a probability mass function.
PMF1: p(x) is defined for all x, and is non-zero on the set of positive integers
{1, 2, 3, . . .}, a countably infinite set1 .
PMF2: p(x) takes on the values 0, 21 , 41 , 18 , . . . all of which lie in the unit interval.
PMF3:
X
x=1
a=
X
1x 1 1 1
= + + +
p(x) =
2
2 4 8
x=1
1
1.
(1 ) = 1.
=
2
2
(Recall that sum of infinity of a geometric progression is given by a/(1 r). Here
1
1
2 and r = 2 .)
Example 13C: Find the sample space for the random experiment which consists of
rolling a pair of dice. Find the probability mass function for the random variables X
defined to be sum of the values on the dice and Y defined to be the product of the values.
Find Pr[X 10] and Pr[Y 13].
1
A set is said to be countably infinite if there is an orderly way of setting about counting its members.
The set of integers is a countably infinite set. However, the set {x|0 x 1}, the unit interval, is noncountable no matter how what system you use to count the numbers, you always leave out infinitely
many!
100
INTROSTAT
Example 14C:
(a) Show that the function
p(x) = x3 0.53
=0
x = 0, 1, 2, 3
otherwise
x=1
x=2
x=3
x=4
otherwise
(d)
p(x) = x4
=0
3 7
2x
2
x = 0, 1, 2
otherwise
Example 16C:
(a) For what value of k will
k
p(x) = x!
=0
x = 0, 1, 2, 3, 4
otherwise
101
Bar graphs. . .
A probability mass function is conveniently plotted by means of a bar graph. This
gives an easily interpretable visual impression of the shape of the distribution of probabilities associated with the random variable. The example demonstrates the method.
Example 18A: Plot a bar graph for the probability mass function
4
p(x) =
0.6x 0.44x x = 0, 1, 2, 3, 4
x
=0
otherwise
of the random variable X.
We compute the following probabilities:
x
0
1
2
3
4
p(x)
0.026
0.154
0.346
0.346
0.130
and plot them as a bar graph. The heights of the lines are equal to the probabilities of
the events X = 0, X = 1, X = 2, X = 3 and X = 4. Naturally, the sum of the heights
of the bars must be equal to one.
0.4
0.3
p(x) 0.2
0.1
0.0
0
2
x
p(x) =
x = 1, 2, 3, 4
otherwise
102
INTROSTAT
PDF2: all values of f (x) lie in the interval [0, ); that is 0 f (x) < .
R
PDF3: f (x) dx = 1, i.e. the area under the curve of a probability density
function is one.
Frequently, the function f (x) is non-zero only on some interval, say (a, b) (this
interval may also be closed, or one of the limits may be infinity). It is then only necRb
essary to check PDF3 on this interval: a f (x) dx = 1. This is obvious, because then
R
Ra
Rb
R
Rb
0 dx = a f (x) dx because f (x) = 0 outside
f (x) dx = 0 dx+ a f (x) dx+ b
the interval (a, b).
We have seen that probabilities for discrete random variables are found by calculating the values of the probability mass function p(x) at the points of interest and summing
them. However, for continuous random variables, the probability density function f (x)
is constructed in such a way that probabilities of events are found by integration: the
area under the graph between the numbers c and d represents the probability
of the
Rd
event the random variable X lies between c and d i.e. Pr(c X d) = c f (x) dx.
This is illustrated below:
0.4
0.3
f (x) 0.2
0.1
0.0
.....
..............
............
............
..............
...............
................
.................
..................
....................
....................
....................
....................
....................
....................
c
d
x
.......................
...
.....
...
....
...
...
.
.
...
...
...
.
.
.
...
.
...
...
.
...
..
.
...
.
.
...
.
...
...
.
...
..
.
...
.
.
.
...
...
...
.
.
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
...
...
.
...
..
....
.
.
.
....
.
.
.
.
.....
....
.
.
......
.
.
.......
....
.
.
.
.
.
.
...........
......
.
.
.
.
.
.............................
.
.
.
.
.
.......................
103
2
1
0.25
0.00
0.25
0.50
0.75
1.00
Example 21B: In a certain risky sector of the share market, the proportion of companies that survive (i.e. are not delisted) a year is a continuous random variable lying in
the interval from zero to one. A statistician examines the data collected over past years
and suggests that the function
f (x) = 20x3 (1 x) 0 x 1
=0
otherwise
might be useful in modelling X, the annual proportion of companies that survive.
(a) Check that f (x) is a probability density function.
(b) What is the probability that between 30% and 50% of the companies survive a
year?
(c) What is the probability that less than 10% of the companies survive a year?
(a) PDF1 (f (x) is defined for all x), and PDF2 (f (x) > 0) are satisfied. To check
PDF3,
Z 1
Z 1
3
20x3 20x4 dx = [5x4 4x5 ]10 = 1,
20x (1 x) dx =
0
as required.
(b) The probability that between 30% and 50% survive is
Pr[0.3 < X < 0.5] =
0.5
0.3
0.1
0
Example 22B: A remote country service station is supplied with petrol once a week.
The weekly demand for petrol (measured in 1000s of litres) is a random variable with
probability density function
1
f (x) = 2500
(10 x)3 0 x 10
=0
otherwise
104
INTROSTAT
(b) What is the probability that between 3 and 5 thousand litres of petrol are sold in
a week?
(c) If less than 2 thousand litres are sold in a week, the petrol company does not
bother to deliver a supply. What is the probability of this event?
(d) If the service station has a 7 thousand litre tank, what is the probability that it
runs out of petrol in a week, assuming that it started the week full?
(e) What size tank is required in order to be 98% certain that weekly demand can be
met?
(f) What is the probability of selling exactly 5 thousand litres in a week?
(a) Checking the three conditions:
PDF1: f (x) is defined for all x.
PDF2: 10 x is positive for x in the interval [0, 10], hence f (x) 0.
PDF3:
1
2500
10
0
10
1
1
1
1
4
4
=
10
(10 x) dx =
(10 x)
2500
4
2500 4
0
3
=1
(b) The probability that sales lie between 3000 and 5000 litres is
5
Z 5
1
1
1
4
3
Pr[3 X 5] =
(10 x) dx =
(10 x)
2500 3
2500
4
3
1
1
=
(54 74 ) = 0.1776
2500
4
(c) The probability that sales are less than 2000 litres is
2
1
1
4
Pr[0 X 2] =
(10 x)
2500
4
0
1
1
=
105
(This is not true of discrete random variables, where one has to be alert to the type of
inequality.)
Examples 21B and 22B showed how a random variable and a probability density
function could be used to model a practical problem. The particular probability
density functions that we used were chosen to make the integration trivial, and would
certainly be poor representations of reality in both situations. In the next chapter we
will be considering various probability mass functions and probability density functions
which have proved themselves useful in practice as models of real-world phenomena.
Example 23B:
(a) Could the following function serve as a probability density function for some random variable X?
f (x) = 6x(1 x) 0 x 1
=0
otherwise
1.5
1.0
f (x)
0.5
.....
......... .............
....
.....
...
....
...
...
.
.
.
...
...
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
.
...
..
.
...
..
...
.
...
....
...
...
...
.
.
...
....
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
.
f (x) = 6x(1 x)
1
x
106
INTROSTAT
PDF3: We need to check that the area under the curve between 0 and 1 is
equal to 1:
Z
6x(1 x) dx = 6
0
1
1
1
= 6 x2 x3
2
3
0
1 1
=6
= 1.
2 3
f (x) dx =
1
0
(x x2 ) dx
Note that there is no requirement for the graph of a probability density function f (x)
to be a smooth curve such as this one:
2
f (x)
1
0
............................
.......
.......
.....
......
.....
......
...
......
.
.
......
...
......
.
.
.....
.
.
.....
...
......
.
......
..
.
......
..
.
......
..
......
.
......
..
.
......
..
......
.
.......
..
.
.......
..
.......
.
........
..
.
........
..
.........
.
..........
....
...........
.
.............
.
................
...
.....................
.
...............................
..
.
....................
...
0.0
0.5
x
1.0
In fact, a great variety of shapes are possible. The only restrictions are that f (x) must be
non-zero, and that the area under the curve must be equal to one. It is important to grasp
that the actual values of f (x) (the height of the curve at value x) cannot be interpreted
as being the probability that the random variable X is equal to x. This interpretation was
possible with graphs of the probability mass functions of discrete random variables. For
continuous random variables, probabilities are computed by integrating the probability
density function.
4
f (x) 2
1
...
..
...
...
...
...
...
...
....
...
.
.
....
...
....
....
...
...
....
....
.....
.....
.
.....
.
.
.
.....
.....
......
.....
......
......
.......
.......
........
........
.
...........
.
.
.
.
.
.
.
.
..................................
2
1
0
0.0
0.5
x
1.0
.
.........
.....
... ....
... ....
.... ...
.. ...
... ....
...
...
...
...
...
...
...
...
...
...
...
..
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
....
...
....
.....
...
......
...
......
......
...
.......
...
.......
........
...
...........
...
............
...........
...
.
.
.............................................
0.0
0.5
x
1.0
107
2
f (x)
1
...
..... ...
..... ....
...
.....
...
.....
.
.
.
.
...
.....
...
.....
.
.
...
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
...
.
.
...
.
...
.
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
.
...
.
...
.
.
...
.
.
...
...
.
.
.
.
...
...
.
.
.
.
...
...
.
.
.
...
.
...
.
.
...
.
.
...
.
...
.
.
.
...
...
.
.
.
.
..
.....
0.0
0.5
x
1.0
.............
... ....
.........
..
... ...
...
..
..
.... ....
.
..
..
...
..
...
..
..
..
..
..
.
..
.
..
.
..
...
.
..
..
..
...
....
..
..
.
..
...
.
.
..
..
.
.
...
..
...
..
....
....
...
.
..
...
.
...
..
..
...
.
...
..
...
...
....
..
..
.
..
...
.
..
...
.
.
..
..
.
...
.
..
.
...
.
..
..
... ....
..
...
..
... ..
...
..
.........
...
...
...
...
...
.
.
..
..
.
...
..
.....
.
.
......
.
.
.
.
.
..............
...................
0.0
0.5
x
f (x) = ke 2 x 0 x
=0
otherwise
What value must k assume?
To make f (x) a density function, k must be chosen so that
Z
f (x) dx = 1
i.e.
Z
ke 2 x dx
0
i
h
1
= 2k = 1
= 2ke 2 x
f (x) dx =
Thus k = 12 .
A selection of examples. . .
Example 25C:
(a) Find the value of k, so that the function
f (x) = k(x2 1) 1 x 3
=0
otherwise
may serve as a probability density function.
(b) Find the probability that X lies between 2 and 3.
1.0
108
INTROSTAT
Example 26C: Verify that each of the following functions satisfies the conditions for
being either a probability mass function or probability density function.
(a)
p(x) = x/6 x = 1, 2, 3
=0
otherwise
(b)
4
p(x) = x4 21 x = 0, 1, 2, 3, 4
=0
otherwise
(c)
f (x) = 1 0 x 1
= 0 otherwise
(d)
f (x) = |x| 1 x 1
=0
otherwise
(e)
f (x) = log x 0 < x 1
=0
otherwise
(f)
p(x) = e1 /x! x = 0, 1, 2, . . .
=0
otherwise
(g)
p(x) = 1/n x = 1, 2, 3, . . . , n
=0
otherwise
(h)
f (x) = 12 sin x 0 x
=0
otherwise
Example 27C: The probability density function of a random variable X is given by
f (x) = kx(1 x2 ) 0 x 1
=0
elsewhere
(a) Show that the value of k must be 4
(b) Calculate Pr[0 < X < 12 ]
(c) Find the value of A so that Pr[0 < X < A] = 21 .
Example 28C: For what values of A can p(x) be a probability mass function?
p(x) = (1 A)/4
= (1 + A)/2
= (1 A)/4
=0
x=0
x=1
x=2
otherwise
109
Example 29C: A small pool building company is equally likely to be able to complete
2 or 3 pool contracts each month. The company receives between 1 and 4 contracts
to build pools each month, with probabilities Pr(1) = 0.1, Pr(2) = 0.2, Pr(3) = 0.5,
Pr(4) = 0.2. At the beginning of this month the company has two contracts carried
forward from the previous month. The random variable X of interest is the number of
contracts to be carried forward to next month. Find the probability mass function of X.
In particular, what is the probability that no contracts will be carried forward to next
month? Assume that the number of contracts is independent of the number of pools
completed. Also, to simplify the problem, assume that the contracts for a month are
made at the beginning of the month.
Solutions to examples. . .
5C The sample space, numerical values for the elementary events and their associated
probabilities are:
S
= {
EE
EB
EL
BB
BE
BL
LL
LE
LB
}
X =
4000 3000 2000 2000 3000 1000
0
2000 1000
Pr =
0.04 0.06 0.10 0.09 0.06 0.15 0.25 0.10 0.15
The probability mass function is therefore given by
p(x) = 0.25
= 0.30
= 0.29
= 0.12
= 0.04
=0
x=0
x = 1000
x = 2000
x = 3000
x = 4000
otherwise
6C (b) & (f) are not random variables, (a), (d) & (g) are discrete, (c) & (e) are
continuous.
9C
p(x) = 0.07
= 0.03
= 0.63
= 0.27
=0
x=6
x = 15
x = 20
x = 29
otherwise
110
INTROSTAT
Y=
Pr(Y ) =
10
12
15
16
18
20
24
25
30
36
1
36
2
36
2
36
3
36
2
36
4
36
13
36 .
2
36
1
36
2
36
4
36
2
36
1
36
2
36
2
36
2
36
1
36
2
36
1
36
15C (a), (c) and (d) are probability mass functions, but (b) is not, because
p(1) = 0.2 < 0.
16C (a) k = 24/65
(b) 48/65
26C (a), (b), (f) and (g) are probability mass functions, (c), (d), (e) and (h) are probability density functions.
27C (b) 7/16
(c) A = 0.5412
x=0
x=1
x=2
x=3
x=4
otherwise
Exercises. . .
4.1
Which of the following random variables are discrete, and which are continuous?
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
4.2
Check which of the following functions can serve as probability mass functions or
probability density functions.
111
(c)
p(x) = x
x=
1 3 1 1
, , ,
16 16 4 2
(d)
f (x) = 2x/3 1 < x < 2
=0
otherwise
(e)
f (x) = 14
=0
3<x<7
otherwise
(f)
p(x) =
x = 1, 2, . . . , n
1
n(n+1)
2
=0
otherwise
4.3 Show that the following functions are probability mass functions.
(a)
p(x) = e2 2x x! x = 0, 1, 2, . . .
=0
otherwise
(b)
p(x) = x5
=0
1 x 3 5x
4 4
x = 0, 1, 2, 3, 4, 5
otherwise
4.4 Show that the following functions are probability density functions.
(a)
(b)
1
f (x) = 41 xe 2 x 0 x <
=0
otherwise
4.5
What must the value of k be so that the following functions are probability density
functions?
(a)
f (x) = kx2 (1 x) 0 < x < 1
=0
otherwise
(b)
f (x) = ke4x 0 x <
=0
otherwise
112
INTROSTAT
The probability density function of the life in hours X of a certain kind of radio
tube is found to be
f (x) = 100/x2 x > 100
=0
otherwise
Three such tubes are bought for a radio set. What is the probability that none
will have to be replaced during the first 150 hours of operation?
Further exercises. . .
4.12 A continuous random variable X has probability density function
f (x) = k a x b
= 0 otherwise
for arbitrary constants a and b. Find the value of k.
4.13 Find values for c so that the following functions may serve as probability density
functions:
113
1
2
0xc
otherwise
4.15
4
2x /36
x = 0, 1, 2
elsewhere
Solutions to exercises. . .
4.1 (b) (c) (d) (g) and (h) are discrete
(a) (e) (f) and (i) are continuous.
(j) is an unusual example of a mixed continuous and discrete random variable:
although the random variable is, at face value, continuous, it cannot be modelled
by a conventional probability density function because the probability of no rain
in a day is not zero but positive. The probability function for X needs to be
something like
p(x) = p
x=0
= f (x) x > 0
=0
otherwise
with the probability density function f (x) integrating to 1 p.
4.2 (a) (b) (e) and (f) satisfy conditions.
For (c), p(x) is not defined for all X.
For (d), f (x) < 0 for 1 < x < 0.
4.5 (a) k = 12
(b) k = 4
114
INTROSTAT
4.11 0.9850
4.12 k = 1/(b a)
4.13 (a) c = e1
4.14 (a) k = 2
(b) c = 1
(b)
1
2
Chapter
PROBABILITY DISTRIBUTIONS I:
THE BINOMIAL, POISSON,
EXPONENTIAL AND NORMAL
DISTRIBUTIONS
KEYWORDS: Binomial, Poisson, exponential and normal distributions.
A number of probability mass and density functions have proved themselves useful as
models for a large variety of practical problems in business and elsewhere. We consider
four of the most frequently encountered probability distributions in this chapter the
Binomial, Poisson, Exponential and Normal Distributions.
116
INTROSTAT
117
Let the events A2 , A3 represent other permutations of 2 successes and 4 failures, e.g.
A2 = F S S F F F
A3 = F F S S F F
Recall that the probability of the intersection of independent events is the product of
the individual probabilities, so that
Pr(A1 ) = Pr(A2 ) = . . . = Pr(A15 ) = 0.84 0.22
Thirdly, the events A1 , A2 , . . . , A15 are mutually exclusive no client can simultaneously both purchase and refuse to purchase! Thus
Pr[X = 2] = p[A1 A2 A15 ]
6
0.22 0.84 obtained from first
2
principles is the same as that obtained by substituting n = 6, p = 0.2 and x = 2 into
the formula for the binomial probability mass function.
Try computing the remaining probabilities from first principles, and compare them
with the results obtained from the formula. The probabilities are given in the table
below:
Stop a while and convince yourself that the answer
118
INTROSTAT
p(x) = Pr[X = x]
6
6
0 0.8 = 0.2621
6
1
5
1 0.2 0.8 = 0.3932
6
2
4
2 0.2 0.8 = 0.2458
6
3
3
3 0.2 0.8 = 0.0819
6
4
2
4 0.2 0.8 = 0.0154
6
5
1
5 0.2 0.8 = 0.0015
6
6
6 0.2 = 0.0001
0
1
2
3
4
5
6
The probability that all six clients purchase the product is very small (0.0001) but
will occasionally occur (we expect it roughly once in every 10 000 times that a session of
6 calls are made!). The probability that none of the 6 clients purchase is 0.2621, so that
in approximately a quarter of sessions of 6 calls no purchases are made. The probability
that two or more purchases are made is Pr[X 2] = 0.2458 + 0.0819 + 0.0154 + 0.0015 +
0.0001 = 0.3417, so that in approximately one-third of sessions of 6 calls the salesperson
achieves two or more sales.
Example 2B: What is the probability of a contractor being awarded only one out of
five contracts? Assume that the probability of being awarded a contract is 0.5.
Let success = awarded a contract. Pr (success) = p = 12 . So q = 1 p = 21 . We
have n = 5 trials. Let X be the number of successes in 5 trials. Then X B(5, 12 ).
x 54
5 1 1
P [X = x] = p(x) =
x = 0, 1, . . . , 5
x 2 2
So
5
5 1
= 5/32.
Pr[X = 1] = p(1) =
1 2
nx
n
X
n
px q nx
x
x=0
n 0 n
n 1 n1
n x nx
n n 0
=
p q +
p q
+ +
p q
+ +
p q
0
1
x
n
= (p + q)n
p (1 p)
119
(because q = 1 p)
0.2
p(x)
0.1
0.0
0
10
15
10
15
10
15
0.3
X B(15, 0.3)
0.2
p(x)
0.1
0.0
0
5
x
0.3
X B(15, 0.8)
0.2
p(x)
0.1
0.0
0
5
x
120
INTROSTAT
12
2
2 0.10
12
0
0 0.10
0.9010 = 0.2301
0.9012 = 0.2824
121
(b) If a sample of 20 components (instead of 10) were tested, and the consignment
rejected if two or more proved defective, calculate the probabilities of rejecting a
consignment for the same proportions of defective components.
(c) Which quality control procedure do you think is the better?
p(x) =
x = 0, 1, 2, . . .
otherwise
The bar graphs below show the shape of Poisson distribution for two values of .
122
INTROSTAT
0.3
X P (3)
0.2
p(x)
0.1
0.0
0
10
15
10
15
x
0.3
X P (8)
0.2
p(x)
0.1
0.0
0
5
x
Example 6A: We have a large fleet of delivery trucks. On average we have 12 breakdowns per 5-day working week. Each day we keep two trucks on standby. What is the
probability that on any day
(a) no standby trucks are needed?
(b) the number of standby trucks is inadequate?
Let the random variable X be the number of trucks that break down in a given day.
Because we are dealing with breakdowns, it is reasonable to assume that they occur at
random and that the Poisson distribution is a realistic model.
Because we are interested in breakdowns per day, we need to convert the given weekly
rate into a daily rate. 12 breakdowns per 5 days is equivalent to 12/5 = 2.4 breakdowns
per day. Thus we assume that X has the Poisson distribution with parameter = 2.4,
i.e. X P (2.4). Hence
e2.4 2.4x
Pr[X = x] = p(x) =
x!
(a) Pr(no breakdowns) = Pr[X = 0] = p(0) =
(b)
e2.4 2, 40
= 0.0907
0!
= 0.4303.
This means that 9% of days we will not use our standby trucks at all, but that
on 43% of days we will run out of standby trucks. We should investigate the financial
implications of putting more trucks on standby.
123
Example 7B: Bank tellers make errors in entering figures in their ledgers at the rate
of 0.75 errors per page of entries. What is the probability that in a random sample of 4
pages there will be 2 or more errors?
Because we are dealing with errors, we assume a Poisson distribution. If errors occur
at 0.75 errors per page, then the error rate per 4 pages is 3. So we choose = 3.
Hence
e3 3x
Pr[X = x] =
x!
Then:
Pr[X 2] = 1 Pr[X < 2]
= 1 Pr[X = 0] Pr[X = 1]
e3 30 e3 31
0!
1!
= 1 0.0498 0.1494 = 0.8008 .
=1
p(x) =
x = 0, 1, 2, . . .
otherwise
X x
2 3
+
+ =
2!
3!
x!
x=0
Example 9C: Beercans are randomly tossed alongside the national road, with an
average frequency 3.2 per km.
(a) What is the probability of seeing no beercans over a 5 km stretch?
(b) What is the probability of seeing at least one beercan in 200 m?
(c) Determine the values of x and y in the following statement: 40% of 1 km sections
have x or fewer beercans, while 5% have more than y.
For (c), the following information is useful:
x
p(x) =
e3.2 3.2x
0.0408 0.1304 0.2087 0.2226 0.1781 0.1140 0.0608 0.0278 0.0111
x!
Pr[X
P x]
= xt=0 p(t)
124
INTROSTAT
1
=
n
n
x
We now let n get so large that the assumption of two or more events in one interval
being impossible becomes realistic. Ultimately, we use the two mathematical results
above to see what happens when we let n tend to infinity:
x
nx
n
1
p(x) = lim
n x
n
n
x
n!
n
x
= lim
1
1
n x!(n x)! nx
n
n
n
x
n!
=
lim
1
lim
lim 1
n
x! n (n x)! nx n
n
n
x
n n1 n2
nx+1
=
lim
...
e 1,
n
x!
n
n
n
n
using the first of the mathematical results above. A simple re-expression of each term
within brackets yields
x
1
2
x1
p(x) =
lim 1 1
1
... 1
e
x! n
n
n
n
x
1 1 1 . . . 1 e
=
x!
using the second of the mathematical results x times. Therefore, we have the result we
require,
e x
p(x) =
,
x!
the probability mass function for the Poisson distribution.
125
1.5e1.5x dx = e1.5x 2
= e + e3 = 0 + e3
= 0.0498.
126
INTROSTAT
We first make our units of time compatible: 3 days = 3/7 week. We want the
probability of a breakdown before 3/7 week:
P [0 < X < 3/7] =
3/7
3/7
1.5e1.5x dx = e1.5x 0
= e0.6429 + e0 = 0.5258 + 1
= 0.4742
These probabilities are depicted in the figure below, which also shows the general shape
of the exponential distribution.
1.5
..
..
..
..
...
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
....
....
1.5x
....
.....
.....
.....
......
......
......
.......
.......
........
.........
..........
............
..............
..................
...........................
.................................................
..................................................................................................................
.
..
..
...
....
....
..... Pr[0 < X < 3/7]
......
....
1.0 ...........
........
..........
..........
...........
f (x)
...........
...........
...........
...........
......
0.5 ................
...........
f (x) = 1.5e
...........
...........
...........
...........
...........
...........
...........
......
0.0
0
1
X E(1.5)
Pr[X > 2]
...................................................................
x
Example 11B: Show that the exponential distribution
f (x) = ex x 0
=0
otherwise
is a probability density function.
We check that the three condition for a probability density function are satisfied.
(i) f (x) is defined everywhere. It is non-zero on the interval [0, ), i.e. the set
{x|0 x < }.
(ii) f (x) is never negative. Because is a rate, it must be positive, and ex is positive.
(iii)
Z
f (x) dx =
ex dx = [ex ]
0
= e + e0 = 0 + 1
= 1,
as required for the area under the curve of a probability density function.
127
Example 12C: Let the random variable X be the time in hours for which a light bulb
burns from the time it is put into service. The probability density function of X is given
by
1
1
e 1000 x x 0
f (x) = 1000
=0
otherwise
(a) What is the probability that the bulb burns for between 100 and 1000 hours?
(b) What is the probability that the bulb burns for more than 1000 hours?
(c) What is the probability that the bulb burns for a further 1000 hours, given that
it has already burned for 500? (Use conditional probabilities!)
Example 13C: Events occur according to a Poisson process with intensity (i.e. at
rate per unit of time).
(a) Use the Poisson distribution to determine the probability of no events in t units of
time.
(b) Now use the exponential distribution to determine the probability that the time
between events is greater than t.
(c) Compare the answers to (a) and (b) and explain these results.
Example 14C: Flaws occur in telephone cable at the average rate of 4.4 flaws per
km of cable. Calculate the following probabilities. (Make use of binomial, Poisson and
exponential distributions.)
(a)
(b)
(c)
(d)
128
INTROSTAT
random variable X is the sum of a large number of random increments, then X has the
normal distribution.
The daily turnover of a large store is the sum of the purchases of all the individual
customers. The height of a 50-year old pinetree can be thought of as the sum of each
years growth which itself is a variable affected by sunshine, temperature, rainfall,
etc. So one expects the heights of 50-year old pinetrees to obey a normal distribution.
Similarly, an examination mark is the sum of the scores in a large number of questions.
Thus, by the central limit theorem, one expects daily turnover, the heights of trees and
examination marks (approximately, at least) to be normally distributed.
The normal distribution is continuous, and has probability density function
f (x) =
1
2 2
12
2
<x<
There are two parameters, (mu, the Greek letter m for Mean) and (sigma,
the Greek letter s for Standard deviation).
The constant tells us where the graph is located (it can take on any real value);
the constant (which is always positive) tells how spread out the distribution is. The
graphs, depicting f (x) for a few values of and , make this clear: The most striking
feature of the normal distribution is that it is bell-shaped. Notice also that the centre
of the bell is located at the value , and that the distribution gets flatter as gets
larger. The plot also illustrates the fact that the area under the curve for a probability
density function is one; to accommodate this, notice that as the curve gets flatter, its
maximum value has to become smaller.
0.8
...
. .
. ..
.. ..
.
.
.
.
2
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. ... ...... ..
.... . ....
...
.
.
. .....
... .
...
. ...
...
.. .
....
.
....
.
....
..
...
..
.
...
.
.
.
...
...
....
......
....
.
...
.
.
.
. ...
.
...
.. .
.
.
. ...
.. .
.
.
.
. ..
.
.. .
...
.
.
.
...
.
..
.
.
.
...
.
.
... .
...
. ....
.
..
.
.
...
.
.
.
...
...
...
.
...
..
..
.
.
.
.
...
.
..
.
...
.
.
.......... ....... .......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.. ....... ....... .....
.... .
...
.
.
.
.
... ........ ....... ....... ....... ...
.
.
.
.
.
.
.
....... ........
...
.
...
. .......
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....
........ .......
..
.
.
.. ....... ...
.....
....... ....... ........
.
...
.
... .... ........... .
..... ......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
........... ....... .......
.
.
........
..
..
... ......
.
.....
.
.
.
.
.
.
.
.
.
.
.
.... . .... . ... .......
.
.
.
.
.
.
.
.
.
.
.
.
....
....
......................
.... .......................
N (0, 1)
N (0, 0.5 )
N (2, 4 )
N (6, 1)
0.6
0.4
f (x)
0.2
10
129
can be done by computer, and we are supplied with a table of probabilities for the normal
distribution.
It should come as a surprise to you that a single table is all we need. After all,
there are infinitely many combinations of and , and it seems that we ought to have
a massive book of normal tables. We are luckier than we deserve to be, and there is
a connecting link between all normal distributions which makes it possible to get away
with a single table! We will learn how to use this amazing table by means of an example.
253
251
1 x251 2
1
e 2 ( 3 ) dx
29
<x<
How do we make do with tables for only the standard normal distribution? Because
we have an easily proved result that the proportion of the density function that
lies between the mean and a specified number of standard deviations away
from the mean is always constant regardless of the numerical values of the mean
and standard deviation.
Translated into mathematical symbols, this important result can be written as
Z
+z
1 x 2
1
e 2 ( ) dx =
2 2
1 2
1
e 2 z dx
2
As an example of this, the areas depicted below are equal. The shading, in both
cases, shows the area under the curve between the mean and one standard deviation
above the mean. Both plots have the same scale on both axes so you can count the
dots for a numerical proof!
130
INTROSTAT
0.4
0.4
0.2 f (x)
f (z) 0.2
X N (251, 32 )
.... .
........ .
............
..............
...............
...............
...............
...............
...............
...............
...........................................
........
......
......
.....
.....
.....
.
.
.
.
......
......
.....
.....
.....
.
.
.
.
.....
....
.
.
.....
.
.
..
.
.....
.
.
.
.....
....
.
.
.
.....
.
....
......
.
.
.
.
......
...
.
.
.
.
......
.
....
.
......
.
.
.
.
.......
....
.
.
.
........
.
.....
.
.
.........
.
.
.
.
.
.
............
......
.
.
.
.
.
.
.
.
.
.
.
..................
.
.
.
.
...
..............................
245
x
250
Z N (0, 1)
.
..
..
..
...
...
....
....
....
....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
........
... ...
... ....
..
....
..
.
..
...
...
...
....
..
...
..
..
..
..
...
..
...
...
...
...
..
...
..
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
..
..
...
...
...
...
.
.
...
.
.
...
.
..
...
.
.
..
...
2
z
255
=3
0
2
=1
254 is one standard deviation (i.e. 3 units) above 251, the mean. Thus the area between
251 and 254 in N (251, 32 ) is the same as that between 0 and 1 in N (0, 1).
Returning to part (a) of our margarine example, we need the area between 251 and
253 of N (251, 32 ). 253 is two-thirds of a standard deviation above the mean of 251,
because (253 251)/3 = 2/3. Thus Pr[251 < X < 253] = Pr[0 < Z < 2/3], as depicted
below:
0.4
0.4
0.2 f (x)
f (z) 0.2
X N (251, 32 )
...... .........
................ ........
.....
........
......
..................................
.....
.
.
.
.
....
.
.
.
.. .
.
.
.
.
...
............... ....................
.....
.....
.....
.
.
.
..
.....
.
.
.
.
.....
...
.....
...............
.....
......
......
......
.
.
.
.
.
.....
.
.
......
.
.
.
.
.
.
.
.
......
....
.
.
.
.
.
.
.
.
.
.
.......
.
.
.
.
.
...
.
.
........
.
.
.
.
.
.
.
.
.........
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
............
.......
.
.
.
.
.
.
.
.
.
.
.
.
..................
.
.
.
.
.
.
.
.
.
.
.
.
.....
...
......................
245
250
x
Z N (0, 1)
.
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
.........
... ...
.. ....
..
..
...
...
.
...
...
...
...
....
..
...
..
...
..
..
..
...
...
...
...
...
...
...
...
..
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
.
.
..
...
....
.
...
.
.
...
.
.
...
..
.
.
..
..
255
Some numerical results from the normal tables help to give a feel for the normal
distribution. The area from one standard deviation below the mean to one standard
deviation above the mean is 0.683 (close to 2/3rds); i.e.
Pr[ < X < + ] = 0.683.
The corresponding probabilities for two, three and four standard deviations are:
Pr[ 2 < X < + 2] = 0.954
131
These results are true for all combinations of and ! In general terms, two-thirds (68%)
of a normal distribution is within one standard deviation of its mean, 95% is within two
standard deviations, and virtually all of it is within three standard deviations,
The general result is that the area between and some point x for N (, 2 ) is the
x
same as the area between 0 and z = x
for N (0, 1). The formula z = tells us how
many standard deviations the point x is away from the mean . Once again, you can
count the dots in the plot below of the normal distribution with arbitrary parameters
and and in the standard normal distribution N (0, 1):
0.4
f (z) 0.2
f (x)
X N (, 2 )
.... .
........ .
............
..............
................
.................. .
.....................
.....................
.....................
.....................
.........................................
........
......
......
.....
.....
.....
.
.
.
.
.
.....
.....
.....
.....
.....
.
.
.
.
.....
...
.
.
.....
.
.
....
...
.
.
.
.
.....
....
.
......
.
.
.
......
....
.
.
.
.
......
...
.
.
.
.
......
.
....
.
......
.
.
.
.
.......
.....
.
.
.......
.
...
.
.........
.
.
.
.
.
.
.
...........
......
.
.
.
.
.
.
.
.
.
................
.
.
.
.
.
.
........
................................
Z N (0, 1)
.
..
..
..
...
...
....
....
....
....
.....
.....
.....
.....
......
......
......
......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.......
.....
... ..
.. ...
.. ....
..
....
..
.
..
..
...
...
....
...
...
..
..
...
..
..
..
..
...
...
...
...
...
..
...
...
...
..
...
...
...
...
...
....
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
..
..
...
...
...
...
.
.
...
.
.
.
...
..
.
...
.
..
....
2
z=
From the table for the standard normal distribution (Table 1) we read off this probability
as 0.2486. Thus
Pr[251 < X < 253] = 0.2486,
almost a quarter of margarine tubs contain between 251 g and 253 g of margarine.
Part (b) of our question asked for the probability that a tub of margarine was
underweight, i.e. the probability that X < 250. The area between and 250 in
N (251, 32 ) is the same as the area between and (250 251)/3 = 1/3 in N (0, 1):
250 251
= P [Z < 1/3].
Pr[X < 250] = Pr Z <
3
132
INTROSTAT
0.4
0.4
Z N (0, 1)
........
... ...
... ....
..
....
..
.
..
...
...
...
....
..
...
..
..
..
..
...
..
...
...
...
...
..
...
..
...
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
...
...
..
..
...
...
...
...
.
.
...
.
.
...
.
..
...
.
.
..
...
.
.
.
...
..
..
....
...
...
...
.
f
(x)
f
(z)
......
0.2
0.2
....
.......
X N (251, 32 )
.....
.....
.....
...........................................
.
.
.
.
.
.
.
.
......
.....
.
.
.
.
.
.
.
.
.
.....
.........
.....
..... . .
..........
.....
.
.
..
.
.
.
.....
..
.....
..... .......
.........
.....
.....
.
.....
...............
..
.
.
..
.
.
.
.
.
.....
..... . . . . .
.....
............
...... . . . . . .
......
...... . . . . . .
......
.
.
...... . . . . . . .
.
.
.
.
.
.
.
......
...........
..... .................
......
.....
.......
.....
..................
.........
.....
........
.
.
......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....
.
.
.
.
.
.
.
.
...........
...... . . . . . . . . . . . .
...............
..............
..........
...........
................................. . . . . . . . . . . . . . . .
245
250
x
255
Because our tables give us areas between 0 and a point z, we have to go through the
steps depicted in the diagrams below to find this probability. We make use of the facts
that the normal distribution is symmetric, and that the area from 0 to is 0.5.
Alternatively, we can write:
Pr[X < 250] = Pr[Z < (250 251)/3] = Pr[Z < 1/3] = Pr[Z > 1/3]
= 0.5 Pr[0 < Z < 1/3] = 0.5 0.1293 = 0.3707.
The value 0.1293 is looked up in Table 1. Thus 37% of the tubs will contain less margarine
than stated. Notice that because the normal distribution is symmetric we only need
tables for half of the distribution.
X
18 4
24
<
<
Pr[2 < X < 18] =
8
8
= Pr [0.25 < Z < 1.75]
(letting z = (x )/)
Example 17B: A t-shirt manufacturer knows that the chest measurements of his
customers are normally distributed with mean 92 cm and standard deviation 5 cm. He
makes his t-shirts in four sizes S ( fit size range 8087 cm), M (to fit 8794), L (to
fit 94101) and XL (to fit 101108). What proportion of customers fit into each size
t-shirt?
133
X N (92, 5 )
80
87
94
XL
101
108
We need to find the z-values for each of the boundary points, by using the formula
z = (x )/.
Then, from our normal tables, we find the area between each of these points and the
mean. This gives
x
z = (x 92)/5
80
87
94
101
108
2.4
1.0
0.4
1.8
3.2
0.4918
0.3413
0.1554
0.4641
0.4993
The proportions for each size are then found by subtraction (or addition in the case
of size M), as follows:
Size
S
M
L
XL
Proportion
0.4918 0.3413
0.3413 + 0.1554
0.4641 0.1554
0.4993 0.4641
= 0.1505
= 0.4967
= 0.3087
= 0.0352
(15.05%)
(49.67%)
(30.87)%
( 3.52%)
Check for yourself that 0.89% of customers dont fit into any size t-shirt.
Example 18C: The mean inside diameter of washers produced by a machine is 0.403
cm and the standard deviation is 0.005 cm.
Washers with an internal diameter less than 0.397 cm or greater than 0.406 cm are
considered defective. What percentage of the washers produced are defective, assuming
the diameters are normally distributed?
134
INTROSTAT
Example 19C: In a large group of men 4% are under 160 cm tall and 52% are between
160 cm and 175 cm tall. Assuming that heights of men are normally distributed, what
are the mean and standard deviation of the distribution?
Example 20C: A soft-drink vending machine is set to discharge an average of 215 ml
of cooldrink per cup. The amount discharged is normally distributed with standard
deviation 10 ml.
(a) If 225 ml cups are used, what proportion of cups overflow?
(b) What is the probability that a cup contains at least 200 m` of cooldrink?
(c) What size cups ought to be used if it is desirable that only 2% of cups overflow?
P
P
where = ni=1 i , and 2 = ni=1 i2 .
Sometimes we need to consider the difference of two independent normally distributed random variables. Suppose
X1 N (1 , 12 ) and X2 N (2 , 22 )
then, letting Z = X1 X2 , we state, without proof, that
Z N (1 2 , 12 + 22 ).
The mean of the random variable Z is found by subtraction, but the variance is still
found by addition.
135
Example 21B: You have 4 chores to perform before getting to Statistics lectures by
08h10. The time (in minutes) to perform each chore is normally distributed with mean
and standard deviation as given below:
mean ()
std. dev. ()
1. Shower
0.5
2. Get dressed
1.0
3. Eat breakfast
10
3.5
4. Drive to university
15
5.0
x 34
,
6.205
45.5 34
Pr[X > 45.5] = Pr Z >
= Pr[Z > 1.85] = 0.0322.
6.205
Youll be late about one day in 31, on average, but (by part (b)) more than
three minutes late only once in every 100 days.
136
INTROSTAT
Example 22C: Plastic caps seal the ends of the tube into which your degree certificate
is placed when you graduate. Suppose the tubes have a mean diameter of 24.0mm and
a standard deviation of 0.15mm, and that the plastic caps have a mean diameter of
23.8 mm and a standard deviation of 0.11mm. If the diameter of the cap is 0.10 mm
or more larger than that of the tube, the cap cannot be squashed into the tube, and if
the diamater of the cap is 0.45 mm or more smaller than that of the tube, it will not
seal the tube, but will just keep falling out. If a tube and and plastic cap are selected at
random, what are the probabilities of (a) the cap being too large for the tube, and (b)
the cap falling out of the tube?
Example 23C: Another textbook says that the mass of an Ostrich is normally distributed with mean 68745 g and variance 13201000g 2 .
(a) Convert this information to a random variable with mass measured, more sensibly,
in kilograms.
(b) What is the probability that three ostriches weigh more than 225 kg?
137
Z ~ N(0,1)
0.4
f(z)
0.3
0.2
0.40
0.1
0.10
0.0
3
1
z (0.10)
Remember that Table 1 is constructed to give probabilities between 0 and z . Therefore, to find z (0.10) , we search in the body of Table 1 until we find the closest value we
can to 0.40. We find 0.3997 when z = 1.28. Thus z (0.10) = 1.28; we say tat 1.28 is
the 10% point of the standard normal distribution. Sometimes, we need to be
more precise, and say that 1.28 is the upper 10% point of the standard normal distribution. Clearly, because of the symmetry of the standard normal distribution, 1.28 is
the lower 10% point of the standard normal distribution. The lower 10% point is also,
in as perverted way, the upper 90% point, so that we can even write z (0.90) = 1.28!
Example 25C: Find (a) z (0.05) , (b) z (0.025) , (c) z (0.01) , (d) z (0.005) , (e) z (0.25) , (f)
z (0.5) , (g) z (0.95) and (h) z (0.99) .
Solutions to examples
10
5C (a) 1 10
(i) 0.0956 (ii) 0.6513 (iii) 0.9718
0 q
19
20 20
(b) 1 0 q 20
1 pq
(i) 0.0169 (ii) 0.6083 (iii) 0.9924
(c) The second procedure is less likely to reject relatively satisfactory consignments, and is more likely to reject very poor consignments. However, it costs
twice as much to do the checking, so there is a trade-off.
9C (a) 0.1125 106
(b) 0.4727
(c) x = 2 y = 6
138
INTROSTAT
(b) 0.3679
(c) 0.3679
14C (a) 0.2834 (Poisson) (b) 0.0257 (Poisson) (c) 0.1108 (Exponential) (d) 0.0158
(Binomial)
18C 38.94%
19C = 173.83, = 7.895
20C (a) 0.1587 (b) 0.9332 (c) 235.5 ml (The machine is pretty useless!)
22C (a) 0.0537 (b) 0.0901
23C (a) The conversion factor is to divide by 1000. Y N (68.745, 13.201), where
Y is measured in kilograms. (b) If one ostrich weighs Y kg, then three weigh
V = Y1 + Y2 + Y3 kg, and V N (3 68.745, 3 13.201) = N (206.235, 39.603).
Pr[V > 225] = Pr[Z > 2.98] = 0.00144.
25C (a) z (0.05) = 1.64, (b) z (0.025) = 1.96, (c) z (0.01) = 2.33, (d) z (0.005) = 2.58, (e)
z (0.25) = 0.67, (f) z (0.5) = 0, (g) z (0.95) = 1.64 and (h) z (0.99) = 2.33.
...
Suppose that 25% of the people entering a supermarket are aged between 18 and
30 years, classified as young adults. A market researcher has to fulfil a quota of
10 interviews. What is the probability her quota of interviews contains
(a) exactly x young adults?
(b) no young adults?
(c) between four and six (inclusive) young adults?
5.2
(a) A true-false test is given with five questions. To pass you need at least four
right. You guess each answer. What is the probability that you pass?
(b) The true-false test is replace with a multiple-choice test with four alternative
answers, only one of which is correct. If you guess, what is the probability
that you pass?
5.3
5.4
An anti-aircraft battery in England during World War II had on the average 3 out
of 10 successes in shooting down flying bombs that came within range. What was
the chance that, if eight bombs came within range, two or more were shot?
139
Poisson distribution . . .
5.5
What is the probability of finding 12 errors in a 200-page book if the printers have
an error rate of 0.075 errors per page? Assume that a Poisson distribution may
be used to model the occurrence of errors.
5.6
5.7
The average demand on a factory store for a certain electric motor is 8 per week.
When the storeman places an order for these motors, delivery takes one week. If
the demand for motors has a Poisson distribution, how low can the storeman allow
his stock to fall before ordering a new supply if he wants to be at least 95% sure
of meeting all requirements while waiting for his new supply to arrive?
5.8 The average number of accidental drownings per year is 3.5 per 100 000 population.
(a) Find the probability that in a city with a population of 200 000 there will be
between 4 and 8 (inclusive) accidental drownings per year.
(b) What are the probabilities that in towns of 15 000, 20 000 and 50 000 there
will be no drownings in a year?
Exponential distribution . . .
5.9
5.10
If the average drowning rate is 3.5 per 100 000 population per year what is the
probability that the time interval between drownings in a city of 200 000 will be
less than one month?
Customers arrive at a restaurant at the rate of 90 per hour during lunch time.
(a) If a customer has just arrived, what is the probability that it will be at least
another minute before the next customer arrives?
(b) If a minute has already passed by since the last customer arrived, what is
the probability that it is at least another minute before the next customer
arrives?
(Hint: use conditional probabilities.)
5.11
The life of an electronic device is known to have the exponential distribution with
parameter = 1/1000.
(a) What is the probability that the device lasts less than 1000 hours?
(b) What is the probability it will last more than 1200 hours?
(c) If three such devices are taken at random, what is the probability that one
will last less than 800 hours, another between 500 and 1200 hours, and the
third between 1200 and 2000 hours?
140
5.12
INTROSTAT
The duration (in minutes) of showers on a tropical island is approximately exponentially distributed with = 1/5.
(a) Out of 3 showers, what is the probability that not more than 2 will last for
10 minutes or more?
(b) What is the probability that a shower will last at least 2 minutes more, given
that it has already lasted 5 minutes?
Normal distribution . . .
5.13 If the random variable Z has the standard normal distribution (i.e. Z N (0; 1)
find the following probabilities:
(a)
(c)
(e)
(g)
(i)
P [0 < Z < 1]
P [1.64 < Z < 1.64]
P [Z < 1.38]
P [2.3 < Z < 1.6]
P [1.74 < Z < 0.86]
(b)
(d)
(f)
(h)
(j)
5.14 Given that Z has standard normal distribution, what values must z have in order
to make each of the following statements true?
(a)
(b)
(c)
(d)
(e)
5.15
5.16
If X N (0; 14 ) find
(a) P [X > 2]
(b) P [0 < X < 1]
5.18
If X has a normal distribution, and if P [X < 10] = 0.8413, what is the value of
the mean if the distribution is known to have variance 2 = 16?
5.19
Sports shirts are frequently classified as S, M, L and XL for small, medium, large
and extra large neck sizes. S fits a neck circumference of less than 37 cm, M fits
between 37 and 40.5 cm and L fits between 40.5 and 44 cm while XL fits necks
over 44 cm in circumference. The neck circumference of adult males has a normal
distribution with = 40 and = 2.
(a) What proportion of shirts should be manufactured in each category?
141
Suppose that the profit (or loss) per day of a shopkeeper dealing in a perishable
item is approximately normally distributed with mean R10 and standard deviation
R5. What is the probability that
(a) he makes a loss on any one day?
(b) his profit exceeds R14?
(c) he makes exactly R10 profit?
5.21 Consider an I.Q. test for which the scores of adult Americans are known to have a
normal distribution with expected value 100 and variance 324, and a second I.Q.
test for which the scores of adult Americans are known to have a normal distribution with expected value 50 and variance 100. Under the assumption that both
tests measure the same phenomenon (intelligence), what score on the second
test is comparable to a score of 127 on the first test? Explain your answer.
5.23
The average number of oil tankers arriving each day at a Persian Gulf port is
known to be 7. The facilities at the port can handle at most 10 tankers per day.
If tankers arrive at random, what is the probability that on a given day tankers
have to be turned away?
5.24
Airplane engines operate independently in flight and fail with probability 1/10.
A plane makes a successful flight if at most half of its engines fail. Determine the
probability of a successful flight for two-engined and four-engined planes.
5.25
5.26 A die is thrown 10 times. What is the probability of obtaining at least three even
numbers?
142
5.27
INTROSTAT
The annual income of residents of Bishopscourt is normally distributed with mean
R25 000 and standard deviation R5000. What is the highest income of the lowest
20% of income earners in Bishopscourt?
5.28 A liquid culture medium contains on the average m bacteria per ml. A large
number of samples is taken, each of 1 ml, and bacteria are found to be present in
90% of the samples. Estimate m.
5.29 The strength of a plastic produced by a certain process is known to be normally
distributed. If 10% of the plastic has a strength of at least 4000 kg, and 70% has
a strength exceeding 3000 kg, what are the mean and standard deviation of the
distribution?
5.30
A bank has 175 000 credit card holders. During one month the average amount
spent by each card holder totalled R192,50 with a standard deviation of R60,20.
Assuming a normal distribution, determine the number of card holders who spent
more than R250.
5.31
The maximum (stated) load of a passenger lift is 8 passengers or 600 kg. If the
masses of people using the lift can be considered to be normally distributed with
mean 70 kg and standard deviation 15 kg, how often will the combined mass of 8
passengers exceed the 600 kg limit?
5.32 A road is constructed so that the right-turn lane at an intersection has a capacity
of 3 cars. Suppose that 30% of cars approaching the intersection want to turn
right. If a string of 15 cars approaches the intersection, what is the probability
that the lane will be insufficiently large to hold all the cars wanting to turn right?
5.33
The weekly demand for sulphuric acid from the store of a chemical factory is
normally distributed with mean 246 litres and standard deviation 50 litres. After
placing an order with the sulphuric acid manufacturers, delivery to the store takes
one week.
(a) How low can the stock of sulphuric acid be allowed to fall before ordering a
new supply in order to be 95% sure of meeting all requirements while waiting
for the new supply to arrive?
(b) What volume of sulphuric acid should then be ordered so that, with 95%
certainty, it will not be necessary to reorder within 6 weeks?
143
Solutions to exercises . . .
5.1 (a)
10
c
0.25x 0.7510x
(b) 0.0563
(b) 0.0156
(b) 0.9812
(c) 0.2206
(c) 0.0296
5.4 0.7447
5.5 0.0829
5.6 (a) 0.2642
(b) 0.1954
5.7 13
5.8 (a) 0.6473
(b) 0.5916,
0.4966 and
0.1738
5.9 0.442
5.10 (a) 0.2231
5.11 (a) 0.632
(b) 0.2231
(c) 0.551 0.305 0.166 = 0.0279
(b) 0.301
5.12 (a) Pr(a shower lasts longer than 10 mins) = 0.135, Pr(X 2) = 0.9975
(b) 0.6703
5.13 (a)
0.3414
(b)
(e)
0.0838
(f)
(i) 0.1540 (j) 0.99730
5.14 (a) 1.964
(b) 1.96
(c) 1.64
(b) 0.9332
(b) 0.4773
0.4750
0.09821
(c)
(g)
(d) 2.58
0.8990
0.9345
(d)
(h)
0.0228
0.1359
(e) 2.33
(c) 0.3612
(b) 2.68
5.18 = 6
5.19 (a) 7% S, 53% M, 38% L, 2% XL
5.20 (a) 0.0228
5.21 65
(b) 0.2119
(c) 0
144
INTROSTAT
(b) 15
(c) 0.964
5.26 0.9453
5.27 R20 800
5.28 m = 2.3026
5.29 = 3288.89, = 555.56
5.30 29 488
5.31 0.1736
5.32 0.7031
5.33 (a) 328.0 litres
Chapter
In words, this says that the mean of a discrete random variable is equal to the sum of a
set of terms, each term being one of the values that the random variable can take on (x),
multiplied by the probability of taking that value (p(x)). The variance makes immediate
use of as defined above. The variance of a discrete random variable is defined to be
X
2 = Var[X] =
(x )2 p(x).
x
The sum is taken over the set of values for which the probability mass function p(x) is
positive.
145
146
INTROSTAT
= Var[X] =
(x )2 f (x) dx.
The limits of integration are taken over the interval for which the probability density
function f (x) is non-zero.
As was the case with the variance of a sample in chapter 1, there is an alternative
formula for the variance of a random variable, which also provides a short-cut for
many problems. If X is discrete, use
X
x2 p(x) 2 .
2 = Var[X] =
x
2 = Var[X] =
b
a
x2 f (x) dx 2 .
You should easily be able to prove that the pairs of formulae for the variance of discrete
and continuous random variables are equivalent.
For both discrete and continuous random variables, the standard deviation of
the random variable X is defined to be the square root of the variance:
p
= Var(X).
(x2 x3 ) dx
x 6x(1 x) dx = 6
0
0
1 1
1
1 3 1 4 1
x x
) =
=6
=6
3
4
3
4
2
0
=
147
x2 6x(1 x) dx
1 2
2
1
(x3 x4 ) dx
4
0
1
1 1
1 4 1 5
1
1
1
=6 x x
=6
=
4
5
4
4 5
4
20
0
=6
6
X
x 1/6
x=1
= 1/6 + 4/6 + 9/6 + 16/6 + 25/6 + 36/6 3.52
= 2.917
Var(X)
=
E(X)
2.917
= 0.488
3.5
In the next two examples, it will help you to be reminded that the mean is the sum
of the values that the random variable takes on multiplied by the probabilities of taking
on these values.
148
INTROSTAT
Example 3C: You are contemplating whether it is worth your while driving out to the
Northern Suburbs to call on a client. You estimate that there is a 40% chance that the
client will purchase your product. Commission on the sale is R50, but the petrol will
cost you R15.
(a) Should you call on the client?
(b) What probability of purchase would lead to an expected net gain of zero for the
call?
Example 4C: You are considering investing in one of two shares listed on the stock
exchange. You estimate that the probability is 0.3 that share A will decline by 15% and
a probability of 0.7 that it will rise by 30%. Correspondingly, for share B, you estimate
that the probability is 0.4 that it will decline by 15% and that the probability that it will
rise by 30% is 0.6. The return on a share is defined as the percentage price change.
(a)
(b)
(c)
(d)
(e)
Example 5B: Suppose the random variable X has the Poisson distribution with parameter . Find E[X].
The probability mass function for the Poisson distribution is
e x
x!
=0
p(x) =
x = 0, 1, 2, . . .
otherwise
X
x=0
x p(x)
x
e x
x!
X
x
e x
=
x (x 1)!
x=1
= e
X
x=1
= e
x1
(x 1)!
2
1+ +
+
1!
2!
= e e
because x! = x (x 1)!
149
n
X
x=1
= np
n!
px q nx
(x 1)!(n x)!
n
X
x=1
(n 1)!
px1 q nx .
(x 1)!(n x)!
because
m
X
m
py q my = 1;
y
y=0
it is the sum of all possible values of the probability mass function of the binomial
distribution B(m, p). Therefore, as required
E[X] = np,
which says that the mean of the binomial distribution B(n, p) is the product of the two
parameters of the distribution, n and p.
150
INTROSTAT
0.8
f (x)
0.4
0.0
.....
... ...
... ...
.... ....
...
..
...
...
...
..
..
..
...
..
..
...
..
...
..
...
....
...
...
...
....
...
..
...
...
..
..
...
..
...
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
...
...
..
...
..
...
..
..
...
..
...
..
..
...
..
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
.
.
...
....
...
.
.
..
..
..
.
..
..
...
.
..
....
.
.
...........
...
.
.
.
.
.
.
.
.
.
............
.
.
.
.
.........
2
3
x
Variance small
0.8
f (x)
0.4
....................
.....
....
....
...
...
...
...
...
.
.
...
.
.
...
..
.
.
...
...
...
.
...
..
.
...
..
.
...
..
.
...
..
...
.
...
..
.
...
..
.
...
.
.
.
...
..
.
...
.
..
...
.
.
...
..
.
...
.
..
...
.
.
....
...
.
.
....
.
.
....
0.0
0
2
3
x
Variance large
The variance may be thought of as a measure of the average distance of the random
variable X from its mean. If the p.d.f. or p.m.f. is very flat, the variance will be
large. If the probability function is very peaked, the variance will be small. This is
illustrated above. In the plot on the left, f (x) is peaked, and the terms (x )2 are on
average small, leading to a small variance. On the other hand, if as in the plot on the
right, f (x) is flat, then the terms (x )2 tend to be large, leading to a large variance.
Applied Mathematics students will see the relationship between means and centres of gravity and between variances and moments of inertia.
Skewness . . .
Just as a histogram can be skew, so can the distribution of a random variable.
We use the same terminology here as in chapter 1. A random variable is said to be
positively skewed if it has a long tail on the right-hand side. Similarly, a random
variable has negative skewness if it has a long tail on the left-hand side. Symmetric
distributions are just that: symmetric, so that the tail on the left is a mirror image of
the tail on the right.
Statisticians sometimes need to describe the shape of the tails of a probability distribution. Even if two distributions may have the same mean and variance, the shapes
of the tails may be quite different. Statisticians distinguish between heavy-tailed distributions, in which the probability of observations far from the mean is relatively large,
and light-tailed distributions, in which observations far from the mean are unlikely.
151
0.8
f (x) 0.4
0.0
Negatively
skewed
..
...
...
....
..
..
..
..
..
.
.
..
...
...
....
...
...
...
..
.
.
..
...
...
..
.
.
..
...
..
...
.
.
..
...
...
...
...
.
.
.
....
.....
.....
......
.......
.
.
.
.
.
.
.
......
............
2
x
0.8
0.4
0.0
Symmetric
......
... ...
... ....
.... ....
..
...
..
..
..
..
...
..
...
...
...
...
...
..
...
...
..
...
..
...
..
...
...
...
...
...
...
...
...
...
....
..
..
...
..
...
..
...
..
..
...
..
...
..
..
...
..
...
..
...
...
...
...
...
...
...
...
...
...
...
...
..
..
.
.
..
..
...
.
.
........
.
.
.
.
..........
..............
2
x
Positively
skewed
0.8
0.4
0.0
........
... ..
.. ..
.. ..
... ....
...
..
...
..
...
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
....
......
...
......
...
.......
.........
...
.............
...
...
...
2
x
F (x) . . .
F (x) =
f (t) dt
p(t).
tx
Note that for X continuous, F 0 (x) = f (x), i.e. the derivative of the distribution function
is the probability density function.
Example 7A: Find the distribution function F (x) for the exponential distribution.
Z x
Z x
ex dx
f (x) dx =
F (x) = Pr[X x] =
0
ix
h
= ex = ex + e0
0
= 1 ex
The distribution function should be defined for the domain (, ): thus we write
F (x) = 0
x<0
= 1 ex x 0
as the distribution function for the exponential distribution. Differentiating F (x) yields
dF (x)
= f (x) = 0
x<0
dx
x
= e
x0
the probability density function of the exponential distribution.
152
INTROSTAT
Example 8A: Find the distribution function for the random variable X with the
binomial distribution X B(3, 0.2).
3
Pr[X = x] =
0.2x 0.83x
x = 0, 1, 2, 3
x
which yields p(0) = 0.512, p(1) = 0.384, p(2) = 0.096, p(3) = 0.008. Thus Pr[X
0] = 0.512, Pr[X 1] = 0.512 + 0.384 = 0.896, etc, so that
F (x) = 0
x<0
= 0.512 0 x < 1
= 0.896 1 x < 2
= 0.992 2 x < 3
=1
x3
The graphs of the distribution function F (x) for Examples 7A and 8A are shown
below. The graph of F (x) is always an increasing function, between 0 and 1. If X is a
discrete random variable, then F (x) will be a step function.
1.0
F (x)
0.5
0.0
........................
.............
..........
........
.
.
.
.
.
.....
.....
.....
....
...
.
.
..
...
...
...
..
.
...
...
...
..
.
...
...
...
....
..
...
x
...
....
..
....
..
...
...
....
..
..
..
..
....
..
...
...
....
..
...
...
....
..
F (x) = 1 e
(with = 1)
Example 7A
2
x
1.0
F (x)
0.5
Step function of
Example 8A
0.0
0
2
x
f (x) = 12 / x 0 x 1
=0
otherwise
153
Example 10B: Suppose events are occurring at random with average rate per unit
of time. What is the probability density function of the random variable X , the waiting
time to the second event?
We first find F (x).
F (x) = Pr[X x] = Pr [waiting less than x units of time for the 2nd event]
= 1 Pr [waiting more than x units of time for the 2nd event]
= 1 ex xex ,
using the Poisson distribution. We now differentiate F (x) to find the density function
f (x).
f (x) =
dF (x)
= ex ex + 2 xex
dx
= 2 xex .
Thus the density function of the random variable X , the waiting time to the second
event, is given by
f (x) = 2 xex x 0
=0
otherwise
This is an extension of the exponential distribution which is the density function of the
waiting time to the first event. See Exercises 6.23 and 6.24.
Example 11C: A company which sells expensive woodworking machinery has established that the time between sales can be modelled by an exponential distribution, with
parameter = 2 per five-day working week. The company has had no sales over the last
week, and the manager fears that if no sales are made in the next few days, the company
will be in serious financial difficulty. Throw some light on the situation by computing
an expression for the probability that the number of days between two sales will be x
or fewer days. Plot this function. Can you allay the managers fears?
Example 12C: An estate agent has five houses to sell. She believes her situation can
be modelled by a binomial process, and that her probability of selling each house within
a month is 0.4. Compute and plot the distribution function of the number of sales in
the month. How would this plot help the estate agent?
Example 13C: A student believes that she is equally likely to obtain a mark between
45% and 60% for an examination.
(a) Find the distribution function for the random variable giving the students examination mark. Assume that the random variable is continuous!
(b) Use the distribution function to determine the probability that the student obtains
less than 50% for her examination.
154
INTROSTAT
Half the density function lies below the median, and half lies above it: the picture makes
this clear. The mean and the median of a random variable only coincide when the density
function is symmetric.
0.6
....
... ...
.. .....
..
..
..
....
...
..
...
..
...
...
...
...
...
...
...
m
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
....
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
....
...
.....
...
.....
...
......
......
...
......
...
.......
........
...
.........
...
............
..............
...
....................
.....................................
...
.................................................................
...
........................................................................................
.
Median x
f (x)
0.3
Mean
0.5
0.0
0.5
4
x
If X is discrete, F (x) is a step function, and the median is taken to be the lowest
value of x for which F (x) 21 .
Example 14A: Find the median of the exponential distribution.
In Example 7A we showed that the distribution function of the exponential distribution is
F (x) = 1 ex .
The median xm is therefore the solution to the equation
1
= 1 exm
2
Rearranging, this yields
1
exm = .
2
Now take natural logarithms to obtain
1
xm = loge ,
2
so that, finally, the median is given by
xm = 0.6931/.
155
Example 15C: Find the distribution functions and medians of the random variables
having the following probability density/mass functions. Compare the medians with the
means.
(a)
4 0.54 x = 0, 1, 2, 3, 4
p(x) = x
=0
otherwise
(b)
40.4x 0.64x x = 0, 1, 2, 3, 4
p(x) = x
=0
otherwise
(c)
f (x) = 1/5 3 < x < 8
=0
otherwise
(d)
f (x) = 1/x 1 x e
=0
otherwise
The normal distribution used to approximate the binomial distribution is the one
with the same mean and variance as the binomial distribution being approximated.
The same principle applies for the normal approximation to the Poisson distribution;
the approximating normal distribution has the same mean and variance as the Poisson
distribution being approximated.
156
INTROSTAT
NORMAL APPROXIMATION TO THE POISSON
DISTRIBUTION
If X P () and > 10 then
X
: N (, ).
The examples show how to use the approximation to compute binomial and Poisson
probabilities.
Example 16A: If you toss an unbiased coin 20 times what is the probability of exactly
15 heads?
X , the number of heads, has a binomial distribution B(20, 21 ) with mean np = 10
and variance np(1 p) = 5. Also n(1 p) = 10. Because np > 5 and n(1 p) > 5
and 0.1 < p < 0.9, we can approximate the distribution of X by means of a normal
distribution which we will denote Y . The appropriate normal distribution Y to use to
appproximate the binomial distribution X is the one with the same mean and variance
as X . Thus we take = 10 and 2 = 5 so that Y N (10, 5). We write X
: N (10, 5)
and say X is distributed approximately normally with mean 10 and variance 5.
We now have to get around the problem of using a continuous distribution for a
discrete random variable. The probability that the normal distribution takes on the
value 15 is zero. If we are going to obtain a positive probability, we need an interval
over which to evaluate the area under the curve. How do we choose this interval? The
approximation has been designed in such a way that to obtain the probability that
X = 15 for the binomial distribution, we calculate Pr[14.5 < Y < 15.5] for the normal
distribution (10, 5).
We learnt in Chapter 5 how to find probabilities for any normal distribution; we
transform it to the
standard normal distribution, using the formula z = (x )/ with
= 10 and = 5. Thus
14.5 10
15.5 10
<Z<
5
5
= Pr(2.01 < Z < 2.46)
157
the discrete distribution is 30 or more, we obtain the probability that the continuous
distribution exceeds 29.5.
Pr[X 30] = Pr[Y > 29.5] = Pr[Z > (29.5 25)/5)]
= Pr[Z > 0.9]
= 0.1841
The method we have demonstrated to compute the approximate probabilities is
summarized in the block below:
PROCEDURE FOR USING THE APPROXIMATIONS
If the random variable X has a binomial or Poisson distribution and
satisfies the conditions for being approximated by a normally distributed random variable Y , then
Pr[a X b] = Pr[a
1
1
<Y <b+ ]
2
2
Example 18B: An actuarial lifetable states that the probability that a 40-year old
man will die before age 60 years is 0.17. An insurance company insures 300 men aged
40. What is the probability that the number of insured men who will die before age 60
lies between 50 and 60 (inclusive)?
Let the random variable X be the number who will die between age 50 and age 60.
Clearly, X B(300, 0.17).
We calculate np, n(1 p), np(1 p):
np = 300 0.17 = 51,
n(1 p) = 249,
np(1 p) = 42.3.
The conditions for using the normal approximation are therefore satisfied. Thus X
B(300, 0.17) can be approximated by Y N (51, 42.3).
Pr[50 X 60] = Pr[49.5 < Y < 60.5]
= 0.0910 + 0.4279
= 0.5189
Example 19C: The average number of newspapers sold at a busy intersection during
rush hour is 5 per minute. What is the probability that in a 15-minute period during
rush hour more than 85 newspapers are sold at the intersection?
Example 20C: A car ferry can accommodate 298 cars. Because bookings are not
always taken up, the operators accept 335 bookings for each ferry crossing, hoping that
no more than 298 cars arrive. If individual bookings are taken up independently with
probability 0.85, what is the probability, when 335 bookings have been made, that more
cars will arrive than can be accommodated for a particular crossing?
158
INTROSTAT
Solutions to examples . . .
3C (a) E[X] = 5 > 0, so you should make the call.
(b) E[X] = 0 implies 35x + (15)(1 x) = 0, so that x = 0.3.
4C (a)
(b)
(c)
(d)
(e)
= x 0x1
=1
x>1
2
x<0
0x<1
1x<2
2x<3
3x<4
4x<5
x5
13C (a) f (x) = 1/15 for 45 < x < 60 (and zero otherwise).
F (x) = 0
x < 45
= (x 45)/15 45 x 60
=1
x > 60
(b) 0.333
15C (a) The distribution function is
F (x) = 0
= 0.0625
= 0.3125
= 0.6875
= 0.9375
=1
xm
=2
x<0
0x<1
1x<2
2x<3
3x<4
x4
=2
159
x<0
0x<1
1x<2
2x<3
3x<4
x4
= 1.6
x3
3<x<8
x8
= 5.5
x<1
1xe
xe
= 1.7183
For the symmetric distributions, (a) and (c), the mean and median coincide.
19C 0.1131
20C 0.0179
Exercises . . .
6.1
Find the mean, the variance, the distribution function and the median of the
random variables having the following probability density/mass functions.
(a)
f (x) = x/50 0 < x < 10
=0
otherwise
(b)
p(x) = (x 1)/15 x = 1, 2, . . . , 6
=0
otherwise
(c)
p(x) = 0.1
= 0.3
= 0.5
=0
x = 10 or
x = 20
x = 30
otherwise
x = 40
(d)
f (x) = 3x2 /125 0 < x < 5
=0
otherwise
160
INTROSTAT
(e)
f (x) = 10x9 0 < x < 1
=0
otherwise
6.2
6.3
6.4
For the opportunity to roll a die you pay n cents. If you score a six, you get
a reward of R2.00, and get your n cents returned. How much should you pay
in order to make this a fair game? (Note: a game is defined to be fair if the
expected gain is zero.)
6.5
6.6 Find the mean and the variance of the exponential distribution.
6.7 Find the mean of the normal distribution.
6.8
What is the probability of obtaining more than 300 sixes in 1620 tosses of a fair
die?
6.9
Experience has shown that 15% of travellers reserving flights with Wildebeest
Airlines do not take up their seats. If each plane has 50 seats and 58 bookings are
accepted beforehand, what is the probability that everyone wishing to make the
flight can be accommodated?
6.10
6.11 Each kilogram of grass seeds contains an average of 200 weed seeds. What is the
probability that a kilogram of seeds contains more than 225 weed seeds? (Assume
that the number of weed seeds has a Poisson distribution.)
161
A holiday farm has 110 bungalows. During the winter holidays, the farm has an
average occupancy of 60 bungalows per night. Assuming that the random variable
giving the number of bungalows let per night has a binomial distribution, compute
the probabilities
(a) fewer than 70 bungalows are let for a night, and
(b) between 65 and 75 bungalows (inclusive) are let for a night.
6.13 Prove that the mean and the median of a continuous random variable having a
symmetric probability distribution function are equal.
6.14
6.17 A fair coin is tossed 250 times. Find the probability that the number of heads will
not differ from 125 by
(a) more than 10
(b) less than 7.
6.18 On average, a newspaper is delivered late twice per month. Calculate the approximate probability that the paper will be late more than 30 times in a year.
162
INTROSTAT
f (x) = Ax + B
Find
Find
Find
Find
the
the
the
the
Consider the following game. You pay an amount x, and then toss a coin until
a head appears. If a head is obtained on the first or second throw, you lose. If a
head is obtained on the third or fourth throws, you win R1. If a head appears on
the fifth or subsequent throw, you win R5.
(a) If x = 75c, what are your expected winnings (or losses) per game?
(b) What should x be to make the game a fair game?
6.21 The random variable X has the probability mass function (known as the truncated Poisson distribution)
e x
x! (1 e )
=0
p(x) =
x = 1, 2, . . .
otherwise
(a) Check that the conditions for a probability mass function are satisfied.
(b) Find the mean of the random variable X .
...
6.22 (a) Show that the variance of the random variable X can be expressed as
Var[X] = E[X(X 1)] + E[X] (E[X])2 ,
P
where E[X(X 1)] means x x(x 1)p(x).
(b) Use this result to find the variance of the binomial and Poisson distributions:
(i) if X B(n, p), then Var(X) = np(1 p)
(ii) if X P (), then Var(X) = .
6.23 (a) Show that the probability density function derived in Example 10B
f (x) = 2 xex x 0
=0
otherwise
satisfies the conditions for being a probability density function.
(b) Find the mean and variance of the random variable X , the waiting time to
the second event.
6.24 Generalize the results of Example 10B to determine the probability density function
of the waiting time to the n-th event.
163
Solutions to exercises . . .
6.1 (a)
(b)
(c)
(d)
(e)
6.2 (a) k = 48
(b) = 14 , and 2 = 1/80
(c) The distribution function is
F (x) = 0
x<0
= 12x2 16x3 0 x
=1
x > 12
The median xm =
6.3 A = 0
1
4
1
2
B=1
6.4 40 cents
6.5
(b) 0.0478
(c) 0.0351
6.11 0.0359
6.12 (a) 0.966
(b) 0.193
164
INTROSTAT
0.4122
6.18 0.0918
6.19 (a) A = 2/9 B = 2/3
(b) 2 = 0.5
(c) The distribution function is given by
F (x) = 0
x<0
= x2 /9 + 2x/3 0 x < 3
=1
x3
(d) xm = 0.8787
6.20 (a) Expected loss of 25c per game.
(b) 50c
6.21 (b) E(X) = /(1 e )
6.23 (b) E(X) = 2/
Var(X) = 2/2
Pn1 x
e
(x)k /k!, where Fn (x) is the distribution function of the
6.24 Fn (x) = 1 k=0
waiting time to the nth event. The density function is found by differentiation:
fn (x) = n xn1 ex /(n 1)! x 0
=0
otherwise
Chapter
Recap . . .
To date we have learnt about four important probability distributions, two of which
were discrete, the binomial and Poisson distributions, and two of which were continuous,
the exponential and normal distributions. There are very many other useful probability
distributions and in this chapter we extend our repertoire.
Have you ever wondered why, when you are looking for something, you always find it in the last
165
166
INTROSTAT
This result, with its negative exponent, is used to prove the condition PMF3, that
P
x=0 p(x) = 1. In the same way that the binomial theorem gave its name to the
binomial distribution, this mathematical result has transferred its name to the negative
binomial distribution. There is nothing else about the negative binomial distribution
that is negative!
Example 1A: A market research company requires each of its fieldworkers to conduct
10 interviews per day. Not everybody approached by a fieldworker agrees to participate
in an interview. In fact, only 60% of approaches lead to an interview. What is the
probability that the 10th interview is obtained from the 15th person approached?
Let the random variable X be the number of failures before 10 successes. Here
r = 10, p = 0.6, so that X N B(10, 0.6). We want to find Pr[X = 5]:
5 + 10 1
Pr[X = 5] =
0.610 0.45 = 0.1240.
5
place that you look for it? Answer! Because once you have found it, you (hopefully!) stop looking for
it. Similarly for a negative binomial process.
167
The special case of the negative binomial distribution when r = 1 is called the geometric distribution. Check for yourself that the mass function simplifies to the function
given below:
GEOMETRIC DISTRIBUTION
Under the same conditions as for the negative binomial distribution,
let the random variable X be the number of trials before the first
success. Then X has the geometric distribution with parameter p,
X G(p), and has probability mass function
p(x) = pq x x = 0, 1, 2, . . .
=0
otherwise
Example 2A: Suppose that you interview job applicants in succession until you find a
person that satisfies the job description. Suppose that, at each interview, the probability
of finding the right person is 0.3.
(a) What is the probability that you appoint the third person you interview?
(b) What is the probability that you will need to do five or more interviews?
(a) Let X be the number of trials before you succeed. Clearly X G(0.3). We want
Pr[X = 2]:
Pr[X = 2] = 0.3 0.72 = 0.147
(b) The probability of needing at least five interviews implies that the number of
unsuccessful interviews must be four or more.
Pr[X 4] = 1 Pr[X 3]
= 0.2401
Example 3B: Some notebook computers make use of an active colour display on
their screens. One reason why these computers are expensive is that the manufacturing
process is so delicate that many of the screens produced are defective, and have to be
discarded when tested during the assembly process. Only 58% of all screens produced
are free from defects and can be used. If an order is placed for five notebook computers,
what is the probability that
(a) the fifth non-defective screen is the eighth screen tested?
(b) no more than nine screens are required?
Let the random variable X be the number of defective screens tested before the fifth
non-defective screen is found. Thus X has the negative binomial distribution:
X N B(5, 0.58).
(a) We want Pr[X = 3] =
3+51
3
168
INTROSTAT
4
X
x+4
x=0
0.585 0.42x
M
n x articles from N M in n x ways. Thus the total number of ways in which
the event X = x can occur is
M
N M
x
nx
169
The total number of ways in which a sample of size n can be drawn from N articles is
N . Thus using the rule for computing probabilities when the elementary events are
n
equiprobable, we have
M
N M . N
Pr[X = x] =
.
x
nx
n
HYPERGEOMETRIC DISTRIBUTION
Given a population of size N , of which M are defective, a sample
of size n (n N ) is drawn. Let the random variable X be the
number of defectives in the sample. Then X has the hypergeometric
distribution with parameters N, M and n, X H(N, M, n) and X
has probability mass function
M
N M . N
p(x) =
x = 0, 1, . . . , n
x
nx
n
=0
otherwise
Example 8A: A fisherman caught 10 lobsters, 3 of which were undersized. An inspector
of the Sea Fisheries Branch measured a random sample of 4 lobsters. What is the
probability that the sample contains no undersized lobsters?
Here N = 10, M = 3 and n = 4. If X is the number of undersized lobsters in the
sample of 4, then
3 10 3 . 10
Pr[X = 0] =
= 0.1667
0
40
4
Conversely, the probability that the inspector finds at least one undersized lobster is
1 0.1667 = 0.8333.
Example 9B: A team of 15 people is chosen from a class of 65 MBA students to play
a social rugby match. The class contains 25 engineers. What is the probability that the
team contains
(a) four engineers?
(b) at least four engineers?
(a) Let X be the number of engineers in the sample. Then N = 25+40 = 65, M = 25
and n = 15, so that X H(65, 25, 15). Thus
.
25 40
65
Pr[X = 4] =
= 0.1410.
4
11
15
(b) The probability that X is 4 or more is
Pr[X 4] = 1 Pr[X 3] = 1 {p(0) + p(1) + p(2) + p(3)}
.
65
25 40
25 40
25 40
25 40
+
+
+
=1
15
12
13
3
14
2
15
1
0
= 0.9176
170
INTROSTAT
Example 10C: There are addition errors in 3 out of a total of 32 invoices. An auditor
checks a random sample of 10 invoices. What are the probabilities of finding (a) 0, (b)
1, (c) 2 and (d) all 3 errors in the sample?
If N is much larger than n, then the difference between using the binomial distribution (which assumes that sampling is done with replacement) and using the
hypergeometric distribution is small. If n/N < 0.1, so that the size of the sample is less
than 10% of the total population size, then the binomial distribution may be used to
give satisfactory approximations to hypergeometric probabilities:
BINOMIAL APPROXIMATION TO THE HYPERGEOMETRIC
DISTRIBUTION
If X H(N, M, n) and if n/N < 0.1, then X
: B(n, p) with
p = M/N
Example 11B: A company is established in two cities, Johannesburg and Cape Town.
The total staff complement is 240, of which only 32 are based in Cape Town. If 10
of the total of 240 staff are randomly selected to attend a course on VAT, what is
the probability that two of the 10 are from Cape Town. Calculate both exact and
approximate probabilities.
Let X be the number of staff members selected from Cape Town.
X H(240, 32, 10)
The exact probability is given by
Pr[X = 2] =
32
2
208 . 240
= 0.2604
8
10
171
f (x) =
axb
otherwise
Example 12A: Suppose the mass of a nominally 500 g tub of margarine is equally
likely to take on any value in the interval (495, 510). What is the probability that a
randomly chosen tub will have a mass less than 500 g?
Let the random variable X be the mass of a tub of margarine. Because X
U (495, 510)
1
= 1/15 495 < x < 510
f (x) = 510495
=0
otherwise
500
h x i500
1
dx =
15
15 495
495
= (500 495)/15 = 1/3
(a) Show that the function given for the uniform distribution satisfies the conditions
for a probability density function.
(b) Show that E[X] = 12 (b + a) and that Var[X] = (b a)2 /12
(c) Show that the distribution function is given by
F (x) = 0
x<a
= (x a)/(b a) a x b
=1
x>b
Example 14C: An investor knows that his share portfolio is equally likely to yield an
annual return anywhere in the interval between 5% and 35%. The fixed deposit rate is
13.5%. What is the probability that he would be better off investing his funds in a fixed
deposit account (rather than in his share portfolio) over the forthcoming year?
172
INTROSTAT
Example 15C: The final mark (Y ) for a particular statistics course comprises of a
30% weighting for the class record and a 70% weighting for the examination. A student
believes that she is equally likely to obtain a mark anywhere between 45% and 65% for
her examination (X ).
(a) Find the probability density function for her final mark (Y ) if she has a class
record of 50%.
(b) Find the distribution function of her final mark.
(c) Find the probability she gets a third-class pass (between 50% and 60%) as a final
mark.
(d) Find the probability that she gets a lower second (between 60% and 70%).
(e) What is the probability that she fails (below 50%)?
Solutions to examples . . .
5C Pr[passing on 5th attempt] = Pr[X = 4] = 0.554 0.45 = 0.0412
1 15 x1
6C Pr[x packets] = Pr[x 1 failures] = 16
16
7C X N B(7, 0.65)
(a) Pr[X = 3] = 0.1766
(b) Pr[X > 3] = 1 Pr[X 3] = 0.4862
(c) 0.0297 and 0.9452
.
3 29 32
10C Pr[x errors] = p(x) = x
10
10 x
.
29
32
(a) p(0) = 3
0 10 . 10 = 0.3105
29 32
(b) p(1) = 3
1 9
10 = 0.4657
(c) p(2) = 0.1996
(d) p(3) = 0.0242
(Note that these probabilities sum to 1.)
14C Pr(5 < X < 13.5) = 0.283
15C (a) The random variable Y U (46.5, 60.5).
1
f (y) = 60.546.5
=
=0
1
14
173
Exercises . . .
7.1
What is the probability that a fair die is tossed z times before a 6 appears
(a) for the first time
(b) for the second time
(c) for the r th time?
7.2
A parliamentary candidate needs to collect 300 signatures before he can be nominated. If the probability that a voter approached at random will give a signature
is 0.15, what is the probability that 1300 voters need to be approached before the
300th signature is collected?
7.3 By assuming that the negative binomial distribution is a probability mass function
show that
X
x+r1
q x = (1 q)r .
x
x=0
7.4
Show that if X N B(r, p), then E[X] = rq/p and (more difficult) that Var[X] =
rq/p2 .
(Hint: the same procedure as was used for finding the mean and variance of the
binomial and Poisson distributions is applied here.)
7.5 In exercise 7.2, what number of voters can be expected to refuse to sign before 300
signatures have been collected?
7.6 The Blood Transfusion Service knows that 6.3% of the population belong to the
A-negative blood group.
(a) If people donate blood at random, what is the probability that x people will
not belong to the A-negative group before
(i) the first A-negative donor
(ii) the fourth A-negative donor?
(b) What is the expected number of donors not having A-negative blood before
(i) the first A-negative donor
(ii) the fourth A-negative donor?
7.7
A company that specializes in the breeding of fish needs to estimate the number
of fish in one of the dams on their fish farm. In order to estimate the number of
fish N in the dam, M are caught, marked and released. After sufficient time has
elapsed for the marked fish to mix thoroughly, fish are caught one by one until r
marked fish have been caught. The fish are released as soon as they have been
examined for marks.
(a) What is the probability that x unmarked fish are examined before r marked
fish are caught?
(b) What is the expected number of unmarked fish that need to be examined
before r marked fish are caught?
(c) What, therefore, would you suggest as an estimate of N , the total population?
174
INTROSTAT
(d) If 300 fish are marked, and 189 unmarked fish are caught before 50 marked
fish are caught, what is your estimate of the total population?
7.8 (a) Plot the bar graph and the distribution function of the negative binomial
distribution with parameters
(i) r = 4
p = 0.8
(ii) r = 4
p = 0.5.
(b) Thus determine the medians of these two negative binomial distributions.
7.9 A small shop has 10 cartons of milk left, of which three are sour. Unaware of this,
you ask for four cartons of milk.
(a) If the four cartons are selected at random, what is the probability that you
get x cartons of sour milk?
(b) Evaluate these probabilities, and plot them as a bar graph.
7.10 If X H(N, M, 2) write down the probability mass function, and show that it
sums to one.
7.11 Show that the mean of the random variable X having the hypergeometric distribution H(N, M, n) is nM/N
. It is much more difficult to show that the variance
N n
M
M
is given by n N 1 N
N 1 .
7.12
7.13 A company employs five male and three female computer programmers. Four of
the eight programmers are selected at random to serve on a committee. One of
the four is chosen from the committee to report to the manager. If this person is
female, find the conditional probability that the committee consists of two males
and two females.
(Hint: use Bayes theorem and the hypergeometric distribution.)
7.14 A manufacturer of light bulbs reports that among a consignment of 10 000 sent to
a supermarket, 2500 were faulty. A shopper selects 10 of these bulbs at random.
What is the approximate probability that more than 2 are faulty?
7.15 A child plays with a pair of scissors and a piece of string 10 cm long. He cuts the
string into two at a randomly chosen place.
(a) What is the probability that the piece of string to the left of the pair of scissors
is less than 4 cm long?
(b) What is the probability that the shorter piece of string is less than 2.5 cm
long?
7.16
A taxi travels between two cities A and B which are 100 km apart. There are
service stations at A and B and at the midpoint of the route. If the taxi breaks
175
down, it does so at random at any point along the route between the cities. If a
tow truck is dispatched from the nearest service station, what is the probability
that it has to travel more than 15 km to reach the taxi?
7.17 A radio station announces the time every 15 minutes between midnight and 06h00.
If you wake up at random in the early hours of the morning and switch on the
radio, what is the probability that you have to wait less than 5 minutes to find out
the time?
7.18
Between 08h00 and 09h00 buses leave the residence for the university at the
following numbers of minutes past 08h00:
00 03 05 07 10 12 15 30 37 55 60
(a) Calculate the probability of having to wait less than 2 minutes for a bus, if you
arrive at Mowbray station at a time uniformly distributed over the interval
(i) 08h00 to 09h00
(ii) 08h00 to 08h20
(iii) 08h02 to 08h30
(b) Calculate the probability of having to wait less than 5 minutes for a bus if
you arrive between 08h00 and 08h35.
Solutions to exercises . . .
1 2 5 z2
(b) z1
(6) (6)
7.1 (a) ( 16 )( 65 )z1
z2
z1 1 r 5 zr
. Remember that the negative binomial distribution counts the
(c) zr ( 6 ) ( 6 )
number of failures before r successes. Here z = x + r trials.
1299
7.2 The number of refusals X N B(300, 0.15). Therefore p(1000) =
0.15300 0.851000 .
1000
7.3 Note that this result is analogous to the binomial theorem expansion of (x + y)n
for negative values of n.
7.5 If X N B(300, 0.15), then E(X) = 300 0.85/0.15 = 1700.
7.6 (a) (i) 0.063 0.937x
(b) (i) q/p = 14.9
3
4
x
(ii) x +
x 0.063 0.937 .
(ii) qr/p = 59.5.
(ii) xm = 3.
176
INTROSTAT
p(1) = 0.500
p(2) = 0.300
p(3) = 0.033.
1 4
53
5
7
= 0.3552.
p(1)
60
60
1
(c) Percentage error =
0.37530.3552
0.3752
100 = 5.4%
7.13 Let Ai be the event the committee contains i females. Let B be the event the
computer programmer is female. Then
Pr(B|A0 ) = 0, Pr(B|A1 ) =
1
1
3
, Pr(B|A2 ) = , Pr(B|A3 ) = .
4
2
4
Also:
.
3 5
8
P r(A0 ) =
= 5/10
0 4
4
.
3 5
8
Pr(A1 ) =
= 30/70, etc.
1 3
4
By Bayes theorem,
!
3
. X
Pr(B|Ai )(Pr(Ai )
Pr(A2 |B) = Pr(B|A2 ) Pr(A2 )
i=0
30 .
1
=
2 70
= 0.5614
5
1 30 1 30 3
5
0
+
+
+
70 4 70 2 70 4 70
1
4 ).
7.15 Let X be length of string to the left of the pair of scissors. X U (0, 10).
(a) Pr[X < 4] = 0.4
(b) 0.5
7.16 0.4
7.17 0.33
7.18 (a) (i) 0.333
(b) 0.657.
(ii) 0.600
(iii) 0.464
Chapter
178
INTROSTAT
value for the population mean and the one true value for the population variance.
Usually, it is impracticable to do a census of every member of a population to determine
the population mean. The standard procedure is to take a random sample from the
population of interest and estimate , the population mean, by means of x
, the sample
mean.
Statistics . . .
We remind you again of the special definition, within the subject Statistics, of a
statistic. A statistic is defined as any value computed from the elements of a random
sample. Thus x
, s2 are examples of statistics.
X ...
We argued above that the population parameter has a fixed value. But the sample
mean x
, the statistic which estimates , depends on the particular sample drawn, and
therefore varies from sample to sample. Thus the sample mean x
is a random variable
it takes on different values for different random samples. In accordance with our custom
of using capital letters for random variables and small letters for particular values of
random variables, we will now start referring to the sample mean as the random variable
X.
Because the statistic X is a random variable it must have a probability distribution.
We have a special name for the probability distributions of statistics. They are called
sampling distributions. This name is motivated by the fact that statistics depend on
samples.
In order to find the sampling distribution of X , consider the following. Suppose
that we take a sample of size n from a population which has a normal distribution with
known mean and known variance 2 . Apart from dividing by n, which is a fixed
number, we can then think of X as the sum of n values, let us call them X1 , X2 , X3 ,
. . . , Xn , each of which has a normal distribution, with mean and variance 2 ; i.e.
Xi N (, 2 ). In Chapter 5, we stated that the sum of normal distributions also has a
normal distribution; the mean of the sum is the sum of the means, and the variance of
the sum is the sum of the variances. Thus
n
X
i=1
Xi N (n, n 2 )
because
the means and variances are all equal. But we do not want the distribution of
Pn
X
i=1 i ; we want the distribution of
n
X=
1X
Xi .
n
i=1
We also saw in Chapter 5 that if the random variable X N (, 2 ), then the distribution
of aX N (a, a2 2 ). Applying this result with a = 1/n, we have
n
X=
1X
2
Xi N (, ).
n
n
i=1
179
This is true for all values of n. The bottom line is that if a sample of any size is taken
from a population having a normal distribution with mean and variance 2 , then the
mean of that sample will also have a normal distribution, with the same mean , but
variance 2 /n.
But what happens if we take a sample from a population that does not have a normal
distribution? Suppose that we do however know that this distribution has mean and
variance 2 . Suppose we take a sample of size n, and compute the sample mean X . As
mentioned
above, apart from division by the sample size, the sample mean X consists
Pn
of
X
i=1 i , the sum of n random variables. In Chapter 5, we mentioned the central
limit theorem, which states that the sum of a large number of random variables
always has a normal distribution. Thus, by the central limit theorem, the sample
mean has a normal distribution if the sample size is large enough. A sample size of
30 or more is large enough so that the distribution of the sample mean can be assumed
to have a normal distribution; the approximation to a normal distribution is usually
good even for much smaller samples. It can be shown that if we draw a sample of n
observations from a population which has population mean and variance 2 then X
can be modelled by a normal distribution with mean and variance 2 /n, i.e.
X
: N (, 2 /n)
(the sample mean is approximately distributed normally, with mean and variance
2 /n). The approximation is invariably very good for n 30. But if the population
which is being sampled has a normal distibution, then
X N (, 2 /n)
for all values of n, including small values.
This is a powerful result. Firstly, it tells us about the sampling distribution of X ,
regardless of the distribution of the population from which we sample. The approximation becomes very good as the sample size increases. If the population which we sampled
had a distribution which was similar to a normal distribution, the approximation is good
even for small samples. If the sampled population had exactly a normal distribution,
then X too has exactly a normal distribution for all sample sizes. But even when the
population from which the samples were taken looks nothing like a normal distribution,
the approximation gets better and better as the sample size increases.
Secondly, (and remember that we are thinking of X as a random variable and that
therefore it has a mean) it tells us that the mean of the sample mean X is the same
as the mean of the population from which we sampled. The sample mean is therefore
likely to be close to the true population mean. In simple terms, the sample mean is a
good statistic to use to estimate the population mean.
Thirdly, the inverse relation between the sample size and variance of the sample
mean has an important practical consequence. With a large sample size, the sample
mean is likely, on average, to be closer to the true population mean than with a small
sample size. In crude terms, sample means based on large samples are better than sample
means based on small samples.
The rest of this chapter, and all of the next, builds on our understanding of the
distribution of X . This is a fairly deep concept, and many students take a while before
they really understand it. The above section should be revisited from time to time. Each
time, you should peel off a new layer of the onion of understanding.
180
INTROSTAT
Let us now consider various problems associated with the estimation of the mean.
For the remainder of this chapter we will make the (usually unrealistic) assumption that
even though the population mean is unknown, we do in fact know the value of the
population variance 2 . In the next chapter, we will learn how to deal with the more
realistic situation in which both the population mean and variance are unknown.
Estimating an unknown population mean when the population variance is assumed known . . .
We motivate some theory by means of an example.
Example 1A: We wish to estimate the population mean of travelling times between
home and university. For the duration of this chapter we have to assume the population
variance 2 known, so let us suppose = 1.4 minutes. On a sample of 40 days, we use
a stopwatch to measure our travelling time, with the following results (time in minutes)
17.2
18.1
16.4
19.1
18.3
16.9
19.3
17.0
16.8
17.2
17.5
20.5
15.9
17.6
18.1
16.1
17.4
15.8
17.2
18.7
16.8
18.4
17.7
19.0
16.3
17.9
17.1
17.3
18.4
16.5
16.0
17.6
19.1
17.9
18.1
18.2
29.3
16.8
22.3
16.5
Adding these numbers and dividing by 40, gives the sample mean X = 17.76.
Fine, X estimates , so we have what we wanted. But how good is this estimate?
How much confidence can we place in it? To answer this question we need to make use
of the sampling distribution of X . We shall see that we can use this distribution to form
an interval of numbers, called a confidence interval, so that we can make a statement
about the probability that the confidence interval contains the population mean.
Confidence intervals . . .
In many situations, in business and elsewhere, a sample mean by itself is not very
useful. Such a single value is called a point estimate. For example, suppose that the
breakeven point for a potential project is R20 million in revenues, and that we are given
a point estimate for revenues of R22 million. Our dilemma is that this point estimate
is subject to variation it may well be below R20 million. Clearly, it will be far more
helpful in taking a decision if we could be given an interval of values and a probability
statement about the likelihood that the interval contains the true value. If we could
be told that the true value for revenue is likely to lie in the interval R21 million to
R23 million, we would decide to go ahead with the project. But if we were told it was
likely that the true value lay in the interval R12 million to R32 million, we would most
certainly want to get better information before investing in the project. Notice that, for
both of these intervals, the point estimate of revenues, R22 million, lies at the midpoint
of the interval. Our hesitation to invest, given the second scenario, is not due to the
position of the midpoint, but to the width of the interval.
The most commonly used probability associated with confidence intervals is 0.95,
and we then talk of a 95% confidence interval. This means that the probability is 0.95
181
that the confidence interval will contain the true population value. Conversely, the
probability that the confidence interval does not contain the population mean is 0.05.
Put another way, the confidence interval will not contain the population mean 1 time in
20, on average. Let us develop the theory for setting up such a confidence interval for
the mean (assuming, as usual for this chapter, that 2 is known).
We make use of the fact that (for large samples)
X N (, 2 /n).
We make the usual transformation to obtain a standard normal distribution:
Z = (X )/ N (0, 1)
n
From our normal tables we know that
Pr[1.96 < Z < 1.96] = 0.95
i.e. the probability that the standard normal distribution lies between 1.96 and +1.96
is 0.95.
....................
......
.....
.....
....
....
....
...
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
...
...
.
...
..
.
...
.
..
...
.
.
....
...
.
.
....
...
.
.....
.
.
.
......
....
.
.
.
.
.......
....
.
.
.
........
.
.
.
.............
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
..........
.
........
Z N (0, 1)
2 1
1.96
1.96
182
INTROSTAT
To obtain confidence intervals with different probability levels, all that needs to
be changed is the z -value obtained from the normal tables. The box below gives the
appropriate z -values for the most frequently used confidence intervals.
CONFIDENCE INTERVAL FOR , 2 KNOWN
If we have a random sample of size n with sample mean X , then A%
confidence intervals for are given by
X z , X +z
n
n
where the appropriate values of z are given by:
A%
90%
95%
98%
99%
z
1.64
1.96
2.33
2.58
Example 2B: An estimate of the mean fuel consumption (litres/100 km) of a car is
required. A sample of 47 drivers each drive the car under a variety of conditions for
100 km, and the fuel consumed is measured. The sample mean turns out to be 6.73
litres/100 km. The value of is known to be 1.7`/100 km. Determine 95% and 99%
confidence intervals for .
We have X = 6.73, = 1.7 and n = 47. Thus the 95% confidence interval for is
given by
and
respectively. Note that if everything else remains unchanged, the confidence interval
becomes shorter with an increase in the sample size. Resist the temptation to conclude
183
that by increasing the sample size it is more likely that the confidence interval covers
the true mean. The increase in sample size results in narrower confidence intervals, but
the level of confidence remains the same.
Example 3C: A chain of stores is interested in the expenditure on sporting equipment
by high school pupils during a winter season. A random sample of 58 high school pupils
yielded a mean winter expenditure on sporting equipment of R168.15. Assuming that the
population standard deviation is known to be R37.60, find the 95% confidence interval
for the true mean winter expenditure on sporting equipment.
2 = 1.96 6.58/ n,
and thus n = (1.96 6.58/2)2 = 42, rounding upwards to the next integer.
The general method for determining sample sizes is given in the box:
CALCULATING THE SAMPLE SIZE
DESIRED ACCURACY, 2 KNOWN
TO
ACHIEVE
z
1.64
1.96
2.33
2.58
Example 5C: The population variance of the amount of cooldrink supplied by a vending
machine is known to be 2 = 115 ml2 .
(a) The machine was activated 61 times, and the mean amount of cooldrink supplied
on each occasion was 185 ml. Find a 99% confidence interval for the mean.
(b) What size samples are required if the estimate is required to be within (i) 1 ml
(ii) 0.5 ml of the true value with probability 0.99?
184
INTROSTAT
X
N (0, 1).
/ n
X 100
N (0, 1).
12/ 50
Here X = 95.5, and the corresponding z -value is therefore
In this example Z =
z=
95.5 100
= 2.65.
12/ 50
Thus
Pr[X < 95.5] = Pr[Z < 2.65] = 0.0040, from tables.
This says, if the manufacturers claim is true (i.e. = 100) then the probability of
getting a sample mean of 95.5 or less is 0.004, a very small probability.
Now we have to take a decision. Either
(a) the manufacturer is correct, and a very unlikely event has occurred, one that will
occur on average 4 times in every 1000 samples, or
(b) the manufacturers claim is not true, and the true population mean is less than
100, making a population mean of 95.5 or less a more likely event.
The statistician here would go for alternative (b). He would reason that alternative
(a) is so unlikely that he can safely reject it, and he would conclude that the manufacturers claim is exaggerated.
Tests of hypotheses . . .
The problem posed in example 6A introduces us to the concept of statistical inference how we infer or draw conclusions from data. Tests of hypotheses, also
called significance tests, are the foundation of statistical inference.
Whenever a claim or assumption needs to be examined by means of a significance
test, we have a step-by-step procedure, as outlined below. We will use the above problem
to illustrate the steps. A modified procedure to do hypothesis tests will be discussed
later.
185
1. Set up a null hypothesis. This is almost always a statement about the value
of a population parameter. Here the null hypothesis is that the true mean is
equal to 100. For our null hypothesis we usually take any claim that is made (and
usually we are hoping to be able to reject it!).
We abbreviate the above null hypothesis to
H0 : = 100.
2. An alternative hypothesis H1 is defined. H1 is accepted if the test enables us
to reject H0 . Here the alternative hypothesis is H1 : < 100. We shall see that
this is a one-sided alternative and gives rise to a one-tailed significance
test.
3. A significance level is chosen. The significance level expresses the probability of
rejecting the null hypothesis when it is in fact true. We usually work with a 5%
or 0.05 significance level. We will then make the mistake of rejecting a true null
hypothesis 5% of the time, i.e. one time in twenty. If the consequences of wrongly
rejecting the null hypothesis are serious, a 1% or 0.01 level is used. Here a 5%
significance level would suffice.
4. We determine, from tables, the set of values that will lead to the rejection of the
null hypothesis. We call this the rejection region. Our test statistic will have a
standard normal distribution thus we use normal tables. Because H1 is onesided and contains a < sign, the rejection region is in the lower (left hand) end
of the standard normal distribution. We reject H0 if the sample mean X is too
far below the hypothesized population mean , that is, if X is too negative.
Because the significance level is 5% we must therefore find the lower 5% point of
the standard normal distribution; this is 1.64.
.........................
......
.....
.....
....
...
...
...
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
....
.
.
.
....
..
.
.
.
.....
...
.
.
......
.
.
...
......
.
.
.
.
.
.......
....
.
.
.
.
...........
.
.
.
.
....
................
...................
Z N (0, 1)
2 1
1.64
Thus the rejection region ties up with the distribution of the test statistic, the form of the alternative hypothesis, and the size of the significance level. The value we look up in the table is frequently called the critical
value of the test statistic.
5. We calculate the test statistic. We know that X N (, 2 /n). If H0 is true
then X N (100, 122 /50) and
Z=
X 100
N (0, 1).
12/ 50
In our example, X = 95.5 and the observed value of the test statistic z is
186
INTROSTAT
6. We state our conclusions. We determine whether the value we observed falls into
the rejection region. If it does, then we reject the null hypothesis H0 and accept
the alternative H1 . The result is then said to be statistically significant.
The feeling you should have by now is that if X , the mean from our sample, is too
far from 100, then we are going to reject H0 . But how far is too far?
Z N (0, 1)
3 2
2.33
26.5 30
X
=
= 1.92.
/ n
10/ 30
6. Because 1.92 > 2.33 we do not reject H0 . We thus decide to keep our new
typist on. Note that if we had used a 5% significance level, the critical value
would have been 1.64, and we would have decided that our new typist was below
standard.
187
Example 8C: From past records it is known that the checkout times at supermarket
tills have a standard deviation of 1.3 minutes. Past records also reveal that the average
checkout time at a certain type of till is 4.1 minutes. A new type of till is monitored and
64 randomly sampled customers had an average checkout time of 3.8 minutes. Does the
new till result in a significant reduction in checkout times? Use a 1% level of significance.
...
Z N (0, 1)
2 1
1.96
1.96
X
2.65 2.5
= 1.67.
: z=
/ n
0.53/ 35
6. The observed z -value, 1.67, does not lie within either part of the rejection region;
thus we cannot reject H0 . On the available evidence the farmer concludes that
the new fertilizer is not significantly different to the old.
188
INTROSTAT
Some guidelines . . .
The null hypothesis and the alternative hypothesis should both be determined before
the data are gathered. The guideline for the choice of alternative hypotheses is: always
use a two-sided alternative unless there are good theoretical reasons for using a onesided alternative. The use of a one-sided alternative is never justified by claiming the
data point that way. In the hypothesis testing procedure we have considered above,
the significance level ought also to be predetermined.
Our
decision
H0 true
H0 false
accept
H0
correct
decision
type II
error
reject
H0
type I
error
correct
decision
The probability of committing a type I error is the significance level of the test,
sometimes also referred to as the size of the test. The probability of committing a
type II error varies depending on how close H0 is to the true situation, and is difficult to
control. The tradition of using 5% and 1% significance levels is based on the experience
that, at these levels, the frequency of type II errors is acceptable.
X2 N (2 , 22 )
and X1 and X2 are independent, then the distribution of the random variable Y =
X1 X2 is given by
Y = X1 X2 N (1 2 , 12 + 22 ).
189
Example 10A: A cross-country athlete runs an 8 km time trial nearly every Wednesday
as part of his weekly training programme. Last year, he ran on 49 occasions, and the
mean of his times was 30 minutes 25.4 seconds (30.42 minutes). So far this year his
mean time has been 30 minutes 15.7 seconds (30.26 minutes) over 35 runs. Assuming
that the standard deviation last year was 0.78 minutes, and this year is 0.65 minutes,
do these data establish whether, at the 5% significance level, there has been a reduction
in the athletes time over 8 km?
We work our way through our six-point plan:
1. Let the population means last year and this year be 1 and 2 respectively. Our
null hypothesis specifies that there is no change in the athletes times between last
year and this year:
H 0 : 1 = 2 ,
or equivalently,
H0 : 1 2 = 0.
Notice once again how the null hypothesis expresses the concept we are hoping to
disprove. The null hypothesis can helpfully be thought of as the hypothesis of
no change, or of no difference.
2. The alternative hypothesis contains the statement the athlete hopes is true:
H 1 : 1 > 2 ,
or equivalently,
H1 : 1 2 > 0.
The alternative hypothesis does not specify the amount of the change; it simply
states that there has been a decrease in average time.
3. Significance level : we specified that we would perform the test at the 5% level.
4. Rejection region. Because the test statistic has a standard normal distribution (we
will show this in the next paragraph), because the significance level is 5%, and
because we have a one-sided greater than alternative hypothesis, we will reject
H0 if the observed value of the test statistic is greater than 1.64.
....................
......
.....
.....
....
....
....
.
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
..
.
.
...
...
...
.
...
..
.
.
...
...
.
....
.
..
....
.
.
.
.....
...
.
.
.
......
.
...
.
.
.......
.
.
.
.
........
.....
.
.
.
.
.
.
.............
.
.
.
..........
...................
Z N (0, 1)
1.64
5. Test statistic. In general, let us suppose that we have a sample of size n1 from
one population and a sample of size n2 from a second population. Suppose the
sample means are X 1 and X 2 , respectively, and that the population means and
variances are 1 , 2 , 12 and 22 . Then
X 1 N (1 , 12 /n1 ) and
X 2 N (2 , 22 /n2 ).
190
INTROSTAT
We now transform to the standard normal distribution by subtracting the mean,
and dividing by the standard deviation:
Z=
X 1 X 2 (1 2 )
r
N (0, 1).
12 22
n1 + n2
This is our test statistic, and we can substitute for each variable: from the problem
description we have
X 1 = 30.42
n1 = 49
1 = 0.78
X 2 = 30.26
n2 = 35
2 = 0.65 ,
6. Conclusion. The observed value of the test statistic does not lie in the rejection
region. We have to disappoint our athlete and tell him to keep trying.
Example 11B: A retail shop has two drivers that transport goods between the shop and
a warehouse 15 km away. They argue continuously about the choice of route between the
shop and warehouse, each claiming that his route is the quicker. To settle the argument,
you wish to decide (5% significance level) which drivers route is the quicker. Over a
period of months you time the two drivers, and the data collected are summarized below.
Number of observations
Average time (minutes)
Standard deviation (minutes)
Driver 1
Driver 2
n1 = 38
X 1 = 20.3
1 = 3.7
n2 = 43
X 2 = 22.5
2 = 4.1
191
Z N (0, 1)
2 1
1.96
1.96
20.3 22.5 0
X 1 X 2 (1 2 )
s
= r
4.12
3.72
12 22
+
+
38
43
n1 n2
= 2.54
6. Because | 2.54| > 1.96 we reject H0 and conclude that the driving times are not
equal. By inspection, it is clear that the route used by driver 1 is the quicker.
Example 12C: Two speedreading courses are available. Students enrol independently
for these courses. After completing their respective courses, the group of 27 students
who took course A had an average reading speed of 620 words/minute, while the group
of 38 students who took course B had an average speed of 684 words/minute. If it is
known that reading speed has a standard deviation of 25 words/minute, test (at the 5%
significance level) whether there is any difference between the two courses.
Example 13C: The scientists of the Fuel Improvement Centre think they have found a
new petrol additive which they hope will reduce a cars fuel consumption by 0.5 `/100 km.
Two series of trials are conducted, one without and the other with the additive. 40 trials
without the additive show an average consumption of 9.8 `/100 km. 50 trials with the
additive show an average consumption of 9.1 `/100 km. Do these data establish that
the additive reduces fuel consumption by 0.5 `/100 km? Use a 5% significance level,
and assume that the population standard deviations without and with the addition of
the additive are 0.8 and 0.7 `/100 km respectively.
Example 14C: Show that a 95% confidence interval for 1 2 , the difference between
two independent means, with variances assumed known, is given by
s
s
2
2
2
2
1
1
X 1 X 2 1.96
+ 2 , X 1 X 2 + 1.96
+ 2.
n1 n2
n1 n2
Find a 95% confidence interval for the difference between mean travelling times in example 15B.
192
INTROSTAT
Significance
level
more than 20%
20%
10%
5%
1%
0.5%
0.1%
0.05%
0.01%
z -value
Reported
one-sided two-sided
probability alternative alternative
P
P
P
P
P
P
P
P
P
> 0.20
< 0.20
< 0.10
< 0.05
< 0.01
< 0.005
< 0.001
< 0.0005
< 0.0001
z
z
z
z
z
z
z
z
z
< 0.84
> 0.84
> 1.28
> 1.64
> 2.33
> 2.58
> 3.09
> 3.29
> 3.72
z
z
z
z
z
z
z
z
z
< 1.28
> 1.28
> 1.64
> 1.96
> 2.58
> 2.81
> 3.29
> 3.48
> 3.89
verbal
description
not significant
possibly significant
nearly significant
significant
very significant
highly significant
very highly significant
very highly significant
very highly significant
193
=
/ n
29.47/ 40
= 3.16.
z=
4. Examining the column in the table for a one-sided alternative, we see that z =
3.16 is significant at the 5% level (because z < 1.64), the 1% level (z < 2.33),
the 0.5% level (z < 2.58), the 0.1% level (z < 3.09), but not at the 0.05% level
(because z > 3.29). The highest level at which the observed value of the test
statistic z = 3.16 is significant is the 0.1% level.
5. We illustrate the conventional form of writing the conclusion.
We have tested the sample mean against the population mean, and have
found a very highly significant difference (z = 3.16, P < 0.001). We
conclude that the company is paying inferior wages.
Notice carefully the shorthand method for presenting the results. The value of the
test statistic is given, and the observed level of significance of this value.
Example 16B: The specifications for extra-large eggs are that they have a mean mass
of 125 g and standard deviation 6 g. A sample of 24 reputedly extra-large eggs had an
average mass of 123.2 g. Are the eggs smaller than the specifications permit?
1. H0 : = 125 g .
2. H1 : < 125 g .
194
INTROSTAT
195
Z N (0, 1)
3 2
2.58
2.58
5. The random variable X , the number of heads in 1000 trials, has a binomial distribution with mean np and variance npq . If H0 is true, then p = 12 and np = 500
and npq = 250. Thus X satisfies the conditions for it to be approximated by
the normal distribution (see Chapter 6 for Normal approximation to thebinomial
distribution) with mean 500 and variance 250 (i.e. standard deviation 250) i.e.
X N (500, 250) and the usual transformation to the standard normal distribution yields
478 500
z=
= 1.39.
250
6. This value does not lie in the rejection region : we therefore cannot reject H0 and
conclude that the coin is unbiased.
In general, the procedure for testing the null hypothesis that the parameter p of the
binomial distribution is some particular value, is summarized as follows. If X B(n, p),
if np and n(1 p) are both greater than 5 and if 0.1 < p < 0.9, then the distribution of
the binomial random variable can be approximated by normal distribution N (np, npq),
so that the formula for calculating the test statistic z is
X np
.
z=
npq
The value X is the observed number of success in n trials, and the value for p is taken
from the null hypothesis.
Likewise, the procedure for testing the null hypothesis that the parameter of the
Poisson distribution is some particular value, is summarized as follows. The Poisson
distribution P () can be approximated by the normal distribution N (, ) provided
> 10. So the formula for calculating the test statistic is thus
z=
X
,
where X is the observed value of the Poisson random variable and the value for is
taken from the null hypothesis.
Example 21B: The number of aircraft landing at an airport has a Poisson distribution.
Last year the parameter was taken to be 120 per week. During a week in March this
year an aviation department official recorded 143 landings. Does this datum suggest
that the parameter has increased? Use a 5% significance level.
1. H0 : = 120. The null hypothesis states that the rate is unchanged.
2. H1 : > 120.
196
INTROSTAT
6. Because 2.10 > 1.64, we reject H0 , and conclude that the rate of landings has
increased.
In other words, it is an unlikely event to observe 143 landings in a week, if the true
rate is 120.
Example 22C: A large motor car manufacturer enjoys a 21% share of the South
African market. Last month, out of 2517 new vehicles sold, 473 were produced by this
manufacturer. Does management have cause for alarm? Quote the observed significance
level.
Example 23C: A die is rolled 300 times, and 34 sixes are observed. Is the die biased?
Example 24C: The complaints department of a large department store deals with, on
average, 20 complaints per day. On Monday, 9 May 1994, there were 33 complaints.
(a) How frequently should there be 33 or more complaints?
(b) At the 1% significance level, test if management should decide that it would be
worth investigating this days complaints further.
(c) What would be acceptable numbers of complaints per day at the 5% and 1%
significance levels?
Example 25C: In a large company with a gender initiative, the proportion of female
staff in middle-management positions was 0.23 in 1990. Currently, in one of the regional
offices of the company, there are 28 women and 63 men in middle-management positions.
If this regional office can be assumed to be representative of the company as a whole,
has there been a significant increase in the proportion of women in middle management?
Do the test at the 5% significance level.
Solutions to examples . . .
3C (158.47, 177.83).
5C (a) (181.46, 188.54).
(b) (i) 766
(ii) 3062.
Note: to improve the precision by a factor
1
2
197
A manufacturer of light bulbs prints Average life 2100 hours on the package of
its bulbs. If the true distribution of lifetimes has a mean of 2130 and standard
deviation 200, what is the probability that the average lifetime of a random sample
of 50 bulbs will exceed 2100 hours? (Hint: use the result X N (, 2 /n).)
8.2 Certain tubes produced by a company have a mean lifetime of 1000 hours and a
standard deviation of 160 hours. The tubes are packed in lots of 100. What is the
probability that the mean lifetime for a randomly selected pack will exceed 1020
hours?
(a) A sample survey was conducted in a city suburb to determine the mean family
income for the area. A random sample of 200 households yielded a mean of
R6578. The standard deviation of incomes in the area is known to be R1000.
Construct a 95% confidence interval for .
(b) Suppose now that the investigator wants to be within R50 of the true value
with 99% confidence. What size sample is required?
(c) If the investigator wants to be within R50 of the true value with 90% confidence, what is the required sample size?
198
INTROSTAT
8.4 We want to test the strength of lift cables. We know that the standard deviation
of the maximum load a cable can carry is 0.73 tons. We test 60 cables and find
the mean of the maximum loads these cables can support to be 11.09 tons. Find
intervals that will include the true mean of the maximum load with probabilities
(a) 0.95 and (b) 0.99.
8.5
8.6
During a student survey, a random sample of 250 first year students were asked to
record the amount of time per day spent studying. The sample yielded a mean of
85 minutes, with a standard deviation of 30 minutes. Construct a 90% confidence
interval for the population mean.
The mean height of 10-year-old males is known to be 150 cm and the standard
deviation is 10 cm. An investigator has selected a sample of 80 males of this age
who are known to have been raised on a protein-deficient diet. The sample mean
is 147 cm. At the 5% level of significance, decide whether diet has an effect on
height.
8.11 A mechanized production line operation is supposed to fill tins of coffee with a
mean of 500.5 g of coffee with a standard deviation of 0.6 g. A quality control
specialist is concerned that wear and tear has resulted in a reduction in the mean.
A sample of 42 tins had a mean content of 500.1 g. Use a 1% significance level to
perform the appropriate test.
8.12
(a) A coin is tossed 10 000 times, and it turns up heads 5095 times. Is it
reasonable to think that the coin is unbiased? Use 5% significance level.
(b) A coin is tossed 400 times, and it turns up heads 220 times. Is the coin
unbiased? Use 5% level.
199
8.13 The beta coefficient is a measure of risk widely used by financial analysts. Larger
values for beta represent higher risk. A particular analyst would like to determine
whether gold shares are more risky than industrial shares. From past records,
it is known that the standard deviation of betas for gold shares is 0.313 and for
industrial shares is 0.507. A sample of 40 gold shares had a mean beta coefficient
of 1.24 while a sample of 30 industrial shares had a mean of 0.72. Using a 1% level
of significance, conduct the appropriate statistical test for the financial analyst.
8.14
The standard deviation of scores in an I.Q. test is known to be 12. The I.Q.s of
random samples of 20 girls and 20 boys yield averages of 110 and 105 respectively.
Use a 1% significance level to test the hypothesis that the I.Q.s of boys and girls
are different.
8.17
A reporter claims that at least 60% of voters are concerned about conservation
issues. Doubting this claim, a politician samples 480 voters and of them 275
expressed concern about conservation. Test the reporters claim.
8.18
If the average rate of computer breakdowns is 0.05 per hour, or less, the computer
is deemed to be operating satisfactorily. However, in the past 240 hours the computer has broken down 18 times. Test the null hypothesis that = 0.05 against
the alternative that exceeds 0.05.
8.19 The mean grade of ore at a gold mine is known to be 4.40 g with a standard deviation of 0.60 g per ton of ore milled. A geologist has randomly selected 30 samples
of ore in a new section of the mine, and determined their mean grade to be 4.12 g
per ton of ore. Test whether the ore in the new section of the mine is of inferior
quality.
200
INTROSTAT
Solutions to exercises . . .
8.1 0.8554.
8.2 0.1056.
8.3 (a) (6439.4, 6716.6)
8.4 (a) (10.91, 11.27)
(b) 2663
(c) 1076.
Chapter
s/ n
X
has the standard normal distribution. But
We know, from chapter 8, that Z = /
n
when , the single true value for the standard deviation in the population, is replaced
by s, the estimate of , this is no longer true, although, as the sample size n increases,
it rapidly becomes an excellent approximation. But for small samples it is far from the
truth. This is because the sample variance s2 (and also s) is itself a random variable
it varies from sample to sample. We will discuss the sampling distribution of s2 in
the next chapter.
We know that the size of the sample influences the accuracy of our estimates. The
larger the sample the closer the estimate is likely to be to the true value. Students
t-distribution takes account of the size of the sample from which s is calculated.
201
202
INTROSTAT
The shape of the t-distribution is similar to that of the normal distribution. However, the shape of the distribution varies with the sample size. It is longeror heavier-tailed than the normal distribution when the sample size is small. As the
sample size increases, the t-distribution and normal distribution become progressively
closer, and, ultimately, they are identical. The standard normal distribution, and two
t-distributions are plotted.
..............
....... ........ ................
...... ....
...........
........
.....
......
.
.
.
. . . . . . . . .......
....
.
.
.
.. . ....
... . ..
.. ... ....
...... ..
.
.
.. .......
.... .
..... . ..
.. ........
.
.
.
.
.. .......
15
.... .
.. . ..
.
. .......
.. . ..
..... ..
.
..... .
.. ........
3
.
... ..
.
. ........
... .
.
.. ..
.
...... ..
. .......
.
.. .....
.... ..
.
.... .
. .......
.
.. ...
.... ..
.
.. ........
.... ..
.
.... .
.. ........
.
... ..
.
.. ......
.... .
.
.
.. ........
...... ..
.
.. ......
.... ..
.
.........
... ..
.
.
.........
..... .
.... ..
.........
.
....
.
........
...
.
.
......
......
.
......
...
.
....
.
.
.
.....
.
.
.
...
....
.
.
.
.
.
........ .
........
......... .
.
.
. ........
....... .
.
...... . .
. .. .........
.
......... . .
. ... ...........
..... .... . . .
.
.
...... ... . .
. . .... ...........
.
.
...... . ... . . . .
..
. .........
.
.
.
....... . ....
.
.
.
.
.............
......... .... . . . . .
...
.
.
.
.
.
.
........... .... ....
.
.
.
................ ...
......................................
......
..
N (0, 1)
t
t
Degrees of freedom . . .
In order to gain some insight into the notion of degrees of freedom, consider again
the definition of the sample variance:
s2 =
n
X
1
(xi x
)2 .
n1
i=1
The terms xi x
are the deviations of each of the xi from the sample mean. To
achieve a given sample mean for, say, six numbers, five of these can be chosen at will,
but the last is then fully determined. Suppose we are given the information that the
mean of six numbers is 5 and that the first five of the six numbers are 4, 9, 5, 7 and 3.
The sixth number must be 2, otherwise the sample mean would not be 5. It is fixed, it
has no freedom. In general if we are told that the mean of n numbers is x
, and that
the first n 1 numbers are x1 , x2 , . . . , xn1 , then it is easy to see that xn must be
given by
xn = n
x x1 x2 xn1 .
In other words, once we are given x
and the first n 1 of the xi we have enough
information to compute the sample variance. Thus, only n 1 of the deviation terms
xi x
in s2 contain real information; the last term is just a formality, but it must be
included! We say that s2 , based on a sample of size n, has n 1 degrees of freedom.
(This is part of the reason why the formula for the sample variance s2 calls for division
by n 1, and not n.)
We will be encountering the concept of degrees of freedom regularly. We have a
simple rule which helps in making decisions about degrees of freedom.
203
s/ n
and say that the expression on the right hand side has the t-distribution with n 1
degrees of freedom, or simply the tn1 -distribution.
As for the normal distribution, we need tables for looking up values for the tdistribution. The shapes of the t-distributions are dependent on the degree of freedom;
thus we cannot get away with a single table for all t-distributions as we did with the
normal distribution. We really do appear to need a separate table for each number of
degrees of freedom. But we take a short cut, and we only present a selection of key
values from each t-distribution in a single table (Table 2). If you think about it, you
can now begin to understand why we can do this; even for the normal distribution, we
repeatedly only use a handful of values; z (0.05) = 1.64, z (0.025) = 1.96, z (0.01) = 2.33
and z (0.005) = 2.58 are by far the most frequently used percentage points of the standard
normal distribution. In Table 2, there is one line for each t-distribution; on that line,
we present 11 percentage points.
tt
3 2
2.262
2.262
204
INTROSTAT
Because 2 12 % (or 0.025) of the t9 distribution lies to the right of 2.262 we write
(0.025)
t9
= 2.262
X
t9 .
s/ 10
Therefore
X
Pr 2.262 <
< 2.262 = 0.95.
s/ 10
Manipulation of the inequalities, as done in the same context in Chapter 8, yields
s
s
= 0.95.
Pr X 2.262 < < X + 2.262
10
10
X tn1 , X + tn1
n
n
where the t values are obtained from the t-tables. For 95% confidence intervals, use the column in the tables headed 0.025. For 99%
confidence intervals, use the column headed 0.005.
Example 2B: A random sample of 25 loaves of bread had a mean mass of 796 g and
a standard deviation of 7 g. Calculate the 99% confidence interval for the mean.
We have x = 796, and s = 7 has 24 degrees of freedom. Because we want 99%
(0.005)
confidence intervals, we must look up t24
in the t-tables:
(0.005)
t24
= 2.797.
205
tt
3 2
2.797
2.797
tt
1.729
206
INTROSTAT
X
.
s/ n
4.8 4.5
= 2.68.
0.5/ 20
6. Because 2.68 > 1.729 we reject H0 , and conclude that, at the 5% significance
level, we have established that the new enriched diet is effective.
Example 5B: The average life of 6 car batteries is 30 months with a standard deviation
of 4 months. The manufacturer claims an average life of 3 years for his batteries. We
suspect that he is exaggerating. Test his claim at the 5% significance level.
1. H0 : = 36 months .
2. H1 : < 36 months .
3. Significance level : 5%.
4. Degrees of freedom is 6 1 = 5. So we use the t5 -distribution. From the form
of the alternative hypothesis and the significance level, we will reject H0 if the
(0.05)
(0.05)
observed t-value is less than t5
. From our tables t5
= 2.015.
.....................
.....
......
....
....
....
...
.
.
...
..
.
.
...
5
..
.
...
.
...
..
.
.
...
...
...
.
..
...
.
.
....
..
....
...
...
....
....
...
.....
.....
.
.
......
.
.
.
......
....
.
.
.
.
.
.......
.....
.
.
.........
.
.
.
.
.
.............
......
.
.
.
.
.
.
.
.
.
.
.
...........
.
.
........
tt
2 1
2.015
yields
s/ n
30 36
= 3.67
4/ 6
6. This lies in the rejection region: we conclude that the true mean is significantly
less than 36 months.
Example 6C: A purchaser of bricks believes that their crushing strength is deteriorating. The mean crushing strength had previously been 400 kg, but a recent sample of 81
bricks yielded a mean crushing strength of 390 kg, with a standard deviation of 20 kg.
Test the purchasers belief at the 1% significance level.
207
Example 7C: The specifications for a certain type of ball bearings stipulate a mean
diameter of 4.38 mm. The diameters of a sample of 12 ball bearings are measured and
the following summarized data computed:
x2i = 235.7403.
xi = 53.18
X 1 X 2 (1 2 )
r
.
12 22
n1 + n2
1.5
1.6
1.9
1.8
1.2
2.0
1.4
1.8
2.3
2.3
and
1.3
One of the plots planted with Variety 2 was accidently given an extra dose of fertilizer,
so the result was discarded. The means and standard deviation are calculated. They
are
x
1 = 1.60
s1 = 0.42
n1 = 6
x
2 = 1.90
s2 = 0.27
n2 = 5.
208
INTROSTAT
1. H0 : 1 2 = 0.
2. H1 : 1 2 6= 0
(a two-tailed test).
where s21 is based on a sample of size n1 and s22 is based on a sample of size n2 .
In the above example, n1 = 6 and n2 = 5. Therefore
5 (0.42)2 + 4 (0.27)2
6+52
= 0.13
s2 =
s = 0.13 = 0.361.
How many degrees of freedom does s have? s21 has 5 and s22 has 4. Therefore s2
has 5+4 = 9 degrees of freedom. In general, s2 has (n1 1)+(n2 1) = n2 +n2 2
degrees of freedom. We lose two degrees of freedom because we estimated the two
parameters 1 (by X 1 ) and 2 (by X 2 ) before estimating s2 .
Thus we use the t-distribution with 9 degrees of freedom, and because we have
(0.025)
a two-sided alternative and a 5% significance level we need the value of t9
,
which from the tables is 2.262. So we will reject H0 if the observed t9 -value is
less than 2.262 or greater than 2.262.
...........
...... ..........
.....
...
....
....
...
...
.
.
...
..
.
9
...
.
..
...
.
.
...
..
.
.
...
..
.
.
.
.
....
...
...
...
....
....
...
...
....
....
.
.
.....
.
.
......
....
.
.
.
.
......
....
.
.
........
.
.
.
.
.........
.....
.
.
.
.
.
.
.
.
..............
.....
.........
.................
tt
3 2
2.262
2.262
The formula for calculating the test statistic in this hypothesis-testing situation,
the so-called two-sample t-test, is
tn1 +n2 2 =
X 1 X 2 (1 2 )
r
1
1
+
s
n1 n2
209
1.60 1.90 0
r
.
1 1
0.361
+
6 5
Thus
t9 = 1.372.
6. Because 1.372 does not lie in the rejection region we conclude that the difference
between the varieties is not significant.
Free gift
8
R490
R104
Discount
9
R420
R92
Test, at the 5% level of significance, whether there is any difference in the effectiveness of the two promotions.
1. H0 : 1 = 2 .
2. H1 : 1 6= 2 .
3. Significance level : 5%.
4. Degrees of freedom : n1 + n2 2 = 15. From t-tables, if the observed t15 value
exceeds 2.131, we reject H0 .
........................
.....
....
....
....
....
...
...
...
.
.
...
15
..
.
...
.
..
...
.
.
...
..
.
.
...
..
.
.
.
.
...
...
....
...
....
....
....
....
.....
.....
.
......
.
.
.
......
....
.
.
.
.
.
.......
.....
.
.
........
.
.
.
.
............
......
.
.
.
.
.
.
.
.
.
.
..............
.
............
tt
2 1
2.131
2.131
210
INTROSTAT
s2 =
X 1 X 2 (1 2 )
q
s n11 + n12
490 420 0
q
= 1.47.
97.78 18 + 19
6. Because 1.47 < 2.131, we cannot reject H0 , and we conclude that we cannot detect
a difference between the effectiveness of the two promotions.
Example 10C: Two methods of assembling a new television component are under consideration by management. Because of more expensive machinery requirements, method
B will only be adopted if it is significantly shorter than method A by more than a minute.
In order to determine which method to adopt a skilled worker becomes proficient in both
methods, and is then timed with a stopwatch while assembling the component by both
methods. The following data were obtained:
Method A
Method B
x1 = 7.72 minutes
x2 = 6.21 minutes
s1 = 0.67 minutes
s2 = 0.51 minutes
n1 = 17
n2 = 25
211
Fisher was a British statistician who was one of the founding fathers of the discipline of
Statistics.
Because variances are by definition positive, the statistic F is always positive. When
H0 : 12 = 22 is true, we expect the sample variances to be nearly equal, so that F will
be close to one. When H0 is false, and the population variances are unequal, then F
will tend to be either large or small, where in this context small means close to zero.
Thus, we accept H0 for F -values close to one, and we reject H0 when the F -value is
too large or too small. The rejection region is obtained from F -tables, but the shape of
probability density function for a typical F -distribution is shown here.
0.6
f (x)
0.4
0.2
0.0
....
... ...
.. ..
.. ...
..
..
...
...
.
..
...
..
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
....
.....
...
.....
...
......
...
......
.......
...
.........
...
...........
..............
...
........................
.........................................
...
........................................................................................
..
F -distribution
4
x
The most striking feature of the probability density function of the F -distribution
is that it is not symmetric. It is positively skewed, having a long tail to the right.
The mode (the x-value associated with the maximum value of the probability density
function) is less than one, but the mean is greater than one, the long tail pulling the
mean to the right. The lack of symmetry makes it seem that we will need separate tables
for the upper and lower percentage points. However, by means of a simple trick (to be
explained later), we can get away without having tables for the lower percentage points
of the F -distribution.
Because the F -statistic is the ratio of two sample variances, it should come as no
surprise to you that there are two degrees of freedom numbers attached to F the
degrees of freedom for the variance in the numerator, and the degrees of freedom for the
denominator variance. It would therefore appear that we need an encyclopaedia of tables
for the F -distribution! To avoid this, it is usual to only present the four most important
values for each F -distribution; the 5%, 2.5%, 1% and 0.5% points. The conventional
way of presenting F -tables is to have one table for each of these percentage points; in
this book Table 4.1 gives the 5% points, Table 4.2 the 2.5% points, Table 4.3 the 1%
points and Table 4.4 the 0.5% points. Within each table, the rows and columns are used
for the degrees of freedom in the denominator and numerator, respectively.
Example 11A: In example 8A, the sample standard deviations were s1 = 0.42 and
s2 = 0.27. Let us test at the 5% level to see if the assumption of equal variances was
reasonable.
1. H0 : 12 = 22 .
2. H1 : 12 6= 22 .
212
INTROSTAT
n2 1 .
In example 8A, the sample sizes for s21 and s22 were 6 and 5 respectively. Thus we
use the F -distribution with 6 1 and 5 1 degrees of freedom, i.e. F5,4 .
Because we have a two-sided test at the 5% level, we need the upper and lower 2 12 %
points of F5,4 . This means that we must use Table 4.2, and go to the intersection
of column 5 and row 4, where we find that the upper 2 12 % point of F5,4 is 9.36.
(0.025)
We write F5,4
= 9.36. (Notice that, in F -tables, the usual matrix convention
of putting rows first, then columns, is not adopted.)
0.6
f (x)
0.4
.......
... ...
.. ....
..
....
...
.
..
...
..
...
...
...
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
....
...
..
...
...
...
...
...
5,4
...
...
...
...
...
...
...
...
...
...
..
...
...
...
...
...
....
....
....
...
....
...
.....
......
...
.......
...
.......
(0.025)
.........
...
..........
...
.............
.......................
...
5,4
......................................
...
...........................................................................................
.
0.2
0.0
= 9.36
12
x
We will reject H0 if our observed F value exceeds 9.36. The tables do not enable
us to find the lower rejection region, but for the reasons explained below, we do
not in fact need it.
5. The observed F -value is
F = s21 /s22 = 0.422 /0.272 = 2.42.
6. Because 2.42 < 9.36, we do not reject H0 . We conclude that the assumption of
equal variances is tenable, and that therefore it was justified to pool the variances
for the two-sample t-test in example 8A.
The trick that enables us never to need lower percentage points of the F -distribution is to adopt the convention of always putting the numerically larger variance into
the numerator so that the calculated F -statistic is always larger than one and
adjusting the degrees of freedom. Let s21 and s22 have n1 1 and n2 1 degrees of
freedom respectively. Then, if s21 > s22 , consider the ratio F = s21 /s22 which has the
Fn1 1,n2 1 -distribution. If s22 > s21 , use F = s22 /s21 with the Fn2 1,n1 1 -distribution.
This trick depends on the mathematical result that 1/Fn1 1,n2 1 = Fn2 1,n1 1 .
213
Example 12B: We have two machines that fill milk bottles. We accept that both
machines are putting, on average, one litre of milk into each bottle. We suspect, however,
that the second machine is considerably less consistent than the first, and that the volume
of milk that it delivers is more variable. We take a random sample of 15 bottles from
the first machine and 25 from the second and compute sample variances of 2.1 ml2 and
5.9 ml2 respectively. Are our suspicions correct? Test at the 5% significance level.
1. H0 : 12 = 22 .
2. H1 : 12 < 22 .
3. Significance level : 5% (and note that we now have a one-sided test).
4. & 5. Because s22 > s21 , we compute
F = s22 /s21 = 5.9/2.1 = 2.81.
(0.05)
This has the Fn2 1,n1 1 = F24,14 -distribution. From our tables, F24,14 = 2.35
1.0
f (x)
0.5
.........
... ......
...
...
...
..
.
...
..
...
..
....
..
..
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
24,14
...
...
...
...
...
...
...
...
...
...
..
...
...
...
....
...
....
....
.....
(0.05)
...
......
.......
...
.......
24,14
...
.........
.
...........
.
................
..
..............................
..
.................................................................
.................
0.0
2
x
= 2.35
6. The observed F -value of 2.81 > 2.35 and therefore lies in the rejection region.
We reject H0 and conclude that the second bottle filler has a significantly larger
variance (variability) than the first.
Example 13C: Packing proteas for export is time consuming. A florist timed how long
it took each of 12 labourers to pack 20 boxes of proteas under normal conditions, and
then timed 10 labourers while they each packed 20 boxes of proteas with background
music. The average time to pack 20 boxes of proteas under normal conditions was 170
minutes with a standard deviation of 20 minutes, while the average time with background
music was 157 minutes, with a standard deviation of 25 minutes. At the 5% significance
level, test whether background music is effective in reducing packing time. Test also the
assumption of equal variances.
214
INTROSTAT
2
2
2
n = (s1 /n1 + s2 /n2 )
+ 2
2.
n1 + 1
n2 + 1
This messy formula inevitably gives a value for n which is not an integer. It is
feasible to interpolate in the t-tables, but we will simply take n to be the nearest
integer value to that given by the formula above.
Example 14B: Personnel consultants are interested in establishing whether there is
any difference in the mean age of the senior managers of two large corporations. The
following data gives the ages, to the nearest year, of a random sample of 10 senior
managers, sampled from each corporations:
Corporation 1
Corporation 2
52
44
50
45
53
39
42
49
57
43
43
49
52
45
44
47
51
42
34
46
s1 = 6.86
x
2 = 45.00
s2 = 3.20
= 6.54).
215
5. The conclusion is that the population variances are not equal (F9,9 = 4.60, P <
0.05).
Thus pooling the sample variances is not justified, and we have to use the approximate t-test:
1. H0 : 1 = 2 .
2. H1 : 1 6= 2 .
3. Substituting into the formula for the approximate test statistic yields
(47.8 45.0)
t = r
= 1.17.
3.202
6.862
+
10
10
Substituting into the degrees of freedom formula yields
(
)
2 2 . (6.862 /10)2
2 /10)2
2
3.20
(3.20
6.86
+
+
2
n =
10
10
11
11
= 13.57 14
(0.100)
4. Because t14
level.
5. We conclude that there is no difference in the mean age of senior employees between
the two corporations (t14 = 1.17, P > 0.20).
Example 15C: A particular business school requires a satisfactory GMAT examination
score as its entrance requirement. The admissions officer believes that, on average, engineers have higher GMAT scores than applicants with an arts background. The following
GMAT scores were extracted from a random sample of applicants with engineering and
arts backgrounds.
Engineering
Art
600
550
650
450
640
700
720
420
700
750
620
500
740
520
650
Gold
shares
Industrial
shares
3.6
4.6
3.2
2.5
4.0
4.0
2.5
8.4
3.9
3.2
8.4
5.2
5.0
15.5
8.7
4.7
2.7
5.7
2.7
3.1
3.7
3.6
3.1
5.5
4.6
4.1
5.3
6.5
3.5
6.2
4.3
3.6
4.5
3.5
3.9
3.7
5.6
4.5
4.0
1.6
5.1
3.1
6.8
216
INTROSTAT
n
x
s
Gold
Industrial
20
4.675
2.677
23
4.710
2.004
217
Is the population
variance known or
estimated?
Population variance
known
Population variance
unknown
Is there one
sample or two?
Is there one
sample or two?
Two samples.
Test statistic:
One sample.
Test statistic:
X
Z=
/ n
N (0, 1)
Z=
One sample.
Test statistic:
1 X
(1 2 )
X
s2
12 22
+
n1 n2
s/ n
tn1
t=
N (0, 1)
Two samples.
First test the assumption that
variances are equal
Test statistic:
F =
s21
s22
or
F =
s22
s21
Assumption rejected.
Test statistic:
Assumption accepted.
Test statistic:
1 X
(1 2 )
X
r2
1
1
+
s
n1 n2
tn1 +n2 2
t? =
t=
where
s=
1 X
(1 2 )
X
s2
s21
s2
+
n1 n2
tn?
where
2
s2
s21
+ 2
n1 n2
n? =
2 2
2
2
s22 /n2
s1 /n1
+
n1 + 1
n2 + 1
218
INTROSTAT
Example 17B: Use the decision tree to decide which test to apply. An experiment
compared the abrasive wear of two different laminated materials. Twelve pieces of
material 1 and 10 pieces of material 2 were tested and in each case the depth of wear
was measured. The results were as follows:
Material 1 x1 = 8.5 mm s1 = 4 mm n1 = 12
Material 2 x2 = 8.1 mm s2 = 5 mm n2 = 10
Test the hypothesis that the two types of material exhibit the same mean abrasive
wear at the 1% significance level.
Begin at START
The population variance is estimated. Go right. There are samples from two
populations. Go right again. The assumption that the variances are equal is
accepted. Check this for yourself. Go left.
Pool the variances and use the test statistic with n1 + n2 2 degrees of freedom.
Do this example as an exercise.
Example 18C: In an assembly process, it is known from past records that it takes
an average of 3.7 hours with a standard deviation of 0.3 hours to assemble a certain
computer component. A new procedure is adopted. The first 100 items assembled using
the new procedure took, on average, 3.5 hours each. Assuming that the new procedure
did not alter the standard deviation, test whether the new procedure is effective in
reducing assembly time.
Solutions to examples
...
219
...
9.1 Find the 95% confidence interval for the mean salary of teachers if a random sample
of 16 teachers had a mean salary of R12 125 with a standard deviation of R1005.
9.2
9.3 Over the past 12 months the average demand for sulphuric acid from the stores
of a large chemical factory has been 206 `; the sample standard deviation has
been 50 `. Find a 99% confidence interval for the true mean monthly demand for
sulphuric acid.
9.4
9.5
In a textile manufacturing process, the average time taken is 6.4 hours. An innovation which, it is hoped, will streamline the process and reduce the time, is
introduced. A series of 8 trials used the modified process and produced the following results:
6.1
5.9
6.3
6.5
6.2
6.0
6.4
6.2.
Using 5% significance level, decide whether the innovation has succeeded in reducing average process time.
9.6 In 1989, the Johannesburg Stock Exchange (JSE) boomed, and the Allshare Index
showed an annual return of 55.5%. A sample of industrial shares yielded the
following returns:
54.5
52.8
47.8
56.1
42.3
23.2
59.8
52.5
33.1
65.7
49.7
47.5
16.0
32.5
50.3
46.7
220
INTROSTAT
9.7 The mean score on a standardized psychology test is supposed to be 50. Believing
that a group of psychologists will score higher (because they can see through the
questions), we test a random sample of 11 psychologists. Their mean score is 55
and the standard deviation is 3. What conclusions can be made?
9.8
The specification quoted by ABC Alloys for a particular metal alloy was a melting
point of 1660 C. Fifteen samples of the alloy, selected at random, had a mean melting point of 1648 C with a standard deviation of 45 C. Is this evidence consistent
with the quoted specification?
ON GROUP
n2 = 8
x2 = 700
s22 = 300
A comparison is made between two brands of toothpaste to compare their effectiveness at preventing cavities. 25 children use Hole-in-None and 30 children use
Fantoothtic in an impartial test. The results are as follows:
Sample size
Average number of new cavities
Standard deviation
24
1.6
0.7
30
2.7
0.9
221
At the 1% level of significance, investigate whether one brand is better than the
other.
9.13 A company claims that its light bulbs are superior to those of a competitor on
the basis of a study which showed that a sample of 40 of its bulbs had an average
lifetime of 522 hours with a standard deviation of 28 hours, while a sample of
30 bulbs made by the competitor had an average lifetime of 513 hours with a
standard deviation of 24 hours. Test the null hypothesis 1 2 = 0 against a
suitable one-sided alternative to see if the claim is justified.
9.14
The densities of sulphuric acid in two containers were measured, four determinations being made on one and six on the other. The results were:
(1) 1.842
1.846
1.843
1.843
(2) 1.848
1.843
1.846
1.847
1.847
1.845
Sample size
Sample mean
(`/100 km)
A
B
21
28
9.03
8.57
1.73
0.89
222
INTROSTAT
Hardware
Crockery
15
1400
180
15
1250
120
Sample size
Mean daily sales (rands)
Standard deviation (rands)
The sales manager feels that mean sales in hardware is significantly higher than in
crockery. Test this idea statistically using a 1% significance level.
9.18
Travel times by road between two towns are normally distributed; a random
sample of 16 observations had a mean of 30 minutes and a standard deviation of
5 minutes.
(a) Find a 99% confidence interval for the mean travel time.
(b) Estimate how large a sample would be needed to be 95% sure that the sample
mean was within half a minute of the population mean. You need to modify
the formula for estimating sample sizes in Chapter 8 to take account of the
fact that you are given a sample standard deviation based on a small sample
rather than the population standard deviation.
9.19 An insurance company has found that the number of claims made per year on a
certain type of policy obeys a Poisson distribution. Until five years ago, the rate
of claims averaged 13.1 per year. New restrictions on the acceptance of this type
of insurance were introduced five years ago, and since then 51 claims have been
made.
(a) Test whether the restrictions have been effective in reducing the number of
claims.
(b) Find an approximate 95% confidence interval for the average claim rate under
the new restrictions.
9.20 A new type of battery is claimed to have two hours more life than the standard
type. Random samples of new and standard batteries are tested, with the following
summarized results:
New
Standard
Sample size
Sample mean
(hours)
Sample standard
deviation (hours)
94
42
39.3
36.8
7.2
6.4
223
The mean commission of floor-wax salesmen has been R6000 per month in the
past. New brands of wax are now providing stiffer competition for the salesmen,
but inflation has pushed up the commission per sale. Management wishes to test
whether the figure of R6000 per month still prevails, and examines a sample of 120
recent monthly commission figures. They are found to have a mean of R5850 and
a standard deviation of R800. Test at a 5% significance level whether the mean
commission rate has changed.
9.22 Two bus drivers, M and N, travel the same route. Over a number of journeys the
times taken by each driver to travel from bus stop 5 to bus stop 19 were noted.
The summarized results are presented below:
Driver
Trips
Mean
(minutes)
Standard Deviation
(minutes)
M
N
12
21
18.1
21.3
1.9
3.9
(a) Test whether driver M is more consistent in journey times than driver N.
(b) Test whether there is a significant difference in the times taken by each driver.
9.23 It is necessary to compare the precision of two brands of detectors for measuring
mercury concentration in the air. The brand B detector is thought to be more
accurate than the brand A detector. Seven measurements are made with a brand
A instrument, and six with a brand B instrument one lunch hour. The results
(micrograms per cubic metre) are summarized as follows:
x
A = 0.87
sA = 0.019
x
B = 0.91
sB = 0.008
At the 5% significance level, do the data provide evidence that brand B measures
more precisely than brand A?
9.24 Show that if we test at the 5% significance level (using either the t- or normal
distributions) the null hypothesis H0 : = 0 against H1 : 6= 0 , we will reject
H0 if and only if 0 lies outside the 95% confidence interval for .
9.25
The health department wishes to determine if the mean bacteria count per ml
of water at Zeekoeivlei exceeds the safety level of 200 per ml. Ten 1 m` water
samples are collected. The bacteria counts are:
225
210
185
202
216
193
190
207
204
220
224
INTROSTAT
(a) what the sample mean of the original data was, and
(b) what size sample he drew.
9.27 The following are observations made on a normal distribution with mean and
variance 2 :
50
53
47
51 49
Solutions to exercises
...
225
9.15 F34,34 = 5.44 > 2.57, reject H0 at 1% level. Variances cannot be pooled. Degrees
of freedom
n = 46.8 47. t = 3.88 > 2.01,
reject H0 .
9.16 F20,27 = 3.78, P < 0.01, very significant, so variances cannot be pooled. Degrees
of freedom
n = 28.7 29. t = 1.11, P < 0 > 20,
possibly significant.
9.17 F14,14 = 2.25 < 2.48, cannot reject H0 and hence pool the variances. t28 = 2.69 >
2.467, reject H0 .
9.18 (a) (26.32 , 33.68)
(b) n = (t?m1 s/L)2 = (2.131 5/ 12 )2 = 455, where m is the size of the sample
used for estimating s (in this case 16), so the that degrees of freedom for t
is 15.
9.19 (a) z = 1, 79, P < 0.05, significant reduction.
(b) (7.4 , 13.0).
9.20 F93.41 = 1.27 < 1.74, cannot reject H0 and pool variances (s = 6.965).
t134 = 0386 < 1.656, cannot reject H0 .
9.21 t119 = 2.05 < 1.98, reject H0 .
9.22 (a) F20,11 = 4.214, P < 0.05, significant differences in variances. Pooling not
justifiable.
(b) n = 32, t = 3.16, P < 0.005, highly significant difference in means.
9.23 F6,5 = 5.64 > 4.95, reject H0 .
9.25 t9 = 1.25, P < 0.20, possibly significant.
9.26 (a) x
= 72.1
(b) n = 36.
226
INTROSTAT
Chapter
10
228
INTROSTAT
Number of misprints
0
1
2
3
4
5
43
69
53
21
8
6
The table tells us that 43 of the 200 pages examined were free of misprints, 69 of
the pages had one misprint, 53 pages had two misprints each (a total of 106 misprints
on these 53 pages), . . . , 6 pages had five misprints on each page (30 misprints on these
6 pages).
In order to test whether the Poisson distribution fits this data, the first problem is
to decide what value to use for the parameter of the Poisson distribution. This can
either be specified by the null hypothesis, or the data can be used to estimate . We
treat these two situations separately.
Let us first consider the test of the null hypothesis that the data can be thought
of as a sample from a Poisson distribution with parameter = 1.2 misprints per page
against the alternative that the data come from some other distribution.
Thus we have
1. H0 : Data come from a Poisson distribution with = 1.2
2. H1 : The distribution is not Poisson with = 1.2
3. Significance level : 5%
4. & 5. We need firstly to compute the expected (theoretical) frequencies, assuming that
the null hypothesis is true. If misprints are occurring in accordance with a Poisson
distribution with rate = 1.2 then the probability that a page contains x misprints
is
p(x) = e1.2 1.2x /x!
Thus the probability of no misprints is
p(0) = e1.2 = 0.3012
Thus 30.12% of pages are expected to have no misprints. A sample of 200 pages
can therefore be expected to have 60.24 pages with no misprints (30.12% of 200).
This theoretical frequency is to be compared with an observed frequency of 43.
Similarly p(1) = e1.2 1.2 = 0.3614. Thus the expected frequency of one error is
200 0.3614 = 72.28. We compare this with an observed 69 pages with one error.
Continuing in this way we build up a table of observed and expected frequencies.
229
number of
misprints (x)
observed number
of pages (Oi )
theoretical
probability p(x)
expected
frequency (Ei )
0
1
2
3
4
5 or more
43
69
53
21
8
6
60.24
72.28
43.38
17.34
5.20
1.56
Even if H0 is true, we anticipate that the observed and expected frequencies will
not be exactly equal; this is because we expect some sampling fluctuation in our
random sample of 200 pages. We would clearly like to reject H0 , however, if the
difference between the observed frequencies and the expected frequencies is too
large.
We need to find a test statistic which is a function of the differences between
observed and expected frequencies and which has a known sampling distribution.
The sum of these differences is of no use because they sum to zero. So we square
the differences and they all become positive. A difference of 3 is negligible if
the observed and expected frequencies are 121 and 124, but it important if the
frequencies are 6 and 9. To take account of this we divide the squared differences
by their expected frequency. The right statistic to use is
D2 =
X (Oi Ei )2
Ei
where the sum is taken over all the cells in the table.
The statistic D 2 has approximately a chi-squared (2 ) distribution. Like the tand F -distributions, the 2 distribution has a degrees of freedom number attached
to it. In tests like the one above, the correct degrees of freedom is given by k 1,
where k is the number of cells into which the data are categorized. Here k = 6,
and therefore the degrees of freedom for 2 is 5. Thus we will reject H0 if D 2
2(0.05)
= 11.071.
exceeds the 5% point of the 25 distribution. From our tables 5
230
INTROSTAT
0.15
0.10
f (x)
.......
........ ...........
....
....
...
...
...
...
..
...
.
...
..
.
...
..
...
.
...
....
...
.
...
.
...
....
...
...
...
...
...
...
.
.
...
....
...
...
...
...
...
...
2
...
...
...
...
... 5
...
....
.
.
.
...
.
....
..
....
...
....
...
....
....
....
.....
...
.....
.....
...
......
2 (0.05)
...
......
.......
...
.......
...
........
........ 5
...
..........
..
............
..............
...
.....................
............................
...
...
0.05
0.00
10
= 11.071
15
x
If D 2 exceeds 11.071 then the observed and expected frequencies are too far
apart for their differences to be explained by chance sampling fluctuations alone.
Let us compute D 2 for the above data.
D2 =
X (Oi Ei )2
(43 60.24)2
(69 72.28)2
(6 1.56)2
=
+
+ +
Ei
60.24
72.28
1.56
= 22.17
6. This value lies in the rejection region. We thus reject the null hypothesis that the
data are sampled from a Poisson distribution with = 1.2.
We said earlier that the statistic D 2 has approximately a 2 distribution. This approximation is good provided all the expected frequencies exceed about 5. Looking
back at the table of expected frequencies, you will see that this condition has been violated the expected number of pages with 5 (or more) misprints is 1.56, which is
much less than 5. To get around this we amalgamate adjoining classes. This reduces
the number of cells, and also the degree of freedom. Here it is necessary to amalgamate
the last two cells:
number of
misprints
observed number
of pages (Oi )
theoretical
probability
expected
frequency (Ei )
0
1
2
3
4 or more
43
69
53
21
8+6 = 14
0.3012
0.3614
0.2169
0.0867
0.0338
60.24
72.28
43.38
17.34
6.76
D2 =
231
We now have only 5 cells, so the degrees of freedom for 2 is 4. The 5% point of 24 is
9.488. Because 15.78 > 9.488, we reject H0 and come to the same conclusion as before
(but using the right method this time!).
0.2
24
f (x) 0.1
2 (0.05)
= 9.488
0.0
0
10
15
Let us now use the data to estimate , and see what difference this makes to the
test. Our null and alternative hypotheses are now
1. H0 : the data fit some Poisson distribution, and
2. H1 : the data fit a distribution other than the Poisson distribution.
3. Significance level : 5%
4. & 5. To find we need to estimate the rate at which misprints occurred in our sample
data. The total number of misprints that occurred was 0 43 + 1 69 + 2 53 +
+ 5 6 = 300 misprints. 300 misprints in 200 pages implies that the mean
rate at which misprints occur is 1.5 misprints per page. We therefore try to fit a
Poisson distribution with = 1.5.
Using the same procedure as before, we find a table of expected frequencies.
number of
misprints
0
1
2
3
4
5 or more
observed number
of pages (Oi )
43
69
53
21
8o
6
14
theoretical
probability
0.2231
0.3347
0.2510
0.1255
0.0471
0.0186
expected
frequencies (Ei )
44.62
66.94
50.20
25.10
9.42 o
13.14
3.72
We amalgamate the last two cells, so that all expected values exceed 5. We now
have five cells.
232
INTROSTAT
The rule for finding the degrees of freedom in this case is
Degrees of freedom = k d 1
where k is the number of cells, and
d is the number of parameters estimated from the data.
Here k = 5 and d = 1, because we estimated just one parameter, , from the data.
Thus we must use 2 with 511 = 3 degrees of freedom. The 5% point of 2 is 7.815.
We reject H0 if D 2 exceeds 7.815. Notice that goodness of fit tests are intrinsically onesided we reject H0 if D 2 is too large. If D 2 is small, it means that the distribution
specified by H0 fits the data very well.
0.2
23
f (x)
0.1
2 (0.05)
= 7.815
0.0
0
10
15
P (Oi Ei )2
we calculate
Ei
(14 13.14)2
(43 44.62)2
+ +
44.62
13.14
= 1.01
D2 =
233
A short-cut formula . . .
As in the case with calculating the sample variance there is a short-cut formula for
calculating D 2 .
X (Oi Ei )2
Ei
X O2 2Oi Ei + E 2
i
i
=
Ei
X O2
X
X
i
=
2
Oi +
Ei
Ei
D2 =
But
P
Oi = Ei = n, the sample size. (Each summation is over the k cells.)
Therefore
X O2
i
n.
D2 =
Ei
Example 2B: A random sample of 230 students took an I.Q. test. The scores they
obtained have been summarized in the table below:
Score
Observed frequency
< 90
90 110
110 130
> 130
18
87
104
21
Test at the 5% level whether these data come from a normal distribution.
1. H0 : The data follow a normal distribution (parameters to be estimated from the
data).
2. H1 : The distribution is not normal.
3. Significance level : 5%.
4. We use the 2 distribution with k d 1 degrees of freedom. Here there are k = 4
cells and d = 2 parameters ( and ) are estimated.
2(0.05)
= 3.843.
234
INTROSTAT
4
3
f (x)
2
1
0
...
...
...
...
...
...
..
..
..
..
..
..
..
..
..
..
..
..
..
...
...
...
...
...
..
...
..
..
..
..
..
...
...
...
..
..
...
...
2
...
....
1
......
2 (0.05)
.......
..........
................
1
.............................
....................................................................................................
....................................................
= 3.841
x
5. We need to estimate and . Let us suppose that before the data was summarized into the table above, the sample mean x
and the standard deviations were
computed to be 111 and 16.3 respectively.
We now use N (111, 16.32 ) to compute the theoretical probabilities that randomly
selected I.Q.s fall into the 4 cells in our table.
If X N (111, 16.32 ), then
90 111
Pr(X < 90) = Pr Z <
16.3
= Pr(Z < 1.29)
=0.0985
Thus the probability that a randomly selected single individual has an I.Q. less
than 90 is 0.0985. In a sample of size 230 we therefore expect 230 0.0985 = 22.7
students to have I.Q.s below 90.
We do a similar calculation for the remaining cells, to obtain the table:
score
< 90
90 110
110 130
130
observed
frequency
theoretical probability
expected
frequency
18
87
104
21
22.7
86.8
92.7
27.8
Thus
X O2
182
872
1042
212
i
n=
+
+
+
230
Ei
22.7 86.8
92.7
27.8
= 234.01 230 = 4.01
D2 =
235
6. Because 4.01 > 3.841, we reject H0 and conclude that the normal distribution is
not a good fit to this data.
This example illustrates the method to use to test whether data fit a normal
distribution. In practice, however, the data would be divided into many more
than the 4 cells we used in this example.
Look back at the plot of the 21 -distribution in example 2B. The shape is quite
unlike that of the typical chi-squared distribution with more than one degree of
freedom, as depicted in example 1A. It is closely related to the exponential dis2 (0.05)
= 3.841, i.e. the 5% point of the
tribution of chapter 5. We saw that 1
chi-squared distribution with one degree of freedom, is 3.841. And it is not an accident that the square root of 3.841 is equal to 1.96, the 2.5% point of the standard
normal distribution there is a mathematical relationship between the normal
and chi-squared distributions.
Example 3C: A T-shirt manufacturer makes a certain line of T-shirt in three colours:
white, red and blue. 50% of the T-shirts made are white, 25% are red and 25% blue.
One outlet reported sales of 47 white T-shirts, 32 red and 21 blue. Test whether sales
are consistent with the manufactured proportions.
Example 4C: A popular clothing store is interested in establishing the distribution of
customers arriving at the cashiers. During a 100-minute period, the number of customers
arriving per minute was counted, with the following results:
Number of
customers
per minute
observed
frequency
5 or more
25
26
21
15
236
INTROSTAT
Total
223
Medium
24
33
17
11
85
Large
Total
55
145
37
160
10
55
10
60
112
420
There were thus 66 companies whose performances were negative (column < 0%)
and which were small, etc. This rectangular collection of frequencies of occurrence is a
typical example of a contingency table. This is a 34 contingency table. It is customary
to give the number of rows first, then the number of columns. The designer of the South
African decimal currency had the right idea rands and cents, rows and columns.
In general we talk of an r c contingency table with r rows and c columns.
The financial analyst wants to know if there is a significant relationship (at the 5%
level) between the performance and the size of the companies. We use our standard
approach to hypothesis testing.
1. We start by assuming that there is no relationship between the two variables;
thus we have H0 : the performance and the size of a company are statistically
independent.
You will be wise at this point to reread the last few pages of Chapter
3, where the concept of independence was developed!
2. The alternative hypothesis, which the financial analyst is trying to establish, is
H1 : the performance and the size of a company are associated.
3. Significance level : 5%.
4. & 5. Under the assumption of independence made in the null hypothesis, we calculate
theoretical expected frequencies for each cell. If the theoretical frequencies (which
assume independence) and the observed frequencies are too different, we reject
the null hypothesis of independence, and conclude that there is a dependence or
a relationship or association between the two variables. A careful examination of
the table then helps us to determine the nature of the association.
From the table above, we see that 145 out of the 420 companies had negative performance figures. Thus, using the relative frequencies in the sample, we estimate the
probability that a randomly selected company will have negative performance figures as
Pr[negative performance] = 145/420 = 0.345.
Similarly, 223 out of the 420 companies were classified as small. We estimate that the
probability that a randomly selected company will be small as Pr[small] = 223/420 =
0.531.
237
If, as H0 tells us, the performance of a company is independent of its size, then
the probability that a randomly selected company will have both negative performance
figures and be small will be the product of the individual probabilities, i.e.
Pr[negative performance and small] = Pr[negative performance] Pr[small]
145 223
= 0.345 0.531
=
420 420
=0.1833
Thus, if H0 is true, the proportion of companies that has negative performance and
is small will be 0.1833, or, expressed as a percentage, 18.33%. We have 420 companies
in our sample, so we expect 0.183 420 = 77.0 to have negative performance and be
small. This expected frequency can be computed more directly as
145 223
145 223
420 =
= 77.0
420 420
420
Thus the calculation of the expected frequencies for this (and every other) cell of
the table reduces to a very simple formula:
expected frequency =
The full set of expected frequencies is given in brackets under the observed values
in the table below:
Firm size
< 0%
Small
66
(77.0)
24
(29.3)
55
(38.7)
90
(85.0)
33
(32.4)
37
(42.7)
28
(29.2)
17
(11.1)
10
(14.7)
39
(31.9)
11
(12.1)
10
(16.0)
145
160
55
60
Medium
Large
Total
Total
223
85
112
420
In all the above arithmetic, dont lose sight of the fact that the expected frequencies have been computed assuming that the two variables are independent. Thus, the
larger the differences between the observed frequencies in the table and the expected
frequencies, the less likely it is that the two variables are independent. The comparison
of observed and expected frequencies makes use of the same statistic as was used for
goodness of fit tests:
D2 =
X O2
i
n,
Ei
238
INTROSTAT
where we sum over all the cells in the contingency table, and n is the total number of
observations.
Once again we have a degrees of freedom problem. It can be shown that the following
rule gives us the degrees of freedom we require:
DEGREES OF FREEDOM FOR TESTS OF ASSOCIATION
For an rc contingency table, the appropriate chi-squared distribution
has degrees of freedom = (r 1)(c 1)
Here we have a 3 4 contingency table, hence the degrees of freedom for chi-squared is
(3 1)(4 1) = 6.
Using the 5% significance level, we determine from tables that the 5% point of
2
6 is 12.592. We will reject H0 if D 2 > 12.592. Notice that tests of association for
contingency tables are almost invariably one-sided.
We compute D 2 using the short-cut formula:
902
102
102
662
+
+ +
+
420
77.0 85.0
14.7 16.0
= 19.09.
D2 =
0.15
0.10
26
f (x)
0.05
2 (0.05)
= 12.592
0.00
0
10
x
15
20
6. Because 19.09 > 12.592 we reject H0 . Thus our financial analyst has demonstrated
that there is a significant relationship between the performance of a firm and its
size. Examination of the table of observed and expected frequencies shows that
expected values exceed observed values in the top left and bottom right corners
of the contingency table, and vice versa in the top right and bottom left corners.
This indicates an inverse relationship between these two variables, that is, as the
size of a firm decreases, the performance improves.
239
Example 6B: A photographic company wishes to compare its present automatic film
processing machines with two other machines that have recently come onto the market.
It processes 194 films on the present machine (called A), and hires the other machines
(B and C) for short periods and processes smaller numbers of films on them. Using very
stringent criteria, the manager classifies each film as being satisfactorily processed or
not. The following contingency table is obtained:
Satisfactory
Unsatisfactory
Processing machine
B
93
101
24
6
8
18
Test whether the classification of the film as satisfactory or not is independent of the
machine the film was processed on.
No significance level is given, and we use the modified hypothesis testing procedure.
1. H0 : The classification is independent of the machine
2. H1 : The classification is dependent on the machine
3. We compute the expected values, given in brackets in the table below:
Processing machine
A
B
Satisfactory
Unsatisfactory
Total
125
93
((194 125)/250 = 97)
101
(97)
24
(15)
6
(15)
8
(13)
18
(13)
194
30
26
Total
125
250
Next we compute D 2 :
932
242
182
+
+ +
250
97
15
13
= 14.98.
D2 =
240
INTROSTAT
5. Our conclusion is that the classification of the film is highly significantly dependent
on the machine used (22 = 14.98, P < 0.001). (Note that we report the value
of the test statistic as a 2 value.) Examination of the contingency table shows
that machine B tends to produce a large proportion of photographs classified as
satisfactory, while machine C tends to produce a large proportion of unsatisfactory
photographs. Processing machine B is the recommended choice.
Example 7C: An analysis of the tries, penalty goals and dropped goals scored by South
Africa in rugby tests against the British Isles, New Zealand, Australia and France gave
the following contingency table:
British Isles
New Zealand
Australia
France
Tries
Penalties
Drops
70
44
79
43
31
38
31
32
6
11
7
3
The number of penalities scored against New Zealand and France seems unexpectedly
high, and this leads you to want to test the hypothesis that the mode of scoring is
dependent on the opponents.
Example 8C: An investment broking company is keen to establish whether there is an
association between their clients perception of their attitude towards risk and the type
of investments they prefer. They obtained 340 responses to a questionnaire designed to
capture this information. A summary of the number of clients falling into the various
categories is shown below:
Risk category
Type of
investment
Risk
averse
Risk
neutral
Risk
lover
Fixed deposits
Bonds
Unit trusts
Options and futures
79
10
12
10
58
8
10
34
49
9
19
42
Test the hypothesis that the type of investment is dependent on the risk perception.
241
Example 9A: A dairy is concerned about the variability in the amount of milk per
bottle. A sample of 25 bottles was examined, and the sample standard deviation of the
contents was computed to be 3.71 ml.
(a) Find 95% confidence intervals for , the population standard deviation.
(b) At the 1% significance level, test whether the observed sample standard deviation is
consistent with the industrial specification that the population standard deviation
must not exceed 2.5 ml.
(a) Confidence Interval: The degrees of freedom of the appropriate 2 -distribution is
one less than the sample size. Here, n 1 = 24 degrees of freedom. To obtain a
95% confidence interval, the upper and lower 2 12 % points of the 2 -distribution must
be obtained from tables. Because the 2 -distribution is not symmetric, the lower 2 21 %
2(0.025)
0.050
f (x)
0.025
2 12 %
= 12.401.
....................
....
..
...
...
..
...
.
...
..
...
..
.
...
2
...
....
.
...
.
... 24
....
...
.
.
...
...
...
...
...
...
...
...
..
.
...
....
...
.
.
...
...
....
.
...
.
...
....
...
.
.
..
...
....
.
...
.
...
....
...
...
...
...
...
...
...
...
...
...
... 2 (0.975)
...
.
...
.
...
..
24
.
2 (0.025)
....
..
.
.....
..
.....
.
.....
..
.
...... 24
.......
..
.
........
..
............
....
..................
.....
...............
.................................................
0.000
2(0.975)
24
10
= 12.401
20
30
40
= 39.364
50
x
Thus we can write:
Pr[12.401 < 224 < 39.364] = 0.95
Because (n 1)s2 / 2 2n1 , and because n 1 = 24 and s2 = 3.712 we have
Pr[12.401 < 24 3.712 / 2 < 39.364] = 0.95
Rearranging, so that inequalities produce a confidence interval for 2 , we have
Pr[24 3.712 /39.364 < 2 < 24 3.712 /12.401] = 0.95.
242
INTROSTAT
This reduces to
Pr[8.39 < 2 < 26.64] = 0.95.
The 95% confidence interval for 2 , the population variance, is given by (8.39, 26.64),
and the 95% confidence interval for , the population standard deviation, obtained by
taking square roots, is (2.90, 5.16). Unlike the confidence intervals for the mean, the
point estimate of the variance does not lie at the midpoint of the confidence interval
(and the same is true for the confidence interval for the standard deviation).
CONFIDENCE INTERVAL FOR 2
If we have a random sample of size n from a population with a normal distribution, and the sample variance is s2 , then A% confidence
intervals for 2 are given by
((n 1) s2 /2n1
, (n 1) s2 /2n1
)
is the appropriate upper percentage point and
where the value 2n1
is the appropriate lower percentage point, determined from ta2n1
bles. The confidence interval for is found by taking square roots.
6. Because 52.85 > 42.980, we reject the null hypothesis, and conclude that the
specification is not met, and that therefore the dairy has a problem.
243
2(0.05)
= 8.672 and 17
= 27.587. Thus the 90% confidence
(a) From tables, 17
interval for the population variance 2 is given by
(17 14.22 /27.587, 17 14.22 /8.672) = (124.3, 395.3)
(b)
2. H1 : > 10.
3. The test statistic is
2 = (n 1)s2 / 2 = 17 14.12 /102
= 34.28.
4. From the row in the tables for 217 , we see that 34.28 is significant at the 1%
level.
5. We conclude that the security being investigated does not meet the variability
requirements of the portfolio manager (217 = 34.28, P < 0.01), and should
be regarded by him as too risky.
Example 11C: A certain moulded concrete product is manufactured in Johannesburg
and Cape Town. Since the aggregate used in the production might differ between the
two regions, management is interested in comparing the mass and variablity of the
product between the regions. A sample of 43 items produced in Johannesburg had a
mean mass of 53.2 kg with a standard deviation of 4.2 kg, while a sample of 23 items
produced in Cape Town had a mean mass of 54.4 kg with a standard deviation of
3.3 kg. Set up 95% confidence intervals for the population means and standard deviations
for products manufactured in both Johannesburg and Cape Town. Test if there is a
significant difference between the means and variances of the two populations. If you find
there is no significant difference, find appropriate pooled estimates of the overall mean
and standard deviation, and find 95% confidence intervals for these overall estimates.
244
INTROSTAT
Solutions to examples . . .
3C D 2 (or 22 ) = 2.78, P > 0.20 (2(0.20) = 3.219). Not significant.
4C = 2.25
2(0.20)
2(0.05)
2(0.001)
Johannesburg
Cape Town
pooled
(51.9 , 54.5)
(53.0 , 55.8)
(52.6 , 54.6)
(3.5 , 5.3)
(2.6 , 4.7)
(3.3 , 4.7)
In 20 soccer cup finals in South Africa between 1959 and 1978, the 40 teams
involved scored the following numbers of goals per match:
0
8
1
6
2
13
3
8
4
3
5
2
6
0
or more
Is it true that the number of goals scored per team per match fits a Poisson
distribution?
245
10.3 A large furniture store also sells television sets. Their planning department is
interested in the distribution of their daily sales of television sets. A sample of
210 days shows the following sales volumes:
Number of sales
0
1
2
3
4 or more
60
71
53
20
6
Test at the 5% significance level whether the daily sales volumes occur in accordance with a Poisson distribution.
10.4 In order to check whether a die is balanced it was rolled 240 times and the following
results were obtained:
Face
No. of occurrences
1
33
2
51
3
49
4
36
5
32
6
39
There is an annoying traffic light at an intersection near your home. Each time you
leave home, you note the colour as you pass the last lamp-post before the traffic
light. During a month, your records show that there was a red light 95 times,
amber 15 times, and green 30 times. Thinking this unreasonable, you contact the
Traffic Department, who claim that the light is set to be red for 50%, amber for
10% and green for 40% of the time for traffic arriving at the intersection from the
direction of your home. Test the Traffic Departments claim at the 1% significance
level.
10.6 A vegetable processing company has frozen peas as one of its major lines. They
have employed a consultant to research ways of improving their product. The
consultant experiments with hybrids between two varieties of peas. He knows
that, according the Mendelian inheritance theory, when the two varieties of peas
are hybridized, 4 types of seeds yellow smooth, green smooth, yellow wrinkled
and green wrinkled are produced in the ratio 9 : 3 : 3 : 1. In the experiment,
the consultant observes 102 yellow smooth, 30 green smooth, 42 yellow wrinkled
and 15 green wrinkled seeds. Are the results consistent with the theory at the 5%
level of significance?
10.7
It is hypothesised that the number of bricks laid per day by a team followed a
normal distribution with mean 2000 and standard deviation 300. A record of bricks
laid over a typical 100 day period produces the following data:
246
INTROSTAT
number of days
under 1500
15001750
17502000
20002250
22502500
over 2500
5
14
30
28
17
6
A survey of 320 families with 5 children revealed the distribution shown in the
table below. Is this data consistent with the hypothesis that male and female
births are equally probable?
No. of males
No. of families
5
18
4
56
3
110
2
88
1
40
0
8
No boy friends
4
19
17
43
30
62
blonde
non-blonde
Records of 500 car accidents were examined to determine the degree of injury
to the driver and whether or not he was wearing a safety belt at the time of the
accident. The data are summarized below:
Minor injury
Major injury
Death
Safety belt
No safety belt
128
49
11
168
104
40
247
Test the hypothesis that the severity of the injury to the driver was independent
of whether or not the driver wears a safety belt. Use a 1% significance level. By
comparing observed and expected values, interpret your result.
10.11 An insecticide company is concerned about the effectiveness of their product across
a range of insect species. 200 specimens of each of four species of insects are placed
in a container and a prescribed amount of an insecticide is added. After an hour
the number of survivors for each species are noted:
Species
Survived
Killed
27
173
42
158
68
132
31
169
Does the insecticide differ in its effectiveness according to species at the 1% level
of significance?
10.12 Does the following sample indicate in the population sampled that preference for
cars of certain makes is independent of sex?
Men
Women
Make of car
B
60
80
80
70
110
100
10.13
Indifferent
60
Against
150
Indifferent
10
Against
70
Do the attitude patterns between police and public differ significantly? Use a 1%
significance level.
248
INTROSTAT
The amount of drug put into a tablet must remain very constant. For a certain
drug the maximum acceptable standard deviation is 1 mg. Analysis of 10 tablets
produced the following data (amount of drug in milligrams):
26 28 26 25 32 27 29 25 30 28
(a) Construct a 95% confidence interval for the true standard deviation.
(b) Using a 1% significance level, decide whether the fluctuation in drug content
exceeds the acceptable level of 1 mg.
10.15
The exact contents of seven similar containers of motor oil were (in millilitres)
499 501 500 498 501 501 499
249
Solutions to exercises . . .
2(0.05)
reject H0 .
2(0.20)
reject H0 .
2(0.05)
cannotreject H0 .
2(0.20)
cannotreject H0 .
2(0.05)
2(0.01)
reject H0 .
,
reject H0 .
2(0.20)
reject H0 .
reject H0 .
10.15 (a) We need to assume that the exact contents of the containers of motor oil are
normally distributed.
(b) For : (498.7, 501.0).
For 2 : (0.613, 7.160).
10.16 (a) For : (82.52 : 87.69).
For 2 : (73.03 , 150.32).
(0.025)
t
s
= 102.
(b) n = 60
L
2(0.95)
(c) Because
60 = 43.185,
2 60101.4
(d) > 43.185 = Pr[ 2 > 140.9] = 0.95
n = 141.
(e) For : (85.48 , 89.12).
For 2 : (94.33 , 155.73).
250
INTROSTAT
(f) By the central limit theorem, the confidence interval for the mean is likely
to be reliable (unless the distribution is so very skew that even a sample of
size 141 is not large enough for the sample mean to be approximately normally
distributed). On the other hand, the confidence interval for 2 depends on
the distribution being normal, and hence is unreliable.
Chapter
11
Sample surveys . . .
Statisticians are frequently required to estimate the proportion of a population
having some characteristic. We are all familiar with the opinion polls that take place
around election time and which purport to inform us what proportion of the electorate
will support each party or each candidate. Market researchers run surveys to determine
the proportion of the population who saw and absorbed the television advertisement
for their clients product.
Ideally, if it were convenient, quick and affordable, one would choose to obtain data
from each element of the population. Then we could estimate the population proportion
exactly, using the information from the complete data set.
However each datum we obtain takes some time to observe and record, and also
generates costs that must be covered. So a complete collection of all data may have
large time and cost impacts. In reality there are very often severe constraints on time
and money available for data collection and capture.
Estimating proportions from sample data, rather than from the complete population
data, is the usual challenge that confronts us. How could such a strategy of making
conclusions about the entire population, on the basis of only an incomplete subset of
the population, ever make sense?
In general the strategy can only make sense if we have reason to believe that the
two aggregated parts of the population, comprising the sampled and observed elements,
on the one hand, and the unsampled and unobserved elements, are essentially similar or
equivalent.
If that belief in equivalence is correct, then the sample can be thought of as being
representative of the unsampled group, as well as being obviously representative of
itself.
When the belief is correct, the sample will be representative not only of itself and of
the unsampled group, but therefore also representative of the population in its entirety.
251
252
INTROSTAT
In these circumstances, we believe the sample, the non-sample and the entire population
are essentially similar, and in that sense, representative of each other.
On the other hand, if we have no reason to believe in the equivalence of the sample
and non-sample subsets, or worse still, if we have reason to believe they are actually
different from one another, we have a severe problem. The strategy of using the sample
data to estimate population features will be incorrect and misleading.
The situation we have described thus far implies that we would still have to find a
way of ensuring the reasonableness or validity of a belief that a particular sample of
size n is representative of the population (of size N > n) from which it is drawn.
We cannot check the belief in the absence of information about the non-sampled
part of the population. Generally there is no such information available. Thus the belief
is not verifiable. If the information for verifiablity were available, then there may be no
additional information value or purpose a sample could actually serve.
Instead of ensuring that a sample is representative, the statistician relies upon a less
strict criterion, called randomness, as a device by which to eliminate any conscious or
unconscious bias in the selection of sample elements from the population. In its simplest
form, randomness implies that, at every stage in the sampling, each and every element
in the population has the same chance of being selected into the sample as each and
every other population element.
It is possible to achieve this type of random selection in practice, using various simple
techniques such as listing and numerically labelling each element of a population (e.g.,
from 1 to N ), putting labels 1 to N into a container and then drawing n numbered
labels from the thoroughly shaken container.
When a sample consisting of n elements has been chosen through an appropriate
randomisation method, we indicate its near-representative quality, by calling it a random sample. A random sample is highly likely to be either representative or close to
representative of the population.
While it is possible that a random sample might turn out to be unrepresentative
of its population, that outcome is very rare, if the sample size is moderate. Moreover,
the likeliness, of such a technically possible outcome, decreases in probability to almost
zero, as the random sample size n increases.
In practice, there are some golden rules for studying a population using statistical
methods:
1. Use random selection to ensure we have random samples.
2. Moderate sample size n can help keep costs and time requirements within limits.
3. Random sample size n should be large enough for all the objectives of the study.
In this chapter we focus upon the study objective of using random sample data, and
sample proportions to estimate corresponding population proportions.
The two most frequently asked questions (which are in fact interrelated) are:
1. What size sample is needed to estimate a proportion?
2. What is the reliability or margin of error of the estimate?
We will answer both questions.
253
(1 )
,
n
P (1 P )
,
n
We now construct confidence intervals for in much the same way that we constructed confidence intervals for when the standard deviation was assumed to be
known:
r
. P (1 P )
(P )
N (0, 1).
n
Thus
"
Pr 1.96 < (P )
#
P (1 P )
< 1.96 = 0.95,
n
Pr P 1.96
#
r
P (1 P )
P (1 P )
= 0.95.
< < P + 1.96
n
n
254
INTROSTAT
Equivalently, we can say that we are 95% sure that the interval (46.9% , 53.1%)
contains the true population proportion. Note that, in all our formulae and calculations,
proportions lie between 0 and 1. However, it is often convenient to communicate our
results as percentages. The quantity that we add to, and subtract from, the point
estimate of proportion to form the confidence intervals, is called the reliability margin
of the estimate at the given confidence level, and is conventionally denoted by L,
and expressed as a percentage. Thus
r
P (1 P )
%
L = 100 1.96
n
at the 95% confidence level.
In this example, the plus/minus term is 100 0.031 = 3.1% so we say that, at
the 95% confidence level, our estimate has a reliability margin of 3.1%. Note that the
confidence interval for the percentage is of the form (100P L)%, as in 50.0% 3.1%.
Note that the use of the term reliability here is different from (and inverse to) the way
it is used in common speech. Here reliability margin is a margin of error or variability
that naturally arises in random samples of a given size n. Our preference is always
for smaller reliability margin values, because the corresponding confidence intervals are
narrower. We say our estimates should be more precise.
The only way to achieve this goal of greater precision is to reduce L. By inspection,
we can see L is small when n is large, and hence n large. In principle we prefer sample
size n as large as any time and cost constraints will permit.
We now have an answer to the second of the two questions we asked at the beginning
of the chapter.
255
Example 2B: A market research company establishes that 28 out of 323 randomly
sampled households have more than one television set. Compute a 95% confidence
interval for the percentage of households having more than one television set. What is
the reliability margin, at the 95% level, of the estimate?
We calculate P = 28/323 = 0.0867, and n = 323. Thus the 95% confidence interval
is
!
r
0.0867 0.9133
, 0.0867 + 0.0307
0.0867 1.96
323
= (0.0560 , 0.1174).
Expressing proportion in percentage terms, the percentage of households with more
than one television set is within the confidence interval (5.60%, 11.74%). The reliability
margin of our estimate at the 95% confidence level is 3.07%.
Example 3C: In a questionnaire, 146 motorists out of a (random) sample of 252 stated
that when they next replaced tyres on their cars, they would insist on radial ply tyres.
Find a 95% confidence interval for the proportion of motorists in the population who
will get radials. What is the reliability margin at this confidence level?
Finite populations . . .
The method used above presupposes that the population being sampled is infinite,
or that the sampling is done with replacement. If neither of these assumptions is true,
we are sampling without replacement from a finite population. Then the random error
in the method will be reduced.
Intuitively we might expect this result because, by sampling without replacement,
no element can be selected twice, and every resulting sample of size n has a greater
chance of being representative of the population. Effectively we have eliminated all the
random samples which have any duplicated elements.
If the size of the population being sampled is N , and the sampling of an element
is done randomly n times without replacement, then the sampling distribution of P is
once again approximately normal, but with a reduced variance:
P (1 P )(N n)
.
P
: N ,
n(N 1)
Confidence intervals are constructed using the same procedure as before, with the
necessary modification for the diminished variance. Thus, the 95% confidence interval
is given by
s
s
!
P (1 P )(N n)
P (1 P )(N n)
, P + 1.96
P 1.96
n(N 1)
n(N 1)
with a reliability margin of our estimator P at the 95% confidence level given by
s
P (1 P )(N n)
%.
L = 100 1.96
n(N 1)
256
INTROSTAT
Example 4B: A lecturer, anxious to estimate the overall pass rate in a class of 823 students, takes a random sample of 100 scripts and finds 27 failures. Find a 95% confidence
interval for the failure rate.
We have P = 0.27, n = 100, and N = 823. Because the sample size is more than
10% of the population size we use the modified confidence interval:
s
!
0.27(1 0.27)(823 100)
, 0.27 + 0.082 = (0.188 , 0.352).
0.27 1.96
100(823 1)
At this stage, all the lecturer can say is that he has used a method for which there
is a 95% probability that the interval (18.8% , 35.2%) includes the true class failure
rate. The reliability margin at this confidence level is 8.2%, which is a comparatively
large figure, and the interval so wide it does not provide useful information! To get a
narrower confidence interval (at the 95% level), a larger sample size would be needed.
Example 5C: In a constituency of 8000 voters, 748 voters out of a sample of 1341
voters state they will support the Materialistic Party, 510 voters state they will vote
for the Ecological Party and the remaining 83 voters are undecided. Find 95% confidence intervals for the population proportions which will support each party, and the
population proportion of undecided voters. What are the associated reliabilities?
Sample sizes . . .
When you approach a statistician with our first question: What size sample is
needed? he will reply by asking four further questions:
1. What is N , the size of the population?
2. Do you want 95% or 99% (or some other level) confidence intervals?
3. What margin of error or reliability margin (L%) can you accept?
4. Do you have a rough estimate, P , of ?
The two formulae for the reliability margin L given earlier connect these four quantities with the sample size. If the population was infinite, we had
r
P (1 P )
.
L = 100 1.96
n
Making n the subject of the formula yields
n = (100/L)2 1.962 P (1 P ).
This formula is appropriate if the answers to the four questions are:
1. The population is very large.
2. 95% confidence level. (For 99% confidence level use 2.58 in place of 1.96. For
other levels use the appropriate value from the normal tables.)
3. Reliability margin of L% is given a value.
4. The rough estimate of is substituted for P . If no estimate of is available,
let P = 0.5. To see the logic behind this choice, let us consider the function
y = P (1 P ) further. In the region of interest, for values of P between 0 and 1,
the graph looks like this:
257
P (1 P )
0.00
...............................
........
......
......
.....
.....
.....
....
....
.
.
.
....
...
.
...
.
...
..
.
.
...
..
.
...
.
..
...
.
.
...
...
.
...
.
.
...
.
...
...
.
...
..
.
...
..
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
.
...
..
.
...
..
...
.
...
..
.
...
..
.
...
..
.
...
...
....
.
...
.
...
..
.
...
....
...
.
.
...
....
...
.
...
.
..
...
.
...
....
...
...
...
.
.
...
..
.
...
..
0.0
0.5
1.0
P
It is easy to show that the maximum of y = P (1 P ) occurs at P = 0.5. Thus
taking P = 0.5 in the sample size formula gives the largest possible sample size
that might be required to achieve a margin of error L%. Because it is an expensive
exercise to take a sample, we like our samples to be as small as possible. Thus if
an estimate of is available, we should always use it to determine sample size,
because it will yield a smaller n. If you wish to err on the side of caution, then, in
the sample size formula, use a value for P which is a little closer to 0.5 than your
estimate of .
If the answer to question 1 is that the population size is finite and of size N , then
we determine n from the reliability margin expression incorporating the reduced
variance:
N
.
n=
N (L/100)2
1+
1.962 P (1 P )
In finite populations randomly sampled without replacement, the formula for the
required minimal sample size n will always be smaller than the required size for random
sampling with replacement.
Example 6A: What size sample is needed in each of the following situations? The
numbers 1 to 4 refer to the four relevant questions for determining the sample size.
(a) 1.N =
2. 95% level
3.L = 3%
4.P = 0.5
(b) 1.N =
2. 95%
3.L = 3%
4.P = 0.25
(c) 1.N = 10 000
2. 95%
3.L = 2%
4.P = 0.35
(d) 1.N = 20 000
2. 99%
3.L = 1%
4.P = 0.10
(e) 1.N = 15 000
2. 99%
3.L = 1.5%
4.P = 0.85.
(a) n = (100/3)2 1.962 0.5 0.5 = 1068.
(b) n = (100/3)2 1.962 0.25 0.75 = 801.
100 000
= 1794.
(c) n =
10 000(2/100)2
1+
1.962 0.35 0.65
20 000
(d) n =
= 4610.
20 000(1/100)2
1+
2.582 0.1 0.9
258
(e) n =
INTROSTAT
15 000
= 3015.
15 000(1.5/100)2
1+
2.582 0.85 0.15
Example 7B: The Student Health Service at a university with 12 000 students wishes
to conduct a Health Awareness Survey by interviewing a sample of students. They desire
a reliability margin of not more than 5% at a 95% confidence level in all questions that
seek to estimate proportions. What size sample is required?
We are given N = 12 000, we are told to find a 95% confidence interval (so z = 1.96)
and to use reliability margin of L = 5%. In the absence of any guidance about the likely
value of , we use P = 0.5 in the sample size formula because it gives the maximum
possible sample size. Thus
n=
12 000
= 373.
12 000(5/100)2
1+
1.962 0.5 0.5
259
5. Test statistic. We asserted earlier that the random variable P , the estimate of ,
has a normal distribution. If H0 is true, then if P is based on a sample of size n,
its distribution is given by
(1 )
.
P
: N ,
n
Thus the test statistic is
: N (0, 1).
(1 )
n
Note that the null hypothesis specifies the value for . For this problem, P =
37/155 = 0.239 and n = 155, from the sample, and = 0.271 is specified by the
null hypothesis. Thus
r
0.271(1 0.271)
z = (0.239 0.271)
155
= 0.896.
Z=q
6. Because 1.64 < 0.896, we are not able to reject H0 . Thus our data does not
indicate that residents of this suburb buy new cars significantly (i.e. discernibly)
less frequently than the national average.
Example 10B: A town of 3000 households is subjected to an intensive advertising
campaign for Easyspread yellow margarine, including free samples. Beforehand, the
proportion of Easyspread users in the town was the same as the national average, 0.132.
One week later, a sample of 350 households was asked if they had bought Easyspread
within the last seven days, and 64 made a positive response. Has the campaign been
effective?
1. H0 : = 0.132, the proportion of Easyspread users has not changed.
2. H1 : > 0.132 (if the campaign is successful).
3. Test statistic. Because of the finite population size we use the adjusted variance:
(1 )(N n)
P
: N ,
.
n(N 1)
Our test statistic is thus
.
Z = (P )
(1 )(N n)
: N (0, 1).
n(N 1)
260
INTROSTAT
Example 11C: A manufacturer claims that his market share is 60%. However, a
random sample of 500 customers reveals that only 275 are users of his product. Test at
the 5% significance level whether the population market share is less than that claimed
by the manufacturer.
261
0, P (1 P )
1
1
+
n1 n2
P1 P2
P (1 P ) n11 +
1
n2
: N (0, 1)
0.550 0.466
0.509 0.491
1
318
1
307
= 2.10.
6. We reject H0 because the test-statistic z = 2.10 satisfies z > 1.96, and conclude
that the difference between the proportions is, at the 5% level, significant.
Example 13B: Miss Jones, the senior typist, made errors on 15 out of 125 pages of
typing, whilst Miss Smith made errors on 44 pages out of 255 pages of typing. Is Miss
Jones error rate significantly lower than Miss Smiths?
1. H0 : 1 = 2 .
2. H1 : 1 < 2 .
3. The test statistic is
.
Z = (P1 P2 )
P (1 P )
1
1
+
.
n1 n2
0.155 0.845
1
1
+
125 255
= 1.34.
The test statistic is significant only at the 10% level. We conclude that the difference between the error rates is nearly significant (z = 1.34, P < 0.10).
262
INTROSTAT
263
By randomly sampling within each stratum we obtain a composite random sample that
is suitably balanced across the strata.
The population and hence the sample may be stratified with regard to one or more
classification variables, the number depending on practical limitations (e.g. gender,
education level, type of dwelling being known for each person in a region of interest,
before the sampling begins).
In cases where the proportions in the sample from each subpopulation are the same
as the proportions in the population we say we have proportional stratified random
sampling. When the proportions do not correspond, some form of weighting must
usually be applied to each subsample in order to draw conclusions about the whole
population.
Often the stratification allows us to obtain very much better estimates and much
narrower confidence intervals, than corresponding amounts spent on a simple random
sample. Alternatively, we may save on sampling costs because we can achieve the same
reliability margins as a simple random sample, but with a smaller total count of sample
units.
Cluster random sampling focuses upon subgroups of a population that are conveniently identified and sampled. It may be easier to sample Cape Town addresses than
to sample the Cape Town population. Clustering reduces the cost of taking a random
sample by ensuring that the units with sampled clusters are geographically close to each
other, and travel costs to obtain data are less than when individual persons have to be
located and interviewed.
Whereas strata are sets of units constructed on the basis of additional knowledge
about similarities and contrasts between units available ahead of the sampling, clusters
are sets of units constructed purely upon a basis of convenient access.
If the householder population of a country are to be sampled with regard to housing
questions, one might first choose a random sample from the list of towns and cities (as
Stage 1) from the set of towns and cities. Stage 2 might involve random selection from
the list of suburbs in each town chosen, Stage 3 a random sample from a list of blocks
within each suburb selected, and at Stage 4 a random selection from a list of houses
from each block.
An overall margin of error statement can be made by combining the variations or
sampling errors at each of the four stages. This strategy is also known as multi-stage
cluster random sampling. One can easily see that this method will be more convenient
than a random sample from the list of all householders in the country. The convenience
comes with some cost: the confidence intervals for cluster random sampling are wider
than for simple random sampling.
Sometimes two or more random samples are taken from the population at distinct
points in time. The first random sample is often used to give an indication of the sample
size required for the second and main random sample, and to aid in its stratification. The
questions of importance appear in the second sample. This strategy is called double
or two-phase sampling. Sequential random sampling also involves a series of random
samples, usually to keep a continuous check on some feature which may change with time,
or to provide further information on a particular aspect. Depending on circumstances
and the information required, each sample may or may not include the same units.
Systematic random sampling involves selecting a random starting point and
then every k -th unit in a list or in a moving system thereafter, e.g., the twenty-third
person leaving a train station, and then every hundredth person. This strategy will often
264
INTROSTAT
give good results, but is subject to immeasurable and perhaps large errors if the interval
between units sampled coincides with a cycle inherent in the population sampled. An
obvious example of a difficulty is sampling every 14th (or 21st) unit of a population of
daily petrol sales figures, when a regular pattern occurs over a seven day sales week.
Other cycles may, unfortunately, not be as obvious.
In all forms of probability sampling, the problem of non-response is an important
issue. Suppose we choose a random sample of voters using the electoral register and random numbers. The interviewers may have to make several calls to find people at home;
they may have moved, be on holiday or in hospital; they may refuse to be interviewed.
If we have a sample of 1000 and 900 respond, can we simply regard this as a random
sample of 900? It would be dangerous to do so, because we would then be assuming
that the non-responders hold similar views to the responders. The very fact that they
are non-responders makes them different to the others. The difference probably extends
to their opinions on the subject of the questionnaire.
If we did proceed to assume that respondents and non-respondents were similar
enough to ignore the 100, and treat the 900 as a random sample, we have an ethical
obligation to report that assumption as a fundamental basis of our analysis. This strategy
can be made explicit, but we will be inlikley to ever know whether or not it was valid.
In short, despite our best efforts to use random selection to increase the chance
of near-representative samples, non-response within a well-chosen random sample may
cause important parts of the information to be hidden from our analysis.
(b)
(c)
(d)
(e)
(f)
(g)
(h)
265
more efficient survey procedure. If a major error is only discovered during or after
the main survey, economic reasons usually prevent one from starting afresh, and
the entire exercise can be largely a waste of time and money.
Questionnaires should be carefully drawn up, taking into account the information that is really desired, the range of possible answers, and the types of people
likely to be interviewed. Wherever possible, people concerned with the survey subject,the study population or study area should be consulted. Questions should be
uncomplicated and kept to a minimum.
Data will often be entered directly into the computer from a questionnaire the
layout should be checked in advance for its suitability at the data capture stage.
The questionnaire should also be discussed beforehand with the person responsible for summarizing and analysing the information, particularly if a computer is
involved. The question of whether or not to process the information by computer
should, of course, be considered at a very early stage. This decision will depend
largely on the size of the sample, the number of questions in the questionnaire, the
complexity of the information required and the relative costs.
Confidence levels are usually selected to be 95% or 99%, although in circumstances with high sampling costs, 90% may be the maximum achievable in practice.
A reliability margin of about 5% is usually quite satisfactory, but anything larger
than 10% is not of much practical use. A smaller margin demands a much greater
sample. There is nowhere near a linear relationship between size of population
and size of sample necessary for a fixed reliability margin L%. There is no basis
for the often held view that 5% (or 10%) of the population will always give a
satisfactory sample. It should be borne in mind that the reliability margin is an
absolute value and that a figure of 14% 3% is relatively more reliable than one
of 45% 5%, because the margin is smaller.
We have so far restricted ourselves to a question with either a yes or no answer.
The same theory holds for questions with any number of categories if Pi is
the percentage of replies to category i, we use the same theory with Pi instead of
P.
If any conclusions are to be drawn specifically for a subgroup of the population, e.g. all bachelors, then the reliability margin of any of these conclusions is
dependent on the numbers of the specific subgroup in the sample and in the
population. Although 400 may give an accurate balanced view of a population of
40 000, it is simply nave and completely incorrect to expect 10 to represent an
accurate and balanced view of a sub-population of 1000.
These sample size figures all relate to a randomly drawn sample, without obvious
bias. Even in the balanced case, different levels of response (acceptance and satisfactory completion of the questionnaire) from different types in a heterogeneous
population may cause bias. Here suitable stratification prior to random selection
might help in maintaining the correct balance. Questionnaires can be postal, by
personal interview or postal with a personal follow-up. The postal method saves
much time and costs, but suffers much more from problems of non-response.
Finally, the above points all attempt to deal with and minimise possible inaccuracies caused by drawing a sample which is not statistically representative of the
population. These selection errors may be minimal compared with those errors
resulting from poor interviewing, bad questions, incorrect transcription
and faulty processing and flawed interpretation. A moderate-sized random
266
INTROSTAT
sample, well controlled and carefully processed, may well give better results than
a complete census, the sheer size of which can cause very rushed interviewing and
processing and a greater proportion of errors.
The classic sampling fiasco is that of the Literary Digest poll, which sampled over
2 million opinions on the winner of the 1936 American Presidential Election. The L.D.
predicted only 40.9% of the votes for Roosevelt and a landslide win for Landon. Actual
returns gave Roosevelt an overwhelming win with 60.7% of the vote. Sample size 2 million final error 20%! This huge error was due to plain stupidity, in picking the sample
from telephone directories and car registration files. The L.D. survey specialists may
have effectively sampled the opinions of the upper-class and middle-class people, but left
out the ill-clad, ill-fed and ill-housed lower-income classes who voted overwhelmingly
in favour of Roosevelts New Deal policies.
Example 15C: For each of the following situations describe an appropriate sampling
technique.
(a) The Electricity Supply Commission is interested in evaluating the effect of an
energy conservation advertising campaign. It wishes to estimate the proportion of
households which are acting to reduce their electricity consumption.
(b) The City Council wishes to know the proportion of residents who favour the construction of a new city by-pass road.
(c) A chain store with outlets in nine regions wishes to estimate the number of bad
debtors nationwide.
(d) The university library wants to estimate the proportion of books that have not been
used for a year. A book is defined to have been used if it has been date-stamped
within the past year.
(e) A manager of a commercial forest plantation wishes to determine the proportion
of trees in a plantation that have been infested with an insect pest.
Solutions to examples . . .
3C (0.518, 0.640)
or (51.8%, 64.0%)
L = 6.1%.
In a sample of 500 garages it was found that 170 sold tyres at prices below those
recommended by the manufacurer.
267
(a) Estimate the percentage of all garages selling tyres below the recommended
price.
(b) Calculate the 95% confidence limits for this estimate.
(c) What is the reliability of the estimate?
(d) What size sample would have to be taken in order to estimate the percentage
to within 2% at a 95% confidence level? Use the value obtained in (a) as a
rough estimate of the true proportion.
11.2
What size sample is needed in each of the following situations? Sampling will be
done without replacement.
Population
Confidence
Reliability
Rough estimate
size N
level
L%
of
(a)
infinite
95%
5%
50%
(b)
infinite
95%
1%
50%
(c)
infinite
95%
5%
25%
(d)
infinite
99%
5%
25%
(e)
infinite
95%
5%
75%
(f)
infinite
95%
1%
10%
(g)
3000
95%
3%
40%
(h)
10 000
95%
3%
40%
(i)
10 000
95%
1%
40%
(j)
10 000
99%
3%
20%
(k)
10 000
99%
2%
12%
(l)
40 000
95%
1%
8%
(m)
infinite
95%
2%
unknown
(n)
7000
95%
0.5%
97%
268
INTROSTAT
(b) What size sample is required if it is only necessary to estimate the overall proportion within the university with the same reliability and confidence
level?
(c) Explain the difference between the overall sample sizes obtained in (a) and
(b).
11.6 You read in a newspaper report that 53% of businessmen think that the economic
climate is improving. Upon checking you learn that the sample size used in the
research was 100. Find the 95% confidence interval and comment.
11.7
A questionnaire has four questions, and the rough estimate of the proportions of
interest are 10%, 70%, 80% and 5%. What size sample is required to achieve a
reliability of 2% at the 90% confidence level in all four questions? The population
may be assumed to be infinite.
11.8 In a random sample of 120 pages typed by Miss Brown there were 23 pages with
typing errors.
(a) Find a 90% confidence interval for the proportion of Miss Browns pages that
have errors.
(b) How large a sample would we need if we wanted our confidence interval to
have overall length 4%?
11.11 The national average for the ownership of motorbikes by teenagers is 15%. In
an affluent suburb which has 2000 teenagers, 45 out of a sample of 250 teenagers
owned motorbikes. Test if this proportion is significantly more than the national
average.
11.12
After corrosion tests, 42 of 536 metal components treated with Primer A and
91 of 759 components treated with Primer B showed signs of rusting. Test the
hypothesis that Primer A is superior to Primer B as a rust inhibitor at the 1%
significance level.
11.13 In a sample of 540 wives of professional and salaried workers, 42% had visited
their doctor at least once during the preceding 3 months. During the same period,
of a sample of 270 wives of labourers and unskilled workers, 36% had visited a
doctor. By the use of an appropriate statistical test, consider the validity of the
assertion that middle-class wives are more likely to visit their doctors than the
wives of working-class husbands.
269
11.14 In a sample of 569 wives of professional and salaried workers 45% attended weekly
the local welfare centre with their infants. For a sample of 245 wives of agricultural
workers, the corresponding proportion was 35%. Test the hypothesis that there is
no difference between the two groups in respect of their attendance at such centres.
11.15
Two groups, A and B, each consist of 100 people who have a disease. A serum
is given to Group A but not to Group B (which is called the control group);
otherwise, the two groups are treated identically. It is found that in Groups A and
B, 75 and 65 people, respectively, recover from the disease. Test the hypothesis
that the serum helps to cure the disease.
11.16 A sample poll of 300 voters from district A and 200 voters from district B showed
that 168 and 96 respectively were in favour of a given candidate. At a level of
significance of 5%, test the hypothesis that
(a) there is a difference between the districts
(b) the candidate is preferred in district A.
11.17 Two separate groups of sailors were randomly selected. One group of 350 sailors
was given seasickness pills of Brand A and another group of 220 sailors was given
pills of Brand B. The number of sailors in each group that became seasick during
a very heavy storm were 57 and 28 respectively. Can one conclude, at a 5%
significance level, that there is no real difference in the effectiveness of the pills?
11.18
There are 3000 students in the Arts Faculty and 2500 in the Science Faculty of
a university. In a sample of 350 arts students there were 186 non-smokers, while
of 400 science students, 273 were non-smokers. Is there a difference between the
proportions of non-smokers in the two faculties? (Develop a test that adjusts the
variances for the finite population sizes.)
11.19 In a study to estimate the proportion of residents in a certain city and its suburbs
who were opposed to the construction of a nuclear power plant, it was found that
48 out of 100 urban residents were opposed to the construction of the power plant,
while 91 out of 125 suburban residents were opposed. Test whether the level
of opposition to the nuclear power plant varies significantly between urban and
suburban residents.
11.20 In an election, one ballot box at a polling station was found to have a broken
seal, and there was concern that votes for a particular party, called ABC, had
been removed and destroyed. None of the other boxes had broken seals. Of the
remaining votes cast at that polling station,32% were in favour of party ABC. The
ballot box with the broken seal contained 543 votes, of which 134 were for party
ABC. Does this information provide support for the allegation that the ballot box
had been tampered with? What assumptions are required to conduct the test?
What is the aim of choosing a random sample? Mention two modifications that
can be made to a simple random sampling plan, and outline the benefits of these
modifications.
270
INTROSTAT
11.22
A statistician wishes to assess the opinion that residents of a suburb have about
their local bus service. Describe how he might go about obtaining a representative
sample of their opinions.
11.23
What objections would you raise if you were told to use, as a random sample of
the households in the Cape Town area, the first three addresses on the top of each
page of the Cape Peninsula telephone directory?
Solutions to exercises . . .
11.1 (a) 0.34 or 34%
(b) (.382 , 0.298) or (38.2% , 29.8%)
(c) L = 4.2%
(d) 2156.
11.2
(a)
(e)
(i)
(m)
385
289
4798
2401
(b)
(f)
(j)
(n)
9604
3458
1059
2728
(c)
(g)
(k)
289
764
1495
(d)
(h)
(`)
500
930
2640
L = 2.7%.
L = 2.83%.
11.5 (a) 473 out of 1000, 732 out of 4000 and 760 out of 5000, a total of 1965.
(b) 823.
11.6 (43.2% , 62.8%). The confidence interval is so long that it is of little use.
11.7 n = 1412. (Use the estimate closest to 0.50, i.e. 0.70).
11.8 (a) (13.3% , 25.1%)
(b) L = 2%,
1042.
11.11 z = 1.42,
11.15 z = 1.54, P < 0.10, nearly significant (but in medical statistics, nearly significant is not sufficient!).
11.16 (a) z = 1.76 < 1.96, cannot reject H0 .
(b) z = 1.76 > 1.64, reject H0 .
11.17 z = 1.16 < 1.96, cannot reject H0 .
271
272
INTROSTAT
Chapter
12
Regression . . .
Example 1A: Suppose, as suggested above, we would like to predict a students mark
in the final examination from his class record. We gather the following data, for 12
students. (In practice we would use far more data; we are now just illustrating the
method.)
273
274
INTROSTAT
Class record
61
39
70
63
83
75
48
72
54
22
67
60
Final mark
83
62
76
77
89
74
48
78
76
51
63
79
80
70
Final
mark
60
50
40
30
20
10
0
0
10 20 30 40 50 60 70 80 90 100
Class record
A haphazard scattering of points in the scatter plot would show that no relationship
exists. Here we have a distinct trend as x increases, we see that y tends to increase
too.
We are looking for an equation which describes the relationship between mid-year
mark and final mark so that for a given mark at the mid-year, x, we can predict the
final mark y . The equation finding technique is called regression analysis. We call the
variable to be predicted, y , the dependent variable, and x is called the explanatory
variable.
The variable x is often called the independent variable, but this is a very poor
name, because statisticians use the concept of independence in an entirely different
275
context. Here, x and y are not (statistically) independent. If they were, it would be
stupid to try to find a relationship between them!
Linear regression
...
We confine ourselves to the fitting of equations which are straight lines thus we will
consider linear regression. This is not as restrictive as it appears many non-linear
equations can be transformed into straight lines by simple mathematical techniques; and
many relationships can be approximated by straight lines in the range in which we are
interested.
The general formula for the straight line is
y = a + bx
The a value gives the y -intercept, and the b value gives the slope. When a and b are
given numerical values the line is uniquely specified. The problem in linear regression
is to find values for the regression coefficients a and b in such a way that we obtain
the best fitting line that passes through the observations as closely as possible.
We must decide the criteria this best line should satisfy. Mathematically, the
simplest condition is to stipulate that the sum of the squares of the vertical differences
between the observed values and the fitted line should be a minimum. This is called the
method of least squares, and is the criterion used almost universally in regression
analysis.
Pictorially, we must choose the straight line in such a way that the sum of the
squares of the ei on the graph below is a minimum, i.e. we wish to minimize
=
n
X
e2i
i=1
observed value yi
Dependent
variable
y
en ...................
(xi , yi )
ei
.
.......
.......
.......
.......
.
.
.
.
.
.
.
...
.......
.......
.......
.......
.......
.
.
.
.
.
.
..
........
n1
i
.......
.......
.......
.......
.
.
.
.
.
.
.....
.
.
.
.
.
.
..
.......
.......
.......
........
.......
.
.
.
.
.
.
...
........
.......
1.....................
.
.
.
.
.
....
.......
2
.......
.......
.......
predicted value y
xi
Explanatory variable x
In general, we have n pairs of observations (xi , yi ). Usually these points do not lie
on a straight line. Note that ei is the vertical difference between the observed value of
276
INTROSTAT
yi and the associated value on the straight line. Statisticians use the notation yi (and
say y hat) for the point that lies on the straight line. Because it lies on the line,
yi = a + b xi . The difference is expressed as
ei = yi yi = yi (a + bxi ).
Thus
=
e2i =
n
X
(yi a bxi )2 .
i=1
We want to find values for a and b which minimize . The mathematical procedure
for doing this is to differentiate firstly with respect to a and secondly to differentiate
with respect to b. We then set each of the derivatives equal to zero. This gives us
two equations in two unknowns, a and b, and we ought to be able to solve them.
Fortunately, the two equations turn out to be a pair of linear equations for a and b; this
is a type of problem we learnt to do in our first years at high school! Technically, the
derivatives are partial derivatives, and we use the standard mathematical notation
for partial derivatives, , instead of the more familiar notation d .
a
da
The partial derivatives are:
n
X
(yi a bxi ) = 0
= 2
a
i=1
= 2
b
n
X
i=1
xi (yi a bxi ) = 0
Setting these partial derivatives equal to zero gives us the so-called normal equations:
n
X
yi = na + b
i=1
xi y i = a
xi
i=1
i=1
n
X
n
X
n
X
i=1
xi + b
n
X
x2i
i=1
Pn
Pn
Pn
2
We can calculate
i=1 xi ,
i=1 xi and
i=1 xi yi from our data, and n is the sample size. This gives us the numerical coefficients for the normal equations. The only
unknowns are a and b.
By manipulating the normal equations algebraically, we can solve them for a and b
to obtain
P P
P
x y
xy
Pn
b=
P 2 ( x)2
x
Pn
P
yb x
,
a = y b
x=
n
Pn
P
abbreviating
x, etc. It is convenient to define quantities SSx , SSxy and
i=1 xi to
SSy as below:
277
SUMS OF SQUARES
P
X
X
( x)2
2
2
SSx =
(x x
) =
x
n P P
X
X
x y
SSxy =
(x x
)(y y) =
xy
P 2 n
X
X
(
y) )
SSy =
(y y)2 =
y2
n
The letters SS stand for Sum of Squares, and the subscript(s) indicate(s) the variable(s) in the sum. In this notation, the regression
coefficients a and b, can be written as
b = SSxy /SSx
P
P
yb x
,
a = y b
x=
n
and the straight line for predicting the values of the dependent variable
y from the explanatory variable x is
y = a + bx.
Although SSy was not needed for the calculation of a and b, it will be needed for a
soon-to-be-developed formula, so it is efficient to define it now and to calculate it along
with SSx and SSxy .
The formulae in the box are the most useful for finding the least squares linear
regression equation y = a + bx.
A computational scheme
...
We set out the procedure for calculating the regression coefficents in full. It is very
arithmetic intensive, and is best done using a computer. But it is important to be able
to appreciate what the computer is doing for you!
Example 1A, continued: The manual procedure for calculating the regression coefficients is a four-point plan.
1. We set out our data as in the following table, and sum the columns:
278
INTROSTAT
P
Thus
x2
y2
xy
61
39
70
63
83
75
48
72
54
22
67
60
83
83
76
77
89
74
48
78
76
51
63
79
612 = 3721
1521
4900
3969
6889
5625
2304
5184
2916
484
4489
3600
832 = 6889
3844
5776
5929
7921
5476
2304
6084
5776
2601
3969
6241
61 83 = 5063
2418
5320
4851
7387
5550
2304
5616
4104
1122
4221
4740
714
856
45 602
62 810
52 696
x = 714,
y = 856,
x2 = 45 602,
y 2 = 62 810, and
xy = 52 696.
SSxy
1764
= 0.566
=
SSx
3119
and
a = y b
x = 71.33 0.566 59.5 = 37.65
Therefore the regression equation for making year-end predictions (y) from midyear mark (x) is
y = 37.65 + 0.566 x
The hat notation is a device to remind you that this is an equation which is to be
used for making predictions of the dependent variable y . This notation is almost
universally used by statisticians. So if you obtain x = 50% at mid-year, you can
predict a mark of
y = 37.65 + 0.566 50 = 66.0%
279
We will defer the problem of placing a confidence interval on this prediction until
later. In the meantime, y is a point estimate of the predicted value of the dependent
variable. The quantity SSy , which we have calculated above, will be used in forming
confidence intervals (and also in correlation analysis).
Example 2B: A personnel manager wishes to investigate the relationship between income and education. He conducts a survey in which a random sample of 20 individuals
born in the same year disclose their monthly income and their number of years of formal
education. The data is presented in the first two columns of the table below.
Find the regression line for predicting monthly income (y) from years of formal
education (x).
Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
P
Years of formal
education
x
Annual income
(1000s of rands)
y
x2
y2
xy
12
10
15
12
16
15
12
16
14
14
16
12
15
10
18
11
17
15
12
13
4
5
8
10
9
7
5
10
7
6
8
6
9
4
7
8
9
11
4
5
144
100
225
144
256
225
144
256
196
196
256
144
225
100
324
121
289
225
144
169
16
25
64
100
81
49
25
100
49
36
64
36
81
16
49
64
81
121
16
25
48
50
120
120
144
105
60
160
98
84
128
72
135
40
126
88
153
165
48
65
275
142
3883
1098
2009
We complete the table by computing the terms x2 , y 2 and xy , and adding the
columns.
Next, x
= 275/20 = 13.75 and y = 142/20 = 7.10.
280
INTROSTAT
Correlation . . .
To justify a regression analysis we need first to determine whether in fact there is
a significant relationship between the two variables. This can be done by correlation
analysis.
If all the data points lie close to the regression line, then the line is a very good fit
and quite accurate predictions may be expected. But if the data is widely scattered,
then the fit is poor and the predictions are inaccurate.
The goodness of fit depends on the degree of association or correlation between
the two variables. This is measured by the correlation coefficient. We use r for the
correlation coefficient and define it by the formula
P
(xi x
)(yi y)
r = pP
P
(xi x
)2 (yi y)2
For computational purposes, the most useful formula for the correlation coefficient
is expressed in terms of SSx , SSxy and SSy :
r=p
SSxy
.
SSx SSy
281
The correlation coefficient r always lies between 1 and +1. If r is positive, then
the regression line has positive slope b, and as x increases so does y . If r is negative
then the regression line has negative slope and as x increases, y decreases. In other
words, in any one example, r , b and SSxy must have the same sign:
SSxy
r=p
=b
SSx SSy
SSx
.
SSy
In the scatter plot on the left, both the correlation coefficient r and the slope coefficient
b of the regression line will be positive. In contrast, both r and b will be negative in
the plot on the right:
r>0
Yield y
r<0
Yield y
Rainfall x
We now look at the extreme values of r . It can readily be shown that r cannot
be larger than plus one or smaller than minus one. These values can be interpreted as
representing perfect correlation. If r = +1 then the observed data points must lie
exactly on a straight line with positive slope, and if r = 1, the points lie on a line with
negative slope:
r = 1
r = +1
Half way between +1, perfect positive correlation, and 1, perfect negative correlation, is 0. How do we get zero correlation? Zero correlation arises if there is no
relationship between the variables x and y , as in the example below:
282
INTROSTAT
r0
I.Q. y
Shoe size x
Thus, if the data cannot be used to predict y from x, r will be close to zero. If it
can be used to make predictions r will be close to +1 or 1. How close to +1 or to
1 must r be to show statistically significant correlation? We have tables which tell us
this.
We express the above more formally by stating that r , which is the sample correlation coefficient estimates , the population correlation coefficient. (, rho, is
the small Greek r we are conforming to our convention of using the Greek letter
to signify the population parameter and the Roman letter for the estimate of the parameter.) What we need is a test of the null hypothesis that the population correlation
coefficient is zero (i.e. no relationship) against the alternative that there is correlation;
i.e. we need to test the null hypothesis H0 : = 0 against the alternative H1 : 6= 0.
Mathematical statisticians have derived the probability density function for r when the
null hypothesis is true, and tabulated the appropriate critical values. Table 5 is such a
table. The probability density function for r depends on the size of the sample, so that,
as for the t-distribution, degrees of freedom is an issue.
The tables tell us, for various degrees of freedom, how large (or how small) r has
to be in order to reject H0 . The alternative hypothesis above was two-sided, thus there
must be negative as well as positive values of r that will lead to the rejection of the
null hypothesis. Because the sampling distribution of r is symmetric, our tables only
give us the upper percentage points. Unlike the earlier tests of hypotheses that we
developed, there is no extra calculation to be done to compute the test statistic. In
this case, r itself is the test statistic.
When the sample correlation coefficient r is computed from n pairs of points, the
degrees of freedom for r are n2. In terms of our degrees of freedom rule, we lose two
degrees of freedom, because we needed to calculate x
and y before we could calculate r .
We estimated two parameters, the mean of x and the mean of y , so we lose two degrees
of freedom.
Example 4A: Suppose that we have a sample of 12 pairs of observations, i.e. 12 xvalues and 12 y -values and that we calculate the sample correlation coefficient r to be
0.451. Is this significant at the 5% level, using a two-sided alternative?
1. H0 : = 0
283
4. Because we have a two sided alternative we consult the 0.025 column of the table.
We have 12 pairs of observations, thus there are 12-2 = 10 degrees of freedom. The
critical value from Table 5 is 0.576. We would reject H0 if the sample correlation
coefficient lay either in the interval (0.576, 1) or in (1, 0.576):
......................
....
.....
...
...
...
...
.
.
...
..
.
.
...
...
...
.
.
...
.
.
...
...
.
...
..
...
.
..
...
.
...
..
.
...
..
.
...
..
...
.
...
..
.
.
...
.
.
...
..
.
.
...
..
...
.
.
...
..
.
.
...
..
.
....
.
..
....
.
.
.
......
..
.
.
.
.
.
......
....
.
.
........
.
.
.
.
...........
......
.
.
.
.
.
.
.
.
.
.........
.
.
......
0.576
+0.576
+1
n = 12,
n = 36,
n = 52,
n = 87,
n = 45,
Example 1A, continued: For the data given in Example 1A, test if there is a significant correlation between class record and final mark. Do the test at the 1% significant
level.
We can justify using a one-sided alternative hypothesis if this world is at all fair,
if there is any correlation between class record and final mark it must be positive! Thus
we have:
1. H0 : = 0
2. H1 : > 0
3. The sample correlation coefficient is
r=p
SSxy
1764
=
= 0.755
3119 1748.67
SSx SSy
284
INTROSTAT
4. The sample size was n = 12, so we have 10 degrees of freedom. The correlation
(0.05)
(0.01)
is significant at the 5% level (r > r10
= 0.4973), the 1% level (r > r10
=
(0.005)
0.6581), the 0.5% level (r > r10
= 0.7079), but not at the 0.1% level (r >
(0.001)
r10
= 0.7950).
5. Our conclusion is that there is a strong positive relationship between the class
record and the final mark (r10 = 0.755, P < 0.005).
Example 2B, continued: Is there significant correlation between monthly income and
years of education? Do the test at the 1% significance level.
1. H0 : = 0
2. H1 : 6= 0
3. Significance level : 1%
4. The sample size was n = 20, so the appropriate degrees of freedom is 20 2 = 18.
(0.005)
We will reject H0 if the sample correlation coefficient is greater than r18
=
0.5614, or less than 0.5614 remember that the alternative hypothesis here is
two-sided.
5. We calculate r to be
p
6. The sample correlation coefficient lies in the rejection region. We have established
a significant correlation between years of education and monthly education. We
may use the regression line to make predictions.
285
Figure 12.1:
been done, largely, by holding all variables in a manufacturing process constant, except
for one experimental variable. If a correlation is found between the variable that is
being experimented with and product quality, then this variable is a likely cause of
defectiveness, and needs to be monitored carefully. On the other hand, if no correlation
is found between a variable and product quality, then this variable is unlikely to be
critical to the process, and the cheapest possible setting can be used.
This measures the sum of squares of the differences between the observed values yi
and the predicted values yi . This quantity will be small if the observed values yi fall
close to the regression line y = a + bx, and will be large if they do not. The term yi yi
is therefore called the residual or error for the ith observation. Thus we define
X
X
SSE =
(yi yi )2 =
(yi a bxi )2
286
INTROSTAT
This result is an important one, because it shows that we can decompose the total
sum of squares, SST, into the part that is explained by the regression, SSR, and the
remainder that is unexplained, SSE, the error sum of squares (or residual sum of
squares).
Finally, we find two alternative expressions for SSR = b SSxy to be useful. First the
short one, useful for calculations. Because b = SSxy /SSx , another expression for SSR
is
2
SSxy
SSR =
SSx
Secondly,
X
SSR = b SSxy = b
(yi y)(xi x
)
X
=b
(yi yi + yi y)(xi x
)
X
X
=b
(yi yi )(xi x
) +
(
yi y)(bxi b
x)
We now consider the first and second terms separately. We will show that the first term
is equal to zero. We note that yi = a + bxi = y b
x + bxi and substitute this in place
of yi in the first term:
hX
i
X
b
(yi y + b
x bxi )(xi x
) = b
((yi y) b(xi x
))(xi x
)
i
hX
X
= b ( (yi y)(xi x
) b
(xi x
)2
= bSSxy b2 SSx
SSxy 2
SSxy
SSxy
SSx
=
SSx
SSx
= 0,
287
which is a lot of stress to prove that something is nothing! For the second term we note
that b xi = yi a and b x
= y a, so (bxi b
x) = (
yi y) and we have that
X
SSR =
(
yi y)2 .
Prediction intervals . . .
In the same way as the standard deviation played a key role in confidence intervals
for means, an equivalent quantity called the residual standard deviation is used in
calculating confidence intervals for our predictions we call these prediction intervals. The residual standard deviation, also denoted s, measures the amount of variation
left in the observed y -values after allowing for the effect of the explanatory variable x.
The residual standard deviation is defined to be
r
SSE
s=
n2
rP
(yi yi )2
=
n2
rP
(yi a bxi )2
.
=
n2
In the previous section, we showed that
SSE = SSy b SSxy .
So a better formula for computational purposes is
r
SSy b SSxy
.
s=
n2
Because the residual standard deviation is estimated, the t-distribution is used to
form the prediction intervals. The degrees of freedom for the residual standard deviation
are n 2 (we lose two degrees of freedom because two parameters are estimated by x
and y before we can calculate the residual standard deviation). The prediction interval
for a predicted value of y for a given value of x is
s
)2
predicted value tn2 s 1 + 1 + (x x
,
n
SSx
s
1 (x x
)2
.
predicted value + tn2 s 1 + +
n
SSx
288
INTROSTAT
The term (x x)2 in the prediction interval has the effect of widening the prediction
interval when x is far from x
. Thus our predictions are most reliable when they are
made for x-values close to x
, the average value of the explanatory variable. Predictions
are less reliable when (x x
)2 is large. In particular, it is unwise to extrapolate, that
is, to make predictions that go beyond the range of the x-values in the sample, although
in practice we are often required to do this.
Example 1A, continued: Find a 95% prediction interval for the predicted value of
the final mark given a class record of 50%.
We first need to compute the residual standard deviation:
s =
r
SSy b SSxy
174.67 0.566 1764
=
n2
12 2
= 8.66
Earlier we saw that the point estimate of the predicted value for y was 66% when
x = 50%.
(0.025)
The required value from the t-tables, t10
, is 2.228. Note also x
= 59.5, SSx =
3119. Thus the 95% prediction interval is given by
r
1
(50 59.5)2
+
;
12
3119
66.0 + 20.35)
1+
= (45.65, 86.35).
SSR
.
SST
2
Sxy
b SSxy
=
= r2.
SSy
SSx SSy
So we have the astonishing result that the coefficient of determination is simply the
square of the correlation coefficient! For this reason the coefficient of determination in
simple linear regression is denoted r 2 .
289
x2
y2
xy
y2
y2
1X
x
n
1X
y =
y
n
X
X
SSx =
x2 (
x)2 /n
X
X X
SSxy =
xy (
y)(
x)/n
X
X
SSy =
y2 (
y)2 /n
x
=
b = SSxy /SSx
a = y b
x
p
SSx SSy
r = SSxy
q
s = (SSy b SSxy )/(n 2)
xy
290
INTROSTAT
Predicted values : y = a + bx
95% prediction intervals:
(0.025)
1+
)2
1 (x x
+
n
SSx
Example 6B: In an investigation to determine whether the rewards on financial investments are related to the risks taken, data on a sample of 17 sector indices on the
Johannesburg Stock Exchange was collected (Source: Financial Risk Service, D.J. Bradfield & D. Bowie). The data gives a beta value for each sector (x), a quantity widely
used by financial analysts as a proxy for risk, and the reward (y ) or return for each
sector, calculated as the percentage price change over a 12 month period.
(a) Draw a scatter plot and discuss the visual appearance of the relationship.
(b) Calculate the correlation coefficient, and decide if the relationship between risk
and return is significant?
(c) Find the regression line. What is the coefficient of determination of the regression.
(d) Find 95% prediction intervals for the predicted returns for investments with risks
given by betas of 0.5, 0.8 and 1.5.
Sector
1. Diamonds
2. West Wits
3. Metals & Mining
4. Platinum
5. Mining Houses
6. Evander
7. Motor
8. Investment Trusts
9. Banks
10. Insurance
11. Property
12. Paper
13. Industrial Holdings
14. Beverages & Hotels
15. Electronics
16. Food
17. Printing
Totals
Beta
x
Return (%)
y
x2
y2
xy
1.31
1.15
1.29
1.44
1.46
1.00
0.74
0.67
0.69
0.79
0.27
0.75
0.83
0.81
0.11
0.11
0.51
46
67
46
40
78
40
24
20
17
19
8
25
27
34
7
19
33
1.716
1.323
1664
2.074
2.132
1.000
0.548
0.449
0.476
0.624
0.073
0.563
0.689
0.656
0.012
0.012
0.260
2116
4489
2116
1600
6084
1600
576
400
289
361
64
625
729
1156
49
361
1089
60.26
77.05
59.34
57.60
113.88
40.00
17.76
13.40
11.73
15.01
2.16
18.75
22.41
27.54
0.77
2.09
16.83
13.93
550
14.271
23704
556.58
291
(a) The scatter plot shows that the percentage return increases with increasing values
of beta. Furthermore, the plot makes it clear that it is appropriate, over the
observed range of values for beta, to fit a straight line to this data.
100
80
Percentage
return
y
60
40
20
0
0.0
0.5
1.0
1.5
2.0
Beta x
Having computed the basic sums at the foot of the table, we compute x
= 0.8194
and y = 32.35. Then
SSx = 14.271 (13.93)2 /17 = 2.857
p
1
3. r = SSxy / SSx SSy = 105.904/(2.857 5909.882) 2 = 0.815
292
INTROSTAT
s=
= 11.501.
(0.025)
We need t15
. From the t-tables, (Table 2) this is 2.131. We note that x
=
0.8194.
The 95% prediction interval for x = 0.5 is thus
r
1
(0.5 0.8194)2
+
, 20.510 + 25.641)
(20.510 2.131 11.501 1 +
17
2.857
or (5.13%, 46.15%).
For x = 0.8 and x = 1.5, the only terms in the prediction interval formula that
change are the first term (the predicted value) and the final term under the square
root sign. When x = 0.8, the 95% prediction interval is
r
1
(0.8 0.8194)2
+
, 31.630 + 25.221)
(31.630 2.131 11.501 1 +
17
2.857
or (6.41%, 56.85%).
for x = 1.5, we have
(57.578 2.131 11.501
1+
1
(1.5 0.8194)2
+
, 57.578 + 27.081)
17
2.857
or (30.50%, 84.65%).
Of the three values for which we have computed prediction intervals, x = 0.8 lies
closest to the mean of the observed range of x-values, and therefore the associated
prediction interval for x = 0.8 is the shortest of the three.
Example 7C: To check on the strength of certain large steel castings, a small test piece
is produced at the same time as each casting, and its strength is taken as a measure of
the strength of the large casting. To examine whether this procedure is satisfactory, i.e.,
the test piece is giving a reliable indication of the strength of the castings, 11 castings
were chosen at random, and both they, and their associated test pieces were broken.
The following were the breaking stresses:
Test piece (x) :
Casting (y) :
45
39
67
86
61
97
77
102
71
74
51
53
45
62
58
69
48
80
62
53
36
48
293
Exponential growth . . .
Example 8A: Fit an exponential growth curve of the form y = aebx to the population
of South Africa, 1904-1976:
Year (x) :
Population :
(millions)(y)
1904
5.2
1911
6.0
1921
6.9
1936
9.6
1946
11.4
1951
12.7
1960
16.0
1970
21.8
30
Population
(millions)
y
20
10
0
1900
1920
1940
Year x
1960
1980
1976
26.1
294
INTROSTAT
However, there is a straightforward way to transform this into a linear relation. Take
natural logarithms on both sides of the equation y = aebx . This yields (remembering
loge e = 1)
log y = log a + bx.
We now put Y = log y and A = log a we rewrite it as
Y = A + bx.
This is the straight line with which we are familiar! The scatter plot of the logarithm of
population against year is plotted below; a straight line through the points now seems
to be a satisfactory model.
3
Logarithm
of
population
(loge millions)
Y
1
1900
1920
1940
1960
1980
Year x
By using our computational scheme for linear regression on the pairs of data values
x and Y (= log y), we compute the regression coefficients A and b. Having done this, we
can now transform back to the exponential growth curve we wanted. Because Y = log y ,
we can write
y =eY = eA+bx
= eA ebx = aebx ,
where a = eA . The table below demonstrates the computational procedure for fitting
exponential growth. (It is convenient here to take x to be years since 1900.)
x
y
4
11
21
36
46
51
60
70
76
375
5.2
6.0
6.9
9.6
11.4
12.7
16.0
21.8
26.1
Y = log y
x2
Y2
xY
1.65
1.79
1.93
2.26
2.43
2.54
2.77
3.08
3.26
16
121
441
1296
2116
2601
3600
4900
5776
2.723
3.204
3.725
5.108
5.905
6.452
7.673
9.486
10.628
6.60
19.69
40.53
81.36
111.78
129.54
166.20
215.60
247.76
21.72
20 867
54.904
1019.06
295
and
B = log b to obtain
A + Y = xB
or
Y = A xB.
296
INTROSTAT
Example 10B: To investigate the relationship between the curing time of concrete and
the tensile strength the following results were obtained:
1 23
2 12
3 31
10
22.4
24.5
26.3
30.2
33.9
35.5
y = ae x ,
find the regression coefficients a and b.
(c) Predict the tensile strength after curing time of three days.
(a)
40
Tensile
strength
(kg/cm 3 )
30
20
10
Days x
(b) Taking logarithms we obtain
log y = log a + b
Put Y = log Y, A = log A and X =
The computational scheme is:
1
x
1
x
297
y
22.4
24.5
26.3
30.2
33.9
35.5
X = 1/x
0.6
0.5
0.4
0.3
0.2
0.1
2.1
Y = log y
3.11
3.20
3.27
3.41
3.52
3.57
20.08
X2
0.36
0.25
0.16
0.09
0.04
0.01
0.91
Y2
9.67
10.24
10.69
11.63
12.39
12.74
67.36
XY
1.87
1.60
1.31
1.02
0.70
0.36
6.86
Y = 3.347
SSX = 0.175
SSXY = 0.168
and
= 3.683.
A = Y bX
Thus the equation for predicting Y , the logarithm of the tensile strength y , from
X , the reciprocal of curing time x, is
Y = 3.683 0.960 X.
To express this in terms of the original variables x and y , we write
y = eY = eA+bX = eA ebX
= e3.683 e0.960X
= 39.766 e0.960/x .
(c) After 3 days curing, i.e. x = 3,
y = 39.766 e0.960/3 = 28.9 kg/cm2 .
Example 11C: A company that manufactures gas cylinders is interested in assessing the
relationships between pressure and volume of a gas. The table below gives experimental
values of the pressure P of a given mass of gas corresponding to various values of the
volume V . According to thermodynamic principles, the relationship should be of the
form P V = C , where and C are constants.
(a) Find the values of and C that best fit the data.
(b) Estimate P when V = 100.0.
Volume V
Pressure P
54.3
61.2
61.8
49.5
72.4
37.6
88.7
28.4
118.6
19.2
194.0
10.1
298
INTROSTAT
Multiple regression . . .
In many practical situations, it is more realistic to believe that more than one explanatory variable is related to the dependent variable. Thus the quality of the grape
harvest depends not only on the amount of rain that falls during spring, but also probably on the hours of sunshine during summer, the amount of irrigation during summer,
whether the irrigation was by sprinklers, furrows or a drip system, the amounts and types
of fertilizers applied, the amounts and types of pesticides used, and even whether or not
the farmer used scarecrows to frighten the birds away! Regression models that include
more than one explanatory variable are called multiple regression models. Multiple
regression should be seen as a straightforward extension of simple linear regression.
The general form of the multiple regression model is
y = 0 + 1 x1 + 2 x2 + + k xk + e
where y is the dependent variable, x1 , x2 , . . . , xk are the explanatory variables and 0 ,
1 , 2 , . . . , k are the true regression coefficients, the population regression parameters.
The term e at the end of the regression model is usually called the error, but this is
a bad name. It does not mean mistake, but it is intended to absorb the variability in
the dependent variable y which is not accounted for by the explanatory variables which
we have measured and have included in the regression model.
The regression coefficients 0 , 1 , 2 , . . . , k are unknown parameters (note the
use of Greek letters once again for parameters) and need to be estimated. The data from
which we are to estimate the regression coefficients consists of n sets of k + 1 numbers
of the form: the observed values of the dependent variable y , and the associated values
of the k explanatory variables xi . For example, we would measure the quality of the
grape harvest, together with all the values of the explanatory variables that led to that
quality.
299
is that the underlying philosophy remains the same: we minimize the sum of squared
residuals.
There are no simple explicit formulae for the regression coefficients bi for multiple
regression. They are most conveniently expressed in terms of matrix algebra. But they
are readily computed by a multitude of statistical software packages. We assume that
you have access to a computer that will do the calculations, and we will take the approach
of helping you to interpret the results.
Example 12A: To demonstrate how a regression equation can be estimated for two or
more explanatory variables, we consider again Example 2B where the personnel manager
was concerned with the analysis and prediction of monthly incomes.
Recall that in Example 2B the relationship between monthly income (y ) and years
of education (which we will now denote x1 ) was estimated using the simple regression
model:
y = 0.535 + 0.555x1 .
Because r = 0.5911 was highly significant, we had established that a relationship existed
between the two variables. But the coefficient of determination r 2 = 0.3494, so that
approximately 35% of the variability in incomes could be explained by this single variable,
years of formal education.
But what about the remaining 65%? The personnel manager is curious to establish whether any other variable can be found that will help to reduce this unexplained
variability and which will improve the goodness of fit of the model.
After some consideration, the personnel manager considers that years of relevant
experience is another relevant explanatory variable that could also impact on their
incomes. He gathers the extra data, and calls it variable x2 .
300
INTROSTAT
Person
Monthly income
(R1000s)
y
Years of formal
education
x1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
4
5
8
10
9
7
5
10
7
6
8
6
9
4
7
8
9
11
4
5
12
10
15
12
16
15
12
16
14
14
16
12
15
10
18
11
17
15
12
13
Years of
experience
x2
2
6
16
23
12
11
1
18
12
5
17
4
20
6
7
13
12
3
2
6
Now the regression equation that includes both the effect of years of education (x1 )
and the years of relevant experience (x2 ) is
y = b0 + b1 x1 + b2 x2 .
The computer generated solution is
y = 0.132 + 0.379x1 + 0.180x2
where x1 is years of formal education and x2 is years of relevant experience.
Notice that the coefficients of x1 and the intercept have changed from the solution
when only x1 was in the model. Why? Because some of the variation previously explained by x1 is now explained by x2 . (There is a mathematical result that says that
the earlier regression coefficients will remain unchanged only if there is no correlation
between x1 and x2 .)
In multiple regression we can interpret the the magnitude of the regression coefficient
bi as the change induced in y by a change of 1 unit in xi , holding all the other variables
constant. In the above example, b1 = 0.379 can be interpreted as measuring the change
in income (y ) corresponding to a one-year increase in years of formal education (while
holding years of experience constant). And b2 = 0.180 is the change in y induced by
a one-year increase in experience (while holding years of formal education constant).
Because y is measured in 1000s of rands per month, the regression model is telling us
301
that each year of education is worth R379 per month, and that each year of experience
is worth R180 per month!
One question that the personnel manager would ask is: To what extent has the
inclusion of the additional variable contributed towards explaining the variation of incomes? For simple regression, the coefficient of determination, r 2 , answered this question. We now have a multiple regression equivalent, called the multiple coefficient
of determination, denoted R2 , which measures the proportion of variation in the dependent variable y which is explained jointly by all the explanatory variables in the
regression model. The computation of R2 is not as straightforward as in simple regression, but is invariably part of the computer output. In our example, the computer
printout that gave us the model
y = 0.132 + 0.379x1 + 0.180x2
also told us that R2 = 0.607. This means that 60.7% of the variability of incomes is
explained jointly by x1 (years of formal education) and x2 (years of relevant experience).
This represents a substantial improvement in the explanation of y (monthly incomes)
from 35% to 61%. However, 39% of the variation in y remains unexplained, and is
absorbed into the error term discussed earlier. This leads us into taking a closer look
at the sources of variation in the dependent variable y .
...
X
X
X
(yi y)2 =
(
yi y)2 +
(yi yi )2 .
The multiple coefficient of determination is defined in the same way as in the simple
case. It is the ratio of the sum of squares explained by the regression and the total sum
of squares,
SSR
.
R2 =
SST
But in multiple regression, we do need to have a different notation for the multiple coefficient of determination (R2 in place of r 2 ) because we no longer have a straightforward
squaring of a sample correlation coefficient.
The partitioning of the total sum of squares also helps to provide insight into the
structure of the multiple regression model. In the section that follows, we will see that
the partitioning enables us to decide whether the equation generated by the multiple
regression is significant. To do this in simple regression, we merely calculated the
correlation coefficient r and checked in Table 5 whether or not this value of r was
significant. The approach is now quite different.
302
INTROSTAT
...
In multiple regression, the appropriate null and alternative hypotheses for testing
whether or not there is a significant relationship between y and the xi s are:
H0 : 1 = 2 = = k = 0
H1 : one or more of the coefficients are non-zero.
If we reject the null hypothesis we can conclude that there is a significant relationship
between the dependent variable y and at least one of the explanatory variables xi .
The test statistic for this hypothesis is couched in terms of the ratio between the
variance due to regression and that due to error. It is therefore not surprising that the
F -distribution provides the critical values for the test statistic (recall chapter 9). The
test statistic is calculated as one of the outputs in most regression software packages,
and is usually presented as part of an analysis of variance or ANOVA table.
The ANOVA table summarizes the sources of variation, and usually has the following
structure:
ANOVA table
Source
Sum of
Squares
(SS)
Degrees of
Freedom
(DF)
Mean
square
(MS)
Regression
Error
SSR
SSE
k
nk1
MSR = SSR/k
MSE = SSE/(n k 1)
Total
SST
n1
F = MSR/MSE
The test statistic is the variance explained by the regression divided by the variance
due to error. The distribution of the test statistic for the above null hypothesis is
F =
MSR
Fk,nk1,
MSE
303
The hypotheses for testing the significance of the regression can thus be formulated
as:
1. H0 : 1 = 2 = 0
2. H1 : one or more of the coefficients are non-zero.
3. Significance level: 5%
4. Rejection region. Because n = 20 and k = 2, the test statistic has the Fk,nk1 =
F2,17 -distribution. We reject H0 if the observed F -value is greater than the upper
5% point of F2,17 , i.e. if it is greater than
0.05
F2,17
= 3.59.
5. Test statistic. The ANOVA table obtained from the computer printout looks like
this:
Source
Regression
Error
Total
ANOVA table
Sum of
Degrees of
Squares (SS) Freedom (DF)
54.5391
2
35.2609
17
89.8000
19
Mean Square
(MS)
27.2696
2.0742
F
13.15
= 6.5.
304
INTROSTAT
4. Rejection region. The critical value of the test statistic is once again t17
2.110, and we will reject H0 if |t| > 2.110.
5. Test statistic. The values b2 = 0.180 and sb2 = 0.054 are computed by the
regression programme, and t = 0.180/0.054 = 3.33.
305
6. Conclusion. Because the observed t-value lies in the rejection region, we reject
H0 : 2 = 0 at the 5% level, and conclude that the variable x1 , number of years
of relevant experience is also significant in the regression model.
Example 13B: A market research company is interested in investigating the relationship between monthly sales income and advertising expenditure on radio and television
for a specific area. The following data is gathered.
Sample
number
1
2
3
4
5
6
7
8
9
10
Monthly
sales
(R1000s)
105
99
104
101
106
103
103
104
100
98
Radio
advertising
(R1000s)
0.5
1.0
0.5
1.5
2.3
1.3
3.2
1.5
0.8
0.6
Television
advertising
(R1000s)
6.0
3.0
5.0
3.5
4.0
4.5
3.5
4.0
3.0
2.8
(a) Find the regression equation relating monthly sales revenue to radio and television
advertising expenditure.
(b) Estimate and interpret R2 .
(c) Test the regression equation for significance at the 5% significance level.
(d) Test, for each variable individually, whether radio and television advertising expenditure are significant in the model at the 5% level.
(e) Comment on the effectiveness of radio vs television advertising for this industry.
You will need the following information, extracted from a computer printout, to
answer the questions.
Table of estimated coefficients
Estimated Estimated
Variable
coefficient
standard
deviation
Radio
1.6105
0.4687
Television
2.3414
0.4041
Intercept
90.9725
Source
Regression
Error
Total
ANOVA table
Sum of
Degrees of
Squares (SS) Freedom (DF)
54.194
2
9.906
7
64.100
9
Mean Square
(MS)
27.097
1.415
F
19.15
306
INTROSTAT
= 4.74
= 2.365.
= 2.365.
307
Person
1
2
3
4
5
6
7
8
9
10
Number
of
sales
4
2
15
9
11
8
14
17
16
2
Number
of hours
worked
5
4
12
10
9
8
13
14
12
4
Months
of
experience
1
2
6
6
8
10
12
15
14
3
(a) Write down the regression equation relating number of sales (y ) to number of
hours worked (x1 ) and months of experience (x2 ).
(b) Compute and interpret R2 .
(c) Test the regression equation for significance at the 5% significance level.
(d) Test, for x1 and x2 individually, that they are significant in the model at the 5%
level.
(e) Do you think that the experience of part-time employees makes a difference to
their number of sales? Explain.
You will require the following information extracted from the relevant computer
printout.
Table of estimated coefficients
Estimated Estimated
Variable
coefficient
standard
deviation
Hours worked
1.3790
0.2270
Experience
0.0998
0.1716
Intercept
-3.5176
Source
Regression
Error
Total
ANOVA table
Sum of
Degrees of
Squares (SS) Freedom (DF)
282.708
2
12.892
7
295.599
9
Mean Square
(MS)
308
INTROSTAT
Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Monthly income
(R1000s)
y
4
5
8
10
9
7
5
10
7
6
8
6
9
4
7
8
9
11
4
5
Years of formal
education
x1
12
10
15
12
16
15
12
16
14
14
16
12
15
10
18
11
17
15
12
13
Years of
experience
x2
2
6
16
23
12
11
1
18
12
5
17
4
20
6
7
13
12
3
2
6
Gender
x3
1
0
1
0
1
0
1
1
0
0
1
1
0
0
1
1
0
1
0
0
...
In the discussion thus far on regression, all the explanatory variables have been
quantitative. Occasions frequently arise, however, when we want to include a categorical
or qualitative variable as an explanatory variable in a multiple regression model. We
demonstrate in this section how categorical variables can be included in the model.
Example 12A, continued: Assume now that the personnel manager who was trying
to find explanatory variables to predict monthly income now believes that, in addition
to years of formal education and years of relevant experience, gender may have a bearing
on an individuals monthly income. Test whether the gender variable is significant in
the model (at the 5% level).
The personnel manager first gathers the gender information on each person in the
sample. The next step is to convert the qualitative variable gender into a dummy
variable. A dummy variable consists only of 0s and 1s. In this example, we will code
gender as a dummy variable x3 . We put x3 = 0 if the gender is female and x3 = 1 if the
gender is male. Results of this coding are shown in the final column of the table. The
variable x3 can be thought of as a switch which is turned on or off to indicate
the two genders.
We now proceed as if the dummy variable x3 was just an ordinary explanatory variable in the regression model, and estimate the regression coefficients using the standard
least squares method to obtain the estimated model:
y = b0 + b1 x1 + b2 x2 + b3 x3 .
309
Now if gender is significant in the model, b3 will be significantly different from zero.
The interpretation of b3 is the estimated difference in monthly incomes between males
and females (holding the other explanatory variables constant). In the way that we have
coded x3 , then a positive value for b3 would indicate that males are estimated to earn
more than females, and vice versa.
The computer printout contains the table of estimated regression coefficients.
Table of estimated coefficients
Variable
Estimated
coefficient
Estimated
standard
deviation
Computed
t-value
0.321
0.192
0.838
0.381
0.156
0.054
0.664
2.06
3.56
1.26
Years of education
Years of experience
Gender
Intercept
Sum of
Squares (SS)
Degrees of
Freedom (DF)
Mean Square
(MS)
Regression
Error
57.7390
32.0610
3
16
19.246
2.004
9.60
Total
89.8000
19
The observed F -value of 9.60 needs to be compared with the five per cent point of F3,16 .
From the F -tables, this is 3.24, so the multiple regression model is significant.
The multiple coefficient of determination is computed as R2 = SSR/SST = 57.739/89.8 =
0.643. As a result of the inclusion of the additional variable, gender, the multiple correlation coefficient has increased from 60.7% (when we had two explanatory variables)
to 64.3%. This seems a relatively small increase, especially when compared with the increase in the coefficient of determination in going from one explanatory variable (34.9%)
to two (60.7%). This leads us to ask whether the addition of the third explanatory
variable x3 was worthwhile. We can do this using the methods of the previous section
we test if the coefficient associated with gender is significantly different from zero.
310
INTROSTAT
Following our standard layout for doing a test of a statistical hypothesis (for the last
time!), the appropriate null and alternative hypotheses are:
1. H0 : 3 = 0
2. H0 : 3 6= 0
3. Significance level: 5%
4. The degrees of freedom are n k 1 = 20 3 1 = 16. We reject H0 if
= 2.120.
|t| > t0.025
16
5. The test statistic is
t=
0.838
b3
= 1.26,
=
sb3
0.664
x5 = 0
x5 = 0
x5 = 1
for married
for single
for divorced
Example 15C: In example 6B, we investigated the relationship between rewards and
risks on the Johannesburg Stock Exchange (JSE). It is well known that the JSE can be
divided into three major categories of shares: Mining (M), Finance (F) and Industrial
(I). The shares of example 6B fall into the following categories:
(a) Design a system of dummy variables which accommodates the three share categories in the regression model. Show how each sector is coded.
(b) Use a multiple regression computer package to compute the regression analysis
with return (y ) as the dependent variable using, as explanatory variables, the
beta (x1 ) and the share category effects. Interpret the regression coefficients.
(c) Test whether the share category effects are significant.
311
Return (%)
y
46
67
46
40
78
40
24
20
17
19
8
25
27
34
7
19
33
Beta
x1
1.31
1.15
1.29
1.44
1.46
1.00
0.74
0.67
0.69
0.79
0.27
0.75
0.83
0.81
0.11
0.11
0.51
Industry
M
M
M
M
F
M
I
F
F
F
F
I
I
I
I
I
I
Solutions to examples . . .
3C (a) x = 9.284 + 0.629y
(b) Making x the subject of the formula yields x = 0.964 + 1.802y.
Note that the method of least squares chooses the coefficients a and b to minimize vertical distances, i.e. parallel to the y axis . Thus it is not symmetric
in its treatment of x and y . Interchanging the roles of x and y gives rise to a new
arithmetical problem, and hence a different solution.
5C (a) P < 0.05 (P < 0.02 is also correct) (b) P < 0.20, but this would not be
considered significant.
(c) P < 0.0005 (d) P < 0.025 (e) P < 0.20, but
not significant!
7C (b) r = 0.704, P < 0.005, a significant relationship exists.
(c) r 2 = 0.4958. Almost 50% of the variation in the breaking stress of the casting
can be explained by the breaking stress of the test piece.
(d) y = 4.50 + 1.15 x
(e) (44, 103).
11C (a) = 1.404 C = 15978.51 log(C) = 9.679, thus P V 1.404 = 15978.51
24.86
(b)
312
INTROSTAT
(d) For x1 : t = 6.074 > 2.365, significant in the model.
For x2 : t = 0.581 < 2.365, not significant in the model.
(e) The result suggests that experience is not an important factor for the performance of part-time salespersons.
15C (a) The final two columns show the two dummy variables needed to code the
share categories.
Sector
1. Diamonds
2. West Wits
3. Metals & Mining
4. Platinum
5. Mining Houses
6. Evander
7. Motor
8. Investment Trusts
9. Banks
10. Insurance
11. Property
12. Paper
13. Industrial Holdings
14. Beverages & Hotels
15. Electronics
16. Food
17. Printing
Return (%)
y
46
67
46
40
78
40
24
20
17
19
8
25
27
34
7
19
33
Beta
x1
1.31
1.15
1.29
1.44
1.46
1.00
0.74
0.67
0.69
0.79
0.27
0.75
0.83
0.81
0.11
0.11
0.51
Dummy
x2
x3
0
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
Exercises . . .
For each example it is helpful to plot a scatter plot.
12.1 The marks of 10 students in two class tests are given below.
(a) Calculate the correlation coefficient and test it for significance at the 5% level.
(b) Find the regression line for predicting y from x.
313
(c) What mark in the second test do you predict for a student who got 70% for
the first test?
1st class
test(x)
2nd class
test(y)
12.2
65
78
52
82
92
89
73
98
56
76
39
43
21
64
57
47
27
75
34
52
The table below shows the mass (y) of potassium bromide that will dissolve in
100m` of water at various temperatures (x).
Temperature x C
Mass y (g)
0
54
20
65
40
75
60
85
80
96
1972
186
1973
209
1974
245
1975
293
1976
310
1977
337
1978
360
1979
420
12.4 Investigate the relationship between the depth of a lake (y) and the distance from
the lake shore (x). Use the following data, which gives depths at 5 metre intervals
along a line at right angles to the shore.
Distance from
shore (m) (x)
Depth (m) (y)
10
15
20
25
30
35
40
45
13
22
37
57
94
1950
733
1955
960
1960
1125
1965
1483
1969
1810
1973
2295
1975
2533
1977
3005
1978
3272
314
12.6
INTROSTAT
Fit a linear regression to the number of telephones in use in South Africa, 1968
1979. Test for significant correlation. Forecast the number of telephones for 1984,
and find a 90% prediction interval for this forecast.
Year(x)
1968 1969 1970 1971 1972 1973 1974
Number of 1.24 1.31 1.51 1.59 1.66 1.75 1.86
telephones
(millions)(y)
12.7
The number of defective items produced per unit of time, y , by a certain machine
is thought to vary directly with the speed of the machine, x, measured in 1000s
of revolutions per minute. Observations for 12 time periods yielded results which
are summarized below:
X
X
X
x = 60
x2 = 504
xy = 1400
X
X
y = 200
y 2 = 4400
(a) Fit a linear regression line to the data.
(b) Calculate the correlation coefficient, and test it for significance at the 5%
level.
(c) Calculate the residual standard deviation.
(d) Find a 95% prediction interval for the number of defective items when the
machine is running at firstly 2000, and secondly 6000 revolutions per minute.
12.8
315
12.9 The following data is obtained to reveal the relationship between the annual production of pig iron (x) and annual numbers of pigs slaughtered (y)
Year
x : Production of pig iron
(millions of metric tons)
y : Number of pigs
slaughtered (10 000s)
A
8
B
7
C
5
D
6
E
9
F
8
G
11
H
9
I
10
J
12
16
12
10
20
14
18
17
21
20
12.10 Daily turnover (y) and price (x) of a product were measured at each of 10 retail
outlets. The data was combined and summarized and the least squares regression
line was found to be y = 8.7754 0.211 x. The original data was lost and only
the following totals remain:
X
X
X
x = 55
x2 = 385
y 2 = 853.27
(a) Find the correlation coefficient r between x and y and test for significance
at a 1% significance level.
(b) Find the residual standard deviation.
12.11 An agricultural company is testing the theory that, for a limited range of values,
plant growth (y) and fertilizer (x) are related by an equation of the form y = axb .
Plant growth and fertilizer values were recorded for a sample of 10 plants and after
some calculation the following results were obtained:
X
X
X
n = 10
x = 54
y = 1427
xy = 10 677
X
X
X
(log x) = 6.5
(log y) = 20.0
(log x log y) = 14.2
X
X
X
(log x)2 = 5.2
(log y)2 = 41.7
x2 = 366
X
y 2 = 301 903
(a) Use these results to fit the equation to this data.
(b) Use your equation to predict plant growth when x = 10.
316
INTROSTAT
12.12
1.35
1.45
1.50
1.55
1.70
1.90
2.05
2.10
10.35
11.25
11.95
12.55
14.10
17.25
19.85
20.30
(a) Calculate the values of the constants a and b that best fit the data. You will
need some of the following:
X
X
X
P = 13.60
Q = 117.60
P 2 = 23.68
X
X
X
Q2 = 1836.48
P Q = 207.73
log Q = 9.24
X
X
(log Q)2 = 10.77
P log Q = 15.94
12.13 Given a set of data points (xi , yi ), and that the slopes of the regression lines
for predicting y from x and x from y are b1 and b2 respectively, show that the
correlation coefficient is given by
p
r = b1 b2
1
.
1 + aebx
(Hint: consider 1 y , then take logarithms. Note also that 0 < y < 1,
and that y is interpreted as the proportion of the asymptote [the final value]
grown at time x. The curve is shaped like an S squashed to the right!)
317
12.15 Shown below is a partial computer output from a regression analysis of the form
y = b0 + b1 x1 + b2 x2 + b3 x3 .
Variable
x1
x2
x3
Intercept
Estimated
coefficient
Estimated
standard
deviation
0.044
1.271
0.619
0.646
0.011
0.418
0.212
318
INTROSTAT
ANOVA table
(a)
(b)
(c)
(d)
Source
Sum of
Squares (SS)
Degrees of
Freedom (DF)
Mean Square
(MS)
Regression
Error
22.012
A
3
B
C
0.331
Total
24.000
Variable
Estimated
standard
deviation
0.187
0.061
ANOVA table
Source
Sum of
Squares (SS)
Degrees of
Freedom (DF)
Mean Square
(MS)
Regression
Error
7.646
B
C
7
D
F
Total
319
R2 = 0.8364
(a) Interpret the implications of the signs of b1 and b2 for job satisfaction.
(b) Complete the entries A F in the ANOVA table. What was the sample
size?
(c) Test the overall regression for significance at the 1% level.
(d) Compute the appropriate t-ratios, and test the explanatory variables individually for significance.
12.17 In an analysis to predict sales of for a chain of stores, the planning department
gathers data for a sample of stores. They take account of the number of competitors
in a 2 km radius (x1 ), the number of parking bays within a 100 m radius (x2 ),
and whether or not there is an automatic cash-dispensing machine on the premises
(x3 ), where x3 = 1 if a cash-dispensing machine is present, and x3 = 0 if it is not.
They estimate the multiple regression model
y = 11.1 3.2x1 + 0.4x2 + 8.5x3 ,
where y is daily sales in R1000s.
(a) What does the model suggest is the effect of having an automatic cashdispensing machine on the premises?
(b) Estimate sales for a store which will be opened in an area with
(i) 3 competitors within 2 km, 56 parking bays within 100 m and a cashdispensing machine;
(ii) 1 competitor, 35 parking bays and no cash-dispensing machine.
Solutions to exercises . . .
12.1 (a) r = 0.84 > 0.6319, reject H0 .
(b) y = 24.58 + 0.93 x.
12.6 y = 1.214+0.116 x, taking x as years since 1968. In 1984, x = 16, y = 3.06 million
telephones. The 90% prediction interval is (2.94, 3.19). r = 0.994, P < 0.001,
very highly significant.
320
INTROSTAT
(d) There is not a cause and effect relationship between the variables. The relationship between the variables is caused by a third variable, possibly either population or standard of living.
12.10 (a) r = 0.1159 > 0.7646, cannot reject H0 .
(b) 5.81.
y = 56.5.
0.4143P
= 2.823 2.596P
Discrete
We have n independent trials, each trial has two outcomes, success of failure,
and Pr[success] = p for all trials. The random variable X is the number of
successes in n trials; n 1 must be an integer, and 0 p 1. Then X has
the binomial distribution, i.e. X B(n, p), with probability mass function
x
nx x = 0, 1, . . . , n
p(x) = n
x p (1 p)
=0
otherwise
Var[X] = np(1 p)
E[X] = np
POISSON DISTRIBUTION
Discrete
Events occur at random in time, with an average rate of events per time
period (or space). The random variable X is a count of the number of events
occurring during a fixed interval of time (or space). The time period (or amount
of space) referred to in the rate must be the same as the time period (or space) in
which events are counted. Then X has the Poisson distribution with parameter
> 0, i.e. X P (), and has probability mass function
e x
x!
=0
p(x) =
E[X] =
x = 0, 1, 2, . . .
otherwise
Var[X] =
321
322
INTROSTAT
EXPONENTIAL DISTRIBUTION
Continuous
As for the Poisson distribution, events occur at random with average rate per
unit of time (or space). Let the continuous random variable X be the interval
between two events. X has the exponential distribution with parameter ,
i.e. X E(), with probability density function
f (x) = ex x 0
=0
otherwise
E[X] =
Var[X] =
1
2
NORMAL DISTRIBUTION
Continuous
1
2 2
12
E[X] =
2
<x<
Var[X] = 2
E[X] = 0
<x<
Var[X] = 1
Discrete
As for the binomial distribution, we have independent trials, each with two
outcomes, success or failure; Pr[success] = p for each trial. Fix the number of
successes r , and let the random variable X be the number of failures obtained
before the r th success. Then X has the negative binomial distribution with
parameters r and p, i.e. X N B(r, p), with probability mass function
x+r1
p(x) =
pr q x x = 0, 1, 2, . . .
x
=0
otherwise
E[X] =
r(1 p)
p
Var[X] =
r(1 p)
p2
323
Discrete
Under the same conditions as for the negative binomial distribution, let the
random variable X be the number of trials before the first success. Then X has
the geometric distribution with parameter p, X G(p), and has probability
mass function
p(x) = pq x x = 0, 1, 2, . . .
=0
otherwise
E[X] =
(1 p)
p
Var[X] =
(1 p)
p2
HYPERGEOMETRIC DISTRIBUTION
Discrete
M
Var[X] = n
N
M
1
N
UNIFORM DISTRIBUTION
N n
N 1
Continuous
f (x) =
E[X] =
a+b
2
axb
otherwise
Var[X] =
(b a)2
12
324
INTROSTAT
THE t, F and 2 DISTRIBUTIONS
Continuous
For completeness sake, we state the probability density functions of these three
distributions. The t-distribution with parameters n, the degrees of freedom, is
a continous random variable with probability density function
1 [(n + 1)/2]
f (x) =
n
(n/2)
x2
1+
n
(n+1)/2
<x<
n
n2
The F -distribution with parameters n1 and n2 , the degrees of freedom for the
numerator and denominator respectively, is a continuous random variable with
probability density function
E[X] = 0
[(n1 + n2 )/2]
f (x) =
(n1 /2)(n2 /2)
=0
E[X] =
Var[X] =
n1
n2
(n1 /2)
n2
n2 2
x(n1 /2)1
(1 + nn12 x)(n1 +n2 )/2
0<x<
otherwise
Var[X] =
2n22 (n1 + n2 2)
n1 (n2 2)2 (n2 4)
1
2n/2 (n/2)
=0
otherwise
E[X] = n
Var[X] = 2n
1
2
is an integer, then
1 3 5 (2n 1)
1
(n + ) =
2
2n
so that
(4) = 3! = 6
and
(2.5) =
1 3
= 1.329
22
TABLES
Table 1
t-distribution
Chi-squared distribution
4.1
4.2
4.3
4.4
Correlation coefficient
325
326
INTROSTAT
0.00
0.0000
0.0398
0.0793
0.1179
0.1554
0.1915
0.2257
0.2580
0.2881
0.3159
0.3413
0.3643
0.3849
0.4032
0.4192
0.4332
0.4452
0.4554
0.4641
0.4713
0.4772
0.4821
0.4861
0.4893
0.49180
0.49379
0.49534
0.49653
0.49744
0.49813
0.49865
0.49903
0.49931
0.49952
0.49966
0.49977
0.49984
0.49989
0.49993
0.49995
0.49997
0.01
0.0040
0.0438
0.0832
0.1217
0.1591
0.1950
0.2291
0.2611
0.2910
0.3186
0.3438
0.3665
0.3869
0.4049
0.4207
0.4345
0.4463
0.4564
0.4649
0.4719
0.4778
0.4826
0.4864
0.4896
0.49202
0.49396
0.49547
0.49664
0.49752
0.49819
0.49869
0.49906
0.49934
0.49953
0.49968
0.49978
0.49985
0.49990
0.49993
0.49995
0.49997
0.02
0.0080
0.0478
0.0871
0.1255
0.1628
0.1985
0.2324
0.2642
0.2939
0.3212
0.3461
0.3686
0.3888
0.4066
0.4222
0.4357
0.4474
0.4573
0.4656
0.4726
0.4783
0.4830
0.4868
0.4898
0.49224
0.49413
0.49560
0.49674
0.49760
0.49825
0.49874
0.49910
0.49936
0.49955
0.49969
0.49978
0.49985
0.49990
0.49993
0.49996
0.49997
0.03
0.0120
0.0517
0.0910
0.1293
0.1664
0.2019
0.2357
0.2673
0.2967
0.3238
0.3485
0.3708
0.3907
0.4082
0.4236
0.4370
0.4484
0.4582
0.4664
0.4732
0.4788
0.4834
0.4871
0.4901
0.49245
0.49430
0.49573
0.49683
0.49767
0.49831
0.49878
0.49913
0.49938
0.49957
0.49970
0.49979
0.49986
0.49990
0.49994
0.49996
0.49997
0.04
0.0160
0.0557
0.0948
0.1331
0.1700
0.2054
0.2389
0.2704
0.2995
0.3264
0.3508
0.3729
0.3925
0.4099
0.4251
0.4382
0.4495
0.4591
0.4671
0.4738
0.4793
0.4838
0.4875
0.4904
0.49266
0.49446
0.49585
0.49693
0.49774
0.49836
0.49882
0.49916
0.49940
0.49958
0.49971
0.49980
0.49986
0.49991
0.49994
0.49996
0.49997
0.05
0.0199
0.0596
0.0987
0.1368
0.1736
0.2088
0.2422
0.2734
0.3023
0.3289
0.3531
0.3749
0.3944
0.4115
0.4265
0.4394
0.4505
0.4599
0.4678
0.4744
0.4798
0.4842
0.4878
0.4906
0.49286
0.49461
0.49598
0.49702
0.49781
0.49841
0.49886
0.49918
0.49942
0.49960
0.49972
0.49981
0.49987
0.49991
0.49994
0.49996
0.49997
0.06
0.0239
0.0636
0.1026
0.1406
0.1772
0.2123
0.2454
0.2764
0.3051
0.3315
0.3554
0.3770
0.3962
0.4131
0.4279
0.4406
0.4515
0.4608
0.4686
0.4750
0.4803
0.4846
0.4881
0.4909
0.49305
0.49477
0.49609
0.49711
0.49788
0.49846
0.49889
0.49921
0.49944
0.49961
0.49973
0.49981
0.49987
0.49992
0.49994
0.49996
0.49998
0.07
0.0279
0.0675
0.1064
0.1443
0.1808
0.2157
0.2486
0.2794
0.3078
0.3340
0.3577
0.3790
0.3980
0.4147
0.4292
0.4418
0.4525
0.4616
0.4693
0.4756
0.4808
0.4850
0.4884
0.4911
0.49324
0.49492
0.49621
0.49720
0.49795
0.49851
0.49893
0.49924
0.49946
0.49962
0.49974
0.49982
0.49988
0.49992
0.49995
0.49996
0.49998
0.08
0.0319
0.0714
0.1103
0.1480
0.1844
0.2190
0.2517
0.2823
0.3106
0.3365
0.3599
0.3810
0.3997
0.4162
0.4306
0.4429
0.4535
0.4625
0.4699
0.4761
0.4812
0.4854
0.4887
0.4913
0.49343
0.49506
0.49632
0.49728
0.49801
0.49856
0.49896
0.49926
0.49948
0.49964
0.49975
0.49983
0.49988
0.49992
0.49995
0.49997
0.49998
0.09
0.0359
0.0753
0.1141
0.1517
0.1879
0.2224
0.2549
0.2852
0.3133
0.3389
0.3621
0.3830
0.4015
0.4177
0.4319
0.4441
0.4545
0.4633
0.4706
0.4767
0.4817
0.4857
0.4890
0.4916
0.49361
0.49520
0.49643
0.49736
0.49807
0.49861
0.49900
0.49929
0.49950
0.49965
0.49976
0.49983
0.49989
0.49992
0.49995
0.49997
0.49998
327
TABLE 2. t-DISTRIBUTION: One sided critical values, i.e. the value of tPn such that
P = Pr[tn > tPn ], where n is the degrees of freedom
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
z
0.4
0.325
0.289
0.277
0.271
0.267
0.265
0.263
0.262
0.261
0.260
0.260
0.259
0.259
0.258
0.258
0.258
0.257
0.257
0.257
0.257
0.257
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.256
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.255
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.254
0.253
0.3
0.727
0.617
0.584
0.569
0.559
0.553
0.549
0.546
0.543
0.542
0.540
0.539
0.538
0.537
0.536
0.535
0.534
0.534
0.533
0.533
0.532
0.532
0.532
0.531
0.531
0.531
0.531
0.530
0.530
0.530
0.530
0.530
0.530
0.529
0.529
0.529
0.529
0.529
0.529
0.529
0.528
0.528
0.527
0.527
0.526
0.526
0.526
0.526
0.526
0.526
0.525
0.525
0.525
0.524
0.2
1.376
1.061
0.978
0.941
0.920
0.906
0.896
0.889
0.883
0.879
0.876
0.873
0.870
0.868
0.866
0.865
0.863
0.862
0.861
0.860
0.859
0.858
0.858
0.857
0.856
0.856
0.855
0.855
0.854
0.854
0.853
0.853
0.853
0.852
0.852
0.852
0.851
0.851
0.851
0.851
0.850
0.849
0.848
0.847
0.846
0.846
0.845
0.845
0.845
0.844
0.844
0.844
0.843
0.842
0.1
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310
1.309
1.309
1.308
1.307
1.306
1.306
1.305
1.304
1.304
1.303
1.301
1.299
1.296
1.294
1.292
1.291
1.290
1.289
1.289
1.288
1.287
1.286
1.286
1.282
Probability Level P
0.05 0.025
0.01
6.314 12.71 31.82
2.920 4.303 6.965
2.353 3.182 4.541
2.132 2.776 3.747
2.015 2.571 3.365
1.943 2.447 3.143
1.895 2.365 2.998
1.860 2.306 2.896
1.833 2.262 2.821
1.812 2.228 2.764
1.796 2.201 2.718
1.782 2.179 2.681
1.771 2.160 2.650
1.761 2.145 2.624
1.753 2.131 2.602
1.746 2.120 2.583
1.740 2.110 2.567
1.734 2.101 2.552
1.729 2.093 2.539
1.725 2.086 2.528
1.721 2.080 2.518
1.717 2.074 2.508
1.714 2.069 2.500
1.711 2.064 2.492
1.708 2.060 2.485
1.706 2.056 2.479
1.703 2.052 2.473
1.701 2.048 2.467
1.699 2.045 2.462
1.697 2.042 2.457
1.696 2.040 2.453
1.694 2.037 2.449
1.692 2.035 2.445
1.691 2.032 2.441
1.690 2.030 2.438
1.688 2.028 2.434
1.687 2.026 2.431
1.686 2.024 2.429
1.685 2.023 2.426
1.684 2.021 2.423
1.679 2.014 2.412
1.676 2.009 2.403
1.671 2.000 2.390
1.667 1.994 2.381
1.664 1.990 2.374
1.662 1.987 2.368
1.660 1.984 2.364
1.659 1.982 2.361
1.658 1.980 2.358
1.656 1.977 2.353
1.654 1.975 2.350
1.653 1.973 2.347
1.653 1.972 2.345
1.645 1.960 2.326
0.005
63.66
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
2.744
2.738
2.733
2.728
2.724
2.719
2.715
2.712
2.708
2.704
2.690
2.678
2.660
2.648
2.639
2.632
2.626
2.621
2.617
2.611
2.607
2.603
2.601
2.576
0.0025
127.3
14.09
7.453
5.598
4.773
4.317
4.029
3.833
3.690
3.581
3.497
3.428
3.372
3.326
3.286
3.252
3.222
3.197
3.174
3.153
3.135
3.119
3.104
3.091
3.078
3.067
3.057
3.047
3.038
3.030
3.022
3.015
3.008
3.002
2.996
2.990
2.985
2.980
2.976
2.971
2.952
2.937
2.915
2.899
2.887
2.878
2.871
2.865
2.860
2.852
2.847
2.842
2.838
2.807
0.001
318.3
22.33
10.21
7.173
5.894
5.208
4.785
4.501
4.297
4.144
4.025
3.930
3.852
3.787
3.733
3.686
3.646
3.610
3.579
3.552
3.527
3.505
3.485
3.467
3.450
3.435
3.421
3.408
3.396
3.385
3.375
3.365
3.356
3.348
3.340
3.333
3.326
3.319
3.313
3.307
3.281
3.261
3.232
3.211
3.195
3.183
3.174
3.166
3.160
3.149
3.142
3.136
3.131
3.090
0.0005
636.6
31.60
12.92
8.610
6.869
5.959
5.408
5.041
4.781
4.587
4.437
4.318
4.221
4.140
4.073
4.015
3.965
3.922
3.883
3.850
3.819
3.792
3.768
3.745
3.725
3.707
3.689
3.674
3.660
3.646
3.633
3.622
3.611
3.601
3.591
3.582
3.574
3.566
3.558
3.551
3.520
3.496
3.460
3.435
3.416
3.402
3.390
3.381
3.373
3.361
3.352
3.345
3.340
3.291
328
INTROSTAT
TABLE 3. CHI-SQUARED DISTRIBUTION: One sided critical values, i.e. the value
2(P )
2(P )
of n
such that P = Pr[2n > n ], where n is the degrees of freedom, for P > 0.5
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
0.9995
0.000
0.001
0.015
0.064
0.158
0.299
0.485
0.710
0.972
1.265
1.587
1.935
2.305
2.697
3.107
3.536
3.980
4.439
4.913
5.398
5.895
6.404
6.924
7.453
7.991
8.537
9.093
9.656
10.227
10.804
11.388
11.980
12.576
13.180
13.788
14.401
15.021
15.644
16.272
16.906
20.136
23.461
30.339
37.467
44.792
52.277
59.895
67.631
75.465
91.389
107.599
124.032
140.659
0.999
0.000
0.002
0.024
0.091
0.210
0.381
0.599
0.857
1.152
1.479
1.834
2.214
2.617
3.041
3.483
3.942
4.416
4.905
5.407
5.921
6.447
6.983
7.529
8.085
8.649
9.222
9.803
10.391
10.986
11.588
12.196
12.810
13.431
14.057
14.688
15.324
15.965
16.611
17.261
17.917
21.251
24.674
31.738
39.036
46.520
54.156
61.918
69.790
77.756
93.925
110.359
127.011
143.842
0.9975
0.000
0.005
0.045
0.145
0.307
0.527
0.794
1.104
1.450
1.827
2.232
2.661
3.112
3.582
4.070
4.573
5.092
5.623
6.167
6.723
7.289
7.865
8.450
9.044
9.646
10.256
10.873
11.497
12.128
12.765
13.407
14.055
14.709
15.368
16.032
16.700
17.373
18.050
18.732
19.417
22.899
26.464
33.791
41.332
49.043
56.892
64.857
72.922
81.073
97.591
114.350
131.305
148.426
0.995
0.000
0.010
0.072
0.207
0.412
0.676
0.989
1.344
1.735
2.156
2.603
3.074
3.565
4.075
4.601
5.142
5.697
6.265
6.844
7.434
8.034
8.643
9.260
9.886
10.520
11.160
11.808
12.461
13.121
13.787
14.458
15.134
15.815
16.501
17.192
17.887
18.586
19.289
19.996
20.707
24.311
27.991
35.534
43.275
51.172
59.196
67.328
75.550
83.852
100.655
117.679
134.884
152.241
Probability Level P
0.99
0.975
0.000
0.001
0.020
0.051
0.115
0.216
0.297
0.484
0.554
0.831
0.872
1.237
1.239
1.690
1.647
2.180
2.088
2.700
2.558
3.247
3.053
3.816
3.571
4.404
4.107
5.009
4.660
5.629
5.229
6.262
5.812
6.908
6.408
7.564
7.015
8.231
7.633
8.907
8.260
9.591
8.897
10.283
9.542
10.982
10.196
11.689
10.856
12.401
11.524
13.120
12.198
13.844
12.878
14.573
13.565
15.308
14.256
16.047
14.953
16.791
15.655
17.539
16.362
18.291
17.073
19.047
17.789
19.806
18.509
20.569
19.233
21.336
19.960
22.106
20.691
22.878
21.426
23.654
22.164
24.433
25.901
28.366
29.707
32.357
37.485
40.482
45.442
48.758
53.540
57.153
61.754
65.647
70.065
74.222
78.458
82.867
86.923
91.573
104.034 109.137
121.346 126.870
138.821 144.741
156.432 162.728
0.95
0.004
0.103
0.352
0.711
1.145
1.635
2.167
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.672
9.390
10.117
10.851
11.591
12.338
13.091
13.848
14.611
15.379
16.151
16.928
17.708
18.493
19.281
20.072
20.867
21.664
22.465
23.269
24.075
24.884
25.695
26.509
30.612
34.764
43.188
51.739
60.391
69.126
77.929
86.792
95.705
113.659
131.756
149.969
168.279
0.9
0.016
0.211
0.584
1.064
1.610
2.204
2.833
3.490
4.168
4.865
5.578
6.304
7.041
7.790
8.547
9.312
10.085
10.865
11.651
12.443
13.240
14.041
14.848
15.659
16.473
17.292
18.114
18.939
19.768
20.599
21.434
22.271
23.110
23.952
24.797
25.643
26.492
27.343
28.196
29.051
33.350
37.689
46.459
55.329
64.278
73.291
82.358
91.471
100.624
119.029
137.546
156.153
174.835
0.8
0.064
0.446
1.005
1.649
2.343
3.070
3.822
4.594
5.380
6.179
6.989
7.807
8.634
9.467
10.307
11.152
12.002
12.857
13.716
14.578
15.445
16.314
17.187
18.062
18.940
19.820
20.703
21.588
22.475
23.364
24.255
25.148
26.042
26.938
27.836
28.735
29.635
30.537
31.441
32.345
36.884
41.449
50.641
59.898
69.207
78.558
87.945
97.362
106.806
125.758
144.783
163.868
183.003
0.6
0.275
1.022
1.869
2.753
3.656
4.570
5.493
6.423
7.357
8.295
9.237
10.182
11.129
12.078
13.030
13.983
14.937
15.893
16.850
17.809
18.768
19.729
20.690
21.652
22.616
23.579
24.544
25.509
26.475
27.442
28.409
29.376
30.344
31.313
32.282
33.252
34.222
35.192
36.163
37.134
41.995
46.864
56.620
66.396
76.188
85.993
95.808
105.632
115.465
135.149
154.856
174.580
194.319
329
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
0.4
0.708
1.833
2.946
4.045
5.132
6.211
7.283
8.351
9.414
10.473
11.530
12.584
13.636
14.685
15.733
16.780
17.824
18.868
19.910
20.951
21.992
23.031
24.069
25.106
26.143
27.179
28.214
29.249
30.283
31.316
32.349
33.381
34.413
35.444
36.475
37.505
38.535
39.564
40.593
41.622
46.761
51.892
62.135
72.358
82.566
92.761
102.946
113.121
123.289
143.604
163.898
184.173
204.434
0.2
1.642
3.219
4.642
5.989
7.289
8.558
9.803
11.030
12.242
13.442
14.631
15.812
16.985
18.151
19.311
20.465
21.615
22.760
23.900
25.038
26.171
27.301
28.429
29.553
30.675
31.795
32.912
34.027
35.139
36.250
37.359
38.466
39.572
40.676
41.778
42.879
43.978
45.076
46.173
47.269
52.729
58.164
68.972
79.715
90.405
101.054
111.667
122.250
132.806
153.854
174.828
195.743
216.609
0.1
2.706
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412
29.615
30.813
32.007
33.196
34.382
35.563
36.741
37.916
39.087
40.256
41.422
42.585
43.745
44.903
46.059
47.212
48.363
49.513
50.660
51.805
57.505
63.167
74.397
85.527
96.578
107.565
118.498
129.385
140.233
161.827
183.311
204.704
226.021
0.05
3.841
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307
19.675
21.026
22.362
23.685
24.996
26.296
27.587
28.869
30.144
31.410
32.671
33.924
35.172
36.415
37.652
38.885
40.113
41.337
42.557
43.773
44.985
46.194
47.400
48.602
49.802
50.998
52.192
53.384
54.572
55.758
61.656
67.505
79.082
90.531
101.879
113.145
124.342
135.480
146.567
168.613
190.516
212.304
233.994
Probability Level P
0.025
0.01
5.024
6.635
7.378
9.210
9.348
11.345
11.143
13.277
12.832
15.086
14.449
16.812
16.013
18.475
17.535
20.090
19.023
21.666
20.483
23.209
21.920
24.725
23.337
26.217
24.736
27.688
26.119
29.141
27.488
30.578
28.845
32.000
30.191
33.409
31.526
34.805
32.852
36.191
34.170
37.566
35.479
38.932
36.781
40.289
38.076
41.638
39.364
42.980
40.646
44.314
41.923
45.642
43.195
46.963
44.461
48.278
45.722
49.588
46.979
50.892
48.232
52.191
49.480
53.486
50.725
54.775
51.966
56.061
53.203
57.342
54.437
58.619
55.668
59.893
56.895
61.162
58.120
62.428
59.342
63.691
65.410
69.957
71.420
76.154
83.298
88.379
95.023 100.425
106.629 112.329
118.136 124.116
129.561 135.807
140.916 147.414
152.211 158.950
174.648 181.841
196.915 204.530
219.044 227.056
241.058 249.445
0.005
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
26.757
28.300
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997
41.401
42.796
44.181
45.558
46.928
48.290
49.645
50.994
52.335
53.672
55.002
56.328
57.648
58.964
60.275
61.581
62.883
64.181
65.475
66.766
73.166
79.490
91.952
104.215
116.321
128.299
140.170
151.948
163.648
186.847
209.824
232.620
255.264
0.0025
9.140
11.983
14.320
16.424
18.385
20.249
22.040
23.774
25.463
27.112
28.729
30.318
31.883
33.426
34.949
36.456
37.946
39.422
40.885
42.336
43.775
45.204
46.623
48.034
49.435
50.829
52.215
53.594
54.966
56.332
57.692
59.046
60.395
61.738
63.076
64.410
65.738
67.063
68.383
69.699
76.223
82.664
95.344
107.808
120.102
132.255
144.292
156.230
168.081
191.565
214.808
237.855
260.735
0.001
10.827
13.815
16.266
18.466
20.515
22.457
24.321
26.124
27.877
29.588
31.264
32.909
34.527
36.124
37.698
39.252
40.791
42.312
43.819
45.314
46.796
48.268
49.728
51.179
52.619
54.051
55.475
56.892
58.301
59.702
61.098
62.487
63.869
65.247
66.619
67.985
69.348
70.704
72.055
73.403
80.078
86.660
99.608
112.317
124.839
137.208
149.449
161.582
173.618
197.450
221.020
244.372
267.539
0.0005
12.115
15.201
17.731
19.998
22.106
24.102
26.018
27.867
29.667
31.419
33.138
34.821
36.477
38.109
39.717
41.308
42.881
44.434
45.974
47.498
49.010
50.510
51.999
53.478
54.948
56.407
57.856
59.299
60.734
62.160
63.581
64.993
66.401
67.804
69.197
70.588
71.971
73.350
74.724
76.096
82.873
89.560
102.697
115.577
128.264
140.780
153.164
165.436
177.601
201.680
225.477
249.049
272.422
330
INTROSTAT
(0.05)
TABLE 4.1. 5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
1
161.4
18.51
10.13
7.71
6.61
5.99
5.59
5.32
5.12
4.96
4.84
4.75
4.67
4.60
4.54
4.49
4.45
4.41
4.38
4.35
4.32
4.30
4.28
4.26
4.24
4.23
4.21
4.20
4.18
4.17
4.16
4.15
4.14
4.13
4.12
4.11
4.11
4.10
4.09
4.08
4.06
4.03
4.00
3.98
3.96
3.95
3.94
3.93
3.92
3.91
3.90
3.89
3.89
2
199.5
19.00
9.55
6.94
5.79
5.14
4.74
4.46
4.26
4.10
3.98
3.89
3.81
3.74
3.68
3.63
3.59
3.55
3.52
3.49
3.47
3.44
3.42
3.40
3.39
3.37
3.35
3.34
3.33
3.32
3.30
3.29
3.28
3.28
3.27
3.26
3.25
3.24
3.24
3.23
3.20
3.18
3.15
3.13
3.11
3.10
3.09
3.08
3.07
3.06
3.05
3.05
3.04
3
215.7
19.16
9.28
6.59
5.41
4.76
4.35
4.07
3.86
3.71
3.59
3.49
3.41
3.34
3.29
3.24
3.20
3.16
3.13
3.10
3.07
3.05
3.03
3.01
2.99
2.98
2.96
2.95
2.93
2.92
2.91
2.90
2.89
2.88
2.87
2.87
2.86
2.85
2.85
2.84
2.81
2.79
2.76
2.74
2.72
2.71
2.70
2.69
2.68
2.67
2.66
2.65
2.65
4
224.6
19.25
9.12
6.39
5.19
4.53
4.12
3.84
3.63
3.48
3.36
3.26
3.18
3.11
3.06
3.01
2.96
2.93
2.90
2.87
2.84
2.82
2.80
2.78
2.76
2.74
2.73
2.71
2.70
2.69
2.68
2.67
2.66
2.65
2.64
2.63
2.63
2.62
2.61
2.61
2.58
2.56
2.53
2.50
2.49
2.47
2.46
2.45
2.45
2.44
2.43
2.42
2.42
11
243.0
19.40
8.76
5.94
4.70
4.03
3.60
3.31
3.10
2.94
2.82
2.72
2.63
2.57
2.51
2.46
2.41
2.37
2.34
2.31
2.28
2.26
2.24
2.22
2.20
2.18
2.17
2.15
2.14
2.13
2.11
2.10
2.09
2.08
2.07
2.07
2.06
2.05
2.04
2.04
2.01
1.99
1.95
1.93
1.91
1.90
1.89
1.88
1.87
1.86
1.85
1.84
1.84
12
243.9
19.41
8.74
5.91
4.68
4.00
3.57
3.28
3.07
2.91
2.79
2.69
2.60
2.53
2.48
2.42
2.38
2.34
2.31
2.28
2.25
2.23
2.20
2.18
2.16
2.15
2.13
2.12
2.10
2.09
2.08
2.07
2.06
2.05
2.04
2.03
2.02
2.02
2.01
2.00
1.97
1.95
1.92
1.89
1.88
1.86
1.85
1.84
1.83
1.82
1.81
1.81
1.80
13
244.7
19.42
8.73
5.89
4.66
3.98
3.55
3.26
3.05
2.89
2.76
2.66
2.58
2.51
2.45
2.40
2.35
2.31
2.28
2.25
2.22
2.20
2.18
2.15
2.14
2.12
2.10
2.09
2.08
2.06
2.05
2.04
2.03
2.02
2.01
2.00
2.00
1.99
1.98
1.97
1.94
1.92
1.89
1.86
1.84
1.83
1.82
1.81
1.80
1.79
1.78
1.77
1.77
14
245.4
19.42
8.71
5.87
4.64
3.96
3.53
3.24
3.03
2.86
2.74
2.64
2.55
2.48
2.42
2.37
2.33
2.29
2.26
2.22
2.20
2.17
2.15
2.13
2.11
2.09
2.08
2.06
2.05
2.04
2.03
2.01
2.00
1.99
1.99
1.98
1.97
1.96
1.95
1.95
1.92
1.89
1.86
1.84
1.82
1.80
1.79
1.78
1.78
1.76
1.75
1.75
1.74
331
15
245.9
19.43
8.70
5.86
4.62
3.94
3.51
3.22
3.01
2.85
2.72
2.62
2.53
2.46
2.40
2.35
2.31
2.27
2.23
2.20
2.18
2.15
2.13
2.11
2.09
2.07
2.06
2.04
2.03
2.01
2.00
1.99
1.98
1.97
1.96
1.95
1.95
1.94
1.93
1.92
1.89
1.87
1.84
1.81
1.79
1.78
1.77
1.76
1.75
1.74
1.73
1.72
1.72
16
246.5
19.43
8.69
5.84
4.60
3.92
3.49
3.20
2.99
2.83
2.70
2.60
2.51
2.44
2.38
2.33
2.29
2.25
2.21
2.18
2.16
2.13
2.11
2.09
2.07
2.05
2.04
2.02
2.01
1.99
1.98
1.97
1.96
1.95
1.94
1.93
1.93
1.92
1.91
1.90
1.87
1.85
1.82
1.79
1.77
1.76
1.75
1.74
1.73
1.72
1.71
1.70
1.69
17
246.9
19.44
8.68
5.83
4.59
3.91
3.48
3.19
2.97
2.81
2.69
2.58
2.50
2.43
2.37
2.32
2.27
2.23
2.20
2.17
2.14
2.11
2.09
2.07
2.05
2.03
2.02
2.00
1.99
1.98
1.96
1.95
1.94
1.93
1.92
1.92
1.91
1.90
1.89
1.89
1.86
1.83
1.80
1.77
1.75
1.74
1.73
1.72
1.71
1.70
1.69
1.68
1.67
18
247.3
19.44
8.67
5.82
4.58
3.90
3.47
3.17
2.96
2.80
2.67
2.57
2.48
2.41
2.35
2.30
2.26
2.22
2.18
2.15
2.12
2.10
2.08
2.05
2.04
2.02
2.00
1.99
1.97
1.96
1.95
1.94
1.93
1.92
1.91
1.90
1.89
1.88
1.88
1.87
1.84
1.81
1.78
1.75
1.73
1.72
1.71
1.70
1.69
1.68
1.67
1.66
1.66
40
251.1
19.47
8.59
5.72
4.46
3.77
3.34
3.04
2.83
2.66
2.53
2.43
2.34
2.27
2.20
2.15
2.10
2.06
2.03
1.99
1.96
1.94
1.91
1.89
1.87
1.85
1.84
1.82
1.81
1.79
1.78
1.77
1.76
1.75
1.74
1.73
1.72
1.71
1.70
1.69
1.66
1.63
1.59
1.57
1.54
1.53
1.52
1.50
1.50
1.48
1.47
1.46
1.46
60
252.2
19.48
8.57
5.69
4.43
3.74
3.30
3.01
2.79
2.62
2.49
2.38
2.30
2.22
2.16
2.11
2.06
2.02
1.98
1.95
1.92
1.89
1.86
1.84
1.82
1.80
1.79
1.77
1.75
1.74
1.73
1.71
1.70
1.69
1.68
1.67
1.66
1.65
1.65
1.64
1.60
1.58
1.53
1.50
1.48
1.46
1.45
1.44
1.43
1.41
1.40
1.39
1.39
100
253.0
19.49
8.55
5.66
4.41
3.71
3.27
2.97
2.76
2.59
2.46
2.35
2.26
2.19
2.12
2.07
2.02
1.98
1.94
1.91
1.88
1.85
1.82
1.80
1.78
1.76
1.74
1.73
1.71
1.70
1.68
1.67
1.66
1.65
1.63
1.62
1.62
1.61
1.60
1.59
1.55
1.52
1.48
1.45
1.43
1.41
1.39
1.38
1.37
1.35
1.34
1.33
1.32
200
253.7
19.49
8.54
5.65
4.39
3.69
3.25
2.95
2.73
2.56
2.43
2.32
2.23
2.16
2.10
2.04
1.99
1.95
1.91
1.88
1.84
1.82
1.79
1.77
1.75
1.73
1.71
1.69
1.67
1.66
1.65
1.63
1.62
1.61
1.60
1.59
1.58
1.57
1.56
1.55
1.51
1.48
1.44
1.40
1.38
1.36
1.34
1.33
1.32
1.30
1.28
1.27
1.26
332
INTROSTAT
(0.025)
TABLE 4.2. 2.5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
1
648
38.51
17.44
12.22
10.01
8.81
8.07
7.57
7.21
6.94
6.72
6.55
6.41
6.30
6.20
6.12
6.04
5.98
5.92
5.87
5.83
5.79
5.75
5.72
5.69
5.66
5.63
5.61
5.59
5.57
5.55
5.53
5.51
5.50
5.48
5.47
5.46
5.45
5.43
5.42
5.38
5.34
5.29
5.25
5.22
5.20
5.18
5.16
5.15
5.13
5.12
5.11
5.10
2
799
39.00
16.04
10.65
8.43
7.26
6.54
6.06
5.71
5.46
5.26
5.10
4.97
4.86
4.77
4.69
4.62
4.56
4.51
4.46
4.42
4.38
4.35
4.32
4.29
4.27
4.24
4.22
4.20
4.18
4.16
4.15
4.13
4.12
4.11
4.09
4.08
4.07
4.06
4.05
4.01
3.97
3.93
3.89
3.86
3.84
3.83
3.82
3.80
3.79
3.78
3.77
3.76
3
864
39.17
15.44
9.98
7.76
6.60
5.89
5.42
5.08
4.83
4.63
4.47
4.35
4.24
4.15
4.08
4.01
3.95
3.90
3.86
3.82
3.78
3.75
3.72
3.69
3.67
3.65
3.63
3.61
3.59
3.57
3.56
3.54
3.53
3.52
3.50
3.49
3.48
3.47
3.46
3.42
3.39
3.34
3.31
3.28
3.26
3.25
3.24
3.23
3.21
3.20
3.19
3.18
4
900
39.25
15.10
9.60
7.39
6.23
5.52
5.05
4.72
4.47
4.28
4.12
4.00
3.89
3.80
3.73
3.66
3.61
3.56
3.51
3.48
3.44
3.41
3.38
3.35
3.33
3.31
3.29
3.27
3.25
3.23
3.22
3.20
3.19
3.18
3.17
3.16
3.15
3.14
3.13
3.09
3.05
3.01
2.97
2.95
2.93
2.92
2.90
2.89
2.88
2.87
2.86
2.85
11
973
39.41
14.37
8.79
6.57
5.41
4.71
4.24
3.91
3.66
3.47
3.32
3.20
3.09
3.01
2.93
2.87
2.81
2.76
2.72
2.68
2.65
2.62
2.59
2.56
2.54
2.51
2.49
2.48
2.46
2.44
2.43
2.41
2.40
2.39
2.37
2.36
2.35
2.34
2.33
2.29
2.26
2.22
2.18
2.16
2.14
2.12
2.11
2.10
2.09
2.07
2.07
2.06
12
977
39.41
14.34
8.75
6.52
5.37
4.67
4.20
3.87
3.62
3.43
3.28
3.15
3.05
2.96
2.89
2.82
2.77
2.72
2.68
2.64
2.60
2.57
2.54
2.51
2.49
2.47
2.45
2.43
2.41
2.40
2.38
2.37
2.35
2.34
2.33
2.32
2.31
2.30
2.29
2.25
2.22
2.17
2.14
2.11
2.09
2.08
2.07
2.05
2.04
2.03
2.02
2.01
13
980
39.42
14.30
8.72
6.49
5.33
4.63
4.16
3.83
3.58
3.39
3.24
3.12
3.01
2.92
2.85
2.79
2.73
2.68
2.64
2.60
2.56
2.53
2.50
2.48
2.45
2.43
2.41
2.39
2.37
2.36
2.34
2.33
2.31
2.30
2.29
2.28
2.27
2.26
2.25
2.21
2.18
2.13
2.10
2.07
2.05
2.04
2.02
2.01
2.00
1.99
1.98
1.97
14
983
39.43
14.28
8.68
6.46
5.30
4.60
4.13
3.80
3.55
3.36
3.21
3.08
2.98
2.89
2.82
2.75
2.70
2.65
2.60
2.56
2.53
2.50
2.47
2.44
2.42
2.39
2.37
2.36
2.34
2.32
2.31
2.29
2.28
2.27
2.25
2.24
2.23
2.22
2.21
2.17
2.14
2.09
2.06
2.03
2.02
2.00
1.99
1.98
1.96
1.95
1.94
1.93
333
15
985
39.43
14.25
8.66
6.43
5.27
4.57
4.10
3.77
3.52
3.33
3.18
3.05
2.95
2.86
2.79
2.72
2.67
2.62
2.57
2.53
2.50
2.47
2.44
2.41
2.39
2.36
2.34
2.32
2.31
2.29
2.28
2.26
2.25
2.23
2.22
2.21
2.20
2.19
2.18
2.14
2.11
2.06
2.03
2.00
1.98
1.97
1.96
1.94
1.93
1.92
1.91
1.90
16
987
39.44
14.23
8.63
6.40
5.24
4.54
4.08
3.74
3.50
3.30
3.15
3.03
2.92
2.84
2.76
2.70
2.64
2.59
2.55
2.51
2.47
2.44
2.41
2.38
2.36
2.34
2.32
2.30
2.28
2.26
2.25
2.23
2.22
2.21
2.20
2.18
2.17
2.16
2.15
2.11
2.08
2.03
2.00
1.97
1.95
1.94
1.93
1.92
1.90
1.89
1.88
1.87
17
989
39.44
14.21
8.61
6.38
5.22
4.52
4.05
3.72
3.47
3.28
3.13
3.00
2.90
2.81
2.74
2.67
2.62
2.57
2.52
2.48
2.45
2.42
2.39
2.36
2.34
2.31
2.29
2.27
2.26
2.24
2.22
2.21
2.20
2.18
2.17
2.16
2.15
2.14
2.13
2.09
2.06
2.01
1.97
1.95
1.93
1.91
1.90
1.89
1.87
1.86
1.85
1.84
18
990
39.44
14.20
8.59
6.36
5.20
4.50
4.03
3.70
3.45
3.26
3.11
2.98
2.88
2.79
2.72
2.65
2.60
2.55
2.50
2.46
2.43
2.39
2.36
2.34
2.31
2.29
2.27
2.25
2.23
2.22
2.20
2.19
2.17
2.16
2.15
2.14
2.13
2.12
2.11
2.07
2.03
1.98
1.95
1.92
1.91
1.89
1.88
1.87
1.85
1.84
1.83
1.82
40
1006
39.47
14.04
8.41
6.18
5.01
4.31
3.84
3.51
3.26
3.06
2.91
2.78
2.67
2.59
2.51
2.44
2.38
2.33
2.29
2.25
2.21
2.18
2.15
2.12
2.09
2.07
2.05
2.03
2.01
1.99
1.98
1.96
1.95
1.93
1.92
1.91
1.90
1.89
1.88
1.83
1.80
1.74
1.71
1.68
1.66
1.64
1.63
1.61
1.60
1.58
1.57
1.56
60
1010
39.48
13.99
8.36
6.12
4.96
4.25
3.78
3.45
3.20
3.00
2.85
2.72
2.61
2.52
2.45
2.38
2.32
2.27
2.22
2.18
2.14
2.11
2.08
2.05
2.03
2.00
1.98
1.96
1.94
1.92
1.91
1.89
1.88
1.86
1.85
1.84
1.82
1.81
1.80
1.76
1.72
1.67
1.63
1.60
1.58
1.56
1.54
1.53
1.51
1.50
1.48
1.47
100
1013
39.49
13.96
8.32
6.08
4.92
4.21
3.74
3.40
3.15
2.96
2.80
2.67
2.56
2.47
2.40
2.33
2.27
2.22
2.17
2.13
2.09
2.06
2.02
2.00
1.97
1.94
1.92
1.90
1.88
1.86
1.85
1.83
1.82
1.80
1.79
1.77
1.76
1.75
1.74
1.69
1.66
1.60
1.56
1.53
1.50
1.48
1.47
1.45
1.43
1.42
1.40
1.39
200
1016
39.49
13.93
8.29
6.05
4.88
4.18
3.70
3.37
3.12
2.92
2.76
2.63
2.53
2.44
2.36
2.29
2.23
2.18
2.13
2.09
2.05
2.01
1.98
1.95
1.92
1.90
1.88
1.86
1.84
1.82
1.80
1.78
1.77
1.75
1.74
1.73
1.71
1.70
1.69
1.64
1.60
1.54
1.50
1.47
1.44
1.42
1.40
1.39
1.36
1.35
1.33
1.32
334
INTROSTAT
(0.01)
TABLE 4.3. 1% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
1
4052
98.50
34.12
21.20
16.26
13.75
12.25
11.26
10.56
10.04
9.65
9.33
9.07
8.86
8.68
8.53
8.40
8.29
8.18
8.10
8.02
7.95
7.88
7.82
7.77
7.72
7.68
7.64
7.60
7.56
7.53
7.50
7.47
7.44
7.42
7.40
7.37
7.35
7.33
7.31
7.23
7.17
7.08
7.01
6.96
6.93
6.90
6.87
6.85
6.82
6.80
6.78
6.76
2
4999
99.00
30.82
18.00
13.27
10.92
9.55
8.65
8.02
7.56
7.21
6.93
6.70
6.51
6.36
6.23
6.11
6.01
5.93
5.85
5.78
5.72
5.66
5.61
5.57
5.53
5.49
5.45
5.42
5.39
5.36
5.34
5.31
5.29
5.27
5.25
5.23
5.21
5.19
5.18
5.11
5.06
4.98
4.92
4.88
4.85
4.82
4.80
4.79
4.76
4.74
4.73
4.71
3
5404
99.16
29.46
16.69
12.06
9.78
8.45
7.59
6.99
6.55
6.22
5.95
5.74
5.56
5.42
5.29
5.19
5.09
5.01
4.94
4.87
4.82
4.76
4.72
4.68
4.64
4.60
4.57
4.54
4.51
4.48
4.46
4.44
4.42
4.40
4.38
4.36
4.34
4.33
4.31
4.25
4.20
4.13
4.07
4.04
4.01
3.98
3.96
3.95
3.92
3.91
3.89
3.88
4
5624
99.25
28.71
15.98
11.39
9.15
7.85
7.01
6.42
5.99
5.67
5.41
5.21
5.04
4.89
4.77
4.67
4.58
4.50
4.43
4.37
4.31
4.26
4.22
4.18
4.14
4.11
4.07
4.04
4.02
3.99
3.97
3.95
3.93
3.91
3.89
3.87
3.86
3.84
3.83
3.77
3.72
3.65
3.60
3.56
3.53
3.51
3.49
3.48
3.46
3.44
3.43
3.41
of Freedom)
9
10
6022 6056
99.39 99.40
27.34 27.23
14.66 14.55
10.16 10.05
7.98
7.87
6.72
6.62
5.91
5.81
5.35
5.26
4.94
4.85
4.63
4.54
4.39
4.30
4.19
4.10
4.03
3.94
3.89
3.80
3.78
3.69
3.68
3.59
3.60
3.51
3.52
3.43
3.46
3.37
3.40
3.31
3.35
3.26
3.30
3.21
3.26
3.17
3.22
3.13
3.18
3.09
3.15
3.06
3.12
3.03
3.09
3.00
3.07
2.98
3.04
2.96
3.02
2.93
3.00
2.91
2.98
2.89
2.96
2.88
2.95
2.86
2.93
2.84
2.92
2.83
2.90
2.81
2.89
2.80
2.83
2.74
2.78
2.70
2.72
2.63
2.67
2.59
2.64
2.55
2.61
2.52
2.59
2.50
2.57
2.49
2.56
2.47
2.54
2.45
2.52
2.43
2.51
2.42
2.50
2.41
11
6083
99.41
27.13
14.45
9.96
7.79
6.54
5.73
5.18
4.77
4.46
4.22
4.02
3.86
3.73
3.62
3.52
3.43
3.36
3.29
3.24
3.18
3.14
3.09
3.06
3.02
2.99
2.96
2.93
2.91
2.88
2.86
2.84
2.82
2.80
2.79
2.77
2.75
2.74
2.73
2.67
2.63
2.56
2.51
2.48
2.45
2.43
2.41
2.40
2.38
2.36
2.35
2.34
12
6107
99.42
27.05
14.37
9.89
7.72
6.47
5.67
5.11
4.71
4.40
4.16
3.96
3.80
3.67
3.55
3.46
3.37
3.30
3.23
3.17
3.12
3.07
3.03
2.99
2.96
2.93
2.90
2.87
2.84
2.82
2.80
2.78
2.76
2.74
2.72
2.71
2.69
2.68
2.66
2.61
2.56
2.50
2.45
2.42
2.39
2.37
2.35
2.34
2.31
2.30
2.28
2.27
13
6126
99.42
26.98
14.31
9.82
7.66
6.41
5.61
5.05
4.65
4.34
4.10
3.91
3.75
3.61
3.50
3.40
3.32
3.24
3.18
3.12
3.07
3.02
2.98
2.94
2.90
2.87
2.84
2.81
2.79
2.77
2.74
2.72
2.70
2.69
2.67
2.65
2.64
2.62
2.61
2.55
2.51
2.44
2.40
2.36
2.33
2.31
2.30
2.28
2.26
2.24
2.23
2.22
14
6143
99.43
26.92
14.25
9.77
7.60
6.36
5.56
5.01
4.60
4.29
4.05
3.86
3.70
3.56
3.45
3.35
3.27
3.19
3.13
3.07
3.02
2.97
2.93
2.89
2.86
2.82
2.79
2.77
2.74
2.72
2.70
2.68
2.66
2.64
2.62
2.61
2.59
2.58
2.56
2.51
2.46
2.39
2.35
2.31
2.29
2.27
2.25
2.23
2.21
2.20
2.18
2.17
335
15
6157
99.43
26.87
14.20
9.72
7.56
6.31
5.52
4.96
4.56
4.25
4.01
3.82
3.66
3.52
3.41
3.31
3.23
3.15
3.09
3.03
2.98
2.93
2.89
2.85
2.81
2.78
2.75
2.73
2.70
2.68
2.65
2.63
2.61
2.60
2.58
2.56
2.55
2.54
2.52
2.46
2.42
2.35
2.31
2.27
2.24
2.22
2.21
2.19
2.17
2.15
2.14
2.13
16
6170
99.44
26.83
14.15
9.68
7.52
6.28
5.48
4.92
4.52
4.21
3.97
3.78
3.62
3.49
3.37
3.27
3.19
3.12
3.05
2.99
2.94
2.89
2.85
2.81
2.78
2.75
2.72
2.69
2.66
2.64
2.62
2.60
2.58
2.56
2.54
2.53
2.51
2.50
2.48
2.43
2.38
2.31
2.27
2.23
2.21
2.19
2.17
2.15
2.13
2.11
2.10
2.09
17
6181
99.44
26.79
14.11
9.64
7.48
6.24
5.44
4.89
4.49
4.18
3.94
3.75
3.59
3.45
3.34
3.24
3.16
3.08
3.02
2.96
2.91
2.86
2.82
2.78
2.75
2.71
2.68
2.66
2.63
2.61
2.58
2.56
2.54
2.53
2.51
2.49
2.48
2.46
2.45
2.39
2.35
2.28
2.23
2.20
2.17
2.15
2.13
2.12
2.10
2.08
2.07
2.06
18
6191
99.44
26.75
14.08
9.61
7.45
6.21
5.41
4.86
4.46
4.15
3.91
3.72
3.56
3.42
3.31
3.21
3.13
3.05
2.99
2.93
2.88
2.83
2.79
2.75
2.72
2.68
2.65
2.63
2.60
2.58
2.55
2.53
2.51
2.50
2.48
2.46
2.45
2.43
2.42
2.36
2.32
2.25
2.20
2.17
2.14
2.12
2.10
2.09
2.07
2.05
2.04
2.03
of Freedom)
27
30
6249 6260
99.46 99.47
26.55 26.50
13.88 13.84
9.42
9.38
7.27
7.23
6.03
5.99
5.23
5.20
4.68
4.65
4.28
4.25
3.98
3.94
3.74
3.70
3.54
3.51
3.38
3.35
3.25
3.21
3.14
3.10
3.04
3.00
2.95
2.92
2.88
2.84
2.81
2.78
2.76
2.72
2.70
2.67
2.66
2.62
2.61
2.58
2.58
2.54
2.54
2.50
2.51
2.47
2.48
2.44
2.45
2.41
2.42
2.39
2.40
2.36
2.38
2.34
2.36
2.32
2.34
2.30
2.32
2.28
2.30
2.26
2.28
2.25
2.27
2.23
2.26
2.22
2.24
2.20
2.18
2.14
2.14
2.10
2.07
2.03
2.02
1.98
1.98
1.94
1.96
1.92
1.93
1.89
1.92
1.88
1.90
1.86
1.88
1.84
1.86
1.82
1.85
1.81
1.84
1.79
40
6286
99.48
26.41
13.75
9.29
7.14
5.91
5.12
4.57
4.17
3.86
3.62
3.43
3.27
3.13
3.02
2.92
2.84
2.76
2.69
2.64
2.58
2.54
2.49
2.45
2.42
2.38
2.35
2.33
2.30
2.27
2.25
2.23
2.21
2.19
2.18
2.16
2.14
2.13
2.11
2.05
2.01
1.94
1.89
1.85
1.82
1.80
1.78
1.76
1.74
1.72
1.71
1.69
60
6313
99.48
26.32
13.65
9.20
7.06
5.82
5.03
4.48
4.08
3.78
3.54
3.34
3.18
3.05
2.93
2.83
2.75
2.67
2.61
2.55
2.50
2.45
2.40
2.36
2.33
2.29
2.26
2.23
2.21
2.18
2.16
2.14
2.12
2.10
2.08
2.06
2.05
2.03
2.02
1.96
1.91
1.84
1.78
1.75
1.72
1.69
1.67
1.66
1.63
1.61
1.60
1.58
100
6334
99.49
26.24
13.58
9.13
6.99
5.75
4.96
4.41
4.01
3.71
3.47
3.27
3.11
2.98
2.86
2.76
2.68
2.60
2.54
2.48
2.42
2.37
2.33
2.29
2.25
2.22
2.19
2.16
2.13
2.11
2.08
2.06
2.04
2.02
2.00
1.98
1.97
1.95
1.94
1.88
1.82
1.75
1.70
1.65
1.62
1.60
1.58
1.56
1.53
1.51
1.49
1.48
200
6350
99.49
26.18
13.52
9.08
6.93
5.70
4.91
4.36
3.96
3.66
3.41
3.22
3.06
2.92
2.81
2.71
2.62
2.55
2.48
2.42
2.36
2.32
2.27
2.23
2.19
2.16
2.13
2.10
2.07
2.04
2.02
2.00
1.98
1.96
1.94
1.92
1.90
1.89
1.87
1.81
1.76
1.68
1.62
1.58
1.55
1.52
1.50
1.48
1.45
1.42
1.41
1.39
336
INTROSTAT
(0.005)
TABLE 4.4. 0.5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN
where NUM and DEN are the numerator and denominator degrees of freedom respectively
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
1
16212
198.5
55.55
31.33
22.78
18.63
16.24
14.69
13.61
12.83
12.23
11.75
11.37
11.06
10.80
10.58
10.38
10.22
10.07
9.94
9.83
9.73
9.63
9.55
9.48
9.41
9.34
9.28
9.23
9.18
9.13
9.09
9.05
9.01
8.98
8.94
8.91
8.88
8.85
8.83
8.71
8.63
8.49
8.40
8.33
8.28
8.24
8.21
8.18
8.13
8.10
8.08
8.06
2
19997
199.0
49.80
26.28
18.31
14.54
12.40
11.04
10.11
9.43
8.91
8.51
8.19
7.92
7.70
7.51
7.35
7.21
7.09
6.99
6.89
6.81
6.73
6.66
6.60
6.54
6.49
6.44
6.40
6.35
6.32
6.28
6.25
6.22
6.19
6.16
6.13
6.11
6.09
6.07
5.97
5.90
5.79
5.72
5.67
5.62
5.59
5.56
5.54
5.50
5.48
5.46
5.44
3
21614
199.2
47.47
24.26
16.53
12.92
10.88
9.60
8.72
8.08
7.60
7.23
6.93
6.68
6.48
6.30
6.16
6.03
5.92
5.82
5.73
5.65
5.58
5.52
5.46
5.41
5.36
5.32
5.28
5.24
5.20
5.17
5.14
5.11
5.09
5.06
5.04
5.02
5.00
4.98
4.89
4.83
4.73
4.66
4.61
4.57
4.54
4.52
4.50
4.47
4.44
4.42
4.41
4
22501
199.2
46.20
23.15
15.56
12.03
10.05
8.81
7.96
7.34
6.88
6.52
6.23
6.00
5.80
5.64
5.50
5.37
5.27
5.17
5.09
5.02
4.95
4.89
4.84
4.79
4.74
4.70
4.66
4.62
4.59
4.56
4.53
4.50
4.48
4.46
4.43
4.41
4.39
4.37
4.29
4.23
4.14
4.08
4.03
3.99
3.96
3.94
3.92
3.89
3.87
3.85
3.84
11
24334
199.4
43.52
20.82
13.49
10.13
8.27
7.10
6.31
5.75
5.32
4.99
4.72
4.51
4.33
4.18
4.05
3.94
3.84
3.76
3.68
3.61
3.55
3.50
3.45
3.40
3.36
3.32
3.29
3.25
3.22
3.20
3.17
3.15
3.12
3.10
3.08
3.06
3.05
3.03
2.96
2.90
2.82
2.76
2.72
2.68
2.66
2.64
2.62
2.59
2.57
2.56
2.54
12
24427
199.4
43.39
20.70
13.38
10.03
8.18
7.01
6.23
5.66
5.24
4.91
4.64
4.43
4.25
4.10
3.97
3.86
3.76
3.68
3.60
3.54
3.47
3.42
3.37
3.33
3.28
3.25
3.21
3.18
3.15
3.12
3.09
3.07
3.05
3.03
3.01
2.99
2.97
2.95
2.88
2.82
2.74
2.68
2.64
2.61
2.58
2.56
2.54
2.52
2.50
2.48
2.47
13
24505
199.4
43.27
20.60
13.29
9.95
8.10
6.94
6.15
5.59
5.16
4.84
4.57
4.36
4.18
4.03
3.90
3.79
3.70
3.61
3.54
3.47
3.41
3.35
3.30
3.26
3.22
3.18
3.15
3.11
3.08
3.06
3.03
3.01
2.98
2.96
2.94
2.92
2.90
2.89
2.82
2.76
2.68
2.62
2.58
2.54
2.52
2.50
2.48
2.45
2.43
2.42
2.40
14
24572
199.4
43.17
20.51
13.21
9.88
8.03
6.87
6.09
5.53
5.10
4.77
4.51
4.30
4.12
3.97
3.84
3.73
3.64
3.55
3.48
3.41
3.35
3.30
3.25
3.20
3.16
3.12
3.09
3.06
3.03
3.00
2.97
2.95
2.93
2.90
2.88
2.87
2.85
2.83
2.76
2.70
2.62
2.56
2.52
2.49
2.46
2.44
2.42
2.40
2.38
2.36
2.35
337
TABLE 4.4, continued. 0.5% critical values for the F -DISTRIBUTION (continued)
DEN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
15
24632
199.4
43.08
20.44
13.15
9.81
7.97
6.81
6.03
5.47
5.05
4.72
4.46
4.25
4.07
3.92
3.79
3.68
3.59
3.50
3.43
3.36
3.30
3.25
3.20
3.15
3.11
3.07
3.04
3.01
2.98
2.95
2.92
2.90
2.88
2.85
2.83
2.82
2.80
2.78
2.71
2.65
2.57
2.51
2.47
2.44
2.41
2.39
2.37
2.35
2.33
2.31
2.30
16
24684
199.4
43.01
20.37
13.09
9.76
7.91
6.76
5.98
5.42
5.00
4.67
4.41
4.20
4.02
3.87
3.75
3.64
3.54
3.46
3.38
3.31
3.25
3.20
3.15
3.11
3.07
3.03
2.99
2.96
2.93
2.90
2.88
2.85
2.83
2.81
2.79
2.77
2.75
2.74
2.66
2.61
2.53
2.47
2.43
2.39
2.37
2.35
2.33
2.30
2.28
2.26
2.25
17
24728
199.4
42.94
20.31
13.03
9.71
7.87
6.72
5.94
5.38
4.96
4.63
4.37
4.16
3.98
3.83
3.71
3.60
3.50
3.42
3.34
3.27
3.21
3.16
3.11
3.07
3.03
2.99
2.95
2.92
2.89
2.86
2.84
2.81
2.79
2.77
2.75
2.73
2.71
2.70
2.62
2.57
2.49
2.43
2.39
2.35
2.33
2.31
2.29
2.26
2.24
2.22
2.21
18
24766
199.4
42.88
20.26
12.98
9.66
7.83
6.68
5.90
5.34
4.92
4.59
4.33
4.12
3.95
3.80
3.67
3.56
3.46
3.38
3.31
3.24
3.18
3.12
3.08
3.03
2.99
2.95
2.92
2.89
2.86
2.83
2.80
2.78
2.76
2.73
2.71
2.70
2.68
2.66
2.59
2.53
2.45
2.39
2.35
2.32
2.29
2.27
2.25
2.22
2.20
2.19
2.18
40
25146
199.5
42.31
19.75
12.53
9.24
7.42
6.29
5.52
4.97
4.55
4.23
3.97
3.76
3.59
3.44
3.31
3.20
3.11
3.02
2.95
2.88
2.82
2.77
2.72
2.67
2.63
2.59
2.56
2.52
2.49
2.47
2.44
2.42
2.39
2.37
2.35
2.33
2.31
2.30
2.22
2.16
2.08
2.02
1.97
1.94
1.91
1.89
1.87
1.84
1.82
1.80
1.79
60
25254
199.5
42.15
19.61
12.40
9.12
7.31
6.18
5.41
4.86
4.45
4.12
3.87
3.66
3.48
3.33
3.21
3.10
3.00
2.92
2.84
2.77
2.71
2.66
2.61
2.56
2.52
2.48
2.45
2.42
2.38
2.36
2.33
2.30
2.28
2.26
2.24
2.22
2.20
2.18
2.11
2.05
1.96
1.90
1.85
1.82
1.79
1.77
1.75
1.72
1.69
1.68
1.66
100
25339
199.5
42.02
19.50
12.30
9.03
7.22
6.09
5.32
4.77
4.36
4.04
3.78
3.57
3.39
3.25
3.12
3.01
2.91
2.83
2.75
2.69
2.62
2.57
2.52
2.47
2.43
2.39
2.36
2.32
2.29
2.26
2.24
2.21
2.19
2.17
2.14
2.12
2.11
2.09
2.01
1.95
1.86
1.80
1.75
1.71
1.68
1.66
1.64
1.60
1.58
1.56
1.54
200
25399
199.5
41.92
19.41
12.22
8.95
7.15
6.02
5.26
4.71
4.29
3.97
3.71
3.50
3.33
3.18
3.05
2.94
2.85
2.76
2.68
2.62
2.56
2.50
2.45
2.40
2.36
2.32
2.29
2.25
2.22
2.19
2.16
2.14
2.11
2.09
2.07
2.05
2.03
2.01
1.93
1.87
1.78
1.71
1.66
1.62
1.59
1.56
1.54
1.51
1.48
1.46
1.44
338
INTROSTAT
TABLE 5. CORRELATION COEFFICIENT. Critical values of the correlation coefficients for one-sided tests of the null hypothesis H0 : = 0 (where degrees of freedom =
sample size - 2)
Deg. of
Freedom
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
45
50
60
70
80
90
100
110
120
140
160
180
200
0.4
0.3090
0.2000
0.1577
0.1341
0.1186
0.1075
0.0990
0.0922
0.0867
0.0820
0.0780
0.0746
0.0715
0.0688
0.0664
0.0643
0.0623
0.0605
0.0588
0.0573
0.0559
0.0546
0.0534
0.0522
0.0511
0.0501
0.0492
0.0483
0.0474
0.0466
0.0458
0.0451
0.0444
0.0437
0.0431
0.0425
0.0419
0.0414
0.0408
0.0403
0.0380
0.0360
0.0328
0.0304
0.0284
0.0268
0.0254
0.0242
0.0232
0.0214
0.0201
0.0189
0.0179
0.3
0.5878
0.4000
0.3197
0.2735
0.2427
0.2204
0.2032
0.1895
0.1783
0.1688
0.1607
0.1536
0.1474
0.1419
0.1370
0.1326
0.1285
0.1248
0.1214
0.1183
0.1154
0.1127
0.1102
0.1078
0.1056
0.1036
0.1016
0.0997
0.0980
0.0963
0.0947
0.0932
0.0918
0.0904
0.0891
0.0878
0.0866
0.0855
0.0844
0.0833
0.0785
0.0744
0.0679
0.0628
0.0588
0.0554
0.0525
0.0501
0.0479
0.0444
0.0415
0.0391
0.0371
0.2
0.8090
0.6000
0.4919
0.4257
0.3803
0.3468
0.3208
0.2998
0.2825
0.2678
0.2552
0.2443
0.2346
0.2260
0.2183
0.2113
0.2049
0.1991
0.1938
0.1888
0.1843
0.1800
0.1760
0.1723
0.1688
0.1655
0.1624
0.1594
0.1567
0.1540
0.1515
0.1491
0.1468
0.1446
0.1425
0.1405
0.1386
0.1368
0.1350
0.1333
0.1257
0.1192
0.1088
0.1007
0.0942
0.0888
0.0842
0.0803
0.0769
0.0712
0.0666
0.0628
0.0595
0.1
0.9511
0.8000
0.6870
0.6084
0.5509
0.5067
0.4716
0.4428
0.4187
0.3981
0.3802
0.3646
0.3507
0.3383
0.3271
0.3170
0.3077
0.2992
0.2914
0.2841
0.2774
0.2711
0.2653
0.2598
0.2546
0.2497
0.2451
0.2407
0.2366
0.2327
0.2289
0.2254
0.2220
0.2187
0.2156
0.2126
0.2097
0.2070
0.2043
0.2018
0.1903
0.1806
0.1650
0.1528
0.1430
0.1348
0.1279
0.1220
0.1168
0.1082
0.1012
0.0954
0.0905
0.005
0.9999
0.9900
0.9587
0.9172
0.8745
0.8343
0.7977
0.7646
0.7348
0.7079
0.6835
0.6614
0.6411
0.6226
0.6055
0.5897
0.5751
0.5614
0.5487
0.5368
0.5256
0.5151
0.5052
0.4958
0.4869
0.4785
0.4705
0.4629
0.4556
0.4487
0.4421
0.4357
0.4296
0.4238
0.4182
0.4128
0.4076
0.4026
0.3978
0.3932
0.3721
0.3542
0.3248
0.3017
0.2830
0.2673
0.2540
0.2425
0.2324
0.2155
0.2019
0.1905
0.1809
0.0025
1.0000
0.9950
0.9740
0.9417
0.9056
0.8697
0.8359
0.8046
0.7759
0.7496
0.7255
0.7034
0.6831
0.6643
0.6470
0.6308
0.6158
0.6018
0.5886
0.5763
0.5647
0.5537
0.5434
0.5336
0.5243
0.5154
0.5070
0.4990
0.4914
0.4840
0.4770
0.4703
0.4639
0.4577
0.4518
0.4461
0.4406
0.4353
0.4301
0.4252
0.4028
0.3836
0.3522
0.3274
0.3072
0.2903
0.2759
0.2635
0.2526
0.2343
0.2195
0.2072
0.1968
0.001
1.0000
0.9980
0.9859
0.9633
0.9350
0.9049
0.8751
0.8467
0.8199
0.7950
0.7717
0.7501
0.7301
0.7114
0.6940
0.6777
0.6624
0.6481
0.6346
0.6219
0.6099
0.5986
0.5879
0.5776
0.5679
0.5587
0.5499
0.5415
0.5334
0.5257
0.5184
0.5113
0.5045
0.4979
0.4916
0.4856
0.4797
0.4741
0.4686
0.4634
0.4394
0.4188
0.3850
0.3583
0.3364
0.3181
0.3025
0.2890
0.2771
0.2572
0.2411
0.2276
0.2162
0.0005
1.0000
0.9990
0.9911
0.9741
0.9509
0.9249
0.8983
0.8721
0.8470
0.8233
0.8010
0.7800
0.7604
0.7419
0.7247
0.7084
0.6932
0.6788
0.6652
0.6524
0.6402
0.6287
0.6178
0.6074
0.5974
0.5880
0.5789
0.5703
0.5621
0.5541
0.5465
0.5392
0.5322
0.5254
0.5189
0.5126
0.5066
0.5007
0.4950
0.4896
0.4647
0.4432
0.4079
0.3798
0.3568
0.3375
0.3211
0.3068
0.2943
0.2733
0.2562
0.2419
0.2298