Probstatmarkov PDF
Probstatmarkov PDF
Patrick D. Bangert
School of Engineering and Science
International University Bremen
[email protected]
Acknowledgements ix
Preface x
I Probability 1
1 The Notion of Probability 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The A Priori Denition. . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 The A Posteriori Denition . . . . . . . . . . . . . . . . . . . 4
1.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Reading: The Meaning of Probability by Ernest Nagel . . . . . . . . 9
1.2.1 Introduction to the Reading . . . . . . . . . . . . . . . . . . . 9
1.2.2 The Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Compound Probabilities 10
2.1 Probability of Compound Events . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Reading: Concerning Probability by Pierre Simon de Laplace . . . . . 18
2.2.1 Introduction to the Reading . . . . . . . . . . . . . . . . . . . 18
2.2.2 The Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Expectation 28
5.1 Expectation Value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Reading: The Theory of Economic Behavior by Leonid Hurwicz . . . 33
5.2.1 Introduction to the Reading . . . . . . . . . . . . . . . . . . . 33
5.2.2 The Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
12 Limit Theorems 96
12.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 96
12.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 97
12.3 Reading: Mathematics of War and Foreign Politics by Lewis Fry
Richardson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
12.3.1 Introduction to the Reading . . . . . . . . . . . . . . . . . . . 99
12.3.2 The Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
II Statistics 100
13 The Importance of Sampling 101
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
13.2 The Field of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 101
13.3 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
13.3.1 Unbiased Samples . . . . . . . . . . . . . . . . . . . . . . . . . 102
13.3.2 Uniformly Random Sampling . . . . . . . . . . . . . . . . . . 103
13.3.3 Stratied Random Sampling . . . . . . . . . . . . . . . . . . . 104
13.3.4 Cluster Random Sampling . . . . . . . . . . . . . . . . . . . . 104
13.3.5 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . 104
13.4 The Question of Error . . . . . . . . . . . . . . . . . . . . . . . . . . 105
13.5 Example of Bad Sampling . . . . . . . . . . . . . . . . . . . . . . . . 105
13.6 Reading: Sampling and Standard Error by L. C. Tippett . . . . . . . 107
13.6.1 Introduction to the Reading . . . . . . . . . . . . . . . . . . . 107
13.6.2 The Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Patrick D. Bangert
CONTENTS v
Patrick D. Bangert
CONTENTS vii
IV Assignments 216
28 Homework 217
28.1 Grading Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
28.1.1 Grades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
28.1.2 Extra Credit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
28.1.3 Attendance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
28.2 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
28.2.1 General Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 218
28.2.2 Homework 1 (19.02.2004): Basic Probability and Random
Variables (75 Points) . . . . . . . . . . . . . . . . . . . . . . . 219
29 Projects 230
Patrick D. Bangert
Acknowledgements
Most of the notes were written by me but some parts were borrowed. I thank the
following people and sources for kindly lending material included in these notes:
John Lammin SRS Varadhan James W. Pitman Michael Casey Numerical Recipes
in C Nanochron, LLC; 4201 Wilson Blvd., # 110-615; Arlington, VA 22203; USA.
Utts, Jessica M. (1999): Seeing Through Statistics. Duxbury Press, Pacic Grove.
Haigh, John (2002): Probability Models. Springer Verlag.
Preface
Probability and Statistics are topics of mathematics and of practical science. The
basic concepts are very dicult as can be seen by the fact that there is still a deep
controversy about a certain few concepts and how to use them. As such, much
statistics that is reported is simply wrong; some of it intentionally.
As the subject is not completely agreed upon, this course will try to present
all views on the topic and let the reader decide. However, a certain bias cannot
be overcome. The main goal of this course is the understanding of the basic terms
and concepts of probability and statistics. The concepts at issue are: Probability,
Sample, Population, Representative, Signicant, Correlation and Test. We will learn
how to evaluate probabilities and combine them, what a population and a sample are
and what makes the sample representative. Then we will learn how to test whether
certain properties are correlated and hypotheses true or false and how signicant
the result of this test is. Finally we shall learn about how these concepts can be
combined to draw useful conclusions about reality and also predict (by use of Markov
Processes) future events.
There are 28 lectures, each having a chapter in this compilation. The preface
is the chapter corresponding to the rst lecture. Most lectures have a reading as-
sociated with them that must be read before that lecture takes place. The reading
is a selection from published works that illustrates the topic at hand. Sometimes
the selections are rather old and thus show more clearly what was done. Certain
prejudices are apparent in these works and their writing style makes them amusing
to read. They are included here because of their scientic and historical value and
quality of writing as the purpose of the course is the understanding of a few very
dicult concepts. The opinions stated in these writings are not to be understood
to correlate at all with those of the author of this compilation.
The grading policy, homework and projects are explained in the last two chapters.
Part I
Probability
Chapter 1
1.1 Introduction
We are now going to study the mathematics of random events. If we throw a six
faced die we do not know in advance which of the numbers 1, 2, 3, 4, 5, 6 will
show. If we spin a coin we do not know in advance whether it will show a head or
a tail. In ordinary speech we say that it is all a matter of chance. It is possible to
argue that no humanly constructible die or coin can be perfect and must show some
sort of bias. This might only reveal itself after the results of thousands of trials
had been subjected to statistical analysis but nevertheless it would be there. These
problems are real; even attempts to lay down a precise denition of what we mean
by a random sequence of numbers run into some very subtle diculties. The so
called random number generators used in computer programs are in fact anything
but random; they produce repeating sequences. The sequences admittedly have very
long cycles and do have some of the properties that we would like a truly random
sequence to have. You can use them safely for most problems.
But we cannot spend time on these considerations interesting though they may
be. As usual we will construct a mathematical model assuming that all is for the
best in the best of all possible worlds. And we will then nd that it does serve to
solve many real world problems in a satisfactory way.
But rst we need some denitions.
Denition 2 The set of all possible outcomes is called the sample space for the
event and is usually denoted by the letter S. Sometimes the sample space is referred
to as the possibility space or outcome space. The term sample space is widely used
in statistics. I avoid outcome space as we have already used it when we were talking
about permutations and combinations. Although the sample spaces of probability
theory nearly always correspond to an outcome space as previously dened I prefer
to keep the nomenclature distinct.
Thus for throwing the usual six faced die S = {1, 2, 3, 4, 5, 6}. The events of the
sample space are often called elementary events and the word event itself is reserved
1.1 Introduction 3
for any non empty subset of S. Suppose the event A = {2, 4, 6}. Then we say that
event A has happened if we get a 2, a 4 or a 6 on throwing the die.
This mathematical denition of the notion of an event is important in what
follows. Try to get away from the everyday life meaning of the word which is not
precise enough for us. I cannot emphasize strongly enough that an event A =
{a, b, c, } is a non empty subset of the sample space S and we say that the event
A has happened if we get any one from the elementary events of A as the result of
a particular trial.
We will use n(A) as before to denote the number of elements in the set A. It is
not the number of times the event happens in any experiment. One nal note before
we get down to actual problems. The sample spaces and events that we study in
this chapter will all be nite sets. We will lift this restriction in a later chapter.
Consider the result of throwing a die 120 times. The results might come out like
this:
Result 1 2 3 4 5 6
Frequency 18 21 20 17 24 20
18 21 20 17 24 20
Relative Frequency 120 120 120 120 120 120
The terms result and frequency used in this example are self explanatory. The
term relative frequency is used by statisticians for the ratio of the number of times
a particular event happens to the total number of trials.
We notice:
If some enthusiast were to throw the die 1200 or even 12000 times we would
expect the approximation of the relative frequencies to 1/6 to get better and better.
The sum of the relative frequencies would always be 1 of course.
From considerations like these two denitions of probability can be extracted.
They are called the a priori and a posteriori denitions. Before and after in
English.
case there is the additional restriction that we expect our assignments of values to
help us solve probabilistic problems.
In actual problems then our next step is to nd a method of assigning the ps in a
way which both satises the conditions above and also makes sense in our problem.
If we decide that the elementary events are equiprobable as with the six results
obtainable by throwing an unbiased die it is reasonable to take
1
p1 = p2 = p3 = p4 = p5 = p6 = . (1.1)
6
This corresponds to the naive notion that a given result happens on average once in
every 6 times.
If the events are not equiprobable we can still make a try at dening the prob-
abilities in many cases. Given a fair die with one face marked 1, two faces marked
2 and 3 faces marked 6 S = {1, 2, 6} and we would assign 1/6, 2/6, 3/6 as the
corresponding probabilities.
In a dicult situation we can attempt to verify a probability assignment as
follows. Given S = {e1 , e2 , en }. Suppose we assign probabilities p1 , p2 , pn
n
which are positive fractions with pi = 1. We now do experiments involving a
i=1
large number of trials, N say. Suppose e1 occurs n1 times, e2 occurs n2 times,
and en occurs nn times. If our chosen ps are to make sense the relative frequencies
, nNn which certainly add up to 1 should also become very close to the ps that
n1 n2
N N
we have dened. It is neat and very tempting to say that we require limN nNi = pi .
The trouble with this is that we can never carry out an experiment in which N .
n
So this is not on the same footing as say limn n+1 = 1 of pure math where we
have techniques for calculating limits without having to carry out an (impossible)
innite number of operations.
You will now appreciate why this entire scheme is called the a priori denition
of probabilities. You assign the probabilities rst and verify them later (if at all).
This was the method used by the early workers in this eld who were developing the
theory for the study of gambling games. Great use is made of this method today,
but we must also consider the a posteriori denition.
Patrick D. Bangert
1.1 Introduction 5
This is called the a posteriori denition because you do the experiment rst and
then assign the probabilities according to the results of the experiment. The devout
experimental physicists among you will be lauding this eminently proper way of
going about things. But it too has its problems. For example, faced with drawing
up an advertising campaign to sell the product on the strength of these results the
director screams that the dogs were not fair or unbiased dogs. Coming from an
animal research center they probably wanted to die. He may well have a point there
and certainly raises the question of how you choose your sample if you are going to
establish probabilities by experimentation. In the 1960s a drug called Thalidomide
was put on the market as a tranquilizer. It had of course been extensively tested
on humans before it was allowed to be sold. Unfortunately the sampling had not
included any pregnant women. As a result thousands of terribly deformed babies
were born before the reason was found and the drug withdrawn.
The validity of sampling techniques is a very important topic in statistics. For
our present task of developing the rst steps of probability theory it does not matter
how the probabilities have been found. Let us summarize what we really need:
This denition looks complicated but it is very precise. Study it carefully. It will
become crystal clear as we look at examples. Remember that an event E happens
when a random trial produces a result which is in the set E.
Example 1 Given a fair die what is the probability of throwing an even number?
Solution.
S={1,2,3,4,5,6}. p1 = p2 = p3 = p4 = p5 = p6 = 1/6. E = {2,4,6}.
3 1
P (E |S ) = p2 + p4 + p6 = = . (1.5)
6 2
So far as this example is concerned then our denition seems eminently reasonable.
Example 2 If a card is drawn from a well shued pack what is the probability of
its being an ace?
Solution.
S = { any one of the 52 cards }. p1 = p2 = ..... = p52 = 1/52.
E = { any one of the four aces }.
1 1 1 1 1
P (E |S ) = + + + = . (1.6)
52 52 52 52 13
Example 3 What is the probability of securing a hand of 13 cards all of the same
suit in a game of bridge?
Solution.
S= { all possible dierent bridge hands }.
52
Now S has C13 elements. Each one of these hands is equally likely. So the
probability distribution is pi = C152 for each hand.
13
E = { the four hands containing just one suit }
1
P (E |S ) = 52
1.6 1011 . (1.7)
C13
Example 4 The letters of the word moon are arranged at random. What is the
probability that the Os will be separated?
Solution.
There are 2!P4 = 4.3.2
4
2!
= 12 arrangements altogether. If the Os are kept together
there are 6 arrangements. Hence there are also 6 arrangements in which the Os are
separated.
S = { the 12 possible arrangements }. p1 = p2 = ..... = p12 = 1/12
E = { the 6 arrangements with separated Os }
1 1
P (E |S ) = 6. = . (1.8)
12 2
Example 5 Two dice are thrown and their sums added. List the sample space and
nd the probability that the total score is a) 7 b) 9.
Solution.
S = {2,3,4,5,6,7,8,9,10,11,12}. But these are not equiprobable!
Each die can fall in 6 ways so there are 36 dierent possible outcomes.
We get 2 as 1 1 1 way
3 as 1 2 2 1 2
41331223
5144123324
615512442335
71661255234436
826623553445
9366345544
10 4 6 6 4 5 5 3
11 5 6 6 5 2
12 6 6 1
Total 36 ways.
Patrick D. Bangert
1.1 Introduction 7
1 2 3 4 5 6
P (2) = 36
P (3) = 36
P (4) = 36 P (5) = 36 P (6) = 36 P (7) = 36
5 4 3 2 1 (1.9)
P (8) = 36
P (9) = 36
P (10) = 36 P (11) = 36 P (12) = 36
4+3+2+1 5
P (E |S ) = = . (1.10)
36 18
Note.
The notation P(1) = 1/36 etc. that we have used here is a useful alternative
to suces in practical calculations. We can put a detailed description of the event
inside the brackets if necessary.
Example 6 360 oysters are examined and 12 contain pearls. What is the probability
that the next oyster examined contains a pearl?
Solution.
This is an a posteriori problem. The sample space has only two elements:
S = { oyster contains a pearl, oyster does not contain a pearl }
On the evidence given we assign :
P(oyster contains a pearl) = 12/360
P(oyster does not contain a pearl) = 338/360.
So we claim that the probability that the next oyster examined contains a pearl
is 1/30. This is the best estimate we can make with the evidence we have.
Example 7 Solution.
S = { dart hits a red square, dart hits a white area }
Since the total area is 400 cm2 and the red area is 9 x 4 = 36 cm2 it is reasonable
to assign the probabilities:
36
P(dart hits a red square) = 400
P(dart hits a white space) = 364
400
9
Thus the probability that it hits a red square is simply 100 .
1.1.3 Exercises
1. Make a list of all possible combinations of Boy(B) and Girl(G) for four child
families. Assuming that P(B) = P(G), nd the probability that in a four child
family chosen at random there is (a) at least one girl, (b) at least two girls, (c)
the oldest and youngest are of the same sex, (d) no girl in the family is older
than a boy.
2. Four fair dice are thrown and the total score calculated. What is the proba-
bility that the total score is 23 or more?
3. What is the probability that an arrangement of the letters of the word MAR-
ROW has the Rs together?
4. Three rows of cars wait to board a hover craft. The 10th car in each row is
labelled A,B,C respectively. Assuming that cars are drawn at random from
the front of each row until all cars are loaded nd the chance that (a) A is
loaded before B, (b) Both A and B are loaded before C.
5. A dart board is divided into 20 equal sectors and marked with scores from 1
to 20. If a dart hits the scoring part of the board at random, calculate the
probability (a) that the score is a prime number, (b) the score will not divide
by 3.
6. An urn contains equal large numbers of red and green balls. If three balls are
drawn out, what is the probability that both colors are represented?
7. A biased die was thrown 100 times and gave the following results:
Score 1 2 3 4 5 6
Number 17 21 15 10 21 16 Total 100
Make the best possible estimate that the sum of the scores of the next two
throws will be at least 4.
8. Twelve people have their names written on pieces of paper, each of which is
folded and placed in a hat. Among the twelve people there are three brothers.
What is the probability that if 5 names are drawn out not all the brothers will
be included?
9. Four players A,B,C and D play a game of chance three times. Find the prob-
ability that (a) A wins all the games, (b) A wins the second game only.
10. In a box there are 12 bulbs of which 3 are defective. If 2 are taken at random,
what is the chance that both are defective? What is the chance that neither
is defective?
11. Two dice are thrown. What is the probability that the scores dier by 2 or
less?
Patrick D. Bangert
1.2 Reading: The Meaning of Probability by Ernest Nagel 9
12. Trains at the same platform at Baker Street underground station go alternately
on the Circle line and the Metropolitan line. A traveller who always wishes
to take a Metropolitan line train, and who can be assumed to arrive on the
platform at random, nds that 403 times out of 500 the next train due is
a circle line train. How can this be reconciled with the fact that the trains
alternate?
13. A prism has ends which are equilateral triangles and sides which are rectangu-
lar. An experiment with 1000 throws gave 882 occasions when it landed on its
rectangular faces and 118 when it landed on a triangular end. The triangular
ends are painted red and green and the rectangular faces red, green, and blue.
Find the best estimate of the probability that for a single throw on to a at
surface (a) only red and green can be seen (b) both red faces can be seen.
14. What is the probability that if the letters of the word MANAGEMENT are
arranged in random order, then the vowels will be separated.
15. An integer is chosen at random. What is the probability that it will be divisible
by 2, but not by 4 or 6?
16. A white square of side 30 cm. has a black circle of radius 10 cm drawn at its
center. Find the probability that a point chosen at random within the square
also lies within the circle.
17.
Compound Probabilities
6 6 2 5
P (E1 E2 |S) = + = (2.4)
36 36 36 18
2.1 Probability of Compound Events 11
But dont forget the full notation. We shall need it again soon.
This is much the same as the use we made of this term when working with
permutations, i.e. if two events are mutually exclusive an elementary event of E1
cannot also be an elementary event of E2 and vice versa. But we are now working
in a more precisely dened framework.
If E1 E2 = clearly P (E1 E2 ) = 0 as there are no elementary events to sum
over. Thus for mutually exclusive events Theorem 1 becomes
Recall next that if we are given an event set E the complement of the event set
denoted by E is the set made up of all those elements of S which are not in E, i.e.
if E = {a, d, f } and S = {a, b, c, d, e, f, g} then E = {b, c, e, g}. We always have
E E = and E E = S. Note that some books use E c as their notation for the
complement of E.
Proof 2
P (E E ) = P (E) + P (E ) P (E E ) (2.7)
P (S) = P (E) + P (E ) (2.8)
(2.9)
But P (S) involves summing over all the elementary events of the sample space and
hence by the denition of a probability distribution P (S) = 1. 1 = P (E) + P (E )and
the theorem follows at once.
Example 9 Two dice are thrown. What is the probability of not getting a double?
Solution.
6
Let E be the event set E = { all doubles } P (E) = 36 = 16 .Then E is the set of
all throws which are not doubles.
1 5
P (E ) = 1 = (2.10)
6 6
You will recognize a device which we often used in permutation and combination
problems: the reverse problem is often much simpler.
We now consider a further aspect of probability theory. If you study the weather
statistics for several years in Manila you will be able to calculate the (a posteriori)
probability that it will rain on a day chosen at random. But if you also happen to
know that the day chosen falls in the middle of the rainy season then your probability
will be much higher. Probability pure and simple involves only random trials, but in
applying probability theory to the real world we will often have partial knowledge
either of the event that we are working with or of an event which can inuence this.
Having said this we will once again take refuge in symbols and return to the real
world later.
We have the usual sample space S and a probability distribution dened on it.
Suppose that E and F are two event subsets (and that neither E nor F is empty
to avoid triviality). We want to know the probability of E given that we do know
that F actually occurs.
A little thought will show you that F has in fact become our sample space. If the
element of E whose probability we are looking for is not in F there is now absolutely
no chance of its happening and it can be considered deleted from the sample space.
Making a trial for an elementary event of E is doomed to failure if that event is not
in F . So in our full notation what we must nd is P (E|F ).
Before we rush o to the canteen to celebrate this discovery there is a remaining
problem. The original probabilities dened on S will not give a probability distribu-
tion on E! Since some of the original events are missing the remaining probabilities
cannot add up to 1 and that, as we all know, is an essential prerequisite for any
probability distribution.
Let us consider a situation. Given: S = {e1 , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 , e10 } with
the probability distribution p1 , p2 , p10 with p1 + p2 + + p10 = 1. Let F =
{e2 , e4 , e9 } so that we know that P (F |S) = p2 + p4 + p9 .
We make the hypothesis that the probability distribution for F as the new sample
space is:
p2 p4 p9
e2 : e4 : e9 : (2.11)
p2 + p4 + p9 p2 + p4 + p9 p2 + p4 + p9
These probabilities clearly add up to 1 and theoretically satisfy the requirements for
a probability distribution. They are also plausible since they multiply each of the old
probabilities for e2 , e4 , e9 by the same factor, which is greater than 1, suggesting that
the probabilities have all been increased in the same ratio by our extra knowledge
that one of the events of F must occur. However this remains a hypothesis. In
situations where it can be experimentally tested it yields results in agreement with
the experimental results. Hence it is adopted for use all the time. [The computer
experts among you might like to try to design a computer simulation which would
test this hypothesis.] We summarize all this in the next denition.
Patrick D. Bangert
2.1 Probability of Compound Events 13
(since they only apply on condition that a previous event has already occurred) and
problems involving them are called problems in conditional probability.
P (E F |S)
P (E|F ) = (2.12)
P (F |S)
(i.e. we add the S probabilities of those elements of E which are also in F and
divide the result by P (F |S)).
Do not forget the underlying meaning of P (E|F ). We are only considering F
as the new sample space because we know that the events of F are the only ones in
question.
Equation 2.12 is of vital importance in much of our more advanced work in
probability and statistics. In the derivation of equation 2.12 we had to assign prob-
abilities to the new sample space F by using the multiplying factor P (F1 |S) . This
is of theoretical importance but in most practical applications of equation 2.12 it
will not be necessary to actually descend to this level of detail. Also, in practical
problems we shall very often be using equation 2.12 in the form
Study our next example carefully and you will realize that this is much simpler in
practice than it may seem from the general discussion.
Example 10 A bag contains 20 balls, 10 red, 8 white and 2 blue. The balls are
indistinguishable apart from the color. Two balls are drawn in succession without
replacement. What is the probability that they will both be red?
Solution.
F is a ball is drawn from the original 20 and is red. E is a ball is drawn
from the remaining 19 and is red. E F is both balls drawn are red. Clearly the
problem wants us to nd P (E F |S). We have P (F |S) = 10 20
= 12 . P (E|F ) is the
probability of getting a red ball from 9 red balls remaining in a bag of 19 balls i.e.
9
P (E|F ) = 19 . So using equation 2.13 P (E F |S) = 12 19
9 9
= 38 .
The argument used in these last two examples can be illustrated by what is called
a tree diagram. In the diagram below we have added R1 and R2 to the notation
already in use. The meanings need no further clarication.
The probability of getting to any leaf on the tree starting from O is obtained
by multiplying the probabilities of the paths followed. e.g. the probability of getting
2 1 1
two blue balls is 20 19 = 190 etc.
Finally for this section we discuss the related problem of independent events.
Two events are independent if neither has any eect on the probability of the other.
The ball drawings in the previous examples are clearly not independent events. To
make this precise we have:
Denition 7 Two events A and B from a given sample space and probability dis-
tribution are said to be independent if
which being obtained much more generally is always true we can conclude that two
events A and B are independent if P (A|S) = P (A|B). And if we recall our earlier
discussion this means that the occurrence of B does not aect the probability of A.
And that is about as good a notion of independence as we shall get.
This denition is used both ways. If common sense tells us that A and B are
independent we will use the simpler equation 2.16 to calculate probabilities. On the
other hand in a more complex situation it may not be obvious whether two events
are independent or not. We can nd out by calculating P (A B|S) both using
equation 2.12 and using equation 2.16. If the answers agree we conclude that the
events are independent.
Patrick D. Bangert
2.1 Probability of Compound Events 15
Example 12 A coin is tossed and a die thrown. What is the probability of getting
a head and a six?
Solution.
The sample space is S = {(H,1),(T,1),(H,2),(T,2),...........(T,6)}
We require P ((H, 6) |S ). The coin tossing and die throwing are independent
events so P ((H, 6) |S ) = P (H |S ) P (6 |S ). Here P (H |S ) means the probability of
getting a head, the die being irrelevant, and P (6 |S ) means the probability of getting a
6 the coin being irrelevant. i.e. H is the event H = {(H, 1) , (H, 2) , (H, 3) , (H, 4) , (H, 5) , (H, 6)}
1 1 1
P ((H, 6) |S ) = = . (2.18)
2 6 12
Example 13 A bag contains 2 white balls and 6 red balls. A second bag contains 4
white balls and 4 red balls. what is the probability of drawing 2 white balls if:
(i) one ball is drawn from each bag.
(ii) 2 balls are drawn from the rst bag.
(iii) 2 balls are drawn from the second bag.
Solution.
(i) these are independent events so P (W1 W2 ) = 28 48 = 18 .
(ii) these are not independent events so
1 2 1
P (W2 W1 |S ) = P (W2 |W1 ) P (W1 |S ) = = (2.19)
7 8 28
3 4 3
(iii) Again P (W2 W1 |S ) = P (W2 |W1 ) P (W1 |S ) = 7
8
= 14
.
Our nal example shows how Venn diagrams can sometimes be useful.
Example 14 A cancelled export order led to a furniture sale at which tables, side-
boards and bookcases were sold. There were 200 tables, 160 sideboards and 160
bookcases and customers were limited to a maximum of one of each item. 285 people
attended the sale and only 25 left without buying anything. 84 people bought all three
items, 32 bought a sideboard and a table only, and 40 bought a table only. All the
items were sold.
What is the probability that a person chosen at random from the 285 sale goers
bought a sideboard only?
Solution.
28
Hence P(bought a sideboard only) = 285 = 0.098.
2.1.1 Exercises
1. Four balls are drawn at random simultaneously from an urn containing 1
purple, 2 yellow, 3 red and 6 blue balls. Find the chance that the balls are (a)
all blue, (b) one of each color.
2. In a certain school, the probability that a person studying German also studies
Physics is 1/4, whereas the probability that someone studying Physics also
studies German is 1/5. The probability that a person chosen at random studies
neither is 1/3. Calculate the probability that a person chosen at random
studies both Physics and German.
Could it be said that Physics and German were independent choices at that
school?
3. A man has 3 black counters in his left pocket and 3 white counters in his right
pocket. He takes one counter at random from the left pocket and places it in
the right. He then takes one at random from the right pocket and places it in
the left. What is the chance that there are again 3 black counters in the left
pocket?
4. Two bags each contain 3 red balls and 1 black ball. A ball is transferred at
random from the rst bag to the second. Then one is transferred at random
from the second bag to the rst. What is the probability that there are again
3 red balls and 1 black ball in each bag?
5. Show that if A1 and A2 are independent events, then (a) A1 and A2
(b) A1 and A2 are also independent events.
6. Show that if A1 and A2 are possible events which are mutually exclusive then
they cannot be independent. Construct denitions for A1 and A2 if each rep-
resents the drawing of a specic card from a well shued pack.
7. On a stretch of main road there are 4 independent sets of trac lights, each
phased for 120 seconds red, 60 seconds green. What is the probability that a
motorist arriving at random will have to stop at least once?
8. A poker player is dealt 5 cards. What is the chance that he receives a royal
ush? (Ace, King, Queen, Jack, 10 of the same suit)
9. An urn contains 3 black and 7 white balls. Balls are drawn at random one at
a time from the urn and are not replaced. Find the probability that the rst
black ball to be drawn is drawn on the fourth attempt.
10. Two dice are thrown. Find the probability that the product of their scores is
even. If n dice are thrown, what is the probability that their product is even?
Patrick D. Bangert
2.1 Probability of Compound Events 17
11. Three people are chosen at random. Assuming that births are equally likely
throughout the week, what is the probability that they were all born on dier-
ent days of the week? What is the probability that 7 people chosen at random
were all born on dierent days of the week?
12. What is the probability of drawing rst a seven and then an eight from a pack
of cards?
13. In a table of random numbers, consecutive repeated digits are called doubles,
triples and so on. Work out the probability that a digit chosen at random
from the table will be part of (a) a double (b) a triple.
14. A car driver has four keys, only one of which will open the door. Given that
the keys are otherwise indistinguishable, nd the probability (before he starts
trying them) that the door will open on the rst, second, third and fourth
attempts.
(a) Consider two cases where (i) he discards each key which fails to open the
door, (ii) he returns each key to the collection before choosing the next one at
random.
(b) Then consider the cumulative probabilities with each strategy, i.e. the
probability that he will have succeeded by the rst, second, third and fourth
attempts.
15. Smith and Wesson ght a duel with guns. The probability that Smith kills
Wesson on any shot is 1/4 and that Wesson kills Smith is 1/3. Find the
probability that just one of them is killed after one shot each if (a) both re
simultaneously (b) Smith res rst.
If both re simultaneously, what is the probability that both are still alive
after the second round?
What is Smiths probability of survival if they duel to the death, each time
ring simultaneously?
16. Two letters of the words SEEN SENSE are chosen at random without replace-
ment. Set up a tree diagram and show that the probability that the second
letter chosen is an E is the same as the probability that the rst letter chosen
is an E. Show that this property is also true for the letters N and S and explain
why.
Explain also why the results are the same as they would be if the selection
took place with replacement.
17. One wine rack contains 3 white, 7 red and 2 rose bottles, and a second contains
4 white, 4 red and 1 rose. A rack is chosen at random and then a bottle
is chosen at random from that rack. Draw a probability tree and nd the
probability that the bottle so chosen is red.
Which color wine has an improved chance of selection if chosen by this method
rather than by true random selection?
18. Accident casualties in a hospital are in the proportion 2/3 male 1/3 female.
The hospital is large and has one large and two small mixed casualty wards.
Victims of accidents are assigned alternately to large and small wards. The
pattern is LS1 LS2 LS1 .....etc. Ward L has 7 male and 14 female nurses. Ward S1
has 3 male and 5 female nurses. Ward S2 has 4 male and 8 female nurses. The
rota system means that the nurses on duty are eectively chosen at random.
Find the probability that a victim chosen at random is male and is received
into the ward by a male nurse.
20. A batch of fty articles contains three which are defective. The articles are
drawn in succession (without replacement) from the batch and tested. Show
that the chance that the rst defective met will be the rth article drawn is
(50r)(49r)
39200
.
21. With a rather gloomy weather forecast for an agricultural show the organizers
decided to hold a draw for the lucky program holder. Free programs, eligible
for the draw were given to the rst 300 visitors entering the ground.
A count was made of whether each visitor was equipped with an umbrella (U),
a raincoat (R) and wellingtons (W). The results were:
Umbrella 75 Umbrella and raincoat 40 None of these 60
Raincoat 140 Umbrella and wellingtons 25
Wellingtons 105 Raincoat and wellingtons 30
Find the probability that the winner of the lucky program draw has an
umbrella but no raincoat or wellingtons.
Patrick D. Bangert
Chapter 3
Hence P (I |W ) = PP (IW )
(W |S )
P (W |I )P (I)
= P (W |S )
(we use the well known set result A B = B A) (3.3)
21
= 65 = 15
3
The crux of this solution is the use of the set theoretic result A B = B A.
From the easily calculated P (W |I) we have calculated the far from obvious
P (I|W ). For obvious reasons this is called an inverse probability calculation.
If any of you have studied probability before at a more elementary level you will
recall that it is possible to muddle ones way through problems such as Examples
11, 12 and 13 without explicit use of equation 2.12. However our use of equation
2.12 is essential in example 15. In general the method used here of basing all
our calculations on the single equation 2.12 would be the preferred one at a more
advanced level. Remember that equation 2.12 is not proved: it is derived from
plausible arguments admittedly but the ultimate test is does it work?. If anything
went wrong and we had to revise our theory it would then only be equation 2.12
that would need examination. But if we had been solving all our problems by ad
hoc methods each of these would require examination.
Let us try another example in inverse probability.
Example 16 Three holiday caravans X,Y,Z are let to 2 adults and 4 children, 4
adults and 2 children, and 3 adults and 1 child respectively. A special holiday prize is
awarded by rst choosing a caravan at random and then choosing one of its occupants
at random. Given the information that the winner was a child nd the probability
that it came from caravan X.
Solution.
With obvious notation we require P (X |C ).
P (X C) = P (X |C ) P (C)
P (X |C ) = P (XC)
P (C)
P (CX)
= P (C|X )+P (C|Y )+P (C|Z ) (3.4)
4 1
8
=
4 1
+
6 3
+ 14 13
2 1 = 15
6 3 6 3
1
40 9
4 40
P (A |L ) = 1 4 1 = (3.6)
40
+
9 50 3
1 + 100
1
1
4
73
Patrick D. Bangert
3.1 Inverse Probability and Bayes Theorem. 21
1 1
24
Similarly P (B |L) =
1 4
+
50 3
1 1
+ 100
1 1
4
= 73
40 9 50 3
9
and P (C |L ) = 73
We conclude that the engineer should rst go to visit computer A.
N
S= Ai . (3.10)
i=1
and thus N
N
P (A) = P A Ai = P A Ai (3.12)
i=1 i=1
and by conditional probability, we have
N N
P (A) = P A Ai = P (Ai ) P (A|Ai ) (3.13)
i=1 i=1
3.1.2 Exercises
1. One of four dice is known to be biased, showing a 6, on average, three times
as often as each of the other scores. A die is chosen at random and thrown
three times and three 6s appear. What is the probability that it is one of the
unbiased dice?
2. A factory manufactures three dierent qualities of light bulb, A, B and C in
the ratio 1:2:3. The bulbs are indistinguishable in external appearance but
extensive tests indicate that, on the average, 1% of type A, 4% of type B and
5% of type C are below the advertised standard. A batch of 6, all of the same
type is sent in error without a distinguishing label. If none of these turns out
to be defective, estimate the probability that they were of type A.
3. Of a large sample of items, 40% were produced by machine A and 30% by
each of machines B and C. The machinery was very unreliable, machine A
producing 10% defective items, machine B 20% defective items and machine
C 30%. If an item, selected at random from the sample, proves to be defective,
what is the probability that it came from machine A?
4. A card is missing from a pack of 52 cards. If this is the only information you
have, what is the probability that the missing card is a spade? The pack is
well shued and the rst card is removed and proves to be a spade. What
would your assessment of the probability that the missing card is a spade be
now? The card removed is now replaced and the pack shued. The top card
again proves to be a spade. What is your assessment of the probability now?
5. Four bags I, II, III, IV, contain white and black balls as shown in the following
table.
Bag I II III IV
Number of white balls 1 2 3 4
Number of black balls 9 8 7 6
A die is thrown; if a one appears bag I is chosen, if a 2 or 3 bag II, if a 4 or
5 bag III, and if a 6 bag IV. A ball is then drawn at random from the bag
selected. If you are told that a black ball has been drawn, what should your
estimate be of the probability that it came from bag I?
6. A certain rare disease from which one in ten thousand of the population suers
is diagnosed by a test which reveals the presence of the disease in 95% of the
cases of those tested who actually have the disease. However, it also incorrectly
yields a positive reaction in 1% of the cases of those who are not suering
from the disease. If a person selected at random from the population shows a
positive reaction what is the probability that he is actually suering from the
disease?
7. One of four pennies is known to be double headed, the other three being
normal. One of the pennies is selected at random and tossed three times. If
the result is three heads, what is the probability that the coin tossed is the
double headed one?
Patrick D. Bangert
3.2 Bernoullis Theorem 23
9. In a hotel 70% of the guests are male and 30% female. Of the male guests the
proportions who are travelling on business and on holiday are 80% and 20%.
For female guests the proportions are 10% and 90%. The hotel gives a choice
of continental or English breakfasts. The proportion of male and female guests
on holiday who take the English breakfast are 80% and 50% respectively. The
proportion of guests on business taking English breakfast is the same for men
and women at 70%. Calculate:
(a) the proportion of guests in the hotel who are on holiday.
(b) the proportion of continental breakfasts served.
(c) the probability that a guest who has a continental breakfast is on business.
(d) the probability that a guest who has an English breakfast is a woman on
holiday.
10. One urn contained 3 white and 3 black balls. A second urn contained 4 white
and 3 black balls. A ball is transferred unseen from the rst to the second urn.
Then a ball is drawn at random from the second urn. Calculate:
(a) the probability that the ball which is drawn is white.
(b) the probability that if the ball drawn is white, then the transferred ball
was also white.
Patrick D. Bangert
Chapter 4
4.1 Variables
From basic high school algebra we know variables as place holders for numbers
that we do not yet know or that we want to hide for some reason. We then learn to
solve equations of the type x+1 = 2 for the variable x. The crucial understanding
is that x, by and of itself, can take on any value whatsoever, it need not be 1. If the
equation above is in addition supposed to be true, then the variable x is forced to
take on the value 1.
We note that x in the above equation is not really variable in the sense that
its value may change. It is really an unknown constant that we have labelled x for
lack of a better name. We have, after some work, found a much better name for it,
namely 1. Thus, x is not a variable in our sense of the word. It is a constant!
This is the understanding that allows us to then learn about straight lines, i.e.
the equation y = mx+b. The colloquial agreement is that x is called the independent
variable and y the dependent variable. This may sound silly but the reason is very
deep. Before we explain it, we note that both m and b are constants in the above
sense, i.e. they are not meant to change for a particular line and take on numerical
values.
Both x and y are variables in the sense that they may change. For a xed m and
b (and they are, of course, always xed for any particular line), any particular value
for x will allow precisely one value for y. Thus these two variables are dependent
on each other. Fixing one xes the other. It does not matter which one we vary
and which one we solve for by our high school algebra from above. It is however
important that we decide this in a systematic fashion. This is why the equation
is usually written y = mx + b. You are meant to pick values for x at your leisure
and then work out the value for y by simple arithmetic. This procedure makes
x dependent only on your whim, i.e. independent of the problem at hand and y
dependent on x. Hence the names.
We then agree further that x shall be graphed on the horizontal axis and y on
the vertical axis. The result is then a picture of several points which we have worked
out so far. We may join them by a line and we obtain a picture of a straight line
segment in between the smallest and largest values of x we have happened to choose.
We can never, of course, draw the whole line as it is innitely long.
Based on these two fundamental choices, we shall agree on something quite
26 The Notion of Random Variables
important: The horizontal axis will nearly always hold a piece of information that
has been selected by a human being for some reasons. These reasons may or may
not be logical, fair, unbiased. In fact, in popular statistics that one may read in the
newspaper, these choices are usually very carefully made so that the nal picture
looks most like what the author needs for the particular policy he or she wishes to
propound. The vertical axis will consequently show the datum that has been arrived
at either by calculation or experimentation. Again, frequently the experimentation
is biased or faulty but this is a topic for later.
To give an example, let us consider that we have carried out a experiment to
ascertain the heights of a large number of men. We wish to display this information.
Generally this is done by drawing a histogram. The basic choice here is that of
bins. We choose to start with 100 cm and go to 250 cm in steps of 10 cm and count
how many men are in each interval. Presumably, all the men asked will fall into
one of these intervals and we may draw the result. Note that what is variable here
is our choice of bin size only. The largest and smallest height and the number of
men of any height is xed by experimentation. Thus the only choice we have is the
choice of bins. Having all bins of equal size is considered good practise and it is the
experience of many that the distribution looks like a bell curve. This will soon be
called a normal distribution as we will learn later on.
What does one do if one wants to establish the thesis that there are a large
number of short men? An easy remedy for this is to introduce a single bin from 100
cm to 150 cm and then count in 10 cm steps. This will increase the count in the
rst bin tremendously. This is known as a biased plot. All you then have to do is
to print the legend in a small font and in an inconvenient place using strange words
and this fact will completely go by most readers of the newspaper you work for.
Unfortunately this little joke and its many variants is practised widely. So watch
out for it!
Patrick D. Bangert
4.3 Random Variables 27
Please be careful to note all the qualifying words in that denition!! For a
gambler, the roll of a die is a random event even though we have enough technology
available to us to measure the strength and direction of the throw, the properties of
the environment, etc. to be able to predict the numerical outcome of the throw. In
fact, a simple gravitational model of the game of roulette combined with educated
guesses as to the angle of the table etc. made a group of mathematicians millions of
dollars a few decades ago. Thus the degree of randomness of an event is something
relative to the experiment. Fundamentally random event are those that we can
not, even theoretically, predict with any amount of presently available technology.
Examples include the exact moment of decay of a radioactive atomic nucleus or the
exact path taken by a tornado.
For the gambler, the variable x dened as the outcome of a single throw of a
single die is a random variable. He can say that x will denitely be one of 1, 2, 3, 4,
5 or 6 and that, to the best of his knowledge they are all equally likely to occur, but
he can not say anymore than that. In fact, the occurrences of the six possibilities in
practise are never truly equally likely because of small imperfections in the physical
makeup of the system under investigation. For a die, for example, a small chip on
one side causes an imbalance that will show up a non-uniformity in the throwing
frequency over a long time. This will not, of course, matter for a game that will
have a hundred throws or so unless the die is badly damaged and as such is of no
interest to the gambler. However, this should be borne in mind.
We are concerned with obtaining a sequence of random numbers for many prac-
tical applications. We simply note here that there exist algorithms for the computer
that are perfectly deterministic in nature if examined line by line but that produce
a sequence of numbers without apparent regularity. Philosophically speaking, the
regularity in this sequence of numbers is so complex that the likelihood of arriving
at it given the sequence is virtually zero. Thus, by the above denition, the numbers
are random events. But we do know, in fact, the method by which they are obtained
and so we call them pseudo-random numbers. It is a complicated matter to study
and compare these methods protably and we shall not need to do so. It suces to
say that they have achieved a quality that they can be considered (for all practical
purposes) random. Furthermore, these methods are included in all standard data
processing computer packages.
Expectation
Probability theory and statistics are very closely linked. In this chapter our aim
is to develop the ideas upon which this linkage is based. This is done through the
concepts of random variables, mean, variance, standard deviation and expectation
value.
To begin with let us return to the simple probability problem of throwing an
unbiased 6 faced die. The outcome space is S = {1, 2, 3, 4, 5, 6}, and as we know,
each of the elementary events has probability 1/6. For this example the set of inte-
gers {1, 2, 3, 4, 5, 6} is the set of values of the random variable that this experiment
determines. It is called random since before the die is thrown we have no idea which
one of the 6 values we will get. It is called a variable since from trial to trial we will
get dierent values. In this case the experiment determines the random variable in
a natural manner.
But now consider a dierent problem in probability. We have a bag containing
2 red balls, 1 blue ball and 3 green balls. We make many experiments of drawing 1
ball with the ball drawn being replaced before each new trial. We know that:
where the number xi is associated with the event ei is called a random variable de-
ned on the outcome space. xi is called the value of the random variable for (or at)
the event ei . We can write X(ei ) = xi as in function theory.
Review the rules for assigning probabilities recalling that:
1. pi > 0 for 1 i n
n
2. pi = 1
i=1
p 1 x1 N + p 2 x2 N + + p n xn N n
= p i xi (5.4)
N i=1
n
The theoretical value E (X) = pi xi is called the expectation value of the variable
i=1
X.
Example 21 An unbiased die is thrown. What is the expected score? (i.e. what is
the expectation value of the score?)
Solution.
6
6
Here E(X) = p i xi = i
6
= 16 (1 + 2 + 3 + 4 + 5 + 6) = 3.5
i=1 i=1
Example 22 The inveterate gamblers A and B play a game with three coins. A
pays B a stake of 10 peso and tosses the three coins in turn. If he gets 3 heads B
Patrick D. Bangert
5.1 Expectation Value. 31
returns his stake plus a prize of 30 peso. For 2 consecutive heads B returns the stake
plus a prize of 10 peso. In all other cases B keeps the stake and there is no prize.
Is this a fair game?
Solution.
[In game theory a game is said to be fair if the expectation value of gain is zero
for both players.]
Here S = {(HHH), (HHT ), (T HH), (others)} P(HHH) = 18 , P(HHT) = P(THH)
= 18 , P(others) = 1 18 14 = 58 . For random variable we take player As net gain:
X(HHH) =30, X(HHT) = X(THH) = 10, X(others) = -10. So As expectation of
net gain is 308
+ 10
8
+ 10
8
5.10
8
= 0. In a two person game one players gain is the
others loss so we need not calculate separately for player B. The game is fair. Over
a long period of time neither player should win or lose any money.
Example 23 A game can result in three possible outcomes whose probabilities are
1
4
, 12 , 14 .The players net gain is respectively x2 , x, 1 where x > 0. (a) Find
the players expectation of net gain. (b) Is it possible to choose x so that the game is
fair?
Solution.
2
(a) E = x4 x2 + 14 = x 2x+1 = (x1)
2 2
4 4
(b) Yes. Simply set x = 1.
5.1.1 Exercises
1. The values of a random variable, together with their associated probabilities
for four dierent experiments, are given in the tables below. Calculate E(x)
in the four cases.
x 0 1 2 3 4 5
(i) i 1 2 1 1 2 2 (5.5)
pi 15 15 3 5 15 15
xi 2 1 0 1 2
(ii) 1 2 3 1 1 (5.6)
pi 10 5 10 10 10
xi 1 2 3 4 5 6
(iii) 1 3 1 2 1 1 (5.7)
pi 20 20 5 5 10 10
xi 1 2 3 4 5 6 7 8 9 10
(iv) (5.8)
pi 0 18 0 38 0 14 0 18 0 18
3. A player pays a certain sum of money to spin two coins. For two heads he
receives back 10 peso, for two tails he receives 2 peso, for a head and a tail he
receives nothing. In all four cases he forfeits his stake money. What should
the stake money be for the game to be fair?
4. Two dice are thrown. Find the expectation of the higher score showing (or
the score of one of them if they fall alike).
5. If the probability that a man aged 60 will survive another year is 0.9, what
premium should he be charged for a life insurance policy of $1000? (If he
survives the year he receives no money back.)
6. X1 and X2 are two random variables, each with values 0,1,2,3,...,9, and each
possessing a uniform probability distribution. Evaluate:
A player stakes a certain sum of money for the privilege of drawing two discs,
one from each bag. For two discs of the same color his stake is returned, and in
addition he is awarded a prize of 10 peso for two reds, 20 peso for two greens,
and 25 peso for two blues. For two discs of dierent colors he loses his stake.
Show that if the stake money is 8 peso he can anticipate gaining in the long
run, but that with the stake at 9 peso he should expect to lose.
8. The game of Question 7 is repeated, but the player now tosses a coin to decide
which bag he must choose from: if he tosses a head, he chooses bag I, if a
tail, bag II; he then draws a disc at random from the chosen bag, notes its
color and replaces the disc before the bag is well rattled and he is allowed to
draw a second disc. He is then awarded prizes based on the colors of the discs
according to the scheme of Question 7. Determine the minimum stake (to the
nearest peso) required to ensure that the player will show a loss in the long
run.
9. The path in the gure below represents a simple maze along which a rat is
made to run. It starts at S and has to nish at F. If it makes a mistake at
A by turning along AA it will return to A and be forced by the construction
of the maze, to turn towards F, and similarly at each of the other junctions.
The probability of taking either of the two paths available at each junction is
1/2. Find the expected number of mistakes the rat will make in running from
A to F.
Patrick D. Bangert
5.2 Reading: The Theory of Economic Behavior by Leonid Hurwicz 33
10. Two dice are thrown in one turn, each turn costing 5 peso. If a prize of
40 peso is given for a double 6, and a prize of 20 peso for any other double
(together in both cases with the stake money), determine the expected loss to
a player who plays the game 100 times.
11. A man put three $5 notes into one envelope and three $1 notes into a similar
envelope. Each year at Christmas, he chooses one envelope at random and
gives his nephew a note from it. As soon as either envelope is emptied by his
taking the last note from it the process ends.
a. State the dierent totals that the nephew may have received when the
process ends.
b. For each of these totals calculate the chance of its occurrence.
c. Show that the nephews expectation of gain is $12.375.
n
= E(X) 1 pi xi = E(X) E(X) = 0 (6.6)
i=1
]
In the actual examples it seems to be the cancelling out of the positive and
negative deviations which causes the trouble. To avoid this let us try the eect of
squaring the deviations. i.e. we have for die 1.
This makes more sense. The larger value obtained for the second die correspond-
ing to the fact that the values obtained with this die are on average further away
from the mean than those obtained with the normal die.
The expectation value of the squared deviations of the random variable X is
called the variance of X and it is denoted by x2 . i.e.
x2 = E (X E(X))2
n n
(6.8)
= pi [xi E(X)]2 = pi [xi x ]2
i=1 i=1
Example 24 Find the mean and variance of the random variable X whose proba-
bility distribution is:
X 0 1 2 3 4
(6.9)
p 18 41 41 41 81
Solution.
n
1 1 2 3 4
E(X) = x = p i xi = 0+ + + + =2 (6.10)
i=1
8 4 4 4 8
pi (xi x )2 = 48 + 18 + 80 + 18 + 48 = 32
x2 = (6.11)
Example 25 Find the variance for the number of heads showing if three unbiased
coins are tossed simultaneously.
Solution.
The sample space, the probabilities and the random variable (1 for a head 0 for
a tail) are as shown below.
3H 1/8 3
2H 1T 3/8 2
1H 2T 3/8 1
Patrick D. Bangert
6.1 Standard deviation. 37
3T 1/8 0
1 3 3 1 3
E(X) = 3 +2 +1 +0 = (6.12)
8 8 8 8 2
The squared deviations Y are:
2 2 2 2
3 3 3 3
Y (3H) = 3 Y (2H) = 2 Y (1H) = 1 Y (0H) = 0
2 2 2 2
(6.13)
9 1 1 9
i.e. Y (3H) = 4
Y (2H) = 4
Y (1H) = 4
Y (0H) = 4
4
1 9 3 1 3 1 1 9 3
x2 = pi yi2 = + + + = (6.14)
i=1
8 4 8 4 8 4 8 4 4
Thus the mean or expectation value is 3/2 and the variance is 3/4.
pi (xi +cc)2 = 2
n
c2 = pi (xi + c c )2 = (6.16)
i=1
Thus, as we might have expected, the mean is simply shifted by the same amount c
and the spread of the distribution is unaltered.
The results of this example give us a clue to the simplication of some numerical
work. We note the properties of what we can call a translated random variable:
c = + c c = (6.17)
1
c = 10 here so y = x + c x = + 10 = 10.5 (6.18)
2
20
2
20
x2 = y2 = pi (yi y ) = pi yi2 2pi y yi + pi 2y (6.19)
i=1 i=1
20
20
20
= pi yi2 2y pi yi + 2y pi
i=1 i=1 i=1
20
20 (6.20)
= pi yi2 22y + 2y = pi yi2 2y
i=1 i=1
1 1
Hence x2 = 20
[2 (12 + 22 + + 92 ) + 02 + 102 ] 4
= 33.5 0.25 = 33.25
The results of these examples can be generalized into a useful theorem as follows:
2. E (X a)2 = 2 + ( a)2
3. 2 = E [X 2 ] 2
Proof 3 1. This is exactly the same as the rst part of Example 37.9 with c
replaced by -a.
n
E (X a)2 = pi (xi a)2
i=1
n
2. = pi [(xi ) + ( a)]2
i=1
n
n
pi (xi )2 +2(a) pi (xi )+(a)2 pi
i=1 i=1
=
n n n n
So noting that pi = 1 and pi (xi ) = p i xi p i = = 0
i=1
i=1 i=1 i=1
we have E (X a)2 = 2 + ( a)2
3. This follows at once from part (ii) on setting a = 0. This last result is really
only a corollary to the theorem but I want to stress its importance. It is the
preferred method for calculating the variance.
2. y =
(Note that the results of Example 9 follow at once from this theorem if we set =
1.)
Patrick D. Bangert
6.1 Standard deviation. 39
i +a)
pi (x
Proof 4 1. E (X + a) =
n
n
= p i xi + a pi = E(X) + a = + a = y (6.21)
i=1 i=1
n
2. E (Y y )2 = pi [xi + a a]2
i=1
n
n
pi [xi ]2 2 = 2 pi [xi ]2 = 2 2 (6.22)
i=1 i=1
y = (6.23)
x 10 20 30 40 50 60
(6.24)
f 15 13 12 10 9 11
y 2 1 0 1 2 3
(6.25)
p 15
70
13
70
12
70
10
70
9
70
11
70
1
y =
[1(15) + (1)(13) + 0(12) + 1(10) + 2((9) + 3(11)] (6.26)
70
1 18 9
= [43 + 61] = = (6.27)
70 70 35
9 1
9
1140
But y = + a 35 = 10 3 = 10 3 + 35 = 35 = 32.57
2
1 9
y2 = E(Y ) 2
2y = [15(4) + 13(1) + 12(0) + 10(1) + 9(4) + 9(11)]
70 35
(6.28)
218
(32.57)2 = 3.048 (6.29)
70
Thus y = 1.746 and y = = 17.46
Example 29 Two unbiased die are thrown. Find the expectation value of the total
score and its standard deviation.
Solution.
The possible scores with their probabilities are listed below.
2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1 (6.30)
36 36 36 36 36 36 36 36 36 36 36
Let Y = X - 7
This leads to a new table as below:
5 4 3 2 1 0 1 2 3 4 5
1 2 3 4 5 6 5 4 3 2 1 (6.31)
36 36 36 36 36 36 36 36 36 36 36
1
y = [5 8 9 8 5 + 5 + 8 + 9 + 8 + 5] = 0 (6.32)
36
y = 7 = 7 (6.33)
y2 = E(Y 2 ) 2y (6.34)
1 210
= [25 + 32 + 27 + 16 + 5 + 5 + 16 + 27 + 32 + 25] = (6.35)
36 36
2 35
But y = = here = 6 = 2.42
Patrick D. Bangert
6.1 Standard deviation. 41
Example 31 A and B alternately throw a die. The game ends in a win for A if he
throws a 1 or a 6 or in a win for B if he throws a 2, 3, 4, 5. Find the probability
that A wins and given that he wins nd the average number of throws that he takes.
It is assumed that A throws rst.
Solution.
This game could theoretically go on for ever.
Clearly P(A wins on a particular throw) = 1/3.
P(A wins 1st time) = 13
P(A wins 2nd time) = 23 13 13
P(A wins 3rd time) = 23 13 23 13 13
2 1 n1
1
P(A wins on nth attempt) =
3 3
3
2
P(A wins) = 13 1 + 29 + 29 + = 13 12/91
= 37
We now need the expected number of throws of A given that A wins. Since it is
now given that he wins the probabilities are now modied. e.g.
P(X = r A wins)
P (X = r | A wins) = . (6.39)
P(Awins)
This is the probability that A wins on the r th try given that we know that he did
win at some point.
1 1
Thus all the previous probabilities are multiplied by = 3/7 = 73 .
P(A wins)
2
1
7
E(No. of throws given that A wins) = 1 13 73 + 2 29 13 73 + 3 29 3 3
+
2
7 2 2
= 1+2 +3 + (6.40)
9 9 9
But from Example 37.13.
(1 x)2 = 1 + 2x + 3x2 +
2
2 (6.41)
1 + 2 29 + 3 29 + = 1 29 = 81
49
7 81 9
E(No. of throws) = = (6.42)
9 49 7
6.1.1 Exercises
1. The following table gives the number of children in each of 360 families.
No. of children 0 1 2 3 4 5 6 7 8
(6.43)
No. of families 38 91 108 76 39 5 2 0 1
Calculate the mean and standard deviation of the number of children per
family.
2. X is a random variable with mean and standard deviation . Find the mean
and standard deviation of each of the following random variables: a. -X; b. X
+ 1; c. 3X-1; d. (X - )/
3. Calculate the mean and variance for each of the following distributions:
X 0 1 2 3 4 5
(a)
p 14 18 16
1 1
16
1
8
3
8
X 1 2 3 4 5 6
(b) 1 1 2 3 2 1
p 10 10 10 10 10 10
X 3 2 1 0 1 2 3 4
(c) 1 3 1 3 3 1 1 1
p 20 20 4 20 20 10 10 20
5. If the random variable X can take the value 1,2,3,...,n all values being equally
likely calculate the mean and variance of X.
7. A throws a pair of unbiased dice, B a pair of dice of which one is unbiased and
the other is such that the probability of a six is p. If they throw in turn and
the winner is the rst to throw a double 6, nd p, given that when A has the
rst throw the game is fair.
8. Two dice are thrown together and the scores added. What is the chance that
the total score exceeds 8? Find the mean and standard deviation of the total
score. What is the standard deviation of the score for a single die?
9. A card is drawn at random from a standard pack and scores the face value
of the card (with ace as 1 and picture cards as 10 each). Find the mean and
variance of the score.
If the card is replaced, the pack well shued and a second card drawn, nd
the probability that the total score for both draws is 12.
10. Two bags contain red and white discs as shown in the table below:
Red W hite
BagI 5 15 (6.44)
BagII 10 10
One of the bags is selected at random and a disc drawn from it proves to be
red. If the red discs are now valued at $1 each and the white discs are valueless,
what would be a fair price to pay for the remaining discs in the selected bag?
11. (This problem is known as the St. Petersburg Paradox). A coin is spun. If
a head is obtained rst time you are paid $1; If you get a tail followed by a
head you receive $2; for two tails followed by a head $4, the next prize being
Patrick D. Bangert
6.2 Moments 43
$8 and so on. Show that, however much you are prepared to pay to play the
game your expected prot will be positive.
Criticize any assumptions you have made and indicate what further knowledge
you would require before oering a more realistic fair price for the game. If
the banker against whom you are playing starts with a capital of $100, what
would be a fair price for you to oer him before playing the game?
6.2 Moments
When a set of values has a suciently strong central tendency, that is, a tendency
to cluster around some particular value, then it may be useful to characterize the
set by a few numbers that are related to its moments, the sums of integer powers of
the values.
Best known is the mean of the values x1 , x2 , , xN ,
1
N
x= xj (6.45)
N j=1
which estimates the value around which central clustering occurs. Note the use of
an overbar to denote the mean; angle brackets are an equally common notation,
e.g.,
x. You should be aware that the mean is not the only available estimator of
this quantity, nor is it necessarily the best one. For values drawn from a probability
distribution with very broad tails, the mean may converge poorly, or not at all,
as the number of sampled points is increased. Alternative estimators, the median
and the mode, are mentioned at the end of this section.
Having characterized a distributions central value, one conventionally next char-
acterizes its width or variability around that value. Here again, more than one
measure is available. Most common is the variance,
1
N
V ar (x1 , x2 , , xN ) = (xj x)2 (6.46)
N 1 j=1
Equation 6.46 estimates the mean squared deviation of x from its mean value.
There is a long story about why the denominator of 6.46 is N 1 instead of N . If
you have never heard that story, you may consult any good statistics text. Here we
will be content to note that the N 1 should be changed to N if you are ever in
the situation of measuring the variance of a distribution whose mean x is known a
priori rather than being estimated from the data. (We might also comment that
if the dierence between N and N 1 ever matters to you, then you are probably
up to no good anyway - e.g., trying to substantiate a questionable hypothesis with
marginal data.)
As the mean depends on the rst moment of the data, so do the variance and
standard deviation depend on the second moment. It is not uncommon, in real
life, to be dealing with a distribution whose second moment does not exist (i.e., is
innite). In this case, the variance or standard deviation is useless as a measure of
the datas width around its central value: The values obtained from equations 6.46
or 6.47 will not converge with increased numbers of points, nor show any consistency
from data set to data set drawn from the same distribution. This can occur even
when the width of the peak looks, by eye, perfectly nite. A more robust estimator
of the width is the average deviation or mean absolute deviation, dened by
1
N
ADev (x1 , x2 , , xN ) = |xj x| (6.48)
N j=1
One often substitutes the sample median xmed for x in equation 6.48. For any
xed sample, the median in fact minimizes the mean absolute deviation. Statisti-
cians have historically snied at the use of 6.48 instead of 6.46, since the absolute
value brackets in 6.48 are nonanalytic and make theorem- proving dicult. In
recent years, however, the fashion has changed, and the subject of robust estimation
(meaning, estimation for broad distributions with signicant numbers of outlier
points) has become a popular and important one. Higher moments, or statistics
involving higher powers of the input data, are almost always less robust than lower
moments or statistics that involve only linear sums or (the lowest moment of all)
counting.
Figure 6.1: Distributions whose third and fourth moments are signicantly dierent
from a normal (Gaussian) distribution. (a) Skewness or third moment. (b) Kurtosis
or fourth moment.
That being the case, the skewness or third moment, and the kurtosis or fourth
moment should be used with caution or, better yet, not at all.
The skewness characterizes the degree of asymmetry of a distribution around its
mean. While the mean, standard deviation, and average deviation are dimensional
quantities, that is, have the same units as the measured quantities xj , the skewness
is conventionally dened in such a way as to make it nondimensional. It is a pure
number that characterizes only the shape of the distribution. The usual denition
is
N 3
1 xj x
Skew (x1 , x2 , , xN ) = (6.49)
N j=1
Patrick D. Bangert
6.2 Moments 45
depends on the shape of the underlying distribution, and rather critically on its tails!
For the idealized case of anormal (Gaussian) distribution, the standard
deviation
of 6.49 is approximately 15/N when x is the true mean and 6/N when it is
estimated by the sample mean, 6.45. In real life it is good practice to believe in
skewnesses only when they are several or many times as large as this.
The kurtosis is also a nondimensional quantity. It measures the relative peaked-
ness or atness of a distribution. Relative to what? A normal distribution, what
else! A distribution with positive kurtosis is termed leptokurtic; the outline of the
Matterhorn is an example. A distribution with negative kurtosis is termed platykur-
tic; the outline of a loaf of bread is an example. (See Figure 6.1) And, as you no
doubt expect, an in-between distribution is termed mesokurtic.
The conventional denition of the kurtosis is
N 4
1 xj x
Kurt (x1 , x2 , , xN ) = 3 (6.50)
N j=1
where the 3 term makes the value zero for a normal distribution.
The standard deviation of 6.50 as an estimator of the kurtosis of an underlying
normal distribution is 96/N when is the true standard deviation, and 24/N
when it is the sample estimate 6.47. However, the kurtosis depends on such a high
moment that there are many real-life distributions for which the standard deviation
of 6.50 as an estimator is eectively innite.
Calculation of the quantities dened in this section is perfectly straightforward.
Many textbooks use the binomial theorem to expand out the denitions into sums
of various powers of the data, e.g., the familiar
N
1
V ar (x1 , x2 , , xN ) = x2j N x2 x2 x2 (6.51)
N 1 j=1
but this can magnify the roundo error by a large factor and is generally unjustiable
in terms of computing speed. A clever way to minimize roundo error, especially
for large samples, is to use the corrected two-pass algorithm: First calculate x, then
calculate V ar (x1 , x2 , , xN ) by
N 2
1 1
N
V ar (x1 , x2 , , xN ) = (xj x)2 (xj x) (6.52)
N 1 j=1 N j=1
The second sum would be zero if x were exact, but otherwise it does a good job of
correcting the roundo error in the rst term.
6.2.1 Semi-Invariants
The mean and variance of independent random variables are additive: If x and y are
drawn independently from two, possibly dierent, probability distributions, then
so that, e.g., M2 = V ar(x), then the rst few semi-invariants, denoted Ik are given
by
I2 = M2 , I 3 = M3 , I4 = M4 3M22 , I6 = M6 15M2 M4 10M32 +30M23
I5 = M5 10M2 M3 ,
(6.55)
Notice that the skewness and kurtosis, equations 6.49 and 6.50 are simple powers of
the semi-invariants,
3/2
Skew(x) = I3 /I2 , Kurt(x) = I4 /I22 (6.56)
A Gaussian distribution has all its semi-invariants higher than I2 equal to zero. A
Poisson distribution has all of its semi-invariants equal to its mean.
If a distribution has a strong central tendency, so that most of its area is under a
single peak, then the median is an estimator of the central value. It is a more robust
estimator than the mean is: The median fails as an estimator only if the area in the
tails is large, while the mean fails if the rst moment of the tails is large; it is easy
to construct examples where the rst moment of the tails is large even though their
area is negligible.
The mode of a probability distribution function p(x) is the value of x where it
takes on a maximum value. The mode is useful primarily when there is a single, sharp
maximum, in which case it estimates the central value. Occasionally, a distribution
will be bimodal, with two relative maxima; then one may wish to know the two
modes individually. Note that, in such cases, the mean and median are not very
useful, since they will give only a compromise value between the two peaks.
Patrick D. Bangert
6.3 Reading: Chance by Henri Poincare 47
6.2.3 Summing Up
The mean and variance are two examples of moments of distributions.
Denition 12 The nth raw moment (the moment about zero) of a distribution P (x)
is dened by
n =
xn (6.59)
where
f (x)P (x)discrete distribution
f (x) = (6.60)
f (x)P (x)dxcontinuous distribution
We note that a distribution is not uniquely specied by its moments. Most com-
monly the moments are taken about the mean.
Denition 13 The central moments are the moments taken about the mean,
n =
(x )n (6.62)
= (x a)n P (x)dx (6.63)
Clearly the rst central moment is zero. The second central moment is the variance,
2 = 2 (6.64)
We now look at some special probability distributions to which we will apply our
general theory.
(n+1)2 (n+1)2 2
1
i2 n 1 n(n+1)(2n+1)
=n 4 = n 121
4 6
= (7.1)
number of trials is carried out. Before giving a formal denition of the binomial
distribution we will look at a numerical example.
Consider the families in a large community which have exactly four children. We
consider the birth of each child as a single trial resulting in two possible outcomes,
a boy or a girl. The birth of all 4 children to make up a family forms a set of
four repetitions of the same trial. We will ignore cases of twins triplets and quads.
Suppose that for this population we have P(boy born) = 0.51; P(girl born) = 0.49.
Now 4 child families can have 0,1,2,3,4 boys. Consider a family with 3 boys and
a girl. This can happen in 4 ways with respect to the order in which the children are
born e. g. GBBB, BGBB, BBGB, BBBG. Thus P(3 boy family) = 4(0.51)3 0.49.
If we now use the random variable X for the number of boys in a family we can
tabulate the probabilities as follows.
x 0 1 2 3 4
P (X = x) (0.49) 4 (0.49) (0.51) 6 (0.49) (0.51) 4 (0.49) (0.51) (0.51)4
4 3 2 2 3
(7.2)
We now note that if we used the binomial theorem to evaluate (0.49 + 0.51)4 we
would get
(0.49 + 0.51)4 = (0.49)4 +4 (0.49)3 (0.51)+6 (0.49)2 (0.51)2 +4 (0.49) (0.51)3 +(0.51)4
(7.3)
This is where the name binomial distribution really comes from. Also note that
(0.49 + 0.51)4 = 14 = 1conrming without detailed calculation that all the proba-
bilities for the various possible numbers of boys in the family add up to 1 as they
should. A rst obvious advantage of this is that we do not have to actually list he
probabilities any more. Using our knowledge of binomial expansions we can write
4
P (X = x) = (0.49)x (0.51)4x x = 0, 1, 2, 3, 4 (7.4)
x
We will allow the results of this example to serve as a model for our general denition
of a binomial distribution.
It is implicit in this denition that we are dealing with trials which can only
have one of two possible outcomes. The terms success and failure are just
convenient to use in a general discussion. The experiment consists of n repetitions
of the Bernoulli trial and P (X = x) is the probability of getting x successes. Once
the values of n and of p are specied so is the binomial distribution. Thus the
notation B(n, p) is often used for a particular binomial distribution.
Example 32 A fair coin is tossed 5 times. Find the probability of getting three or
more heads.
Solution.
P(H) = P(T) = 1/2. We are working with B(5, 1/2).
We require P(3) + P(4) + P(5)
3 2 4 5
5 1 1 5 1 1 5 1
= + + (7.6)
3 2 2 4 2 2 5 2
1
= . (7.7)
2
Of course we have solved problems like this before but we now have some machinery
for handling all problems of this type systematically.
Example 34 A fair coin is tossed n times.. Given that n is even what is the proba-
bility that the coin will fall heads exactly half the number of times. Do the numerical
calculation for n = 10, 20, 50, 100.
Solution.
n n
1 n2
1 n n2 n
1 n
We are dealing with B(n,1/2). P 2 = n 2 2
= n 2
2 2
10
1 10
For n = 10 P (5) = 2
= 0.246
5
20
1 20
For n = 20 P (10) = 2
= 0.176
10
50
1 50
For n = 50 P (25) = 2
= 0.112
25
100
1 100
For n = 100 P (50) = 2
= 0.08
50
This example shows the need for precision. As n increases the proportion of
heads gets closer and closer to 0.5 but the probability that exactly half of the tosses
will result in a head gets smaller and smaller.
Patrick D. Bangert
7.3 Mean and Variance of a Binomial Distribution. 51
7.2.1 Exercises
1. A coin is tossed 6 times. What is the probability of obtaining fewer than three
heads?
2. In a family of 5 children, what is the probability that there are (a) 2 boys and
3 girls, (b) at least 2 boys. Assume P(boy) = 0.51.
3. 5 seeds with probability of germination 0.3 are planted in each of 60 pots. How
many pots can be expected to have exactly three seedlings?
4. Calculate the probability of getting two sixes if four dice are thrown.
5. What is the probability of obtaining two clubs if a card is taken from each of
6 well shued packs? Compare this with the probability of two clubs if all six
cards are taken from the same pack.
6. What is the probability of obtaining fewer than three ones if twelve dice are
tossed.
7. The probability of a shot at goal actually going in is 1/5. What is the prob-
ability of two goals being scored in a match in which there are 14 shots at
goal?
8. Racing cars are notoriously prone to break down. The probability of a car of
a certain team nishing is 0.6. What is the probability that three out of four
cars in this one team will not nish.
9. A gun has probability 0.3 of hitting a target and it takes at least three direct
hits to destroy it. What is the probability that the target is destroyed if 5
shots are red?
10. A tetrahedral die has its faces colored red, green, yellow and blue. If a group
of 8 such dice are thrown, calculate the probabilities of 0, 1, 2, 3, 4 red faces
being seen. How many times would you expect to see only 2 red faces if the
experiment of throwing the 8 dice was repeated 200 times?
Theorem 6 For any binomial distribution B(n.p) we can calculate the expectation
value from the formula = E(X) = np.
Proof 5
E(X) = 0.P (0) + 1.P (1) + 2.P (2) + + n.P (n) (7.9)
n n
n in!
= i pi (1 p)ni = pi (1 p)ni (7.10)
i=0
i
i=0
i! (n i)!
n
n!
= pi q ni where q = 1 p for convenience (7.11)
i=1
(i 1)! (n i)!
(n1)!
pi1 q ni
(i1)!(ni)!
= np (7.12)
n1 n1 n3 2 n1
= np q n1
+ n2
q p+ q p + + p n1
1 2 n1
(7.13)
n1
= np (q + p) = np (7.14)
= E(X) = np (7.15)
E [X(X 1)] = 0(1)P (0) + 1.0P (1) + 2.1P (2) + + n(n 1)P (n) (7.18)
n
n
n i (i 1) n!
= i (i 1) pi q ni = pi q ni (7.19)
i=0
i
i=2
i! (n i)!
n
(n 2)!
2
= n (n 1) p pi2 q (n2)(i2) (7.20)
i=2
(i 2)! (n i)!
= n (n 1) p2 (p + q)n2 = n (n 1) p2 (7.21)
2 = n (n 1) p2 + np n2 p2 = np (1 p) (7.22)
From this
we have at once that the standard deviation of the binomial distribution
B(n,p) is np (1 p).
Patrick D. Bangert
7.3 Mean and Variance of a Binomial Distribution. 53
Example 35 A machine produces resistors and 2% of them are defective. They are
packed into boxes of 24. Find the mean and standard deviation of the number of
defectives per box.
Solution.
[You may be interested to note that in statistics this situation would be handled
by calling a defective resistor a success.]
For us P(defective) = 0.02 = p
P(perfect) = 0.98 = 1 - p on trying any resistor at random. Since there are only
the two possible outcomes perfect and defective we have a binomial distribution for
the multiple trials (i.e. 24 per box). i.e. we have a case of B(24, 0.02).
Thus the mean number of defectives in a box
of 24 is np = 24 x 0.02 = 0.48.
The standard deviation = np (1 p) = 24 (0.02) (0.98) = 0.686
= np
= 500(0.04)
= 20.
(7.23)
= np (1 p) = 19.2 = 4.4
an extra 30 items per box and sell on the basis that each box will contain at least
500 perfects rather than to set up elaborate testing procedures to nd and replace the
defectives in the output.
Example 37 A very large number of balls are in a bag one eighth of them being
black and the rest white. Twelve balls are drawn at random. Find
Solution.
We neglect the change in the probabilities as the balls are drawn since there are a
very large number of balls in the bag. We consider a black ball a success. P(drawing
a black ball ) = 1/8 and we consider B(12, 1/8)
12
1 3
7 7
1. P(3B and 9W) = 8 8
0.129
3
12!
1 r+1
7 11r
P(r + 1 black) (r+1)!(11r)! 8 8
=
12!
1 r
7 12r (7.28)
P(r black)
r!(12r)! 8 8
12r
1
8 12r
= r+1 8 7
= 7(r+1)
12 r 5
> 1 12 r > 7r + 7 r < (7.29)
7r + 7 8
i.e. P(1) P(0) but after that the probabilities decrease. Thus the most likely
number of black balls is 1.
Patrick D. Bangert
7.3 Mean and Variance of a Binomial Distribution. 55
7.3.1 Exercises
1. What is the mean and standard deviation of the number of ones if 3 dice are
thrown repeatedly?
2. What is the mean and standard deviation of the number of heads if 5 coins
are tossed repeatedly?
3. Find the mean and standard deviation of a binomial distribution in which (a)
n = 20, p = 0.4 (b) n = 50, p = 0.7
6. In the manufacture of silicon chips P(success) = 0.4. If the chips are made in
batches of 80, what is the expected number of good chips per batch? What is
the lowest number of defectives per batch likely in thousands of batches?
12. In a quality control laboratory, samples of size 60 were examined and the num-
ber of defectives counted. Over thousands of trials, the number of defectives
never exceeded 14 nor was less than 2. Assuming unchanged probability over
the testing period, what was the approximate percentage defective?
13. A coin is tossed 4 times. Find the probability that heads appear
14. From a packet containing a large number of seeds, 40% of which are advertised
to give red owers and the others white, 10 plants are produced. What is the
probability
15. (a) In a trial eight coins are tossed together. In 100 such trials how many
times should one expect to get three heads and ve tails?
(b) If 8% of articles in a large consignment are defective what is the chance
that a sample of 30 articles will contain fewer than three defectives?
16. Nine unbiased dice are thrown. Find P(r) the probability that r sixes appear,
and hence determine the value of P(r + 1)/P(r). Find
17. Playing a certain one armed bandit which is advertised to increase your
money tenfold costs 5 pesos a turn. The player is returned 50 pesos if more
than eight balls out of a total of ten drop in a specied slot. The chance of
any one ball dropping in this slot is p. Determine the chance of winning in a
given turn and for p = 0.65 calculate the mean prot made by the machine on
500 turns. Evaluate the proportion of losing turns in which the player comes
within on or two balls of winning for the case of p = 0.65.
Patrick D. Bangert
Chapter 8
constant.
The usual kinds of example used to introduce and justify the Poisson distribution
run as follows. Consider a long distance telephone operator in a small town. It is
found by observation over a three month period that the average number of long
distance calls that she has to put through in the hour from 2 am to 3 am is 3 calls.
We will assume that the time needed to make a connection is negligible. If we divide
58 Probability Distributions: The Poisson Distribution
the hour up into a large number of equal short time periods, let us say 720, each of
duration 5 seconds for deniteness and also assume that the requests for such calls
arrive randomly then there is a probability of 3/720 = 1/240 of a connection being
made in any one of the time intervals. [What further assumption(s) is implicit in this
modelling?] We now pose the problem: on a given morning what is the probability
of the operator handling 0 calls, 1 call, 2 calls etc. in the hour from 2 am to 3 am.?
[ The number of calls is articially small you may think. This helps us visual-
ize what is going on in the argument that follows. Also larger numbers introduce
some problems of computation which have nothing to do with the theory underpin-
ning the Poisson distribution and which are best discussed after we know what the
distribution is all about.]
We can get an approximate solution (and a very good one) by applying the
binomial distribution as follows. This will give us some condence when we try to
use the Poisson distribution.
Since we have divided the one hour interval into 720 subintervals of 5 sec each
we can consider the problem as making 720 successive Bernoulli trials of the form a
call arrives or does not arrive in the 5 sec period. The
probability
of a call arriving
1
in such a period is 1/240 and so we are studying B 720, 240 .
Using this model:
0 720
720 1 1
P (0) = 1 = 0.04948 (8.2)
0 240 240
1 719
720 1 1
P (1) = 1 = 0.1484 (8.3)
1 240 240
2 718
720 1 1
P (2) = 1 = 0.2242 (8.4)
2 240 240
3 717
720 1 1
P (3) = 1 = 0.2245 (8.5)
3 240 240
4 716
720 1 1
P (4) = 1 = 0.1684 (8.6)
4 240 240
(Note that before the age of electronic calculators these calculations would have
been somewhat tiresome.)
Now suppose we set a = 3 (which happens to be the mean number of calls re-
ceived during this hour!!) in the Poisson distribution and calculate the corresponding
probabilities.
3r
P (r) = e3 (8.7)
r!
P (0) = e3 = 0.04979 (8.8)
3
P (1) = e3 = 0.1494 (8.9)
1
32
P (2) = e3 = 0.2246 (8.10)
2!
33
P (3) = e3 = 0.2240 (8.11)
3!
Patrick D. Bangert
8.1 The Poisson Distribution 59
34
P (4) = e3
= 0.1680 (8.12)
4!
Note
how1 the
results obtained compare very closely with the ones obtained using
B 720, 240 (and with much simpler calculation.)
It can be proved theoretically that as n , p 0 in such a way that np = a
(i.e. the a = 3 of our example) then the correspondence between the binomial
distribution B (n, p)) and the Poisson distribution ea ar! gets better and better, the
r
discrepancies tending to zero. [This is a result of pure math concerning the limits of
the algebraic expressions. It does not give evidence for or against either distribution
being a good model for a particular real situation. That must be decided by other
means]
However in our example the lengths of the time subintervals into which
our hour
3
is divided soon becomes so small that we feel justied in saying B 10000, 10000
will be
an almost perfect
model for our problem. Moreover the results obtained
3
from B 10000, 10000 will be for all practical purposes indistinguishable from those
obtained by using the much easier to calculate ea ar! . For problems of this nature
r
we will from now on use the Poisson distribution without further ado.
Solution.
Since we could clearly split the week into a very large number of subintervals with
a very low probability of an accident happening in any one subinterval a Poisson
distribution with a = 4 seems reasonable.
40
P (0) = e4 = 0.018 (8.13)
0!
4 42 43
P (r 3) = e 1+4+ + = 0.433 (8.14)
2! 3!
4 42 43 44
P (r > 4) = 1 e 1+4+ + + = 0.371 (8.15)
2! 3! 4!
Theorem 8 The mean and variance of the Poisson distribution ea ar! are both a.
r
Proof 7
ea ar
ar1
a
= E(X) = r = ae (8.16)
r=0
r! r=1
(r 1)!
a a2
= ae 1+a+ + = aea ea = a (8.17)
2!
ea ar
rar1
2 = E X 2 2 = r2 a2 = ea a a2 (8.18)
r=0
r! r=1
(r 1)!
2 3
a 3a 4a
= e a 1 + 2a + + + a2 (8.19)
2! 3!
" #
a d 2 a3 a4
=e a a+a + + + a2 (8.20)
da 2! 3!
" #
a d a 2 a3
=e a a 1+a+ + + a2 (8.21)
da 2! 3!
a d a
=e a ae a2 = ea a [ea + aea ] a2 = a (8.22)
da
Thus = 2 = a
Example 39 Cars on a highway pass under a bridge at an average rate of 280 per
hour. What number is expected to pass under the bridge in 1.5 minutes? What is
the probability that this expected number does in fact pass under the bridge in a 1.5
minute period?
Solution.
We would assume from the given quantities that on average 280 1.5 60
= 7 cars
pass under the bridge in a 1.5 minute period. Hence the number of cars actually
passing under the bridge in a 1.5 minute period should be a Poisson distribution
7 r 7 7
with mean 7 i.e. P (r) = e r!7 Pr(7 cars pass) = e 7!7 0.149.
Example 40 The proportion of car headlight bulbs which are faulty is 0.5%. If they
are packed in boxes of 250 what is the probability that the box will contain at least
one defective bulb?
Solution.
This is actually a binomial distribution with mean or expected value of 1.25
defective bulbs. However by analogy with our rst example, splitting the box 250
r
bulb intervals tells us that P (r) = e (1.25)
1.25
r!
ought to be a good approximation to
the actual distribution. [Good because 250 is already a relatively large number and
1
p = 200 is already small. But it is only an approximation since we cannot go to the
limit. We can divide time intervals as nely as we like but not light bulbs.] Using
this approximation P (0) = e1.25 and Pr(at least one defective) = 1 e1.25 = 0.713
This technique is often used on binomial distributions with large numbers of
trials and small probabilities.
Example 41 Calls come into a telephone exchange at an average rate of 2.5 per
minute. Only 4 calls per minute can be handled. Find the probability that a particular
caller will have to be put on hold.
Solution.
The number of calls arriving per minute form a Poisson distribution with mean
2.5. A caller cannot be serviced whenever 5 or more calls arrive in a minute. So we
require
P (5) + P (6) + P (7) + = 1 P (0) P (1) P (2) P (3) P (4) (8.23)
e2.5 (2.5)r
Using P (r) = r!
we nd Pr(caller put on hold) =0.109
Patrick D. Bangert
8.1 The Poisson Distribution 61
Example 42 Quetelet and von Bortkiewicz collected data on the number of men in
the Prussian army who were killed by being kicked by their horses whilst they were
grooming them. These two devoted savants conducted this experiment over a 20 year
period in the late 19th C. They studied 10 separate cavalry corps (each containing
the same number of men and horses) over this period. They tabulated deaths from
horse kicks per corp per year. With 10 corps for 20 years this gave 200 observations.
They tabulated these as follows:
Deaths in a 0 1 2 3 4
corp in a year
Frequency 109 65 22 3 1
e0.61 (0.61)
P (1) = = 0.3314 200 0.3314 = 66.2 (8.25)
1
e0.61 (0.61)2
P (2) = = 0.1011 200 0.1011 = 20.2 (8.26)
2
e0.61 (0.61)3
P (3) = = 0.02056 200 0.02056 = 4 (8.27)
3!
0.61 4
P (4) = e 4!(0.61) = 0.003135 200 0.003135 = 0.6(1 in the table).
The agreement is quite remarkable. Ill disposed people have been known to assert
that Q and von B bribed the horses with sugar lumps.
There is probably a Ph.D. in statistics awaiting the person who carries out a
similar piece of research on the probability of a lion tamer being eaten by his client.
It is certainly common in experimental statistics to try to use a Poisson distri-
bution whenever 2 .
Example 43 In a large unnamed capital city there are on average 1000 trac ac-
cidents per week. What is the probability:
(i) that there are 800 accidents in a given week.
(ii) that there are 1500 accidents in a give week.
Solution.
We can argue that the Poisson distribution is still valid for the time interval and
hence the p in B(n,p) can be made as small as we please.
1000 800
(1000)
So P (800) = e 800!
which is just ne until we try to calculate it!!
It can be done as follows:
ln P (800) = ln e1000 + ln 1000800 ln 800! (8.28)
8.1.1 Exercises
1. An electrical component is packed in boxes of 100. If the proportion of de-
fectives is 2%, calculate the proportion of boxes with 2 or less defectives in
them.
5. A small car hire rm has 6 cars which it hires out by the day. The demand
for these cars follows a Poisson distribution with mean 3.5. Calculate the
expected number of days per year when (a) no cars are required, (b) all 6 are
in use but no customer is turned away, (c) some customers are turned away.
6. The average number of cars crossing a narrow bridge is 130 per hour. If
more than 5 cars try to cross within 1.5 minutes a queue forms. What is the
probability that a queue forms.
Patrick D. Bangert
8.1 The Poisson Distribution 63
7. A car manufacturer gives a guarantee of 1 year and nds that on average 1 car
in 14 needs attention during this period. What is the probability that a dealer
for these cars will have to give attention to exactly 4 out of the 75 cars that
he has sold? What is the probability that more than 4 will need attention?
8. A quantity of radioactive material emits 400 particles per hour. Calculate the
expected number of 10 second intervals in a period of 10 hours when 0,1,2,3,4
particles are emitted.
9. In a warehouse, 6 spare parts of a particular type are issued each week on
average. Find the number of spare parts which need to be kept in stock so
that there is a greater than 95% chance that all demands in a week can be
met.
10. A time traveller watching the World Cup soccer tournament in 2020 counted
the total goals scored in each match. i.e. if England lost 10-1 to Andorra he
counted 11. His total data is collected in the table below. Fit a theoretical
Total goals 0 1 2 3 4 5
Number of 24 26 11 11 1 1
matches
Number of 0 1 2 3 4 5
hits
Number of 229 211 93 35 7 1
squares
16. Find the possible values for the mean and standard deviation of a Poisson
distribution in which P(2) = 0.25.
17. The road accidents in a certain area occur at an average rate of one per two
days. Calculate the probability of 0,1,2,3,4,5,6 accidents per week in this area.
What is the most likely number of accidents per week? How many days in a
week are expected to be free of accidents?
18. Explain briey what is meant by a Poisson distribution and show that for such
a distribution the mean is equal to the variance. In a bakery 3600 cherries are
added to a mixture which is later divided up to make 1200 small cakes.
19. The average proportion of bad eggs in an egg packing station is 1 in 2000. The
eggs are packed in boxes containing 6 eggs each.
(a) Evaluate the probability that a box contains exactly one bad egg.
(b) A housewife complains if she obtains two or more boxes with one bad egg
each per 100 boxes bought. What is the probability that she complains?
Patrick D. Bangert
Chapter 9
of integration will give excellent approximations to the real world answers. Let us
now put these ideas into a more precise mathematical form.
To do this we will continue with our analysis of choosing a point P at random on
the line AB which is 10cm long and using this distance AP as the random variable
x.
It is reasonable to assume that all points on the line are equiprobable in this case.
Suppose we assign the value 0 to the probability of a particular precise point (say
the midpoint of AB for deniteness) being chosen. We know that the probability
of some point on the line being chosen is 1. But if we attempt to calculate this
probability from the above assumptions we arrive at the result 1 = x (the number
of points on the line)!! Since this is obvious nonsense we are forced to conclude that
the attempt to assign probabilities in the manner that we have used so far is futile.
For the rest of this discussion refer to Figure 39.1.
We have divided the line into n equal subintervals each of length x = 10/n. We
have shown the point P as lying somewhere in the i th subinterval. We can denote
the probability of it falling in the ith subinterval as pi = f (xi1 )x = x 10
where
we have introduced a function f (x) which in this example is constant over AB but
which we identify by its value at the beginning of each of its intervals. This is not
done to make life complicated but so as to be able to extend the method we are
developing to cases which do not have equiprobable distributions. In this example
f (x) is just the constant function 1/10.
If now we wish to assign a probability to the point P being somewhere in the
interval [c,d] where for the moment we assume that c and d fall respectively at the be-
ginning and the end of their particular subdivisions it is only necessary to add up the
n
probabilities for the subdivisions involved. i.e. P (c x d) = f (xi1 )xwhere
i=m
m and n are the numbers of the rst and last subintervals involved.
The next step is to allow the number of subdivisions to increase without limit.
Then x tends to 0 and it becomes immaterial whether or not c and d are actually
at the beginning or ends of their respective intervals. Even if f(x) where not constant
it would also become immaterial whether or not we chose values at the endpoints or
from within the intervals so long as f was at least a continuous function. You will
recognize here the outlines of the process which leads to the Riemann integral of a
continuous function on an interval [c,d].
Without more ado we will dene the probability that the point falls somewhere
Patrick D. Bangert
9.1 Probability Distribution Functions 67
$d
in the interval [c,d] as P (c x d) = c f (x) dx where in this example f(x) =
$d 1
1/10. Evaluating this for our example we have P (c x d) = c 10 dx = dc
10
which coincides with our common sense expectation of what that
$ 10 1 probability should
be. Moreover since according to this P (0 x 10) = 0 10 dx = 1 we have also
ensured that the probability of the point being somewhere on the line is 1. As we
know from our earlier work this is a vital prerequisite for a probability distribution.
Before we give a formal denition note that the function we have dened here
could be redened for < x < in the following way:
f (x) = 0 < x < 0
1
f : x f (x) f (x) = 10 0 x 10 (9.2)
f (x) = 0 10 < x <
2. f (x) 0x R.
$
3. f (x)dx = 1.
Now given a subinterval E = [x1 , x2 ] of the real line R we dene the probability
of nding the value of the random variable X in E to be:
x2
P (x E) = f (x)dx. (9.3)
x1
Finally we dene
$x the cumulative distribution function F (cdf.) corresponding to f
by F (x) = f (t)dt. Note that F (x) is the probability of nding X somewhere
between the left end of the distribution (whether or not this is actually -) and the
value x.
follows at once of course from our earlier work on the relationship between integrals
and areas. eg. for our numerical example we have the situation of
The cdf. has the following important properties:
1. lim F (x) = 0
x
2. lim F (x) = 1
x
dF
3. dx
= f (x)
The properties a. and b. are obvious from the denitions. Property c. follows from
the denition of F(x) as an integral and the fundamental theorem of calculus.
The cdf. is also useful in nding probabilities on intervals e.g.
x2 x2 x1
P (x1 x x2 ) = f (x)dx = f (x)dx f (x)dx (9.4)
x1
Let us stay for the moment with the problem of choosing a point at random on
the line segment AB of length 10cm which we take to stretch from x = 0 to x = 10
on the real axis. However we will now suppose that the choice is not equiprobable
but that the person making the choice has a subconscious preference for points near
the center of the segment. We might postulate a pdf. for the situation now as
0 <x<0
kx 0 x < 5
follows: f (x) = where k is a positive constant.
k(10 x) 5 x 10
0 10 < x <
Patrick D. Bangert
9.1 Probability Distribution Functions 69
1
1 k = .
We used simple geometry to evaluate the area here but to illustrate the
25
$5 x $6 2 5 6
(10x)2
integral we have: P (3 x 6) = 3 25 dx + 5 10x
25
dx = x
50
+ 50
3 5
16 9 1
+ = . (9.6)
50 50 2
Another variation of the same problem would assign a pdf. as follows: f (x) =
0 <x<0
kx(10 x) 0 x 10 where k is a positive constant.
0 10 < x <
1000k 3
500k
=1k= (9.7)
3 500
$6 3 6
2 3 2 x3 93
We also have P (3 x 6) = 3 500 (10x x ) dx = 500 5x 3 = 250
3
Note that this is less than the previous result yet the probability is more centrally
concentrated. Explain?
Denition 19 The lower and upper quartile values L and Q respectively of a ran-
dom variable X with pdf. f (x) are dened by the equations:
L Q
1 3
f (x)dx = , f (x)dx = . (9.8)
4 4
Example 44 Given a pdf. f (x) for a random variable X on the interval [0,1]
0 <x<0
dened as f (x) = kx(1 x) 0 x 1 where k is a positive constant. Find
0 1<x<
1. The value of k.
3. The pdf. g and cdf. G of the random variable W where w = x2 on [0, 1].
Sketch graphs of f and F .
Solution.
$ $1
1. We need
f (x)dx = 1 0
kx(1 x)dx = 1 k = 6
The graphs of f (x) and F (x) are shown above. The vertical scale of the graph
of f (x) is magnied.
Patrick D. Bangert
9.1 Probability Distribution Functions 71
where f (x) is the pdf. for the distribution. As in the discrete case E(x) is often
referred to as the mean .
$ $ $ $
Now
(x )2 f (x)dx =
x2 f (x)dx 2 xf (x)dx + 2 f (x)dx
= x2 f (x)dx 2 (9.15)
2 + 2 = E (x2 ) and we note that this result is exactly the same as that obtained in
the discrete case.
9.1.1 Exercise
1. f (x) = kx2 if x 1, f (x) = 0 if x < 1; nd the value of k and then nd
P (x < 2)
4. Find the (cumulative) distribution function for each of the density functions
in Nos. 1-3.
5. Find the mean, median, mode and variance for the random variable with the
pdf.
0 x<0
3
f (x) = x (2 x) 0 x 2 (9.16)
4
0 x>2
6. Find the mean, median, mode and variance for the random variable with pdf.
0 x < 12
1
f (x) = cos x 12 x 12 (9.17)
2 1
0 x> 2
Find the cumulative distribution function F, and sketch the graphs of f and
F. Find the density and distribution functions, g and G, of the new random
variable Y where y = x2 and sketch the graphs of g and G.
Patrick D. Bangert
9.2 The Uniform Continuous Probability Distribution. 73
8. A probability
density function of a random variable X is dened as follows:
x (x 1) (x 2) 0 x < 1
f (x) = 1x3 where is a suitable constant.
0 otherwise
Calculate the expectation value of x. What is the probability that x
?
9. The probability density function p(t) of the length of life, t hours, of a certain
electronic component is given by p(t) = kekt (0 t < ), where k is a pos-
itive constant. Show that the mean and standard deviation of this distribution
are both equal to 1/k. Find the probability that the life of a component will
be at least t0 hours. Given that a particular component is already t1 hours old
and has not failed, show that the probability that it will last at least a further
t0 hours is ekt0 .
A computer contains three components of this type and the failure of any
one may be assumed to be independent of the failure of the others. Find the
probability that:
(a) none of the three will have failed at t0 hours.
(b) exactly one will fail in the rst t0 hours, another in the next t0 hours and
a third after more than 2t0 hours.
b2 + ab + a2 b2 + 2ab + a2 (b a)2
= = (9.19)
3 4 12
We summarize these results as
Theorem 9 The mean and the variance 2 of the uniform distribution of of the
2 (ab)2
above denition are given by = a+b
2
, and = 12
.
For our problem this means 0 x <$ 1 or 4 $< x 10 since f(x) = 0 outside 0 x
1 10 7
10. So P (X : x2 5x + 4 > 0) = 0 dx
10
+ 4 dx 10
= 10
%
2 2 10
If AP = x the area of the circle is x . x > 10 x >
10
dx 1
P (required) = =1 . (9.21)
10 10 10
9.2.1 Exercises
1. The line AB has length 10cm. An interval of length 2cm is marked at random
on the line, the positions of the interval being uniformly distributed. What is
the probability that the interval will contain the midpoint of AB?
Patrick D. Bangert
9.3 Reading: Classication of Men According to their Natural Gifts by Sir
Francis Galton 75
6. A point P is marked on the side AB of the square ABCD, the points within
AB being uniformly distributed. Find the mean and variance of the area of
the triangle APD.
This is the famous bell shaped curve here centered on a mean of 2. Let us now
discuss without going into too much detail some of the experiments which can lead
to normal distributions.
Experiment 1.
10.1 The Normal Distribution. 77
If we measure the IQ. of a very large section of the population it is found that
the mean is about 100, and that there will be very few incidences below 60 or above
140. If the probabilities are worked out as relative frequencies i.e.
It is found then that many situations arising in the analysis of numerical data of
various kinds produce these bell shaped curves. If we are to work with this kind of
distribution mathematically we need a more powerful tool than a well drawn graph
however pretty this may be. We need a mathematical model for what we will call a
normal probability distribution. [For the moment this is dened unsatisfactorily as
the kind of distribution that the above experiments and others like them lead to.]
= 2Axex and dx
2
and y 0 very rapidly for |x| > 1. Also dxdy dy
= 0 x = 0.
2
d2 y x2 2 2 x2 dy
2
= 2Ae + 4 Ax e so = 2A < 0 (10.2)
dx dx2 x=0
d2 y 2 2 1
= 0 2A + 4 Ax = 0 x = . (10.3)
dx2 2
and the curve has points of inexion when x = 12 . Analytically this looks to
x2
be a good candidate for a BSC. and we conrm this by the plot of y = e 2 as in
Figure 39.4.
Gauss then found that by suitable choices of A and and possibly by translations
so that the mean need not be at 0 he was able to get an extremely good t to observed
distributions. He then chose = 12 and A = 12 for special examination.
Note that the small is the usual notation for the standard normal distribution
and is used in all the literature rather than f . For this to be a genuine pdf. we must
$ x2 $ x2
have 12 e 2 dx = 1. To check this we need to be able to evaluate e 2 dx.
This is quite a tricky integral and for our purposes we are allowed to assume that
$ x2
e 2 dx = 2 which gives us the normalization required.
[ For the curious here is a sketch of one way to nd this integral.
x2
Let I = e 2 dx. (10.4)
Patrick D. Bangert
10.1 The Normal Distribution. 79
$ x2 $ y2
Consider I1 = e 2 dx e 2 dy where the x and y are the usual x and y of
the cartesian plane. We have I1 = I 2 . Assuming the integrals can be combined (and
2
$ $ (x2 +y2 )
this would have to be proved) we have I = e 2 dxdy
2 2
r2 r2
2 r2
I = e rdrd = 2 re 2 dr = 2 e 2 = 2 (10.5)
0 0 0 0
and I = 2 as required. ]
It is perhaps worth noting that there is a well developed theory of integrals of
this kind. They are called Gaussian Integrals and arise in many applications such
as the kinetic theory of gasses and quantum theory.
The qualifying word standard in the denition is important. This is obviously a
special case but by using the word standard we are implying that we hope to be able
to obtain a model for any normal distribution by suitably modifying this standard
distribution. This hope will be realized shortly.
Proof 9
1 x2 1 x 2
= E(X) = xe 2 dx = e 2 = 0. (10.6)
2 2
{ Strictly, these integrals involving innite limits have not so far been dened in our
course. They are certainly not Reiman integrals. The following method of handling
them is plausible:
M
1 x2 1 x2
xe 2 dx = lim xe 2 dx (10.7)
2 M M 2
M2 M2
= lim 1 e 2 + e 2 = 0}
M 2
2 1 2 2 x2
= E(X ) = x2 e 2 dx (10.8)
2
x2 x2
Let x = u xe 2 dx = dv so dx = du and v = e 2 (10.9)
2
x2
2 x2
Then 2 = xe + e 2 dx = 0 + 2 (10.10)
Patrick D. Bangert
10.1 The Normal Distribution. 81
You will notice that in the interest of saving trees the negative values of x are not
printed in your tables. However this is no big problem since the standard normal
distribution is symmetric about x = 0. We have P(x -0.87) = P(x 0.87) = 1 -
P(x 0.87) = 1- 0.8078) = 0.1922
We have P(-1.6 x 2.4) = P(x 2.4) - P(x -1.6) = P(x 2.4) -[1 - P(x
1.6)] = 0.9918 -[1 - 0.9542)] = 0.937.
We now take note of the fact that it is extremely unlikely that an example arising
from a real situation will exactly t the standard normal distribution. We must then
examine how we can modify the standard normal distribution so that we can use it
to handle any normal distribution no matter what its mean and variance.
Suppose then that we have a distribution of a random variable x that we want
to model as normal and which has an arbitrary mean and variance 2 . We dene
a new random variable y in terms of the old one x by the equations
y
y = x + and x = (10.12)
$ x2 $ (y)2 $ (y)2
We now have dy = dx and 1 = 1
2
e 2 dx = 1
2
e 2 2
dy
= 1
2
e 2 2 dy.
(y)2
If we now dene a new pdf. by (y) = 12 e the above line gives us the
2 2
necessary normalization. Moreover the mean and variance of the new variable y will
be and .
1
$
(y)2
y
[e.g. E(y) = 2
ye 2 2 dy Let t =
y = t +
1 t2
2 t2
2 t2
E(y) = (t + ) e dt = te dt + e 2 dt
2 2 2
(10.13)
= 0 + = . (10.14)
A similar reverse substitution and appeal to already calculated integrals will show
that the variance of y is 2 as we hoped. ]
We will not nd tables calculated for the cumulative distribution function arising
from this pdf. as we have for the standard normal distribution. Since an arbitrary
normal distribution N (, ) can have any value of and it would be quite im-
practicable to construct such tables. However we can proceed as follows.
First if we have a normal distribution N (, ) and retaining the variable name x
instead of the y above the cumulative distribution function is denoted by I and we
have a
1 (x)2
I(a) = e 22 dx (10.15)
2
Patrick D. Bangert
10.1 The Normal Distribution. 83
$ a t2 $ a t2
Let t = x dt = dx
and I(a) = 1
e 2 dt =
1
e 2 dt.
2 2
From this we deduce the nal, and the most powerful result of our theory:
Example 54 Given that P(x 1.7) = 0.6331 for a normal distribution of mean 2
nd its variance.
Solution.
P (x 1.7)
1.7 = 1 0.6331 = 0.3669 (10.19)
= 0.3669
But 0.3669 is not in the table!
Since (0) = 0.5and the negative values of x are not in the table this will often
happen. If x = -a (with a 0) is the value we need clearly
0.3669 = P (x a) = P (x a) = 1 P (x a)
(10.20)
P (x a) = 0.6331
[We are back where we started! In practice the above digression is not needed then
but it may help you to understand what is going on.]
From the table a = 0.34. So the value that we require is -0.34 and
1.7 0.3
= 0.34 = 0.34 = 0.882
2 (10.21)
Thus the variance is = 0.779
Example 55 Given P(x 5.4) = 0.8710 for a normal distribution with variance
4 nd the mean.
Solution.
5.4 5.4
= 0.8710 = 1.13 = 3.14 (10.22)
2 2
Let us now look at a more realistic problem involving a normal distribution.
Example 56 The heights of a large number of young children are measured cor-
rect to the nearest centimeter and the mean and standard deviation of the resulting
frequency distribution are calculated and found to have values 122cm and 5.2 cm
respectively. Hence a statistical model is assumed of a distribution N(122, 5.2).
Calculate the probability of a random measurement of a child of the group yielding
a result in each of the class intervals: x 105, 105 < x 110, 125 < x 130.
Solution.
In view of the accuracy limitation (to the nearest centimeter) and the fact that
we are assuming a continuous distribution the rst probability interval will be better
modelled by P(x 105.5) , the second by P (105.5 < x 110.5) and so on.
[The analysis of the errors introduced by treating a discrete distribution as a
continuous one and the eect of experimental error forms part of advanced statis-
tics. However the fact remains that a simple common sense approach such as we
are adopting here will give results that are as good as can be expected given the
imperfections in the data.]
The transformed variable is x122
5.2
105.5 122
P (x 105.5) = = (3.173) = 0.0001 (10.23)
5.2
110.5 122 105.5 122
P (105.5 < x 110.5) = (10.24)
5.2 5.2
= (2.212) 0.0001 = 1 0.9864 0.001 = 0.014 (10.25)
130.5 122 125.5 122
P (125.5 < x 130.5) = (10.26)
5.2 5.2
= (1.635) (0.673) = 0.949 0.749 = 0.20 (10.27)
10.1.1 Exercises
1. For N(0,1) nd a. P(x 0.79) b. P(x -0.48)
Patrick D. Bangert
10.2 The Normal Approximation to the Binomial Distribution 85
$2 2
1 e 2
z
11. Find 1.4 2
dz
$ (x4) 2
12. Find 1 e 18 dx
5 3 2
13. The weights of army recruits form a normal distribution with mean 69.8 kg.
standard deviation 6.2 kg. Out of 976 recruits how many can be expected to
weigh more than 80 kg?
14. A machine making connecting pins produces them with lengths normally dis-
tributed about a mean of 1.8 cm. The variance is 0.0001 cm2 . If 3% are
rejected for oversize and 4.5% for undersize, nd the tolerance limits for ac-
ceptance. 15. A light bulb test shows that 10% last longer than 2140 hours
and that the mean is 1630 hours. Assuming that the lifetime of the bulbs is
normally distributed, nd the standard deviation and the percentage which
would fail to meet the guarantee of 1000 hours. 16. In a normal distribution,
12.5% of the distribution is less than 14 and 20% is less than 17. Find its
mean and standard deviation. 17. Packets of sugar, nominally 1 kg. have
weights normally distributed with mean 1.05 kg. and standard deviation 0017
kg. Find the percentage of packets which weigh less than 1 kg. On average,
how many packets labelled 1kg can be produced on this machine from 1 tonne
of sugar? The setting of the machine is now altered. The standard deviation
remains the same and the mean is reduced until 4.9% of the packets contain
less than 1 kg. Find the new mean and the number of packets which can now
be obtained from 1 tonne of sugar. 18. The IQs of 500 students are assumed
to be normally distributed with mean 105 and standard deviation 12. How
many students may be expected:
15. Rods are manufactured with a mean length of 20.2 cm. and standard deviation
0.09 cm. the distribution being normal. If rods of length less than 20.1 cm
are rejected what is the probability that a rod which is retained has a length
in excess of 20.3 cm?
The approximating normal distribution is N(12.5, 2.5). Using the continuity cor-
rection we require P (14.5 X 15.5). Standardizing z1 = 14.512.5
2.5
= 0.8 z2 =
15.512.5
2.5
= 1.2
10.2.1 Exercises
1. A die is thrown 1200 times. Use the normal approximation to nd the proba-
bility that the number of sixes lies in the range 191 - 209.
2. A fair coin is thrown 100 times. What is the probability that there will be
fewer than 40 heads?
3. Patients suering from a certain disease have a 60% chance of recovering with-
out treatment. A new drug is being tested but in fact has no eect. What is
the probability that 15 or more patients in a group of 20 taking the new drug
will recover?
Patrick D. Bangert
10.3 Reading: The Application of Probability to Conduct by Lord John
Maynard Keynes 87
4. To assess public opinion 80 inquisitors each ask 50 people to give their opin-
ion either for or against the immolation of math teachers as heretics. If on
average 60% of the population are for the measure what is the probability that
an individual inquisitor will report a majority against.? What is the probabil-
ity that more than 10% of the inquisitors will report a majority against.
5. Using the data of Example 39.14 plot on the same axes the probabilities ob-
tained by using (a) the exact binomial distribution and (b) the probabilities
obtained by using the normal approximation. Plot for say 8 to 18 tails as the
probabilities become insignicant outside this range. Plot the binomial dis-
tribution as a step function letting the value calculated for 12 be constant on
the interval 11.5 x 12.5 etc. This will give you an empirical justication
for the technique. It also shows that in a situation like that of Example 39.15
the errors cancel out to some extent.
score 1 2 3 4 5 6
(11.1)
frequency 83 105 95 104 111 102
1 2 3 4 5 6 Totals
O 83 105 95 104 111 102 600
(11.2)
E 100 100 100 100 100 100 100
O - E - 17 5 -5 4 11 2 0
4. If = 1 forget it.
0 1 2 3 4 5 Total
O(actual no. of girls) 12 58 113 111 61 13 368
(11.5)
E(theoretical no. 1dp.) 11.1 56.4 114.2 115.7 58.6 11.8 367.8
O-E 0.9 1.6 - 1.2 - 4.7 2.4 1.2 0.2
[ A note on the rounding: Obviously the observed values are integers. The statisti-
cians rule is to calculate theoretical expectation values to one more decimal place of
precision than the observed values. So here 1 decimal place.]
Using the above values
0.92 1.62 1.22 4.72 2.42 1.22
2 = + + + + + = 0.542 (11.6)
11.1 56.4 114.2 115.7 58.6 11.8
There are 6 classes.
There are 2 constraints in this problem. Clearly E is calculated from the data
to give one constraint as before. Also the mean of the theoretical distribution is
calculated from the data. p of course need not be counted since it follows at once
from the mean and E.
So = 4 and our test statistic is 24 = 0.542. From the tables we see that
approximately 96% of the values are greater than this value of 24 . This is a very
good t indeed and might lead one to conclude that the data had been ddled. [In
fact the example comes from a book where the authors did in fact ddle the data
to avoid problems in an earlier section on the binomial distribution. They admit it
later.]
Example 61 Random samples of 10 grains of wheat were weighed in grams and the
results divided into class intervals as in the table below.
wt. of 10 grains (gms) 2.1 - 2.2 2.2 - 2.3 2.3 - 2.4 2.4 - 2.5 2.5 - 2.6 2.6 - 2.7 2.7 - 2.8 Total
No. of samples 5 46 183 342 289 115 20 1000
(11.7)
Find the mean and standard deviation of these sample weights. Test at the 5%
signicance level the goodness of t of the data with a normal distribution of the
same mean and standard deviation.
Solution.
We use the mid points of the class intervals as random variable i.e. 2.15, 2.25,
2.35, 2.45, 2.55, 2.65, 2.75.
By using a guessed mean of 2.45 we have the table
x 0.3 0.2 0.1 0 0.1 0.2 0.3
(11.8)
f 5 46 183 342 289 115 20
Thus f x = 28.9 and x = 28.9/1000 = 0.0289 (11.9)
Patrick D. Bangert
11.1 The Chi-Squared Distribution 91
Example 62 In 1979 petrol tanker (gasoline truck) drivers voted on whether or not
to accept a pay oer. The results were tabulated by company as follows.
This is called a contingency table. The essential feature of what makes it a contin-
gency table appears to be that it makes sense to sum both vertically and horizontally.
In our previous examples in this section vertical summation would have produced
meaningless gures.
The question arising out of this particular contingency table is: Is there evidence
of any dierence of response between drivers of dierent companies?
The null hypothesis is that there is no dierence. Now 3563 out of 5781 drivers
voted for and 2218 out of 5781 voted against. Since there are 2277 Shell drivers
then if the null hypothesis is true we would expect 3563
5781
2277 = 1403.4 to vote for.
In this way we complete the following table of expected values on the null hypothesis.
[A little thought will convince you that the results must come out in this symmetric
manner. Doing the complete calculation serves as a check on the arithmetic to this
point.] 2 computed from these dierences is
11.1.1 Exercises
1. How many degrees of freedom are required for a 2 test, for testing data against
the following distributions? Where parameters are known independently of the
data they are given.
Patrick D. Bangert
11.1 The Chi-Squared Distribution 93
2. Use tables to look up the following values. State whether the results are
signicant at the 5% or 1% levels, or whether they seem too good to be true.
(a) 24 = 9.60
(b) 211 = 2.51
(c) 220 = 26.52
(d) 212 = 36.04
(a) 212 at 5%
(b) 29 at 5%
(c) 2100 at 1%
(d) 26 at 99%
5. An experiment which was inconclusive was repeated three times and each
time the value of 2 was computed. First experiment 2 = 9.32 = 6. Second
experiment 2 = 16.51 = 11. Third experiment 2 = 13.82 = 9. What
conclusion can be drawn?
6. A repeated experiment gave 21 = 3.29 and then and then 23 = 6.64. Is this
result signicant at the 5% level?
7. Five coins are tossed 320 times. The number of heads is as given:
Number of heads 0 1 2 3 4 5
(11.17)
Frequency 11 55 107 102 38 7 Total 320
Score 1 2 3 4 5 6
(11.18)
Frequency 21 17 24 22 15 21 Total 120
Is this signicant evidence at the 5% level that the die is not fair?
Patrick D. Bangert
11.2 Reading: Sociology Learns the Language of Mathematics by Abraham
Kaplan 95
14. The dierent grades awarded by three examiners in English O level papers
were as follows:
Examiner R S T Total
Grade A 21 14 38 73
Grade B 49 40 92 181
(11.24)
Grade C 55 43 103 201
Grade D or less 61 48 122 231
Total 186 145 355 686
Total
a b a+b 2 (ad bc)2 (a + b + c + d)
=
c d c+d (a + c) (a + b) (b + d) (c + d)
Total a + c b + d a + b + c + d
(11.25)
Limit Theorems
Note that the overline denotes the mean over the random variables. The P
means that that the left-hand side converges in probability to the right-hand side;
that is the probability that the dierence between the two sides is larger than some
(for any as small as desired) tends to zero.
In words the law says, the mean of a random sample of a population converges
to the mean of the distribution of the population. Even simpler, more trials give a
better average. We can use the central limit theorem (see below) to show (which
wewont) that the expected deviation of X from is of the order of magnitude of
1/ N where N is the number of trials made. This answers the question about an
error terms from some lectures back.
If we wish to be pedantic, the previous theorem did not say that X will get
close and stay close to ; it merely said that this becomes more and more likely.
The next theorem actually says this outright but please note the slightly changed
assumptions!
Note that a.s. means that the actual dierence between the left and right-hand
sides tends to zero and not only the probability of that. This is the crucial dierence
but for this we must have a nite fourth moment and not merely a nite variance
(second moment).
12.2 Central Limit Theorem 97
We will not prove this theorem simply because of the length of the proof and
the fact that is an exercise in Fourier integration which serves no purpose in this
course. This central limit theorem holds under some conditions. There are two
conditions that are both equivalent and which we now state (note the dierence in
wording that one uses the word every and the other some). We do this for
completeness. There is no need to fully understand where these conditions come
from. The discussion below will illustrate the theorem in more detail.
and the variances are all nite. So far, we have checked that the conditions of the
theorem apply. This is very important because there are some circumstances when
one or more of these conditions are false and the theorem does not apply.
Thus the random variable calculated from our four random variables and given
by the equation 12.1 has a distribution function that, as the sample size gets large,
is normally distributed. In this example, this knowledge does not get us very far
but let us give another example to illustrate it further.
We have established that, as far as we can tell, the distribution of male heights
is normal. We want to know why this is so. For this purpose we list a number of
factors that inuence a particular mans height:
Clearly some inuence it more than others, some do not apply all the time, some
can not easily be measured, others can not even be quantied in an established
way. Basically the reason why a man is of a particular height is a mystery as far
as predicting it goes. What we can say is that each of the factors can be viewed as
a random variable with a certain distribution function. The central limit theorem
now basically means that the heights of men are going to be normally distributed if
only you measure enough of them.
As this can be done for virtually anything, the theorem essentially says: Anything
complicated is almost always distributed normally if only the sample size is large
enough and none of the contributing factors have innite variance. This is not
entirely and exactly true as the variate which is distributed normally is a very
particular one (see equation 12.1). It is important to note however that it is in
this sense that the theorem is almost always interpreted. The reason being that,
in practise, the distribution functions and means etc. are rarely known for all the
contributing factors. This this interpretation is not correct and that it leads to
wrong conclusions in some circumstances is vital to be understood.
The conditions of the theorem must be checked, all the factors together with their
means and variances known and then a particular variate is normally distributed
if the sample is large enough. Note that the theorem makes no statement about
another variate and it does also not tell you how large the sample has to be in order
for the approximation to be good enough for your practical purposes. In short, this
statement has to be taken with a lot of salt and is worth considerably less than is
usually claimed and advertised.
Patrick D. Bangert
12.3 Reading: Mathematics of War and Foreign Politics by Lewis Fry
Richardson 99
Statistics
Chapter 13
13.1 Introduction
Data consist of numbers, of course. But these numbers are fed into the computer,
not produced by it. These are numbers to be treated with considerable respect,
neither to be tampered with, nor subjected to a numerical process whose character
you do not completely understand.
The analysis of data inevitably involves some tracking with the eld of statis-
tics, that gray area which is not quite a branch of mathematics - and just as surely
not quite a branch of science. In the following sections, you will repeatedly encounter
the following paradigm:
3. if it falls in a very unlikely spot, way out on a tail of the distribution, conclude
that the null hypothesis is false for your data set.
If a statistic falls in a reasonable part of the distribution, you must not make
the mistake of concluding that the null hypothesis is veried or proved. That is
the curse of statistics; that it can never prove things, only disprove them! At best,
you can substantiate a hypothesis by ruling out, statistically, a whole long list of
competing hypotheses, every one that has ever been proposed. After a while your
adversaries and competitors will give up trying to think of alternative hypotheses,
or else they will grow old and die, and then your hypothesis will become accepted.
Sounds crazy, we know, but thats how science works!
about a nite number of things that can all be checked without error, (3) mea-
surements in the real world usually come with errors and uncertainties and (4) the
hypothesis must be phrased in a numerical way so that it can be checked against cal-
culations. As statistics involves a lot of data and complicated computations only the
nal results of which are ever reported, there is a large scope for bias and cheating.
Usually statistics begins when the data is available for treatment and considers
only the calculation of some gures of merit to be considered for the validation of
the hypothesis. It is however quite important to consider how the data is obtained.
If the hypothesis refers to a small number of objects, they can all be investigated
and the hypothesis established as absolutely true or false. This is not a part of
statistics. The question whether a hypothesis is true or false becomes a statistical
one if we cannot measure all members of the population (the collection of objects
of interest). We may establish that the hypothesis is false or that it is true with a
to-be-determined degree of condence.
Thus we must select from the population a few members that we will investigate
and base our conclusions upon. This selection is called sampling. This is wrought
with diculties that we will discuss in a moment. It is clear that sampling is
necessary.
Patrick D. Bangert
13.3 Sampling Methods 103
As we cannot, in general, calculate the property of interest for the whole population
(this being the point behind sampling in the rst place) we must agree on methods
that are going to produce an unbiased sample.
It is clear that a biased sample may or may not give rise to a dierent conclusion
with regard to a particular hypothesis than an unbiased sample. As we are interested
in the hypothesis as regards the population (and not the sample), a biased sample
is a wrong sample.
Below we give some basic sampling methods. In real life, usually a combination
of these is actually used for practical reasons and limitation of resources.
2. For a survey, you phone people during the day. This is not uniform as some
people do not have phones and many are not at home when you try to phone
them.
3. For a medical study, you pay people a signicant sum of money. This is not
uniform as people with time but no money are more likely to join thus the
study is biased towards the unemployed and poorer sections of the public
which may or may not correlate with poorer health in general.
If the method of random sampling is not uniform, then the method is clearly
biased towards that group that has a higher chance of being selected. The complexity
is best illustrated by an unanswered exercise.
Example 63 You are to do an election poll by asking 4000 people about their opin-
ions regarding the candidates running for President in a country of more than 50
million people. The country is broken up into dozens of electoral districts each of
which are counted separately. You want a representative sample that takes account of
several factors: Geographic location (i.e. electoral district), age, nancial resources
and broad work category. How do you select these 4000 people?
The answer will come in the next few sections.
This method suers from the same caveats as the stratied sampling and may
be chosen principally for nancial reasons. It is simply easier to rst limit oneself
geographically to a few regions and then exhaustively question all individuals in
that region.
Patrick D. Bangert
13.4 The Question of Error 105
If N = 1600, the error is thus 1/40 = 0.025 = 2.5%. If this sample had an
average of 100 for some quantity, then the average of the population is going to lie
in the interval 100 2.5% to 100 + 2.5%, i.e. 97.5 to 102.5. This answer is surprising
for the simple reason that it does not depend on the size of the population. Thus a
sample of 100 individuals from a population of 1000 or a population of one million
has the same error!
One could now speculate on the general sociological relevance of psychology
experiments that are done with 15 psychology students that all need the money
being paid for it...
Example 64 According to a poll taken among scientists and reported in the pres-
tigious journal Science, scientists do not have much faith in either the public or the
media. The article reported that, based on the results of a recent survey of 1400
professionals in science and in journalism, 82% of scientists strongly or some-
what agree with the statement The U.S. public is gullible and believes in miracle
cures and easy solutions, and 80% agreed that the public doesnt understand the
importance of federal funding for research. About the same percentage (82%) also
trashed the media, agreeing with the statement that The media do not understand
statistics well enough to explain new ndings.
It is not until the end of the article that we learn who responded: The study
reported a 34% response rate among scientists, and the typical respondent was a
white, male physical scientist over the age of 50 doing basic research. Remember
that those who feel strongly about the issues in a survey are the most likely to respond.
Clearly this is unrepresentative and the numbers reported are essentially meaningless
as far as concluding anything about scientists in general.
Example 65 On February 18, 1993, shortly after Bill Clinton became President
of the United States, a television station in Sacramento, California, asked viewers
to respond to the question: Do you support the presidents economic plan? The
next day, the results of a properly conducted study asking the same question were
published in the newspaper. Here are the results:
The people dissatised with the plan were more likely do call the TV station and
no one called in to say that they were not sure. In short, such polls are merely a
count of received answers and not statistics at all. No conclusions as regards the
general population can be made from them. It is irresponsible behavior to publish
such numbers without the appropriate caveats.
Example 66 Some years ago, the student newspaper at a California university an-
nounced as a front page headline: Students ignorant, survey says. The article
explained that a random survey indicated that American students were less aware
of current events than international students were. However, the article quoted the
undergraduate researchers, who were international students themselves, as saying
that the students were randomly sampled on the quad. The quad is an open-air,
grassy area where students relax, eat lunch, and so on.
Such samples are called haphazard samples and can not be expected to repre-
sentative of anything. The researchers are likely to approach friendly looking people
who will support what the researchers want to establish. This is the same disaster
as psychology studies being done on psychology students at university and leads to a
waste of time.
Example 67 A much older example that initiated much of the study of sampling
(in the USA) is as follows. Before the Presidential election of 1936 in the USA,
a contest between Democratic incumbent Franklin Delano Roosevelt and Republican
Alf Landon, the magazine Literary Digest had been extremely successful in predicting
the results in U.S. presidential elections. But 1936 turned out to be the year of their
downfall, when they predicted a 3-to-2 victory for Landon. To add insult to injury,
young pollster George Gallup, who had just founded the American Institute of Public
Opinion in 1935, not only correctly predicted Roosevelt as the winner of the election,
he also predicted that the Literary Digest would get it wrong. He did this before they
even conducted their poll. And Gallup surveyed only 50,000 people, whereas the
Literary Digest sent questionnaires to 10 million people.
The Literary Digest made two classic mistakes. First, the lists of people to whom
they mailed the 10 million questionnaires where taken from magazine subscribers,
car owners, telephone directories, and, in just a few cases, lists of registered voters.
In 1936, those who owned telephones or cars, or subscribed to magazines, were more
likely to be wealthy individuals who were not happy with the Democratic incumbent.
Despite what many accounts of this famous story conclude, the bias produced by
the more auent list was not likely to have been as severe as the second problem. The
Patrick D. Bangert
13.6 Reading: Sampling and Standard Error by L. C. Tippett 107
main problem was volunteer response. The magazine received 2.3 million responses,
a response rate of only 23%. Those who felt strongly about the outcome of the election
were most likely to respond. And that included a majority of those who wanted a
change, the Landon supporters. Those who were happy with the incumbent were less
likely to bother to respond.
Gallup, however, knew the value of random sampling. He was able not only
to predict the election, but to predict the results of the Literary Digest poll within
1%. How did he do this? He just chose 3000 people at random from the same lists
the Digest was going to use, and mailed them all a postcard asking them how they
planned to vote.
This example illustrates the beauty of random sampling and idiocy of trying to
base conclusions on non-random and biased samples. The Literary Digest went
bankrupt the following year, and so never had a chance to revise its methods. The
organization founded by George Gallup has ourished, although not without making
a few sampling blunders of its own.
Hypothesis Testing: An
Introduction
2. Two populations have the same mean. (Males in the USA and Germany have
the same average height.)
3. Two populations have the same variance. (Ninety percent of Males in the USA
and Germany are between 170cm and 190cm tall.)
4. Two distributions are the same. (Male heights in both the USA and Germany
are normally distributed.)
Each type of null hypothesis has one or more associated tests. One computes a num-
ber and on this basis one decides to accept or reject the hypothesis. One generally
has to specify a signicance level for a test.
Generally is either 0.05 or 0.01 but this choice is purely conventional and there
is no theoretical reason to choose either one of them over anything else. If we choose
the 0.05 condence level, we can be 95% condent that the statistical test will return
the correct answer.
that we need to solve for p. This can be done by looking at standard tables or
using a computer program like Mathematica. Note that this integral, as many like
it in statistics, can not be done in closed form. The answer is p = 1.96 but needs
to be recomputed for a dierent signicance level, of course, as the right hand side
will dier. If the test is at the 0.05 signicance level but is one-tailed instead of
two-tailed, we must solve
p
1 1
exp dz = 0.9 (14.3)
2 2z 2
p
Patrick D. Bangert
Chapter 15
where each sum is over the points in one sample, the rst or second, each mean
likewise refers to one sample or the other, and NA and NB are the numbers of points
in the rst and second samples, respectively. Second, compute t by
xA xB
t= (15.2)
sD
Third, evaluate the signicance of this value of t for Students distribution with
NA + NB 2 degrees of freedom, by equations (6.4.7) and (6.4.9), and by the routine
betai (incomplete beta function) of 6.4.
The signicance is a number between zero and one, and is the probability that
|t| could be this large or larger just by chance, for distributions with equal means.
Therefore, a small numerical value of the signicance (0.05 or 0.01) means that the
observed dierence is very signicant. The function A(t|) in equation (6.4.7) is
one minus the signicance.
The next case to consider is where the two distributions have signicantly dif-
ferent variances, but we nevertheless want to know if their means are the same or
dierent. (A treatment for baldness has caused some patients to lose all their hair
and turned others into werewolves, but we want to know if it helps cure baldness
on the average!) Be suspicious of the unequal-variance t-test: If two distributions
have very dierent variances, then they may also be substantially dierent in shape;
in that case, the dierence of the means may not be a particularly useful thing to
know. To nd out whether the two data sets have variances that are signicantly
dierent, you use the F-test, described later on.
The relevant statistic for the unequal variance t-test is
xA xB
t= (15.3)
V ar(xA )/NA + V ar(xB )/NB
Our nal example of a Students t test is the case of paired samples. Here we
imagine that much of the variance in both samples is due to eects that are point-by-
point identical in the two samples. For example, we might have two job candidates
who have each been rated by the same ten members of a hiring committee. We
want to know if the means of the ten scores dier signicantly. We rst try the t
test above, and obtain a value of the probability that is not especially signicant
(e.g., > 0.05). But perhaps the signicance is being washed out by the tendency
of some committee members always to give high scores, others always to give low
scores, which increases the apparent variance and thus decreases the signicance of
Patrick D. Bangert
15.2 Testing Whether Two Distributions have the Same Variance 113
1
N
Cov (xA , xB ) (xAi xA ) (xBi xB ) (15.5)
N 1 i=1
1/2
V ar(xA ) + V ar(xB ) 2Cov (xA , xB )
sD = (15.6)
N
xA xB
t = (15.7)
sD
where N is the number in each sample (number of pairs). Notice that it is important
that a particular value of i label the corresponding points in each sample, that is,
the ones that are paired. The signicance of the t statistic in 15.7 is evaluated for
N 1 degrees of freedom.
since (e.g.) no practical amount of data can distinguish between two distributions
which dier only by one part in 1010 .
Proving that two distributions are dierent, or showing that they are consistent,
is a task that comes up all the time in many areas of research: Are the visible stars
distributed uniformly in the sky? (That is, is the distribution of stars as a function
of declination - position in the sky - the same as the distribution of sky area as a
function of declination?) Are educational patterns the same in Brooklyn as in the
Bronx? (That is, are the distributions of people as a function of last-grade-attended
the same?) Do two brands of uorescent lights have the same distribution of burnout
times? Is the incidence of chicken pox the same for rst-born, second-born, third-
born children, etc.?
These four examples illustrate the four combinations arising from two dierent
dichotomies: (1) The data are either continuous or binned. (2) Either we wish to
compare one data set to a known distribution, or we wish to compare two equally
unknown data sets. The data sets on uorescent lights and on stars are continu-
ous, since we can be given lists of individual burnout times or of stellar positions.
The data sets on chicken pox and educational level are binned, since we are given
tables of numbers of events in discrete categories: rst-born, second-born, etc.; or
6th Grade, 7th Grade, etc. Stars and chicken pox, on the other hand, share the
property that the null hypothesis is a known distribution (distribution of area in the
sky, or incidence of chicken pox in the general population). Fluorescent lights and
educational level involve the comparison of two equally unknown data sets (the two
brands, or Brooklyn and the Bronx).
One can always turn continuous data into binned data, by grouping the events
into specied ranges of the continuous variable(s): declinations between 0 and 10
degrees, 10 and 20, 20 and 30, etc. Binning involves a loss of information, however.
Also, there is often considerable arbitrariness as to how the bins should be chosen.
Along with many other investigators, we prefer to avoid unnecessary binning of data.
The accepted test for dierences between binned distributions is the chi-square
test. For continuous data as a function of a single variable, the most generally
accepted test is the Kolmogorov-Smirnov test. We consider each in turn.
where the sum is over all bins. A large value of 2 indicates that the null hypothesis
(that the Ni s are drawn from the population represented by the ni s) is rather
unlikely.
Any term j in 15.8 with 0 = nj = Nj should be omitted from the sum. A term
with nj = 0, Nj = 0 gives an innite 2 , as it should, since in this case the Ni s
cannot possibly be drawn from the ni s!
Patrick D. Bangert
15.3 Testing Whether Two Distributions are Dierent 115
(Ri + Si )2
2
= (15.9)
i
Ri + Si
Comparing 15.9 to 15.8, you should note that the denominator of 15.9 is not just
the average of Ri and Si (which would be an estimator of ni in 15.8). Rather, it
is twice the average, the sum. The reason is that each term in a chi-square sum is
supposed to approximate the square of a normally distributed quantity with unit
variance. The variance of the dierence of two normal quantities is the sum of their
individual variances, not the average.
If the data were collected in such a way that the sum of the Ri s is necessarily
equal to the sum of Si s, then the number of degrees of freedom is equal to one less
than the number of bins, NB 1, the usual case. If this requirement were absent,
then the number of degrees of freedom would be NB . Example: A bird-watcher
wants to know whether the distribution of sighted birds as a function of species is
the same this year as last. Each bin corresponds to one species. If the bird-watcher
takes his data to be the rst 1000 birds that he saw in each year, then the number
of degrees of freedom is NB 1. If he takes his data to be all the birds he saw on
a random sample of days, the same days in each year, then the number of degrees
of freedom is NB . In this latter case, note that he is also testing whether the birds
were more numerous overall in one year or the other: That is the extra degree of
freedom. Of course, any additional constraints on the data set lower the number of
degrees of freedom in accordance with their number.
Equation 15.9 applies to the case where the total number of data points is the
same in the two binned sets. For unequal numbers of data points, the formula
analogous to 15.9 is
2
Ri S/R Si R/S
2 = (15.10)
i
Ri + Si
where
R Ri , S Si (15.11)
i i
while for comparing two dierent cumulative distribution functions SN1 (x) and
SN2 (x), the K-S statistic is
What makes the K-S statistic useful is that its distribution in the case of the null
hypothesis (data sets drawn from the same distribution) can be calculated, at least
to useful approximation, thus giving the signicance of any observed nonzero value
of D. A central feature of the K-S test is that it is invariant under reparametrization
of x; in other words, you can locally slide or stretch the x axis in Figure 15.1, and
Patrick D. Bangert
15.4 Reading: Foundations of Vital Statistics by John Graunt 117
the maximum distance D remains unchanged. For example, you will get the same
signicance using x as using log x.
The function that enters into the calculation of the signicance can be written
as the following sum:
(1)j1 e2j
2 2
QKS () = 2 (15.14)
j=1
where Ne is the eective number of data points, Ne = N for the case ?? of one
distribution, and
N1 N2
Ne = (15.17)
N1 + N2
for the case 15.13 of two distributions, where N1 is the number of data points in the
rst distribution, N2 the number in the second. The nature of the approximation
involved in 15.16 is that it becomes asymptotically accurate as the Ne becomes large,
but is already quite good for Ne 4, as small a number as one might ever actually
use.
social behavior from numerical datain this case, fairly crude gures of births and
deaths in London from 1604 to 1661. Graunts tract appeared in 1662. Thirty
years later, the Royal Society published in its Philosophical Transactions a paper
on mortality rates written by the eminent astronomer Edmund Halley. This famous
article was entitled An Estimate of the Degrees of the Mortality of Mankind, drawn
from curious Tables of the Births and Funerals at the City of Breslaw; with an
Attempt to ascertain the Prices of Annuities upon Lives. It was followed by Some
further Considerations on the Breslaw Bills of Mortality. Together, the papers are
the foundation for all later work on life expectancy, indispensable of course to the
solvency of life-insurance companies2 .
John Graunt was born in 1620 in Birchin Lane, London, at the Sign of the
Seven Stars, where his father kept a shop and home. He was early apprenticed
to a merchant in small waresbuttons, needles and the likeand prospered in the
trade. Success gave him the leisure to indulge interests somewhat broader than
those of the notions counter. Aubrey describes him as a very ingenious and stu-
dious person...[who] rose earlyin the morning to his Study before shoptime.3 He
became a friend of Sir William Petty, later the author of a wellknown book on
the new study of political arithmetic, and probably discussed with him the ideas to
be expressed in the Observations. The Bills of Mortality which attracted Graunts
attention were issued weekly by the company of parish clerks and listed the number
of deaths in each parish, the causes, and also an Accompt of all the Burials and
Christnings, hapning that Week. They are described fully in the material selected
from Graunts book.
Charles II was so favorably impressed by the Observations that he specially
proposed Graunt as an original member of the newly incorporated Royal Society.
To forestall any possible objections on the ground that Graunt was a shopkeeper,His
Majesty gave this particular charge to his Society, that if they found any more such
Tradesmen, they should be sure to admit them all, without any more ado.4 He was
elected F.R.S. in 1662.
The merit of the Observations was immediately recognized and encouraged the
gathering and study of vital statistics on the Continentparticularly in Franceas
well as in England. The book went through several editions, the fth of which,
published after Graunts death, was enlarged by Petty. Historians have long been
vexed to decide how much Petty contributed to the original work. Aubrey, who
delighted in retailing small malices, says only that Graunt had his Hint from
Petty, but he implies much more. There seems no doubt that the book was a joint
production. Graunt wrote by far the greater part, including the most valuable
scientic portions; Petty, it may be supposed, added what Thomas Browne would
have called elegancy and thereby increased the popularity of the book. Sir William
2
He not only gave a sound analysis of this problem (the calculation of annuity prices), but
he put his results in such a convenient form that tills rst table of mortality has remained the
pattern for all subsequent tables, s to its fundamental form of expression.Lowell J. Reed in the
introduction to Degrees of Mortality of Mankind by Edmund Halley, a reprint of the papers noted,
issued by the Johns Hopkins Press, Baltimore, 1942; p. iv. The selection by Halley is based on
this reprint.
3
Aubreys Brief Lives, edited by Oliver Lawson Dick; London, 1950, p. 114.
4
Tho. Sprat, The History of the Royal Society of London, for the improving of Natural Knowl-
edge; 3rd Edition, London, 1722, p. 67.
Patrick D. Bangert
15.4 Reading: Foundations of Vital Statistics by John Graunt 119
was a bumptious and somewhat inated man, unable to decide whether to patronize
Graunt or to claim credit for his work. There is no evidence that he even understood
the importance and originality of what his friend had done.5 The last sentence
preface is unmistakably Graunts: For herein I have, like a silly Scholeboy, coming
to say my Lesson to the World (that Peevish, and Tetchie Master) brought a bundle
of Rods wherewith to be whipt, for every mistake I have committed.
Graunt served as a member of the city common council and in other oces,
but on turning Catholiche was raised a Puritanlayd down trade and all other
publique Employment. Aubrey tells us that he was a man generally beloved, a
faythfull friend, prudent and just. He had an excellent working head, and was
very facetious and uent in his conversation. He was accused of having had some
hand in the great re of London, and the fact that he was a Catholic gave impetus
to the charge. It was said that, as an ocer of a water company, he had given orders
stopping the water supply just before the re started. A diligent eighteenthcentury
historian proved this false by showing that Graunt had had no connection with the
company until a month after the re. Graunt died of jaundice on Easter-eve 1674,
and was buried under the piewes in St. Dunstans church. What pitty tis,
wrote Aubrey, so great an Ornament of the Citty should be buryed so obscurely!
Unlike poor Graunt, whom my edition of the Britannica does not deign even to
notice, Edmund Halley has been amply celebrated. I shall dispose of him as briey
as possible. He was born in London in 1658, the son of a wealthy Soape-boyler,
and he enjoyed every advantage, including an excellent education, that rich and
indulgent parents could confer. His passion for mathematics and astronomy showed
itself in his youth: when he arrived at Queens College, Oxford, he brought with
him a large assortment of astronomical Instruments, including a 24foot telescope,
whose use he had already mastered. His reputation as a theoretician and observer
was established by the time be was twenty. He left the college before nishing his
course, to make southern hemisphere observations at St. Helena. On his return, and
by the Kings command, he was awarded a Master of Arts degree; a few days later
he was elected a Fellow of the Royal Society. He was then twenty-two. The next
few years were spent on various astronomical labors which required him to travel
widely on the Continent. Becoming deeply interested in the problem of gravity,
he visited Newton at Cambridge in August 1684. It was a momentous meeting,
for it resulted in the Principia, a work which might never have appeared except
for Halleys extraordinary exertions. He suggested the project in the rst place;
he averted suppression of the third book; he bore all the expenses of printing and
binding, corrected the proofs, and laid his own work entirely aside to see Newtons
masterpiece through the press. The expense was assumed at a time when Halley
could ill aord it. His father had suered serious reverses before he died and had
left an encumbered and almost worthless estate.
Halleys long life was crowded with literary and scientic activity. He was a clas-
sical scholar, hydrographer, mathematician, physicist, and astronomer. His writings
include, besides a vast output in his specialty, such diverse items as An Account
5
For a meticulous sifting of the evidence as to Graunt vs. Petty see the introduction to a
reprint of the Observations (Baltimore, The Johns Hopkins Press, 1939), by Walter F. Willcox.
As to Petty, no inconsiderable person even if he was inated and bumptious, see E. Strauss, Sir
William Petty, Portrait of a Genius, Glencoe (III.), 1954.
of the Circulation of the Watery Vapours of the Sea, and of the Cause of Springs;
Discourse tending to prove at what Time and Place Julius Caesar made his rst De-
scent upon Britain; New and General Method of nding the Roots of Equations;
a translation from the Arabic - which language he learned for this purposeof Apol-
lonius treatise De sectione rationis and a brilliant restoration of his two lost books
De sectione spatii ; an admirable edition of the same authors Conics; and more
than eighty miscellaneous papers published by the Royal Society, which he served
as secretary. In 1698 he commanded the war-sloop Paramour Pink in an expedition
to the South Atlantic to study variations of the compass and to nd new lands, if
possible. On this journey he fell in with great islands of ice, of so incredible a height
and magnitude that I scarce dare write my thoughts of it. He was made Savilian
professor of geometry at Oxford in 1703 and astronomer royal in 1721. One of his
greatest achievements was a study of the orbits of comets, of which he described
no less than twenty-four. Three of these were so much alike that he was convinced
that the comets of 1531, 1607, and 1682 were one body. Assuming its period to be
seventy-six years, he predicted its return in 1758. On Christmas Day of that year
his conjecture was veried, and Halleys comet has since appeared in 1835 and 1910.
Halley died at the age of eighty-six. He was a generous, easygoing person, free
from rancor or jealousy, who spoke and acted with an uncommon degree of spright-
liness and vivacity. He enjoyed his work, had excellent health and owned a large
circle of friends, among them Peter the Great of Russia to whose table he always had
access. Bishop Berkeley thought Halley an indel, and it is true that in 1691 he
was refused the Savilian professorship of astronomy at Oxford because of his alleged
materialistic views. The evidence is that he was a sensible man spoke his mind
freelydangerous practice in any age.
Halleys concern with the curious tables of Breslaw was one of his diversions.
This Silesian city had, for more than a century before y into the problem, kept
regular records of its births and deaths. Dr. Caspar Neumann, a scientist and
clergyman of Breslaw had analyzed data, disproving certain current superstitions
with regard of the phases of the moon and the so-called climacteric health. 6 His
results were submitted to Leibniz who sent them al Society. It was at about this time
that the Society resumed publication of the Transactions after a lapse of several
years. Halley to furnish ve sheets in twenty of the forthcoming issues. He was never
hard up for ideas, nor for the energy and ingenuity to express them. His Breslaw
papers may therefore be regarded as a kind of ller for the Transactions, to keep
his word until something better came along. Nevertheless, the analysis reects the
exceptional power of his mind.
6
Lowell J. Reed
Patrick D. Bangert
Chapter 16
These tests are particularly useful if the population is highly skewed or if the
data is non-numerical, for example if people are asked to rate something in order of
preference.
As we want to test if the machines are the same, we have to use a two-tailed test
(we would use a one-tailed one if we predicted the rst machine to be better than
the second). Remembering that N is given, we now add the probabilities until the
total exceeds one-half of the chosen signicance level.
Suppose we choose the 0.05 signicance level and N = 12, then we get
P12 (0, 1, 2) = 0.01928 (16.2)
P12 (0, 1, 2, 3) = 0.07299 (16.3)
and so at the 0.05 signicance level we have to accept the hypothesis that the two
machines are equally good if we have at least 3 minuses and 3 pluses in 12 days of
testing (as the probability for 0,1,2 is less than 0.025 and the probability for 0,1,2,3
is larger).
We remark that if the dierence between the two defective totals of the machines
is zero one day, this day must be ignored as we can not ascribe a sign to this
dierence. Furthermore, you may use a normal approximation to the binomial
distribution if you wish.
Patrick D. Bangert
16.1 Non-parametric Tests 123
12 Rj2
k
H= 3(N + 1). (16.8)
N (N + 1) j=1 Nj
If the data is
aaaabbbbaaaabbbb (16.10)
then we have four runs with the rst and third run being aaaa and the second
and fourth run being bbbb. In this case, we easily observe the pattern and thus
the test for randomness should fail. We will essentially count runs and see if there
are enough but not too many. For example the series
ababababababab (16.11)
has runs of length one but is also not random. Thus a random sequence can be said
to have a moderate number of runs.
Suppose that there are N1 as, N2 bs and V runs in the sequence. It can be
shown that this sampling has a distribution of mean and variance
1. There is always a chance that the test gives rise to the wrong decision.
2. Never put complete trust in conrmatory statistics, i.e. if your long-held pet
theory turned into a null hypothesis is conrmed by a test, do not believe that
it is true now. It has merely been conrmed by the data collected. If, on the
other hand, the test is negative, you may assume that it is wrong or at least
does not hold with the assumed generality.
6. Think of all parts of the experiment where you might be leading the test person
or introducing some incentive to go one way, to keep back information, to
embellish information or even to lie and counteract them as much as possible.
8. Investigate carefully the truth of the assumptions of the test (if any). You
may have to use other tests to ascertain this.
9. Every test gives an answer at some signicance level or with some probability
of error.
Patrick D. Bangert
16.3 Reading: First Life Insurance Tables by Edmund Halley 125
10. In any write-up, carefully identify the manner in which the data was ob-
tained and the testing done. If the data was not collected by you, be sure
to investigate how it was collected so that you can be sure of reliability and
representativity of the data.
A Matrix Review
Denition 29 Two matrices A and B are equal if and only if Aij = Bij for all i
and j.
Denition 30 The zero and one matrices are Z = Zij = 0 and I = ij for all i
and j.
Denition 31 Two matrices A and B are added to produce their sum A + B and
their dierence A B by
We note that it is necessary for two matrices to have the same size for the sum
and dierence to be dened.
k
Cij = Ail Blj (17.8)
l=1
We note that in general AB = B A and usually one of the two is even undened.
We also note that multiplication with ij leaves the arguments unchanged. This is
also commutative.
Furthermore if C = A B then C T = B T AT .
Denition 36 The trace of a matrix is the sum of the diagonal elements T (A) =
k
Aii .
i=1
17.1.3 Determinants
Every square n n matrix A has an associated determinant det A,
/ /
/a11 0 0 0 /
/ /
/ 0 a21 0 0 //
/
D(A) = / .. .. .. . . .. / (17.9)
/ . . . . . /
/ /
/0 0 0 anm /
= = 1n = 1n = 1n A1 A2 An (17.10)
where
1 even permutation
= 1 odd permutation (17.11)
0 any two indices equal
and we note that 123, 231 and 312 are examples of even permutation whereas 132,
213 and 321 are examples of odd permutations.
(n) (n)
Denition 38 The i, j co-factor is ij = (1)i+j Mij
Thus we have
n
(n1)
D= ij Aik (17.12)
k=1
for any i.
The properties of the determinant include
2. D = 0 when either (1) all elements in a single row or column are zero or (2)
any row or column is equal to another row or column or (3) any row or column
can be expressed as a linear combination or other rows or columns.
17.1.4 Division
Denition 39 The adjoint matrix Aadj is formed from a square n n matrix A by
(n1)
Aadj
ij = ji (17.13)
Patrick D. Bangert
17.1 Matrices and Relatives 129
N ull(A) = {X Rn : A X = 0} (17.15)
a1 x 1 + a2 x 2 = a3 (17.20)
b1 x 1 + b 2 x 2 = b 3 (17.21)
for the variables x1 and x2 given all the others. We reformulate in terms of the
above language
a1 a 2 x1 a3
= (17.22)
b1 b2 x2 b3
and represent the coecients by matrices such that
a1 a2 x1 a3
A= ; x= ; B= (17.23)
b1 b2 x2 b3
and so we have
Ax = B (17.24)
1
A A x = A1 B (17.25)
x = A1 B (17.26)
if and only if det A = 0. If det A = 0, then these equations have no solution. This
method of course applies no matter what is in the matrices A and B and in fact
linear algebra is largely concerned with this type of equation.
We call the equations homogenous when B = 0. In this case, we always have
x = 0 as a solution. This solution is the only one (a unique solution) if and only if
det A = 0 for the above reasons. If det A = 0, then x = 0 is a solution but not the
only one. In the inhomogenous case, we have a solution if and only if det A = 0.
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 131
We have
A x = I x (17.27)
and so
(A I) x = 0 (17.28)
which is homogenous and to be solved for both and x. To have non-trivial (i.e.
x = 0) solutions we must have det (A I) = 0. We determine from this,
/ /
/a11 a a a /
/ 12 13 1n /
/ a21 a22 a23 a2n //
/
det (A I) = / .. .. .. ... .. / = 0. (17.29)
/ . . . . /
/ /
/ an1 an2 an3 ann /
The previous equation will yield a polynomial in the variable which will have n
roots {1 , 2 , , n } that may or may not be all dierent. These are the eigenval-
ues. Having obtained a particular i , we may obtain its associated eigenvector xi
by
(A i I) xi = 0 (17.30)
and non-trivial solutions must exist as we have ensured that the matrix on the left
has vanishing determinant.
After all, the basic natural principles that govern mans responses to the incen-
tives of prospective rewards, or that govern the proportion of executive leaders to
followers (in corporation, labor union, army, political Party, or social club), or that
govern the rise and decline of fashions, or that govern the distribution of relative
power and control in any dominance system, from the family unit to the nation, or
that govern the circulation of persons, goods, services, ideas, and information over
the earths surface in the exchange of the products of our labor - just to enumerate
a few of the principles that we shall treat - are not likely to remain idle curiosities,
so long as each man must daily cooperate and compete with others in order to live.
Nor are these principles particularly out of place at the present time, when we seem
to be faced with an impending planned economy in Which a few persons will tell
many others how they should behave - often Perhaps without regard to how people
do behave.
On the other hand, just because we treat objectively and dispassionately of
the mainsprings of human behavior, without any particular reference to how people
should behave, that does not necessarily mean that I for one feel personally obliged
to deprecate the almost universal belief that all mankind should co-operate and get
along together like one happy team that is bent upon social progress. Nevertheless
I do note that, in spite of this universal belief, there is virtually no agreement as to
the particular ways and means whereby the worthwhile objective of universal human
cooperation is to be achieved.
It is obvious that some persons and groups have personal and group preconcep-
tions as to precisely how the world should co-operate. These preconceptions are
sometimes so deeply rooted that the individuals in question can barely talk with
others whose similarly profound preconceptions happen not to agree with their own,
In so doing they seem to block communication, and thereby to impede the better
world understanding and co-operation they so fervently desire, It is further obvious
that many of these persons and groups are so rigid and inexible in their precon-
ceptions that they are not to be budged from them either by incentive rewards of
any amount or by threats of direst harm.
Neither the natural scientist nor the practical social engineer can aord to ignore
the power of these preconceptions, to which even the best intended incentives are
often subordinate and from which, only too often, the gravest individual and group
misery can result.
Nevertheless, to the natural scientist, mans preconceptions do not belong to
some other world, but instead are merely further natural phenomena, As such they
are a part of the total natural phenomenon of human behavior and merit an investi-
gation into their mainsprings quite as much as the rest of human behavior, Indeed,
in many situations, the preconceptions involved are largely determinative of the rest
of the behavior.
Our emphasis upon the eect of mans preconceptions is by no means new to
present-day thinking, even though, in actual practice, a given persons attitude
towards a particular vocabulary of individual or group preconceptions that confront
him may depend upon what his particular problem is.
Thus, for example, the personnel man in a sizable factory, store, labor union, or
armed group, who is obliged to deal daily with a large number of persons of diverse
cultural and socio-economic backgrounds, has the continual task of understanding
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 133
and of reconciling his groups conicting preconceptions, so that the group can
eectively communicate in reference to its common tasks, incentives, and risks, The
personnel man does not need to be told that it is ecient for his group to have
a common language, in the broad sense of having like responses and evaluations
to like stimuli within the functioning of the group, His task is essentially that of
understanding the existing diverse preconceptions, so that he can restructure them
into a more harmonious whole.
The case is somewhat dierent, however, with the politician who wants votes, or
with the marketer who invents styles or has something to sell, For here the game may
be simply the most eective manipulation of the existing preconceptions, without
any thought of altering them, A politician, though meritorious, who casually ignores
his constituents preconceptions, or else tries to superimpose his own logic upon
them, is only too likely to fall before another and even far less capable politician
who knows the preconceptions of his constituents, and who says, in substance, My
friends, I understand your feelings perfectly, and am heartily in accord with them.
Yet just because one mans preconceptions often aunt another mans logic in
what seems to him to be a highly capricious manner, we may by no means suppose
that mans preconceptions are random and haphazard, and without a certain logic
of their own, On the contrary, in our study of the dynamics of language and the
structure of the personality, we shall nd that a vocabulary of preconceptions is
quite orderly and is governed by quite denite underlying principles. Nor are we
in any way disposed to argue that the deliberate use of these underlying principles
by the personnel man, politician, and marketer will not help him to alter or to
manipulate more eectively the particular vocabulary of preconceptions with which
he happens to be confronted.
It is perhaps well at this point to elucidate our more general terms language and
a vocabulary of preconceptions, lest they be confused with the more familiar terms,
words and speech, with which, incidentally, they are intimately related.
To this end, since we have just spoken of the marketers problem, let us by way
of illustration begin with a brand name of goods (for example, G.E., Frigidaire,
Chestereld), A given brand name may be so favorably known that many persons will
prefer, and even pay more for, the brand goods than for unnamed goods, although
even a connoisseur could not tell the dierence. In short, a specic brand name
tends to evoke a specic response in reference to human wants, and in so doing may
be said to represent a sort of preconception.
Let us note, however, that a brand name is also a word, and nothing but a word,
Whatever the principles may be that govern the behavior of words in their favorable
and unfavorable connotations, and in their fashionableness and obsolescence, will
also govern to a considerable extent the ups and downs and competition of brand
names. (Hence our later study of words and speech is germane to a study of fashions
and preconceptions.)
But let us go further, Instead of a brand name, let us consider a trademark which
identies a particular product or service quite as eectively as a brand name, but
which contains not a single word, This trademark is a sign or a signal which, like a
brand name, evokes a more or less stereotyped response. Although a trademark is
not a word, and therefore not an element of speech, we shall later see that it is in
fact an element of what we may call the groups language, (And we shall attempt
to demonstrate that things like trademarks will behave in a manner quite similar to
that of words.)
But we do not need to stop with the trademark, There are many stereotyped
things, such as kinds and qualities of clothes, ways of doing ones hair, manners of
gesticulating and of talking, places and times where one is seen or not seen, which
convey information about the person in question, Although these things are neither
words, brand names, nor trademarks, they tend to evoke more or less stereotyped
responses and, in so doing, they belong to the language of the group in question -
quite as much as the groups words, and phrases, and sentences.
To illustrate the import of this broader concept of language, let us paint a picture.
It is evening; a costly automobile with livened chaueur drives up before the opera
house; and out steps milady elegantly gowned and jeweled. She fumbles in her purse
and conspicuously gives the beggar a coin, and then skips up the stairs. That is the
picture.
All the parts of this picture relate to the problem of the production and distri-
bution of goods and services and of rewards, Yet, as we shall note in detail, all parts
of the picture - the car, chaueur, opera, woman, gown, jewels, coin, and beggar -
are also problems in language (and in preconceptions).
And so if at times in the opening chapters of our demonstration, we seem to be
almost pedantically concerned with the phonetic and semantic minutiae of human
speech which are apparently of little importance in the robustness of everyday life,
may the reader of imagination reect that we may thereby be gaining insight into the
total language of the group, whose minutiae may at times be extremely important
in everyday life, in politics, in marketing, or in just plain getting along together.
In thus placing a study of the principles of language before that of the economy
of geography, or of the distribution of economic power and social status, or of the
waxing and waning of prestige symbols and cultural vogues, we perhaps confess
to a certain personal preconception as to what is likely to be most important in
the diculties of actual everyday practical human problems, from which confusion,
heartache, and misery arise.
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 135
1
Because the preparation of our manuscript was essentially complete at the time, we WCCC
unable to include a discussion of the recently appeared Kinsey Report in connection With our
own discussion of an individuals homosexual-heterosexual balance, in which we have arrived at
conclusions-on the basis of entirely dierent kinds of data-that undeniably support the Kinsey
ndings. Hence reference is here made: A. C. Kinsey. W. B. Pomeroy, and C. E. Martin. Sexual
Behavior in the Human Male, Philadelphia: Saunders, l948.
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 137
quickest course might be over the tops of the mountains at a great cost of labor and
at great risk, His easiest path, however, might be a tortuous winding back and forth
through the mountain range over a very considerable distance and during a quite
long interval of time.
These three paths are obviously not the same. The foot-traveler between the
two cities cannot, therefore, simultaneously minimize distance, time, and work in a
single path between the two cities as the problem now stands. Which path, therefore,
will he take? Or, since the above case is fairly typical of lifes daily problems, in
which impediments of various sorts obstruct our way, which path do we actually
take? Clearly our selection of a path will be determined by the particular dynamic
minimum in operation.
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 139
person in question were suddenly to minimize time. Since sudden alterations of such
proportions are unknown, we are perhaps not overbold in suspecting a fortiori that
an individuals entire activity from birth to death is governed throughout by the
same one single superlative which, in our opinion, is least eort.
But that is not all. If we remember the extent to which ospring inherit the forms
and functions of their parents, we may suspect that this inheritance is possible only
if the ospring also inherit the parental dynamic drive that governs the parental
forms and functions that are inherited.
Furthermore, if we view the present-day variety of living process as the result of
slow evolutionary changes from an initial similarity of form and function, then we can
understand a fortiori how the one initial single common dynamic superlative might
well remain unchanged from generation to generation, regardless of how enormous
the changes in forms and functions might become; and that, in turn, will mean
that all individuals, regardless of their dierences in form and function, will still be
governed by the same single superlative.
But though we may argue at length as to the plausibility of one single superlative
for all living process, yet, even if this were the case, we should still need to disclose
what, in fact, the particular superlative in question is.
An actual disclosure of the single hypothetical superlative in question may be
dicult for quite obvious reasons, If we take our previous example of the two cities
with an intervening mountain chain, in which the paths of least distance, least time,
and least work are three dierent paths, we are obliged in all candor to admit that
sometimes one of these paths is taken and sometimes another. For that matter, a
tunnel may be dug through the base of the mountain to save distance, while airplanes
are own over the same mountain to save time, while pack horses continue to take
the easier and more leisurely winding route. Or, to take another case, sometimes the
reader will dart through trac at considerable risk in order to save time in crossing
a Street; and sometimes he will take the longer and safer path to the corner, where
he will wait for the trac light. Even if we assume that we are all governed by the
same one single dynamic superlative, which superlative is it?
But although the superlatives in the foregoing examples seem to be dierent, are
they nevertheless irreconcilable? Before answering this question, let us remember
the physicists claim that according to their law of falling bodies, all free-standing
bodies will fall (by least action) to the earth. Yet, despite this claim, we have all
observed how leaves sometimes rise in the air, or how birds take o from the ground
and y out of sight, much as if they were exceptions to the law of falling bodies. Of
course we know from a more careful inspection of the problem that these leaves and
birds are by no means exceptions to the law of falling bodies; on the contrary, if all
the factors in the problem are taken into consideration, they are behaving quite in
accordance to the physical law in question.
May not the same be true of the three dierent paths to the other side of the
mountain? Even though each of these paths may be taken simultaneously by some-
one, and even though a given person may now take one path and now another, there
remains the possibility that the adoption of one or another by an individual under
varying sets of circumstances is governed by the operation of some further single
dynamic minimum that forever remains invariant, In any event, we shall argue that
such is the case.
More specically, we shall argue that if we view the above types of situations
against the broad background of the individuals present and future problems, we
shall nd that an extraordinary expenditure of work at one moment, or an extraor-
dinary haste in one situation, may simply be temporary devices for reducing the
probable rate of the individuals work expenditure over subsequent periods of his
life.
In short, we shall argue that the invariable minimum that governs all varying
conduct of an individual is least eort,
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 141
subject to the minimizing of eort. Or, dierently stated, every individuals entire
behavior is governed by the Principle of Least Eort.
Now that we have described what the Principle of Least Eort is, let us briey
illustrate it.
At the risk of being tedious, let the rst example be our previous case of the
two towns, A and B, that are separated by an intervening mountain range. Here
we can see the enormous amount of work that could be saved in travel and trade
if the two towns were connected by a tunnel of least distance through the base of
the mountain; we can also see the enormous amount of work that it would take to
construct such a tunnel. We are simply arguing that when the probable cost in work
of digging the tunnel is estimated to be less than the probable work of not having the
tunnel, then, if the necessary work for construction is available, the tunnel will be
dug. The problem relates, therefore, to the probable amounts of work involved, as
estimated by one or more persons. Naturally, these persons can have been mistaken
in their estimates, with the result that the tunnel can either succeed beyond their
wildest hopes, or dismally fail. For we do not deny that a persons hindsight is
generally better than his foresight. We merely claim that a person acts on the basis
of his foresight - with all that that will later be found to imply - according to the
Principle of Least Eort.
The above type of argument will also apply to a path of least time over the
mountain. Thus the enormous cost of ying munitions over the mountain to save
time in supplying an army in combat on the other side may be more than justied
by the future probable work that is thereby saved.
These cases of the dierent paths to the other side of the mountain represent
instances of collective action and of collective economies, since, for example, a tunnel
through a mountain is obviously not constructed by a single person but by the
collective eort of a great many persons.
And yet we are not restricted to examples of collective eort in illustrating our
Principle of Least Eort, which we contend also applies to an individuals own
behavior. We might take the case of a student whose particular path of least eort
out of his classroom would seem ohand to be the path that leads from his seat to the
nearest aisle, and thence out of the door, through the hall, to the nearest stairway.
On the other hand, in the event of a re, the student might conceivably prefer to
run with least time to the nearest window and adopt a path that is simultaneously
a path of least work and of least time and of least distance to the ground. This path
will also be a path of least eort, as estimated by himself, even at the risk of months
in the hospital with a broken back. Other students may prefer to take paths through
the smoke-lled corridors. These paths are also paths of least eort, as estimated by
the students in question. Afterwards, when, as, and if all the students foregather,
they can decide which of them, in the light of subsequent events, actually were the
shrewdest gamblers in the sense of having both most correctly comprehended the
nature and estimated the probabilities of the problem in their lives that was caused
by the unexpected re.
From this second example we note that the operation of the Principle of Least
Eort is contingent upon the mentation of the individual, which in turn includes the
operations of comprehending the relevant elements of a problem, of assessing
their probabilities; and of solving the problem in terms of least eort, We mention
this vital consideration of mentation right here and now, so that we may prepare
ourselves for the task of dening mentation, and of showing that the structure
and operation of mentation are also governed throughout by the Principle of Least
Eort, since an individuals mentation is clearly a part of his total behavior, and
hence subject to our Principle of Least Eort.
The foregoing examples suce to illustrate what the Principle of Least Eort
is, and what its implications may be for everyday problems. By and large, our
explanations of the above commonplace examples are pretty much in line with the
way the reader himself would have explained them. We mention this consideration
in order to suggest that our chief task may not be that of persuading the reader
to adopt a totally new way of thinking, but rather of formally describing and of
scientically establishing the basic principle of our habitual way of thinking.
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 143
reference that will serve to dene both tools and jobs in conditions where neither is
xed.
For the purpose of a preliminary orientation, however, we can illustrate super-
cially some of the more obvious implications of the reciprocal economy of matching
tools and jobs under the assumption that either the tools or the jobs are xed, We
shall begin with the example of an automobile manufacturer, and then turn to the
case of an imaginary person called John.
If the owner of a manufacturing plant has the job of making pleasure automo-
biles, then, theoretically, he will seek to use those tools that will manufacture the
automobiles with a maximum economy, as is observably the case with all automobile
manufacturers. The same urge of economy presumably activates the manufacturers
of other kinds of goods. In short, the kinds of jobs (or objectives) that a person has
to perform will determine the kinds of tools (or means) that he employs for their
performance.
Nevertheless, the above situation might well be immediately changed with the
sudden outbreak of war that introduces a whole new set of national jobs-or objectives-
while suppressing a great many of the erstwhile peacetime ones. During the war the
automobile manufacturer may no longer be permitted to manufacture his peacetime
pleasure cars; and the same will doubtless be true of many other kinds of manu-
factures, That will not mean, however, that the manufacturing plants in question
will remain idle for the duration of the war, On the contrary the plants will be
converted to war work. That is, they will perform the new kinds of war jobs.
What, more precisely, takes place under this conversion to war work? Theo-
retically, each plant will seek to perform that particular kind of war job most nearly
adapted to its present peacetime toolage; that is, it will seek to perform the par-
ticular job for which it can re-tool with least work (eort) - Thus the automobile
factory may make tanks, or jeeps, or gun carriages of some sort. Generalizing upon
this case, we may say that tools-seek-jobs-and-jobs-seek-tools throughout the entire
nation at war.5
After the war is over, the manufacturers again face the need of conversion.
This does not mean that the manufacturers will necessarily revert to production of
their prewar lines of goods, although they may. They may nd it easier to convert
their wartime toolage to the production of some entirely dierent kind of peacetime
goods. In this conversion to peacetime activity, we may again say that tools-seek
jobs-and-jobs-seek-tools.
The foregoing example of the manufacturing plant is instructive for two reasons,
First, it shows what might be called the complete relativism of the problem, with
little that is permanently stable over any extended period of time; for by just such
successive steps of adaptive evolution a plant that begins with the manufacture
of microscopes may nd itself manufacturing perfumes a century later, without a
single one of the original kinds of tools and processes still employed. Secondly, the
example of conversion of tools to war and then to peace indicates that our problem
5
The case of the automobile manufacturer is oversimplied since we have deliberately Shored
problems of labor, management, and raw materials, which will be treated in detail in Chapters
9 and 11. Theoretically, the total supply of war jobs will be distributed to the total supply of
manufacturers in such a way that the total work of re-tooling and of manufacture of the desired
items will be least (cf. Chap. 5).
of economy is twofold, since it involves not only the more familiar important problem
of the Selection of economical means (tools) but also the somewhat less familiar but
no less important problem of the selection of economical objectives (jobs), in the
reciprocal economy of matching tools to jobs and jobs to tools.
The above example is tendered only to illustrate the general relativism of the
fundamental problem of tools-and-jobs as well as the twofold economy of their re-
ciprocal matching. Since these two considerations are obviously important in our
theory, it is not unreasonable for the reader to ask even now just how we may hope
to study them quantitatively.
Curiously enough, we shall nd in the forms and functions of the entities of
human speech an almost perfect example of the general relativism of tools and jobs
and of the twofold economy of selection. For, as we shall later see more specically,
the forms and meanings of words represent merely a special case of tools that perform
jobs. We shall nd that the forms and functions of words are quite capable of being
studied quantitatively by means of the objective methods of empiric Science, with
results that will be applicable to the general economy of all tools and jobs.
Indeed, it will be the precise information that we gain from a study of the case
of speech that will suggest how every individual may be viewed in his entirety as a
single set of integrated tools-and-jobs; and that the total behavior of the individual
can be viewed as a case in which tools-seek-jobs-and-jobs-seek-tools with a maximum
economy of eort, This view of an individual as a set of tools-jobs does no violence
to our commonsense feelings about the matter, as we shall now see as we turn from
the case of the automobile manufacturer to that of an imaginary person called John
who, we shall suppose, is in love with Mary.
John, after work, tidies himself before going to see Mary, who is similarly tidying
herself to see John. In these two persons we have, theoretically, a case where jobs-
seek-tools-and-tools-seek-jobs. Each may be viewed as both a set of tools and as a
set of jobs for the other person. Together they face a period of reciprocal adjust-
ment, during which each side alters its tools and jobs to eect a more economical
match. In this respect John (or Mary) is quite similar to our previously discussed
automobile manufacturer, who also had to alter his tools and jobs to match them
more economically to the jobs and tools of others.
Yet in saying that John (or Mary) is a set of tools-and-jobs that is seeking a
complementary set of jobs-and-tools, we are obviously dealing with two dierent
economic problems in each person, Thus, in the case of John, there is the rst
problem of organizing Johns own individual set of tools in such a way that they
will operate with maximum economy in performing their jobs of self-support, self-
defense, and procreation. Then there is the second problem of economically moving
John as a unit system of tools over the earths surface in the quest of jobs for his
tools and of tools for his jobs (e.g., John seeks Mary). Clearly, these two economies
to which John is continually subject are not the same, Yet they have one salient
feature in common: in either case there is always the problem of moving matter-
energy over paths of least eort, whether the matter-energy thus moved represents
Johns individual tools in operation, or whether it represents John as a total whole.
In other words, we may say two things about John in reference to our theoretical
paths of least eort, First, we may say that John is a set of paths over which matter-
energy proceeds into his system of toolage, through his system of toolage, and out
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 145
of his system of toolage. Secondly, John, as a unit, takes paths. According to our
Principle, all these paths, of whatever sort, will be paths of least eort, even though
Johns intrasystematic set of paths may seem to be rigidly stereotyped and deter-
minate, whereas Johns extra-systematic unit action may seem to be comparatively
optional and indeterminate.
Paths of least eort are only probable paths, regardless of the comparative de-
grees of probabilities that the paths in question will be taken, If we now inspect more
closely the manner in which John selects his extra-systematic path to his rendezvous
with Mary, we shall gain further insight into the degree of precision to which any
path of least eort is calculated, and also into the general economy of a persevera-
tion, or a repetition, of activity that accounts for the apparently stereotyped rigidity
of Johns intrasystematic paths.
To begin, let us ask whether John, in selecting a path of least eort to Mary, will
perchance take a slide rule and surveyors transit, and calculate his path to Mary
with the precision of a civil engineer who is surveying a roadbed for a railway that
is to wind through a mountain range where every inch counts. Obviously not, and
for a very good economic reason: the work of calculating a path of least eort must
be included in the total work of taking the path of least eort. Nothing is gained by
calculating a particular path of least eort to a greater degree of precision, when
the added work of so calculating it is not more than oset by the work that is saved
by using the more precisely calculated path. John, therefore, in selecting an easiest
probable path to his objective, will minimize the total eort of calculating and of
taking the path in question. The same will theoretically be the case of every other
path of least eort.
This consideration leads to a second one. If John expects to take the same path
repeatedly, then he can aord to calculate it more precisely. Since the additional
work of calculation can be distributed over the repeated trips. From this we see that
there is an inherent economy of eort in repeatedly taking the same paths, since
one saves thereby the work of repeated calculations, In short, there is an economy
in the repetitiveness of ones acts of behavior, Thence the growth of habits.
We mention this consideration in order to suggest that the intrasystemic paths
over which Johns individual tools behave within his own system will still be paths of
least eort, even though they may seem to be Stereotyped to the point of complete
rigidity because of the relatively high frequency of recurrence of the action in ques-
tion. The sheer fact that our Physiological behavior is highly predictable does not
preclude the possibility that it takes place over paths of least eort; on the contrary,
as we shall later argue in detail, intrasystematic paths are stereotyped because they
are frequent, and the reverse.
This view of John as simultaneously both taking paths and being paths leads
ultimately to a profound question in dynamics that will occupy our attention in the
course of our demonstration.
If we restrict our attention to John as a set of paths in reference to which matter-
energy moves into Johns system, through Johns system, and out of Johns system,
we note that there is nothing in this transient matter-energy that can be called
permanently John. And since the matter-energy is not John-or, if one will, since
the actual physical tools of his system are not John-what then is John?
All that is left, according to our theory, is the system of paths itself, over which
the matter-energy moves while acting in the capacity of being Johns tools in op-
eration. Yet does even this system of paths represent what is permanently John?
Obviously not, since these paths are only probable paths and by no means xed.
Indeed, we know that the particular system of paths of an aged man are far from be-
ing the same as his system of paths when he was an embryo. Even as matter-energy
is continually moving over paths, so too paths are continually changing.
And yet if all the matter-energy in Johns system of tools in operation is transient
and ephemeral to the system, and if the paths, or processes, are also variable, what,
then, is left over in the total phenomenon to represent that apparently enduring
entity called John?
This is clearly a primary question that is inherent in our concept of tools-and-
jobs and one which will confront every other investigator into the eld of biosocial
dynamics. The question is simple: What is John?
At present we merely point to the existence of this question, which we shall
candidly face in the course of our demonstration, when we shall attempt to answer
it. If we mention it now, it is only in order to suggest what is attached to the view
that an individual is a set of tools-and-jobs which in turn seeks jobs-and-tools.
We shall later dene more precisely what we mean by tools and by jobs and by
the reciprocal economy of matching the two. Our present discussion of the topic is
intended merely to indicate the scope of the Principle of Least Eort, and to suggest
a possibly fruitful manner of studying the problem of the economy of a persons total
behavior.
This total behavior, as we have seen, consists of two economies. John is not
only subject to the economy of organizing his own individual self; he is also subject
to the economy of a collective organization with what lies outside his system. For
the convenience of exposition, we shall divide our demonstration into two parts,
the rst devoted primarily to a discussion of the economy of the organization of the
individual, and the second to the economy of the organization of the collective group
of individuals in reference to each other and to the rest of nature,
We can see from the very nature of the case that the one economy of the indi-
vidual will inuence the other economy of the collective group, and the reverse.
Collective Economy
It is questionable whether our analysis of the particular case of the economy of
digging a tunnel through the mountain added one single consideration that would
not have been completely covered by the older schools of classical or orthodox
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 147
economics. These schools have never been in any doubt that the factor that is
minimized in such collective enterprises is labor, or work. Nor do they view their
topics so narrowly that they fail to take into account the concept of risk which, in
turn, is intimately connected with the concept of probable work. A great deal of
economic theory, therefore, belongs to the history of the topic we are considering.
Closely related to the eld of economics are the elds of sociology and of cultural
anthropology, to whose thinking we are also indebted. Earlier quantitative observa-
tions in these elds, as well as in the eld of general biological ecology, will often be
referred to in the course of our demonstration.
Individual Economy
Less widely known than the above studies, yet surely no less important, are the
studies of the factor of work in the motivation of human behavior as conducted with
painstaking care by experimental psychologists, The results of these studies have led
to the theoretical formulations of Drs. N. F. Miller and J, Dollard, and to those of
Dr. C. L. Hull (see below), who is himself an experimentalist. Although the actual
experimentation in the eld of motivation has been too extensive to be discussed in
detail here, we shall at least give a short account of some of the main steps of the
theoretical development of the concept of economy.
Among the very earliest experimentalists to be concerned with the topic of work
were three persons: (1) Dr. J. A. Gengerelli in his The Principle of Maxima and
Minima in Learning, (2) Dr. L. S. Tsai in his The Laws of Minimum Eort
and Maximum Satisfaction in Animal Behavior, and (3) Dr. R, H. Waters in his
The Principle of Least Eort in Learning. By the term eort in these titles, the
authors without exception mean work. In addition, there is also the principle of
least action that was borrowed wholesale from the physicists by Dr. R. H. Wheeler,
who advanced it as a Primary psychological principle without any supporting proof;6
we mention Dr. Wheelers principle of least action in psychology lest we otherwise
seem to be unaware of it; and we discard it for the reason pointed out by Dr. Waters:
Dr. Wheeler has in fact done nothing scientically except, at best, to enunciate a
postulate which he saw t to make, yet for the validity of which there is not the
slightest bit of evidence.
As to the other three persons, Dr. Gengerelli states his Principle of Maxima and
Minima as follows:
The behavior of an organism elicited by a given stimulating situation which
aords relief to an internal need of that organism tends, with repetition, to approach,
in terms of time, space, and eort involved, the minimal limit compatible with the
relief of that need; the nature of the limit being dened by the structure of the
organism and of the external situation.
Dr. Tsai in turn states his Law of Minimum Eort as follows:
Among several alternatives of behavior leading to equivalent satisfaction of some
potent organic need, the animal, within the limits of its discriminative ability, tends
6
In the original simple terms of Maupertuis. the principle of least action states that when a
mass, M , moves from a given point at a given moment of time to another point at another moment
of time, it will proceed along a course in which the sum of all products of all masses when multiplied
by their respective distances moved and by their respective velocities will be a minimum.
Patrick D. Bangert
17.2 Human Behavior and the Principle of Least Eort by George Kingsley
Zipf 149
the same and of which the consequences are the same, then theoretically the course
of least work will be adopted, since this course is also the course of Least Eort.
We mention this consideration because many of the phenomena of speech, if viewed
in isolation over a short period of time, will be found to conform to Hulls corollary
of less work. The same would seem to apply to many other restricted experimental
situations.
In addition to the above approach by experimental psychologists, of which the
foregoing account is only a barest outline, there was the general theoretical study
of vectorial analysis by Kurt Lewin in his Principles of Topological Psychology.
This study, though stimulating to read, suered both from a paucity of supporting
data and from the lack of a satisfactory theory to account vectorially for the
progression of a person from One situation to the other, On the other hand, the
gifted experimentalists, Drs. J. F. Brown and A. C. Voth, have observed instances
of topological vectors in the eld of vision. Nor should we fail to mention, if only
in P8ssing, the extremely interesting observations of the social behavior of ants as
made by the topological psychologist and biologist, Dr. T. C. Schneirla.8
The writers publications in the eld of the dynamics of behavior began in 1929,
in which cases of least work were found in the speech process.
The foregoing outline is by no means complete. In the following chapters we
shall discuss in detail the observations of other investigators, including those of
sociologists, which for that reason are not included above. The above outline serves
to suggest both the increasing interest on the part of objective scientists in the topic
of biosocial dynamics, and a growing feeling of the urgent need for a single unifying
biosocial principle.
Only by venturing to set up general unifying principles for others to criticize
and to prune and to restructure, even as those persons we have just mentioned have
ventured and are venturing to do-and as we shall also attempt to do-may we hope,
I believe, ultimately to disclose the desired principle.
8
See, for example, O. H. Mowrer and H. M. Jones, Extinction and Behavior Variability as
Function of Eortfulness of Task; J. E. Dc Camp, Relative distance as a factor in the white rats
selection of a path. etc.
should provide (i) parameters, (ii) error estimates on the parameters, and (iii) a
statistical measure of goodness-of-t. When the third item suggests that the model
is an unlikely match to the data, then items (i) and (ii) are probably worthless.
Unfortunately, many practitioners of parameter estimation never proceed beyond
item (i). They deem a t acceptable if a graph of data and model looks good. This
approach is known as chi-by-eye. Luckily, its practitioners get what they deserve.
But where does this come from? What general principles is it based on? The answer
to these questions takes us into the subject of maximum likelihood estimators.
Given a particular data set of xi s and yi s, we have the intuitive feeling that
some parameter sets a1 , a2 , , aM are very unlikely - those for which the model
function y(x) looks nothing like the data -while others may be very likely -those
that closely resemble the data. How can we quantify this intuitive feeling? How
can we select tted parameters that are most likely to be correct? It is not
meaningful to ask the question, What is the probability that a particular set of
tted parameters a1 , a2 , , aM is correct? The reason is that there is no statistical
universe of models from which the parameters are drawn. There is just one model,
the correct one, and a statistical universe of data sets that are drawn from it!
That being the case, we can, however, turn the question around, and ask, Given
a particular set of parameters, what is the probability that this data set could have
occurred? If the yi s take on continuous values, the probability will always be zero
unless we add the phrase, ...plus or minus some xed y on each data point.
So lets always take this phrase as understood. If the probability of obtaining the
data set is innitesimally small, then we can conclude that the parameters under
consideration are unlikely to be right. Conversely, our intuition tells us that the
data set should not be too improbable for the correct choice of parameters.
In other words, we identify the probability of the data given the parameters
(which is a mathematically computable number), as the likelihood of the parameters
given the data. This identication is entirely based on intuition. It has no formal
mathematical basis in and of itself; as we already remarked, statistics is not a branch
of mathematics!
Once we make this intuitive identication, however, it is only a small further step
to decide to t for the parameters a1 , a2 , , aM precisely by nding those values
that maximize the likelihood dened in the above way. This form of parameter
estimation is maximum likelihood estimation.
We are now ready to make the connection to 18.2. Suppose that each data
point yi has a measurement error that is independently random and distributed as
a normal (Gaussian) distribution around the true model y(x). And suppose that
the standard deviations of these normal distributions are the same for all points.
Then the probability of the data set is the product of the probabilities of each point,
2
0N
1 yi y (xi )
P proportional to exp y (18.3)
i=1
2
Notice that there is a factor y in each term in the product. Maximizing 18.3 is
equivalent to maximizing its logarithm, or minimizing the negative of its logarithm,
namely, N
(yi y (xi ))2
N log y (18.4)
i=1
2 2
Since N , and y are all constants, minimizing this equation is equivalent to
minimizing 18.2.
What we see is that least-squares tting is a maximum likelihood estimation
of the tted parameters if the measurement errors are independent and normally
distributed with constant standard deviation. Notice that we made no assumption
about the linearity or nonlinearity of the model y(x; a1 , a2 , , aM ) in its parame-
ters a1 , a2 , , aM . Just below, we will relax our assumption of constant standard
deviations and obtain the very similar formulas for what is called chi-square t-
ting or weighted least-squares tting. First, however, let us discuss further our
very stringent assumption of a normal distribution.
For a hundred years or so, mathematical statisticians have been in love with the
fact that the probability distribution of the sum of a very large number of very small
random deviations almost always converges to a normal distribution. (the central
limit theorem) This infatuation tended to focus interest away from the fact that, for
real data, the normal distribution is often rather poorly realized, if it is realized at
all. We are often taught, rather casually, that, on average, measurements will fall
within of the true value 68 percent of the time, within 2 95 percent of the
time, and within 3 99.7 percent of the time. Extending this, one would expect
a measurement to be o by 20 only one time out of 2 1088 . We all know that
glitches are much more likely than that!
In some instances, the deviations from a normal distribution are easy to under-
stand and quantify. For example, in measurements obtained by counting events,
the measurement errors are usually distributed as a Poisson distribution, whose cu-
mulative probability function was already discussed in 6.2. When the number of
counts going into one data point is large, the Poisson distribution converges towards
a Gaussian. However, the convergence is not uniform when measured in fractional
accuracy. The more standard deviations out on the tail of the distribution, the
larger the number of counts must be before a value close to the Gaussian is realized.
The sign of the eect is always the same: The Gaussian predicts that tail events
are much less likely than they actually (by Poisson) are. This causes such events,
when they occur, to skew a least-squares t much more than they ought.
Patrick D. Bangert
18.2 Maximum Likelihood Estimators 153
Other times, the deviations from a normal distribution are not so easy to un-
derstand in detail. Experimental points are occasionally just way o. Perhaps the
power ickered during a points measurement, or someone kicked the apparatus, or
someone wrote down a wrong number. Points like this are called outliers. They
can easily turn a least-squares t on otherwise adequate data into nonsense. Their
probability of occurrence in the assumed Gaussian model is so small that the maxi-
mum likelihood estimator is willing to distort the whole curve to try to bring them,
mistakenly, into line.
The subject of robust statistics deals with cases where the normal or Gaussian
model is a bad approximation, or cases where outliers are important. We will discuss
robust methods briey later. All the sections between this one and that one assume,
one way or the other, a Gaussian model for the measurement errors in the data. It
it quite important that you keep the limitations of that model in mind, even as you
use the very useful methods that follow from assuming it.
Finally, note that our discussion of measurement errors has been limited to sta-
tistical errors, the kind that will average away if we only take enough data. Mea-
surements are also susceptible to systematic errors that will not go away with any
amount of averaging. For example, the calibration of a metal meter stick might
depend on its temperature. If we take all our measurements at the same wrong
temperature, then no amount of averaging or numerical processing will correct for
this unrecognized systematic error.
N 2
2 yi y (xi ; a1 , a2 , , aM )
(18.5)
i=1
i
Patrick D. Bangert
18.3 Reading: Theory of Games by S. Vajda 155
N
yi y (xi ) y (xi ; a1 , a2 , , aM )
0= k = 1, 2, , M (18.7)
i=1
i2 ak
which is, in general, a set of M nonlinear equations for the M unknown ak . Several
of the procedures described subsequently derive from this equation and its special-
izations.
If the measurement errors are normally distributed, then this merit function will
give maximum likelihood parameter estimations of a and b; if the errors are not
normally distributed, then the estimations are not maximum likelihood, but may
still be useful in a practical sense.
Equation 19.2 is minimized to determine a and b. At its minimum, derivatives
of 2 (a, b) with respect to a, b vanish.
2 yi a bxi
N
2 xi (yi a bxi )
N
0= = 2 0 = = 2 (19.3)
a i=1
i2 b i=1
i2
N
1 N
xi N
yi N
x2i N
xi yi
S 2
; S x 2
; S y 2
; S xx 2
; S xy (19.4)
i=1 i
i=1 i
i=1 i
i=1 i i=1
i2
19.1 Fitting Data to a Straight Line 157
For the straight line, the derivatives of a and b with respect to y i can be directly
evaluated from the solution:
a Sxx Sx xi b Sxi Sx
= 2
; = (19.8)
yi i yi i2
Summing over the points as in 19.7, we get
Sxx 2 S
a2 = ; b = (19.9)
which are the variances in the estimates of a and b, respectively. We will see later
that an additional number is also needed to characterize properly the probable
uncertainty of the parameter estimation. That number is the covariance of a and b,
and (as we will see below) is given by
Sx
Cov(a, b) = (19.10)
The coecient of correlation between the uncertainty in a and the uncertainty
in b, which is a number between -1 and 1, follows from 19.10,
Sx
rab = (19.11)
SSxx
A positive value of rab indicates that the errors in a and b are likely to have the
same sign, while a negative value indicates the errors are anti-correlated, likely to
have opposite signs.
We are still not done. We must estimate the goodness-of-t of the data to
the model. Absent this estimate, we have not the slightest indication that the
parameters a and b in the model have any meaning at all! The probability Q that
a value of chi-square as poor as the value 19.2 should occur by chance is
N 2 2
Q = Gamma , (19.12)
2 2
For data with varying weights i , the above equations remain valid if the sums in
equation (14.5.1) are weighted by 1/i2 .
When the weights are known in advance, the calculations exactly correspond to
the formulas above. However, when weights are unavailable, the routine assumes
equal values of for each point and assumes a good t.
y(x) = a + bx (19.15)
where xi and yi are, respectively, the x and y standard deviations for the ith
point. The weighted sum of variances in the denominator of equation 19.16 can be
understood both as the variance in the direction of the smallest 2 between each data
point and the line with slope b, and also as the variance of the linear combination
yi a bxi of two random variables xi and yi ,
1
V ar (yi a bxi ) = V ar (yi ) + b2 V ar (xi ) = y2i + b2 x2i (19.17)
wi
Patrick D. Bangert
19.2 Straight-Line Data with Errors in Both Coordinates 159
The sum of the square of N random variables, each normalized by its variance, is
thus 2 -distributed.
We want to minimize equation 19.16 with respect to a and b. Unfortunately, the
occurrence of b in the denominator of equation 19.16 makes the resulting equation
for the slope 2 /b = 0 nonlinear. However, the corresponding condition for the
intercept, 2 /a = 0, is still linear and
wi (yi bxi )
a= i
(19.18)
wi
i
Figure 19.1: Standard errors for the parameters a and b. The point B can be found
by varying the slope b while simultaneously minimizing the intercept a. This gives
the standard error b , and also the value s. The standard error a can then be found
by the geometric relation a2 = s2 + r2 .
where X1 (x), X2 (x), , XM (x) are arbitrary xed functions of x, called the basis
functions.
Patrick D. Bangert
19.3 General Linear Least Squares 161
Note that the functions Xk (x) can be wildly nonlinear functions of x. In this
discussion linear refers only to the models dependence on its parameters ak .
For these linear models we generalize the discussion of the previous section by
dening a merit function
2
M
N yi ak Xk (xi )
2 k=1
= (19.24)
i=1
i
As before, i is the measurement error (standard deviation) of the ith data point,
presumed to be known. If the measurement errors are not known, they may all be
set to the constant value = 1.
Once again, we will pick as best parameters those that minimize 2 . There are
several dierent techniques available for nding this minimum. Two are particularly
useful, and we will discuss both in this section. To introduce them and elucidate
their relationship, we need some notation.
Let A be a matrix whose N M components are constructed from the M basis
functions evaluated at the N abscissas xi , and from the N measurement errors i ,
by the prescription
Xj (xi )
Aij = (19.25)
i
The matrix A is called the design matrix of the tting problem. Notice that in
general A has more rows than columns, N M , since there must be more data
points than model parameters to be solved for. (You can t a straight line to two
points, but not a very meaningful quintic!) The design matrix is shown schematically
in Figure 19.2.
Also dene a vector b of length N by
yi
bi = (19.26)
i
and denote the M vector whose components are the parameters to be tted, a1 , a2 ,
, aM , by a.
1
N
0= y i j = 1M
a j Xj (x i ) Xk (xi ) k = 1, 2, , M (19.27)
2
i=1 i
Interchanging the order of summations, we can write 19.27 as the matrix equation
j = 1M kj aj = k (19.28)
where
N
Xj (xi )Xk (xi )
kj = or [] = AT A (19.29)
i=1
i2
an M M matrix, and
N
yi Xk (xi )
k = or [] = AT b (19.30)
i=1
i2
a vector of length M .
The equations 19.27 or 19.28 are called the normal equations of the least-squares
problem. They can be solved for the vector of parameters a by standard methods,
notably LU decomposition and backsubstitution, Choleksy decomposition or Gauss-
Jordan elimination. In matrix form, the normal equations can be written as either
[] a = [] or AT A a = AT b (19.31)
The inverse matrix Cjk []1 jk is closely related to the probable (or, more
precisely, standard ) uncertainties of the estimated parameters a. To estimate these
uncertainties, consider that
N
M M yi Xk (xi )
aj = []1
jk k = Cjk (19.32)
k=1 k=1 i=1
i2
and that the variance associated with the estimate aj can be found as in 19.7 from
N 2
2 aj
(aj ) = i2 (19.33)
i=1
yi
aj Cjk Xk (xi )
M
= (19.34)
yi k=1
i2
Patrick D. Bangert
19.4 Reading: On the Average and Scatter by M. J. Moroney 163
Consequently, we nd that
N
M
M Xk (xi )Xl (xi )
2
(aj ) = Cjk Cjl (19.35)
k=1 l=1 i=1
i2
The nal term in brackets is just the matrix []. Since this is the matrix inverse of
[C], 19.35 reduces immediately to
In other words, the diagonal elements of [C] are the variances (squared uncer-
tainties) of the tted parameters a. It should not surprise you to learn that the
o-diagonal elements Cjk are the covariances between aj and ak ; but we shall defer
discussion of these to the next lecture.
Since we wish to compute not only the solution vector a but also the covariance
matrix [C], it is most convenient to use Gauss-Jordan elimination to perform the
linear algebra. In theory, since AT A is positive denite, Cholesky decomposition
is the most ecient way to solve the normal equations. However, in practice most
of the computing time is spent in looping over the data to form the equations, and
Gauss-Jordan is quite adequate.
We need to warn you that the solution of a least-squares problem directly from
the normal equations is rather susceptible to roundo error. An alternative, and
preferred, technique involves QR decomposition of the design matrix A. This is
essentially what we did for tting data to a straight line, but without invoking all
the machinery of QR to derive the necessary formulas.
Frequently it is a matter of art to decide which parameters ak in a model
should be t from the data set, and which should be held constant at xed values,
for example values predicted by a theory or measured in a previous experiment.
One wants, therefore, to have a convenient means for freezing and unfreezing
the parameters ak .
knew this distribution, we would know everything that there is to know about the
quantitative uncertainties in our experimental measurement a(0) .
So the name of the game is to nd some way of estimating or approximating
the probability distribution of a(i) atrue without knowing atrue and without having
available to us an innite universe of hypothetical data sets.
Figure 20.1: A statistical universe of data sets from an underlying model. True
parameters atrue are realized in a data set, from which tted (observed) parameters
a(0) are obtained. If the experiment were repeated many times, new data sets and
new values of the tted parameters would be obtained.
Now, often, the distribution of a(i) a(0) in the ctitious world is within our
power to calculate (see Figure 20.2). If we know something about the process that
generated our data, given an assumed set of parameters a(0) , then we can usually g-
ure out how to simulate our own sets of synthetic realizations of these parameters
as synthetic data sets. The procedure is to draw random numbers from appropri-
ate distributions so as to mimic our best understanding of the underlying process
and measurement errors in our apparatus. With such random draws, we construct
data sets with exactly the same numbers of measured points, and precisely the same
values of all control (independent) variables, as our actual data set D(0) . Let us
call these simulated data sets D(1)S
, D(2)
S
, . By construction these are supposed
to have exactly the same statistical relationship to a(0) as the D(i) s have to atrue .
(For the case where you dont know enough about what you are measuring to do a
credible job of simulating it, see below.)
Next, for each D(j)
S
, perform exactly the same procedure for estimation of pa-
2
rameters, e.g., minimization, as was performed on the actual data to get the
parameters a(0) , giving simulated measured parameters aS(1) , aS(2) , . Each simu-
lated measured parameter set yields a point aS(i) a(0) . Simulate enough data sets
and enough derived simulated measured parameters, and you map out the desired
probability distribution in M dimensions.
Figure 20.2: Monte Carlo simulation of an experiment. The tted parameters from
an actual experiment are used as surrogates for the true parameters. Computer-
generated random numbers are used to simulate many synthetic data sets. Each of
these is analyzed to obtain its tted parameters. The distribution of these tted
parameters around the (known) surrogate true parameters is thus studied.
In fact, the ability to do Monte Carlo simulations in this fashion has revolution-
ized many elds of modern experimental science. Not only is one able to characterize
the errors of parameter estimation in a very precise way; one can also try out on
the computer dierent methods of parameter estimation, or dierent data reduc-
tion techniques, and seek to minimize the uncertainty of the result according to
any desired criteria. Oered the choice between mastery of a ve-foot shelf of an-
alytical statistics books and middling ability at performing statistical Monte Carlo
simulations, we would surely choose to have the latter skill.
The procedure is simply to draw N data points at a time with replacement from
the set D(0)
S
. Because of the replacement, you do not simply get back your original
data set each time. You get sets in which a random fraction of the original points,
Patrick D. Bangert
20.1 Condence Limits on Estimated Model Parameters 167
typically 1/e 37%, are replaced by duplicated original points. Now, exactly
as in the previous discussion, you subject these data sets to the same estimation
procedure as was performed on the actual data, giving a set of simulated measured
parameters aS(1) , aS(2) , . These will be distributed around a(0) in close to the same
way that a(0) is distributed around atrue .
Sounds like getting something for nothing, doesnt it? In fact, it has taken more
than a decade for the bootstrap method to become accepted by statisticians. By
now, however, enough theorems have been proved to render the bootstrap reputable.
The basic idea behind the bootstrap is that the actual data set, viewed as a prob-
ability distribution consisting of delta functions at the measured values, is in most
cases the best or only available estimator of the underlying probability distri-
bution. It takes courage, but one can often simply use that distribution as the basis
for Monte Carlo simulations.
Watch out for cases where the bootstraps iid assumption is violated. For ex-
ample, if you have made measurements at evenly spaced intervals of some control
variable, then you can usually get away with pretending that these are iid, uni-
formly distributed over the measured range. However, some estimators of a (e.g.,
ones involving Fourier methods) might be particularly sensitive to all the points on
a grid being present. In that case, the bootstrap is going to give a wrong distribu-
tion. Also watch out for estimators that look at anything like small-scale clumpiness
within the N data points, or estimators that sort the data and look at sequential
dierences. Obviously the bootstrap will fail on these, too. (The theorems justifying
the method are still true, but some of their technical assumptions are violated by
these examples.)
For a large class of problems, however, the bootstrap does yield easy, very quick,
Monte Carlo estimates of the errors in an estimated parameter set.
You might suspect, correctly, that the numbers 68.3 percent, 95.4 percent, and
99.73 percent, and the use of ellipsoids, have some connection with a normal dis-
tribution. That is true historically, but not always relevant nowadays. In general,
the probability distribution of the parameters will not be normal, and the above
numbers, used as levels of condence, are purely matters of convention.
Figure 20.3 sketches a possible probability distribution for the case M = 2.
Shown are three dierent condence regions that might usefully be given, all at the
same condence level. The two vertical lines enclose a band (horizontal interval)
which represents the 68 percent condence interval for the variable a1 without regard
to the value of a2 . Similarly the horizontal lines enclose a 68 percent condence
interval for a2 . The ellipse shows a 68 percent condence interval for a1 and a2
jointly. Notice that to enclose the same probability as the two bands, the ellipse
must necessarily extend outside of both of them (a point we will return to below).
Patrick D. Bangert
20.1 Condence Limits on Estimated Model Parameters 169
4. to give the covariance matrix Cij as the formal covariance matrix of the t.
You are not allowed to use formulas that we now give for the case of normal errors,
which establish quantitative relationships among 2 , Cij , and the condence level.
Here are the key theorems that hold when (i) the measurement errors are nor-
mally distributed, and either (ii) the model is linear in its parameters or (iii) the
sample size is large enough that the uncertainties in the tted parameters a do not
extend outside a region in which the model could be replaced by a suitable linearized
model.
Theorem A. 2min min is distributed as a chi-square distribution with N M
degrees of freedom, where N is the number of data points and M is the number of
tted parameters. This is the basic theorem that lets you evaluate the goodness-
of-t of the model. We list it rst to remind you that unless the goodness-of-t is
credible, the whole estimation of parameters is suspect.
Theorem B. If aS(j) is drawn from the universe of simulated data sets with ac-
tual parameters a(0) , then the probability distribution of a = aS(j) a(0) is the
multivariate normal distribution
1
P (a)da1 daM = constant exp a [] a da1 daM (20.1)
2
Patrick D. Bangert
20.1 Condence Limits on Estimated Model Parameters 171
p =1 =2 =3 =4 =5 =6
68.3% 1.00 2.30 3.53 4.72 5.89 7.04
90% 2.71 4.61 6.25 7.78 9.24 10.6
95.4% 4.00 6.17 8.02 9.70 11.3 12.8
99% 6.63 9.21 11.3 13.3 15.1 16.8
99.73% 9.00 11.8 14.2 16.3 18.2 20.1
99.99% 15.1 18.4 21.1 23.5 25.7 27.8
to your desired condence level. (Additional values are given in the accompanying
table.)
Let a be a change in the parameters whose rst component is arbitrary, a1 ,
but the rest of whose components are chosen to minimize the 2 . Then Theorem
D applies. The value of 2 is given in general by
2 = a [] a (20.2)
Since a by hypothesis minimizes 2 in all but its rst component, the second
through M th components of the normal equations continue to hold. Therefore, the
solution is
c c
0 0
a = []1 .. = [C]1 .. (20.3)
. .
0 0
where c is one arbitrary constant that we get to adjust to make 20.2 give the desired
left-hand value. Plugging 20.3 into 20.2 and using the fact that [C] and [] are
inverse matrices of one another, we get
a1 2 (a1 )2
c= and = (20.4)
C11 C11
or
a1 = 2 C11 (20.5)
At last! A relation between the condence interval a1 and the formal standard
error 1 C11 . Not unreasonably, we nd that the 68 percent condence interval
is 1 , the 95 percent condence interval is 21 , etc.
These considerations hold not just for the individual parameters ai , but also for
any linear combination of them: If
M
b c i ai = c a (20.6)
k=1
1. Let be the number of tted parameters whose joint condence region you
wish to display, M . Call these parameters the parameters of interest.
5. Invert the matrix [Cproj ]. (In the one-dimensional case this was just taking
the reciprocal of the element C11 .)
6. The equation for the elliptical boundary of your desired condence region in
the -dimensional subspace of interest is
= a [Cproj ]1 a (20.8)
If you are confused at this point, you may nd it helpful to compare gure 20.4
and the accompanying table, considering the case M = 2 with = 1 and = 2. You
should be able to verify the following statements: (i) The horizontal band between
C and C contains 99 percent of the probability distribution, so it is a condence
Patrick D. Bangert
20.2 Reading: The Vice of Gambling and the Virtue of Insurance by George
Bernard Shaw 173
limit on a2 alone at this level of condence. (ii) Ditto the band between B and B
at the 90 percent condence level. (iii) The dashed ellipse, labeled by 2 = 2.30,
contains 68.3 percent of the probability distribution, so it is a condence region for
a1 and a2 jointly, at this level of condence.
Figure 21.1: Example of a contingency table for two nominal variables, here sex
and color. The row and column marginals (totals) are shown. The variables are
nominal, i.e., the order in which their values are listed is arbitrary and does not
aect the result of the contingency table analysis. If the ordering of values has some
intrinsic meaning, then the variables are ordinal or continuous, and correlation
techniques can be utilized.
A continuous variable can always be made into an ordinal one by binning it into
ranges. If we choose to ignore the ordering of the bins, then we can turn it a nominal
variable. Nominal variables constitute the lowest type of the hierarchy, and therefore
the most general. For example, a set of several continuous or ordinal variables can be
turned, if crudely, into a single nominal variable, by coarsely binning each variable
and then taking each distinct combination of bin assignments as a single nominal
value. When multidimensional data are sparse, this is often the only sensible way
to proceed.
The remainder of this section will deal with measures of association between
nominal variables. For any pair of nominal variables, the data can be displayed as
a contingency table, a table whose rows are labelled by the values of one nominal
variable, whose columns are labelled by the values of the other nominal variable,
and whose entries are nonnegative integers giving the number of observed events for
each combination of row and column (see Figure 21.1). The analysis of association
between nominal variables is thus called contingency table analysis or crosstabulation
analysis. We will introduce two dierent approaches. The rst approach, based
on the chi-square statistic, does a good job of characterizing the signicance of
association, but is only so-so as a measure of the strength (principally because
its numerical values have no very direct interpretations). The second approach,
based on the information-theoretic concept of entropy, says nothing at all about the
signicance of association (use chi-square for that!), but is capable of very elegantly
characterizing the strength of an association already known to be signicant.
N.j and Ni. are sometimes called the row and column totals or marginals, but we
will use these terms cautiously since we can never keep straight which are the rows
and which are the columns!
The null hypothesis is that the two variables x and y have no association. In this
case, the probability of a particular value of x given a particular value of y should
be the same as the probability of that value of x regardless of y. Therefore, in the
null hypothesis, the expected number for any Nij , which we will denote nij , can be
calculated from only the row and column totals,
nij Ni. Ni. N.j
= which implies nij = (21.2)
N.j N N
Notice that if a column or row total is zero, then the expected number for all the
entries in that column or row is also zero; in that case, the never-occurring bin of x
or y should simply be removed from the analysis.
The chi-square statistic is now given by equation 15.8, which, in the present case,
is summed over all entries in the table,
(Nij nij )2
2 = (21.3)
i,j
nij
The number of degrees of freedom is equal to the number of entries in the table
(product of its row size and column size) minus the number of constraints that have
arisen from our use of the data themselves to determine the nij . Each row total and
column total is a constraint, except that this overcounts by one, since the total of
the column totals and the total of the row totals both equal N , the total number
of data points. Therefore, if the table is of size I by J, the number of degrees of
freedom is IJ I J + 1. Equation 21.3, along with the chi-square probability
function, now give the signicance of an association between the variables x and y.
Suppose there is a signicant association. How do we quantify its strength, so
that (e.g.) we can compare the strength of one association with another? The idea
here is to nd some reparametrization of 2 which maps it into some convenient
interval, like 0 to 1, where the result is not dependent on the quantity of data that
we happen to sample, but rather depends only on the underlying population from
which the data were drawn. There are several dierent ways of doing this. Two of
the more common are called Cramers V and the contingency coecient C.
The formula for Cramers V is
3
2
V = (21.4)
N min(I 1, J 1)
where I and J are again the numbers of rows and columns, and N is the total
number of events. Cramers V has the pleasant property that it lies between zero
and one inclusive, equals zero when there is no association, and equals one only
when the association is perfect: All the events in any row lie in one unique column,
and vice versa. (In chess parlance, no two rooks, placed on a nonzero table entry,
can capture each other.)
In the case of I = J = 2, Cramers V is also referred to as the phi statistic.
The contingency coecient C is dened as
3
2
C= (21.5)
2 + N
Patrick D. Bangert
21.1 Contingency Table Analysis of Two Distributions 177
It also lies between zero and one, but (as is apparent from the formula) it can never
achieve the upper limit. While it can be used to compare the strength of association
of two tables with the same I and J, its upper limit depends on I and J. Therefore
it can never be used to compare tables of dierent sizes.
The trouble with both Cramers V and the contingency coecient C is that, when
they take on values in between their extremes, there is no very direct interpretation
of what that value means. For example, you are in Las Vegas, and a friend tells you
that there is a small, but signicant, association between the color of a croupiers
eyes and the occurrence of red and black on his roulette wheel. Cramers V is about
0.028, your friend tells you. You know what the usual odds against you are (because
of the green zero and double zero on the wheel). Is this association sucient for you
to make money? Dont ask us!
I
H= pi ln pi (21.6)
i=1
The value H lies between 0 and ln I. It is zero only when one of the pi s is one, all the
others zero: In this case, the question is valueless, since its answer is preordained. H
takes on its maximum value when all the pi s are equal, in which case the question
is sure to eliminate all but a fraction 1/I of the remaining possibilities.
The value H is conventionally termed the entropy of the distribution given by
the pi s, a terminology borrowed from statistical physics.
So far we have said nothing about the association of two variables; but suppose
we are deciding what question to ask next in the game and have to choose between
two candidates, or possibly want to ask both in one order or another. Suppose that
one question, x, has I possible answers, labeled by i, and that the other question,
y, has J possible answers, labeled by j. Then the possible outcomes of asking both
questions form a contingency table whose entries Nij , when normalized by dividing
by the total number of remaining possibilities N , give all the information about the
ps. In particular, we can make contact with the notation 21.1 by identifying
Nij
pij = (21.8)
N
Ni.
pi. = (outcomes of question x alone) (21.9)
N
N.j
p.j = (outcomes of question y alone) (21.10)
N
The entropies of the questions x and y are, respectively,
H(x) = pi. ln pi. , H(y) = p.j ln p.j (21.11)
i j
Now what is the entropy of the question y given x (that is, if x is asked rst)? It
is the expectation value over the answers to x of the entropy of the restricted y
distribution that lies in a single column of the contingency table (corresponding to
the x answer):
pij pij pij
H(y|x) = pi. ln = pij ln (21.13)
i j
pi. pi. i,j
pi.
We can readily prove that the entropy of y given x is never more than the entropy
of y alone, i.e., that asking x rst can only reduce the usefulness of asking y (in which
case the two variables are associated !):
pij /pi.
H(y|x) H(y) = pij ln (21.15)
i,j
p.j
p.j /pi.
= pij ln (21.16)
i,j
pij
p.j /pi.
pij 1 (21.17)
i,j
pij
= pi. p.j pij (21.18)
i,j i,j
= 0 (21.19)
Patrick D. Bangert
21.2 Linear Correlation 179
ln w w 1 (21.20)
to show that
H(x)U (x|y) + H(y)U (y|x)
U (x, y) = (21.25)
H(x) + H(y)
i.e., that the symmetrical measure is just a weighted average of the two asymmetrical
measures 21.21 and 21.22, weighted by the entropy of each variable separately.
where, as usual, x is the mean of the xi s, y is the mean of the yi s. The value
of r lies between -1 and 1, inclusive. It takes on a value of 1, termed complete
positive correlation, when the data points lie on a perfect straight line with positive
slope, with x and y increasing together. The value 1 holds independent of the
magnitude of the slope. If the data points lie on a perfect straight line with negative
slope, y decreasing as x increases, then r has the value -1; this is called complete
negative correlation. A value of r near zero indicates that the variables x and y are
uncorrelated.
When a correlation is known to be signicant, r is one conventional way of
summarizing its strength. In fact, the value of r can be translated into a statement
about what residuals (root mean square deviations) are to be expected if the data
are tted to a straight line by the least-squares method. Unfortunately, r is a rather
poor statistic for deciding whether an observed correlation is statistically signicant,
and/or whether one observed correlation is signicantly stronger than another. The
reason is that r is ignorant of the individual distributions of x and y, so there is no
universal way to compute its distribution in the case of the null hypothesis.
About the only general statement that can be made is this: If the null hypothesis
is that x and y are uncorrelated, and if the distributions for x and y each have enough
convergent moments (tails die o suciently rapidly), and if N is large (typically
> 500), then r is distributed
approximately normally, with a mean of zero and a
standard deviation of 1/ N . In that case, the (double-sided) signicance of the
correlation, that is, the probability that |r| should be larger than its observed value
in the null hypothesis, is
|r| N
erf c (21.27)
2
where erf c(x) is the complementary error function. A small value of 21.27 indicates
that the two distributions are signicantly correlated. (See below for a more accurate
test.)
Most statistics books try to go beyond 21.27 and give additional statistical tests
that can be made using r. In almost all cases, however, these tests are valid only for
a very special class of hypotheses, namely that the distributions of x and y jointly
form a binormal or two-dimensional Gaussian distribution around their mean values,
with joint probability density
1
2 2
p(x, y)dxdy = const. exp a11 x 2a12 xy + a22 y dxdy (21.28)
2
where a11 , a12 and a22 are arbitrary constants. For this distribution r has the value
a12
r = (21.29)
a11 a22
There are occasions when 21.28 may be known to be a good model of the data.
There may be other occasions when we are willing to take 21.28 as at least a rough
and ready guess, since many two-dimensional distributions do resemble a binormal
distribution, at least not too far out on their tails. In either situation, we can use
21.28 to go beyond 21.27 in any of several directions:
Patrick D. Bangert
21.2 Linear Correlation 181
First, we can allow for the possibility that the number N of data points is not
large. Here, it turns out that the statistic
4
N 2
t=r (21.30)
1 r2
is distributed in the null case (of no correlation) like Students t-distribution with
= N 2 degrees of freedom, whose two-sided signicance level is given by 1A(t|).
As N becomes large, this signicance and 21.27 become asymptotically the same,
so that one never does worse by using 21.30, even if the binormal assumption is not
well substantiated.
Second, when N is only moderately large ( 10), we can compare whether
the dierence of two signicantly nonzero rs, e.g., from dierent experiments, is
itself signicant. In other words, we can quantify whether a change in some control
variable signicantly alters an existing correlation between two other variables. This
is done by using Fishers z-transformation to associate each measured r with a
corresponding z,
1 1+r
z = ln (21.31)
2 1r
Then, each z is approximately normally distributed with a mean value
1 1 + rtrue rtrue
z= ln + (21.32)
2 1 rtrue N 1
where rtrue is the actual or population value of the correlation coecient, and with
a standard deviation
1
(z) (21.33)
N 3
Equations 21.32 and 21.33, when they are valid, give several useful statistical
tests. For example, the signicance level at which a measured value of r diers from
some hypothesized value rtrue is given by
|z z| N 3
erf c (21.34)
2
where z and z are given by 21.31 and 21.32, with small values of 21.34 indicating
a signicant dierence. (Setting z = 0 makes expression 21.34 a more accurate
replacement for expression 21.27 above.) Similarly, the signicance of a dierence
between two measured correlation coecients r1 and r2 is
|z1 z2 |
erf c % (21.35)
1 1
2 N1 3 + N2 3
where z1 and z2 are obtained from r1 and r2 using 21.31, and where N1 and N2 are,
respectively, the number of data points in the measurement of r1 and r2 .
All of the signicances above are two-sided. If you wish to disprove the null
hypothesis in favor of a one-sided hypothesis, such as that r1 > r2 (where the sense
of the inequality was decided a priori), then (i) if your measured r1 and r2 have the
wrong sense, you have failed to demonstrate your one-sided hypothesis, but (ii) if
they have the right ordering, you can multiply the signicances given above by 0.5,
which makes them more signicant.
But keep in mind: These interpretations of the r statistic can be completely
meaningless if the joint probability distribution of your variables x and y is too
dierent from a binormal distribution.
Patrick D. Bangert
21.3 Nonparametric or Rank Correlation 183
holds exactly. Notice that if all the fk s and all the gm s are equal to one, meaning
that there are no ties, then equation 21.40 reduces to equation 21.39.
In 21.37 we gave a t-statistic that tests the signicance of a nonzero rs . It is also
possible to test the signicance of D directly. The expectation value of D in the
null hypothesis of uncorrelated data sets is
1
3 1
3 1
3
D= N N fk fk gm gm (21.41)
6 12 k 12 m
its variance is
2 2 (fk3 fk ) 3
(gm gm )
(N 1)N (N + 1) k 1 m (21.42)
V ar(D) = 1
36 N3 N N3 N
Patrick D. Bangert
21.4 Reading: The Probability of Induction by Charles Sanders Peirce 185
Markov Processes
Chapter 22
Suppose further that initially 50,000 people live both inside and outside the state,
i.e. x0 = y0 = 50, 000. We can thus calculate the population in successive years
using the above equations, the results are shown in table 22.1.
It is important also to note that the steady state response, if it exists, is independent
of the initial state of the system as is illustrated in table 22.2 where the initial
population distribution is 10,000 versus 90,000 but we reach the same distribution.
From the two initial conditions we observe that the steady state is approached
from opposite sides. This can be used to compute upper and lower bounds on the
188 Examples of Discrete Processes
Table 22.1: The population of Bremen as a function of time with an even initial
distribution.
Table 22.2: The population of Bremen as a function of time with a highly uneven
initial distribution.
Patrick D. Bangert
22.1 The Population of Bremen 189
steady state and by iteration the steady state may be found to any accuracy desired.
This requires many iterations however. It is generally possible to solve for the steady
state response without this calculational aid even though it is more illustrative.
We can verify by actual calculation that this state is achieved after 27 iterations of
our Bremen example (with equal initial population inside and outside) after which
33,334 people live inside the state. We can use the rate of convergence to predict
the number of iterations needed achieve the steady state. This is simply done here
in which we have multiplied the total population by the change in the ith year and
we requested that this is less than or equal to the accuracy we want (less than or
equal to one person dierence). This is trivially solved by logarithms and gives
i 32. Thus we should converge in 32 iterations. In our example we converged
in 27 iterations because we chose an equally distributed initial population. This
estimate applies for the worst case possible. In our situation the worst case scenario
(in terms of the number of iterations to the steady state) is the one in which all
the people live either inside or outside the state. For an actual case, the number
of iterations will be less than or equal to the one calculated by the above method.
Thus to compute the steady state response, we solve this problem and then take the
appropriate matrix power of the transition matrix and multiply this by the initial
state.
We must note that the steady state response does not depend on the initial state
except in two cases. First, if the total population is zero, there is no convergence at
all as the initial distribution can not change. Second, if the results are rounded (as
they were above to the nearest whole person) we may get results diering by one
unit depending on the initial distribution. If we start with the uneven distribution
of 10,000 people living inside Bremen, we converge to 33,333 people living inside
whereas if we start with 50,000 living inside we converge to 33,334 people living
inside. This is a feature of rounding. The true steady state response is in general a
real number. If a higher accuracy is required, the above method of calculating the
worst case number of iterations will yield a larger number.
We would have obtained the same result from the other equation. Also note that
since x0 +y0 = 100, 000, we must have x +y = 100, 000 and so x 33, 333. This
procedure is very simple indeed but only for two variables. If the process becomes
more complex, this is not so helpful.
We now form a matrix S from these two matrices by taking one row from each and
making them columns of S. It does not matter which rows we take as they are
related by multiplication of a constant. Thus we choose,
0.1 0.2
S= (22.12)
0.2 0.2
A = SS 1 (22.15)
Patrick D. Bangert
22.2 Bacterial Growth 191
Note that the constant (x0 + y0 ) is just a scaling constant equal to the total popu-
lation. Thus the steady state response may be obtained simply by multiplying the
eigenvector corresponding to the eigenvalue of unity by the total population.
As the determination of eigenvectors is easy, all we have to do is nd this eigen-
vector and multiply by the population. As such, this method is the best of the ones
illustrated.
dx(t)
= x(t) (22.30)
dt
If we assume that x(0) = x0 > 0, then the solution is
x(t) = x0 et (22.31)
In this simple model we have not made any assumption about the removal of
bacteria from the population; hence it is clear on biological grounds, as well as
from the solution, that as t the population size will go from x0 to . The
distinguishing feature of the deterministic solution is that it tells us that, whenever
the initial value x0 is the same, the population size will always be the same for a
given time t > 0.
We now consider the stochastic analogue of the above model. Let the integer-
valued random variable X(t) represent the number of bacteria in a population at
time t, and let us assume that X(0) = x0 > 0. In the stochastic approach we do
not derive a functional equation for X(t); instead, we attempt to nd an expression
for the probability that at time t the population size is equal to x. Hence, we seek
Px (t) = P(X(t) = x).
To formulate the stochastic model, we assume (1) that, if at time t there are
x > 0 bacteria in the population, the probability that in the interval (t, t + t) one
bacterium will be added to the population is equal to xt + o(t), > 0, and
(2) that the probability of two or more bacteria being added to the population in
(t, t + t) is o(t). These assumptions lead to the relation
dPx (t)
= xPx (t) + (x 1)Px1 (t) x = x0 , x0 + 1, (22.33)
dt
Since we have assumed that X(0) = x0 , the above equation has to be solved with
the initial conditions,
for x x0 .
To compare the two models, we rst observe that in the deterministic approach
the population size was represented by a real-valued and continuous function of time,
while in the stochastic approach we start by assuming that the random variable
denoting the population size is integer-valued. An examination of the deterministic
Patrick D. Bangert
22.2 Bacterial Growth 193
solution 22.31 shows that, for and x0 xed, we have associated with every value
of t a real number x(t). From 22.36 we see that, for and x0 xed, and for every
pair (x, t), x x0 , t 0, there exists a number Px (t), 0 Px (t) 1, which is
the probability that the random variable will assume the value x at time t. It is of
interest to note that the deterministic model is a special case of a stochastic model,
in the sense that it yields results which hold with probability one.
Consider the mean or expected population size. Let m(t) = E(X(t)), the expec-
tation value. By denition,
m(t) = xPx (t) = x0 et (22.37)
x=0
Hence, we see that the expression for the mean population size 22.37 is the same
as that for the population size 22.31 obtained from the deterministic model. In
view of this correspondence, we can state that equation 22.30 describes the mean
population size, while equation 22.33 takes into consideration random uctuations.
It is of interest to point out that this correspondence between the two models here
considered does not hold in general; the deterministic solution is not always the
same as the stochastic mean.
5000(0.7)+10000(0.2) = The number of people who ride bus next year = b1 (23.1)
5000(0.3)+10000(0.8) = The number of people who do not ride the bus next year = b2
(23.2)
This system of equations is equivalent to the matrix equation: M x = b where
0.7 0.2 5000 b1
M= ,x = and b = (23.3)
0.3 0.8 10000 b2
Note
5500
b= . (23.4)
9500
For computing the result after 2 years, we just use the same matrix M , however we
use b in place of x. Thus the distribution after 2 years is M b = M 2 x. In fact, after
n years, the distribution is given by M n x.
23.1 Markov Process 195
23.1.2 Denitions
The forgoing example is an example of a Markov process. Now for some formal
denitions:
Denition 44 A stochastic process is a sequence of events in which the outcome
at any stage depends on some probability.
Denition 45 A Markov process is a stochastic process with the following proper-
ties:
1. The number of possible outcomes or states is nite.
2. The outcome at any stage depends only on the outcome of the previous stage.
3. The probabilities of any outcome become constant over time.
If x0 is a vector which represents the initial state of a system, then there is a
matrix M such that the state of the system after one iteration is given by the vector
M x0 . Thus we get a chain of state vectors: x0 , M x0 , M 2 x0 , where the state
of the system after n iterations is given by M n x0 . Such a chain is called a Markov
chain and the matrix M is called a transition matrix.
The state vectors can be of one of two types: an absolute vector or a probability
vector. An absolute vector is a vector where the entries give the actual number of
objects in a give state, as in the rst example. A probability vector is a vector where
the entries give the percentage (or probability) of objects in a given state. We will
take all of our state vectors to be probability vectors from now on. Note that the
entries of a probability vector add up to 1. The main theorem on the third Markov
processes concern property above, namely the notion that the probabilities become
constant over time.
Theorem 18 Let M be the transition matrix of a Markov process. Then there exists
a vector xs such that M xs = xs . Moreover, if M k has only positive entries for some
k, then xs is unique.
The vector xs is called a steady-state vector. The transition matrix of an n-state
Markov process is an n n matrix M where the i, j entry of M represents the
probability that an object in state j transitions into state i, that is if M = (mij )
and the states are S1 , S2 , , Sn then mij is the probability that an object in state
Sj transitions to state Si . What remains is to determine the steady-state vector.
Notice that we have the chain of equivalences:
M xs = xs M xs xs = 0 M xs Ixs = 0 (M I)xs = 0 xs N (M I)
(23.5)
Thus xs is a vector in the nullspace of M I. If M has all positive entries for some
k
Find the transition matrix M and steady-state vector xs for this Markov process.
Recall that M = (mij ) where mij is the probability of conguration Cj transitioning
to Ci . Therefore
0.3 0.3 0.4
M = 0.2 0.5 0.2 (23.10)
0.5 0.2 0.4
and
0.7 0.3 0.4
M I = 0.2 0.5 0.2 (23.11)
0.5 0.2 0.6
Now we compute a basis for N (M I) by putting M I into reduced echelon form:
1 0 0.8966
U = 0 1 0.7586 (23.12)
0 0 0
and we see that
0.8966
x = 0.7586 (23.13)
1
is the basis vector for N (M I). Consequently, c = 2.6552 and
0.3377
xs = 0.2850 (23.14)
0.3766
is the steady-state vector of this process.
Patrick D. Bangert
23.2 Random Walks 197
for all m, n, r, a and b. We will say that the random walk has returned to the origin
if Sn takes the value zero, its initial value, at some n = 0.
The simple random walk returns to the origin certainly if and only if p = 1/2.
In all other circumstances, the simple random walk may or may not return. In the
case that p = 1/2, the mean time taken to returning is, however, innite. Thus this
result should be taken in the asymptotic convergence sense.
This can be extended to yield the result that the probability of visiting the point
r is given by the following formulae
Example 70 Jack and Jill play a game together. Jack initially has c a dollars
and Jill has a dollars at the start. We assume that 0 a c. At each play of
the game, Jill wins or looses one dollar from or to Jack with probability p and q
respectively. The game ends when either players fortune is zero. We are to take the
role of a conservative friend of Jills who wants to dissuade her from playing. Thus
we wish to calculate the probability of her loosing her fortune.
This is a random walk with S0 = a as we are focusing on Jill and we seek the
probability that Sn = 0 before Sn = c as this corresponds to Jill loosing. Denote this
desired probability by pa , i.e.
Of course, p0 = 1 and pc = 0 and so the only interesting case is when 0 < a < c.
pa = P (Loss X1 = +1|S0 = a) + P (Loss X1 = 1|S0 = a) (23.18)
= pP (Loss|X1 = +1, S0 = a) + qP (Loss|X1 = 1, S0 = a) (23.19)
= pP (Loss|S0 = a + 1) + qP (Loss|S0 = a 1) (23.20)
= ppa+1 + qpa1 (23.21)
having used time and space homogeneity of the random walk. The recurrence relation
must be solved and this is typically done by a substitution pa = xa . We obtain
This has solutions x = 1, q/p. When p = q, these solutions are dierent and the
general solution of the recurrence relation is thus
a
q
pa = A + B (23.23)
p
pa = C + Da (23.24)
In both cases, we have boundary conditions that thus allow the constants A, B, C
and D to be found and thus,
a xc xa
pa = 1 ; pa = (23.25)
c xc 1
for the cases of p = q = 1/2 and p = q respectively where x = q/p. If we interchange
the roles of the players we arrive at the conclusion that the probability of either
person to loose is one, i.e. the game must end at some time.
As the game is nite, denote by Ta the time that the game lasts. Naturally
T0 = Tc = 0 and so we again take 0 < a < c. Identically to above, we nd
Lets record the random walk by considering the time as well as the location, i.e.
we are going to record (n, Sn ) at every step. This allows a graph to be produced
with n on the horizontal axis as time. If a and b are integers and n > 0, we dene
N (a, b, n) to be the number of paths from (0, a) to (n, b) and N 0 (a, b, n) to be the
number of such paths that cross the time axis, i.e. that contain a (k, 0) for some
k > 0. The important reection principle states that if a > 0 and b > 0, we have
N 0 (a, b, n) = N (a, b, n). This principle is used to prove some interesting results.
Patrick D. Bangert
23.2 Random Walks 199
Theorem 19 (Ballot Theorem) There are two candidates for a political oce,
A gets x votes and B gets y votes with x > y. Assuming the votes are randomly
mixed before being counted out, the probability that A is always ahead of B during
the counting is (x y)/(x + y).
0, S2 =
P (S2m = 0) = P (S1 = 0, , S2m = 0) (23.28)
= P (S1 0, S2 0, , S2m 0) (23.29)
(23.30)
Markov Chains
24.1 Recurrence
We rst dene a new concept.
Denition 47 If for some n 0, we have Pijn > 0 where P is the transition matrix,
then we say that the state j is accessible from the state i. This is denoted by i j.
If we have both i j and j i, then the two states are said to communicate.
Denition 48 We denote
(1)
fij = P (X1 = j|X0 = i) (24.1)
as the probability that given the rst state i, we visit state j for the rst time at time
n. Thus (n)
fij = fij (24.3)
n
is the probability that we ever visit j starting at i. If fii < 1, we say that i is
transient and otherwise recurrent.
The message here is simple: If the likelihood that we never return to state i
is non-zero, the state is a transient state that is, in some sense, unreliable. If it is
certain that we will get back to i at some point, then the state is obviously recurrent.
24.1 Recurrence 201
As the future is independent of the past path, the probability of reaching i exactly
N times is fiiN . For the recurrent state this probability is always unity and for
a transient state this tends to zero as N gets large. This is encapsulated by the
following theorem.
Theorem 21 The state i is transient if and only if n Piin converges and the state
is recurrent if and only if the same sum diverges. The properties of transience and
recurrence belong to the whole equivalence class of states.
We remember that return to the origin is only possible after an even number of
steps. Hence the sequence {Piin } is alternatively zero and non-zero. If this occurs,
we may or may not nd that the non-zero terms converge. As the whole sequence
does not converge, we need to deal with these cases.
Thus simple random walk states have period two. One can show that if i and
j belong to the same equivalence class, then di = dj . There are two sucient
conditions for aperiodicity: (1) some diagonal entry Pii is non-zero and (2) there
exist paths of 2 and 3 steps from state i to itself.
The periodic case is an annoyance and we want to get rid of it for further analysis.
Let P be the transition matrix of an irreducible but periodic chain with period d > 1.
For any state i, we put
(dn)
C(i) = {j : pij > 0 for some n > 0} (24.4)
It can be shown that the C(i) are either identical or disjoint, i.e. they do not partially
overlap. Consider the transition matrix Q = P d . The chain with transition matrix
Q is reducible with the d sets C(i) as its equivalence classes. The original period
chain is just like d aperiodic chains operating in parallel with a new timescale. Thus
we can now focus on irreducible and aperiodic chains. The major result follows.
Theorem 22 For all states i and j in an irreducible, aperiodic Markov chain, the
following holds
(n)
1. If the chain is transient, pij 0.
(n)
2. If the chain is recurrent, pij j where either
3. If the chain is recurrent, let Ti represent the time to return to state i. Then
i = E(Ti ) = 1/i with the understanding that i = if i = 0.
xn = P n x0 (24.5)
which allows us simply to iterate the whole thing. Now, in a transient or null
recurrent chain xn 0, i.e. the state tends to the zero vector. In dierent words,
the probability that the nth state is state j tends to zero for any particular state j.
For a positive recurrent chain xn where is the probability vector from the
theorem. Thus the probability that the nth state is state j tends to j > 0 for any
state j. The dierence between null and transient chains is the frequency of visiting
i. In a transient chain we may never make a visit and we cannot return innitely
often. In a null chain we are certain to make innitely many visits although the
mean time between visits is innite. We may decide between the three alternatives
using a criterion.
1. The chain is transient if and only if the system Qy = y has a bounded non-zero
solution.
2. The chain is positive recurrent if and only if the system P = has a solution
that is a probability vector.
Patrick D. Bangert
24.2 Random Walk with Barrier 203
we have
T
T
(n)
T
(n)
E(Y (j)) = P (Ijn = 1) = P (X0 = i)Pij = P (X0 = i) Pij .
n=0 n=0 i i n=0
(24.8)
For example, if we denitely start in state i, so that P (X0 = i) = 1 and P (X0 =
k) = 0 for k = i, the mean number of visits to j up to time T is
T
T
(n)
Pij = Pn . (24.9)
n=0 n=0 ij
Recall the gamblers example with Jack and Jill from last lecture. As a Markov
chain, the states are {0, 1, , c}, and the non-zero transition probabilities are P00 =
Pcc = 1, with Pi,i1 = q and Pi,i+1 = p for 1 i c 1. These are three classes:
{0}, {c}, both recurrent, and {1, 2, , c 1}, clearly transient.
It is convenient to write the transition matrix P with the states in a non-standard
order: Let
1 0 p 0 q 0
2 q 0 p 0 0
3 0 q 0 0 0
. . . . . . . Q A
P = .. .. .. .. . . .. .. = (24.10)
0 I
c 1 0 0 0 0 p
0 0 0 0 1 0
c 0 0 0 0 1
where Q is a (c 1) (c 1) matrix.
Because of the structure of P , the powers {P n } have a similar structure, with
n Qn An
P = (24.11)
0 I
Now
T
T
(I Q)ST = Qn Qn+1 = I QT +1 . (24.14)
n=0 n=0
But Q 0 as n , since the states {1, 2, , c1} are transient, so (IQ)ST
n
I, and hence ST (I Q)1 . This shows that in the gamblers ruin problem, the
mean total number of visits to state j, starting from state i, is ((I Q)1 )ij , if
1 i, j c 1.
For example, if c = 4 then
0 p 0
Q = q 0 p , (24.15)
0 q 0
and so
1 pq p p2
1 q
(I Q)1 = 2 1 p . (24.16)
p + q2 2
q q 1 pq
Starting with unit amount, the mean number of times over the whole game that we
possess exactly three units is the (1, 3) entry p2 /(p2 + q 2 ).
In this fashion, we can nd the mean total number of visits to any transient state
of a general Markov chain. For a recurrent state, this mean number is either zero
(if we cannot reach it from our starting point) or innite.
A systematic way to assess the long-term behavior of a Markov chain with tran-
sition matrix P , i.e. the fate of Xn , conditional on X0 = i, might proceed as follows.
1. Find the classes and establish which, if any, are closed. Find the period of
each closed class. If the chain has no closed classes, all states are transient
and P (Xn = j) 0 for all j.
2. For each closed aperiodic class C, determine whether it is transient, null or
positive. In the rst two cases, P (Xn = j) 0 for all j C, otherwise
P (Xn = j|C ever entered) j > 0 for all j C.
3. For each closed class C of period d > 1, let P0 be that part of P that describes
transitions among the states of C alone, and let Q = P0d . For the transition
matrix Q, C splits into d aperiodic subclasses, each to be treated as in the
above case. Ascertain the order in which these subclasses are visited.
4. Denote the closed classes by C1 , C2 , , and write R = Rest of the states.
Write xij = P (Eventually enter Cj |X0 = i). Considering one step,
xij = Pik xkj + Pik , i R, j = 1, 2, (24.17)
kR kCj
from which the {xij } are to be found, and then use the above two steps.
Patrick D. Bangert
Chapter 25
Queuing Theory
Please see sections 8.1 and 8.2 in the attached photocopies please for this chapter.
Chapter 26
Brownian Motion
Please see sections 8.3 and 8.4 in the attached photocopies please for this chapter.
Chapter 27
for some n and m, so that S contains nm points. These points are often called
sites. We must now dene what the neighborhood of a point is going to be. This
denition is up to us and depends on the use to which we wish to put the model.
We will adopt some restrictions however: (1) a site must not neighbor itself and (2)
the neighborhood property is symmetric. We will write s t if the two sites s, t S
are neighbors. Some common neighborhood structures are shown in gure 27.1. If
s is a site, we dene the neighborhood Ns of s as the set of all neighbors,
Ns = {t S : t s} . (27.2)
And so gure 27.1 shows the neighborhood of the middle site for two dierent
structures. We note that in these structures, special care must be taken at the
edge of the lattice S, since site located there have smaller neighborhoods. One may
get around these diculties by identifying the borders with each other, i.e. if we
leave on one side, we enter on the opposing side. This technique is called periodic
boundary conditions and is very important in applied mathematics.
We consider a Markov random eld {X(s)}sS dened on S, i.e. a collection X(s)
of random variables indexed by sites in S. These random variables are assumed to
take their values in a nite set X , the state space. Examples of X that we will use is
X = {1, +1} and X = {1, 2, , r}. The set X S is the set of elements of the form
x = {x(s)}sS with x(s) X for each s. An element of X S will often be referred
to as a conguration of the Markov random eld. In addition, we will often write
208 Markov Random Fields
Figure 27.1: Two dierent neighborhood structures and their corresponding cliques.
Left: Four closest points are neighbors (top) and the cliques (bottom). Right: Eight
closest points are neighbors (top) and the cliques (bottom. The concept of cliques
will be dened below.
X for {X(s)}sS and think of X as a random variable with values in X S , the set of
congurations. Letting |S| denote the number of elements of S and similarly for X ,
the number of elements of the conguration space X S is |X ||S| and it is hence often
extremely large. For example, if X = {1, +1} and S is a lattice of size 128 128,
its size is 2128 . If A is a subset of S, we write X(A) for {X(s)}sA , i.e. the collection
2
for all sites s S and all congurations x X S . That is, the distribution of X(s),
given all other sites, depends on the realized values in its neighborhood only. These
conditional distributions are often called the local specication of the MRF.
for some real number . Note that the denominator does not depends on x(s) and
is only a normalizing factor to make the right hand side a proper distribution, i.e.
summing to unity. This model is called the Ising model after the German physicist
Ising who invented it to explain ferromagnetism.
The sum in the exponent is positive if x(s) has the same sign as the majority
of its neighbors. Hence, if > 0 the sites interact such that congurations x with
many neighbors of the same sign will have large probabilities. On the contrary,
Patrick D. Bangert
27.2 Cliques and the Gibbs Distribution 209
if < 0 congurations with many neighbors having opposite signs will have large
probabilities.
for some real number . As before, the denominator does not depend on x(s) and
is only a normalizing factor. This model is called the Potts model. We note that
# {t Ns |x(t) = x(s)} is the number of neighbors of s that have values dierent
from x(s). Hence, if > 0 this model gives large probabilities to congurations x
in which there are many neighbors with dierent values. If < 0, the model works
the opposite way.
is called a Gibbs distribution. Here the sum runs over all cliques C. The normalizing
constant Z is given by
Z= exp VC (x) (27.8)
xX S C
and is generally infeasible to compute as the outer sum runs over a very large set.
The importance of Gibbs distributions is made clear from the following facts: (1)
Any random eld with a distribution that is a Gibbs distribution is a Markov
random eld with respect to the neighborhood system governing the cliques and
(2) any random eld that is Markov with respect to a given neighborhoods system
has a distribution that is a Gibbs distribution generated by the corresponding
cliques. Hence we may say that Markov random elds and Gibbs distributions are
equivalent. It is not too dicult to verify the rst fact, while the second is much
more dicult and is known as the Hammersley-Cliord theorem (it does require one
mild condition in addition to the eld being an MRF). Let us consider some specic
examples of Gibbs distributions.
Note that as we sum over cliques, only one of the pairs (s, t) and (t, s) are accounted
for in the sum. It is not dicult to check that this distribution yields the local
specication 27.4, with = 0. In this model, controls the fractions of -1s and
+1s in the congurations (with > 0 large probabilities are given to congurations
with many +1s) while controls the interaction between neighbors as previously
described. Note that the normalizing constant Z(, ) depends on the parameters
and .
where I(x(s) = x(t)) is an indicator function being equal to one if x(s) = x(t) and
zero otherwise. Hence, the sum counts pairs of neighbors that have values which do
not agree. Again, only one of the pairs (s, t) and (t, s) are accounted for in the sum.
This Gibbs distribution yields the local specication 27.5.
Patrick D. Bangert
27.3 Markov Chain Monte Carlo Simulation 211
(x)P (X(s) = x (s)|X(S \ s) = x(S \ s)) = (x )P (X(s) = x(s)|X(S \ s) = x(S \ s))
(27.13)
Expanding the conditional probability we nd that the above equation is equivalent
to
(x ) (x)
(x) = (x ) (27.14)
P (X(S \ s) = x(S \ s)) P (X(S \ s) = x(S \ s))
which is trivially true.
We can now envision how we can simulate the MRF. Imagine a Markov chain
{Xn }n=0 of congurations of the MRF with transition probability matrix Ps . In
other words, the state space of this Markov chain is the conguration space X S ,
and it moves from one conguration to another by updating X(s) according to
its conditional distribution given the rest of the eld. This Markov chain has a
stationary distribution , but its distribution does not converge to because it is
not irreducible. Indeed, since the chain can only modify the MRF at the single site
In other words, the action of P is that we visit all sites s of S (in some order),
and when we visit s we modify its value by simulating X(s) from its conditional
distribution give the rest of the MRF. Since Ps = for each s, it is plain that
P = ; that is, the distribution of the MRF is a stationary distribution of P .
Moreover, it is not dicult to check that P is irreducible (since it can modify the
current conguration at all sites) and aperiodic (since it can also choose not to
modify the current conguration at all). Hence, a Markov chain {Xn } with state
space X S and with transition probability matrix P will converge in distribution to
. Thus we may take the initial conguration X0 arbitrary, run the chain for a
long time and then obtain a simulated conguration whose distribution is roughly
. For how long we need to run the chain in order to make this approximation good
depends on the model, and is a dicult question to address in general. A full turn
of the simulation scheme, in which each site is visited once, is often called a sweep
of the algorithm.
3. Set p = ew and p+ = ew .
Patrick D. Bangert
27.3 Markov Chain Monte Carlo Simulation 213
that = 0.6 yields a larger fraction of cliques with equal sign, as expected, but we
also see that it takes more iterations for the Gibbs sampler to reach a steady state
in this respect. When = 0.4 it appears that this curve attens out after about 100
iterations, but when = 0.6 it is not obvious that convergence has occurred even
after 250 iterations. This indicates that the Markov chain {Xn } that constitutes the
Gibbs sampler has a slower rate of convergence when = 0.6. This is indeed true,
and is caused by the larger degree of spatial dependence in this case.
Figure 27.2: Simulation of the Ising model with = 0.4 on a 100 100 lattice.
Top left to middle right: replications after 0, 3, 10, 25, 100 and 250 iterations of the
Gibbs sampler. Bottom: fraction of cliques with equal signs (left) and fraction of
sites with positive sign (right) as a function of the number of iterations.
Figure 27.3: Simulation of the Ising model with = 0.6 on a 100 100 lattice.
Top left to middle right: replications after 0, 3, 10, 25, 100 and 250 iterations of the
Gibbs sampler. Bottom: fraction of cliques with equal signs (left) and fraction of
sites with positive sign (right) as a function of the number of iterations.
We can construct variants to the Gibbs sampler, for example by not visiting the
site cyclically in a prescribed order, but rather selecting a site at random, updating
it, then selecting another one at random, updating it, and so on. This scheme works
equally well and corresponds to the transition probability matrix
1
P = Ps (27.16)
|S| sS
The Gibbs sampler is a particular case of what is called Markov chain Monte Carlo
simulation methods. The general idea of these methods is to construct a Markov
chain that has a prescribed stationary distribution. The Gibbs sampler is a special
The maximizer of this function is called the maximum likelihood estimator (MPLE).
Since the local specication does not involve the normalizing constant Z(), the
pseudo-likelihood function is free of this constant as well. This is the main advantage
of this approach.
Figure 27.4 shows the log-pseudo-likelihood function for the Ising model repli-
cations displayed in gures 27.2 and 27.3, obtained after 250 iteration of the Gibbs
sampler. We that there are maximas around 0.4 and 0.6, the respective true values
of , but we also see that the maximum is much more at when = 0.6. In statis-
tical terms this implies that the variance of the MPLE is large, and this is in line
with the rule of thumb that the MPLE is inecient when there is a large degree of
spatial dependence.
Patrick D. Bangert
27.4 Parameter Estimation: Pseudo-likelihood 215
Assignments
Chapter 28
Homework
Anyone who scores over 100% will obtain a 1.00 as there is no better grade
available. As we simply add up points, the extra credit points remedy any loss of
points regardless of where.
extra credit points if you were the rst person to suggest it. The following table
gives the points available.
Suggestion Points
Grammatical error 0.5
Mathematical error in formula 1.0
Argumentative error 5.0
Improvement of explanation 5.0
Missing information 3.0
28.1.3 Attendance
Attendance to lectures is required. This will be checked by asking you to complete
the reading assigned for each lecture in advance and to write down on a piece of
paper at least two points that you observed, liked, found questionable or otherwise
noteworthy. These papers are to be submitted by yourself at the start of the relevant
lecture. Only one paper will be accepted per person and there will be no exceptions
whatsoever regarding this or the time of submission. The last six lectures do not
have reading associated with them and here the notes are to be written on the
lecture script.
Note that the purpose is manyfold. I want to make sure that you do the reading,
engage with it and come to class. As such this appears to be the most economical
way to enforce this. I am certainly not a fan of rules but some guidelines have to be
set. As there are 27 such submissions possible and, of course, the usual problems of
illness and family emergency prevents submission of some of them, I want at least
20 of these papers submitted by each student. If you submit less than 20, your total
point score will be multiplied by a factor. The following table gives the factors.
Submissions Factor
20 - 27 1.00
15 - 19 0.80
10 - 14 0.70
0-9 0.60
Please also note that these notes can be very brief. Two or three sentences per
point are completely sucient. Do not write an essay but merely jot down some
thoughts.
28.2 Homework
28.2.1 General Policy
Homeworks are due each fortnight at the start of the class starting with the third
week (the dates are written in the titles of the homeworks below) and are to be
written on paper and not submitted by email. The points available for each problem
are written into the following problem sheets.
Patrick D. Bangert
28.2 Homework 219
8. A certain rare disease from which one in ten thousand of the population suers
is diagnosed by a test which reveals the presence of the disease in 95% of the
cases of those tested who actually have the disease. However, it also incorrectly
yields a positive reaction in 1% of the cases of those who are not suering
from the disease. If a person selected at random from the population shows a
positive reaction what is the probability that he is actually suering from the
[15] disease? Solution: 95/10094.
2. If the probability that a man aged 60 will survive another year is 0.9, what
premium should he be charged for a life insurance policy of $1000? (If he
[5] survives the year he receives no money back.) Solution: $ 100.
3. Two dice are thrown in one turn, each turn costing 5 peso. If a prize of
40 peso is given for a double 6, and a prize of 20 peso for any other double
(together in both cases with the stake money), determine the expected loss to
[5] a player who plays the game 100 times. Solution: 27.5 peso.
4. The following table gives the number of children in each of 360 families.
No. of children 0 1 2 3 4 5 6 7 8
(28.1)
No. of families 38 91 108 76 39 5 2 0 1
Calculate the mean and standard deviation of the number of children per
[10] family. Solution: 2.02; 1.28.
5. Two dice are thrown together and the scores added. What is the chance that
the total score exceeds 8? Find the mean and standard deviation of the total
[10] score. What is the standard deviation of the score for a single die? Solution:
5/18; 7; 2.415; 1.71.
Patrick D. Bangert
28.2 Homework 221
7. A tetrahedral die has its faces colored red, green, yellow and blue. If a group
of 8 such dice are thrown, calculate the probabilities of 0, 1, 2, 3, 4 red faces
being seen. How many times would you expect to see only 2 red faces if the
experiment of throwing the 8 dice was repeated 200 times? [5]
9. In a quality control laboratory, samples of size 60 were examined and the num-
ber of defectives counted. Over thousands of trials, the number of defectives
never exceeded 14 nor was less than 2. Assuming unchanged probability over
the testing period, what was the approximate percentage defective? [5]
10. If telephone calls come into an exchange at an average rate of 70 per hour,
nd the probability of there being 0,1,2 calls in a period of 2 minutes. What
is the probability that there are no calls in a period of 7 minutes. Solution: [5]
0.097; 0.226; 0.264; 0.0003.
11. After World War II, an area of South London was divided into 576 squares
each of 0.25 square mile area. The number of ying bomb hits in each square
was recorded. The results are given in the table below. Compare this with a
Number of 0 1 2 3 4 5
hits
Number of 229 211 93 35 7 1
squares
Poisson model with the same mean. Solution: 228, 211, 98, 30, 7, 1. Good [5]
correspondence.
12. A book has errors on 403 out of 480 pages. Assuming that the errors follow a
Poisson distribution, nd the expected number of pages with just one error. [5]
Solution: 141.
3. The line AB has length 10cm. An interval of length 2cm is marked at random
on the line, the positions of the interval being uniformly distributed. What is
[5] the probability that the interval will contain the midpoint of AB? Solution:
1/4
[5] 4. For N(1.6, 2) nd a. P (|x| 1.4) b. P(x is negative) c. P(x 3.2) Solution:
0.3934; 0.2119; 0.2119.
[5] 6. A normal distribution N (2, ) is such that P(X 2.8) = 0.7123. Find .
Solution: 1.43
8. The IQs of 500 students are assumed to be normally distributed with mean
105 and standard deviation 12. How many students may be expected:
9. How many degrees of freedom are required for a 2 test, for testing data against
the following distributions? Where parameters are known independently of the
data they are given.
10. Use tables to look up the following values. State whether the results are
signicant at the 5% or 1% levels, or whether they seem too good to be true.
(a) 24 = 9.60
(b) 211 = 2.51
Patrick D. Bangert
28.2 Homework 223
(a) Supposing that the individual correctly identies 32 cards (of the 50),
[10] test the hypothesis at the 0.05 and 0.01 levels.
(b) The p-value corresponding to a statistical test is dened to be the smallest
signicance level at which the null hypothesis is rejected. Find the p-value
[7] for this case.
Solution: We choose a one-tailed test since we are not concerned with the
individual reaching low numbers but only high ones. We test relative to the
standard that no powers of ESP means the person chooses at random and
thus the likelihood of choosing any one color is 0.5. The z-scores are 1.645
and 2.33 for the two signicance levels and the z-score of the measurement is
1.98 and so we accept at 0.05 level and reject at 0.01 level the hypothesis that
the person has powers of ESP. The p-value is the probability on the standard
normal distribution that z is larger than or equal to 1.98 which is 2.39%. Thus
we conclude that the probability that the conclusion that this individual has
ESP has a chance of 2.39% of being wrong; in short, it is likely that this
individual does have ESP.
3. Using the chi-squared distribution and test, test the following hypotheses.
(a) A coin is tested for fairness and thrown 200 times; there were 115 heads
and 85 tails. Test the hypothesis of fairness at the 0.05 and 0.01 signi-
[7] cance levels
(b) A pot contains a large number of marbles of red, orange, yellow and
green colors. We draw out 12 of these marbles at random and get 2 reds,
5 oranges, 4 yellows and 1 green. The hypothesis is that the pot contains
[11] an equal proportion of marbles of all four colors.
(c) 320 families with 5 children are surveyed and the sex of the children asked
for. The results are shown in table 28.1 and the hypothesis is that men
[11] and women have equal birth probability.
Solution: The coin has 2 = 4.5 and the problem has one degree of freedom.
The critical values for 2 at the 0.05 and 0.01 signicance levels for a single
degree of freedom are 3.84 and 6.63 and thus we reject it at the 0.05 level
and accept it at the 0.01 level. We would have gotten the same answer using
z-score methods as the experimental z-score is 2.12 and so larger than 1.96
but less than 2.58.
With the urn, we would expect to see three of each kind in a sample of 12.
However as the expected numbers are less than ve, the chi-squared distribu-
tion test is not applicable. Thus we must combine the categories. We shall
thus test whether red-green is equally proportioned to orange-yellow. Thus
Patrick D. Bangert
28.2 Homework 225
Table 28.1: The child distribution over 320 families of 5 children. (b = boys, g =
girls)
2 = 3 and there is one degree of freedom. From the same critical values, we
must accept the hypothesis at the 0.05 level.
We use the binomial distribution to compute the probabilities of each kind of
family if the birth probability is 0.5 for each sex. The result is 1, 5, 10, 10, 5,
1 out of 32 in order of the table. Thus 2 = 12.0 and there are ve degrees
of freedom and we reject the hypothesis at the 0.05 level but not at the 0.01
level. We therefore conclude that the birth probabilities are in all likelihood
not equal.
2. Using linear least square methods t a straight line to the data given in table
28.3. Solution: NEED DATA WITH ERRORS IN BOTH COORDINATES. [11]
3. Using three data sets, you are asked to t lines and make certain predictions.
(a) Farm real estate values in the USA is given in table 28.4. Find the least
squares straight line for this data. Estimate the value of farm real estate
in 1988 and 1996 and compare your values to the true values of 626.8 and
859.7 billion US Dollars. [11]
(b) Purchasing power of the US Dollar is given in table 28.5. Find the least
square straight line. Predict the power for 1998 assuming that the trend
continues. [10]
x 1 2 3 4 5 6 7 8 9 10
x 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
y 1.1 1.5 2.1 2.4 3.0 3.7 3.8 4.4 5.2 5.6
y 0.1 0.2 0.2 0.1 0.1 0.2 0.1 0.2 0.2 0.1
Table 28.3: Data for straight line t with errors in both coordinates.
(c) The population of the USA is given in table 28.6. Find the least squares
straight line as well as the least squares parabola to this data and com-
ment on which ts best. Use both models to predict the population in
[10] the year 2000.
4. The number of students in a class of 100 who got certain grades in a mathe-
matics and a physics exam are given in table 28.7. How correlated are these
[10] grades? Find a numerical measure of correlation and discuss. Solution:
The coecient of linear correlation r = 0.7686 and we conclude that given
one grade we can predict the other grade with a good accuracy. If we are
interested in the rough placement of a students abilities, one of the two tests
would be enough as they measure 77% of the identical ability, i.e. doing the
second test only supplies 23% additional information and this is not worth the
eort.
Table 28.4: Data giving the total value of USA farm real estate in billions of US
Dollars. U.S. Department of Agriculture, Economic Research Service.
Patrick D. Bangert
28.2 Homework 227
Table 28.5: Data giving the purchasing power of the US Dollar as measured by
consumer prices according to the U.S. Bureau of Labor Statistics, Survey of Current
Business.
Year 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995
Population 152 166 181 194 205 216 228 238 250 263
Table 28.6: Data giving the population of the USA in millions. U.S. Bureau of
Census.
Physics Math
4049 5059 6069 7079 8089 9099 Total
9099 2 4 4 10
8089 1 4 6 5 16
7079 5 10 8 1 24
6069 1 4 9 5 2 21
5059 3 6 6 2 17
4049 3 5 4 12
Total 7 15 25 23 20 10 100
Table 28.7: Number of people of a class of 100 who receive a certain grade on a
physics and a mathematics exam.
Patrick D. Bangert
28.2 Homework 229
is
0.81 0.15 0.04
P = 0.10 0.80 0.10 (28.5)
0.05 0.20 0.75
We calculate the eigenvectors corresponding the unity eigenvalue and multiply
it by the total population of 30,000 to obtain the steady state response of (, ).
Simple iteration shows that this is reached after .... years.
3. A gas station has four pumps. Cars arrive at random at an average rate of
three per minute and service times are exponential with a mean of two minutes.
Give the parameters for the queue to which this corresponds; what happens
in the long run?
Suppose the forecourt has room for six cars only, in addition to any at the
pumps. Find the equilibrium distribution and estimate how many cars per
hour drive on, nding the forecourt full.
Each car that drives on represents lost prot. Consider the relative merits
of (a) extending the forecourt to take two more cars; (b) installing one more
pump, which reduces the forecourt capacity to ve waiting cars. Solution: [33]
Per hour, the queue is M (180)/M (30)/4. Here k = 120 < 180 = , and so
the queue would grow without bound. With the given limit, it is M/M/4/10,
so equation 8.10 (in the photocopies) implies n = n 0 /(n!n ) = 6n 0 /n!
when n 4, and n = 4 (3/2)n4 for 4 n 10. This yields 0 = 1/1798.28,
hence 1 , , 10 . P (Forecourt full) = 10 = P (Arrival drives on), so mean
number lost per hour is 18010 61.6.
(a) leads to M/M/4/12, from which we nd 12 = 0.3371, and (b) is M/M/5/10,
now n = 6n 0 /n! for n 5, and n = 5 (6/5)n5 for 5 n 10. Then
10 = 0.2126, so the respective forecourt losses are about 60.7 and 38.3. Thus
an extra pump is much better than extending the forecourt.
Projects
In addition to the homework, you are asked to complete a project. The average of
all homework grades will count for fty percent of the nal grade whereas the other
fty percent will come from the project. The project takes the form of a written
report about a topic on the list below. The report must be typed and submitted
by email. The LATEXformat is much preferred but MS Word will also be accepted.
There is an absolute upper limit on the length of the report of 5000 words (data and
equations not counted) but please do not feel that you have to write almost that
many words. If you can make your point in signicantly less words, that is very
good.
The report must contain but is not limited to:
1. An introduction in which it is made clear what the task (hypothesis) is, why
it is interesting/useful and how you have chosen to go about it,
2. A body in which you give the data used for the report and any calculations
made,
3. A discussion in which the data and the results are discussed with respect to
the task given in the introduction,
Note that the law is eectively a social scientic law and thus not so precise.
It says that humans will do whatever they have to do with the least eort
possible. Thus (and this thus requires quite some arguing) if we have a list
of items that occur with a given frequency (number of occurrences of that
item divided by the total number of occurrences) sorted in order of decreasing
frequency, then the rank r and frequency f of this list should be related by
f = bra where a and b are some constants to be found. Note that, in spirit,
this law is identical to the action principle at the basis of physics.
2. Argue for or against the following hypotheses (note that some are deliberately
provoking and do not necessarily reect my own opinion)
Note that most of these hypotheses were vague and in your report, you must
dene the words so as for them to become precise statements! Also note that
while these topics are of everyday relevance, the report should discuss them
from the standpoint of the data available using probabilistic and statistical
methods.
3. Choose a game of chance (Blackjack, Poker, Dice, Roulette, etc.) and prove
that the house will win in the long run by deriving the probability that a player
will loose at any stage of said game. Demonstrate a strategy of cheating
by which the player may increase his chances of winning so substantially that
playing the game will become a realistic source of prot for him (for example
by counting cards in Blackjack or Poker).
In this report you must rst explicitly dene the game, nd the probability
that a player will loose and argue that in the long run, the house prots.
Then you are to give complete details of a method of winning and nd the
new probability of loosing and argue that thus the player will prot.
4. Apply the theory of Markov processes to the study of nancial indices. Choose
an index (Dow Jones, DAX, etc. or also a consumer index) and dene it and
the method of its calculation in detail. Derive a Markov process model for
this index by making economical assumptions that must be argued for. Using
data up to a certain arbitrary date available on the Internet for this index,
calculate the necessary model parameters. Use the data after the chosen date
to check the prediction of the Markov process and show for what period of
time your model is accurate. As a benchmark, a model would be considered
good (for the purposes of this report) if it is accurate to within 5% for at least
one week (for a market index) in which no dramatic political event disturbed
the evolution of the index.
Patrick D. Bangert