STAT 230 Course Notes Fall 2019

STAT 220/230 COURSE NOTES
By Chris Springer Revised by Jerry Lawless, Don McLeish and Cyntha Struthers
Fall 2019 Edition

Contents
1. INTRODUCTION TO PROBABILITY 1
1.1 Definitions of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Chapter 1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. MATHEMATICAL PROBABILITY MODELS 5

2.1 Sample Spaces and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. PROBABILITY AND COUNTING TECHNIQUES 15

3.1 Addition and Multiplication Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Counting Arrangements or Permutations . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Counting Subsets or Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Number of Arrangements When Symbols Are Repeated . . . . . . . . . . . . . . . . . 24
3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4. PROBABILITY RULES AND CONDITIONAL PROBABILITY 38

4.1 General Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Rules for Unions of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Intersections of Events and Independence . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Product Rules, Law of Total Probability and Bayes’ Theorem . . . . . . . . . . . . . . 56
4.6 Useful Series and Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5. DISCRETE RANDOM VARIABLES 72

5.1 Random Variables and Probability Functions . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
ii
CONTENTS iii
5.3 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Poisson Distribution from Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Poisson Distribution from Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . 94
5.9 Combining Other Models with the Poisson Process . . . . . . . . . . . . . . . . . . . 99
5.10 Summary of Probability Functions for Discrete Random Variables . . . . . . . . . . . 102
6. COMPUTATIONAL METHODS AND THE STATISTICAL SOFTWARE R 113

6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3 Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4 Some Basic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.6 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.7 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7. EXPECTED VALUE AND VARIANCE 122

7.1 Summarizing Data on Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2 Expectation of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3 Some Applications of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4 Means and Variances of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8. CONTINUOUS RANDOM VARIABLES 148

8.1 General Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.2 Continuous Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.4 A Method for Computer Generation of Random Variables . . . . . . . . . . . . . . . . 169
8.5 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9. MULTIVARIATE DISTRIBUTIONS 191

9.1 Basic Terminology and Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
0 CONTENTS
9.2 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

9.3 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.4 Expectation for Multivariate Distributions: Covariance and Correlation . . . . . . . . . 212
9.5 Mean and Variance of a Linear Combination of Random Variables . . . . . . . . . . . 221
9.6 Linear Combinations of Independent Normal Random Variables . . . . . . . . . . . . 224
9.7 Indicator Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
10. C.L.T., NORMAL APPROXIMATIONS and M.G.F.’s 245

10.1 Central Limit Theorem (C.L.T.) and Normal Approximations . . . . . . . . . . . . . . 245
10.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
10.3 Multivariate Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 264
10.4 Chapter 10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11. SOLUTIONS TO SECTION PROBLEMS 274
12. SOLUTIONS TO END OF CHAPTER PROBLEMS 298
13. SAMPLE TESTS 380
14. SOLUTIONS TO SAMPLE TESTS 396
15. SUMMARY OF DISTRIBUTIONS AND N (0; 1) TABLES 417

1
1
Sections 5.3, 5.5, 5.8, 5.9, Chapter 6, Sections 8.4, 9.3, 10.2 and 10.3 are usually optional for STAT 220.
Chapter 6, Sections 8.4, 9.3 and 10.3 are usually optional for STAT 230.
1. INTRODUCTION TO PROBABILITY
1.1 Definitions of Probability
You are the product of a random universe. From the Big Bang to your own conception and birth,
random events have determined who we are as a species, who you are as a person, and much of your
experience to date. Ironic therefore that we are not well-tuned to understanding the randomness around
us, perhaps because millions of years of evolution have cultivated our ability to see regularity, certainty
and deterministic cause-and-effect in the events and environment about us. We are good at finding
patterns in numbers and symbols, or relating the eating of certain plants with illness and others with a
healthy meal. In many areas, such as mathematics or logic, we assume we know the results of certain
processes with certainty (e.g., 2 + 3 = 5), though even these are often subject to assumed axioms.
Most of the real world, however, from the biological sciences to quantum physics2 , involves variability
and uncertainty. For example, it is uncertain whether it will rain tomorrow; the price of a given stock
a week from today is uncertain; the number of claims that a car insurance policy holder will make
over a one-year period is uncertain; the number of requests to a web server is uncertain. Uncertainty
or “randomness” (that is, variability of results) is usually due to some mixture of at least two factors
including: (1) variability in populations consisting of animate or inanimate objects (e.g., people vary
in height, weight, hair colour, blood type, etc.), and (2) variability in processes or phenomena (e.g.,
the random selection of six numbers from forty-nine numbers in a lottery draw can lead to a very large
number of different outcomes). Which of these would you use to describe the fluctuations in stock
prices or currency exchange rates?
Variability and uncertainty in a system make it more difficult to plan or to make decisions without
suitable tools. We cannot eliminate uncertainty but it is usually possible to describe, quantify and deal
with variability and uncertainty using the theory of probability. This course develops both the mathe-
matical theory and some of the applications of probability. The applications of this methodology are
far-reaching, from finance to the life-sciences, from the analysis of computer algorithms to simulation
2
“As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to
reality” Albert Einstein, 1921.
1
2 1. INTRODUCTION TO PROBABILITY
of queues and networks or the spread of epidemics. Of course we do not have the time in this course to
develop these applications in detail, but some of the problems at the end of the chapter will give a hint
of the extraordinary range of application of the mathematical theory of probability and statistics.
It seems logical to begin by defining probability. People have attempted to do this by giving de-
finitions that reflect the uncertainty whether some specified outcome or “event” will occur in a given
setting. The setting is often termed an “experiment” or “process” for the sake of discussion. We often
consider simple examples: it is uncertain whether two pips or dots will be on the upturned face when
a six-sided die is rolled. It is similarly uncertain whether the Canadian dollar will be higher tomor-
row, relative to the U.S. dollar, than it is today. One step in defining probability requires envisioning
a random experiment with a number of possible outcomes. We refer to the set of all possible distinct
outcomes to a random experiment as the sample space (usually denoted by S). Groups or sets of
outcomes of possible interest, subsets of the sample space, we will call events. Then we might define
probability in three different ways:
1. The classical definition: The probability of some event is
number of ways the event can occur

number of outcomes in S
provided all points in the sample space S are equally likely. For example, when a die is rolled
the probability of two pips on the upturned face is 16 because only one of the six faces has two
pips.
2. The relative frequency definition: The probability of an event is the (limiting) proportion (or
fraction) of times the event occurs in a very long series of repetitions of an experiment or process.
For example, this definition could be used to argue that the probability of getting two pips on the
upturned face when a die is rolled is 16 .
3. The subjective probability definition: The probability of an event is a measure of how sure the
person making the statement is that the event will happen. For example, after considering all
available data, a weather forecaster might say that the probability of rain today is 30% or 0:3.
Unfortunately, all three of these definitions have serious limitations.
Classical Definition: What does “equally likely” mean? This appears to use the concept of probability
while trying to define it! We could remove the phrase “provided all outcomes are equally likely”, but
then the definition would clearly be unusable in many settings where the outcomes in S did not tend to
occur equally often.
1.1. DEFINITIONS OF PROBABILITY 3
Relative Frequency Definition: Since we can never repeat an experiment or process indefinitely,
we can never know the probability of any event from the relative frequency definition. In many cases
we can’t even obtain a long series of repetitions due to time, cost, or other limitations. For example,
the probability of rain today cannot really be obtained by the relative frequency definition since today
cannot be repeated again under identical conditions. Intuitively, however, if a probability is correct, we
expect it to be close to relative frequency, when the experiment is repeated many times.
Subjective Probability: This definition gives no rational basis for people to agree on a right answer,
and thus would disqualify probability as an objective science. Are everyone’s opinions equally valid
or should we only consult “experts”. There is some controversy about when, if ever, to use subjective
probability except for personal decision-making but it does play a part in a branch of Statistics that is
often called “Bayesian Statistics”. This type of Statistics will not be discussed in this course, but it is a
common and useful method for updating subjective probabilities with objective experimental results.
The difficulties in producing a satisfactory definition can be overcome by treating probability as a

mathematical system defined by a set of axioms. We do not worry about the numerical values of prob-
abilities until we consider a specific application. This is consistent with the way that other branches of
mathematics are defined and then used in specific applications (e.g., the way calculus and real-valued
functions are used to model and describe the physics of gravity and motion).
The mathematical approach that we will develop and use in the remaining chapters is based on the
following description of a probability model:
a sample space of all possible outcomes of a random experiment is defined
a set of events, subsets of the sample space to which we can assign probabilities, is defined
a mechanism for assigning probabilities (numbers between 0 and 1) to events is specified.
Of course in a given run of the random experiment, a particular event may or may not occur.
In order to understand the material in these notes, you may need to review your understanding of
basic counting arguments, elementary set theory as well as some of the important series that you have
encountered in Calculus that provide a basis for some of the distributions discussed in these notes. In
Chapter 2, we begin a more mathematical description of probability theory.
4 1. INTRODUCTION TO PROBABILITY
1.2 Chapter 1 Problems

1. Try to think of examples of probabilities you have encountered which might have been obtained
by each of the three “definitions”.
2. Which definitions do you think could be used for obtaining the following probabilities?
(a) A person’s birthday is in April

(b) A driver makes a claim on their car insurance in the next year
(c) There is a meltdown at a nuclear power plant during the next 5 years
(d) The disk in a personal computer crashes
3. Give examples of how probability applies to each of the following areas.
(a) Lottery draws

(b) Public opinion polls
(c) Sending data over a network
(d) Auditing of expense items in a financial statement
(e) Disease transmission (e.g. measles, tuberculosis, STD’s)
4. Which of the following can be accurately described by a “deterministic” model, that is, a model
which does not require any concept of probability?
(a) The position of a small particle in space

(b) The velocity of an object dropped from the leaning tower of Pisa
(c) The lifetime of a heavy smoker
(d) The value of a stock which was purchased for $20 one month ago
(e) The number of servers at a large data center which crash on a given day
2. MATHEMATICAL PROBABILITY
MODELS
2.1 Sample Spaces and Probability

Consider some phenomenon or process which is repeatable, at least in theory, and suppose that certain
events or outcomes A1 ; A2 ; A3 ; : : : are defined. We will often term the phenomenon or process an
“experiment” and refer to a single repetition of the experiment as a “trial”. The probability of an
event A, denoted P (A), is a number between 0 and 1. For probability to be a useful mathematical
concept, it should possess some other properties. For example, if our “experiment” consists of tossing
a coin with two sides, Head and Tail, then we might wish to consider the two events A1 = “Head turns
up” and A2 = “Tail turns up”. It does not make much sense to allow P (A1 ) = 0:6 and P (A2 ) = 0:6,
so that P (A1 ) + P (A2 ) > 1. (Why is this so? Is there a fundamental reason or have we simply adopted
1 as a convenient scale?) To avoid this sort of thing we begin with the following definition.
Definition 1 A sample space S is a set of distinct outcomes for an experiment or process, with the
property that in a single trial, one and only one of these outcomes occurs.
The outcomes that make up the sample space may sometimes be called “sample points” or just
“points” on occasion. A sample space is defined as part of the probability model in a given setting but
it is not necessarily uniquely defined, as the following example shows.
Example: Roll a six-sided die, and define the events
ai = there are i pips on the top face, for i = 1; 2; : : : ; 6
Then we could take the sample space as S = fa1 ; a2 ; : : : ; a6 g. (Note we use the curly brackets “f: : :g”
to indicate the elements of a set). Instead of using this definition of the sample space we could instead
define the events
E : the event that there are an even number of pips on the top face
O : the event that there are an odd number of pips on the top face
5
6 2. MATHEMATICAL PROBABILITY MODELS
and take S = fE; Og. Both sample spaces satisfy the definition. Which one we use depends on what
we wanted to use the probability model for. If we expect never to have to consider events like “there
are less than three pips on the top face” then the space S = fE; Og will suffice, but in most cases, if
possible, we choose sample points that are the smallest possible or “indivisible”. Thus the first sample
space is likely preferred in this example.
Sample spaces may be either discrete or non-discrete; S is discrete if it consists of a finite or

countably infinite set of simple events. Recall that a countably infinite sequence is one that can be
put into a one-to-one correspondence with the positive integers, so for example f 21 ; 13 ; 14 ; 15 ; : : : g is
countably infinite as is the set of all rational numbers. The two sample spaces in the preceding example
are discrete. A sample space S = f1; 2; 3; : : : g consisting of all the positive integers is discrete, but a
sample space S = fx : x > 0g consisting of all positive real numbers is not. For the next few chapters
we consider only discrete sample spaces. For discrete sample spaces it is much easier to specify the
class of events to which we may wish to assign probabilities; we will allow all possible subsets of the
sample space. For example if S = fa1 ; a2 ; a3 ; a4 ; a5 ; a6 g is the sample space then A = fa1 ; a2 ; a3 ; a4 g
and B = fa6 g and S itself are all examples of events.
Definition 2 An event in a discrete sample space is a subset A S. If the event is indivisible so it

contains only one point, e.g. A1 = fa1 g we call it a simple event. An event A made up of two or more
simple events such as A = fa1 ; a2 g is called a compound event.
Note that the notation A B means a 2 A implies a 2 B.

Our notation will often not distinguish between the point ai and the simple event Ai = fai g which
has this point as its only element, although they differ as mathematical objects. When we mean the
probability of the event A1 = fa1 g, we should write P (A1 ) or P (fa1 g) but the latter is often shortened
to P (ai ). In the case of a discrete sample space it is easy to specify probabilities of events since they
are determined by the probabilities of simple events.
Definition 3 Let S = fa1 ; a2 ; a3 ; : : : g be a discrete sample space. Assign numbers (probabilities)

P (ai ); i = 1; 2; 3; : : : to the ai ’s such that the following two conditions hold:
(1) 0 P (ai ) 1
P
(2) P (ai ) = 1
all i
The set of probabilities fP (ai ); i = 1; 2; : : : g is called a probability distribution on S.

P
Note that P ( ) is a function whose domain is the sample space S. The condition P (ai ) = 1
all i
above reflects the idea that when the process or experiment happens, one or other of the simple events
2.1. SAMPLE SPACES AND PROBABILITY 7
fai g in S must occur (recall that the sample space includes all possible outcomes). The probability of
a more general event A (not necessarily a simple event) is then defined as follows:
Definition 4 The probability P (A) of an event A is the sum of the probabilities for all the simple events
P
that make up A or P (A) = P (a).
a2A
For example, the probability of the compound event A = fa1 ; a2 ; a3 g is P (a1 ) + P (a2 ) + P (a3 ).
Probability theory does not say what numbers to assign to the simple events for a given application,
only those properties guaranteeing mathematical consistency. In an actual application of a probability
model, we try to specify numerical values of the probabilities that are more or less consistent with the
frequencies of events when the experiment is repeated. In other words we try to specify probabilities
that are consistent with the real world. There is nothing mathematically wrong with a probability model
for a toss of a coin that specifies that the probability of heads is zero, except that it likely won’t agree
with the frequencies we obtain when the experiment is repeated.
Example: Suppose a six-sided die is rolled, and let the sample space be S = f1; 2; : : : ; 6g, where i
represents the simple event that there are i pips on the top face, i = 1; 2; : : : ; 6. If the die is an ordinary
one, (a fair die) we would likely define probabilities as
1
P (i) = for i = 1; 2; : : : ; 6 (2.1)
6
because if the die were tossed repeatedly by a fair roller (as in some games or gambling situations) then
each number would occur close to 61 of the time. However, if the die were weighted in some way, or if
the roller were able to manipulate the die so that outcome 1 is more likely, these numerical values would
not be so useful. To have a useful mathematical model, some degree of compromise or approximation is
usually required. Is it likely that the die or the roller are perfectly “fair”? Given (2.1), if we wish to con-
sider some compound event, the probability is easily obtained. For example, if A = “there are an even
number of pips on the top face” then because A = f2; 4; 6g we get P (A) = P (2) + P (4) + P (6) = 12 .
We now consider some additional examples, starting with some simple problems involving cards,
coins and dice. Once again, to calculate probability for discrete sample spaces, we usually approach a
given problem using three steps:
(1) Specify a sample space S.
(2) Assign a probability distribution to the simple events in S.
(3) For any compound event A, find P (A) by adding the probabilities of all the simple events that
make up A.
Later we will discover that having a detailed specification or list of the elements of the sample
space may be difficult. Indeed in many cases the sample space is so large that at best we can describe
it in words. For the present we will solve problems that are stated as “Find the probability that . . . ” by
carrying out step (2) above, assigning probabilities that we expect should reflect the long run relative
frequencies of the simple events in repeated trials, and then summing these probabilities to obtain
P (A).
When S has only a few points, one of the easiest methods for finding the probability of an event is
to list all outcomes. In many problems a sample space S with equally probable simple events can be
used, and the first few examples are of this type.
Example: Draw one card from a standard well-shuffled deck (13 cards of each of 4 suits - spades,
hearts, diamonds, clubs). Find the probability that the card is a club.
Solution 1: Let S ={spade, heart, diamond, club}. Then S has 4 points, with 1 of them being “club”,
so P (club) = 41 .
Solution 2: Let S = f 2 ,3 ,4 ; : : : ; A ; 2~; : : : ; A|g. Then each of the 52 cards in S has proba-
1
bility 52 . The event A of interest is
A = f2|; 3|; : : : ; A|g

1
and this event has 13 simple outcomes in it all with the same probability 52: : Therefore
1 1 1 13 1
P (A) = + + + = =
52 52 52 52 4
Note 1: A sample space is not necessarily unique, as mentioned earlier. The two solutions illustrate
this. Note that in the first solution the event A = “the card is a club” is a simple event because of the
way the sample space was defined, but in the second it is a compound event.
Note 2: In solving the problem we have assumed that each simple event in S is equally probable. For
example in Solution 1 each simple event has probability 1=4. This seems to be the only sensible choice
of numerical value in this setting, but you will encounter problems later on where it is not obvious
whether outcomes are all equiprobable.
The term “odds” is sometimes used in describing probabilities. In this card example the odds in favour
of clubs are 1 : 3; we could also say the odds against clubs are 3 : 1. In general,
Definition 5 The odds in favour of an event A is the probability the event occurs divided by the prob-
ability it does not occur or 1 PP(A) 1 P (A)
(A) . The odds against the event is the reciprocal of this, P (A) .
If the odds against a given horse winning a race are 20 to 1 (or 20 : 1), what is the corresponding
probability that the horse will win the race? According to the definition above 1 PP(A)
(A)
= 20, which
1
gives P (A) = 21 . Note that these odds are derived from bettor’s collective opinion and therefore
subjective.
Example: Toss a coin twice. Find the probability of getting one head. (In this course, “one head” is
taken to mean exactly one head. If we meant “at least one head” we would say so.)
Solution 1: Let S = fHH; HT; T H; T T g and assume the simple events each have probability 14 .
(Here, the notation HT means head on the 1st toss and tails on the 2nd .) Since one head occurs for
simple events HT and T H, the event of interest is A = fHT; T Hg and we get P (A) = 14 + 14 = 21 .
Solution 2: Let S = f0 heads, 1 head, 2 headsg and assume the simple events each have probability
1 1
3 . Then P (1 head) = 3 .
Which solution is right? Both are mathematically “correct” in the sense that they are both con-
sequences of probability models. However, we want a solution that reflects the relative frequency of
occurrence in repeated trials in the real world, not just one that agrees with some mathematical model.
In that respect, the points in solution 2 are not equally likely. The event f1 headg occurs more often
than either f0 headsg or f2 headsg in actual repeated trials.
Figure 2.1: Ten tosses of two coins.
You can experiment to verify this (for example of the 10 replications of the experiment in Figure
2.1, 2 heads occurred 2 of the 10 times, 1 head occurred 7 of the 10 times. For more certainty you
should replicate this experiment many times. So we say solution 2 is incorrect for ordinary fair coins
because it is based on an incorrect model. If we were determined to use the sample space in solution
2, we could do it by assigning appropriate probabilities to each of the three simple events but then 0
heads would need to have a probability of 14 , 1 head a probability of 21 and 2 heads 41 . We do not usually
do this because there seems little point in using a sample space whose points are not equally probable
when one with equally probable points is readily available.
Example: Roll a red die and a green die. Find the probability of the event A = “the total number of
pips showing on the top faces is 5”.
Solution: Let (x; y) represent getting x on the red die and y on the green die.
Then, with these as simple events, the sample space is
S = f (1; 1) (1; 2) (1; 3) (1; 6)
(2; 1) (2; 2) (2; 3) (2; 6)
(3; 1) (3; 2) (3; 3) (3; 6)
(6; 1) (6; 2) (6; 3) (6; 6)g

1
Each simple event, for example f(1; 1)g is assigned probability 36 . For the event of interest,
4
A = f(1; 4); (2; 3); (3; 2); (4; 1)g and therefore P (A) = 36 .
Example: Suppose the 2 dice were identical in colour. Find the probability of the event A.
Solution 1: Since we can no longer distinguish between (x; y) and (y; x), the only distinguishable
points in S are:
S = f (1; 1) (1; 2) (1; 3) (1; 6)
(2; 2) (2; 3) (2; 6)
(3; 3) (3; 6)
:: ::
(6; 6)g
1
Using this sample space, we have A = f(1; 4); (2; 3)g. If we assign equal probability 21 to each point
2
(simple event) then we get P (A) = 21 .
2 4
At this point you should be suspicious since 21 6= 36 . The colour of the dice should not have
any effect on what total we get. The universe does not change the frequency of real physical events
depending on whether the dice are identical or not, so one answer must be wrong! The problem is
that the 21 points in S here are not equally likely. There was nothing theoretically wrong with the
probability model except that if this experiment is repeated in the real world, the point (1; 2) occurs
about twice as often in the long run as the point (1; 1). So the only sensible way to use this sample space
1
so it is consistent with the real world is to assign probabilities 36 to the points of the form (x; x) and
2
36 to the points (x; y) for x 6= y. We can compare these probabilities with experimental evidence. On
the website http://www.math.duke.edu/education/postcalc/probability/dice/index.html you may throw
virtual dice up to 10,000 times and record the results. For example on 1000 throws of two dice (see
Figure 2.2), there were 121 occasions when the total on the dice was 5, indicating the probability of the
121 4
event A is close to 1000 or 0:121. This compares with the probability P (A) = 36 = 0:111.
Figure 2.2: Results of 1000 throws of 2 dice
Solution 2: For a more straightforward solution to the above problem, pretend the dice can be distin-
guished. (Imagine, for example, that we put tiny mark on one die, or label one of them differently.) We
then get the same 36 sample points as in the example with the red die and the green die. The fact that
4
one die has a tiny mark cannot change the probabilities so that P (A) = 36 . The laws determining the
probabilities associated with these two dice do not, of course, know whether your eyesight is so keen
that you can or cannot distinguish the dice. These probabilities must be the same in either case. In
many problems when objects are indistinguishable and we are interested in calculating a probability,
you will discover that the calculation is made easier by pretending the objects can be distinguished.
This illustrates a common pitfall. When treating objects in an experiment as distinguishable leads
to a different answer from treating them as identical, the points in the sample space for identical objects
are usually not “equally likely” in terms of their long run relative frequencies. It is generally safer to
pretend objects can be distinguished even when they can’t be, in order to get equally likely sample
points.
While the method of finding probability by listing all the points in S can be useful, it is not practical
when there are a lot of points to write out (e.g., if three dice were tossed there would be 216 points in
S). We need to have more efficient ways of determining the number of outcomes in S or in a compound
event without having to list them all. Chapter 3 considers ways to do this, and then Chapter 4 develops
other ways to manipulate and calculate probabilities.
Although we often consider simple problems involving things such as coins, dice and simple games,
probability is used to deal with a huge variety of practical problems from finance to clinical trials.
In some settings such as in Problems 6 and 7 below, we need to rely on previous repetitions of an
experiment, or on related scientific data, to assign numerical probabilities to events.

1. Students in a particular program have the same four math professors. Two students in the program
each independently ask one of their math professors3 for a letter of reference. Assume each is
equally likely to ask any of the math professors.
(a) List a suitable sample space for this “experiment”.

(b) Use this sample space to find the probability both students ask the same professor.
2. A fair coin is tossed three times.
(a) List a sample space for this experiment.

(b) Find the probability of two heads.
(c) Find the probability of exactly two consecutive tails.
3. Two numbers are chosen at random without replacement from the set f1; 2; 3; 4; 5g.
(a) List a sample space for this experiment.

(b) Find the probability both numbers are odd.
(c) Find the probability the numbers chosen differ by one, that is, the two numbers are consec-
utive.
4. Four letters addressed to individuals W , X, Y and Z are randomly placed in four addressed
envelopes, one letter in each envelope.
(a) List the 24 equally probable outcomes for this experiment. Be sure to explain your notation.
(b) List the sample points belonging to each of the following events:
A: “W ’s letter goes into the correct envelope”;
B: “no letters go into the correct envelopes”;
C: “exactly two letters go into the correct envelopes”;
D: “exactly three letters go into the correct envelopes”.
(c) Find the probability of each event in (b).
3
“America believes in education: the average professor earns more money in a year than a professional athlete earns in a
whole week.” Evan Esar (1899 - 1995)
2.2. CHAPTER 2 PROBLEMS 13
5. Three balls are placed at random in three boxes, with no restriction on the number of balls per
box.
(a) List the 27 equally probable outcomes of this experiment. Be sure to explain your notation.
(b) Find the probability of each of the following events:
A: “the first box is empty”;
B: “the first two boxes are empty”;
C: “no box contains more than one ball”.
(c) Find the probabilities of events A, B and C when three balls are placed at random in n
boxes (n 3).
(d) Find the probabilities of events A, B and C when k balls are placed in n boxes (n k).
6. Diagnostic Tests: Suppose that in a large population some persons have a specific disease at
a given point in time. A person can be tested for the disease, but inexpensive tests are often
imperfect, and may give either a “false positive” result (the person does not have the disease but
the test says they do) or a “false negative” result (the person has the disease but the test says they
do not).
In a random sample of 1000 people, individuals with the disease were identified according to a
completely accurate but expensive test, and also according to a less accurate but inexpensive test.
The results for the less accurate test were:
920 persons without the disease tested negative

60 persons without the disease tested positive
18 persons with the disease tested positive
2 persons with the disease tested negative.
(a) Estimate the fraction of the population that has the disease and tests positive using the
inexpensive test.
(b) Estimate the fraction of the population that has the disease.
(c) Suppose that someone randomly selected from the same population as those tested above
was administered the inexpensive test and it indicated positive. Based on the above infor-
mation, how would you estimate the probability that they actually have the disease.
7. Machine Recognition of Handwritten Digits: Suppose that you have an optical scanner and
associated software for determining which of the digits 0; 1; : : : ; 9 an individual has written in a
square box. The system may of course be wrong sometimes, depending on the legibility of the
handwritten number.
(a) Describe a sample space S that includes points (x; y), where x stands for the number actu-
ally written, and y stands for the number that the machine identifies.
(b) Suppose that the machine is asked to identify very large numbers of digits, of which
0; 1; : : : ; 9 occur equally often, and suppose that the following probabilities apply to the
points in your sample space:
p(0; 6) = p(6; 0) = 0:004; p(0; 0) = p(6; 6) = 0:096

p(5; 9) = p(9; 5) = 0:005; p(5; 5) = p(9; 9) = 0:095
p(4; 7) = p(7; 4) = 0:002; p(4; 4) = p(7; 7) = 0:098
p(y; y) = 0:100 for y = 1; 2; 3; 8
Give a table with probabilities for each point (x; y) in S. What fraction of numbers is
correctly identified?
8. In Problems 4-7, what can you say about how appropriate you think the probability model is for
the experiment being modelled?
9. Challenge Problem: Professor X has an integer (1 m 9) in mind and asks two students,
Allan and Beth to pick numbers between 1 and 9. Whichever is closer to m gets 90% and the
other 80% in STAT 230. If they are equally close, they both get 85%. If the professor’s number
and that of Allen are chosen purely at random and Allen announces his number out loud, describe
a sample space and a strategy which leads Beth to the highest possible mark.
3. PROBABILITY AND COUNTING
TECHNIQUES
Some probability problems can be solved by specifying a sample space S = fa1 ; a2 ; : : : ; an g in which
each simple event has probability n1 , that is, each event is “equally likely”. This is referred to as a
uniform distribution over the set fa1 ; a2 ; : : : ; an g. In a uniform probability model, we can calculate the
probability of any event A by counting the number of outcomes in the event A,
number of outcomes in A
P (A) =
number of outcomes in S
In other words, we need to be able to count the number of events in S which are in A. We now look at
techniques for counting outcomes from “experiments”.
3.1 Addition and Multiplication Rules

There are two helpful rules for counting, phrased in terms of “jobs” which are to be done.
1. The Addition Rule: Suppose we can do job 1 in p ways and job 2 in q ways. Then we can do
either job 1 OR job 2 (but not both), in p + q ways.
For example, suppose a class has 30 men and 25 women. There are 30 + 25 = 55 ways the
instructor can pick one student to answer a question. If there are 5 vowels and 20 consonants on a list
and I must pick one letter, this can be done in 5+20 ways.
2. The Multiplication Rule: Suppose we can do job 1 in p ways and, for each of these ways, we
can do job 2 in q ways. Then we can do both job 1 AND job 2 in p q ways.
For example, if there are 5 vowels and 20 consonants and I must choose one consonant followed
by one vowel for a two-letter word, this can be done in 20 5 ways (there are 100 such words). To ride
a bike, you must have the chain on both a front sprocket and a rear sprocket. For a 21 speed bike there
15
16 3. PROBABILITY AND COUNTING TECHNIQUES
are 3 ways to select the front sprocket and 7 ways to select the rear sprocket, which gives 3 7 = 21
such combinations.
This interpretation of “OR” as addition and “AND” as multiplication evident in the addition and
multiplication rules above will occur throughout probability, so it is helpful to make this association in
your mind. Of course questions do not always have an AND or an OR in them and you may have to
play around with re-wording the question to discover implied AND’s or OR’s.
Example: Suppose we pick 2 numbers from digits 1, 2, 3, 4, 5 with replacement. (Note: “with
replacement” means that after the first number is picked it is “replaced” in the set of numbers, so it
could be picked again as the second number.) Assume a uniform distribution on the sample space, that
is, assume that every pair of numbers has the same probability. Let us find the probability that one
number is even. This can be reworded as: “The first number is even AND the second is odd (this can
be done in 2 3 ways) OR the first is odd AND the second is even (done in 3 2 ways).” Since these
are connected with the word OR, we combine them using the addition rule to calculate that there are
(2 3) + (3 2) = 12 ways for this event to occur. Since the first number can be chosen in 5 ways
AND the second in 5 ways, S contains 5 5 = 25 points and since each point has the same probability,
1
they all have probability 25 . Therefore
12
P (one number is even) =
25
When objects are selected and replaced after each draw, the addition and multiplication rules are gen-
erally sufficient to find probabilities. When objects are drawn without being replaced, some special
rules may simplify the solution.
Note: The phrases at random, or uniformly are often used to mean that all of the points in the sample
space are equally likely so that in the above problem, every possible pair of numbers chosen from this
1
set has the same probability 25 .
Problems
3.1.1 (a) A course has 4 sections with no limit on how many can enrol in each section. Three students
each pick a section at random.
(i) Specify the sample space S.

(ii) Find the probability that all three students end up in the same section.
(iii) Find the probability that all three students end up in different sections.
(iv) Find the probability nobody picks section 1.
3.2. COUNTING ARRANGEMENTS OR PERMUTATIONS 17
(b) Repeat (a) in the case when there are n sections and s students (n s).
3.1.2 Canadian postal codes consist of 3 letters (of 26 possible letters) alternated with 3 digits (of the 10
possible), starting with a letter (e.g. N2L 3G1). Assume no other restrictions on the construction
of postal codes. For a postal code chosen at random, what is the probability:
(a) all 3 letters are the same?

(b) the digits are all even or all odd? Treat 0 as being neither even nor odd.
3.1.3 Suppose a password has to contain between six and eight digits, with each digit either a letter or
a number from 1 to 9. The password must contain at least one number.
(a) What is the total number of possible passwords?

(b) If you started to try passwords in random order, what is the probability you would find the
correct password for a given situation within the first 1,000 passwords you tried?
3.2 Counting Arrangements or Permutations

In many problems, the sample space is a set of arrangements or sequences. These are classically called
permutations. A key step in the argument is to be sure to understand what it is you are counting. It is
helpful to invent a notation for the outcomes in the sample space and the events of interest (these are
the objects you are counting).
Example: Suppose the letters a,b,c,d,e,f are arranged at random to form a six-letter word (an arrange-
ment) – we must use each letter once only. The sample space
S = fabcdef, abcdfe, . . . , fedcbag
has a large number of outcomes and, because we formed the word “at random”, we assign the same
probability to each. To count the number of words in S, count the number of ways that we can construct
such a word – each way corresponds to a unique word. Consider filling the boxes
corresponding to the six positions in the arrangement. We can fill the first box in 6 ways with any one
of the letters. For each of these choices, we can fill the second box in 5 ways with any one of the
remaining letters. Thus there are 6 5 = 30 ways to fill the first two boxes. (If you are not convinced
by this argument, list all the possible ways that the first two boxes can be filled.) For each of these
30 choices, we can fill the third box in 4 ways using any one of the remaining letters so there are
6 5 4 = 120 ways to fill the first three boxes. Applying the same reasoning, we see that there are
6 5 4 3 2 1 = 720 ways to fill the 6 boxes and hence 720 equally probable words in S.
Now consider the event A: the second letter is e or f so
A = fafbcde, aebcdf, . . . , efdcbag
We can count the number of outcomes in A using a similar argument if we start with the second box.
We can fill the second box in 2 ways, that is, with an e or f. For each of these choices, we can then fill
the first box in 5 ways, so now we can fill the first two boxes in 2 5 = 10 ways. For each of these
choices, we can fill the remaining four boxes in 4 3 2 1 = 24 ways so the number of outcomes
in A is 10 24 = 240. Since we have a uniform probability model
number of outcomes in A 240 1

P (A) = = =
number of outcomes in S 720 3
In determining the number of outcomes in A, it is important that we start with the second box.
Suppose, instead, we start by saying there are 6 ways to fill the first box. Now the number of ways of
filling the second box depends on what happened in the first. If we used e or f in the first box, there is
only one way to fill the second. If we used a, b, c or d for the first box, there are 2 ways of filling the
second. We avoid this complication by starting with the second box.
We can generalize the above problem in several ways. In each case we count the number of arrange-
ments by counting the number of ways we can fill the positions in the arrangement. Suppose we start
with n symbols. Then we can make:
n (n 1) 1 arrangements of length n using each symbol once and only once. This
product is denoted by n! (“n factorial”). Note that n! = n (n 1)!.
n (n 1) (n k + 1) arrangements of length k using each symbol at most once. This
product is denoted by n (“n to k factors”). Note that n(k) = (n n!k)! .
(k)
n n n = nk arrangements of length k using each symbol as often as we wish.
In Table 3.1 we see how quickly n! increases as n increases.
n 1 2 3 4 5 6 7 8 9 10
n! 1 2 6 24 120 720 5040 40320 362880 3628800
n
p
(n=e) 2 n 0:9 1:9 5:8 23:5 118:0 710:1 4980:4 39902:4 359536:9 3598695:6
Table 3.1
Stirling’s Approximation: For large n there is an approximation to n! called Stirling’s approxima-

tion. Note that the sequence fan g is asymptotically equivalent to the sequence fbn g if lim abnn = 1.
n!1
3.2. COUNTING ARRANGEMENTS OR PERMUTATIONS 19
p
Stirling’s approximation says that n! is asymptotically equivalent to (n=e)n 2 n. The relative error
p
in the approximation n! t (n=e)n 2 n decreases quickly as n increases as can be seen in Table 3.1
and is less than 0:01 if n 8.
For many problems involving sampling from a deck of cards or a reasonably large population,
counting the number of cases by simple conventional means is virtually impossible, and we need the
counting arguments dealt with here. The extraordinarily large size of populations, in part due to the
large size of quantities like nn and n!, is part of the reason that statistics, sampling, counting methods
and probability calculations play such an important part in modern science and business.
Example: A pin number of length four is formed by randomly selecting four digits from the set
f0; 1; 2; : : : ; 9g with replacement. Find the probability of the events:
A: the pin number is even

B: the pin number has only even digits
C: all of the digits are unique
D: the pin number contains at least one 1.
Solution: Since we pick the digits with replacement, the outcomes in the sample space can have
repeated digits. The sample space is
S = f0000; 0001; : : : ; 9999g
with 104 equally probable outcomes.

For the event A = f0000; 0002; : : : ; 9998g, we can select the last digit to be any one of 0; 2; 6; 4; 8
in 5 ways. Then for each of these choices, we can select the first digit in 10 ways and so on. There are
5 103 outcomes in A and
5 103 1
P (A) = 4
=
10 2
For the event B = f0000; 0002; : : : ; 8888g, we can select the first digit in 5 ways, and for each of
these choices, the second in 5 ways, and so on. There are 54 outcomes in B and
54 1
P (B) = 4
=
10 16
For the event C = f0123; 0124; : : : ; 9876g, we can select the first digit in 10 ways and for each of
these choices, the second in 9 ways and so on. There are 10 9 8 7 outcomes in C and so
10 9 8 7 10(4) 63
P (C) = = =
104 104 125
For the event D = f0001; 0011; 0111; 1111; : : :g, it is easier to count the number of outcomes
in the complement of D, that is, the set of all outcomes in S but not in D. We denote this event
D = f0000; 0002; : : : ; 9999g. There are 94 outcomes in D and so there are 104 94 outcomes in D
and
104 94 3439
P (D) = 4
=
10 10000
For a general event A, the complement of A, denoted A, is the set of all outcomes in S which are
not in A. It is sometimes easier to count outcomes in the complement rather than in the event itself.
Example: A pin number of length four is formed by randomly selecting four digits from the set
f0; 1; 2; 3; 4; 5; 6; 7; 8; 9g without replacement. Find the probability of the events:
A: the pin number is even.

B: the pin number has only even digits
C: the pin number begins or ends with a 1
D: the pin number contains 1.
Solution: The sample space is

S = f0123; 0132; : : : ; 6789g
with 10(4) equally probable outcomes. For the event A = f1230; 0134; : : : ; 9876g, we can select the
last digit to be any one of 0; 2; 6; 4; 8 in 5 ways. Then for each of these choices, we can select the first
digit in 9 ways, the third in 8 ways and so on. There are 5 9 8 7 outcomes in A and
5 9 8 7 1
P (A) = =
10(4) 2
The event B = f0246; 0248; : : : ; 8642g. The pin numbers in B are all 5(4) arrangements of length
4 using only the even digits f0; 2; 4; 6; 8g and so
5(4) 5 4 3 2 1
P (B) = (4)
= =
10 10 9 8 7 42
The event C = f1023; 0231; : : : ; 9871g. There are 2 positions for the 1. For each of these choices,
we can fill the remaining three positions in 9(3) ways and so
2 9(3) 1
P (C) = =
10(4) 5
The event D = f1234; 2134; : : : ; 9871g. We can use the complement and count the number of pin
numbers that do not contain a 1. There are 9(4) pin numbers that do not contain 1 and so there are
10(4) 9(4) that do contain a 1. Therefore
9(4) 2
P (D) = 1 P D =1 =
10(4) 5
3.3. COUNTING SUBSETS OR COMBINATIONS 21
3.3 Counting Subsets or Combinations

In some problems, the outcomes in the sample space are subsets of a fixed size. Here we look at
counting such subsets. Again, it is useful to write a short list of the subsets you are counting.
Example: Suppose we randomly select a subset of three digits from the set f0; 1; 2; 3; 4; 5; 6; 7; 8; 9g
so that the sample space is
S = ff1; 2; 3g; f0; 1; 3g; f0; 1; 4g; : : : ; f7; 8; 9gg
All the digits in each outcome are unique, that is, we do not consider f1; 1; 2g to be a subset of S.
Also, the order of the elements in a subset is not relevant. This is true in general for sets; the subsets
f1; 2; 3g and f3; 1; 2g are the same. To count the number of outcomes in S, we use what we have
learned about counting arrangements. Suppose there are m such subsets. Using the elements of any
subset of size 3, we can form 3! arrangements of length 3. For example, the subset f1; 2; 3g generates
the 3! = 6 arrangements 123; 132; 213; 231; 312; 321 and any other subset generates a different 3!
arrangements so that the total number of arrangements of 3 digits taken without replacement from the
set f0; 1; 2; 3; 4; 5; 6; 7; 8; 9g is 3! m. But we know the total number of arrangements is 10(3) so
3! m = 10(3) . Solving we get
10(3)
m= = 120
3!
Number of subsets of size k: We use the combinatorial symbol nk (“n choose k”) to denote the
number of subsets of size k that can be selected from a set of n objects. By an argument similar
to that above, if m denotes the number of subsets of size k that can be selected from n things, then
m k! = n(k) and so we have
n n(k)
m= =
k k!
10
In the example above we selected the subset at random so each of the 3 = 120 subsets has the
1
same probability 120 . We now find the probability of the following events:
A: the digit 1 is included in the selected subset

B: all the digits in the selected subset are even
C: at least one of the digits in the selected subset is less than or equal to 5
To count the outcomes in event A, we must have 1 in the subset and we can select the other two
elements from the remaining 9 digits in 92 ways. And so
9
2 9(2) =2! 3
P (A) = 10 = (3)
=
3
10 =3! 10
The event B = ff0; 2; 4g; f0; 2; 6g; : : :g: We can form the outcomes in B by selecting 3 digits
from the 5 even digits f0; 2; 4; 6; 8g in 53 ways. And so
5
3
P (B) = 10
3
The event C = ff0; 1; 2g; f0; 1; 6g; f0; 6; 7g; : : :g. Here it is convenient to consider the com-
plement C in which the outcomes are ff6; 7; 8g; f6; 7; 9g; : : :g, that is, the subsets with all elements
greater than 5. We can form the subsets in C by selecting a subset of size 3 from the set f6; 7; 8; 9g in
4
3 ways. Therefore
4
3
P (C) = 1 P (C) = 1 10
3
Example: Suppose a box contains 10 balls of which 3 are red, 4 are white and 3 are green. A sample
of 4 balls is selected at random without replacement. Find the probability of the events:
E: the sample contains 2 red balls

F : the sample contains 2red, 1 white and 1 green ball
G: the sample contains 2 or more red balls
Solution: Imagine that we label the balls from 1 to 10 with labels 1; 2; 3 being red, 4; 5; 6; 7 being
white and 8; 9; 10 being green. Construct a uniform probability model in which all subsets of size 4 are
equally probable. The sample space is
S = ff1; 2; 3; 4g; f1; 2; 3; 5g; : : : ; f7; 8; 9; 10gg
10
and each outcome has probability 1= 4 .
The event E: To count the number of outcomes in E, we can construct a subset with two red balls
by first choosing the two red balls from the three in 32 ways. For each of these choices we can select
the other two balls from the seven non-red balls in 72 ways so there 32 7
2 are outcomes in E and
3 7
2 2 3
P (E) = 10 =
4
10
The event F = ff1; 2; 4; 8g; f1; 2; 4; 9g; : : :g. To count the number of outcomes in F , we can
select the two red balls in 32 ways, then the white ball in 41 ways and the green ball in 31 ways. So
we have
3 4 3
6
P (F ) = 2 10 1 1
=
4
35
3.3. COUNTING SUBSETS OR COMBINATIONS 23
The event G = ff1; 2; 3; 4g; f1; 2; 4; 5g; : : :g has outcomes with both 2 and 3 red balls. We need
to count these separately (see below). There are 32 72 outcomes with exactly two red balls and 33 71
outcomes with three red balls. Hence we have
3 7 3 7
2 2 + 3 1 1
P (G) = 10 =
4
3
A common mistake is to count the outcomes in G as follows. There are 32 ways to select two red
balls and then for each of these choices we can select the remaining two balls from the remaining eight
in 82 ways. So the number of outcomes in G is 32 8
2 . You can easily check that this is greater
3 7 3 7
than 2 2 + 3 1 . The reason for the error is that some of the outcomes in G have been counted
more than once. For example, you might pick red balls 1; 2 and then other balls 3; 4 to get the subset
f1; 2; 3; 4g. Or you may pick red balls 1; 3 and then other balls 2; 4 to get the subset f1; 3; 2; 4g. These
are counted as two separate outcomes but they are in fact the same subset. To avoid this counting error,
whenever you are asked about events defined in terms such as “at most. . . ”, “more than . . . ”, “ fewer
than. . . ” etc., break the events into pieces where each piece has outcomes with specific values e.g. two
red balls, three red balls.
n
Properties of k : You should be able to prove the following for n and k non-negative integers:
n!
1. n(k) = (n k)! = n(n 1)(k 1) for k 1
n n! n(k)
2. k = k!(n k)! = k!
n n
3. k = n k for all k = 0; 1; : : : ; n
n n
4. If we define 0! = 1, then the formulas hold with 0 = n = 1.
n n 1 n 1
5. k = k 1 + k
n n n n
6. Binomial Theorem: (1 + x)n = 0 + 1 x+ 2 x2 + : : : + n xn
In many problems, we can combine counting arguments for arrangements and subsets as in the
following example.
Example: A binary sequence is an arrangement of zeros and ones. Suppose we have a uniform
probability model on the sample space of all binary sequences of length 10. What is the probability
that the sequence has exactly 5 zeros?
S = f0000000000; 0000000001; : : : ; 1111111111g
We can fill each of the 10 positions in the sequence in 2 ways and hence S has 210 equally possible
outcomes. The event E with exactly 5 zeros and 5 ones is
E = f0000011111; 1000001111; : : : ; 1111100000g
To count the outcomes in E, think of constructing the sequence by filling the ten boxes
10
We can choose the 5 boxes for the zeros in 5 ways and then the ones go in the remaining boxes in 1
way. Hence we have
10
5
P (E) =
210
3.4 Number of Arrangements When Symbols Are Repeated

Example: Suppose the letters of the word STATISTICS are arranged at random. Find the probability
of the event G that the arrangement begins and ends with S.
S = fSSST T T IIAC; SSST T T IICA; : : :g
Here we need to count arrangements when some of the elements are the same. We construct the
arrangements by filling ten boxes corresponding to the positions in the arrangement.
We can choose the three positions for the three S’s in 103 ways. For each of these choices, we can
choose the positions for the three T’s in 3 ways. Then we can place the two Is in 42 ways, then the
7
C in 21 ways and finally the A in 11 ways. The number of equally probable outcomes in the sample
space S is
10 7 4 2 1 10! 7! 4! 2! 1! 10!
= =
3 3 2 1 1 3!7! 3!4! 2!2! 1!1! 1!0! 3!3!2!1!1!
The event G = fSSTTTIIACS,SSTTTIICAS,. . . g. To count the outcomes in G we must have S in
the first and last position
S S
3.4. NUMBER OF ARRANGEMENTS WHEN SYMBOLS ARE REPEATED 25
Now we can use the same technique to arrange the remaining eight letters. Having placed two of the
S’s, there remain eight free boxes, in which we are to place three T’s in 83 ways, two I’s in 52 ways,
one C in 31 ways, one A in 21 ways and finally the remaining S in the last empty box in 11 way.
There are
8 5 3 2 1 8!
= = 3360
3 2 1 1 1 3!2!1!1!1!
elements in G and
8!
3!2!1!1!1! 3360 1
P (G) = 10!
= =
3!3!2!1!1!
50400 15
In general, if we have ni symbols of type i, i = 1; 2; : : : ; k with n1 + n2 + + nk = n, then the

number of arrangements using all of the symbols is
n n n1 n n1 n2 nk n!
=
n1 n2 n3 nk n1 !n2 ! nk !
Example: Suppose we make a random arrangement of length 3 using letters from the set
fa; b; c; d; e; f; g; h; i; jg. What is the probability of the event B = “letters are in alphabetic order” if
(a) letters are selected without replacement?
(b) letters are selected with replacement?
Solution: For (a), the sample space is
fabc; acb; bac; bca; cab; cba; : : : ; hijg
with 10(3) equally probable outcomes. The event B = fabc; abd; : : : ; hijg. To count the outcomes in
B, we first select the three (different) letters to form the arrangement in 10
3 ways. There is then one
way to make an arrangement with the selected letters in alphabetic order. So we have
10
3 1
P (B) = =
10(3) 6
For (b), the sample space is
faaa; aab; baa; aba; abc; acb; bac; bca; cab; cba; : : : ; hijg
with 103 equally probable outcomes. To count the elements in B, consider the following cases:
Case 1: all three letters are the same. There are ten such arrangements faaa; bbb; ccc; : : :g all in
alphabetic order.
Case 2: there are two different letters e.g. faab; aba; baa; abb; bab; bbag. We can choose the two
letters in 10
2 ways. For each of these choices, we can then make two arrangements with the letters in
alphabetic order e.g. faab; abbg There are 2 10
2 arrangements in this case.
10
Case 3: all three letters are different. We can select the three letters in 3 ways and then make one
arrangement that is in alphabetic order (as in part (a)).
Combining the three cases, we have

10 10
10 + 2 2 + 3 11
P (B) = =
103 50
Example: Suppose we make a four digit number by randomly selecting and arranging four digits from
1; 2; : : : ; 7 without replacement. Find the probability that the number formed is
(a) even
(b) over 3000
(c) an even number over 3000.
Solution: Since we are forming a four digit number, the order in which the numbers are selected is
important. We choose the sample space S to be the set of all possible arrangements of four numbers
selected without replacement from the numbers 1; 2; : : : ; 7. The sample space is
S = f1234; 1243; 1324; 1342; : : : ; 4567; 4576; : : : ; 7654g
with 7(4) equally probable outcomes.
(a) For a number to be even, the last digit must be even. We can fill the last position in three ways
with a 2, 4, or 6. The first three positions can be filled by choosing and arranging three of the six
digits not used in the final position in 6(3) ways. Then there are 3 6(3) ways to fill the final position
AND the first three positions to produce an even number. Therefore the probability the number is even
6(3)
is 3 7(4) = 37 . Alternatively, the four digit number is even if and only if the last digit is even. The
last digit is equally likely to be any one of the numbers 1; 2; : : : ; 7 so the probability it is even is the
probability it is either 2; 4; or 6 or 73 .
(b) To get a number over 3000, we require the first digit to be 3, 4, 5, 6, or 7, that is, the first
position can be filled in five ways. The remaining three positions can be filled in 6(3) ways. Therefore
6(3)
the probability the number is greater than 3000 is 5 7(4) = 57 . Alternatively, note that the four digit
number is over 3000 if and only if the first digit is one of 3, 4, 5, 6 or 7. Since each of 1; 2; : : : ; 7 is
equally likely to be the first digit, we get the probability the number is greater than 3000 is 57 .
3.4. NUMBER OF ARRANGEMENTS WHEN SYMBOLS ARE REPEATED 27
In both (a) and (b) we dealt with positions which had restrictions first, before considering positions
with no restrictions. This is generally the best approach to follow in applying counting techniques.
(c) This part has restrictions on both the first and last positions. To illustrate the complication this
introduces, suppose we decide to fill positions in the order 1 then 4 then the middle two. We can fill
position 1 in 5 ways. How many ways can we then fill position 4? The answer is either 2 or 3 ways,
depending on whether the first position was filled with an even or odd digit. Whenever we encounter a
situation such as this, we have to break the solution into separate cases. One case is where the first digit
is even. The positions can be filled in 2 ways for the first (that is, with a 4 or 6), 2 ways for the last, and
then 5(2) ways to arrange 2 of the remaining 5 digits in the middle positions. This first case then occurs
in 2 2 5(2) ways. The second case has an odd digit in position one. There are 3 ways to fill position
one (3, 5, or 7), 3 ways to fill position four (2, 4, or 6), and 5(2) ways to fill the remaining positions.
Case 2 then occurs in 3 3 5(2) ways. We need case 1 OR case 2. Therefore the probability we
obtain an even number greater than 3000 is
2 2 5(2) + 3 3 5(2) 13 5(2) 13

= =
7(4) 7 6 5 (2) 42
Another way to do this is to realize that we need only to consider the first and last digit, and to find
P (first digit is 3 and last digit is even). There are 7 6 = 42 different choices for (first digit, last
digit) and it is easy to see there are 13 choices for which first digit 3, last digit is even ( 5 3 minus
13
the impossible outcomes (4; 4) and (6; 6)). Thus the desired probability is 42 .
Exercise: Try to solve (c) by filling positions in the order 4, 1, middle. You should get the same
answer.
Exercise: Can you spot the flaw in the following argument? There are 3 6(3) ways to get an even
number (part (a)). There are 5 6(3) ways to get a number 3000 (part (b)). Therefore by the
multiplication rule there are 3 6 (3) 5 6 (3) ways to get a number which is even and > 3000.
Example: Five red balls and three white balls are arranged at random in a row. Find the probability
that:
(a) the same colour is at each end

(b) the three white balls are together.
Solution: There are eight objects, five of one type and three of another, that is, five R’s and three W’s,
8!
so our sample space has 5!3! = 56 equally possible outcomes.
(a) To get the same colour at each end we need either
R R OR W W
6!
The number of distinct arrangements with R at each end is 3!3! = 20, since we are arranging three R’s
6!
and three W’s in the middle six positions. The number with W at each end is 5!1! = 6. Thus
20 + 6 13
P (same colour at each end) = =
56 28
(b) Treating WWW as a single unit, we are arranging six objects, five R’s and one object we might
6!
call “WWW”. There are 5!1! = 6 arrangements. Thus,
6 3
P (three white balls are together) = =
56 28
Problems
3.4.1 Digits 1; 2; 3; : : : ; 7 are arranged at random to form a 7 digit number. Find the probability that
(a) the even digits occur together, in any order

(b) the digits at the 2 ends are both even or both odd.
3.4.2 The letters of the word EXCELLENT are arranged in a random order. Find the probability that
(a) the word begins and ends with the same letter.
(b) X,C, and N occur together, in any order.
(c) the letters occur in alphabetical order.
3.5 Examples
Example: In the Lotto 6/49 lottery, six numbers are drawn at random, without replacement, from the
numbers 1 to 49. Find the probability that
(a) the numbers f1; 2; 3; 4; 5; 6g are drawn in any order.

(b) no even number is drawn.
Solution:
(a) Let the sample space S consist of all subsets of six numbers from 1; 2; : : : ; 49; there are 49 6 of
them. Since f1; 2; 3; 4; 5; 6g is one of these subsets, the probability of this particular set is
1
49
6
which is about 1 in 13:9 million.

3.5. EXAMPLES 29
(b) There are 25 odd and 24 even numbers, so there are 256 choices in which all the numbers are
odd. Therefore the probability no even number is drawn is the probability they are all odd, or
25
6
49 = 0:0127
6
Example: Find the probability a bridge hand (13 cards picked at random from a standard deck4 without
replacement) has:
(a) 3 aces
(b) at least 1 ace
(c) 6 spades, 4 hearts, 2 diamonds, 1 club
(d) a 6-4-2-1 split between the 4 suits
(e) a 5-4-2-2 split.
52
Solution: Since order of selection does not matter, we take S to have 13 outcomes, each with the
same probability.
(a) We can choose 3 aces in 43 ways. We also have to choose 10 other cards from the 48 non-aces.
This can be done in 48
10 ways. Hence the probability of exactly three aces is
4 48
3 10
52
13
(b) Solution 1: At least 1 ace means 1 ace or 2 aces or 3 aces or 4 aces. Calculate each part as in (a)
and use the addition rule to get that the probability of at least one ace is
4 48 4 48 4 48 4 48
1 12 + 2 11 + 3 10 + 4 9
52
13
Solution 2: If we subtract all cases with 0 aces from the 52 13 points in S we are left with all points
4 48 48
having at least 1 ace. There are 0 13 = 13 possible hands with 0 aces since all cards must be drawn
from the non-aces. (The term 40 can be omitted since 40 = 1, but was included here to show that we
were choosing 0 of the 4 aces.) This gives that the probability of at least one ace is
52 48 48
13 13 13
52 =1 52
13 13
4
A standard deck has 13 cards in each of four suits, hearts, diamonds, clubs and spades for a total of 52 cards. There are
four aces in the deck (one of each suit).
Incorrect Solution: Choose 1 of the 4 aces then any 12 of the remaining 51 cards. This guarantees
(4)(51)
we have at least 1 ace, so the probability of at least one ace is 1 5212 . This is a common incorrect
(13)
solution. The flaw in this solution is that it counts some points more than once by partially keeping
track of order. For example, we could get the ace of spades on the first choice and happen to get the ace
of clubs in the last 12 draws. We also could get the ace of clubs on the first draw and then get the ace of
spades in the last 12 draws. Though in both cases we have the same outcome, they would be counted
as 2 different outcomes. The strategies in solution 1 and 2 above are safer. We often need to inspect a
solution carefully to avoid double or multiple counting.
(c) Choose the 6 spades in 13 13 13

6 ways and the hearts in 4 ways and the diamonds in 2 ways and
the clubs in 13
1 ways. Therefore the probability of 6 spades, 4 hearts, 2 diamonds and one clubs is
13 13 13 13
6 4 2 1
52 = 0:00196
13
(d) The split in (c) is only 1 of several possible 6-4-2-1 splits. In fact, filling in the numbers 6, 4, 2 and
1 in the spaces below
Spades Hearts Diamonds Clubs
defines a 6-4-2-1 split. There are 4! ways to do this, and having done this, there are 13
6
13 13 13
4 2 1
ways to pick the cards from these suits. Therefore the probability of a a 6-4-2-1 split between the 4
suits is
13 13 13 13
4! 6 4 2 1
52 = 0:047
13
4!
(e) This is the same question as (d) except the numbers 5-4-2-2 are not all different. There are 2!
different arrangements of 5-4-2-2 in the spaces below.
Spades Hearts Diamonds Clubs
Therefore, the probability of a a 5-4-2-2 split is

4! 13 13 13 13
2! 5 4 2 2
52 = 0:1058
13
3.5. EXAMPLES 31
Notes:
While n(k) only has a physical interpretation when n and k are non-negative integers with n k,
n(k) can still be defined when n is a real number and k is a non-negative integer. In general we can
define n(k) = n(n 1) (n k + 1). For example,
( 2)(3) = ( 2)( 2 1)( 2 2) = ( 2)( 3)( 4) = 24
and
(1:3)(2) = (1:3)(1:3 1) = 0:39
n n
Note that in order for 0 = n = 1 we must define
n!
n(0) = = 1 and 0! = 1
(n 0)!
n
When n is not a non-negative integer k, k loses its physical meaning. If n is a real number and
k is a non-negative integer then we use
n n(k) n(n 1) (n k + 1)
= =
k k! k!
For example,
1 ( 21 )(3) ( 1 )( 12 )( 3
2) 1
2 = = 2 =
3 3! 3! 16
Note also that when n and k are non-negative integers and k > n then
n n(k) n(n 1) (1) (0) (n k + 1)

= = =0
k k! k!
Problems
3.5.1 A factory parking lot has 160 cars in it, of which 35 have faulty emission controls. An air quality
inspector does spot checks on 8 cars on the lot.
(a) Give an expression for the probability that at least 3 of these 8 cars will have faulty emission
controls.
(b) What assumption does your answer to (a) require? How likely is it that this assumption
holds if the inspector hopes to catch as many cars with faulty controls as possible?
3.5.2 In a race, the 15 runners are randomly assigned the numbers 1; 2; : : : ; 15. Find the probability
that
(a) 4 of the first 6 finishers have single digit numbers.

(b) the fifth runner to finish is the 3rd finisher with a single digit number.
(c) number 13 is the highest number among the first 7 finishers.

1. Six digits from 2; 3; 4; : : : ; 8 are chosen and arranged in a row without replacement. Find the
probability that
(a) the number is divisible by 2

(b) the digits 2 and 3 appear consecutively in the proper order (that is, in the order 23)
(c) digits 2 and 3 appear in the proper order but not consecutively.
2. There are 6 stops on a subway line and 4 passengers on a subway car. Assume the passengers
are each equally likely to get off at any stop. Find the probability that
(a) the passengers all get off at different stops

(b) 2 passengers get off at stop two and 2 passengers get off at stop five
(c) 2 passengers get off at one stop and the other 2 passengers get off at another same stop
3. Five tourists plan to attend Octoberfest. Each tourist attends a location selected at random
from the choices: Alpine Club, Bingemans, Concordia Club, Kitchener Memorial Auditorium,
Queensmount, Schwaben Club, Transylvania Club (7 locations). Find the probability that
(a) the tourists all attend different locations

(b) the tourists all attend the same location
(c) two tourists attend one location and three tourists attend another same location
(d) at least one of the tourists attends Queensmount
(e) both Queensmount and Concordia are unattended by the tourists
4. Suppose k people get on an elevator at the basement floor. There are n floors above the basement
floor which are numbered 1; 2; 3; : : : ; n where people may get off.
(a) Find the probability

(i) nobody gets off at floor 1
(ii) the people all get off at different floors (n k).
(b) What assumption(s) underlies your answer to (a)? Comment briefly on how likely it is that
the assumption(s) is(are) valid.
5. Give an expression for the probability a bridge hand of 13 cards contains 2 aces, 4 face cards
(Jack, Queen or King) and 7 others.
6. The letters of the word STATISTICS are arranged in a random order to form a “word”. Find the
probability that
(a) the word is STATISTICS

(b) the word begins and ends with an I
(c) the word begins and ends with the same letter
(d) the T’s occur together in the word
(e) the word begins with an S and the T’s occur together
7. Three digits are chosen in order from 0; 1; 2; : : : ; 9. Find the probability the digits are drawn in
increasing order (first digit < second digit < third digit) if
(a) draws are made without replacement

(b) draws are made with replacement.
8. The Birthday Problem:5 Suppose there are n people in a room. Ignore birthdays that fall on
February 29 and assume that every person is equally likely to have been born on any of the other
365 days in a year. (Is this a reasonable assumption? There are several images available on the
internet illustrating the frequency of birthdays throughout the year.)
(a) Find the probability that every person in the room has a different birthday.
(b) Let p (n) be the probability that at least two people in a room containing n people have the
same birthday. Plot p (n) for n = 1; 2; : : : ; 80.
(c) For what value of n does p (n) exceed 0:5? This surprising result is called the Birthday
Paradox.
9. You have n identical looking keys on a chain, and one opens your office door. Suppose you try
the keys in random order.
(a) What is the probability the k’th key opens the door?
(b) What is the probability one of the first two keys opens the door (assume n 3)?
(c) Determine numerical values for the answer in part (b) for n = 3; 5; 7.
5
“My birthday was a natural disaster, a shower of paper full of flattery under which one almost drowned” Albert Einstein,
1954 on his seventy-fifth birthday.
10.
(a) Suppose a set of nine tickets are numbered 1; 2; : : : ; 9. Three tickets are selected at random
without replacement. Find the probability that the numbers of the tickets form an arithmetic
progression. The order in which the tickets are selected does not matter.
(b) Suppose a set of 2n + 1 tickets are numbered 1; 2; : : : ; 2n + 1. Three tickets are selected at
random without replacement. Find the probability that the numbers of the tickets form an
arithmetic progression.
11. The 10,000 tickets for a lottery are numbered 0000 to 9999. A four-digit winning number is
drawn and a prize is paid on each ticket whose four-digit number is any arrangement of the num-
ber drawn. For instance, if winning number 0011 is drawn, prizes are paid on tickets numbered
0011, 0101, 0110, 1001, 1010, and 1100. A ticket costs $1 and each prize is $500.
(a) What is the probability of winning a prize

(i) with ticket number 7337?
(ii) with ticket number 7235?
(b) Based on your calculations in (a), what advice would you give to someone buying a ticket
for this lottery?
(c) Assuming that all tickets are sold, what is the probability that the operator will lose money
on the lottery?
12. Capture/Recapture:
(a) There are 25 deer in a certain forested area, and 6 have been caught temporarily and tagged.
Some time later, 5 deer are caught. Find the probability that 2 of them are tagged. (What
assumption did you make to do this?)
(b) Suppose that the total number of deer in the area was unknown to you. Describe how you
could estimate the number of deer based on the information that 6 deer were tagged earlier,
and later when 5 deer are caught, 2 are found to be tagged. What estimate do you get?
13. Lotto 6/49: In Lotto 6/49 you purchase a lottery ticket with 6 different numbers, selected from
the set f1; 2; : : : ; 49g. In the draw, six (different) numbers are randomly selected. Find the
probability that
(a) your ticket matches exactly 3 of the 6 numbers drawn.

(b) your ticket matches none of the numbers drawn.
(c) your ticket matches exactly x of the 6 numbers drawn, x = 0; 1; : : : ; 6.
14. The Enigma machine was used in World War II to send encrypted messages. There were five
rotors to choose from. Three rotors were chosen and placed in order in the machine, and each was
set to one of 26 letters (A-Z) to create the starting position. (There were also other complications
such as a plugboard which swapped ten pairs of letters, but we will ignore this here.) Suppose
you are a cryptanalyst trying to break Enigma.
(a) How many possible starting positions are there?

(b) Now suppose you know which three rotors are being used (and the order), but you don’t
know the letters on each one that form the setting. Find the probability that:
(i) the setting is DKS
(ii) the setting contains 2 vowels
(iii) the setting begins and ends with a consonant
(iv) the setting contains at least one A
(c) Repeat the calculations in (b) if no two rotors can have the same letter.
15. Hash Tables: In computer science, a dictionary is a collection of key-value pairs (k; v), such
that each pair consists of a unique key k and some data v. For example, for a collection of
student records the key k would be the student ID number and v would be the data associated
with that student. Suppose a data structure is to be designed to maintain such a collection. Let
U = the set of possible keys, N = the number of elements in U , and n = number of keys
used in the dictionary. If n << N then a data structure of size N is very wasteful in terms of
space. In such a situation, a hash table of size M < N can be used together with a hash function
h : U ! f0; 1; : : : ; M 1g. The pair (k; v) is stored in slot h (k) of the table. Ideally, a hash
function should have the property that each key is equally likely to be mapped to any of the M
slots in the table independently of any other key.
(a) Consider the hash function which chooses the slot for key k by randomly selecting a number
with replacement from the set f0; 1; : : : ; M 1g. When a key-value pair is mapped to a
slot which has already been assigned to another key-value pair a collision is said to have
occurred. Show that, for n keys and a hash table of size M , the probability of at least one
collision is equal to
1 2 n 1
1 1 1 1
M M M
Hint: See Problem 8.
(b) Show that the probability in (a) is approximately equal to

1
n(n 1)=M
1 e 2
Hint: e x t1 x for x close to zero.

(c) Use the approximation in (b) to show that if you want the probability of a least one collision
p
to be at most 0:5 then n should be less than 2M log 2.
(d) Use the result in (d) to show that if there are n = 2L=2 distinct keys, a hash table of size at
least M = 2L should be used to ensure that the probability of a collision is less than 0:5.
16. Texas Hold-em: Texas Hold-em is a poker game in which players are each dealt two cards face
down (called your hole or pocket cards), from a standard deck of 52 cards, followed by a round
of betting, and then five cards are dealt face up on the table with various breaks to permit players
to bet the farm. These are communal cards that anyone can use in combination with their two
pocket cards to form a poker hand. Players can use any five of the face-up cards and their two
cards to form a five card poker hand. Probability calculations for this game are not only required
at the end, but also at intermediate steps and are quite complicated so that usually simulation is
used to determine the odds that you will win given your current information, so consider a simple
example. Suppose I was dealt two Jacks in the first round.
(a) What is the probability that the next three cards (face up) include at least one Jack?
(b) Given that there was no Jack among these next three cards, what is the probability that there
is at least one among the last two cards dealt face-up?
(c) What is the probability that the five face-up cards show two Jacks, given that I have two in
my pocket cards?
17. I have a quarter which turns up heads with probability 0:6, and a fair dime. The quarter is flipped
until a head occurs. Independently the dime is flipped until a head occurs. Find the probability
that the number of flips is the same for both coins.
18. Players A and B decide to play chess until one of them wins. Assume games are independent
with P (A wins) = 0:3, P (B wins) = 0:25 and P (draw) = 0:45 on each game. If the game ends
in a draw another game will be played. Find the probability A wins before B.
4. PROBABILITY RULES AND
CONDITIONAL PROBABILITY
4.1 General Methods
Recall that a probability model consists of a sample space S; a set of events or subsets of the sample
space to which we can assign probabilities and a mechanism for assigning these probabilities. The
probability of an arbitrary event A can be determined by summing the probabilities of simple events in
A and so we have the following rules:
Rule 1: P (S) = 1.
P P
Proof: P (S) = P (a) = P (a) = 1.
a2S all a
Rule 2: For any event A; 0 P (A) 1.

P P
Proof: P (A) = P (a) P (a) = 1 and since each P (a) 0, we have 0 P (A) 1.
a2A a2S
Rule 3: If A and B are two events with A B (that is, all of the points in A are also in B), then
P (A) P (B).
P P
Proof: P (A) = P (a) P (a) = P (B) so P (A) P (B).
a2A a2B
Before continuing with the set-theoretic description of a probability model, let us review some of
the basic ideas in set theory. First what do sets have to do with the occurrence of events? Suppose
a random experiment having sample space S is conducted (for example a die is thrown with S =
f1; 2; 3; 4; 5; 6g). When would we say an event A S, or in the case of the die, the event A = f2; 4; 6g
occurs? In the latter case, the event A means that the number showing is even, that is, in general that
one of the simple outcomes in A occurred.
38
4.1. GENERAL METHODS 39
Figure 4.1: Event A in sample space S
A B
Figure 4.2: The union of two events A [ B
We often illustrate the relationship among sets using Venn diagrams. In the drawings below, think
of S consisting of all of the points in a rectangle of area one6 . To illustrate the event A we can draw a
region within the rectangle with area roughly proportional to the probability of the event A. We might
think of the random experiment as throwing a dart at the rectangle in Figure 4.1, and we say the event
A occurs if the dart lands within the region A.
What if we combine two events A, B by including all of the points in either A or B or both. This is
the union of the two events or A [ B illustrated in Figure 4.2. The union of the events occurs if one of
the outcomes in either A or B or both occurs. In language we refer to this as the event “A or B” with
6
As you may know, however, the number of points in a rectangle is NOT countable, so this is not a discrete sample space.
Nevertheless this definition of S is used to illustrate various combinations of sets
40 4. PROBABILITY RULES AND CONDITIONAL PROBABILITY
the understanding that in this course we will use the word “or” inclusively to also permit both. Another
way of expressing a union is A [ B occurs if at least one of A; B occurs. Similarly if we have three
events A; B; C; the event A [ B [ C means “at least one of A; B; C”.
A B
Figure 4.3: The intersection of two events A \ B
What about the intersection of two events (A\B) or the set of all points in S that are in both A and
B? This is illustrated in Figure 4.3. The event A \ B occurs if and only if a point in the intersection
occurs which means both A and B occur.
Note: The sets A \ B and A \ B \ C are often written more simply as AB and ABC respectively.
Finally the complement of the event A, denoted by A, is the set of all points which are in S but not
in A as in Figure 4.4.
Figure 4.4: A = the complement of the event A

B
A
Figure 4.5: Illustration of De Morgan’s Law using a Venn Diagram
There are two special events in a probability model that we will use. One is the whole sample space
S. Because P (S) = 1, this event is certain to occur. Another is the empty event, or the null set ;. This
is a set with no elements at all and so it must have probability 0. Notice that ; = S.
The illustrations above showing the relationship among sets are examples of Venn diagrams. Since
probability theory is built from the relationships among sets, it is often helpful to use Venn diagrams
in solving problems. For example there are rules governing taking the complements of unions and
intersections that can easily be verified using Venn diagrams.
De Morgan’s Laws:
(a) A [ B = A \ B
(b) A \ B = A [ B
Proof of (a): One can argue such set theoretic rules using the definitions of the sets. For example when
is a point a in the set A [ B: This means a 2 S but a is not in A [ B, which in turn implies a is not in
A and it is not in B; or a 2 A and a 2 B, equivalently a 2 A \ B. As and alternative demonstration,
we can use a Venn diagram (Figure 4.5) in which A is indicated with vertical lines, B with horizontal
lines and so A \ B is the region with both vertical and horizontal lines. This agrees with the shaded
region A [ B.
The following example demonstrates solving a problem using a Venn diagram.
Example: Suppose for students finishing second year Math that 22% have a math average greater than
80%, 24% have a STAT 230 mark greater than 80%, 20% have an overall average greater than 80%,
14% have both a math average and STAT 230 greater than 80%, 13% have both an overall average and
STAT 230 greater than 80%, 10% have all 3 of these averages greater than 80%, and 67% have none of
these 3 averages greater than 80%. Find the probability a randomly chosen math student finishing 2A
has math and overall averages both greater than 80% and STAT 230 less than or equal to 80%.
Solution: When using rules of probability it is generally helpful to begin by labeling the events of
interest. Imagine a student is chosen at random from all students finishing second year Math. For this
student, let
A be the event “math average greater than 80%”

B be the event “overall average greater than 80%”
C be the event “STAT 230 mark greater than 80%”
In terms of these symbols, we are given:
P (A) = 0:22 P (B) = 0:20

P (A \ C) = 0:14 P (B \ C) = 0:13
P (A \ B \ C) = 0:1
P (A \ B \ C) = 0:67
Let us interpret some of these expressions; for example A \ B \ C or (not A) and (not B) and (not
C), means that none of the marks or averages are greater than 80% for the randomly chosen student.
We are asked to find P (A \ B \ C), the region labelled with ‘(5) x’ in Figure 4.6. We have filled in
the following information on the Venn diagram, in the order indicated by (1), (2), (3), etc.
(1) P (A \ B \ C) = 0:1 is given

(2) P (A \ C) P (A \ B \ C) = 0:14 0:1 = 0:04
(3) P (B \ C) P (A \ B \ C) = 0:13 0:1 = 0:03
(4) P (C) P (A \ C) 0:03 = 0:24 0:14 0:03 = 0:07
(5) P (A \ B \ C) is unknown, so let P (A \ B \ C) = x
(6) P (A) P (A \ C) P (A \ B \ C) = 0:22 0:14 x = 0:08 x
(7) P (B) P (B \ C) P (A \ B \ C) = 0:20 0:13 x = 0:07 x
(8) P (A [ B [ C) = 0:67 is given:
S
B
(7) 0.07-x
(5) x (3) 0.03

(1) 0.1
A C
(6) 0.08-x (2) 0.04 (4) 0.07
(8) 0.67
Figure 4.6: Venn Diagram for Math Averages Example
Adding all probabilities from (1) to (8) we obtain, since P (S) = 1,
0:1 + 0:04 + 0:03 + 0:07 + x + 0:08 x + 0:07 x + 0:67 = 1
giving 1:06 x = 1 and solving for x, P (A \ B \ C) = x = 0:06.
Problems
4.1.1 In a class of 100 students who speak English, 50% also speak French, 25% also speak Spanish
and 40% also speak Mandarin or Cantonese. 10% speak both French and Spanish, 12% speak
both French and Mandarin or Cantonese and 2% speak French, Spanish and Mandarin or Can-
tonese. 8% of the students only speak English, Find the probability that a randomly chosen
student from this class speaks Spanish and Mandarin or Cantonese but not French.
4.1.2 According to a survey of people on the last Ontario voters list, 55% are female, 55% are polit-
ically to the right, and 15% are male and politically to the left. What percent are female and
politically to the right? Assume voter attitudes are classified simply as left or right.
4.2 Rules for Unions of Events

In addition to the two rules which govern probabilities listed in Section 4.1, we have the following
Rule 4 a: Addition Law of Probability or the Sum Rule
P (A [ B) = P (A) + P (B) P (A \ B)
Proof: Suppose we denote set differences by A=B = A \ B = the set of points which are in A but
not in B. Then
X X
P (A) + P (B) = P (a) + P (a)
a2A a2B
0 1 0 1
X X X X
=@ P (a) + P (a)A + @ P (a) + P (a)A
a2A=B a2AB a2B=A a2AB
0 1
X X X X
=@ P (a) + P (a) + P (a)A + P (a)
a2A=B a2AB a2B=A a2AB
X X
= P (a) + P (a)
a2A[B a2AB
= P (A [ B) + P (A \ B)
Rearranging P (A)+P (B) = P (A[B)+P (A\B) we obtain P (A[B) = P (A)+P (B) P (A\B)
as required. This can also be justified by using a Venn diagram. Each point in A [ B must be counted
once. In the expression P (A) + P (B), however, points in A \ B have their probability counted twice
- once in P (A) and once in P (B) - so they need to be subtracted once.
Rule 4 b: Probability of the Union of Three Events
P (A [ B [ C) = P (A) + P (B) + P (C) P (AB) P (AC) P (BC) + P (ABC) (4.1)
Proof: See Figure 4.7. In the sum P (A) + P (B) + P (C) those points in the regions labelled D; H; J
in Figure 4.7 lie in only one of the events and their probabilities are added only once. However points
in the regions labelled G; E; I, for example, lie in two of the events. We can compensate for this
double counting by subtracting these probabilities once, e.g. using P (A) + P (B) + P (C) [P (AB) +
P (AC) + P (BC)]. However, now those points in all three sets, that is, those points in F = ABC have
their probabilities added in three times and then subtracted three times so they are not included at all.
We must correct the formula to give (4.1).
4.2. RULES FOR UNIONS OF EVENTS 45
S B
E I
F
A C
D H
G
Figure 4.7: The union of three events A [ B [ C
Rule 4 c: Probability of the Union of n Events:

There is an obvious generalization of the above formula to n events A1 ; A2 ; : : : ; An . This is often
referred to as the inclusion-exclusion principle because of the process discussed above for constructing
it:
X X X
P (A1 [ A2 [ A3 [ [ An ) = P (Ai ) P (Ai Aj ) + P (Ai Aj Ak ) (4.2)
i i<j i<j<k
X
P (Ai Aj Ak Al ) +
i<j<k<l
(where the subscripts are all distinct, for example i < j < k < l).
Proof: This can be proved using rule 4a and induction. Let Bn = A1 [ A2 [ A3 [ [ An for
n = 1; 2; : : :. Then 4a shows that (4.2) holds for n = 2: Suppose the rule is true for n. Then
P (A1 [ A2 [ A3 [ [ An [ An+1 )
= P (Bn [ An+1 )
= P (Bn ) + P (An+1 ) P (Bn An+1 )
X X X
= P (Ai ) P (Ai Aj ) + P (Ai Aj Ak ) + : : : + P (An+1 )
i n i<j n i<j<k n
X X X
P (Ai An+1 ) + P (Ai Aj An+1 ) P (Ai Aj Ak An+1 ) + : : :
i n i<j n i<j<k n
We will use (4.2) rarely in this course7 .
7
i.e. do not memorize
Definition 6 Events A and B are mutually exclusive if A \ B = ; (the empty event).
Since mutually exclusive events A and B have no common points, P (A \ B) = P (;) = 0.
In general, events A1 ; A2 ; : : : ; An are mutually exclusive if Ai \ Aj = ; for all i 6= j. This means

that there is no chance of two or more of these events occurring together, we either have exactly one of
the events occur, or none. For example, if a die is rolled twice, the events A = “2 occurs on the 1st roll”
and B = “total is 10” are mutually exclusive events. Similarly the events A2 ; A3 ; : : : ; A12 where Aj is
the event that the total on the two dice is j are all mutually exclusive events. In the case of mutually
exclusive events, Rule 4 above simplifies to Rule 5 below.
Rule 5 a: Probability of the Union of Two Mutually Exclusive Events

Let A and B be mutually exclusive events. Then
P (A [ B) = P (A) + P (B)
This is a consequence of Rule 4a and the fact that P (A \ B) = P (;) = 0.
Rule 5 b: Probability of the Union of n Mutually Exclusive Events

In general, let A1 ; A2 ; : : : ; An be mutually exclusive events. Then
n
X
P (A1 [ A2 [ [ An ) = P (Ai )
i=1
This is easily proven from Rule 5a above using induction or as an immediate consequence of Rule 4c.
Rule 6: Probability of the Complement of an Event
P (A) = 1 P (A)
Proof: A and A are mutually exclusive and A [ A = S; so by Rule 5a,
P (A [ A) = P (A) + P (A)
But since P (A [ A) = P (S) = 1,
1 = P (A) + P (A) or
P (A) = 1 P (A)
This result is useful whenever P (A) is easier to obtain than P (A).

4.2. RULES FOR UNIONS OF EVENTS 47
Example: Two ordinary dice are rolled. Find the probability that at least one of them turns up a six.
Solution 1: The sample space is S = f(1; 1); (1; 2); (1; 3); : : : ; (6; 6)g. Let A be the event that we
obtain 6 on the first die, B be the event that we obtain 6 on the second die and note (by rule 5a) that
P (at least one die shows 6) = P (A [ B)

= P (A) + P (B) P (A \ B)
1 1 1 11
= + =
6 6 36 36
Solution 2: This is an example where it is perhaps somewhat easier to obtain the complement of the
event A [ B since the complement is the event that there is no six showing on either die, and there are
exactly 25 such points, f(1; 1); : : : ; (1; 5); (2; 1); : : : ; (2; 5); : : : ; (5; 5)g. Therefore
P (at least one die shows 6) = 1 P (no 6 on either die)

25 11
=1 =
36 36
Example: Roll a die 3 times. Find the probability of getting at least one 6.
Solution 1: Let A be the event “least one die shows 6”. Then A is the event that no 6 on any die shows.
Using counting arguments, there are 6 outcomes on each roll, so S = f(1; 1; 1); (1; 1; 2); : : : ; (6; 6; 6)g
has 6 6 6 = 216 points. For A to occur we can’t have a 6 on any roll. Then A can occur in
5 5 5 = 125 ways. Therefore
125
P (A) =
216
and
125 91
P (A) = 1 =
216 216
Solution 2: Can you spot the flaw in the following argument? Let
A be the event that 6 occurs on the first roll
B be the event that 6 occurs on the second roll
C be the event that 6 occurs on the third roll
Then
P (one or more six) = P (A [ B [ C)

= P (A) + P (B) + P (C)
1 1 1 1
= + + =
6 6 6 2
You should have noticed that A; B, and C are not mutually exclusive events, so we should have used
P (A [ B [ C) = P (A) + P (B) + P (C) P (AB) P (AC) P (BC) + P (ABC)
Each of AB; AC, and BC occurs 6 times in the sample space of 216 points and so
1
P (AB) = = P (BC) = P (AC)
36
Also
1
P (ABC) =
216
Therefore
1 1 1 1 1 1 1
P (A [ B [ C) = + + +
6 6 6 36 36 36 216
91
=
216
Note: These rules link the concepts of addition of probabilities with unions of events, and com-
plements. The next segment will consider intersection, multiplication of probabilities, and a concept
known as independence. Making these linkages will make problem solving and the construction of
probability models easier.
Problems
4.2.1 Let A; B; and C be events for which P (A) = 0:2, P (B) = 0:5, P (C) = 0:3, and
P (A \ B) = 0:1.
(a) Find the largest possible value for P (A [ B [ C).

(b) For this largest value to occur, are the events A and C mutually exclusive, not mutually
exclusive, or can this not be determined?
4.2.2 Prove that P (A [ B) = 1 P (A \ B) for arbitrary events A and B in S.

4.3. INTERSECTIONS OF EVENTS AND INDEPENDENCE 49
4.3 Intersections of Events and Independence

Dependent and Independent Events:
Consider the events A : airplane engine fails in flight and B : airplane reaches its destination safely.
Do we normally consider these events as related or dependent in some way? Certainly if a Canada
Goose is sucked into one jet engine, then this event would affect the probability that the airplane safely
reaches its destination, that is, it affects the probability that should be assigned to the event B.
Suppose we toss a fair coin twice. Consider the events A : head on first toss and B : head on both
tosses. Again there appears to be some dependence. On the other hand, if we define the event B as B :
head on second toss, we do not think that the occurrence of A affects the chances that B will occur.
If we need to reassess the probability of an event B if we are given that the event A has occurred
then we call such a pair of events dependent, and otherwise we call them independent. We formalize
this concept in the following mathematical definition.
Definition 7 Events A and B are independent events if and only if P (A \ B) = P (A)P (B). If the
events are not independent, we call the events dependent.
When we use Venn diagrams, we imagine that the probability of events are roughly proportional
to their area. This is justified in part because area and probability are two examples of “measures” in
mathematics and share much the same properties. Let us continue this tradition, so that if two events
are independent, then the “size” of their intersection, as represented by the area in a Venn diagram,
must equal the product of the individual probabilities. This means, of course, that the intersection must
be non-empty, and therefore the events are not mutually exclusive.
For example in the Venn diagram depicted in Figure 4.8, the area of region A is equal to
(0:4) (0:5) = 0:2, the area of region B is equal to (0:6) (0:5) = 0:3 and the area of region A \ B is
equal to (0:3) (0:2) = 0:06. Since P (A)P (B) = (0:2) (0:3) = 0:06 = P (A \ B), the events A and
B are independent events. If you were to hold the rectangle A in place and move the rectangle B down
and to the right, the probability of the intersection as represented by the area would decrease and the
events would become dependent.
Exercise: Suppose the events A and B are mutually exclusive events with P (A) > 0 and P (B) > 0.
Can A and B be independent events?
1
0.9 S
0.8 A
0.7
0.6
0.5 A ∩B
0.4
0.3
0.2
B
0.1
0
0 0.2 0.4 0.6 0.8 1
Figure 4.8: A Venn diagram illustrating independent events
Example: Suppose we toss a fair coin twice. Let A = “head on 1st toss” and B = “head on 2nd toss”.
Clearly A and B are independent events since the outcome on each toss is unrelated to other tosses, so
P (A) = 12 , P (B) = 21 , and P (A \ B) = 41 = P (A)P (B).
Example: Suppose we roll a die once and let A = “the number is even” and B = “number > 3”. The
events will be dependent since
1 1
P (A) = ; P (B) = and
2 2
2
P (A \ B) = P (4 or 6 occurs) = 6= P (A)P (B)
6
(Rationale: B only happens half the time. If A occurs we know the number is 2, 4, or 6. So B occurs
2
3 of the time when A occurs. The occurrence of A does affect the chances of B occurring so A and B
are not independent.)
When there are more than two events, the above definition generalizes to8 :
Definition 8 The events A1 ; A2 ; : : : ; An are mutually independent if and only if
P (Ai1 \ Ai2 \ \ Aik ) = P (Ai1 )P (Ai2 ) P (Aik )
for all sets (i1 ; i2 ; : : : ; ik ) of distinct subscripts chosen from (1; 2; : : : ; n).
8
We need all subsets so that events are independent of combinations of other events. For example if A1 is independent of
A2 and A4 is to be independent of A1 A2 then, P (A1 A2 A4 ) = P (A1 A2 )P (A4 ) = P (A1 )P (A2 )P (A4 )
For example, for n = 3, we need
P (A1 \ A2 ) = P (A1 )P (A2 )

P (A1 \ A3 ) = P (A1 )P (A3 )
P (A2 \ A3 ) = P (A2 )P (A3 )
and
P (A1 \ A2 \ A3 ) = P (A1 )P (A2 )P (A3 )
We will shorten “mutually independent” to “independent” to reduce confusion with “mutually exclu-
sive.”
Remark: The definition of independence works two ways. If we can find P (A); P (B), and P (A \ B)
then we can determine whether A and B are independent. Conversely, if we know (or assume) that A
and B are independent, then we can use the definition as a rule of probability to calculate P (A \ B).
Examples of each follow.
Example: Toss a die twice. Let A be the event that the first toss is a 3 and B the event that the
total is 7. Are A and B independent? (What do you think?) We use the definition to check. Now
6
P (A) = 61 , P (B) = 36 1
since B = f(1; 6); (2; 5); (3; 4); (4; 3); (5; 2); (6; 1)g and P (A \ B) = 36
since A \ B = f(3; 4)g. Therefore,
1 1 6
P (A \ B) = = P (A)P (B) =
36 6 36
and so A and B are independent events.
Now suppose we define C to be the event that the total is 8. This is a minor change from the
definition of B. Then
1 5 1
P (A) = ; P (C) = and P (A \ C) =
6 36 36
Therefore P (A \ C) 6= P (A)P (C) and A and C are dependent events.
This example often puzzles students. Why are they independent if B is a total of 7 but dependent for
C is a total of 8? The key is that regardless of the first toss, there is always one number on the 2nd
6
toss which makes the total 7. Since the probability of getting a total of 7 started off being 36 = 16 , the
outcome of the 1st toss doesn’t affect the chances. However, for any total other than 7, the outcome of
the 1st toss does affect the chances of getting that total (e.g., a first toss of 1 guarantees the total cannot
be 8)9 .
9
This argument is in terms of “conditional probability” closely related to independence and to be treated in the next
section.
Example: Show if A and B are independent events, then A and B are independent events.
Solution: Since A \ B and A \ B are mutually exclusive events B = (A \ B) [ A \ B so
P (B) = P (A \ B) + P A \ B
Therefore
P (A \ B) = P (B) P (A \ B)
= P (B) P (A)P (B) since A and B are independent events
= [1 P (A)] P (B)
= P (A)P (B)
Example: A pseudo random number generator on a computer can give a sequence of independent
1
random digits chosen from S = f0; 1; : : : ; 9g. This means that (i) each digit has probability of 10
of being any of 0; 1; : : : ; 9, and (ii) events determined by the different trials are independent of one
another. We call this an “experiment with independent trials”. Determine the probability that
(a) in a sequence of 5 trials, all the digits generated are odd

(b) the number 9 occurs for the first time on trial 10.
Solution:
(a) Define the events Ai : digit from trial i is odd, i = 1; 2; : : : ; 5. Then
Q
5
P (all digits are odd) = P (A1 \ A2 \ A3 \ A4 \ A5 ) = P (Ai )
i=1
since the Ai ’s are mutually independent. Since P (Ai ) = 12 , we get P (all digits are odd) = 1
25
.
(b) Define events Ai : 9 occurs on trial i, for i = 1; 2; : : : . Then we want
P (A1 \ A2 \ \ A9 \ A10 ) = P (A1 )P (A2 ) P (A9 )P (A10 )

= (0:9)9 (0:1)
because the Ai ’s are independent, and P (Ai ) = 1 P (Ai ) = 0:1.
Note: We implicitly assumed independence of events in some of our earlier probability calculations.
For example, suppose a coin is tossed 3 times, and we consider the sample space
S = fHHH; HHT; HT H; T HH; HT T; T HT; T T H; T T T g

1
Assuming that the outcomes on the three tosses are independent, and P (H) = P (T ) = 2 on each toss,
we obtain
1 3 1
P (HHH) = P (H)P (H)P (H) = =
2 8
Similarly, all the other simple events have probability 81 . In earlier calculations we implicitly assumed
this was true by assigning the same probability 18 to all possible outcomes without thinking directly
about independence. However, it is clear that if somehow the 3 tosses were not independent then it
might be a bad idea to assume each outcome had probability 18 . (For example, instead of heads and
tails, suppose H stands for “rain” and T stands for “no rain” on a given day; now consider 3 consecutive
days. Would you want to assign a probability of 18 to each of the 8 simple events even if this were in a
season when the probability of rain on a day was 12 ?)
Note: The definition of independent events can be used either to check for independence or, if events
are known to be independent, to calculate P (A \ B). Many problems are not obvious, and scientific
study is needed to determine if two events are independent. For example, are the events A and B
independent if, for a random child living in a country, the events are defined as A: the child lives within
5 kilometers of a nuclear power plant and B: the child has leukemia? Determining whether such events
are dependent and if so the extent of the dependence are problems of substantial importance, and can
be handled by methods discussed in later statistics courses.
Problems
4.3.1 A weighted die is such that P (1) = P (2) = P (3) = 0:1, P (4) = P (5) = 0:2, and P (6) = 0:3.
Assume that events determined by different throws of the die are independent.
(a) If the die is thrown twice what is the probability the total is 9?
(b) If a die is thrown twice, and this process repeated 4 times, what is the probability the total will
be 9 on exactly 1 of the 4 repetitions?
4.3.2 Suppose among UWaterloo students that 15% speaks French and 45% are women. Suppose
also that 20% of the women speak French. A committee of 10 students is formed by randomly
selecting from UWaterloo students. What is the probability there will be at least 1 woman and at
least 1 French speaking student on the committee10 ?
4.3.3 Prove that A and B are independent events if and only if A and B are independent.
10
Although the sampling is conducted without replacement, because the population is very large, whether we replace or
not will make little difference. Therefore assume in your calculations that sampling is with replacement so the 10 draws are
independent.
4.4 Conditional Probability

In many situations we may want to determine the probability of some event A, while knowing that
some other event B has already occurred. For example, what is the probability a randomly selected
person is over 6 feet tall, given that she is female? Let the symbol P (AjB) represent the probability
that event A occurs, when we know that B occurs. We call this the conditional probability of A given
B. While we will give a definition of P (AjB), let’s first consider an example we looked at earlier, to
get some sense of why P (AjB) is defined as it is.
Example: Suppose we roll a die once so that sample space is S = f1; 2; 3; 4; 5; 6g. Let A be the
event that the number is even and B the event that the number is greater than 3. If we know that B
occurs, this tells us that we have a 4, 5, or 6. Of the times when B occurs, we have an even number 23
of the time. So P (AjB) = 32 . More formally, we could obtain this result by calculating P P(A\B)
(B) , since
2 3
P (A \ B) = P (4 or 6) = 6 and P (B) = 6 .
Definition 9 The conditional probability of event A, given event B, is
P (A \ B)
P (AjB) = provided P (B) > 0
P (B)
Note: If A and B are independent then
P (A \ B) = P (A)P (B)
P (A)P (B)
so P (AjB) = = P (A) provided P (B) > 0
P (B)
This result leads us to the following theorem:
Theorem 10 Suppose A and B are two events defined on a sample space S such that P (A) > 0 and
P (B) > 0. Then A and B are independent events if and only if either of the following statements is
true
P (AjB) = P (A) or P (BjA) = P (B)
Note: We could have taken the definition of independent events to be: A and B are independent
events if P (AjB) = P (A). In some sense, this definition is more intuitive than the original definition.
However this definition does not hold in the case that P (B) = 0 whereas the original definition does
hold.
4.4. CONDITIONAL PROBABILITY 55
Example: If a fair coin is tossed three times, find the probability that if at least one Head occurs, then
exactly one Head occurs.
Solution: The sample space is S = fHHH; HHT; HT H; : : : ; T T T g. Define the events

A = “one Head” and B = “at least one Head”. We need to find
P (A \ B)
P (AjB) =
P (B)
Now
P (B) = 1 P B =1 P (no heads)

7
=
8
and
P (A \ B) = P (we obtain one head AND we obtain at least one head)

= P (we obtain one head)
= P (fHT T; T HT; T T Hg)
3
=
8
using either the sample space with equally probably points, or the fact that the 3 tosses are independent.
Thus,
P (A \ B)
P (AjB) =
P (B)
3
8 3
= 7 =
8
7
Example: The probability a randomly selected male is colour-blind is 0:05, whereas the probability
a female is colour-blind is only 0:0025. If the population is 50% male, what is the fraction that is
colour-blind?
Solution: Let C be the event that the person selected is colour-blind, M the event that the person
selected is male and F = M the event that the person selected is female. We are asked to find P (C).
We are given that P (CjM ) = 0:05, P (CjF ) = 0:0025, and P (M ) = 0:5 = P (F ). From the
definition of conditional probability
P (C \ M )
P (CjM )P (M ) = P (M )
P (M )
= P (C \ M )
and similarly P (CjF )P (F ) = P (C \ F ). To obtain P (C) we can therefore use the fact that
C = (C \ M ) [ C \ M and the events C \ M and C \ M are mutually exclusive so
P (C) = P (C \ M ) + P (C \ F )
= P (CjM )P (M ) + P (CjF )P (F )
= (0:05)(0:5) + (0:0025)(0:5)
= 0:02625
4.5 Product Rules, Law of Total Probability and Bayes’ Theorem

The preceding example suggests two more useful probability rules. They are based on the idea of
breaking down the event of interest into mutually exclusive pieces.
Rule 7: Product Rules Let A; B; C; D; : : : be arbitrary events in a sample space. Assume that
P (A) > 0, P (A \ B) > 0, and P (A \ B \ C) > 0. Then
P (AB) = P (A)P (BjA)

P (ABC) = P (A)P (BjA)P (CjAB)
P (ABCD) = P (A)P (BjA)P (CjAB)P (DjABC)
and so on.
Proof: The first rule comes directly from the definition P (BjA) since
P (A \ B)
P (A)P (BjA) = P (A) = P (A \ B)
P (A)
assuming P (A) > 0. The right hand side of the second rule equals (assuming P (AB) > 0 and
P (A) > 0)
P (AB)
P (A)P (BjA)P (CjAB) = P (A) P (CjAB)
P (A)
= P (AB)P (CjAB)
P (CAB)
= P (AB)
P (AB)
= P (ABC)
and so on.
4.5. PRODUCT RULES, LAW OF TOTAL PROBABILITY AND BAYES’ THEOREM 57
In order to remember these rules you can imagine that the events unfold in some chronological
order, even if they do not. For example,
P (ABCD) = P (A)P (BjA)P (CjAB)P (DjABC)
could be interpreted as the probability that “A occurs” (first) and then “given A occurs, that B occurs”
(next), etc.
Rule 8: Law of Total Probability Let A1 ; A2 ; : : : ; Ak be a partition of the sample space S into
disjoint (mutually exclusive) events, that is
A1 [ A2 [ [ Ak = S and Ai \ Aj = ; if i 6= j
Let B be an arbitrary event in S. Then
P (B) = P (BA1 ) + P (BA2 ) + + P (BAk )

P
k
= P (BjAi )P (Ai )
i=1
Proof: Note that the events BA1 ; BA2 ; : : : ; BAk are all mutually exclusive and their union is B, that
is B = (BA1 ) [ [ (BAk ). Therefore by Rule 5b
P (B) = P (BA1 ) + P (BA2 ) + + P (BAk ):
By the product rule, P (BAi ) = P (BjAi )P (Ai ) so this becomes
P (B) = P (BjA1 )P (A1 ) + P (BjA2 )P (A2 ) + + P (BjAk )P (Ak )
Example: In an insurance portfolio 10% of the policy holders are in Class A1 (high risk), 40% are
in Class A2 (medium risk), and 50% are in Class A3 (low risk). The probability there is a claim on a
Class A1 policy in a given year is 0:10; similar probabilities for Classes A2 and A3 are 0:05 and 0:02.
Find the probability that if a claim is made, it is made on a Class A1 policy.
Solution: For a randomly selected policy, let B = “policy has a claim” and Ai = “policy is of Class
Ai ”, i = 1; 2; 3. We are asked to find P (A1 jB). Note that
P (A1 \ B)
P (A1 jB) =
P (B)
and that
P (B) = P (A1 \ B) + P (A2 \ B) + P (A3 \ B)
We are given that

P (A1 ) = 0:10; P (A2 ) = 0:40; P (A3 ) = 0:50
and
P (BjA1 ) = 0:10; P (BjA2 ) = 0:05; P (BjA3 ) = 0:02
Therefore
P (A1 \ B) = P (A1 )P (BjA1 ) = 0:01

P (A2 \ B) = P (A2 )P (BjA2 ) = 0:02
P (A3 \ B) = P (A3 )P (BjA3 ) = 0:01
This gives P (B) = 0:01 + 0:02 + 0:01 = 0:04 and

P (A1 \ B) 0:01
P (A1 jB) = = = 0:25
P (B) 0:04
Tree Diagrams
Figure 4.9: Tree diagram for insurance example
Tree diagrams can be a useful device for keeping track of conditional probabilities when using
multiplication and partition rules. The idea is to draw a tree where each path represents a sequence of
4.5. PRODUCT RULES, LAW OF TOTAL PROBABILITY AND BAYES’ THEOREM 59
events. On any given branch of the tree we write the conditional probability of that event given all the
events on branches leading to it. The probability at any node of the tree is obtained by multiplying the
probabilities on the branches leading to the node, and equals the probability of the intersection of the
events leading to it. For example, the immediately preceding example could be represented by the tree
in Figure 4.9. Note that the probabilities on the terminal nodes must add up to 1.
Here is another example involving diagnostic tests for disease. See if you can represent the problem
using a tree diagram.
Example: Testing for HIV

Tests used to diagnose medical conditions are often imperfect, and give false positive or false
negative results, as described in Problem 2.6 of Chapter 2. A fairly cheap blood test for the Human
Immunodeficiency Virus (HIV) that causes AIDS (Acquired Immune Deficiency Syndrome) has the
following characteristics: the false negative rate is 2% and the false positive rate is 0.5%. It is assumed
that around .04% of Canadian males are infected with HIV. Find the probability that if a male tests
positive for HIV, he actually has HIV.
Solution: Suppose a male is randomly selected from the population. Define the events A = “selected
male has HIV” and B = “blood test is positive”. We are asked to find P (AjB).
From the information given we know that
P (A) = 0:0004; P (A) = 0:9996

P (BjA) = 0:98; P (BjA) = 0:005
Therefore we can find
P (AB) = P (A)P (BjA) = 0:000392

P (AB) = P (A)P (BjA) = 0:004998:
Thus
P (B) = P (AB) + P (AB) = 0:00539
and
P (AB)
P (AjB) = = 0:0727
P (B)
Thus, if a randomly selected male tests positive, there is still only a small probability (0:0727) that
they actually have HIV!
Exercise: Try to explain in ordinary words why this is the case.

Bayes’ Theorem: Suppose A and B are events defined on a sample space S. Suppose also that
P (B) > 0. Then
P (BjA)P (A) P (BjA)P (A)

P (AjB) = =
P (B) P (BjA)P (A) + P (BjA)P (A)
Proof:
P (BjA)P (A) P (AB)

= by the Product Rule
P (BjA)P (A) + P (BjA)P (A) P (AB) + P (AB)
P (AB)
= by the Law of Total Probability
P (B)
= P (AjB)
Remark: Bayes’ Theorem allows us to write conditional probabilities in terms of similar conditional
probabilities but with the order of conditioning reversed. It is a simple theorem, but it has inspired
approaches to problems in statistics and other areas such as machine learning, classification and pattern
recognition. In these areas the term “Bayesian methods” is often used. The result is named after a
mathematician11 who proved it in the 1700’s.
Problems
4.5.1 If you take a bus to work in the morning there is a 20% chance you’ll arrive late. When you go
by bicycle there is a 10% chance you’ll be late. 70% of the time you go by bike, and 30% by
bus. Given that you arrive late, what is the probability you took the bus?
4.5.2 A box contains 4 coins – 3 fair coins and 1 biased coin for which P (heads) = 0:8. A coin is
picked at random and tossed 6 times. It shows 5 heads. Find the probability this coin is fair.
4.5.3 At a police spot check, 10% of cars stopped have defective headlights and a faulty muffler. 15%
have defective headlights and a muffler which is satisfactory. If a car which is stopped has
defective headlights, what is the probability that the muffler is also faulty?
11
(Rev) Thomas Bayes (1702-1761) was an English Nonconformist minister, turned Presbyterian. He may have been
tutored by De Moivre. His famous paper introducing this rule was published after his death. “Bayesians” are statisticians
who opt for a purely probabilistic view of inference. All unknowns obtain from some distribution and ultimately, the
distribution says it all.
4.6. USEFUL SERIES AND SUMS 61
4.6 Useful Series and Sums

In remaining chapters the following series and sums will be useful. You will have seen some of these
results in previous courses.
1. Geometric Series:
n
X1 1 tn
ti = 1 + t + t2 + + tn 1
= for t 6= 1
1 t
i=0
If jtj < 1, then

1
X 1
tx = 1 + t + t2 + =
1 t
x=0
Note that other identities can be obtained from this one by differentiation. For example
1
d X x d 1
t = or
dt dt 1 t
x=0
1
X 1
xtx 1
= for jtj < 1
x=1
(1 t)2
You should be able to determine other identities by taking second and higher derivatives.
2. Binomial Theorem: There are various forms of this theorem. We will use the form
n
n 1 n 2 n n X n x
(1 + t)n = 1 + t + t + + t = t
1 2 n x
x=0
where n is a positive integer and t is any real number.
Justification: One way of verifying this formula uses the counting arguments of this chapter.
Imagine a product of the individual terms:
(1 + t) (1 + t) (1 + t) (1 + t)
To evaluate this product we must add together all of the possibilities obtained by taking one of
the two possible terms from the first bracketed expression, that is, one of f1; tg, multiplying by
one of f1; tg taken from the second bracketed expression, etc. In how many ways do we obtain
the term tx where x = 0; 1; 2; : : : ; n? We might choose t from each of the first x terms above
and then 1 from the remaining (n x) terms, or indeed we could choose t from any x of the n
terms in nx ways and then 1 from the remaining (n x) terms.
3. Binomial Theorem: There is a more general version of the Binomial Theorem that results in an
infinite series and that holds when n is not a positive integer:
1
X
n n x
(1 + t) = t if jtj < 1
x
x=0
Proof: Recall from Calculus the Maclaurin’s series which says that a sufficiently smooth func-
tion f (t) can be written as an infinite series using an expansion around t = 0;
f 0 (0) f 00 (0) 2
f (t) = f (0) + t+ t +
1 2!
provided that this series is convergent. If f (t) = (1 + t)n , then f (0) = 1 and f (k) (0) = n(k) for
k = 1; 2; : : : and we obtain
1
X
n n(n 1) 2 n(k) k n x
(1 + t)n = 1 + t+ t + + t + = t
1 2! k! x
x=0
It is not difficult to show that this converges whenever jtj < 1 using the Ratio Test.
4. Multinomial Theorem: A generalization of the Binomial Theorem is

X n!
(t1 + t2 + + tk )n = tx1 1 tx2 2 txk k
x1 !x2 ! xk !
P
k
where the summation is over all non-negative integers x1 ; x2 ; : : : ; xk such that xi = n where
i=1
n is a positive integer.
Justification: Again we could verify this formula using a counting argument. Consider the
product:
(t1 + t2 + + tk ) (t1 + t2 + + tk ) (t1 + t2 + + tk )
To evaluate this product we must add together all of the possibilities obtained by taking one
of the terms from the first bracketed expression, that is, one of ft1 ; t2 ; : : : ; tk g; multiplying by
one ft1 ; t2 ; : : : ; tk g taken from the second bracketed expression. etc. In how many ways do we
P
k
obtain the term tx1 1 tx2 2 txk k where xi = n? We can choose t1 a total of x1 times from any of
i=1
the n terms in xn1 ways, and then t2 from any of the remaining n x1 terms in n x1
x2 ways,
and so on so there are
n n x1 n x1 x2 xk n!
=
x1 x2 x3 xk x1 !x2 ! xk !
ways or obtaining this term in the product. The case k = 2 gives the Binomial Theorem in the
form
Xn
n n x1 n x1
(t1 + t2 ) = t t
x1 1 2
x1 =0
4.6. USEFUL SERIES AND SUMS 63
5. Hypergeometric Identity:
1
X a b a+b
=
x n x n
x=0
There will not be an infinite number of terms if a and b are positive integers since the terms
become 0 eventually. For example
4 4 (5) (4)(3)(2)(1)(0)
= = =0
5 5! 5!
Proof: We prove this in the case that a and b are non-negative integers. Obviously
(1 + y)a+b = (1 + y)a (1 + y)b :
If we expand each term using the Binomial Theorem we obtain
a+b
X a b
a+b k X a i X b j
y = y y
k i j
k=0 i=0 j=0
P
a
a b a+b
Note that the coefficient of y k on the right side is i k i and so this must equal k ; the
i=0
coefficient of y k on the left side.
6. Exponential Series: This is another example of a Maclaurin series expansion, if we let

f (x) = ex ; then f (k) (0) = 1 for k = 1; 2; : : : and so
1 n
X
t t0 t1 t2 t3 t
e = + + + + = for all t 2 <
0! 1! 2! 3! n!
n=0
We will also use the limit definition of the exponential function:

n
t
et = lim 1+ for all t 2 <
n!1 n
7. Special series involving integers:
n(n + 1)
1+2+3+ +n=
2
n(n + 1)(2n + 1)
12 + 22 + 32 + 2
+n =
6
n(n + 1) 2
13 + 23 + 33 + + n3 =
2
Example: Find
1
X a b
x (x 1)
x n x
x=0
Solution: For x = 0 or 1 the term becomes 0, so we can start summing at x = 2. For x 2, we can
expand x! as x(x 1)(x 2)!
1
X 1
X
a b a! b
x(x 1) = x(x 1)
x n x x(x 1)(x 2)!(a x)! n x
x=0 x=2
Cancel the x(x 1) terms and try to re-group the factorial terms as “something choose something”.
a! a(a 1)(a 2)! a 2

= = a(a 1)
(x 2)!(a x)! (x 2)! [(a 2) (x 2)]! x 2
Then
1
X 1
X
a b a 2 b
x(x 1) = a(a 1)
x n x x 2 n x
x=0 x=2
Factor out a(a 1) and let y = x 2 to get

1
X a 2 b a+b 2
a(a 1) = a(a 1)
y n (y + 2) n 2
y=0
by the Hypergeometric Identity.
Problems
4.6.1 Use the Binomial Theorem to show that
n
X n x
x p (1 p)n x
= np
x
x=0
4.6.2 Show that

1
X 2
x (x 1) tx 2
= for jtj < 1
x=2
(1 t)3
4.6.3 For k a non-negative real number and 0 < p < 1 show that
1
X k k
p (p 1)x = 1
x
x=0

1. Suppose A and B are mutually exclusive events with P (A) = 0:25 and P (B) = 0:4. Determine
the probabilities of the following events:
A; B; A [ B; A \ B; A [ B; A \ B; A\B
2. Three digits are chosen at random with replacement from 0; 1; : : : ; 9. Define the following
events:
A: “all three digits are the same” D: “the digits all exceed 4”
B: “all three digits are different” E “digits all have the same parity (all odd or all even)”
C: “the digits are all nonzero”
Determine probabilities of the events A; B; C; D; E and the events
B \ E; B [ D; B [ D [ E; (A [ B) \ D; A [ (B \ D)
Show the events (A [ B) \ D and A [ (B \ D) in a Venn diagram.
3. Let A and B be events defined on the same sample space, with P (A) = 0:3, P (B) = 0:4, and
P (AjB) = 0:5. Given that event B does not occur, what is the probability of event A?
4. A die is loaded to give the probabilities:
number 1 2 3 4 5 6
probability 0:3 0:1 0:15 0:15 0:15 0:15
The die is rolled 8 times. Rolls of the die are assumed to be independent. Find the probability
that
(a) the number 1 does not occur in the 8 rolls

(b) the number 2 does not occur in the 8 rolls
(c) the number 1 and the number 2 do not occur in the 8 rolls
(d) the numbers 1 and 2 both occur at least once in the 8 rolls.
5. Events A and B are independent with P (A) = 0:3 and P (B) = 0:2. Find P (A [ B).
6. Let E and F be independent events with E = A [ B and F = A \ B. Prove that either

P (A \ B) = 0 or P A \ B = 0.
7. A population consists of F females and M males; the population includes f female smokers and
m male smokers. An individual is chosen at random from the population. If A is the event that
this individual is female and B is the event he or she is a smoker, find necessary and sufficient
conditions on f , m, F and M so that A and B are independent events.
8. Suppose A, B, C, and D are events defined on a sample space such that A and C are mutually
exclusive events, A and B are independent events, B and D are not independent events and
A D. Suppose also that P (A) = 0:15, P B = 0:3, P (C) = 0:1, and P DjB = 0:8.
Determine the probabilities of the following events:
A [ B; B \ DjA; B [ D; CjA [ B
9. Consider a system of independent components shown in the figure below. The system functions
properly if all components along at least one path from point A to point B are working. The
probabilities that the components C1 ; C2 ; C3 ; C4 are working are 0:9; 0:8; 0:7; 0:6 respectively.
What is the probability that the system is functioning properly?
C1 C2
A B
C3 C4
10. Customers at a store independently decide whether to pay by debit card or with cash. Suppose
the probability is 70% that a customer pays by debit card. Find the probability
(a) 3 out of 5 customers pay by debit card

(b) the 5th customer is the first one to pay by cash
(c) the 5th customer is the 3rd one to pay by debit card.
11. Students A; B and C each independently answer a question on a test. The probability of getting
the correct answer is 0:9 for A, 0:7 for B and 0:4 for C.
(a) What is the probability that all three students get the correct answer?
(b) Find the probability that exactly two students get the correct answer.
(c) If exactly two students get the correct answer, what is the probability student C got the
wrong answer?
12. Suppose you are playing a game where you flip a coin to determine who plays first. You know
that when you play first, you win the game 60% of the time and that when you play second, you
lose 52% of the time.
(a) What is the probability that you win the game given that you played second?
(b) What is the probability that you win the game?
(c) If you won the game, what is the probability that you played first?
13. In a large population, people are one of three genetic types A; B and C: 30% are type A, 60%
type B and 10% type C. The probability a person carries another gene making them susceptible
for a disease is 0:05 for A, 0:04 for B and 0:02 for C.
(a) What is the probability a randomly selected person is susceptible for the disease?
(b) If ten unrelated persons are selected at random, what is the probability at least one is sus-
ceptible for the disease?
14. Two baseball teams play a best-of-seven series, in which the series ends as soon as one team wins
four games. The first two games are to be played on A’s field, the next three games on B’s field,
and the last two on A’s field. The probability that A wins a game is 0:7 at home and 0:5 away.
Assume that the results of the games are independent. Find the probability that:
(a) A wins the series in 4 games; in 5 games;

(b) the series does not go to 6 games.
15. An experiment has three possible outcomes A, B and C with respective probabilities p, q and r,
where p + q + r = 1. The experiment is repeated until either outcome A or outcome B occurs.
Show that A occurs before B with probability p=(p + q). A tree diagram is useful for solving
this problem.
16. In the game of craps, a player rolls two dice. The player wins at once if the total is 7 or 11, and
loses at once if the total is 2, 3, or 12. Otherwise, the player continues rolling the dice until they
either win by throwing their initial total again, or lose by rolling 7. Show that the probability the
player wins is 0:493. Hint: Use the result of Problem 15.
17. Slot machines: Standard slot machines have three wheels, each marked with some number
of symbols at equally spaced positions around the wheel. For this problem suppose there are
ten positions on each wheel, with three different types of symbols being used: flower, dog,
and house. The three wheels spin independently and each has probability 0:1 of landing at any
position. Each of the symbols (flower, dog, house) is used in a total of ten positions across the
three wheels. A payout occurs whenever all three symbols showing are the same.
(a) If wheels 1, 2, 3 have 2, 6, 2 flowers, respectively, what is the probability all three positions
show a flower?
(b) In order to minimize the probability of all three positions showing a flower, what number
of flowers should go on wheels 1, 2, and 3? Assume that each wheel must have at least one
flower.
18. The following table of probabilities are based on data from the 2011 Canadian census data. The
probabilities are for Canadians aged 25 34.
Highest level of education attained Employed Unemployed

No certificate,
0:066 0:010
diploma or degree
High school
0:185 0:016
diploma or equivalent
Postsecondary
0:683 0:040
certificate, diploma or degree
(a) What proportion of Canadians aged 25 34 are unemployed?

(b) What proportion of Canadians aged 25 34 have no certificate, diploma or degree?
(c) What proportion of Canadians aged 25 34 have at least a high school diploma or equiva-
lent?
(d) What proportion of Canadians aged 25 34 who are employed have at least a high school
diploma or equivalent?
(e) Are the events, “unemployed” and “no certificate, diploma or degree”, independent events?
Why?
19. A researcher wishes to estimate the proportion p of university students who have cheated on an
examination. The researcher prepares a box containing 100 cards, 20 of which contain Question
A and 80 Question B.
Question A: Were you born in July or August?
Question B: Have you ever cheated on an examination?
Each student who is interviewed draws a card at random with replacement from the box and
answers the question it contains. Since only the student knows which question he or she is an-
swering, confidentiality is assured and so the researcher hopes that the answers will be truthful12 .
It is known that one-sixth of birthdays fall in July or August.
(a) What is the probability that a student answers “yes”?

(b) If x of n students answer “yes”, estimate p.
(c) What proportion of the students who answer “yes” are responding to Question B?
20. Diagnostic tests: See Chapter 2, Problem 6. For a randomly selected person let D = “person has
the disease” and R = “the test result is positive”. Give estimates of the following probabilities:
P (RjD), P (RjD), P (R).
21. Spam detection 1: Many methods of spam detection are based on words or features that appear
much more frequently in spam than in regular email. Conditional probability methods are then
used to decide whether an email is spam or not. For example, suppose we define the following
events associated with a random email message.
Spam = “Message is spam”
Not Spam = “Message is not spam (“regular”)”
A = “Message contains the word Viagra”
From a study of email messages coming into a certain system it is estimated that
P (Spam) = 0:5, P (AjSpam) = 0:2, and P (AjNot Spam) = 0:001.
(a) Find P (A) = P (email message contains the word Viagra).

(b) Find P (SpamjA) and P (Not SpamjA).
(c) If you declare any message containing the word Viagra as spam, what fraction of spam
emails would you detect?
12
“A foolish faith in authority is the worst enemy of truth.” Albert Einsten, 1901.
22. Spam detection 2: To increase the probability of detecting spam, we can use a larger set of
email “features”. These could be words or other features of a message which tend to occur with
much different probabilities in spam and in regular email. (From your experience, what might be
some useful features?) Suppose we identify three binary features, and we define events
Ai = feature i appears in a message, i = 1; 2; 3.
Assume that A1 ; A2 ; A3 are independent events, given that a message is spam, and that they are
also independent events, given that a message is not spam.
From a study of email messages coming into a certain system it is estimated that P (Spam) = 0:5
and
P (A1 j Spam) = 0:2 P (A1 j Not Spam) = 0:005
(a) Find P (message has all three features) = P (A1 A2 A3 ).

(b) Find P (Spam jA1 A2 A3 ).
(c) Suppose a message has features 1 and 2 present, but feature 3 is not present. Determine
P (Spam jA1 A2 A3 ).
(d) If you declare any message with one or more of features 1, 2 or 3 as spam, what fraction of
spam emails would you detect? Compare this with Problem 21(c).
(e) Given that a message is declared as spam (according to the rule in (d)), what is the proba-
bility that the message is actually spam?
(f) Given that a message is declared as spam (according to the rule in (d)), what is the proba-
bility that feature 1 is present?
23. Online fraud detection: Methods like those in Problems 21 and 22 are also used in monitoring
events such as credit card transactions for potential fraud. Unlike the case of spam email, how-
ever, the fraction of transactions that are fraudulent is usually very small. What we hope to do
in this case is to “flag” certain transactions so that they can be checked for potential fraud, and
perhaps to block (deny) certain transactions. This is done by identifying features of a transaction
so that if F = “transaction is fraudulent”, then
P (feature presentjF )
r=
P (feature presentjF )
is large.
(a) Suppose P (F ) = 0:0005 and that P (feature presentjF ) = 0:02. Determine

P (F j feature present) as a function of r, and give the values when r = 10, 30 and 50.
(b) Suppose r = 50 and you decide to flag transactions if the feature is present. What percent-
age of transactions would be flagged? Does this seem like a good idea?
24. Challenge problem: n music lovers have reserved seats in a theatre containing a total of n + k
seats (k seats are unassigned). The first person who enters the theatre, however, lost his seat
assignment and chooses a seat at random. Subsequently, people enter the theatre one at a time
and sit in their assigned seat unless it is already occupied. If it is, they choose a seat at random
from the remaining empty seats. What is the probability that person n, the last person to enter
the theatre, finds their seat already occupied?
25. Challenge problem (Monty Hall): You have been chosen as finalist on a television show. For
your prize, the host shows you three doors. Behind one door is a sports car, and behind the
other two are goats. After you choose one door, the host, who knows what is behind each of the
three doors, opens one (never the one you chose or the one with the car) and then says:“You are
allowed to switch the door you chose if you find that advantageous”. Should you switch?
5. DISCRETE RANDOM VARIABLES
5.1 Random Variables and Probability Functions

Probability models are used to describe outcomes associated with random processes. So far we have
used sets A; B; C; : : : in sample spaces to describe such outcomes. In this chapter we introduce
numerical-valued variables X; Y; : : : to describe outcomes. This allows probability models to be ma-
nipulated easily using ideas from algebra, calculus, or geometry.
A random variable (r.v.) is a numerical-valued variable that represents outcomes in an experiment
or random process. For example, suppose an experiment consists of tossing a coin 3 times. Then
X = number of heads that occur
would be a random variable. Associated with any random variable is a range A, which is the set of
possible values for the variable. For example, the random variable X = number of heads that occur,
has range A = f0; 1; 2; 3g.
Random variables are denoted by capital letters like X; Y; : : : and their possible values are denoted
by x; y; : : : . This gives a nice short-hand notation for outcomes. For example, \X = 2” in the
experiment above stands for “2 heads occurred”.
Random variables are always defined for every outcome of the random experiment, that is, for every
outcome a 2 S: For each possible value x of the random variable X; there is a corresponding set of
outcomes a in the sample space S which results in this value of x (that is, so that \X = x" occurs). In
rigorous mathematical treatments of probability, a random variable is defined as a function on a sample
space, as follows:
Definition 11 A random variable is a function that assigns a real number to each point in a sample
space S.
To understand this definition, consider the experiment in which a coin is tossed 3 times, and suppose
that we use the sample space
S = fHHH; T HH; HT H; HHT; HT T; T HT; T T H; T T T g
72
5.1. RANDOM VARIABLES AND PROBABILITY FUNCTIONS 73
and define a random variable as X = number of heads that occur. The range of the random variable X
is A = f0; 1; 2; 3g. For points in the sample space, for example a = T HH; the value of the function
X(a) is obtained by counting the number of heads, X(a) = 2 in this case. Each of the outcomes
“X = x” represents an event (either simple or compound). For example they are as follows:
Table 4.1
Events Definition of this event
X=0 fT T T g
X=1 fHT T; T HT; T T Hg
X=2 fHHT; HT H; T HHg
X=3 fHHHg
Since some value of X in the range A must occur, the events of the form “X = x” for x 2 A form a
partition of the sample space S: For example the events in the second column of Table 4.1 are mutually
exclusive (for example fT T T g \ fHT T; T HT; T T Hg = ;) and their union is the whole sample
space: fT T T g [ fHT T; T HT; T T Hg [ fHHT; HT H; T HHg [ fHHHg = S.
As you may recall, a function is a mapping of each point in a domain into a unique point. For
example, the function f (x) = x3 maps the point x = 2 in the domain into the point f (2) = 8 in the
range. We are familiar with this rule for mapping being defined by a mathematical formula. However,
the rule for mapping a point in the sample space (domain) into the real number in the range of a random
variable is often given in words rather than by a formula. As mentioned above, we generally denote
random variables, in the abstract, by capital letters (X; Y , etc.) and denote the actual numbers taken
by random variables by small letters (x; y, etc.). You should know that there is a difference between a
function (f (x) or X(a)) and the value of a function ( for example f (2) or X(a) = 2).
Since “X = x” represents an event of some kind, we will be interested in its probability, which we
write as P (X = x). In the above example in which a fair coin is tossed three times, we might wish the
probability that X is equal to 2, or P (X = 2): This is P (fHHT; HT H; T HHg) = 83 in the example.
We classify random variables into two types, according to how big their range of values is:
Discrete random variables take integer values or, more generally, values in a countable set. Recall
that a set is countable if its elements can be placed in a one-one correspondence with a subset of the
positive integers.
Continuous random variables take values in some interval of real numbers like (0; 1) or (0; 1) or
( 1; 1): You should be aware that the cardinality of the real numbers in an interval is NOT countable.
Examples of each might be (where we assume the values in the second column are the actual values,
not rounded in any way),
74 5. DISCRETE RANDOM VARIABLES
Discrete Continuous
number of people in a car total weight of people in a car
number of cars in a parking lot distance between cars in a parking lot
number of phone calls to 911 time between calls to 911.
In theory there could also be mixed random variables which are discrete-valued over part of their
range and continuous-valued over some other portion of their range. We will ignore this possibility
here and concentrate first on discrete random variables. Continuous random variables are considered
in Chapter 8.
Our aim is to set up general models which describe how the probability is distributed among the
possible values in the range of a random variable X. To do this we define for any discrete random
variable X the probability function.
Definition 12 Let X be a discrete random variable with range(X) = A. The probability function
(p.f.) of X is the function
f (x) = P (X = x); defined for all x 2 A
The set of pairs f(x; f (x)) : x 2 Ag is called the probability distribution of X.
All probability functions must have two properties:
1. f (x) 0 for all x 2 A

P
2. f (x) = 1
all x2A
By implication, these properties ensure that f (x) 1 for all x.
We consider a few simple examples before dealing with more complicated problems.
Example: Let X be the number obtained when a die is thrown. We would normally use the probability
function f (x) = 1=6 for x = 1; 2; 3; : : : ; 6. In fact there is probably no absolutely perfect die in
existence. For most dice, however, the 6 sides will be close enough to being equally likely that the
model f (x) = 1=6 is a satisfactory one for the distribution of probability among the possible outcomes.
Example: Suppose a “fair” coin is tossed 3 times, with the results on the three tosses independent, and
let X be the total number of heads occurring. Refer to Table 4.1 and compute the probabilities of the
four events listed there; you obtain
Table 4.2
Events Definition of this event P (X = x)
1
X=0 fT T T g 8
3
X=1 fHT T; T HT; T T Hg 8
3
X=2 fHHT; HT H; T HHg 8
1
X=3 fHHHg 8
Thus the probability function has values f (0) = 18 ; f (1) = 83 ; f (2) = 38 ; f (3) = 18 . In this case it
is easy to see that the number of points in each of the four events of the form “X = x” is x3 using
the counting arguments of Chapter 3, so we can give a simple algebraic expression for the probability
function,
3
x
f (x) = for x = 0; 1; 2; 3
8
Example: Find the value of k which makes f (x) below a probability function.
x 0 1 2 3
f (x) k 2k 0:3 4k
P
3
Since the probability of all possible outcomes must add to one, f (x) = 1 giving 7k + 0:3 = 1.
x=0
Hence k = 0:1.
While the probability function is the most common way of describing a probability model, there are
other possibilities. One of them is by using the cumulative distribution function.
Definition 13 The cumulative distribution function (c.d.f.) of X is the function usually denoted by
F (x)
F (x) = P (X x); defined for all x 2 <
In the last example, with k = 0:1, the range of values for the random variable is A = f0; 1; 2; 3g: For
x 2 A we have
x f (x) = P (X = x) F (x) = P (X x)
0 0:1 0:1
1 0:2 0:3
2 0:3 0:6
3 0:4 1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
0.1
0
-1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
Figure 5.1: A simple cumulative distribution function
Note that the values in the third column are partial sums of the values of the probability function in the
second column. For example,
F (1) = P (X 1) = P (X = 0) + P (X = 1) = f (0) + f (1) = 0:3

F (2) = P (X 2) = f (0) + f (1) + f (2) = 0:6:
F (x) is defined for real numbers x 2

= A not in the range of the random variable, for example
F (2:5) = F (2) = 0:6 and F (3:8) = 1
The cumulative distribution function for this example is plotted in Figure 5.1.
In general, F (x) can be obtained from f (x) using

X
F (x) = P (X x) = f (u)
u x
A cumulative distribution function F (x) has certain properties, just as a probability function f (x)
does. Obviously, since it represents a probability, F (x) must be between 0 and 1. In addition it must be
a non-decreasing function (e.g. P (X 8) cannot be less than P (X 7)). Thus we note the following
properties of a cumulative distribution function F (x):
1. F (x) is a non-decreasing function of x for all x 2 <:
2. 0 F (x) 1 for all x 2 <:
3. lim F (x) = 0 and lim F (x) = 1:

x! 1 x!1
We have noted above that F (x) can be obtained from f (x). The opposite is also true; for example
the following result holds:
If X takes on integer values then for values x such that x 2 A and x 1 2 A,
f (x) = F (x) F (x 1)
This says that f (x) is the size of the jump in F (x) at the point x: To prove this, just note that
F (x) F (x 1) = P (X x) P (X x 1) = P (X = x)
When a random variable has been defined it is sometimes simpler to find its probability function
f (x) first, and sometimes it is simpler to find F (x) first. The following example gives two approaches
for the same problem.
Example: Suppose that N balls labelled 1; 2; : : : ; N are placed in a box, and n balls (n N ) are
randomly selected without replacement. Define the random variable
X = largest number selected
Find the probability function for X.
Solution 1: If X = x then we must select the number x plus n 1 numbers from the set
f1; 2; : : : ; x 1g. (Note that this means we need x n.) This gives
1 x 1 x 1
1 n 1 n 1
f (x) = P (X = x) = N
= N
for x = n; n + 1; : : : ; N
n n
Solution 2: First find F (x) = P (X x). Noting that X x if and only if all n balls selected are
from the set f1; 2; : : : ; xg, we get

x
n
F (x) = N
for x = n; n + 1; : : : N
n
We can now find
f (x) = F (x) F (x 1)
x x 1
n n
= N
n
x 1
n 1
= N
n
0.5
0.45
0.4
0.35
0.3
f(x)
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3
x
x+1
Figure 5.2: Probability histogram for f (x) = 10 ; x = 0; 1; 2; 3
as before.
Remark: When you write down a probability function, remember to state its domain (that is, the
possible values of the random variable, or the values x for which f (x) is defined). This is an essential
part of the function’s definition.
We frequently graph the probability function f (x) using a probability histogram. For now, we’ll
define this only for random variables whose range is some set of consecutive integers f0; 1; 2; : : :g.
A histogram of f (x) is then a graph consisting of adjacent bars or rectangles. At each x we place a
rectangle with base on (x 0:5; x + 0:5) and with height f (x). In the above example, a histogram of
f (x) looks like that in Figure 5.2.
Notice that the areas of these rectangles correspond to the probabilities, so for example P (X = 1)
is the area of the bar above and centered around the value 1 and P (1 X 3) is the sum of the area
of the three rectangles above the points 1; 2; and 3 (actually the area of the region above between the
points x = 0:5 and x = 3:5): In general in a probability histogram, probabilities are depicted by areas.
Model Distributions:
Many processes or problems have the same structure. In the remainder of this course we will
identify common types of problems and develop probability distributions that represent them. In doing
this it is important to be able to strip away the particular wording of a problem and look for its essential
features. For example, the following three problems are all essentially the same.
(a) A fair coin is tossed 10 times and the “number of heads obtained” (X) is recorded.
(b) Twenty seeds are planted in separate pots and the “number of seeds germinating” (X) is recorded.
5.2. DISCRETE UNIFORM DISTRIBUTION 79
(c) Twelve items are picked at random from a factory’s production line and examined for defects.
The number of items having no defects (X) is recorded.
What are the common features? In each case the process consists of “trials” which are repeated a
stated number of times: 10, 20, and 12. In each repetition there are two types of outcomes: heads/tails,
germinate/don’t germinate, and no defects/defects. These repetitions are independent (as far as we can
determine), with the probability of each type of outcome remaining constant for each repetition. The
random variable we record is the number of times one of these two types of outcome occurred.
Six model distributions for discrete random variables will be developed in the rest of this chapter.
Students often have trouble deciding which one (if any) to use in a given setting, so be sure you under-
stand the physical setup which leads to each one. Also, as illustrated above you will need to learn to
focus on the essential features of the situation as well as the particular content of the problem.
Statistical Computing
A number of major software systems have been developed for Probability and Statistics. We will use
a system called R, which has a wide variety of features and which has Unix and Windows versions.
Chapter 6 gives a brief introduction to R, and how to access it. For this course, R can compute prob-
abilities for all the distributions we consider, can graph functions or data, and can simulate random
processes. In the sections below we will indicate how R can be used for some of these tasks.
Problems
5.1.1 Find c if X is a random variable with probability function
x 0 1 2
f (x) 9c2 9c c2
5.1.2 Suppose that 5 people, including you and a friend, line up at random. Let X be the number of
people standing between you and your friend. Tabulate the probability function and the cumula-
tive distribution function for X.
5.2 Discrete Uniform Distribution

We define each model in terms of an abstract “physical setup", or setting, and then consider specific
examples of the setup.
Physical Setup: Suppose X takes values a; a + 1; a + 2; : : : ; b with all values being equally likely.
Then X has a discrete Uniform distribution, on the set fa; a + 1; a + 2; : : : ; bg.
Illustrations:
1. If X is the number obtained when a die is rolled, then X has a discrete Uniform distribution with
a = 1 and b = 6.
2. Computer random number generators give Uniform[1; N ] variables, for a specified positive in-
teger N . These are used for many purposes, e.g. generating lottery numbers or providing auto-
mated random sampling from a set of N items.
Probability Function: There are b a + 1 values X can take so the probability at each of these values
1 P
b
must be b a+1 in order that f (x) = 1. Therefore
x=a
(
1
b a+1 for x = a; a + 1; : : : ; b
f (x) = P (X = x) =
0 otherwise
Example: Suppose a fair die is thrown once and let X be the number on the face. First find the
cumulative distribution function, F (x) of X:
Solution: This is an example of a discrete Uniform distribution on the set f1; 2; 3; 4; 5; 6g having
a = 1; b = 6 and probability function
(
1
6 for x = 1; 2; : : : ; 6
f (x) = P (X = x) =
0 otherwise
The cumulative distribution function is F (x) = P (X x);

8
>
< 0 for x < 1
[x]
F (x) = P (X x) = for 1 x < 6
>
:
6
1 for x 6
where by [x] we mean the integer part of the real number x or the largest whole number less than or
equal to x:
Many distributions are constructed using discrete Uniform random variables. For example we
might throw two dice and sum the values on their faces.
Example: Suppose two fair dice (suppose for simplicity one is red and the other is green) are thrown.
Let X be the sum of the values on their faces. Find the cumulative distribution function, F (x) of X:
5.2. DISCRETE UNIFORM DISTRIBUTION 81
Solution: In this case we can consider the sample space to be
S = f(1; 1); (1; 2); (1; 3); : : : ; (5; 6); (6; 6)g
where for example the outcome (i; j) means we obtained i on the red die and j on the green. There
1
are 36 outcomes in this sample space, all with the same probability 36 . The probability function of
X is easily found. For example f (5) is the probability of the event X = 5 or the probability of
4
f(1; 4); (2; 3); (3; 2); (4; 1)g so f (5) = 36 . The probability function and the cumulative distribution
function is as listed below:
x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
f (x) = P (X = x) 36 36 36 36 36 36 36 36 36 36 36
1 3 6 10 15 21 26 30 33 35
F (x) = P (X x) 36 36 36 36 36 36 36 36 36 36 1
Although it is a bit more difficult to give a formula for the cumulative distribution function for
general argument x in this case, it is clear for example that F (x) = F ([x]) and F (x) = 0 for x < 2,
F (x) = 1 for x 12:
Example: Let X be the largest number when a die is rolled 3 times. First find the cumulative distrib-
ution function, F (x), and then find the probability function, f (x) of X:
Solution: This is another example of a distribution constructed from the discrete Uniform. In this case
the sample space
S = f(1; 1; 1); (1; 1; 2); : : : ; (6; 6; 6)g
1
consists of all 63 possible outcomes of the 3 dice, with each outcome having probability 216 : Suppose
that x is an integer between 1 and 6. What is the probability that the largest of these three numbers is
less than or equal to x? This requires that all three of the dice show numbers less than or equal to x,
and there are exactly x3 points in S which satisfy this requirement. Therefore the probability that the
largest number is less than or equal to x is
x3
F (x) =
63
for x = 1; 2; 3; 4; 5; 6. Here is the cumulative distribution function for all real values of x:
8 [x]3
>
< 216 for 1 x < 6
F (x) = P (X x) = 0 for x < 1
>
:
1 for x 6
To find the probability function we may use the fact that for x in the domain of the probability function
(in this case for x 2 f1; 2; 3; 4; 5; 6g) we have P (X = x) = P (X x) P (X < x) so that
f (x) = F (x) F (x 1)
x3 (x 1)3
=
216
[x (x 1)][x2 + x(x 1) + (x 1)2 ]
=
216
3x 2 3x + 1
= ; x = 1; 2; 3; 4; 5; 6
216
5.3 Hypergeometric Distribution

13 Physical Setup: We have a collection of N objects which can be classified into two distinct types.
Call one type “success”14 (S) and the other type “failure” (F ). There are r successes and N r
failures. Pick n objects at random without replacement. Let X be the number of successes obtained.
Then X has a Hypergeometric distribution.
Illustrations:
1. The number of aces X in a bridge hand has a Hypergeometric distribution with N = 52; r = 4,
and n = 13.
2. In a fleet of 200 trucks there are 12 which have defective brakes. In a safety check 10 trucks
are picked at random for inspection. The number of trucks X with defective brakes chosen for
inspection has a Hypergeometric distribution with N = 200; r = 12; n = 10.
Probability Function: Using counting techniques we note there are N n points in the sample space S
r
if we don’t consider order of selection. There are x ways to choose the x success objects from the r
available and N r
n x ways to choose the remaining (n x) objects from the (N r) failures. Hence
r N r
x n x
f (x) = P (X = x) = N
n
The range of values for x is somewhat complicated. Of course, x 0. However if the number, n,
picked exceeds the number, N r, of failures, the difference, n (N r) must be successes. So
x max(0; n N + r). Also, x r since we can’t get more successes than the number available. But
x n, since we can’t get more successes than the number of objects chosen. Therefore x min(r; n).
13
This section optional for STAT 220.
14
“If A is a success in life, then A equals x plus y plus z. Work is x; y is play; and z is keeping your mouth shut.” Albert
Einstein, 1950
5.3. HYPERGEOMETRIC DISTRIBUTION 83
Example: In Lotto 6/49 a player selects a set of six numbers (with no repeats) from the set f1; 2; : : : ; 49g.
In the lottery draw six numbers are selected at random. Find the probability function for X, the number
from your set which are drawn.
Solution: Think of your numbers as the S objects and the remainder as the F objects. Then X has a
Hypergeometric distribution with N = 49; r = 6 and n = 6, so
6 43
x 6 x
f (x) = P (X = x) = 49 for x = 0; 1; : : : ; 6
6
6 49
For example, you win the jackpot prize if X = 6; the probability of this is 6 = 6 , or about 1 in 13:9
million.
Remark: When parameter values are large, Hypergeometric probabilities may be tedious to compute
using a basic calculator. The R functions dhyper and phyper can be used to evaluate f (x) and the
c.d.f. F (x). In particular, dhyper(x; r; N r; n) gives f (x) and phyper(x; r; N r; n) gives F (x).
Using this we find for the Lotto 6/49 problem here, for example, that f (6) is calculated by typing
dhyper(6; 6; 43; 6) in R, which returns the answer 7:151124 10 8 or 1=13; 983; 186.
P
For all of our model distributions we can also confirm that f (x) = 1. To do this here we use a
all x
summation result from Chapter 4 called the Hypergeometric Identity. Letting a = r; b = N r in that
identity we get
X X r N r
1 X r
r+N r
x n x N r n
f (x) = N
= N
= N
=1
n n
x n x n
all x all x all x
Problems
5.3.1 A box of 12 tins of tuna contains d which are tainted. Suppose 7 tins are opened for inspection
and none of these 7 is tainted.
(a) Calculate the probability that none of the 7 is tainted for d = 0; 1; 2; 3.

(b) Do you think it is likely that the box contains as many as 3 tainted tins?
5.3.2 Suppose our sample space distinguishes points with different orders of selection. For example
suppose that S = fSSSSF F F : : : :; g consists of all words of length n where letters are drawn
without replacement from a total of r S’s and N r F’s. Derive a formula for the probability
that the word contains exactly X S’s. In other words, determine the Hypergeometric probability
function using a sample space in which order of selection is considered.
5.4 Binomial Distribution

Physical Setup:
Suppose an “experiment” has two types of distinct outcomes. Call these types “success” (S) and
“failure” (F ), and let their probabilities be p (for S) and 1 p (for F ). Repeat the experiment n
independent times. Let X be the number of successes obtained. Then X has what is called a Binomial
distribution. We write X Binomial(n; p) as a shorthand for “X is distributed according to a
Binomial distribution with n repetitions and probability p of success”. The n individual experiments
in the process just described are often called “trials” or “Bernoulli trials” and the process is called a
Bernoulli15 process or a Binomial process.
Illustrations:
1. Toss a fair die 10 times and let X be the number of sixes that occur. Then X Binomial(10; 1=6).
2. In a microcircuit manufacturing process, 90% of the chips produced work (10% are defective).
Suppose we select 25 chips, independently16 and let X be the number that work. Then
X Binomial(25; 0:9).
Comment: We must think carefully whether the physical process we are considering is closely approx-
imated by a Binomial process, for which the key assumptions are that (i) the probability p of success is
constant over the n trials, and (ii) the outcome (S or F ) on any trial is independent of the outcome on
the other trials. For Illustration 1 these assumptions seem appropriate. For Illustration 2 we would need
to think about the manufacturing process. Microcircuit chips are produced on “wafers” containing a
large number of chips and it is common for defective chips to cluster on wafers. This could mean that
if we selected 25 chips from the same wafer, or from only 2 or 3 wafers, that the “trials” (chips) might
not be independent, or perhaps that the probability of defectives changes.
15
After James (Jakob) Bernoulli (1654 – 1705), a Swiss member of a family of eight mathematicians. Nicolaus Bernoulli
was an important citizen of Basel, being a member of the town council and a magistrate. Jacob Bernoulli’s mother also came
from an important Basel family of bankers and local councillors. Jacob Bernoulli was the brother of Johann Bernoulli and the
uncle of Daniel Bernoulli. He was compelled to study philosophy and theology by his parents, graduated from the University
of Basel with a master’s degree in philosophy and a licentiate in theology but against his parents wishes, studied mathematics
and astronomy . He was offered an appointment in the Church he turned it down instead taught mechanics at the University
in Basel from 1683, giving lectures on the mechanics of solids and liquids. Jakob Bernoulli is responsible for many of the
combinatorial results dealing with independent random variables which take values 0 or 1 in these notes. He was also a fierce
rival of his younger brother Johann Bernoulli, also a mathematician, who would have liked the chair of mathematics at Basel
which Jakob held.
16
for example we select at random with replacement or without replacement from a very large number of chips.
5.4. BINOMIAL DISTRIBUTION 85
0.2
0.18
0.16
0.14
0.12
f(x)
0.1
0.08
0.06
0.04
0.02
0
-5 0 5 10 15 20 25
x
Figure 5.3: Probability histogram for a Binomial(20; 0:3) random variable
Probability Function: There are x!(nn! x)! = nx different arrangements of x S’s and (n x) F ’s
over the n trials. The probability for each of these arrangements has p multiplied together x times and
(1 p) multiplied (n x) times, in some order, since the trials are independent. So each arrangement
has probability px (1 p)n x . Therefore
n x
f (x) = P (X = x) = p (1 p)n x
for x = 0; 1; 2; : : : ; n
x
P
Checking that f (x) = 1:
n
X n
X n x
f (x) = p (1 p)n x
x
x=0 x=0
n
X x
nn p
= (1 p)
x 1 p
x=0
n
p
= (1 p)n 1 + by the Binomial Theorem
1 p
n
1 p+p
= (1 p)n
1 p
= 1n = 1
Figure 5.3 shows the probability histogram for the Binomial distribution with parameters n = 20 and
p = 0:3. Although the formula for f (x) may seem complicated the shape of the histogram is simple
since it increases to a maximum value near np and then decreases thereafter.
Computation: Many software packages and some calculators give Binomial probabilities. In R we
use the function dbinom(x; n; p) to compute f (x) and pbinom(x; n; p) to compute the corresponding
cumulative distribution function F (x) = P (X x).
Example: Suppose that in a weekly lottery you have probability 0:02 of winning a prize with a single
ticket. If you buy 1 ticket per week for 52 weeks, what is the probability that (a) you win no prizes, and
(b) that you win 3 or more prizes?
Solution: Let X be the number of weeks that you win; then X Binomial(52; 0:02). We find
(a)
52
P (X = 0) = f (0) = (0:02)0 (0:98)52 = 0:350
0
(b)
P (X 3) = 1 P (X 2)
=1 f (0) f (1) f (2) = 0:0859
(Note that P (X 2) is given by the R command pbinom(2; 52; 0:02).)
Comparison of Binomial and Hypergeometric Distributions: These distributions are similar in

that an experiment with two types of outcome (S and F ) is repeated n times and X is the number
of successes. The key difference is that the Binomial requires independent repetitions with the same
probability of S, whereas the draws in the Hypergeometric are made from a fixed collection of objects
without replacement. The trials (draws) are therefore not independent. For example, if there are r = 10
S objects and N r = 10 F objects, then the probability of getting an S on draw two depends on what
was obtained in draw one. If these draws had been made with replacement, however, they would be
independent and we would use the Binomial rather than the Hypergeometric model.
If N is large and the number, n, being drawn is relatively small in the Hypergeometric setup then
we are unlikely to get the same object more than once even if we do replace it. So it makes little
practical difference whether we draw with or without replacement. This suggests that when we are
drawing a fairly small proportion of a large collection of objects the Binomial and the Hypergeometric
models should produce similar probabilities. As the Binomial is easier to calculate, it is often used as
an approximation to the Hypergeometric in such cases.
Example: Suppose we have 15 cans of soup with no labels, but 6 are tomato and 9 are pea soup. We
randomly pick 8 cans and open them. Find the probability three of the cans picked are tomato.
Solution: The correct solution uses the Hypergeometric distribution, and is (with X = number of
tomato soup cans picked)
6 9
3 5
P (X = 3) = 15 = 0:3916
8
5.4. BINOMIAL DISTRIBUTION 87
If we incorrectly used the Binomial distribution, we would obtain

3 5
8 6 9
= 0:2787
3 15 15
As expected, this is a poor approximation since we are picking over half of a fairly small collection of
cans.
However, if we had 1500 cans: 600 tomato and 900 pea, we are not likely to get the same can
again even if we did replace each of the 8 cans after opening it. (Put another way, the probability
we get a tomato soup on each pick is very close to 0:4, regardless of what the other picks give.) The
Hypergeometric probability is
600 900
3 5
1500 = 0:2794
8
The Binomial probability,
3 5
8 600 900
= 0:2787
3 1500 1500
which is a very good approximation.
Problems
5.4.1 Megan audits 130 clients during a year and finds irregularities for 26 of them.
(a) Give an expression for the probability that 2 clients will have irregularities when 6 of her
clients are picked at random,
(b) Evaluate your answer to (a) using a suitable approximation.
5.4.2 The flash mechanism on camera A fails on 10% of shots, while that of camera B fails on 5% of
shots. The two cameras being identical in appearance, a photographer selects one at random and
takes 10 indoor shots using the flash.
(a) Give the probability that the flash mechanism fails exactly twice. What assumption(s) are
you making?
(b) Given that the flash mechanism failed exactly twice, what is the probability camera A was
selected?
5.5 Negative Binomial Distribution

17 Physical Setup:
The setup for this distribution is almost the same as for Binomial; that is, an experiment (trial) has
two distinct types of outcome (S and F ) and is repeated independently with the same probability, p,
of success each time. Continue doing the experiment until a specified number, k, of success have been
obtained. Let X be the number of failures obtained before the k’th success. Then X has a Negative
Binomial distribution. We often write X N egative Binomial(k; p) to denote this.
Illustrations:
(1) If a fair coin is tossed until we get the fifth head, the number of tails we obtain has a Negative
Binomial distribution with k = 5 and p = 12 .
(2) As a rough approximation, the number of half credit failures a student collects before successfully
completing 40 half credits for an honours degree has a Negative Binomial distribution. (Assume
all course attempts are independent, with the same probability of being successful, and ignore
the fact that getting more than 6 half credit failures prevents a student from continuing toward an
honours degree.)
Probability Function: In all there will be x + k trials (x F ’s and k S’s) and the last trial must be a
success. In the first x + k 1 trials we therefore need x failures and (k 1) successes, in any order.
There are (x+k 1)!
x!(k 1)! =
x+k 1
x different orders. Each order will have probability pk (1 p)x since there
must be x trials which are failures and k which are success. Hence
x+k 1 k
f (x) = P (X = x) = p (1 p)x for x = 0; 1; 2; : : :
x
Note: An alternate version of the Negative Binomial distribution defines X to be the total number of
trials needed to get the k’th success. This is equivalent to our version. For example, asking for the
probability of getting 3 tails before the fifth head is exactly the same as asking for a total of 8 tosses in
order to get the fifth head. You need to be careful to read how X is defined in a problem rather than
mechanically “plugging in” numbers in the above formula for f (x).
P
Checking that f (x) = 1 requires somewhat more work for the Negative Binomial distribution. We
first re-arrange the x+kx 1 term,
x+k 1 (x + k 1) (x) (x + k 1)(x + k 2) (k + 1)(k)

= =
x x! x!
17
This section optional for STAT 220
5.5. NEGATIVE BINOMIAL DISTRIBUTION 89
Factor a ( 1) out of each of the x terms in the numerator, and re-write these terms in reverse order,
x+k 1 ( k)( k 1) ( k x + 2)( k x + 1)

= ( 1)x
x x!
( k) (x) k
= ( 1)x = ( 1)x
x! x
Then (using the Binomial Theorem)
1
X 1
X k
f (x) = ( 1)x pk (1 p)x
x
x=0 x=0
1
X k
= pk [( 1)(1 p)]x = pk [1 + ( 1)(1 p)] k
x
x=0
k k
=p p
=1
Comparison of Binomial and Negative Binomial Distributions

These should be easily distinguished because they reverse what is specified or known in advance
and what is variable.
Binomial: We know the number n of trials in advance but we do not know the number of
successes we will obtain until after the experiment.
Negative Binomial: We know the number k of successes in advance but do not know the number
of trials that will be needed to obtain this number of successes until after the experiment.
Example: The fraction of a large population that has a specific blood type T is 0:08 (8%). For blood
donation purposes it is necessary to find 5 people with type T blood. If randomly selected individuals
from the population are tested one after another, then (a) What is the probability y persons have to be
tested to get 5 type T persons, and (b) What is the probability that over 80 people have to be tested?
Solution: Think of a type T person as a success (S) and a non-type T as an F . Let Y = number of
persons who have to be tested and let X = number of non-type T persons in order to get 5 S’s. Then
X has a Negative Binomial distribution with k = 5 and p = 0:8 and
x+4
P (X = x) = f (x) = (0:08)5 (0:92)x for x = 0; 1; 2; : : :
x
We are actually asked here about Y = X + 5. Thus
P (Y = y) = P (X = y 5)
= f (y 5)
y 1
= (0:08)5 (0:92)y 5
for y = 5; 6; 7; : : :
y 5
Thus we have the answer to (a) as given above, and for (b)
P (Y > 80) = P (X > 75) = 1 P (X 75)

75
X
=1 f (x)
x=0
= 0:2235
Note: Calculating such probabilities is easy with R. To get f (x) we use dnbinom(x; k; p) and to get
F (x) = P (X x) we use pnbinom(x; k; p).
Problems
5.5.1 You can get a group rate on tickets to a play if you can find 25 people to go. Assume each person
you ask responds independently and has a 20% chance of agreeing to buy a ticket. Let X be the
total number of people you have to ask in order to find 25 who agree to buy a ticket. Find the
probability function of X.
5.5.2 A shipment of 2500 car headlights contains 200 which are defective. You choose from this
shipment without replacement until you have 18 which are not defective. Let X be the number
of defective headlights you obtain.
(a) Give the probability function, f (x).

(b) Using a suitable approximation, find f (2).
5.6 Geometric Distribution

Physical Setup: Consider the Negative Binomial distribution with k = 1: In this case we repeat
independent Bernoulli trials with two types of outcome (S and F ) each time, and the same probability,
p, of success each time until we obtain the first success. Let X be the number of failures obtained
before the first success. We write X Geometric (p).
5.6. GEOMETRIC DISTRIBUTION 91
Illustrations:
(1) The probability you win a lottery prize in any given week is a constant p. The number of weeks
before you win a prize for the first time has a Geometric distribution.
(2) If you take STAT 230 until you pass it and attempts are independent with the same probability of
a pass each time18 , then the number of failures would have a Geometric distribution. (Thankfully
these assumptions are unlikely to be true for most persons! Why is this?)
Probability Function: There is only the one arrangement with x failures followed by 1 success. This
arrangement has probability
f (x) = P (X = x) = (1 p)x p for x = 0; 1; 2; : : :
Alternatively if we substitute k = 1 in the probability function for the Negative Binomial, we obtain
x+1 1
f (x) = p1 (1 p)x for x = 0; 1; 2; : : :
x
= p(1 p)x for x = 0; 1; 2; : : :
P
which is the same. To check that f (x) = 1, we will need to evaluate a Geometric series,
1
X 1
X
f (x) = (1 p)x p = p + (1 p)p + (1 p)2 p +
x=0 x=0
p p
= = =1
1 (1 p) p
Note: The names of the models so far derive from the summation results which show f (x) sums to
one. The Geometric distribution involved a Geometric series; the Hypergeometric distribution used the
Hypergeometric Identity; both the Binomial and Negative Binomial distributions used the Binomial
Theorem.
Bernoulli Trials: Once again remember that the Binomial, Negative Binomial and Geometric models
all involve trials (experiments) which:
(1) are independent
(2) have 2 distinct types of outcome (S and F )
(3) have the same probability p of “success” (S) each time.
Such trials are known as Bernoulli trials.
18
you burn all notes and purge your memory of the course after each failure
Problem
5.6.1 Suppose there is a 30% chance of a car from a certain production line having a leaky windshield.
The probability an inspector will have to check at least n cars to find the first one with a leaky
windshield is 0:05. Find n.
5.7 Poisson Distribution from Binomial
The Poisson19 distribution has probability function of the form
x
f (x) = P (X = x) = e for x = 0; 1; 2; : : :
x!
where > 0 is a parameter whose value depends on the setting for the model. Mathematically, we can
see that f (x) has the properties of a probability function, since f (x) 0 for x = 0; 1; 2; : : : and
1
X 1
X x
f (x) = e =e (e ) = 1
x!
x=0 x=0
by the Exponential series.
The Poisson distribution arises in physical settings where the random variable X represents the
number of events of some type. In this section we show how it arises from a Binomial process, and
in the following section we consider another derivation of the model. We write X P oisson( ) to
denote that X has the probability function above.
Physical Setup: One way the Poisson distribution arises is as a limiting case of the Binomial distribu-
tion as n ! 1 and p ! 0. In particular, we keep the product np fixed at some constant value, , while
letting n ! 1. This automatically makes p ! 0. Let us see what the limit of the Binomial probability
function f (x) is in this case.
19
After Siméon Denis Poisson (1781-1840), a French mathematician who was supposed to become a surgeon but, fortu-
nately for his patients, failed medical school for lack of coordination. He was forced to do theoretical research, being too
clumsy for anything in the lab. He wrote a major work on probability and the law, Recherchés sur la probabilité des juge-
ments en matière criminelle et matière civile (1837), discovered the Poisson distribution (called law of large numbers) and
to him is ascribed one of the more depressing quotes in our discipline “Life is good for only two things: to study mathematics
and to teach it.”
5.7. POISSON DISTRIBUTION FROM BINOMIAL 93
Probability Function: Since np = , p = n and for x fixed,
n x n(x) x n x
f (x) = p (1 p)n x
= 1
x x! n n
x terms
z }| {
x n(n 1)(n 2) (n x + 1) n x
= 1
x! (n)(n) (n) (n) n
x n n 1 n 2 n x+1 n x
= 1 1
x! n n n n n n
x 1 2 x 1 n x
= (1) 1 1 1 1 1
x! n n n n n
x n
k
lim f (x) = (1)(1)(1) (1) e (1) x since ek = lim 1 +
n!1 x! | {z } n!1 n
x terms
xe
= ; for x = 0; 1; 2; : : :
x!
(For the Binomial the upper limit on x is n, but we are letting n ! 1.)
This result allows us to use the Poisson distribution with = np as a close approximation to the
Binomial distribution in processes for which n is large and p is small.
Example: There are 200 people at a party. What is the probability that 2 of them were born on January
1?
Solution: Assuming all days of the year are equally likely for a birthday (and ignoring February 29)
and that the birthdays are independent (e.g. no twins!) we can use the Binomial distribution with
n = 200 and p = 1=365 for X = number born on January 1, giving
2 198
200 1 1
f (2) = 1 = 0:086767
2 365 365
Since n is large and p is close to 0, we can use the Poisson distribution to approximate this Binomial
probability, with = np = 200 365 , giving
200
200 2 ( 365 )
365 e
f (2) = = 0:086791
2!
As might be expected, this is a very good approximation.
Notes:
(1) If p is close to 1 we can also use the Poisson distribution to approximate the Binomial. By in-
terchanging the labels “success” and “failure”, we can get the probability of “success” (formerly
labelled “failure”) close to 0.
(2) The Poisson distribution used to be very useful for approximating Binomial probabilities with n
large and p near 0 since the calculations are easier. (This assumes values of ex to be available.)
With the advent of computers, it is just as easy to calculate the exact Binomial probabilities
as the Poisson probabilities. However, the Poisson approximation is useful when employing a
calculator without a built in Binomial function.
(3) The R functions dpois(x; ) and ppois(x; ) give f (x) and F (x) respectively.
Problem
5.7.1 An airline knows that 97% of the passengers who buy tickets for a certain flight will show up on
time. The plane has 120 seats.
(a) The airline sells 122 tickets. Find the probability that more people will show up than can be
carried on the flight. Compare this answer with the answer given by the Poisson approximation.
(b) What assumptions does your answer depend on? How well would you expect these assumptions
to be met?
5.8 Poisson Distribution from Poisson Process

20 We now derive the Poisson distribution as a model for the number of a certain kind of event or
occurrence (e.g. births, insurance claims, web site hits) that occur at points in time or in space. To this
end, we use the “order” notation g( t) = o( t) as t ! 0 to mean that the function g approaches 0
faster than t as t approaches zero, or that
g( t)
! 0 as t!0
t
For example g( t) = ( t)2 = o( t) but ( t)1=2 is not o( t):
Physical Setup: Consider a situation in which a certain type of event occurs at random points in time
(or space) according to the following conditions:
1. Independence: the number of occurrences in non-overlapping intervals are independent.
2. Individuality: for sufficiently short time periods of length t; the probability of 2 or more
events occurring in the interval is close to zero, that is, events occur singly not in clusters. More
20
5.8. POISSON DISTRIBUTION FROM POISSON PROCESS 95
precisely, as t ! 0; the probability of two or more events in the interval of length t must go
to zero faster than t ! 0 or
P (2 or more events in (t; t + t)) = o( t) as t!0
3. Homogeneity or Uniformity: events occur at a uniform or homogeneous rate over time so

that the probability of one occurrence in an interval (t; t + t) is approximately t for small
t for any value of t: More precisely,
P (one event in (t; t + t)) = t + o( t)
These three conditions together define a Poisson Process.
Let X be the number of event occurrences in a time period of length t. Then it can be shown (see
below) that X has a Poisson distribution with = t.
Illustrations:
(1) The emission of radioactive particles from a substance follows a Poisson process. (This is used
in medical imaging and other areas.)
(2) Hits on a web site during a given time period often follow a Poisson process.
(3) Occurrences of certain non-communicable diseases sometimes follow a Poisson process.
Probability Function: We can derive the probability function f (x) = P (X = x) from the conditions
above. We are interested in time intervals of arbitrary length t, so as a temporary notation, let ft (x) be
the probability of x occurrences in a time interval of length t. We now relate ft (x) and ft+ t (x). From
that we can determine what ft (x) is. To find ft+ t (x) we note that for t small there are only 2 ways
to get a total of x event occurrences by time t + t. Either there are x events by time t and no more
from t to t + t or there are x 1 by time t and 1 more from t to t + t. (since P (2 or more events
in (t; t + t)) = o( t), other possibilities are negligible if t is small). This and condition 1 above
(independence) imply that
ft+ t (x) t ft (x)(1 t) + ft (x 1)( t) + o( t)2
Re-arranging gives
ft+ t (x) ft (x)
t [ft (x 1) ft (x)] + o( t)
t
Taking the limit as t ! 0 we get

d
ft (x) = [ft (x 1) ft (x)] (5.1)
dt
This provides a “differential-difference” equation that needs to be solved for the functions ft (x) as
functions of t for each fixed integer value of x: We know that in interval of length 0; zero events will
occur, so that f0 (0) = 1 and f0 (x) = 0 for x = 1; 2; 3; : : :. At the moment we may not know how to
solve such a system but let’s approach the problem using the Binomial approximation of the last section.
Suppose that the interval (0; t) is divided into n = t t small subintervals of length t: The probability
that an event falls in any subinterval (record this as a success) is approximately p = t provided
the interval length is small. The probability of two or more events falling in any one subinterval is
less than nP [2 or more events in(t; t + t)] = n o( t) which goes to 0 as t ! 0 so we can
ignore the possibility that one of the subintervals has 2 or more events in it. Also the “successes” are
independent on the n different subintervals or “trials”, and so the total number of successes recorded,
X; is approximately Binomial(n; p): Therefore
x
n x n(x) px 1
P (X = x) t p (1 p)n x
= (1 p)n
x x! 1 p
t
Notice that for fixed t, x, as t ! 0, p = t ! 0 and n = t ! 1, and (1 p)n ! e t. Also,
(x) x x
for fixed x, n p ! ( t) . This yields the approximation
( t)x e t
P (X = x) t
x!
You can confirm that
( t)x e t
ft (x) = f (x) = for x = 0; 1; 2; : : :
x!
provides a solution to the system (5.1) with the required initial conditions. If we let = t, we can
xe
re-write f (x) as f (x) = x! , which is the Poisson distribution from Section 5.7. That is:
In a Poisson process with rate of occurrence , the number of event occurrences X

in a time interval of length t has a Poisson distribution with = t.
Interpretation of and : is referred to as the intensity or rate of occurrence parameter for the
events. It represents the average rate of occurrence of events per unit of time (or area or volume, as
discussed below). Then t = represents the average number of occurrences in t units of time. It
is important to note that the value of depends on the units used to measure time. For example, if
phone calls arrive at a store at an average rate of 20 per hour, then = 20 when time is in hours and the
average in 3 hours will be 3 20 or 60. However, if time is measured in minutes then = 20=60 = 1=3;
the average in 180 minutes (3 hours) is still (1=3)(180) = 60.
5.8. POISSON DISTRIBUTION FROM POISSON PROCESS 97
Example: Suppose earthquakes recorded in Ontario each year follow a Poisson process with an aver-
age of 6 per year. What is the probability that 7 will be recorded in a 2-year period?
Solution: In this case t = 2 (years) and the intensity of earthquakes is = 6. Therefore X; the number
of earthquakes in the two-year period follows a Poisson distribution with parameter = t = 12: The
7 e 12
probability that 7 earthquakes will be recorded in a 2 year period is f (7) = 12 7! = 0:0437.
Example: At a nuclear power station an average of 8 leaks of heavy water are reported per year. Find
the probability of 2 or more leaks in 1 month, if leaks follow a Poisson process.
Solution: Assume leaks satisfy the conditions for a Poisson process and that a month is 1=12 of a year.
Let X be the number of leaks in one month. Then X has the the Poisson distribution with = 8 and
t = 1=12, so = t = 8=12. Thus
P (X 2) = 1 P (X < 2)
=1 [f (0) + f (1)]
" #
8 1
(8=12)0 e 8=12 12 e 8=12
=1 +
0! 1!
= 0:1443
Random Occurrence of Events in Space: The Poisson process also applies when “events” occur
randomly in space (either 2 or 3 dimensions). For example, the “events” might be bacteria in a volume
of water or blemishes in the finish of a paint job on a metal surface. If X is the number of events in a
volume or area in space of size v and if is the average number of events per unit volume (or area),
then X has a Poisson distribution with = v. For this model to be valid, it is assumed that the
Poisson process conditions given previously apply here, with “time” replaced by “volume” or “area”.
Once again, note that the value of depends on the units used to measure volume or area.
Example: Coliform bacteria occur in river water with an average intensity of 1 bacteria per 10 cubic
centimeters of water. Find (a) the probability there are no bacteria in a 20 cubic centimeter sample
of water which is tested, and (b) the probability there are 5 or more bacteria in a 50 cubic centimeter
sample. (To do this assume that a Poisson process describes the location of bacteria in the water at any
given time.)
Solution: Let X = number of bacteria in a v cubic centimeter sample of water. Since = 0:1
bacteria per 1 cubic centimeter (1 per 10 cubic centimeters) the probability function of X is Poisson
with = 0:1v,
(0:1v)x
f (x) = e 0:1v for x = 0; 1; 2; : : :
x!
Thus we find
(a) With v = 20; = 2 so P (X = 0) = f (0) = e 2 = 0:135
(b) With v = 50; = 5 so f (x) = e 5 5x =x! and P (X 5) = 1 P (X 4) = 1 0:440 = 0:560
(Note: we can use the R command ppois(4; 5) to get P (X 4).)
Exercise: In each of the above examples, how well are each of the conditions for a Poisson process
likely to be satisfied?
Distinguishing Poisson from Binomial and Other Distributions

Students often have trouble knowing when to use the Poisson distribution and when not to use it. To be
certain, the three conditions for a Poisson process need to be checked. However, a quick decision can
often be made by asking yourself the following questions:
1. Can we specify in advance the maximum value which X can take?

If we can, then the distribution is not Poisson. If there is no fixed upper limit, the distribution
might be Poisson, but is certainly not Binomial or Hypergeometric, e.g. the number of seeds
which germinate out of a package of 25 does not have a Poisson distribution since we know in
advance that X 25. The number of cardinals sighted at a bird feeding station in a week might
be Poisson since we can’t specify a fixed upper limit on X. At any rate, this number would not
have a Binomial or Hypergeometric distribution. Of course if it is Binomial with a very large
value of n and a small value of p we may still use the Poisson distribution, but in this case it is
being used to approximate a Binomial.
2. Does it make sense to ask how often the event did not occur?
If it does make sense, the distribution is not Poisson. If it does not make sense, the distribution
might be Poisson. For example, it does not make sense to ask how often a person did not hiccup
during an hour. So the number of hiccups in an hour might have a Poisson distribution. It would
certainly not be Binomial, Negative Binomial, or Hypergeometric. If a coin were tossed until the
3rd head occurs it does make sense to ask how often heads did not come up. So the distribution
would not be Poisson. (In fact, we’d use Negative Binomial for the number of non-heads or tails.)
5.9. COMBINING OTHER MODELS WITH THE POISSON PROCESS 99
Problems
5.8.1 Suppose that emergency calls to 911 follow a Poisson process with an average of 3 calls per
minute. Find the probability there will be
(a) 6 calls in a period of 2:5 minutes.

(b) 2 calls in the first minute of a 2:5 minute period, given that 6 calls occur in the entire period.
5.8.2 Misprints are distributed randomly and uniformly in a book, at a rate of 2 per 100 lines.
(a) What is the probability a line is free of misprints?

(b) Two pages are selected at random. One page has 80 lines and the other 90 lines. What is
the probability that there are exactly 2 misprints on each of the two pages?
5.9 Combining Other Models with the Poisson Process

21 While we’ve considered the model distributions in this chapter one at a time, we will sometimes need
to use two or more distributions to answer a question. To handle this type of problem you’ll need to
be very clear about the characteristics of each model. Here is a somewhat artificial illustration. Lots of
other examples are given in the problems at the end of the chapter.
Example: A very large (essentially infinite) number of ladybugs is released in a large orchard. They
scatter randomly so that on average a tree has 6 ladybugs on it. Trees are all the same size.
(a) Find the probability a tree has > 3 ladybugs on it.
(b) When 10 trees are picked at random, what is the probability 8 of these trees have > 3 ladybugs
on them?
(c) Trees are checked until 5 with > 3 ladybugs are found. Let X be the total number of trees
checked. Find the probability function, f (x).
(d) Find the probability a tree with > 3 ladybugs on it has exactly 6.
(e) On 2 trees there are a total of t ladybugs. Find the probability that x of these are on the first of
these 2 trees.
21
Solution:
(a) If the ladybugs are randomly scattered the most suitable model is the Poisson distribution with
= 6 and v = 1 (that is, any tree has a “volume” of one unit), so = 6 and
P (X > 3) = 1 P (X 3) = 1 [f (0) + f (1) + f (2) + f (3)]

60 e 6 61 e 6 62 e 6 63 e 6
=1 + + + = 0:8488
0! 1! 2! 3!
(b) Using the Binomial distribution where “success” means > 3 ladybugs on a tree, we have n = 10,
p = 0:8488 and
10
f (8) = (0:8488)8 (1 0:8488)2 = 0:2772
8
(c) Using the Negative Binomial distribution, we need the number of successes, k, to be 5, and the
number of failures to be (x 5). Then
x 5+5 1
f (x) = (0:8488)5 (1 0:8488)x 5
x 5
x 1
= (0:8488)5 (1 0:8488)x 5
x 5
x 1
= (0:8488)5 (0:1512)x 5 x = 5; 6; 7; : : :
4
(d) This is conditional probability. Let A = f6 ladybusg and

B = fmore than 3 ladybugsg. Then
6 e 6 6
P (A \ B) P (6 lady bugs)
P (AjB) = = = 6! = 0:1892
P (B) P (more than 3 ladybugs) 0:8488
(e) Again we need to use conditional probability.
P (x on 1st tree and total of t)

P (x on 1st treejtotal of t) =
P (total of t)
P (x on 1st tree and t x on 2nd tree)
=
P (total of t)
P (x on 1 tree)P (t x on 2nd tree)
st
=
P (total of t)
5.9. COMBINING OTHER MODELS WITH THE POISSON PROCESS 101
Use the Poisson distribution to calculate each, with =6 2 = 12 in the denominator since
there are 2 trees.
6x e 6 6t x e 6
st x! (t x)!
P (x on 1 tree j total of t) = 12t e 12
t!
x t x
t! 6 6
=
x!(t x)! 12 12
x t x
t 1 1
= 1 for x = 0; 1; : : : ; t
x 2 2
Caution: Don’t forget to state the range of X. If the total is t, there couldn’t be more than t ladybugs
on the 1st tree.
Exercise: The answer to (e) is a Binomial probability function. Can you reach this answer by general
reasoning rather than using conditional probability to derive it?
Problems
5.9.1 In a Poisson process the average number of occurrences is per minute. Independent 1 minute
intervals are observed until the first minute with no occurrences is found. Let X be the number
of 1 minute intervals required, including the last one. Find the probability function, f (x).
5.9.2 Calls arrive at a telephone distress centre during the evening according to the conditions for a
Poisson process. On average there are 1.25 calls per hour.
(a) Find the probability there are no calls during a 3 hour shift.
(b) Give an expression for the probability a person who starts working at this centre will have
the first shift with no calls on the fifteenth shift.
(c) A person works one hundred 3 hour evening shifts during the year. Give an expression for
the probability there are no calls on at least 4 of these 100 shifts. Calculate a numerical
answer using a Poisson approximation.
5.10 Summary of Probability Functions for Discrete Random Variables

Name Probability Function
1
Discrete Uniform f (x) = b a+1 ; x = a; a + 1; a + 2; : : : ; b
(xr )(Nn xr)

Hypergeometric f (x) = ; x = max(0; n (N r)); : : : ; min(n; r)
(Nn )
Binomial f (x) = n
x px (1 p)n x
; x = 0; 1; 2; : : : ; n
Negative Binomial f (x) = x+k 1

x pk (1 p)x ; x = 0; 1; 2; : : :
Geometric f (x) = p (1 p)x ; x = 0; 1; 2; : : :
e x
Poisson f (x) = x! ; x = 0; 1; 2; : : :

1. The random variable X has probability function given by
x 0 1 2 3 4
f (x) 0:1c 0:2c 0:5c c 0:2c
(a) Find c and P (X > 2).

(b) Find F (x) = P (X x), the cumulative distribution function for X.
2. The range of the random variable X is A = f1; 2; 3; 4; 5g. For x 2 A the cumulative distribution
function for X is given by
x 1 2 3 4 5
F (x) 0:1k 0:2 0:5k k 4k 2
(a) Find k and P (2 < X 4).

(b) Find f (x) = P (X = x), the probability function for X. Draw a probability histogram for
f (x).
3. The range of the random variable X is A = f0; 1; 2; : : :g. For x = 0; 1; 2; : : :, the cumulative
distribution function of X is given by
x
F (x) = P (X x) = 1 2
(a) Find P (X = 5) and P (X 5).

(b) Find f (x) = P (X = x), the probability function of X.
4. Two balls are drawn at random from a box containing ten balls numbered 0; 1; : : : ; 9. Let the
random variable X be the maximum of the two numbers drawn and let the random variable Y be
the total of the two numbers drawn.
(a) If sampling is done without replacement, determine

(i) the probability function of X
(ii) the probability function of Y .
(b) If sampling is done with replacement, determine
(i) the probability function of X
(ii) the probability function of Y .
5. Suppose X is a discrete random variable with probability function
f (x) = P (X = x) = p(1 p)x for x = 0; 1; 2; : : : :
P
1
(a) Verify that f (x) = 1.
x=0
(b) Find P (X < x) for x = 0; 1; : : :.
(c) Find the probability that X is an odd number.
(d) Find the probability that X is divisible by 3.
(e) Find the probability function of the random variable R, where R is the remainder when X
is divided by 4.
6. In a box of 1000 computer chips, 5% are defective. Twenty computer chips are drawn at random
without replacement and tested for defects. Let the random variable X be the number of defective
computer chips found.
(a) Give the probability function for X.

(b) Give an expression for the probability that at least two chips are defective.
(c) Approximate the probability in (b) using a suitable approximation. Justify the approxima-
tion.
7. Jury selection: During jury selection a large number of people are asked to be present, then
persons are selected one by one in a random order until the required number of jurors has been
chosen. Because the prosecution and defense teams can each reject a certain number of persons,
and because some individuals may be exempted by the judge, the total number of persons selected
before a full jury is found can be quite large.
(a) Suppose that you are one of 150 persons asked to be present for the selection of a jury. If
it is necessary to select 40 persons in order to form the jury, what is the probability you are
chosen?
(b) In a recent trial there were 74 men and 76 women present for jury selection. Twelve people
are chosen at random without replacement for a jury of 12 people. Let Y be the number of
men chosen. Give an expression for P (Y = y).
(c) For the trial in part (b), the number of men selected turned out to be two. Find P (Y 2).
What might you conclude from this?
8. A string of zeros and ones of length 104 is sent over a network. Suppose each bit has probability
10 5 of being corrupted, independently for each bit.
(a) Give an expression for the probability that no bits are corrupted.
(b) Give an expression for the probability that at most one bit is corrupted.
(c) Approximate the probabilities in (a) and (b) using a suitable approximation. Justify the
approximation.
9. An oil company runs a contest in which there are 500; 000 tickets; a motorist receives one ticket
with each fill-up of gasoline, and 500 of the tickets are winners.
(a) If a motorist has 10 fill-ups during the contest, give an expression for the probability that he
or she wins at least one prize. Approximate this probability using a suitable approximation.
Justify the approximation.
(b) If a particular gas bar distributes 2000 tickets during the contest, give an expression for
the probability that there is at least one winner among the gas bar’s customers. Use two
different approximations to approximate this probability. Justify the approximation in each
case.
(xr )(Nn xr) r
10. Let f (x) = . We want to determine lim f (x) such that p = is held fixed.
(Nn ) N !1 N
a (x)
(a) Use x = ax! and r = pN to show that
r N r
x n x n (pN )(x) [(1 p) N ](n x)
N
=
n
x N (x) (N x)(n x)
(b) Show that

n (pN )(x) [(1 p) N ](n x)
n x
lim (n x)
= p (1 p)n x
N !1 x N (x) (N x) x
(c) What is the importance of the result in (b)?
11. A bin at a hardware store contains 35 forty watt lightbulbs and 70 sixty watt bulbs. A customer
wants to buy 8 sixty watt bulbs, and withdraws bulbs without replacement until these 8 bulbs
have been found. Let the random variable X be the number of 40 watt bulbs drawn from the bin.
Find the probability function of X.
12. Todd buys a lottery ticket every week. Suppose 1% of the tickets win some prize.
(a) Todd buys one ticket every week for the next 20 weeks. What is the probability he bought
no winning tickets? at least one winning ticket?
(b) What is the probability he does not get a winning ticket in the first 30 weeks?
(c) Todd realizes he is spending too much money on lottery tickets so he decides he will con-
tinue buying tickets only until he has bought 4 winning tickets. Find an expression for the
probability he will have to buy a ticket every week for at least the next 100 weeks in order
to achieve his goal.
13. A coffee chain claims that you have a 1 in 9 chance of winning a prize on their “roll up the edge”
promotion, where you roll up the edge of your paper cup to see if you win.
(a) What is the probability that you have no winners in a one week period in which you bought
fifteen cups of coffee?
(b) What is the probability that you get your first win when you buy your twenty-fifth cup of
coffee?
(c) Over the last week of a month long promotion you and your friends bought 60 cups of
coffee, but there was only one winner. Find the probability that there would be one or
fewer winners. What would you conclude?
14. Suppose X v Geometric (p).
(a) Find an expression for P (X x), and show that P (X s + tjX s) = P (X t) for
all non-negative integers s, t. Explain why this property of the Geometric distribution is
called the “memoryless” property.
(b) What is the most probable value of X?
15. Requests to a web server are assumed to follow a Poisson process. On average there are two
requests per second.
(a) Discuss briefly whether or not you think the three assumptions for a Poisson process would
hold reasonably well in this situation.
(b) Find the probability of three or more requests in a one second interval.
(c) Given an expression for the probability of more than 125 requests in a one minute interval.
16. A waste disposal company averages 6:5 spills of toxic waste per month. Assume spills occur
randomly at a uniform rate, and independently of each other, with a negligible chance of two
or more occurring at the same time. Find the probability there are four or more spills in a two
month period.
17. Coliform bacteria are distributed randomly and uniformly throughout river water at the average
concentration of one per 20 cubic centimeters of water.
(a) What is the probability of finding exactly 2 coliform bacteria in a 10 cubic centimeters
sample of the river water?
(b) What is the probability of finding at least 1 coliform bacterium in a 1 cubic centimeter
sample of the river water?
(c) In testing for the concentration (average number per unit volume) of bacteria it is possible to
determine cheaply whether a sample has any bacteria present or not. Suppose the average
concentration of bacteria in a body of water is per cubic centimeter. If 10 independent
water samples of 10 cubic centimeters each are tested, let the random variable Y be the
number of samples with no bacteria. Find P (Y = y).
(d) Suppose that in 10 independent samples, there were exactly 3 samples with no bacteria.
Give an estimate for the value of .
18. In a group of policy holders for house insurance, the average number of claims per 100 policies
per year is = 8:0. The number of claims for an individual policy holder is assumed to follow a
Poisson distribution.
(a) In a given year, what is the probability an individual policy holder has at least 1 claim?
(b) In a group of 20 policy holders, what is the probability there are no claims in a given year?
What is the probability there are 2 or more claims?
19. Assume power failures occur independently of each other at a uniform rate through the months of
the year, with little chance of 2 or more occurring simultaneously. Suppose that 80% of months
have no power failures.
(a) Seven months are picked at random. What is the probability that 5 of these months have no
power failures?
(b) Months are picked at random until 5 months without power failures have been found. What
is the probability that 7 months will have to be picked?
(c) What is the probability a month has more than one power failure?
20. Spruce budworms are distributed through a forest according to a Poisson process so that the
average is per hectare.
(a) Give an expression for the probability that at least 1 of n one hectare plots contains at least
k spruce budworms.
(b) Discuss briefly which assumption(s) for a Poisson process may not be well satisfied in this
situation.
21. A person working in telephone sales has a 20% chance of making a sale on each call, with
calls being independent. Assume calls are made at a uniform rate, with the numbers made in
non-overlapping periods being independent. On average there are 20 calls made per hour.
(a) Find the probability there are 2 sales in 5 calls.

(b) Find the probability exactly 8 calls are needed to make 2 sales.
(c) If 8 calls were needed to make 2 sales, what is the probability there was 1 sale in the first 3
of these calls?
(d) Find the probability of 3 calls being made in a 15 minute period.
22. During rush hour the number of cars passing through a particular intersection22 is assumed to
follow a Poisson process. On average there are 540 cars per hour.
(b) Find the probability that 11 cars passed through the intersection in a thirty second interval.
(c) Find the probability that 11 or more cars passed through the intersection in a thirty second
interval.
(d) Find the probability that when 20 disjoint thirty second intervals are studied, exactly 2 of
them had 11 cars pass through the intersection.
(e) We want to find 12 disjoint thirty second intervals in which 11 cars passed through the
intersection.
(i) Give an exact expression for the probability that 1000 disjoint 30 second intervals have
to be observed to find the 12 having the desired traffic flow.
(ii) Use an appropriate approximation to evaluate this probability and justify why this ap-
proximation is suitable.
22
“Traffic signals in New York are just rough guidelines.” David Letterman (1947 - )
23. Bubbles are distributed in sheets of glass, as a Poisson process, at an intensity of = 1:2 bubbles
per square metre. Sheets of glass, each of area 0:8m2 , are manufactured.
(a) What is the probability a sheet of glass has no bubbles?

(b) What is the probability a sheet of glass has more than one bubble?
(c) Let X be the number of sheets of glass, in a shipment of n sheets, which have no bubbles.
What is the probability function of the random variable X?
(d) In a shipment of 100 sheets, what is the probability more than ten sheets have more than
one bubble?
(e) If the glass manufacturer wants to have at least 50% of the sheets of glass with no bubbles,
how small should the intensity be to achieve this?
(f) If the glass manufacturer wants to ensure that 95% of all sheets manufactured have fewer
than two bubbles, how small should the intensity be to achieve this?
24. Polls and Surveys: Polls or surveys in which people are selected and their opinions or other
characteristics are determined are very widely used. For example, in a survey on cigarette use
among teenage girls, we might select a random sample of n girls from the population in question,
and determine the number X who are regular smokers. If p is the fraction of girls who smoke,
then X Binomial(n; p). Since p is unknown (that is why we do the survey) we then estimate
it as p^ = X=n. (In statistics a “hat” is used to denote an estimate of a model parameter based
on data.) The Binomial distribution can be used to study how “good” such estimates are, as
follows:
(a) Suppose p = 0:3 and n = 100. Find the probability P 0:27 X n 0:33 . Many surveys
try to get an estimate X=n which is within 3% (0:03) of p with high probability. What
would you conclude here?
(b) Repeat the calculation in (a) if n = 400 and n = 1000. What do you conclude?
X
(c) If p = 0:5 instead of 0:3, find P 0:47 n 0:53 when n = 400 and 1000.
(d) Your employer asks you to design a survey to estimate the fraction p of persons age 25-34
who download music via the internet. The objective is to get an estimate accurate to within
3%, with probability close to 0:95. What size of sample n would you recommend?
25. Telephone surveys: In some “random digit dialing” surveys, a computer phones randomly se-
lected telephone numbers. However, not all numbers are “active” (belong to a telephone account)
and some numbers belong to businesses.
Suppose that for a given large set of telephone numbers, 57% are active residential or individual
numbers. We will call these “personal” numbers.
Suppose that we wish to interview (over the phone) 1000 persons in a survey.
(a) Suppose that the probability a call to a personal number is answered is 0:8, and that the
probability the person answering agrees to be interviewed is 0:7. Give the probability
distribution for X, the number of calls needed to obtain 1000 interviews.
(b) Use the statistical software R (see Chapter 6) to find P (X x) for x = 2900; 3000; 3100;
3200.
(c) Suppose instead that 3200 randomly selected numbers were dialed. Give the probability
function for Y , the number of interviews obtained, and find P (Y 1000).
26. Hash tables continued: See Chapter 3, Problem 15. When a hash function is used to create
a data structure for a dictionary a collision can occur when a key-value pair is mapped to a slot
which has already been assigned to another key-value pair. One strategy for handling collisions is
to create a linked list of all the key-value pairs which are mapped to the same slot. This is called
collision resolution by separate chaining. Suppose separate chaining is used with a hash table of
size M and that the slot for key k is chosen by randomly selecting a number with replacement
from the set f0; 1; : : : ; M 1g.
(a) For n keys show that the probability that a given list contains exactly x keys is equal to
x n x
n 1 1
1 for x = 0; 1; : : : ; n
x M M
(b) Let = n=M (called the load factor). Under what conditions can the probability in (a) be
approximated by the Poisson probability
xe
x!
(c) If X v P oisson ( ) then the Chernoff bound for tail probabilities gives
e x
P (X x) e
x
if x . If = 10, then use this inequality to bound the probability that a list has
(i) 15 or more keys (ii) 20 or more keys
What are the implications of these results?
27. The ALOHA protocol for sending messages over wireless connections works as follows:
Messages of length t are sent by multiple users, without checking if the frequency is busy.
If two messages are sent during overlapping time intervals, both messages fail.
If a message fails, the user waits a random amount of time and then tries again.
Suppose messages are sent according to a Poisson process with a constant rate of messages per
t units of time.
(b) Find the probability that 3 or more users send a message within t units of time, if = 0:75.
(c) Find the probability a message sent at time x succeeds if = 0:75. Hint: For this event to
happen, there must be no other messages sent between x t and x + t.
(d) If = 0:75 and a message has just been sent, find the probability of waiting at least 3t units
of time until the next message is sent.
(e) Slotted ALOHA is an updated protocol for sending messages where discrete timeslots of
length t are set up, and messages can only be sent at the beginning. If = 0:75, find the
probability a timeslot has a message successfully sent in it. Find the probability that of 10
slots, exactly 7 have successful messages sent.
28. Error Correcting Codes 1: A message consisting of a string of zeros and ones is sent over a
network. Due to interference in the network a bit can randomly “flip” from 0 to 1 or vice versa
during transmission. An error correcting code (ECC) is a process of adding redundant data to
the message by the transmitter which allows error detection or correction by the receiver. The
Triple Repetition Code (TRC) is an ECC for which each bit is sent three times. For example the
message 0110 is sent as 000111111000. The receiver looks at the received message in groups of
three and decodes each group to the bit that occurs most often in the group (called the majority
rule). For example, 000, 001, 010, 100 are decoded as 0 while 111, 110, 101, 011 are decoded as
1. TRC allows the correction of at most one error in each group of three bits. Suppose each bit
has probability p of flipping, independently for each bit.
(a) What is the probability a group of three repeated bits will be decoded correctly?
(b) What is the probability an original message of length four (e.g. 0110) is decoded correctly
if no ECC is used?
(c) What is the probability an original message of length four (e.g. 0110) is decoded correctly
if TRC is used?
(d) Compare the probabilities in (b) and (c) for p = 0:2, 0:1, 0:05 and 0:01.
(e) Suppose each bit is repeated five times and the majority rule is used. What is the probability
a group of five repeated bits will be decoded correctly?
(f) Suppose each bit is repeated k times, where k is an odd number, and the majority rule is
used. Let P (k; p) be the probability a group of k repeated bits will be decoded correctly.
Find an expression for P (k; p). Calculate P (k; p) for p = 0:2, 0:1, and 0:05 and
k = 3; 5; 7; 9. What do you notice?
29. Error Correcting Codes 2: Hamming(7,4) is another type of ECC in which three parity bits are
added to a four bit string so a total of seven bits are transmitted. The three parity bits can be used
to correct at most one error in the string of seven bits received. Suppose each bit has probability
p of flipping, independently for each bit.
(a) What is the probability a correctable message is received using Hamming(7,4) (at most one
bit is flipped)?
(b) Calculate the probability in (a) for p = 0:2, 0:1, 0:05 and 0:01.
(c) Compare the probabilities in (b) with the probabilities obtained in Problem 28(d). What do
you notice?
30. Challenge problem: Suppose that n independent tosses of a coin with P (Head) = p are made.
Show that the probability of an even number of heads is given by 12 [1+(q p)n ] where q = 1 p.
6. COMPUTATIONAL METHODS AND
THE STATISTICAL SOFTWARE R
One of the giant steps towards democracy in the last century was the increased democratization of
knowledge23 , facilitated by the personal computer, Wikipedia and the advent of free open-source (GNU)
software such as Linux. The statistical software package R implements a dialect of the S language that
was developed at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. Versions
of R are available, at no cost, for 32-bit versions of Microsoft Windows for Linux, for Unix and for
Macintosh systems. It is available through the Comprehensive R Archive Network (CRAN) (download-
able for unix, windows or MAC platforms at http://cran.r-project.org/ ). This means that a community
of interested statisticians voluntarily maintain and updates the software. Like the licensed software
Matlab and Splus, R permits easy matrix and numerical calculations, as well as a programming en-
vironment for high-level computations. The R software also provides a powerful tool for handling
probability distributions, generating random variables, and graphical display. Because it is freely avail-
able and used by statisticians world-wide, high level programs in R are often available on the web.
These notes provide a glimpse of a few of the features of R. Web resources have much more infor-
mation and more links can be found on the Stat 230 web page. We will provide a brief description of
commands on a windows machine here, but the MAC and UNIX commands will generally be similar
once R is started.
6.1 Preliminaries
Begin by installing R on your personal computer and then invoke it on Math Unix machines by typing
R or on a windows machine by clicking on the R icon. For these notes, we will simply describe typing
commands into the R command window following the R prompt “>” in interactive mode.
Objects include variables, functions, vectors, arrays, lists and other items. To see online documenta-
23
“Knowledge is the most deomocratic source of power.” Alvin Toffler
113
114 6. COMPUTATIONAL METHODS AND THE STATISTICAL SOFTWARE R
tion about something, we use the “help” function. For example, to see documentation on the function
mean(), type
help(mean).
In some cases help.search() is helpful. For example
help.search("matrix")
lists all functions whose help pages have a title or alias in which the text string “matrix” appears.
The <- is a left diamond bracket (<) followed by a minus sign (-). It means “is assigned to”, for
example,
x <- 15
assigns the value 15 to variable x. To quit an R session, type
q()
You need the brackets () because you wish to run the function “q”. Typing q on its own, without
the parentheses, displays the text of the function on the screen. Try it! Alternatively to quit R, you can
click on the “File” menu and then on Exit or on the x in the top right corner of the R window. You are
asked whether you want to save the workspace image. Clicking “Yes” (safer) will save all the objects
that remain in the workspace both those at the start of the session and those added.
6.2 Vectors
Vectors can consist of either numbers or other symbols like characters; we will consider only numbers
here. Vectors are defined using c(): for example,
x<-c(1,3,5,7,9)
defines a vector of length 5 with the elements given. Vectors and other classes of objects possess certain
attributes. For example, typing
length(x)
will give the length of the vector x. Vectors are a convenient way to store values of a function (e.g. a
probability function or a c.d.f) or values of a random variable that have been recorded in some experi-
ment or process. We can also read a table of values from a text file that we created earlier called say
“mydata.txt” on a disk in drive c:
> mydata <- read.table("c:/mydata.txt", header=TRUE)

6.3. ARITHMETIC OPERATIONS 115
Use of “header=TRUE” causes R to use the first line of the text file to get header information for the
columns. If column headings are not included in the file, the argument can be omitted and we obtain a
table with just the data. The R object “mydata” is a special form known as a “data frame”. Data frames
that consist entirely of numeric data have a structure that is similar to that of numeric matrices. The
names of the columns can be displayed with the command
> names(mydata)
6.3 Arithmetic Operations

The following R commands and responses should explain the most basic arithmetic operations.
> 7+3
[1] 10
> 7*3
[1] 21
> 7/3
[1] 2.333333
> 2^3
[1] 8
In the last example the result is 8. The [1] says basically “first requested element follows” but here
there is just one element. The “>” indicates that R is ready for another command.
6.4 Some Basic Functions

Functions of many types exist in R. Many operate on vectors in a transparent way, as do arithmetic
operations. For example, if x and y are vectors then x+y adds the vectors element-wise; thus x and y
must be the same length. Some examples, with comments, follow. Note that anything that follows a #
on the command line is taken as comment and ignored by R.
> x<- c(1,3,5,7,9) # Defines a vector x

> x # displays x
[1] 1 3 5 7 9
> y<- seq(1,2,.25) # defines vector whose elements are an
# arithmetic progression
> y
[1] 1.00 1.25 1.50 1.75 2.00

> y[2] #displays the second element of vector y
[1] 1.25
> y[c(2,3)] #displays 2nd and 3rd elements of vector y
[1] 1.25 1.50
> mean(x) #computes mean of the elements of vector x
[1] 5
> summary(x) #function which summarizes features of a vector x
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3 5 5 7 9
> var(x) # Computes the (sample) variance of the elements of x
[1] 10
> exp(1) # The exponential function
[1] 2.718282
> exp(y)
[1] 2.718282 3.490343 4.481689 5.754603 7.389056
# round(y,n) rounds elements of vector y to n decimals
> round(exp(y),2)
[1] 2.72 3.49 4.48 5.75 7.39
> x+2*y
[1] 3.0 5.5 8.0 10.5 13.0
6.5 R Objects
Type “ls()” to see a list of names of all objects, including functions and data structures, in your
workspace.
If you type the name of an object, vector, matrix or function, you are returned its contents. (Try
typing “q” or “mean”).
Before you quit, you may remove objects that you no longer require with “rm()” and then save the
workspace image. The workspace image is automatically loaded when you restart R in that directory.
6.6. GRAPHS 117
6.6 Graphs
To open a graphics window in Unix, type x11(). Note that in R, a graphics window opens automatically
when a graphical function is used.
There are various plotting and graphical functions. Two useful ones are
plot(x,y) # Gives a scatterplot of x versus y; thus x and y must be

#vectors of the same length.
hist(x) # Creates a frequency histogram based on the values in the

#vector x. To get a relative frequency histogram (areas of
#rectangles sum to one) use
hist(x,prob=T).
Graphs can be tailored with respect to axis labels, titles, numbers of plots to a page etc. Type help(plot),
help(hist) or help(par) for some information. Try
x<-(0:20)*pi/10
plot(x, sin(x))
Is it obvious that these points lie on a sine curve? One can make it more obvious by changing
the shape of the graph. Place the cursor over the lower border of the graph sheet, until it becomes a
double-sided and then drag the border in towards the top border, to make the graph sheet short and
wide.
To save/print a graph in R using UNIX, you generate the graph you would like to save/print in R using
a graphing function like plot() and type:
dev.print(device,file="filename")
where device is the device you would like to save the graph to and filename is the name of the file that
you would like the graph saved to. To look at a list of the different graphics devices you can save to,
type help(Devices).\newpage
To save/print a graph in R using Windows, you can do one of two things.

a) You can go to the File menu when the graph window is active and save the graph using one of
several formats (that is, postscript, jpeg, etc.) or print it. You may also copy the graph to the clipboard
using one of the formats and then paste to an editor, such as MS Word.
b) You can right click on the graph. This gives you a choice of copying the graph and then pasting
to an editor, such as MS Word, or saving the graph as a metafile or bitmap or print directly to a printer.
6.7 Distributions
There are functions which compute values of probability or probability density functions, cumulative
distribution functions, and quantiles for various distributions. It is also possible to generate (pseudo)
random samples from these distributions. Some examples follow for Binomial and Poisson distribu-
tions. For other distribution information, type
help(rhyper),
help(rnbinom)
and so on. Note that R does not have any function specifically designed to generate random samples
from a discrete Uniform distribution (although there is one for a continuous Uniform distribution). To
generate n random samples from a discrete U nif orm(a; b), use
sample(a:b,n,replace=T).
> y<- rbinom(10,100,0.25) # Generate 10 random values from the

# Binomial(100,0.25) distribution
# The values are stored in the vector y.
> y # Display the values
[1] 24 24 26 18 29 29 33 28 28 28
> pbinom(3,10,0.5) # Compute P(Y<=3) for a Binomial(10,0.5) r.v.
[1] 0.171875
> qbinom(.95,10,0.5) # Find the 0.95 quantile (95th percentile)
# for Binomial(10,0.5) distribution
[1] 8
> z<- rpois(10,10) # Generate 10 random values from the
# Poisson(10) distribution.
# The values are stored in the vector z.
> z # Display the values
[1] 6 5 12 10 9 7 9 12 5 9
> ppois(3,10) # Compute P(Y<=3) for a Poisson(10) r.v.
[1] 0.01033605
> qpois(.95,10) # Find the 0.95 quantile (95th percentile)
# for Poisson(10) distribution
[1] 15
6.7. DISTRIBUTIONS 119
To illustrate how to plot the probability function for a random variable, a Binomial(10,0.5) random
variable is used.
# Assign all possible values of the random variable, X ~ Binomial(10,0.5)

x <- seq(0,10,by=1)
# Determine the value of the p.f. for possible values of X

x.pf <- dbinom(x,10,0.5)
# Plot the probability function

barplot(x.pf,xlab="X",ylab="Probability Function",
names.arg=c("0","1","2","3","4","5","6","7","8","9","10"))
Loops in R are easy to construct but long loops can be slow and should be avoided where possible. For
example
x=0
for (i in 1:10) x<- c(x,i)
can be replaced by
x=c(0:10)
Commonly used functions.
print() # Prints a single R object

cat() # Prints multiple objects, one after the other
length() # Number of elements in a vector or of a list
mean() # mean of a vector of data
median() # median of a vector of data
range() # Range of values of a vector of data
unique() # Gives the vector of distinct values
diff() # the vector of first differences so diff(x) has
# one less element than x
sort() # Sort elements into order, omitting NAs
order() # x[order(x)] orders elements of x, with NAs last
cumsum() # vector of partial or cumuulative sums

cumprod() # vector of partial or cumuulative products
rev() # reverse the order of vector element
1. The following ten observations, taken during the years 1970-79, are on October snow cover for
Eurasia. (Snow cover is in millions of square kilometers).
Year Snow.cover
1970 6.5
1971 12
1972 14.9
1973 10
1974 10.7
1975 7.9
1976 21.9
1977 12.5
1978 14.5
1979 9.2
(a) Enter the data into R. To save keystrokes, enter the successive years as 1970:1979
(b) Plot snow.cover versus year.
(c) Use “hist()” to plot a histogram of the snow cover values.
(d) Repeat b and c after taking logarithms of snow cover.
2. Input the following data, on damage that had occurred in space shuttle launches prior to the
Challenger space shuttle launch of Jan 28 1986.
Number of
Temperature
Date Damage
(Frahenheit)
Incidents
4/12/81 66 0
11/12/81 70 1
3/22/82 69 0
6/27/82 80 NA
1/11/82 68 0
4/4/83 67 0
6/18/83 72 0
8/30/83 73 0
11/28/83 70 0
2/3/84 57 1
4/6/84 63 1
8/30/84 70 1
10/5/84 78 0
11/8/84 67 0
1/24/85 53 3
4/12/85 67 0
4/29/85 75 0
6/17/85 70 0
7/29/85 81 0
8/27/85 76 0
10/3/85 79 0
10/30/85 75 2
11/26/85 76 0
1/12/86 58 1
This was then followed by the disasterous CHALLENGER incident on 1/28/86.
(a) Enter the temperature data into a data frame, with (for example) column names: date, tempera-
ture, damage.
(b) Plot total incidents against temperature. Do you see any relationship? On the date of the
challenger incident the temperature at launch was 31 degrees F. What would you expect for the number
of damage incidents?
7. EXPECTED VALUE AND VARIANCE
7.1 Summarizing Data on Random Variables

When we return midterm tests, someone almost always asks what the average was. While we could
list out all marks to give a picture of how students performed, this would be tedious. It would also
give more detail than could be immediately digested. If we summarize the results by telling a class
the average mark, students immediately get a sense of how well the class performed. For this reason,
“summary statistics” are often more helpful than giving full details of every outcome.
To illustrate some of the ideas involved, suppose we were to observe cars crossing a toll bridge, and
record the number, X, of people in each car. Suppose in a small study24 data on 25 cars were collected.
We could list out all 25 numbers observed, but a more helpful way of presenting the data would be in
terms of the frequency distribution below, which gives the number of times (the “frequency”) each
value of X occurred.
X Frequency Count Frequency
1 |||| | 6
2 |||| ||| 8
3 |||| 5
4 ||| 3
5 || 2
6 | 1
We could also draw a frequency histogram of these frequencies (see Figure 7.1).
Frequency distributions or histograms are good summaries of data because they show the variability
in the observed outcomes very clearly. Sometimes, however, we might prefer a single-number sum-
mary. The most common such summary is the average, or arithmetic mean of the outcomes. The mean
24
“Study without desire spoils the memory, and it retains nothing that it takes in.” Leonardo da Vinci
122
7.1. SUMMARIZING DATA ON RANDOM VARIABLES 123
Frequency
4
0
1 2 3 4 5 6
x
Figure 7.1: Frequency histogram for number of people in a car at a toll bridge
P
n
of n outcomes x1 ; : : : ; xn for a random variable X is xi =n, and is denoted by x. The arithmetic
i=1
mean for the example above can be calculated as
(6 1) + (8 2) + (53) + (3 4) + (2 5) + (1 6) 65
= = 2:60
25 25
That is, there was an average of 2.6 persons per car. A set of observed outcomes x1 ; : : : ; xn for a
random variable X is termed a sample in probability and statistics. To reflect the fact that this is the
average for a particular sample, we refer to it as the sample mean. Unless somebody deliberately
“cooked” the study, we would not expect to get precisely the same sample mean if we repeated it
another time. Note also that x is not in general an integer, even though X is.
Two other common summary statistics are the median and mode.
Definition 14 The median of a sample is a value such that half the results are below it and half above
it, when the results are arranged in numerical order.
If these 25 results were written in order, the thirteenth outcome would be a 2. So the median is 2. By
convention, we go half way between the middle two values if there are an even number of observations.
Definition 15 The mode of the sample is the value which occurs most often. In this case the mode is 2.
There is no guarantee there will be only a single mode.
Exercise: Give a data set with a total of 11 values for which the median<mode<mean.
124 7. EXPECTED VALUE AND VARIANCE
7.2 Expectation of a Random Variable

The statistics in the preceding section summarize features of a sample of observed X-values. The
same idea can be used to summarize the probability distribution of a random variable X. To illustrate,
consider the previous example, where X is the number of persons in a randomly selected car crossing
a toll bridge. Note that we can re-arrange the expression used to calculate x for the sample, as
(6 1) + (8 2) + (5 3) + (3 4) + (2 5) + (1 6)
25
6 8 5 3 2 1
= (1) + (2) + (3) + (4) + (5) + (6)
25 25 25 25 25 25
6
X
= x fraction of times x occurs
x=1
Now suppose we know that the probability function of X is given by
x 1 2 3 4 5 6
f (x) 0:30 0:25 0:20 0:15 0:09 0:01
Using the relative frequency “definition” of probability, if we observed a very large number of cars, the
fraction (or relative frequency) of times X = 1 would be 0:30, for X = 2, this proportion would be
0:25, etc. So, in theory, (according to the probability model) we would expect the mean to be
(1)(0:30) + (2)(0:25) + (3)(0:20) + (4)(0:15) + (5)(0:09) + (6)(0:01) = 2:51
if we observed an infinite number of cars. This “theoretical” mean is usually denoted by or E(X), and
requires us to know the distribution of X. With this background we make the following mathematical
definition.
Definition 16 Let X be a discrete random variable with range(X) = A and probability function f (x).
The expected value (also called the mean or the expectation) of X is given by
X
E(X) = xf (x)
x2A
The expected value of X is also often denoted by the Greek letter . The expected value 25 of X
can be thought of physically as the average of the X-values that would occur in an infinite series of
repetitions of the process where X is defined. This value not only describes one aspect of a probability
distribution, but is also very important in certain types of applications. For example, if you are playing
25
Oft expectation fails, and most oft where most it promises; and oft it hits where hope is coldest; and despair most sits.
William Shakespeare (1564 - 1616)
7.2. EXPECTATION OF A RANDOM VARIABLE 125
a casino game in which X represents the amount you win in a single play, then E(X) represents your
average winnings (or losses!) per play.
Sometimes we may not be interested in the average value of X itself, but in some function of X.
Consider the toll bridge example once again, and suppose there is a toll which depends on the number
of car occupants. For example, a toll of $1 per car plus 25 cents per occupant would produce an average
toll for the 25 cars in the study of Section 7.1 equal to
6 8 5 3 2 1
(1:25) +(1:50) +(1:75) +(2:00) +(2:25) +(2:50) = $1:65
25 25 25 25 25 25
If X has the theoretical probability function f (x) given above, then the average value of this $(0:25X +
1) toll would be defined in the same way, as,
(1:25)(0:30) + (1:50)(0:25) + (1:75)(0:20) + (2:00)(0:15) + (2:25)(0:09) + (2:50)(0:01) = $1:6275
We call this the expected value of (0:25X + 1) and write E (0:25X + 1) = 1:6275.
As a further illustration, suppose a toll designed to encourage car pooling charged $12=x2 if there
were x people in the car. This scheme would yield an average toll, in theory, of
12 12 12 12 12 12
(0:30) + (0:25) + (0:20) + (0:15) + (0:09) + (0:01)
1 4 9 16 25 36
= $4:7757
In other words
12
E = 4:7757
X2
12
is the “expected value” of X2
.
With this as background, we can now make a formal definition.
Theorem 17 Let X be a discrete random variable with range(X) = A and probability function f (x).
The expected value of some function g(X) of X is given by
X
E [g(X)] = g(x)f (x)
x2A
Proof: To use Definition 16, we need to determine the expected value of the random variable
Y = g(X) by first finding the probability function of Y; say fY (y) = P (Y = y) and then computing
X
E[g(X)] = E(Y ) = yfY (y) (7.1)
y2B
where range(Y ) = B. Let Dy = fx; g(x) = yg be the set of x values with a given value y for g(x),
then
X
fY (y) = P [g(X) = y] = f (x)
x2Dy
Substituting this in (7.1) we obtain

X
E[g(X)] = yfY (y)
y2B
X X
= y f (x)
y2B x2Dy
X X
= g(x)f (x)
y2B x2Dy
X
= g(x)f (x)
x2A
Notes:
(1) You can interpret E[g(X)] as the average value of g(X) in an infinite series of repetitions of the
process where X is defined.
(2) E [g(X)] is also known as the “expected value” of g(X). This name is somewhat misleading
since the average value of g(X) may be a value which g(X) never takes - hence unexpected!
(3) The case where g(x) = x reduces to our earlier definition of E(X).
(4) Confusion sometimes arises because we have two notations for the mean of a probability distri-
bution: and E(X) mean the same thing. There is a small advantage to using the (lower case)
letter : It makes it visually clearer that the expected value is NOT a random variable like X but
a non-random constant.
(5) When calculating expectations, look at your answer to be sure it makes sense. Suppose for
example that X takes values from 1 to 10. Then since
10
X 10
X 10
X
1= (1) P (X = x) xP (X = x) = E (X) (10) P (X = x) = 10 (1) = 10
x=1 x=1 x=1
you should know you’ve made an error if you get E(X) > 10 or E(X) < 1. In physical terms,
E(X) is the balance point for the probability histogram of f (x).
7.3. SOME APPLICATIONS OF EXPECTATION 127
Let us note a couple of mathematical properties of expected value that can help to simplify calculations.
Linearity Properties of Expectation: If your linear algebra is good, it may help if you think of E as
being a linear operator, and this may save memorizing these properties.
1. For constants a and b,

E [ag(X) + b] = aE [g(X)] + b
Proof:
X
E [ag(X) + b] = [ag(x) + b] f (x)
all x
X
= [ag(x)f (x) + bf (x)]
all x
X X
=a g(x)f (x) + b f (x)
all x all x
X
= aE [g(X)] + b since f (x) = 1
all x
2. Similarly for constants a and b and two functions g1 and g2 , it is also easy to show
E [ag1 (X) + bg2 (X)] = aE [g1 (X)] + bE [g2 (X)]
Don’t let expected value intimidate you. Much of it is common sense. For example, using property
1, with we let a = 0 and b = 13 we obtain E(13) = 13. The expected value of a constant b is,
of course, equal to b. The property also implies E (2X) = 2E(X) if we use a = 2, b = 0, and
g(X) = X. This is obvious also. Note, however, that for g(x) a nonlinear function, it is NOT
generally true that E[g(X)] = g [E(X)]; this is a common mistake. (Check this for the example above
when g(X) = 12=X 2 .)
7.3 Some Applications of Expectation

Because expected value is an average value, it is frequently used in problems where costs or profits
are connected with the outcomes of a random variable X. It is also used as a summary statistic; for
example, one often hears about the expected life (expectation of lifetime) for a person or the expected
return on an investment. Be cautious however. The expected value does NOT tell the whole story
about a distribution. One investment could have a higher expected value than another but much much
larger probability of large losses.
The following are examples.
Example: Expected Winnings in a Lottery A small lottery26 sells 1000 tickets numbered
000; 001; : : : ; 999; the tickets cost $10 each. When all the tickets have been sold the draw takes place:
this consists of a single ticket from 000 to 999 being chosen at random. For ticket holders the prize
structure is as follows:
Your ticket is drawn - win $5000.
Your ticket has the same first two number as the winning ticket, but the third is different - win
$100.
Your ticket has the same first number as the winning ticket, but the second number is different -
win $10.
All other cases - win nothing.
Let the random variable X represent the winnings from a given ticket. Find E(X).
Solution: The possible values for X are 0, 10, 100, 5000 (dollars). First, we need to find the probability
function for X. We find (make sure you can do this) that f (x) = P (X = x) has values
f (0) = 0:9; f (10) = 0:09; f (100) = 0:009; f (5000) = 0:001
The expected winnings are thus the expected value of X, or

X
E(X) = xf (x) = $6:80
all x
Thus, the gross expected winnings per ticket are $6:80. However, since a ticket costs $10 your expected
net winnings are negative, $3:20 (that is, an expected loss of $3:20).
Remark: For any lottery or game of chance the expected net winnings per play is a key value. A fair
game is one for which this value is 0. Needless to say, casino games and lotteries are never fair: the
expected net winnings for a player are always negative.
Remark: The random variable associated with a given problem may be defined in different ways but
the expected winnings will remain the same. For example, instead of defining X as the amount won
we could have defined X = 0; 1; 2; 3 as follows:
26
“Here’s something to think about: How come you never see a headline like ‘Psychic Wins Lottery’?” Jay Leno (1950 - )
7.3. SOME APPLICATIONS OF EXPECTATION 129
X =3 all 3 digits of number match winning ticket

X =2 1st 2 digits (only) match
X =1 1st digit (but not the 2nd) match
X =0 1st digit does not match
Now, we would define the function g(x) as the winnings when the outcome X = x occurs. Thus,
g(0) = 0; g(1) = 10; g(2) = 100; g(3) = 5000
The expected winnings are then
P
3
E [g(X)] = g(x)f (x) = $6:80
x=0
the same as before.
Example: Diagnostic medical Tests Often there are cheaper, less accurate tests for diagnosing the
presence of some conditions in a person, along with more expensive, accurate tests. Suppose we have
two cheap tests and one expensive test, with the following characteristics. All three tests are positive if
a person has the condition (there are no “false negatives”), but the cheap tests give “false positives”.
Let a person be chosen at random, and let D = {person has the condition}. For the three tests the
probability of a false positive and cost are:
Test P positive test jD Cost (in dollars)

1 0:05 5
2 0:03 8
3 0 40
We want to check a large number of people for the condition, and have to choose among three testing
strategies:
(i) Use Test 1, followed by Test 3 if Test 1 is positive27 .
(ii) Use Test 2, followed by Test 3 if Test 2 is positive.
(iii) Use Test 3.
Determine the expected cost per person under each of strategies (i), (ii) and (iii). We will then choose
the strategy with the lowest expected cost. It is known that about 0:001 of the population have the
condition (P (D) = 0:001; P (D) = 0:999).
27
Assume that given D or D, tests are independent of one another.
Solution: For a person tested chosen at random and tested, define the random variable X as follows:
X=1 if the initial test is negative

X=2 if the initial test is positive
Let g(x) be the total cost of testing the person. The expected cost per person is then
P
2
E[g(X)] = g(x)f (x)
x=1
The probability function f (x) for X and function g(x) differ for strategies (i), (ii) and (iii). Consider
for example strategy (i). Then
P (X = 2) = P (initial test positive)

= P (D) + P (positivejD)P (D)
= 0:001 + (0:05)(0:999)
= 0:0510
The rest of the probabilities, associated values of g(X) and E[g(X)] are obtained below.
(i) f (2) = 0:0510 (obtained above)

f (1) = P (X = 1) = 1 f (2) = 1 0:0510 = 0:949
g(1) = 5 g(2) = 45
E[g(X)] = 5(0:949) + 45(0:0510) = $7:04
(ii) f (2) = 0:001 + (0:03)(0:999) = 0:03097

f (1) = 1 f (2) = 0:96903
g(1) = 8 g(2) = 48
E[g(X)] = 8(0:96903) + 48(0:03097) = $9:2388
(iii) f (2) = 0:001; f (1) = 0:999

g(2) = g(1) = 40
E[g(X)] = $40:00
Therefore the cheapest strategy is strategy (i).

7.4. MEANS AND VARIANCES OF DISTRIBUTIONS 131
Problem
7.3.1 A lottery28 has tickets numbered 000 to 999 which are sold for $1 each. One ticket is selected
at random and a prize of $200 is given to any person whose ticket number is a permutation of
the selected ticket number. All 1000 tickets are sold. What is the expected profit or loss to the
organization running the lottery?
7.4 Means and Variances of Distributions

It is useful to know the means, = E(X) of the probability models derived in Chapter 5.
Example: Expected value of a Binomial random variable Let X Binomial(n; p). Find E(X).
Solution:
n
X n x
= E(X) = x p (1 p)n x
x
x=0
n
X n!
= x px (1 p)n x
x!(n x)!
x=0
When x = 0 the value of the expression is 0. We can therefore begin our sum at x = 1. Provided
x 6= 0, we can expand x! as x(x 1)! (so it is important to eliminate the term when x = 0). Therefore
n
X n(n 1)!
= ppx 1
(1 p)(n 1) (x 1)
(x 1)! [(n 1) (x 1)]!
x=1
n
X x 1
n 1 n 1 p
= np(1 p)
x 1 1 p
x=1
Let y = x 1 in the sum, to get

n
X1 y
n 1 p
= np(1 p)n 1
y 1 p
y=0
n 1
p
= np (1 p)n 1
1+ by the Binomial Theorem
1 p
(1 p + p)n 1
= np (1 p)n 1
(1 p)n 1
= np
28
“I’ve done the calculation and your chances of winning the lottery are identical whether you play or not.” Fran Lebowitz
(1950 - )
Exercise: Does this result make sense? If you try something 100 times and there is a 20% chance of
success each time, how many successes do you expect to get, on average?
Example: Expected value of the Poisson random variable Let X have a Poisson distribution where
is the average rate of occurrence and the time interval is of length t. Find = E(X).
Solution: Since the probability function of X is
( t)x e t
f (x) = for x = 0; 1; : : :
x!
then
1
X ( t)x e t
= E(X) = x
x!
x=0
As in the Binomial example, we can eliminate the term when x = 0 and expand x! as x(x 1)! for
x = 1; 2; : : : to obtain
1
X 1
X
( t)x e t ( t)x e t
= x = x
x! x(x 1)!
x=1 x=1
1
X x 1
t ( t)
= ( t)e
(x 1)!
x=1
1
X
t ( t)x 1
= ( t) e
(x 1)!
x=1
1
X
t ( t)y
= ( t)e letting y = x 1 in the sum
y!
y=0
1
X
t t xy
= ( t)e e since ex =
y!
y=0
= t
Note that we used the symbol = t earlier in connection with the Poisson model; this was because
we knew (but couldn’t show until now) that E(X) = .
Exercise: These techniques can also be used to work out the mean for the Hypergeometric or Negative
P
Binomial distributions. Looking back at how we proved that f (x) = 1 shows the same method of
summation used to find . However, in Chapter 9 we will give a simpler method of finding the means
of these distributions, which are E(X) = nr=N (Hypergeometric) and E(X) = k(1 p)=p (Negative
Binomial).
Variability: While an average or expected value is a useful summary of a set of observations, or a

probability distribution, it omits another important piece of information, namely the amount of vari-
ability. For example, it would be possible for car doors to be the right width, on average, and still have
no doors fit properly. In the case of fitting car doors, we would also want the door widths to all be close
to this correct average. We give a way of measuring the amount of variability next. You might think
we could use the average difference between X and to indicate the amount of variation. In terms of
expectation, this would be E (X ). However,
E (X ) = E(X) since is a constant

= 0:
We soon realize that for a measure of variability, we can use the expected value of a function that has
the same sign for X > and for X < . One might try the expected value of the distance between
X and its mean, e.g. E(jX j): An alternative, more mathematically tractable version squares the
distance (much as Euclidean distance in <n involves a sum of squared distances) is the variance.
Definition 18 The variance of a random variable X, denoted by V ar(X) or by 2, is

h i
2
= V ar(X) = E (X )2
In words, the variance is the average square of the distance from the mean. This turns out to be a very
useful measure of the variability of X.
The basic definition of variance is often awkward to use for mathematical calculation of V ar(X),
whereas the following two results are often useful:
(1) V ar(X) = E X 2 [E (X)]2 = E X 2 2
(2) V ar(X) = E [X(X 1)] + E (X) [E (X)]2 = E [X(X 1)] + 2
Proof:
(1) Using properties of expected value,

h i
2
= V ar(X) = E (X )2
= E X2 2 X+ 2
= E X2 2 E(X) + 2
since is a constant
2 2 2
=E X 2 + since E(X) =
= E X2 2
(2) Since X 2 = X(X 1) + X,
E X2 2
= E [X (X 1) + X] 2
2
= E [X(X 1)] + E(X)
2
= E [X(X 1)] +
Formula (2) is most often used when there is an x! term in the denominator of f (x). Otherwise, formula
(1) is generally easier to use.
Suppose the random variable X is the number of dollars that a person wins if they play a certain
game. We notice that the units of measurement for E (X) will also be dollars but the units of mea-
surement for V ar(X) will be (dollars)2 . We can regain the original units by taking the square root of
V ar(X). This is called the standard deviation of X, and is denoted by , or by sd(X).
Definition 19 The standard deviation of a random variable X is

r h i
p
= sd (X) = V ar (X) = E (X )2
Both variance and standard deviation are commonly used to measure variability.
Example: Suppose X is a random variable with probability function given by
x 1 2 3 4 5 6 7 8 9 Total
f (x) 0:07 0:10 0:12 0:13 0:16 0:13 0:12 0:10 0:07 1
The probability histogram for X is given in Figure 7.2.

Find E (X) and V ar (X).
Solution:
= E (X)
= 1 (0:07) + 2 (0:1) + 3 (0:12) + 4 (0:13) + 5 (0:16)
+ 6 (0:13) + 7 (0:12) + 8 (0:1) + 9 (0:07)
=5
E (X) = 5 should be obvious by looking at the histogram. If a probability histogram is symmetric

about the line x = then E (X) = without any calculation.
0.2
0.18
0.16
0.14
0.12
f(x)
0.1
0.08
0.06
0.04
0.02
0
1 2 3 4 5 6 7 8 9
x
Figure 7.2: Probability Histogram for X
Without doing any calculations we also know that V ar(X) = 2 16: This is because the
possible values of X are f1; 2; : : : ; 9g and so the maximum possible value for (X )2 is (9 5)2 or
(1 5)2 = 16. Therefore
h i 9
X
2
V ar (X) = E (X 5) = (x 5)2 P (X = x)
x=1
9
X 9
X
(9 5)2 P (X = x) = 16 P (X = x) = 16 (1) = 16
x=1 x=1
An expected value of a function, say E [g(X)] is always somewhere between the minimum and the
maximum value of the function g(x) so in this case 0 V ar(X) 16: Since
E X 2 = (1)2 (0:07) + (2)2 (0:1) + (3)2 (0:12) + (4)2 (0:13) + (5)2 (0:16)
+ (6)2 (0:13) + (7)2 (0:12) + (8)2 (0:1) + (9)2 (0:07)
= 30:26
Therefore
2
= V ar (X) = E X 2 2
= 30:26 (5)2
= 5:26
and
p
= V ar (X)
p
= 5:26
= 2:2935
To see how 2 = V ar (X) or = sd (X) reflects the shape of a probability histogram see Figure
7.3. In each case the range of the random variable X is f1; 2; : : : ; 9g and the mean is = E (X) = 5.
0.8 0.8
µ=5 µ=5
0.6 0.6 2
2
σ =5.26 σ =3.18
f(x) f(x)
σ=2.29 σ=1.78
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0.8 0.8
µ=5 µ=5
0.6 2 0.6 2
σ =1.66 σ =0.88
f(x) f(x)
σ=1.29 σ=0.94
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Figure 7.3: How V ar (X) or sd (X) reflects the spread of a probability histogram
Each histogram is labeled with its corresponding values for 2 = V ar (X) and = sd (X). We can
see from the histograms that a small value of 2 ( ) suggests there is a greater probability of getting
an observed value of the random variable near the mean = E (X). A large value of 2 ( ) suggests
there is a greater probability of getting an observed value of the random variable that is not close to the
mean = E (X).
Example: Variance of Binomial random variable Let X Binomial(n; p). Find V ar(X).
Solution: The probability function for X is
n!
f (x) = px (1 p)n x
for x = 0; 1; : : : ; n
x!(n x)!
so we’ll use formula (2) above,
n
X n!
E [X(X 1)] = x(x 1) px (1 p)n x
x!(n x)!
x=0
If x = 0 or x = 1 the value of the term is 0, so we can begin summing at x = 2. For x 6= 0 or 1, we

can expand the x! as x(x 1)(x 2)!
n
X n!
Therefore E [X (X 1)] = px (1 p)n x
(x 2)!(n x)!
x=2
Now re-group to fit the Binomial Theorem, since that was the summation technique used to show
P
f (x) = 1 and to derive = np.
n
X n(n 1)(n 2)!
E [X(X 1)] = p2 px 2
(1 p)(n 2) (x 2)
(x 2)! [(n 2) (x 2)]!
x=2
n
X x 2
2 n 2 n 2 p
= n(n 1)p (1 p)
x 2 1 p
x=2
Let y = x 2 in the sum, giving
n
X2 y
2 n 2 n 2 p
E [X(X 1)] = n(n 1)p (1 p)
y 1 p
y=0
n 2
p
= n(n 1)p2 (1 p)n 2
1+
1 p
2 (1 p + p)n 2
= n(n 1)p2 (1 p)n
(1 p)n 2
= n(n 1)p2
Then
2 2
= E [X (X 1)] +
= n(n 1)p2 + np (np)2
= n2 p 2 np2 + np n2 p 2
= np(1 p)
Remember that the variance of a Binomial distribution is np(1 p), since we’ll be using it later in the
course.
In Figure 7.4 there are four Binomial probability histograms for various values of n and p along
with their means, variances and standard deviations. From the top two panels we can see that a
Binomial (10; 0:1) random variable and a Binomial (10; 0:9) random variable have the same variance
and standard deviation and that the probability histograms are mirror images of each other. This is what
you might expect given that the Binomial distribution arises as the number of successes in n Bernoulli
trials and the only difference between a Binomial (10; 0:1) random variable and a Binomial (10; 0:9)
random variable is which outcome is labeled a Success (S) and which is labeled a Failure (F ).
B in o m ia l( 1 0 ,0 .1 ) B in o m ia l( 1 0 ,0 .9 )
0 .4 0 .4
0 .3 5 0 .3 5
µ = 1
2
µ = 9
0 .3 0 .3
σ = 0 .9
2
0 .2 5
σ = 0 .9 5
0 .2 5
σ = 0 .9
0 .2 0 .2 σ = 0 .9 5
0 .1 5 0 .1 5
0 .1 0 .1
0 .0 5 0 .0 5
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
B in o m ia l( 1 0 ,0 .5 ) B in o m ia l( 5 0 ,0 .1 )
0 .2 5 0 .2
0 .1 8
µ = 5
0 .2 0 .1 6 µ = 5
2
σ = 2 .5 2
0 .1 4
σ = 4 .5
0 .1 5 σ = 1 .5 8 0 .1 2
σ = 2 .1 2
0 .1
0 .1 0 .0 8
0 .0 6
0 .0 5 0 .0 4
0 .0 2
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 7.4: Probability histograms, means and variances for various Binomial(n; p) random variables
The lower left hand panel contains the probability histogram for a Binomial (10; 0:5) random
variable which we notice is symmetric about its mean = 10 (0:5) = 5. In fact the probability
histogram for a Binomial (n; 0:5) random variable will always be symmetric about its mean =
np. Note also that for fixed n, a Binomial (n; 0:5) random variable has the largest variance since
V ar (X) = np(1 p) is maximized for p = 0:5.
The lower right hand panel contains the probability histogram for a Binomial (50; 0:1) random
variable which we notice is fairly symmetric about its mean = 50 (0:1) = 5 even though p = 0:1
is not close to 0:5. This observation leads to an approximation to the Binomial distribution which is
discussed in Section 10:1.
Example: Variance of Poisson random variable Suppose X has a Poisson( ) distribution. Find
V ar(X).
Solution: The probability function for X is

xe
f (x) = for x = 0; 1; 2; : : :
x!
from which we obtain
1
X xe
E [X(X 1)] = x(x 1)
x!
x=0
X1 xe
= x(x 1) ; setting the lower limit to 2 and expanding x!
x(x 1)(x 2)!
x=2
1
X x 2
2
= e
(x 2)!
x=2
Let y = x 2 in the sum, giving

1
X y
2 2 2
E [X(X 1)] = e = e e = so
y!
y=0
2 2
= E [X(X 1)] +
2 2
= + =
(For the Poisson distribution, the variance equals the mean.)
In Figure 7.5 there are four Poisson probability histograms for various values of along with their
means, variances and standard deviations. In the top left histogram = 0:5 and we see that much of
the probability is concentrated on the values x = 0 and x = 1. As the value of approaches 0, more
and more of the probability will be concentrated on the value x = 0 since
0e
lim P (X = 0) = lim = lim e = 1:
!0 !0 0! !0
This is consistent with V ar (X) = approaching 0 as well.

As increases V ar (X) = increases and the spread of the probability histogram increases as
illustrated in the top right histogram where = 1:5 and the bottom two histograms where = 5 and
Poisson(0.5) Poisson(1.5)
0.7 0.35
0.6 µ= 0.5 0.3

µ= 1.5
2
σ = 0.5 2
σ = 1.5
σ= 0.71 σ= 1.22
0.5 0.25
f(x)
0.4 0.2
0.3 0.15
0.2 0.1
0.1 0.05
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9
Poisson(5) Poisson(9.5)
0.18 0.14
0.16
µ= 5 0.12
2
0.14 σ =5
µ= 9.5
σ= 2.24 0.1 2
σ = 9.5
0.12 f(x)
σ= 3.08
0.1 0.08
0.08 0.06
0.06
0.04
0.04
0.02
0.02
0 0
0 1 2 3 4 5 6 7 8 9 10 11 -5 0 5 10 15 20 25
Figure 7.5: Probability histograms, means and variances for various Poisson( ) random variables
= 9:5. Note that for = 5 the histogram is quite symmetric about the mean E (X) = = 5
and even more symmetric for = 9:5. This observation leads to an approximation to the Poisson
distribution which is discussed in Section 10:1.
Properties of Mean and Variance:

If a and b are constants and Y = aX + b, then
Y = E (Y ) = aE (X) + b = a X +b
and
2
Y = V ar (Y ) = a2 V ar (X) = a2 2
X
where = E (X), 2 = V ar (X), E (Y ) = and V ar (Y ) = 2

X X Y, Y.
Proof:
We already showed that E (Y ) = E(aX + b) = a X + b = Y . Then
h i n o
2 2 2
Y = E (Y Y ) = E [(aX + b) (a X + b)]
h i h i
= E (aX a X )2 = E a2 (X X)
2
h i
2
= a2 E (X X ) = a2 X 2
This result is to be expected. Adding a constant, b, to all values of X has no effect on the amount of
variability. So it makes sense that V ar(aX + b) doesn’t depend on the value of b. Also since variance
is in squared units, multiplication by a constant results in multiplying the variance by the constant
squared. A simple way to relate to this result is to consider a random variable X which represents a
temperature in degrees Celsius (even though this is a continuous random variable which we don’t study
until Chapter 9). Now let Y be the corresponding temperature in degrees Fahrenheit. We know that
9
Y = X + 32
5
and it is clear that Y = ( 95 ) X + 32 and that 2
Y = ( 95 )2 2
X.
Problems
7.4.1 An airline knows that there is a 97% chance a passenger for a certain flight will show up, and
assumes passengers arrive independently of each other. Tickets cost $100, but if a passenger
shows up and can’t be carried on the flight the airline has to refund the $100 and pay a penalty
of $400 to each such passenger. How many tickets should they sell for a plane with 120 seats to
maximize their expected ticket revenues after paying any penalty charges? Assume ticket holders
who don’t show up get a full refund for their unused ticket.
7.4.2 A typist typing at a constant speed of 60 words per minute makes a mistake in any particular
word with probability 0:04, independently from word to word. Each incorrect word must be
corrected; a task which takes 15 seconds per word.
(a) Find the mean and variance of the time (in seconds) taken to finish a 450 word passage.
(b) Would it be less time consuming, on average, to type at 45 words per minute if this reduces
the probability of an error to 0:02?
1. For Chapter 5, Problems 1 and 2 find E (X), E X 2 , and V ar (X).
2. Let X have probability function

(
1
2x for x = 2; 3; 4; 5; or 6
f (x) = 11
40 for x = 1
Find E (X), E X 2 , and V ar (X).
3. A person plays a game in which a fair coin is tossed until the first tail occurs. The person wins
$2x if x tosses are needed for x = 1; 2; 3; 4; 5 but loses $256 if x > 5.
(a) Determine the expected winnings.

(b) Determine the variance of the winnings.
4. Yasmin and Zack are undergraduate mathematics students currently taking the same five courses.
Let X be the number of assignments they have in one week. The probability function of X is:
x 0 1 2 3 4 5
f (x) 0:09 0:10 0:25 0:40 0:15 0:01
The number of cups of coffee Yasmin and Zack drink in one week both depend on the number of
assignments they have. Yasmin drinks about 2X 2 cups per week and Zack drinks about j2X 1j
cups per week.
(a) Find the expected number of cups of coffee that Yasmin will drink in a week. Find the
expected number of cups of coffee that Zack will drink in a week.
(b) Find the variance of the number of cups of coffee that Yasmin will drink in a week. Find
the variance of the number of cups of coffee that Zack will drink in a week.
(a) Find the mean and variance of X. Hint: See Problem 4.6.2.
(b) Use your result in (a) to show that if p is the probability of “success” (S) in a sequence of
Bernoulli trials, then the expected number of trials until the first S occurs is 1=p. Explain
why this is “obvious”.
6. Diagnostic medical tests I: Consider diagnostic tests like those discussed in the example of
Section 7:3. Assume that for a randomly selected person, P (D) = 0:02, P (RjD) = 1,
P (RjD) = 0:05, so that the inexpensive test only gives false positive, and not false negative,
results. Suppose that this inexpensive test costs $10. If a person tests positive then they are also
given a more expensive test, costing $100, which correctly identifies all persons with the disease.
What is the expected cost per person if a population is tested for the disease using the inexpensive
test followed, if necessary, by the expensive test?
7. Diagnostic medical tests II: Two percent of the population has a certain condition for which
there are two diagnostic tests. Test A, which costs $1 per person, gives positive results for 80%
of persons with the condition and for 5% of persons without the condition. Test B, which costs
$100 per person, gives positive results for all persons with the condition and negative results for
all persons without it.
(a) Suppose that test B is given to 150 persons, at a cost of $15; 000. How many cases of the
condition would one expect to detect?
(b) Suppose that 2000 persons are given test A, and then only those who test positive are
given test B. Show that the expected cost is $15; 000 but that the expected number of cases
detected is much larger than in part (a).
8. Diagnostic medical tests III: Suppose that n people take a blood test for a disease, where each
person has probability p of having the disease, independent of other persons. To save time and
money, blood samples from k people are pooled and analyzed together. If none of the k persons
has the disease then the test will be negative, but otherwise it will be positive. If the pooled test
is positive then each of the k persons is tested separately (so k + 1 tests are done in that case).
(a) Let X be the number of tests required for a group of k people. Show that
E(X) = k + 1 k(1 p)k
(b) What is the expected number of tests required for n=k groups of k people each? If p = 0:01,
evaluate this for the cases k = 1; 5; 10.
(c) Show that if p is small, the expected number of tests in part (b) is approximately
n(kp + k 1 ), and is minimized for k t p 1=2 . Hint: Use the linear approximation
(1 p)k t 1 kp for p close to 0.
9. The probability that a roulette wheel stops on a red number is 18=37. Suppose you bet x dollars
on “red”. If the wheel stops on a red number then you are paid 2x dollars and your net winnings
are x dollars. If the wheel does not stop on a red number then you lost your bet and your net
winnings are x dollars.
(a) If you bet $1 on each of 10 consecutive plays, what is the expected value of your net
winnings? What is the expected value of your net winnings if you bet $10 on a single play?
(b) For each of the two cases in part (a), calculate the probability that you made a profit (that
is, your winnings are positive, not negative).
10. Consider the slot machine discussed in Chapter 4, Problem 17. Suppose that the number of each
type of symbol on wheels 1, 2 and 3 is as given below:
Wheel
1 2 3
Flower 2 6 2
Symbols Dog 4 3 3
House 4 1 5
If all three wheels stop on a flower, you win $20 for a $1 bet. If all three wheels stop on a dog,
you win $10, and if all three wheels stop on a house, you win $5. Otherwise you win nothing.
Find the expected value of your winnings per dollar spent.
11. Suppose a slot machine has n + 1 possible outcomes A1 ; A2 ; : : : ; An+1 for a single play. A
single play costs $1. If outcome Ai occurs, a player wins $ai , for i = 1; 2; : : : ; n. If outcome
An+1 occurs, the player wins nothing. In other words, if outcome Ai ; i = 1; 2; : : : ; n occurs the
player’s net winnings are $ (ai 1) and if An+1 occurs the player’s net winnings are $ ( 1).
(a) Give a formula for the expected value of your net winnings from a single play, if the prob-
abilities of the n + 1 outcomes are pi = P (Ai ); i = 1; 2; : : : ; n + 1.
(b) The slot machine owner wants the expected value of the player’s net winnings to be neg-
ative. Suppose n = 4, with p1 = 0:1; p2 = p3 = p4 = 0:04 and p5 = 0:78. If the slot
machine is set to pay $3 when outcome A1 occurs, and $5 when outcomes A2 ; A3 or A4
occur, determine the expected value of the player’s net winnings from a single play.
(c) The slot machine owner wants the player’s net winnings (equivalently the owner’s net pay-
out) to be $dbi when outcome Ai occurs, where bi = 1=pi and d is a number between 0 and
1. The owner also wants the expected value of the player’s net winnings to be $ ( 0:05) per
play. Find d as a function of n and pn+1 . What is the value of d if n = 10 and pn+1 = 0:7?
12. A contestant on a game show has two questions, one from category A and one from category B.
She may choose which category to attempt first but she must answer the first question correctly
to be able to attempt the remaining question. If she answers A correctly she receives $100 and if
she answers B correctly she receives $200. She knows the answer to A with probability 0:8 and
the answer to B with probability 0:6. (Assume independence in knowing the answers to the two
questions.)
(a) Which question should she attempt first to maximize her winnings?
(b) Suppose that she must now pay a $50 penalty if she gets the the first question wrong. What
question should she attempt first?
13. A manufacturer of car radios ships them to retailers in cartons of n radios. The profit per radio
is $59:50, less shipping cost of $25 per carton, so the profit is $ (59:5n 25) per carton. To
promote sales by assuring high quality, the manufacturer promises to pay the retailer $200X 2 if
X radios in the carton are defective. (The retailer is then responsible for repairing any defective
radios.) Suppose radios are produced independently and that 5% of radios are defective. How
many radios should be packed per carton to maximize expected net profit per carton?
14. On Halloween trick-or-treaters arrive at a house from 5:30pm until 9pm according to a Poisson
Process with an average of 12 trick-or-treaters per hour.
(a) What is the probability that between 5 and 7 trick-or-treaters (inclusive) arrive in the first
half hour?
(b) How many trick-or-treaters would be expected to arrive over the whole evening?
(c) What number of trick-or-treaters is most likely to arrive?
(d) What is the variance of the number of trick-or-treaters that arrive over the whole evening?
15. Assume that each week a stock either increases in value by $1 with probability 12 or decreases by
$1, these moves independent of the past. The current price of the stock is $50. I wish to purchase
a call option which allows me (if I wish to do so) the option of buying the stock 13 weeks from
now at a “strike price” of $55. Of course if the stock price at that time is $55 or less there is no
benefit to the option and it is not exercised. Assume that the return from the option is
R = max(S 55; 0)
where S is the price of the stock after 13 weeks. What is the fair price of the option today
assuming no transaction costs and 0% interest, that is, what is E (R)?
Hint: Let X = the number of times the stock increases in value by $1 during the 13 weeks and
determine S in terms of X.
16. Web cache: Web browsers often use a cache (locally-stored information which can be accessed
very quickly) to improve performance. When a user requests information, the cache is searched
first. If the information is found in the cache (known as a “cache hit”), it takes 10 ms (millisec-
onds) for the information to be displayed to the user. If the information is not found in the cache
(a “cache miss”), a request is made to a web server (50 ms), a data base is searched (70 ms), and
the information is returned and displayed to the user (50 ms).
(a) If no cache is used, what is the expected time for the information to be displayed to the
user?
(b) If a cache is used and there is a 20% chance of a cache hit, what is the expected time for
the information to be displayed to the user?
(c) How small would the probability of a cache hit need to be to have these expected times be
equal?
17. Analysis of algorithms - Quicksort: Suppose we have a set S of distinct numbers and we wish
to sort them from smallest to largest. The Quicksort algorithm works as follows: When n = 2
it just compares the numbers and puts the smallest one first. For n > 2 it starts by choosing a
random “pivot” number from the n numbers. It then compares each of the other n 1 numbers
with the pivot and divides them into groups S1 (numbers smaller than the pivot) and S1 (numbers
bigger than the pivot). It then does the same thing with S1 and S1 as it did with S, and repeats this
recursively until the numbers are all sorted. (Try this out with, say n = 10 numbers to see how
it works.) In computer science it is common to analyze such algorithms by finding the expected
number of comparisons (or other operations) needed to sort a list. Thus, let
Cn = expected number of comparisons for lists of length n
(a) Show that if X is the number of comparisons needed,

n
X 1
Cn = E(X j initial pivot is ith smallest number)
n
i=1
(b) Show that
E(X j initial pivot is ith smallest number) = n 1 + Ci 1 + Cn i
and thus that Cn satisfies the recursion (note C0 = C1 = 0)
2 nP1
Cn = n 1+ Ck for n = 2; 3; : : :
n k=1
(c) Show that

(n + 1)Cn+1 = 2n + (n + 2)Cn for n = 1; 2; : : :
(d) (Harder) Use the result of part (c) to show that for large n,
Cn+1
2 log(n + 1)
n+1
(Note: an bn means an =bn ! 1 as n ! 1.) This proves a result from computer science
which says that for Quicksort, Cn O(n log n).
18. Challenge problem: Let Xn be the number of ascents in a random permutation of the integers
f1; 2; : : : ng. For example, the number of ascents in the permutation 213546 is three, since
2; 135; 46 form ascending sequences.
(a) Show that the following recursion for the probabilities pn (k) = P (Xn = k) :
k+1 n k
pn (k) = pn 1 (k) + pn 1 (k 1)
n n
(b) Cards numbered 1; 2; : : : ; n are shuffled, drawn and put into a pile as long as the card drawn
has a number lower than its predecessor. A new pile is started whenever a higher card is
drawn. Show that the distribution of the number of piles that we end with is that of 1 + Xn
and that the expected number of piles is n+1 2 .
8. CONTINUOUS RANDOM
VARIABLES
8.1 General Terminology and Notation
For continuous random variables the range (set of possible values) is an interval (or a collection of
intervals) on the real number line. Continuous random variables must be treated a little differently than
discrete random variables because P (X = x) is zero for each x. To illustrate a random variable with
a continuous distribution, consider the simple spinning pointer in Figure 8.1 operating in a frictionless
environment. Suppose we assume that the pointer is equally likely to stop at any point in the interval
4
X
3 1
Figure 8.1: Spinner: a device for generating a continuous random variable
P
(0; 4]. If we assume this probability is p > 0, then for A = fx : 0 < x 4g, P (A) = p=1
x (0;4]
since the set A is uncountably infinite. This implies that probability that the pointer stops precisely at
any given number x must be zero. Note however that it seems reasonable to assign a probability of
1 1 3
16 to the event that the spinner stops at some value x in the interval (0; 4 ] or (1 4 ; 2]. For continuous
random variables we specify the probability of intervals, rather than individual points.
148
8.1. GENERAL TERMINOLOGY AND NOTATION 149
Consider another example produced by choosing a “random point” in a region. Suppose we plot a
graph of a function f (x) as in Figure 8.2 (assume the function is positive and has finite integral) and
then generate a point at random by closing our eyes and firing a dart from a distance until at least one
lands in the shaded region under the graph. We assume such a point, here denoted “*” is “uniformly”
distributed under the graph. This means that the point is equally likely to fall in any one of many
possible regions of a given area located in the shaded region so we only need to know the area of a
region to determine the probability that a point falls in it. Consider the x-coordinate X of the point
“*” as our random variable. (In Figure 8.2 it appears to be around 6.) Notice that the probability that
0.2
0.18
0.16
0.14
0.12
f(x)
0.1
0.08
0.06
0.04
0.02
0
0 5 X 10 15
x
Figure 8.2: Graph of f (x)
X falls in a particular interval (a; b) is measured by the area of the region above this interval, that is,
Rb
f (x)dx and so the probability of any particular point P (X = a) is the area of the region immediately
a
Ra
above this single point f (x)dx = 0. This is another example of a random variable X which has a
a
continuous distribution.
For a continuous random variable X, there are two commonly used functions which describe its
distribution. The first is the cumulative distribution function, used before for discrete distributions, and
the second is the probability density function, the derivative of the cumulative distribution function.
Cumulative Distribution Function:

For discrete random variables we defined the cumulative distribution function, F (x) = P (X x). For
continuous random variables we can also define the cumulative distribution function. For the spinner
example, the probability the pointer stops between 0 and 1 is 1=4 if all values x are equally “likely”;
150 8. CONTINUOUS RANDOM VARIABLES
0.2
0.18
0.16
0.14
0.12
f(x)
0.1
0.08 F(x)
0.06
0.04
0.02
0
0 5 x 10 15
x
Figure 8.3: Area of shaded region equals F (x) = P (X x)
between 0 and 2 the probability is 1=2, between 0 and 3 it is 3=4; and so on. In general, F (x) = x=4
for 0 < x 4. Also, F (x) = 0 for x 0 since there is no chance of the pointer stopping at a number
0, and F (x) = 1 for x > 4 since the pointer is certain to stop at number below x if x > 4.
Suppose in the second example in which we generated a point at random under the graph of a
function f (x); we assume that the total area under the graph is one, then the cumulative distribution
function F (x) is the area under the graph but to the left of the point x as in Figure 8.3.
Most properties of a cumulative distribution function are the same for continuous variables as for
discrete variables. These are:
1. F (x) is defined for all real x
2. F (x) is a non-decreasing function of x for all real x
3. lim F (x) = 0; and lim F (x) = 1
x! 1 x!1
4. P (a < X b) = F (b) F (a).
Note that, as indicated before, for a continuous random variable, we have
0 = P (X = a) = lim P (a "<X a) = lim F (a) F (a ")

"!0 "!0
This means that lim F (a ") = F (a) or that the distribution function F is a continuous function (in
"!0
the sense of continuity in calculus). Also, since the probability is 0 at each point:
P (a < X < b) = P (a X b) = P (a X < b) = P (a < X b) = F (b) F (a)
(For a discrete random variable, each of these 4 probabilities could be different.). For the continuous
distributions in this chapter, we do not worry about whether intervals are open, closed, or half-open
since the probability of these intervals is the same.
Probability Density Function: While the cumulative distribution function can be used to find prob-
abilities, it does not give an intuitive picture of which values of x are more likely, and which are less
likely. To develop such a picture suppose that we take a short interval of X-values, [x; x + x]. The
probability X lies in the interval is
P (x X x+ x) = F (x + x) F (x)
To compare the probabilities for two intervals, each of length x, is easy. Now suppose we consider
what happens as x becomes small, and we divide the probability by x. This leads to the following
definition.
Definition 20 The probability density function (p.d.f.) f (x) for a continuous random variable X is
the derivative
dF (x)
f (x) =
dx
where F (x) is the cumulative distribution function for X.
If the derivative of F does not exist at x = a we usually define f (a) = 0 for convenience. Note that
if the function f (x) graphed in Figure 8.3 has total integral one, the cumulative distribution function or
Rx
the area to the left of a point x is given by F (x) = f (u)du and so the derivative of the cumulative
1
distribution function is F 0 (x) = f (x). It is clear from the way in which X was generated that f (x)
represents the relative likelihood of (small intervals around) different x-values. To do this we first note
some properties of a probability density function. It is assumed that f (x) is a continuous function of x
at all points for which 0 < F (x) < 1.
Properties of a probability density function:
Rb
1. P (a X b) = F (b) F (a) = f (x)dx. (This follows from the definition of f (x))
a
2. f (x) 0. (Since F (x) is non-decreasing, its derivative is non-negative)

R1 R
3. f (x)dx = f (x)dx = 1. (This is because P ( 1 X 1) = 1)
1 allx
Rx
4. F (x) = f (u)du. (This is just property 1 with a = 1)
1
To see that f (x) represents the relative likelihood of different outcomes, we note that for x small,
x x x x
P x X x+ =F x+ F x t f (x) x
2 2 2 2
0.25
0.2
f (x)
0.15
0.1
0.05
0
-1 0 1 2 3 4 5
x
Figure 8.4: Probability density function for spinner example
Thus, f (x) 6= P (X = x) but f (x) x is the approximate probability that X is inside an interval of
length x centered about the value x when x is small. A plot of the function f (x) shows such values
clearly and for this reason it is very common to plot the probability density functions of continuous
random variables.
Example: Consider the spinner example, where

8
>
< 0 x 0
F (x) = x
0<x 4
>
:
4
1 x>4
Thus, the probability density function is f (x) = F 0 (x), or

1
f (x) = if 0 < x < 4
4
and outside this interval the probability density function is defined to be 0. Figure 8.4 shows the
probability density function f (x); for obvious reasons this is called a “Uniform” distribution.
Remark: Continuous probability distributions are, like discrete distributions, mathematical29 models.
Thus, the Uniform distribution assumed for the spinner above is a model, and it seems likely it would
be a good model for many real spinners.
Remark: It may seem paradoxical that P (X = x) = 0 for a continuous random variable and yet
we record the outcomes X = x in real “experiments” with continuous variables. The catch is that all
29
“How can it be that mathematics, being after all a product of human thought which is independent of experience, is so
admirably appropriate to the objects of reality? Is human reason, then, without experience, merely by taking thought, able to
fathom the properties of real things?” Albert Einstein.
measurements have finite precision; they are in effect discrete. For example, the height 60 + inches
is within the range of the height X of people in a population but we could never observe the outcome
X = 60 + if we selected a person at random and measured their height.
To summarize, in measurements we are actually observing something like
P (x 0:5 X x + 0:5 )
where may be very small, but not zero. The probability of this outcome is not zero: it is (approxi-
mately) f (x) .
We now consider a more complicated mathematical example of a continuous random variable.

Remember that it is always a good idea to sketch or plot the probability density function f (x) for a
random variable.
Example:
Let X be a continuous random variable with probability density function
8
> 2
< kx 0<x 1
f (x) = k(2 x) 1 < x < 2
>
:
0 otherwise
Find: (a) the constant k

(b) the cumulative distribution function F (x) = P (X x)
(c) P (0:5 < X < 1:5)
1.4
1.2
0.8
f(x)
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Figure 8.5: Area of shaded region equals one

Solution:
(a) When finding the area of a region bounded by different functions we split the integral into pieces.
Z1
1= f (x)dx
1
Z0 Z1 Z2 Z1
2
= 0dx + kx dx + k(2 x)dx + 0dx
1 0 1 2
Z1 Z2
=0+k x2 dx + k (2 x)dx + 0
0 1
x3 x2 2
=k j10 + k 2x j
3 2 1
5k 6
= and therefore k =
6 5
(b) Let us start with the easy pieces (which are unfortunately often left out) first:
F (x) = P (X x) = 0 if x 0
F (x) = P (X x) = 1 if x 2 since the probability density function equals 0 for all x 2
By looking at Figure 8.6 we have

1.4
1.2
0.8
f(x)
0.6
0.4
F(x)
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
x1.4 1.6 1.8 2
Figure 8.6: Area of shaded region equals F (x) = P (X x)
Zx Zx
6 2 6 x3 x 2x3
F (x) = P (X x) = f (z) dz = 0 + z dz = j = if 0 < x < 1
5 5 3 0 5
1 0
Z1 Zx
6 2 6 6 x3 1 6 x2 x
F (x) = P (X x) = 0 + z dz + (2 z) dz = j + (2x )j
5 5 5 3 0 5 2 1
0 1
12x 3x2 7
= if 1 < x < 2
5
Therefore 8
>
> 0 x 0
>
>
< 2x3 0<x 1
F (x) = P (X x) = 5
> 12x 3x2 7
>
> 5 1<x<2
>
: 1 x 2
As a rough check, since for a continuous distribution there is no probability at any point, F (x)
should have the same value as we approach each boundary point from above and from below.
For example,
3
as x ! 0+ ; 2x5 ! 0
3
as x ! 1 ; 2x5 ! 25
2
as x ! 1+ ; 12x 53x 7 ! 25
2
as x ! 2 ; 12x 53x 7 ! 1
This quick check won’t prove your answer is right, but will detect many careless errors.
(c)
Z1:5
P (0:5 < X < 1:5) = f (x)dx = F (1:5) F (0:5)
0:5
12 (1:5) 3 (1:5)2 7 2 (0:5)3
= = 0:8
5 5
Definition 21 Quantiles and Percentiles: Suppose X is a continuous random variable with cumula-
tive distribution function F (x). The pth quantile of X (or the pth quantile of the distribution) is the
value q (p), such that P [X q (p)] = p. The value q (p) is also called the 100pth percentile of the
distribution. If p = 0:5 then m = q (0:5) is called the median of X or the median of the distribution.
Example: For the example above find:
(a) the 0:4 quantile (40th percentile) of the distribution
(b) the median of the distribution

Solution:
(a) Since F (1) = 0:4, the 0:4 quantile is equal to 1.

2
(b) The median is the solution to F (x) = 12x 53x 7 = 0:5 or 24x 6x2 19 = 0 which has two
solutions. Since F (1) = 0:4 we know that the median must lie between 1 and 2 and we choose
the solution x t 1:087. The median is approximately equal to 1:087.
Defined Variables or Change of Variable:

When we know the probability density function or cumulative distribution function for a continuous
random variable X we sometimes want to find the probability density function or cumulative distrib-
ution function for some other random variable Y which is a function of X. The procedure for doing
this is summarized below. It is based on the fact that the cumulative distribution function FY (y) for Y
equals P (Y y), and this can be rewritten in terms of X since Y is a function of X. Thus:
1) Write the cumulative distribution function of Y as a function of X.
2) Use FX (x) to find FY (y). Then if you want the probability density function fY (y), you can
differentiate the expression for FY (y).
3) Find the range of values of y.
Example: In the earlier spinner example,

(
1
4 0<x 4
f (x) =
0 otherwise
and 8
>
< 0 x 0
F (x) = x
> 4 0<x<4
:
1 x 4
Find the probability density function of Y = X 1.
Solution:
Step 1 above becomes:
1
FY (y) = P (Y y) = P X y
1 1
=P X y =1 P X<y
1
=1 FX y :
1
For step (2), we can substitute y in place of x in FX (x) giving:
y 1 1
FY (y) = 1 =1
4 4y
and then differentiate to obtain the probability density function
d 1 1
fY (y) = FY (y) = 2 for y
dy 4y 4
1
(Note that as x goes from 0 to 4, y = x goes between 1 and 14 .)
Alternatively, and a little more generally, we can use the chain rule:
d d
fY (y) = FY (y) = 1 FX y 1
dy dy
d d
= fX y 1 y 1 since FX (x) = fX (x)
dy dx
1
= fX y 1 y 2 = y 2
4
1 1
= 2 for y
4y 4
Generally if FX (x) is known in some easy form, it is easier to substitute first, then differentiate. If
FX (x) is more complicated, for example an integral that can’t be easily found, it is usually easier to
differentiate first, then substitute fX (x).
Expectation, Mean, and Variance for Continuous Random Variables
Definition 22 When X is a continuous random variable we define

Z1
E [g(X)] = g(x)f (x)dx
1
P
Note that this is analogous to the definition for discrete random variables E [g(X)] = g(x)f (x):
all x
With this definition, all of the earlier properties of expected value and variance still hold. For example
with = E(X);
2
= V ar(X) = E[(X )2 ] = E(X 2 ) 2
R1
(This definition can be justified by writing g(x)f (x)dx as a limit of a Riemann sum and recognizing
1
the Riemann sum as being in the form of an expected value for discrete random variables.)
Example: For the earlier spinner example,

(
1
4 0<x 4
f (x) = :
0 otherwise
Therefore
Z1 Z4
1 1 x2
= E (X) = xf (x)dx = 0 + x dx + 0 = j40 = 2
4 4 2
1 0
Z1 Z4
1 1 x3 16
E X2 = x2 f (x)dx = 0 + x2 dx + 0 = j40 =
4 4 3 3
1 0
16 4
2
= V ar (X) = E X 2 2
= (2)2 =
3 3
Example: Let X have probability density function

8 2
6x
>
< 5 0<x 1
f (x) = 6
> 5 (2 x) 1 < x < 2
:
0 otherwise
Then
Z1 Z1 Z2
6 6
= E (X) = xf (x)dx = 0 + x x2 dx + x (2 x)dx + 0
5 5
1 0 1
6 x4 1 x3 2 11
= j0 + x2 j1 = = 1:1
5 4 3 10
Z1 Z1 Z2
6 6
E(X 2 ) = x2 f (x)dx = 0 + x2 x2 dx + x2 (2 x)dx + 0
5 5
1 0 1
6 x5 1 x3 x4 2 67
= j +2 j21 j =
5 5 0 3 4 1 50
2
2 67 11
= V ar (X) = E X 2 2
=
50 10
13
=
100
= 0:13
8.2. CONTINUOUS UNIFORM DISTRIBUTION 159
Problems
8.1.1 Let X be a continuous random variable with probability density function
(
kx2 1 x 1
f (x) =
0 otherwise
Find
(a) k
(b) the cumulative distribution function, F (x)
(c) P ( 0:1 < X 0:2)
(d) the mean and variance of X.
(e) the probability density function of Y = X 2 .
8.1.2 Let X be a continuous random variable with cumulative distribution function

(
kxn
1+x n x > 0; n > 0
F (x) =
0 x 0
(a) Find k.
(b) Find the probability density function, f (x).
(c) Find the median.
8.2 Continuous Uniform Distribution

Just as we did for discrete random variables, we now consider some special types of continuous proba-
bility distributions. These distributions arise in certain settings, described below. This section considers
what we call Uniform distributions.
Physical Setup:
Suppose X takes values in some interval [a; b] (it doesn’t actually matter whether interval is open or
closed) with all subintervals of a fixed length being equally likely. Then X has a continuous Uniform
distribution. We write X U (a; b).
Illustrations:
(1) In the spinner example X U (0; 4).
(2) Computers can generate a random number X which appears as though it is drawn from the dis-
tribution U (0; 1). This is the starting point for many computer simulations of random processes; an
example is given below.
The probability density function and the cumulative distribution function:

Since all points are equally likely (more precisely, intervals contained in [a; b] of a given length, say
0:01, all have the same probability), the probability density function must be a constant f (x) = k for
Rb
all a x b for some constant k. To make a f (x)dx = 1, we require k = b 1 a .
Therefore the probability density function is
(
1
f (x) = b a a x b
0 otherwise
R1
which is pictured in Figure 8.7. It easy to verify that f (x)dx = 1 since the area of the shaded
1
1
region is a rectangle of area (b a) b a = 1.
1/(b-a)
f(x)
x
a (a+b)/2 b
Figure 8.7: Probability density function for a U (a; b) random variable
The cumulative distribution function is

8
>
> 0 x<a
>
< Rx
1 x a
F (x) =
> b a dx = b a a x b
>
> a
: 1 x>b
which is pictured in Figure 8.8.

8.2. CONTINUOUS UNIFORM DISTRIBUTION 161
1.0
F(x)
0.5
0 x
a (a+b)/2 b
Figure 8.8: The cumulative distribution function for a U (a:b) random variable
Mean and Variance:

The mean of a U (a; b) random variable can easily be determined by noting that the graph of the
Rb
probability density function is symmetric about the line x = (a + b) =2. Since the integral xdx exists
a
(why?) therefore E (X) exists and by symmetry E (X) = (a + b) =2.
To determine V ar (X) we note that
Z1 Zb
2 2 1 1 x3 b b3 a3
E(X ) = x f (x)dx = x2 dx = j =
b a (b a) 3 a 3(b a)
1 a
(b a) b2 + ab + a2 b2 + ab + a2
= =
3(b a) 3
and therefore
2
2 b2 + ab + a2 b+a
= V ar (X) = E X 2 2
=
3 2
4b2 + 4ab + 4a2 3b2 6ab 3a2 b2 2ab + a2
= =
12 12
(b a)2
=
12
In summary:
a+b (b a)2
If X v U (a; b) then E (X) = and 2
= V ar (X) =
2 12
Example: Suppose X has the continuous probability density function
0:1x
f (x) = 0:1e for x > 0
and zero otherwise. (This is called an Exponential distribution and is discussed in the next section. It
is used in areas such as queueing theory and reliability.) We will show that the new random variable
0:1X
Y =e
has a U (0; 1) distribution. To see this, we follow the steps in Section 8:1:
FY (y) = P (Y y)
0:1X
= P (e y)
= P (X 10 ln y)
=1 P (X < 10 ln y)
=1 FX ( 10 ln y)
Since for x > 0

Zx
0:1u 0:1x
FX (x) = 0:1e du = 1 e
0
we have h i
0:1( 10 ln y)
FY (y) = 1 1 e = y for 0 < y < 1
The range of Y is (0; 1) since X > 0. Thus

(
d
dy FY (y) =1 0<y<1
fY (y) =
0 otherwise
which implies Y U (0; 1).
Many computer software systems have “random number generator” functions that will simulate ob-
servations Y from a U (0; 1) distribution. These are more properly called pseudo-random number
generators because they are based on deterministic algorithms. In addition they give observations Y
that have finite precision so they cannot be exactly like continuous U (0; 1) random variables. However,
good generators give Y ’s that appear indistinguishable in most ways from U (0; 1) random variables.
Given such a generator, we can also simulate random variables X with the Exponential distribution
above by the following algorithm:
8.3. EXPONENTIAL DISTRIBUTION 163
1. Generate Y U (0; 1) using the computer random number generator.
2. Compute X = 10 ln Y .
Then X has the desired distribution. This is a particular case of a method described in Section 8.4
for generating random variables from a general distribution. In R software the command runif (n)
produces a vector consisting of n independent U (0; 1) values.
Problem
8.2.1 If X has cumulative distribution function F (x), then show Y = F (X) has a U (0; 1) distribution.
Suppose you want to simulate observations from a distribution with probability density function
f (x) = 1:5x2 for 1 < x < 1 and zero otherwise, by using the random number generator on a
computer to generate U (0; 1) numbers. What value would X take when you generate the random
number y = 0:27125?
8.3 Exponential Distribution

The continuous random variable X is said to have an Exponential distribution if its probability density
function is of the form (
e x x>0
f (x) =
0 otherwise
where > 0 is a real parameter value. This distribution arises in various problems involving the time
until some event occurs. The following gives one such setting.
Physical Setup:
In a Poisson process for events in time let X be the length of time we wait for the first event occurrence.
We’ll show that X has an Exponential distribution. (Recall that the number of occurrences in a fixed
time has a Poisson distribution. The difference between the Poisson and Exponential distributions lies
in what is being measured.)
Illustrations:
(1) The length of time X we wait with a Geiger counter until the emission of a radioactive particle is
recorded follows an Exponential distribution.
(2) The length of time between phone calls to a fire station (assuming calls follow a Poisson process)
follows an Exponential distribution.
Derivation of the probability density function and the cumulative distribution function
For x > 0
F (x) = P (X x) = P (time to 1st occurrence x)

=1 P (time to 1st occurrence > x)
=1 P (no occurrence in the interval (0; x))
Check that you understand this last step. If the time to the first occurrence is greater than x, then there
must be no occurrences in (0; x), and vice versa.
We have now expressed F (x) in terms of the number of occurrences in a Poisson process by time
x. But the number of occurrences has a Poisson distribution with mean = x, where is the average
rate of occurrence. Therefore
( 0 x
1 ( x)0!e =1 e x x>0
F (x) =
0 x 0
d x x
Also since dx 1 e = e we have
(
e x x>0
f (x) =
0 x 0
which is the formula we gave above.
Alternate Form: It is common to use the parameter = 1= in the Exponential distribution. (We’ll
see below that = E(X).) This gives
(
1 e x= x>0
F (x) =
0 x 0
and (
1 x=
e x>0
f (x) =
0 x 0
We write X v Exponential ( ).
A graph of the probability density function f (x) is given in Figure 8.9. It obvious from the graph
why this distribution is called the Exponential distribution. The distribution is said to be positively
skewed (skewed to the right) or have a long right tail.
A graph of the cumulative distribution function is given in Figure.8.10.
1/θ
f (x)
x
0 θ 2θ 3θ 4θ
Figure 8.9: Graph of the probability density function of a Exponential ( ) random variable
1.0
0.8
F(x)
0.6
0.4
0.2
x
0 θ 2θ 3θ 4θ
Figure 8.10: Cumulative distribution function for a Exponential ( ) random variable
Exercise:
Suppose trees in a forest are distributed according to a Poisson process. Let X be the distance from an
arbitrary starting point to the nearest tree. The average number of trees per square metre is . Derive
f (x) the same way we derived the Exponential probability density function. You are now using the
Poisson distribution in two dimensions (area) rather than one dimension (time).
Mean and Variance:

Finding and 2 directly involves integration by parts. An easier solution uses properties of gamma
functions, which extends the notion of factorials beyond the integers to the positive real numbers.
Definition 23 The Gamma Function:

Z1
1 y
( )= y e dy
0
is called the gamma function of , where > 0.
Note that is 1 more than the power of y in the integrand. For example,
Z1
(5) = y4e y
dy:
0
There are three properties of gamma functions which we will use.
1. ( )=( 1) ( 1) for >1
Proof: Using integration by parts,

Z1 Z1
1 y 1 y 2 y
y e dy = lim y e +( 1) y e dy
y!1
0 0
and provided that > 1; lim y 1e y = 0. Therefore

y!1
Z1 Z1
1 y 2 y
y e dy = ( 1) y e dy = ( 1) ( 1)
0 0
2. ( )=( 1)! if is a positive integer.
Proof: It is easy to show that (1) = 1. Using property 1 repeatedly, we obtain
(2) = 1 (1) = 1
(3) = 2 (2) = 2!
(4) = 3 (3) = 3! etc.
In general, (n + 1) = n! for n = 0; 1; : : :
1 p
3. 2 =
(This can be proved using double integration.)

Returning to the Exponential distribution we have:

Z1 Z1
1 x
= xf (x)dx = x e x= dx let y = with dx = dy
1 0
Z1 Z1
y
= ye dy = y1e y
dy = (2) = (1!)
0 0
Note: Read questions carefully. If you are given the average rate of occurrence in a Poisson process,
then this is the parameter . If you are given the average waiting time for an occurrence, then this is
the parameter .
To get 2 = V ar(X), we first find

Z1 Z1
2 1
E X = 2
x f (x)dx = x2 e x=
dx let y = x=
1 0
Z1 Z1
2 21 y 2
= y e dy = y2e y
dy = 2
(3) = 2
(2!)
0 0
2
=2
Then
2
= V ar(X) = E X 2 2
=2 2 2
= 2
In summary:
If X v Exponential ( ) then E (X) = and V ar (X) = 2
In Figure 8.11, the probability density functions for different values of are pictured to see the
effect of changing .
Example:
Suppose buses arrive at a bus stop according to a Poisson process with an average of 5 buses per hour.
( = 5/hour so = 1=5 hour or 12 minutes).
Find the probability:
(a) you have to wait longer than 15 minutes for a bus
(b) you have to wait more than 15 minutes longer, having already waited for 6 minutes.
1.4
θ=0.75
1.2
θ=1.0
0.8
f(x)
0.6
θ=2
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x
Figure 8.11: Exponential probability density functions for different values of
Solution:
(a)
15=12 1:25
P (X > 15) = 1 P (X 15) = 1 F (15) = 1 1 e =e = 0:2865
(b) If X is the total waiting time, the question asks for the probability
P (X > 21 and X > 6)

P (X > 21jX > 6) =
P (X > 6)
P (X > 21)
=
P (X > 6)
1 1 e 21=12
= 6=12
1 1 e
e 21=12
= 6=12
e
1:25
=e = 0:2865
Does this surprise you? The fact that you’re already waited 6 minutes doesn’t seem to matter.
8.4. A METHOD FOR COMPUTER GENERATION OF RANDOM VARIABLES 169
Memoryless Property of the Exponential Distribution

The example above illustrates the “memoryless property” of the Exponential distribution:
P (X > c + bjX > b) = P (X > c)
In other words for a Poisson process , given that you have waited b units of time for the next event, the
probability you wait an additional c units of time does not depend on b but only depends on c.
Fortunately, buses don’t follow a Poisson process so the example need not cause you to stop using
the bus.
Problems
8.3.1 In a bank with on-line terminals, the time the system runs between disruptions has an Exponential
distribution with mean hours. One quarter of the time the system shuts down within 8 hours of
the previous disruption. Find .
8.3.2 Flaws in painted sheets of metal occur over the surface according to the conditions for a Poisson
process, at an intensity of per m2 . Let X be the distance from an arbitrary starting point to the
second closest flaw. (Assume sheets are of infinite size!)
(a) Find the probability density function, f (x).

(b) What is the average distance to the second closest flaw?
8.4 A Method for Computer Generation of Random Variables

30 Most computer software has a built-in “pseudo-random number31 generator” that will simulate ob-
servations U from a U (0; 1) distribution, or at least a reasonable approximation to this Uniform dis-
tribution. If we wish a random variable with a non-Uniform distribution, the standard approach is
to take a suitable function of U: By far the simplest and most common method for generating non-
Uniform variates is based on the inverse cumulative distribution function. For arbitrary cumulative
distribution function F (x), define F 1 (y) = minfx; F (x) yg. This is a real inverse (that is,
1 1
F (F (y)) = F (F (y)) = y) in the case that the cumulative distribution function is continuous
and strictly increasing. However, in the more general case of a possibly discontinuous non-decreasing
cumulative distribution function (such as the cumulative distribution function of a discrete distribution)
30
31
“The generation of random numbers is too important to be left to chance.” Robert R. Coveyou, Oak Ridge National
Laboratory
the function continues to enjoy at least some of the properties of an inverse. F 1 is useful for gen-
erating a random variables having cumulative distribution function F (x) from U; a Uniform random
variable on the interval [0; 1]:
Theorem 24 If F is an arbitrary cumulative distribution function and U is Uniform on [0; 1] then the
random variable defined by X = F 1 (U ) has cumulative distribution function F (x).
Proof:
The proof is a consequence of the fact that
[U < F (x)] [X x] [U F (x)] for all x
You can check this graphically be checking, for example, that if [U < F (x)] then [F 1 (U ) x]
(this confirms the left hand “ ”): Taking probabilities on all sides of this, and using the fact that
P [U F (x)] = P [U < F (x)] = F (x), we discover that P (X x) = F (x):
0.9
0.8
0.7
0.6
F(x)
F(x)
0.5
0.4
U
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-1 x
X=F (U)
Figure 8.12: Inverting a Cumulative Distribution Function
The relation X = F 1 (U ) implies that F (X) U and for any point z < X; F (z) < U: For example,
for the rather unusual looking piecewise linear cumulative distribution function in Figure 8.12, we find
the solution X = F 1 (U ) by drawing a horizontal line at U until it strikes the graph of the cumulative
distribution function (or where the graph would have been if we had joined the ends at the jumps) and
8.5. NORMAL DISTRIBUTION 171
then X is the x coordinate of this point. This is true in general, X is the coordinate of the point
where a horizontal line first strikes the graph of the cumulative distribution function We provide one
simple example of generating random variables by this method, for the Geometric distribution.
Example: A Geometric random number generator
For the Geometric distribution, the cumulative distribution function is given by F (x) = 1 (1 p)x+1
for x = 0; 1; 2; : : :. Then if U is a Uniform random number in the interval [0; 1]; we seek an integer X
such that F (X 1) < U F (X). (You should confirm that this is the value of X at which the above
horizontal line strikes the graph of the c.d.f.) and solving these inequalities gives
1 (1 p)X < U 1 (1 p)X+1

(1 p)X > 1 U (1 p)X+1
X ln(1 p) > ln(1 U) (X + 1) ln(1 p)
ln(1 U)
X< X +1
ln(1 p)
so we compute the value of ln(1 U )= ln(1 p) and round down to the next lower integer.
Exercise: An Exponential random number generator.
For the Exponential( ) distribution show that the inverse transform method above results in
X= ln(1 U)
8.5 Normal Distribution

Physical Setup:
A random variable X has a Normal32 distribution if it has probability density function of the form
1 1
(x
2
) x2<
f (x) = p e 2
2
where 2 < and > 0 are parameters. It turns out (and is shown below) that E(X) = and
V ar(X) = 2 for this distribution; that is why its probability density function is written using the
symbols and 2 .
32
“The only normal people are the ones you don’t know very well.” Joe Ancis
We write X N ( ; 2) to denote that X has a Normal distribution with mean and variance 2
(standard deviation ).
The Normal distribution is the most widely used distribution in probability and statistics. Physical
processes leading to the Normal distribution exist but are a little complicated to describe. (For example,
it arises in physics via statistical mechanics and maximum entropy arguments.) It is used for many
processes where X represents a physical dimension of some kind, but also in many other settings.
Illustrations:
(1) Heights or weights of males (or of females) in large populations tend to follow a Normal distribution.
(2) The logarithms of stock prices are often assumed to have a Normal distribution.
The graph of the probability density function f (x), shown in Figure 8.13, is symmetric about the
line x = . The shape of the probability density function is often termed a “bell shape” or “bell curve”.
You should be able to verify the shape of the function using the first and second derivatives of f (x).
0.4/ σ
0.3/ σ
f(x)
0.2/ σ
0.1/ σ
x
µ-3 σ µ-2 σ µ- σ µ µ+σ µ+2 σ µ+3 σ
Figure 8.13: The N ( ; 2) probability density function

We can show that f (x) integrates to 1:

Z1
1 1 x
( )2
p e 2 dx let z = (x )=
2
1
Z1 Z1
1 z 2 =2 1 z 2 =2 1 dy
= p e dz = 2 p e dz let y = z 2 and dz = p 1
2 2 2 2y 2
1 0
Z1
1 y dy
=2 p e p 1
2 2y 2
0
Z1
1 1
y
=p y 2 e dy
0
1 1
=p ( ) where is the gamma function
2
1 p
= 1 since =
2
The cumulative distribution function:

The cumulative distribution function of the Normal distribution N ( ; 2) is
Zx
1 1
(y
2
) dy
F (x) = p e 2 for x 2 <
2
1
as shown in Figure 8.14. This integral cannot be given a simple mathematical expression so numerical
methods are used to compute its value for given values of x, and . This function is included in many
software packages and some calculators.
In the statistical packages R we get F (x) above using the function pnorm(x; ; ). Before comput-
ers, people needed to produce tables of probabilities F (x) by numerical integration, using mechanical
calculators. Fortunately it is necessary to do this only for a single Normal distribution: the one with
= 0 and = 1. This is called the “standard” Normal distribution and is denoted N (0; 1).
It is easy to see that if X N ( ; 2 ) then the “new” random variable Z = (X )= is distributed

as Z N (0; 1). (Just use the change of variables methods in Section 8.1.) We’ll use this result to
compute probabilities for X, and to show that E(X) = and V ar(X) = 2 .
1.0
0.8
0.6
F(x)
0.4
0.2
x
µ-3 σ µ-2 σ µ- σ µ µ+σ µ+2σ µ+3σ
Figure 8.14: The Normal cumulative distribution function
Mean:
Recall that an odd function, f (x), has the property that f ( x) = f (x). If f (x) is an odd function
R1
then f (x)dx = 0, provided the integral exists.
1
Consider
Z1
1 1
(x
2
) dx:
E (X )= (x )p e 2
2
1
Let y = x . Then
Z1
1 1
(x
2
) dy;
E (X )= y p e 2
2
1
where the integrand is an odd function so that E (X ) = 0. But since E (X ) = E(X) ,

this implies E(X) = and so is the mean of the N ( ; 2 ) distribution.
Variance:
To obtain the variance we have
h i
V ar(X) = E (X )2
Z1
1 1
(x
2
) dx
= (x )2 p e 2
2
1
Z1
1 1
(x
2
) dx (since the function is symmetric about ):
=2 (x )2 p e 2
2
(x )2
We can obtain a gamma function by letting y = 2 2 and noting that
p
(x )2 = 2 2
y or (x )= 2y since x >
p
2dy
dx = p = p dy
2 y 2y
Then
Z1
2 1 y
V ar(X) = 2 2 y p e p dy
2 2y
0
Z1
2 2
= p y 1=2 e y
dy
0
2 2 3 2 2 1 1
= p = p
2 2 2
2p 1 p
= p since =
2
2
=
and so 2 is the variance of the N ( ; 2) distribution.
Finding Normal Probabilities Using N (0; 1) Tables: As noted above, F (x) does not have an explicit
closed form so numerical computation is needed. The following result shows that if we can compute
the cumulative distribution function for the standard Normal distribution N (0; 1), then we can compute
it for any other Normal distribution N ( ; 2 ) as well.
Theorem 25 Let X N( ; 2) and define Z = (X )= . Then Z N (0; 1) and
x
P (X x) = P Z
Proof: The fact that Z N (0; 1) has probability density function

1 1 2
z
(z) = p e 2 z2<
2
follows immediately by change of variables.
Alternatively, we can just note that
Zx
1 1 y
( )2 y
P (X x) = p e 2 dy let z =
2
1
(x Z )=
1 1 2
z
= p e 2 dz
2
1
x
=P Z
A table of probabilities P (Z z) is given at the end of these Course Notes. A space-saving feature is
that only the values for z 0 are shown; for negative values we use the fact that N (0; 1) probability
density function is symmetric about 0.
The following examples illustrate how to get probabilities for Z using the tables.
Examples: Find the following probabilities, where Z N (0; 1).
(a) P (Z 2:11)
(b) P (Z < 3:40)
(c) P (Z > 1:06)
(d) P (Z 1:06)
(e) P ( 1:06 < Z 2:11)

Solution:
(a) Look up 2:11 in the table by going down the left column to 2:1 then across to the heading 0:01.
We find the number 0:98257. Then P (Z 2:11) = 0:98257. See Figure 8.15.
0.45
0.4
0.35
0.3
φ(z)
0.25
0.2
0.15
0.1
0.05 (2.11, φ(2.11))
0
-4 -3 -2 -1 0 1 2 3 4
z
Figure 8.15: Area of shaded region equals P (Z 2:11) = 0:9826
(b) P (Z < 3:40) = P (Z 3:40) = 0:99966
(c) P (Z > 1:06) = 1 P (Z 1:06) = 1 0:85543 = 0:14457
(d) Now we have to use symmetry:
P (Z 1:06) = P (Z > 1:06) = 1 P (Z 1:06) = 1 0:85543 = 0:14457
See Figure 8.16.
(e)
P ( 1:06 < Z 2:11) = P (Z 2:11) P (Z 1:06)

= P (Z 2:11) P (Z > 1:06)
= P (Z 2:11) [1 P (Z 1:06)]
= 0:98257 (1 0:85543) = 0:83800
0.4
0.3
φ(z)
(-1.06, φ(-1.06))
0.2
0.1
0
-4 -3 -2 -1 0 1 2 3 4
z
0.4
0.3
φ(z) (1.06,φ(1.06))
0.2
0.1
0
-4 -3 -2 -1 0 1 2 3 4
z
Figure 8.16: Calculation of P (Z < 1:06) and P (Z > 1:06)
In addition to using the tables to find the probabilities for given numbers, we sometimes are given the
probabilities and asked to find the number. With R, the function qnorm(p; ; ) gives the 100 p-th
percentile (where 0 < p < 1). We can also use tables to find desired values.
Examples:
(a) Find a number c such that P (Z c) = 0:85.
(b) Find a number d such that P (Z > d) = 0:90.
(c) Find a number b such that P (jZj b) = 0:95.
Solutions:
(a) We can look in the body of the table to get an entry close to 0:85. This occurs for z between 1:03
and 1:04; z = 1:04 gives the closest value to 0:85. For greater accuracy, the table at the bottom
of the last page is designed for finding numbers, given the probability. Looking beside the entry
0:85 we find z = 1:0364.
(b) See Figure 8.17. Since P (Z > d) = 0:90 we have P (Z d) = 1 P (Z > d) = 0:10. There is
0.45
0.4
0.35
0.3
φ(z)
0.25
0.2
0.15
0.1
0.05
0.1 0.1
0
-4 -3 -2 -1 0 1 2 3 4
d |d|
z
Figure 8.17: Picture to determine P (Z > d) = 0:90
no entry for which P (Z z) = 0:10 so we again have to use symmetry, since d will be negative.
From the table we have P (Z 1:2816) = 0:90. By symmetry P (Z > 1:2816) = 0:90 and
therefore d = 1:2816.
The key to this solution lies in recognizing that d will be negative. If you can picture the situation
it will probably be easier to handle the question than if you rely on algebraic manipulations.
Exercise: Will a be positive or negative if P (Z > a) = 0:05? What if P (Z a) = 0:99?
(c) We first note that P (jZj b) = P ( b < Z < b) = 0:95. By symmetry, the probability outside
the interval ( b; b) must be 0:05, and this is evenly split between the area above b and the area
below b. Therefore (see Figure 8.18)
P (Z b) = P (Z > b) = 0:025
and
P (Z b) = 0:975
Looking within the body of the top table, we can see P (Z 1:96) = 0:975 so b = 1:96.
Exercise: Find b such that P (jZj b) = 0:9 and P (jZj b) = 0:99.

0.45
0.4
0.35
0.3
φ(z)
0.25
0.2
0.15 0.95
0.1
0.05
0.025 0.025
0
-4 -3 -2 -1 0 1 2 3 4
z
Figure 8.18: Picture to determine P ( b < Z < b) = 0:95
Finding N ( ; 2 ) probabilities: To find N ( ; 2 ) probabilities in general, we use the theorem given

earlier, which implies that if X N ( ; 2 ) then
a b
P (a X b) = P Z
b a
=P Z P Z
where Z N (0; 1).
Example: Let X N (3; 25).
(a) Find P (X < 2)
(b) Find a number c such that P (X > c) = 0:95.
Solution:
(a)
X 2 3
P (X < 2) = P <
5
= P (Z < 0:20) where Z N (0; 1)
=1 P (Z < 0:20) = 1 0:57926 = 0:42074
(b)
X c 3
P (X > c) = P >
5
c 3
=P Z> where Z N (0; 1)
5
= 0:95
Therefore (see Figure (8.19)), (c 3) =5 = 1:6449 or c = 5:2245.
0.45
0.4
0.35
0.3
φ(z)
0.25
0.95
0.2
0.15
0.1 ((c-3)/5, φ((c-3)/5))
0.05
0
-4 -3 -2 -1 0 1 2 3 4
z
c 3
Figure 8.19: Picture for determining P Z > 5 = 0:95
Gaussian Distribution: The Normal distribution is also known as the Gaussian33 distribution. The
notation X G( ; ) means that X has Gaussian (Normal) distribution with mean and standard
deviation . So, for example, if X N (1; 4) then we could also write X G(1; 2).
33
After Johann Carl Friedrich Gauss (1777-1855), a German mathematician, physicist and astronomer, discoverer of Bode’s
Law, the Binomial Theorem and a regular 17-gon. He discovered the prime number theorem while an 18 year-old student
and used least-squares (what is called statistical regression in most statistics courses) to predict the position of Ceres.
Example: The distribution of heights of adult males in Canada is well approximate by a Gaussian
distribution with mean = 69:0 inches and standard deviation = 2:4 inches. Find the 10th and 90th
percentiles of the height distribution.
Solution: We are being told that if X is the height of a randomly selected Canadian adult male, then
X G(69:0; 2:4), or equivalently X N (69:0; 5:76).
To find the 90th percentile c, we use
X 69:0 c 69:0
P (X c) = P
2:4 2:4
c 69:0
=P Z where Z G(0; 1)
2:4
= 0:90
From the table we see P (Z 1:2816) = 0:90 so we need

c 69:0
= 1:2816
2:4
which gives c = 72:08 inches as the 90th percentile.
Similarly, to find c such that P (X c) = 0:10 we find that

P (Z 1:2816) = 0:10, so we need
c 69:0
= 1:2816
2:4
or c = 65:92 inches, as the 10th percentile.
Problem
8.5.1 Let X have a Normal distribution. What percent of the time does X lie within one standard
deviation of the mean? Two standard deviations? Three standard deviations?

1. A continuous random variable X has probability density function
(
k(1 x2 ) 1 x 1
f (x) =
0 otherwise
(a) Find k and the cumulative distribution function of X. Graph f (x) and the cumulative
distribution function.
(b) Find the value of c such that P ( c X c) = 0:95.
(c) Find = E (X) and = sd (X).
(d) Find the probability density function of Y = X 2 .
2. When people are asked to make up a random number between 0 and 1, it has been found that the
distribution of the numbers, X, has probability density function close to
8
>
< 4x 0 < x 1=2
f (x) = 4 (1 x) 12 < x < 1
>
:
0 otherwise
(rather than the U (0; 1) distribution which would be expected).

R1
(a) Graph the probability density function and show that f (x) dx = 1 without evaluating
1
an integral.
(b) Find P (0:25 X 0:8).
(c) Find the median and 10th percentile of the distribution.
(d) Find the mean and variance of X.
(e) Find the probability density function of Y = 2 (X 1=2).
(f) Find the probability density function of Z = X 3 .
3. Let X have probability density function

(
1
20 10 x 10
f (x) =
0 otherwise
Show that Y = (X + 10) =20 v U (0; 1).
4. Suppose X is a continuous random variable with finite mean and standard deviation . Sup-
pose also this its probability density function is symmetric about the line x = . Show that
P (jX j 2 ) = 2P (X + 2 ) 1.
5. Suppose that X is the lifetime in years of a Canadian born in 1995. When dealing with lifetimes,
a function of interest is the survivor function defined as
S (x) = P (X > x) = 1 P (X x) = 1 F (x)
Values of S (x), based on data collected by the Canadian government, are given in the table
below for x = 30; 40; : : : ; 100.
x 30 40 50 60 70 80 90 100
Females: S (x) 0:996 0:987 0:971 0:939 0:863 0:704 0:396 0:075
Males: S (x) 0:989 0:975 0:949 0:903 0:801 0:603 0:273 0:034
(a) If a female born in 1995 lives to at least age 30, what is the probability she lives to at least
age 80? To at least age 90? What are the corresponding probabilities for males?
(b) If 51% of persons born in 1995 were male, find the fraction of the total population (males
and females) that will live at least to age 90.

(
( + 1) x 0 < x < 1
f (x) =
0 otherwise
where is a real-valued parameter of the distribution.
(a) For what values of is this a probability density function? Explain.

(b) Find P (X 0:5).
(c) Find E X k for k = 0; 1; : : : and use this to find E (X) and V ar (X).
(d) Find the probability density function of Y = 1=X.

( 2
kxe x = x > 0
f (x) =
0 otherwise
where > 0 is a real-valued parameter of the distribution.
(a) Find k and the cumulative distribution function of X.

(b) Find the mean and variance of X. Hint: Use the method of substitution and the Gamma
function.
(c) Show that Y = X 2 = v Exponential(1).
8. The diameters in centimeters of spherical particles produced by a machine are randomly distrib-
uted according to a U (0:6; 1:0) distribution. Find the probability density function for the volume
of a particle.
9. The magnitudes of earthquakes in a region of North America can be modelled by an Exponential

distribution with mean 2:5 measured on the Richter scale.
(a) What is the probability an earthquake has a magnitude greater than 5 on the Richter scale?
(b) Suppose 3 earthquakes occur in a given month. What is the probability that none of the
earthquakes have a magnitude greater than 5 on the Richter scale?
(c) If the magnitude of an earthquake exceeds 4, what is the probability it also exceeds 5?
10. The lifetime of a certain type of light bulb follows an Exponential distribution with mean 1000
hours.
(a) What are the mean and standard deviation of the lifetime of this type of light bulb?
(b) What are the mean and standard deviation of the lifetime of this type of light bulb in days?
(c) Find the median lifetime in hours.
11. Traffic accidents at the intersection of University Avenue and Westmount Road occur according
to a Poisson process with an average rate of 0:5 accidents per day. Suppose an accident has just
occurred.
(a) What is the expected waiting time until the next accident?
(b) What is the probability that the waiting time until the next accident is less than 12 hours?
(c) If there have been no accidents before noon on a particular day what is the probability that
there are no accidents before midnight on the same day?
12. Jamie figures that the total number of thousands of kilometers that an auto can be driven before it
needs to be junked is an Exponential random variable with mean 20 thousand kilometers. Smith
has a used car that he claims has been driven 10 thousand kilometers.
(a) If Jamie purchases the car, what is the probability that Jamie would get at least 20 thousand
additional kilometers out of the car?
(b) Repeat the calculation in (a) under the assumption that the lifetime kilometer-age of the car
(in thousands of kilometers) is a U (0; 40) random variable.
13. Server crashes at a giant data center are assumed to follow a Poisson process. On average there
are three server crashes per day (24 hours).
(a) What is the probability that the waiting time between two consecutive crashes is greater
than 8 hours?
(b) Suppose there have been no server crashes for the last 8 hours. What is the probability that
the time until the next crash exceeds one hour?
(c) What is the probability that there are fewer than three crashes in a day?
14. Gamma distribution: A continuous random variable X is said to have the Gamma distribution
with parameters > 0 and > 0 if
(
1
( )
x 1 e x= x 0
f (x) =
0 otherwise
(a) Show that f is a legitimate probability density function.

(b) Use the properties of the Gamma function to obtain E(X) and V ar(X).
(c) Verify that setting = 1 results in the Exponential distribution with parameter = 1= .
15. The examination scores obtained by a large group of students can be modelled by a Normal
distribution with a mean of 65% and a standard deviation of 10%. Find the proportion of students
who obtain each of the following letter grades:
A( 80%); B(70 80%); C(60 70%); D(50 60%); F (< 50%)
16. Suppose X v N (10; 16). Find the 20th, 40th, 60th, and 80th percentiles of the distribution.
17. The number of liters X that a filling machine in a water bottling plant deposits in a nominal two
liter bottle follows a Normal distribution N ( ; 2 ), where = 0:01 liters and is the setting (in
liters) on the machine.
(a) If = 2, what is the probability a bottle has less than 2 liters of water in it?
(b) Find c such that P (jX j c) = 0:9.
(c) What should be set at to make the probability a bottle has less than 2 liters be less than
0:01?
18. A manufacturer produces bolts that are specified to be between 1:19 and 1:21 centimeters in
diameter. If the production process results in a bolt’s diameter being Normally distributed with
mean 1:20 centimeters and standard deviation 0:005 centimeters, what percentage of bolts will
not meet specifications?
19. Suppose that the diameters in millimeters of the eggs laid by a large flock of hens can be modelled
by a Normal distribution with a mean of 40 millimeters and a variance of 4 (millimeters)2 . The
wholesale selling price is 5 cents for an egg less than 37 millimeters in diameter, 6 cents for eggs
between 37 and 42 millimeters, and 7 cents for eggs over 42 millimeters. What is the average
wholesale price per egg?
20. The manufacturer of computer chips advertises that the lifetimes of the computer chips that
it produces are Normally distributed with mean = 5 106 hours and standard deviation
= 5 105 hours.
(a) What proportion of computer chips last less than 6 106 hours?
(b) What proportion of computer chips last longer than 4 106 hours?
(c) The manufacturer is working on improvements to the computer chips to increase the aver-
age lifetime. The manufacturer wishes to ensure that at least 95 percent of the computer
chips last longer than 4:5 106 hours. What should the value of be to achieve this?
(Assume the value of is unchanged by the improvements.)
21. The temperature of a CPU (central processing unit) while operating, X, has a Normal distribu-
tion with mean = 60 degrees Celsius and standard deviation = 5 degrees Celsuis. If the
temperature reaches over 75 degrees, it will throttle (slow down) to avoid damage to the CPU. If
it reaches 95 degrees, it will shut down.
(a) What is the probability the CPU will slow down?

(b) Find c such that P (X c) = 0:9.
(c) You can overclock your CPU (make it run faster than it’s supposed to) but that increases
its average operating temperature . What should be set at to make the probability the
CPU shuts down be 0:01? (Assume does not change and that there is no slow down at 75
degrees.)
22. Binary classification: Many situations require that we “classify” a unit of some type as being
one of two types, which for convenience we will call positive and negative. For example, a
diagnostic test for a disease might be positive or negative; an email message may be spam or not
spam; a credit card transaction may be fraudulent or not. The problem is that in many cases we
cannot tell for certain whether a unit is positive or negative, so when we have to classify the unit,
we may make errors. The following framework helps us to deal with these problems.
For a randomly selected unit from the population being considered, define the random variable
Y such that Y = 1 if the unit is positive and Y = 0 if the unit is negative. Suppose that we are
unable to determine if Y = 0 or Y = 1 for a given unit, but that we can obtain a measurement
X with the property that
2
if Y = 1 then X N( 1; 1)
2
if Y = 0 then X N( 0; 0)
where 1 > 0. We then classify the unit based on the measurement X according to the follow-
ing rule:
if X d then classify the unit as positive

if X < d then classify the unit as negative
where d is a value chosen to between 0 and 1 . Such a rule can obviously result in errors.
If a unit is actually positive and we wrongly classified it as negative then we call this a “false
negative”. If a unit is actually negative and we wrongly classified it as positive then we call this
a “false positive”.
(a) Find the probability of a false negative and the probability of a false positive if 0 = 0;
1 = 10; 0 = 4; 1 = 6 and d = 5.
(b) Find the probability of a false negative and the probability of a false positive if 0 = 0;
1 = 10; 0 = 3; 1 = 3. Explain in words why the false negative and false positive
missclassification probabilities are smaller than in (a).
23. Binary classification and spam detection: Chapter 4, Problems 21 and 22 discussed methods
of spam detection using binary features. Suppose that for a given email message we compute a
measure X, designed so that X tends to be high for spam messages and low for regular (non-
spam) messages. (For example X can be a composite measure based on the presence or absence
of certain words in a message, as well as other features.) In this problem we assume that X is a
continuous random variable.
Suppose that for spam messages, the distribution of X is approximately N ( 1 ; 12 ), and that for
regular messages, it is approximately N ( 0 ; 02 ), where 1 > 0 . This is the setup described in
Problem 22. We will filter spam by picking a value d, and then filtering any message for which
X d. The trick here is to decide what value of d to use.
(a) Suppose that 0 = 0; 1 = 10; 1 = 3; 2 = 3. What is the probability of a false positive

(filtering a message that is not spam) and a false negative (not filtering a message that is
spam) under each of the three choices (i) d = 5 (ii) d = 4 (iii) d = 6?
(b) What factors would determine which of the three choices of d would be best to use?
24. Let T1 ; T2 ; : : : ; Tn denote the first n interarrival times of a Poisson process Xt (recall that Xt is
the number of hits in [0; t)) with intensity .
P
n
(a) What is the interpretation of Sn = Ti ?
i=1
(b) Argue that the two events fSn tg and fXt ng are identical.
(c) Use (b) to show that
n
X1
t( t)j
P (Sn t) = 1 e
j!
j=0
(d) By differentiating the cumulative distribution function of Sn given in (c), show that Sn is a
Gamma random variable (see Problem 8:12) with parameters = n and = .
25. Cauchy distribution: A random variable X is said to have a Cauchy distribution with parameter
> 0 if
f (x) = x2<
( 2 + x2 )
(a) Show that E(X) does not exist. Explain why this also implies V ar(X) does not exist
(b) Let Y = 1=X. Show that Y has a Cauchy distribution with parameter 1.
(c) Find the cumulative distribution function F and the inverse cumulative distribution function
F 1 for the random variable X.
(d) Suppose U U ( 1; 0). Find a function g such that g(U ) is a Cauchy random variable
with parameter .
26. Challenge Problem: Given a circle, find the probability that a chord chosen at random be longer
than the side of an inscribed equilateral triangle. For example in Figure 8.20, the line joining A
and B satisfies the condition, the other lines do not. This is called Bertrand’s paradox (see
Figure 8.20: Bertrand’s Paradox

http://www.cut-the-knot.org/bertrand.shtml) and there various possible solutions, depending on

exactly how you interpret the phrase “a chord chosen at random”. For example, since the only
important thing is the position of the second point relative to the first one, we can fix the point A
and consider only the chords that emanate from this point. Then it becomes clear that 1/3 of the
outcomes (those with angle with the tangent at that point between 60 and 120 degrees) will result
in a chord longer than the side of an equilateral triangle. But a chord is fully determined by its
midpoint. Chords whose length exceeds the side of an equilateral triangle have their midpoints
inside a smaller circle with radius equal to 1/2 that of the given one. If we choose the midpoint
of the chord at random and uniformly from the points within the circle, what is the probability
that corresponding chord has length greater than the side of the triangle? Can you think of any
other interpretations which lead to different answers?
9. MULTIVARIATE DISTRIBUTIONS
9.1 Basic Terminology and Techniques

Many problems involve more than a single random variable. When there are multiple random variables
associated with an experiment or process we usually denote them as X; Y; : : : or as X1 ; X2 ; : : : . For
example, your final mark in a course might involve X1 = your assignment mark, X2 = your midterm
test mark, and X3 = your exam mark. We need to extend the ideas introduced for single variables
to deal with multivariate problems. In Sections 9:1 9:5 we consider discrete multivariate problems.
Continuous multivariate variables are also common in daily life (e.g. consider a person’s height X and
weight Y; or X1 = the return from Stock 1, X2 = return from stock 2). In Section 9.6 we will only
consider the special case of a linear combination of independent Normal random variables.
To introduce the ideas in a simple setting, we will first consider an example in which there are only
a few possible values of the variables. Later we will apply these concepts to more complex examples.
In particular we will look at the Multinomial distribution which is a generalization of the Binomial
distribution that we saw in Chapter 5. The ideas themselves are simple even though some applications
can involve fairly messy algebra.
Joint Probability Functions:
First, suppose there are two discrete random variables X and Y , and define the function
f (x; y) = P (X = x and Y = y)
= P (X = x; Y = y):
We call f (x; y) the joint probability function of (X; Y ). The properties of a joint probability function
are similar to those for a single variable; for two random variables we have f (x; y) 0 for all (x; y)
and
X
f (x; y) = 1:
all(x;y)
191
192 9. MULTIVARIATE DISTRIBUTIONS
In general,
f (x1 ; x2 ; : : : ; xn ) = P (X1 = x1 ; X2 = x2 ; : : : ; Xn = xn )
if there are n random variables X1 ; : : : ; Xn .

Example: Consider the following numerical example, where we show f (x; y) in a table.
x
f (x; y) 0 1 2
y 1 0:1 0:2 0:3
2 0:2 0:1 0:1
1
For example f (0; 2) = P (X = 0,Y = 2) = 0:2: We can check that f (x; y) is a proper joint probability
function since f (x; y) 0 for all 6 combinations of (x; y) and the sum of these 6 probabilities is 1.
When there are only a few values for X and Y it is often easier to tabulate f (x; y) than to find a formula
for it. We’ll use this example below to illustrate other definitions for multivariate distributions, but first
we give a short example where we need to find f (x; y).
Example: Suppose a fair coin is tossed 3 times. Define the random variables X = number of Heads
and Y = 1 (0) if Heads (Tails) occurs on the first toss. Find the joint probability function for (X; Y ).
Solution: First we should note the range for (X; Y ), which is the set of possible values (x; y) which
can occur. Clearly X can be 0, 1, 2, or 3 and Y can be 0 or 1, but we’ll see that not all 8 combinations
(x; y) are possible. We can find f (x; y) = P (X = x; Y = y) by just writing down the sample space
S = fHHH; HHT; HT H; T HH; HT T; T HT; T T H; T T T g
that we have used before for this process. Then simple counting gives f (x; y) as shown in the following
table:
x
f (x; y) 0 1 2 3
1 2 1
y 0 8 8 8 0
1 2 1
1 0 8 8 8
1
For example, (X; Y ) = (0; 0) if and only if the outcome is T T T ; (X; Y ) = (1; 0) if and only if the
outcome is either T HT or T T H.
Note that the joint probability function for (X; Y ) is a little awkward to write down in a formula,
so we just use a table.
9.1. BASIC TERMINOLOGY AND TECHNIQUES 193
Marginal Distributions:
We may be given a joint probability function involving more variables than we’re interested in using.
How can we eliminate any which are not of interest? Look at the first example above. If we’re only
interested in X, and don’t care what value Y takes, we can see that
P (X = 0) = P (X = 0; Y = 1) + P (X = 0; Y = 2)
= f (0; 1) + f (0; 2)
= 0:3:
Similarly
P (X = 1) = f (1; 1) + f (1; 2) = 0:3

and P (X = 2) = f (2; 1) + f (2; 2) = 0:4
The distribution of X obtained in this way from the joint probability function is called the marginal
probability function of X:
x 0 1 2 Total
f1 (x) = P (X = x) 0:3 0:3 0:4 1
In the same way, if we were only interested in Y , we obtain
P (Y = 1) = f (0; 1) + f (1; 1) + f (2; 1) = 0:6
since X can be 0, 1, or 2 when Y = 1. The marginal probability function of Y would be:
y 1 2 Total
f2 (y) = P (Y = y) 0:6 0:4 1
Note that we use the notation f1 (x) = P (X = x) and f2 (y) = P (Y = y) to avoid confusion with
f (x; y) = P (X = x; Y = y). An alternative notation that you may see is fX (x) and fY (y):
In general, to find f1 (x) we add over all values of y where X = x, and to find f2 (y) we add over
all values of x with Y = y. Then
X
f1 (x) = f (x; y)
all y
X
and f2 (y) = f (x; y)
all x
This reasoning can be extended beyond two variables. For example, with three variables (X1 ; X2 ; X3 ),
X
f1 (x1 ) = f (x1 ; x2 ; x3 )
all (x2 ;x3 )
X
and f1;3 (x1 ; x3 ) = f (x1 ; x2 ; x3 ) = P (X1 = x1 ; X3 = x3 )
all x2
where f1;3 (x1 ; x3 ) is the marginal joint probability function of (X1 ; X3 ).
Note that if the joint probability function is given in a table then the marginal probability functions
are obtained by simply summing over the rows and columns as shown in the table below for the coin
example above:
x
f (x; y) 0 1 2 3 f2 (y)
1 2 1 4
y 0 8 8 8 0 8
1 2 1 4
1 0 8 8 8 8
1 3 3 1
f1 (x) 8 8 8 8 1
Independent Random Variables:

For events A and B, we have defined A and B to be independent if and only if P (AB) = P (A)P (B).
This definition can be extended to random variables (X; Y ). Two random variables are independent if
their joint probability function is the product of the marginal probability functions.
Definition 26 X and Y are independent random variables if f (x; y) = f1 (x)f2 (y) for all values
(x; y).
Definition 27 In general, X1 ; X2 ; : : : ; Xn are independent random variables if and only if
f (x1 ; x2 ; : : : ; xn ) = f1 (x1 )f2 (x2 ) fn (xn ) for all x1 ; x2 ; : : : xn
In the first example X and Y are not independent since f1 (x)f2 (y) 6= f (x; y) for any of the 6
combinations of (x; y) values; e.g., f (1; 1) = 0:2 6= f1 (1)f2 (1) = (0:3) (0:6). Be careful applying this
definition. You can only conclude that X and Y are independent after checking all (x; y) combinations.
Even a single case where f1 (x)f2 (y) 6= f (x; y) makes X and Y dependent random variables.
Conditional Probability Functions:

Again we can extend a definition from events to random variables. For events A and B, recall that
P (AB)
P (AjB) = provided P (B) > 0
P (B)
Since
P (X = x; Y = y)
P (X = xjY = y) = provided P (Y = y) > 0
P (Y = y)
we make the following definition.
Definition 28 The conditional probability function of X given Y = y is

f (x; y)
f1 (xjy) = provided f2 (y) > 0
f2 (y)
Similarly, the conditional probability function of Y given X = x is
f (x; y)
f2 (yjx) = provided f1 (x) > 0
f1 (x)
Example: Suppose X and Y have joint probability function:
x
f (x; y) 0 1 2 f2 (y)
y 1 0:1 0:2 0:3 0:6
2 0:2 0:1 0:1 0:4
f1 (x) 0:3 0:3 0:4 1
Find f1 (xjY = 1) = P (X = xjY = 1).
Solution: Since
f (x; 1)
f1 (xjY = 1) =
f2 (1)
we obtain
x 0 1 2 Total
0:1 1 0:2 1 0:3 1
f1 (xjY = 1) 0:6 = 6 0:6 = 3 0:6 = 2 1
As you would expect, marginal and conditional probability functions are probability functions in
that they are always 0 and their sum is 1.
Functions of Random Variables:
Your final mark in a course might be a function of the three variables X1 ; X2 ; X3 - assignment,
midterm, and exam marks34 . We often encounter problems where we need to find the probability
distribution of a function of two or more random variables. The most general method for finding the
probability function for some function of random variables X and Y involves looking at every combi-
nation (x; y) to see what value the function takes.
Example: Suppose X and Y have joint probability function
x
f (x; y) 0 1 2
y 1 0:1 0:2 0:3
2 0:2 0:1 0:1
1
and we want to find the probability function of U = 2(Y X). The possible values of U are seen by
looking at the value of u = 2(y x) for each (x; y) in the range of (X; Y ).
x
u 0 1 2
y 1 2 0 2
2 4 2 0
Since
P (U = 2) = P (X = 2; Y = 1) = f (2; 1) = 0:3
P (U = 0) = P (X = 1; Y = 1) + P (X = 2; Y = 2) = f (1; 1) + f (2; 2) = 0:3
P (U = 2) = f (0; 1) + f (1; 2) = 0:2
P (U = 4) = f (0; 2) = 0:2
the probability function of U is
u 2 0 2 4 Total
P (U = u) 0:3 0:3 0:2 0:2 1
34
“Don’t worry about your marks. Just make sure that you keep up with the work and that you don’t have to repeat a year.
It’s not necessary to have good marks in everything.” Albert Einstein in letter to his son, 1916.
For some functions it is possible to approach the problem more systematically. One of the most
common functions of this type is the total. Let T = X + Y . This gives:
x
t 0 1 2
y 1 1 2 3
2 2 3 4
Then P (T = 3) = f (1; 2) + f (2; 1) = 0:4, for example. Continuing in this way, we obtain
t 1 2 3 4 Total
P (T = t) 0:1 0:4 0:4 0:1 1
In fact, to find P (T = t) we are simply adding the probabilities for all (x; y) combinations with
x + y = t. This could be written as:
P
f (x; y):
P (T = t) = all (x ;y)
with x +y=t
However, if x + y = t, then y = t x. To systematically pick out the right combinations of (x; y), all
we really need to do is sum over values of x and then substitute t x for y. Then,
X X
P (T = t) = f (x; t x) = P (X = x; Y = t x)
all x all x
So P (T = 3) would be
X
P (T = 3) = f (x; 3 x) = f (0; 3) + f (1; 2) + f (2; 1) = 0:4
all x
(note f (0; 3) = 0 since Y can’t be 3.)
We can summarize the method of finding the probability function for a function U = g(X; Y ) of two
random variables X and Y as follows:
Let f (x; y) = P (X = x; Y = y) be the probability function for (X; Y ). Then the probability function
for U is
X
P (U = u) = f (x; y)
all(x;y):
g(x;y)=u
This can also be extended to functions of three or more random variables U = g(X1 ; X2 ; : : : ; Xn ):
X
P (U = u) = f (x1 ; : : : ; xn )
(x1 ;:::;xn ):
g(x1 ;:::;xn )=u
Note: Do not get confused between the functions f and g in the above: f (x; y) is the joint probability
function of the random variables X; Y whereas U = g(X; Y ) defines the “new” random variable that
is a function of X and Y , and whose distribution we want to find.
Theorem 29 If X v P oisson ( 1 ) and Y v P oisson ( 2) independently then

T = X + Y v P oisson ( 1 + 2 ).
Proof. Since X v P oisson ( 1) and independently Y v P oisson ( 2) their joint probability

function is given by
xe y
2e
1 2
1
f (x; y) = for x = 0; 1; : : : and y = 0; 1; : : : .
x! y!
The probability function of T is
P (T = t) = P (T = t) = P (X + Y = t)
X
= P (X = x; Y = t x)
all x
X t xe t x
2 e
1 2
1
=
x! (t x)!
x=0
t
X x
t ( 1+ 2)
1 1
= 2e
x!(t x)! 2
x=0
te ( 1+ 2) Xt x
2 t 1
=
t! x 2
x=0
te ( 1+ 2) t
2 1
= 1+ by the Binomial Theorem.
t! 2
te ( 1+ 2) ( + t
2 1 2)
= t
t! 2
( 1 + 2 )t ( 1+ 2)
= e for t = 0; 1; 2; : : :
t!
which we recognize as the probability function of a P oisson ( 1 + 2 ) and we have proved the desired
result.
Exercise: Prove the following theorem.
Theorem 30 If X v Binomial (n; p) and Y v Binomial (m; p) independently then

T = X + Y v Binomial (n + m; p).
9.2. MULTINOMIAL DISTRIBUTION 199
Problems
9.1.1 The joint probability function of (X; Y ) is:

x
f (x; y) 0 1 2
0 0:09 0:06 0:15
y 1 0:15 0:05 0:20
2 0:06 0:09 0:15
(a) Are X and Y independent random variables? Why?

(b) Tabulate the conditional probability function of Y given X = 0.
(c) Tabulate the probability function of D = X Y.
9.1.2 Suppose X and Y are independent random variables with f1 (x) = x+kx 1 pk (1 p)x and
f2 (y) = y+`y 1 p` (1 p)y . Let T = X + Y . Find the probability function of T . Hint: use the
result a+ba 1 = ( 1)a ab .
9.2 Multinomial Distribution

There is only one multivariate model distribution introduced in this course, though other multivariate
distributions do exist. The Multinomial distribution defined below is very important. It is a generaliza-
tion of the Binomial model to the case where each trial has k possible outcomes. Before defining the
Multinomial distribution we consider the following example:
Example: Three sprinters, A; B and C, compete against each other in 10 independent 100 m. races.
The probabilities of winning any single race are 0:5 for A, 0:4 for B, and 0:1 for C. Let X1 ; X2 and
X3 be the number of races A; B and C win respectively.
(a) Find the joint probability function, f (x1 ; x2 ; x3 )
(b) Find the marginal probability function, f1 (x1 )
(c) Find the conditional probability function, f (x2 jx1 )
(d) Are X1 and X2 independent? Why?
(e) Find the probability function of T = X1 + X2 .

Solution: Before starting, note that x1 + x2 + x3 = 10 since there are 10 races in all. We really only
have two variables since x3 = 10 x1 x2 . However it is convenient to use x3 to save writing and
preserve symmetry.
(a) The reasoning will be similar to the way we found the Binomial distribution in Chapter 5 except
that there are now 3 types of outcome. There are x1 !x10!
2 !x3 !
different outcomes (that is, results for
races 1 to 10) in which there are x1 wins by A; x2 by B, and x3 by C. Each of these arrangements
has a probability of (0:5) multiplied x1 times, (0:4) multiplied x2 times, and (0:1) multiplied x3
times in some order, that is, (0:5)x1 (0:4)x2 (0:1)x3 .
Therefore
10!
f (x1 ; x2 ; x3 ) = (0:5)x1 (0:4)x2 (0:1)x3
x1 !x2 !x3 !
The domain of f is the set f(x1 ; x2 ; x3 ); xi = 0; 1; : : : ; 10; i = 1; 2; 3 and x1 + x2 + x3 = 10g.
(b) It would also be acceptable to drop x3 as a variable and write down the probability function for
X1 ; X2 only; this is
10!
f (x1 ; x2 ) = (0:5)x1 (0:4)x2 (0:1)10 x1 x2
x1 !x2 !(10 x1 x2 )!
because of the fact that X3 must equal 10 X1 X2 . For this probability function x1 =
0; 1; ; 10; x2 = 0; 1; ; 10 and x1 + x2 10. This simplifies finding f1 (x1 ) a little. We
P
now have f1 (x1 ) = f (x1 ; x2 ). The limits of summation need care: x2 could be as small as 0,
x2
but since x1 + x2 10, we also require x2 10 x1 . (For example if x1 = 7 then B can win
0; 1; 2, or 3 races.) Thus,
Xx1
10
10!
f1 (x1 ) = (0:5)x1 (0:4)x2 (0:1)10 x1 x2
x1 !x2 !(10 x1 x2 )!
x2 =0
Xx1
10 x2
10! 1 0:4
= (0:5)x1 (0:1)10 x1
x1 ! x2 !(10 x1 x2 )! 0:1
x2 =0
(Hint: In nk = k!(nn! k)! the two terms in the denominator add to the term in the numerator, if
we ignore the ! sign.) Multiply top and bottom by [x2 + (10 x1 x2 )]! = (10 x1 )! This
gives
Xx1
10 x2
10! 10 x1 0:4
f1 (x1 ) = (0:5)x1 (0:1)10 x1
x1 !(10 x1 )! x2 0:1
x2 =0
10 x1
10 0:4
= (0:5)x1 (0:1)10 x1
1+ by the Binomial Theorem
x1 0:1
10 10 x1
x1 (0:1 + 0:4)
= (0:5)x1 (0:1)10
x1 (0:1)10 x1
10
= (0:5)x1 (0:5)10 x1
for x1 = 0; 1; 2; : : : ; 10
x1
Note: While this derivation is included as an example of how to find marginal distributions by
summing a joint probability function, there is a much simpler method for this problem. Note that
each race is either won by A (“success”) or it is not won by A (“failure”). Since the races are
independent and X1 is now just the number of “success” outcomes, X1 must have a Binomial
distribution, with n = 10 and p = 0:5. Hence
10
f1 (x1 ) = (0:5)x1 (0:5)10 x1
for x1 = 0; 1; : : : ; 10 as above:
x1
(c) Remember that f (x2 jx1 ) = P (X2 = x2 jX1 = x1 ), so that

10! x1 x2 10 x1 x2
f (x1 ; x2 ) x1 !x2 !(10 x1 x2 )! (0:5) (0:4) (0:1)
f (x2 jx1 ) = = 10!
f1 (x1 ) x1
x1 !(10 x1 )! (0:5) (0:5)
10 x1
x2 10 x1 x2
(10 x1 )! (0:4) (0:1)
=
x2 ! (10 x1 x2 )! (0:5)x2 (0:5)10 x1 x2
x2 10 x1 x2
10 x1 4 1
= for x2 = 0; 1; : : : ; (10 x1 )
x2 5 5
The range of X2 depends on the value x1 , which makes sense: if B wins x1 races then the most
A can win is 10 x1 .
Note: As in (b), this result can be obtained more simply by general reasoning. Once we are given that
A wins x1 races, the remaining (10 x1 ) races (“trials”) are all won by either B or C. Since
P (B wins) = 0:4 and P (C wins) = 0:1 then, for the races won by either B or C, the probability that
B wins (“Success”) is
P (B wins) 0:4
P (B wins jB or C wins) = = = 0:8:
P (B or C wins) 0:4 + 0:1
So the probability function of the number of wins (“Successes”) in (10 x1 ) races (“trials”) is
10 x1
f (x2 jx1 ) = (0:8)x2 (0:2)10 x1 x2
for x2 = 0; 1; : : : ; 10 x1
x2
(d) X1 and X2 are clearly not independent random variables since the more races A wins, the fewer
races there are for B to win. More formally,
10 10
f1 (x1 )f2 (x2 ) = (0:5)x1 (0:5)10 x1
(0:4)x2 (0:6)10 x2
6= f (x1 ; x2 )
x1 x2
(In general, if the range for X1 depends on the value of X2 , then X1 and X2 cannot be indepen-
dent random variables.)
(e) If T = X1 + X2 then
t
X
fT (t) = P (T = t) = f (x1 ; t x1 )
x1
t
X 10!
= (0:5)x1 (0:4)t x1
(0:1)10 t
x1 !(t x1 )! (10 x1 (t x1 ))!
x1 =0 | {z }
(10 t)!
The upper limit on x1 is t because, for example, if t = 7 then A could not have won more than 7
races. Then
t
X x1
10! 1 0:5
fT (t) = P (T = t) = (0:4)t (0:1)10 t
(10 t)! x1 !(t x1 )! 0:4
x1 =0
What do we need to multiply by on the top and bottom? Can you spot it before looking below?
t
X x1
10! t! 0:5
fT (t) = P (T = t) = (0:4)t (0:1)10 t
t!(10 t)! x1 !(t x1 )! 0:4
x1 =0
10 0:5 t
= (0:4)t (0:1)10 t
1+
t 0:4
10 t
t (0:4 + 0:5)
= (0:4)t (0:1)10
t (0:4)t
10
= (0:9)t (0:1)10 t
for t = 0; 1; : : : ; 10
t
Exercise: Explain to yourself how the answer in (e) can be obtained from the Binomial distribution,
as we did for parts (b) and (c)
We now generalize this example to the case in which there are k types of outcome rather than three.
Physical Setup for the Multinomial distribution: Suppose an experiment is repeated indepen-
dently n times with k distinct types of outcome each time. Let the probabilities of these k types
be p1 ; p2 ; : : : ; pk each time. Let X1 be the number of times the 1st type occurs, X2 the number of
times the 2nd occurs, : : :, Xk the number of times the k’th type occurs. Then (X1 ; X2 ; : : : ; Xk ) has a
Multinomial distribution.
Notes:
(1) p1 + p2 + + pk = 1
(2) X1 + X2 + + Xk = n,
If we wish we can drop one of the variables (say the last), and just note that
Xk = n X1 X2 Xk 1 .
Illustrations:
(1) If k = 2 and there are two possible outcomes (Success and Failure) then we simply have a
Binomial distribution.
(2) In the example above with sprinters A, B, and C running 10 races we had a Multinomial distrib-
ution with n = 10 and k = 3. Since there were k = 3 possible outcomes this distribution is also
called the Trinomial distribution.
(3) Suppose student marks are given in letter grades as A, B, C, D, or F. In a class of 80 students the
number getting A, B, . . . , F might have a Multinomial distribution with n = 80 and k = 5.
Joint Probability Function: The joint probability function of X1 ; X2 ; : : : ; Xk is given by extending

the argument in the sprinters example from k = 3 to general k. There are x1 !x2n!! xk ! different outcomes
of the n trials in which x1 are of the 1st type, x2 are of the 2nd type, etc. Each of these arrangements
has probability px1 1 px2 2 pxk k since p1 is multiplied x1 times in some order, etc. Therefore
n!
f (x1 ; x2 ; : : : ; xk ) = p x1 p x2 pxk k
x1 !x2 ! xk ! 1 2
Pk
The restriction on the xi ’s are xi = 0; 1; : : : ; n and xi = n.
P i=1
As a check that f (x1 ; x2 ; : : : ; xk ) = 1 we use the Multinomial Theorem to get
X n!
px1 px2 pxk k = (p1 + p2 + + p k )n = 1
x1 !x2 ! xk ! 1 2
Example: Every person is one of four blood types: A, B, AB and O. (This is important in determining,
for example, who may give a blood transfusion to a person.) In a large population let the fraction that
has type A, B, AB and O, respectively, be p1 ; p2 ; p3 ; p4 . Then, if n persons are randomly selected from
the population, the numbers X1 ; X2 ; X3 ; X4 of types A, B, AB, O have a Multinomial distribution with
k = 4. (In Caucasian people the values of the pi ’s are approximately p1 = 0:45; p2 = 0:08; p3 = 0:03;
p4 = 0:44:)
Remark: We sometimes use the notation (X1 ; : : : ; Xk ) M ultinomial(n; p1 ; p2 ; : : : ; pk ) to indi-

cate that (X1 ; X2 ; : : : ; Xk ) have a Multinomial distribution.
Remark: For some types of problems its helpful to write formulas in terms of x1 ; x2 ; : : : ; xk 1 and
p1 ; p2 ; : : : ; pk 1 using the fact that
xk = n x1 x2 xk 1 and pk = 1 p1 p2 pk 1
In this case we can write the joint probability function as f (x1 ; x2 ; : : : ; xk 1)but we must remember
then that x1 ; x2 ; : : : ; xk 1 satisfy the condition 0 x1 + x2 + + xk 1 n.
The Multinomial distribution can also arise in combination with other models, and students often have
trouble recognizing it then.
Example: A potter is producing teapots one at a time. Assume that they are produced independently
of each other and with probability p the pot produced will be “satisfactory”; the rest are sold at a
lower price. The number, X, of rejects before producing a satisfactory teapot is recorded. When 12
satisfactory teapots are produced, what is the probability the 12 values of X will consist of six 0’s,
three 1’s, two 2’s and one value which is 3?
Solution: Each time a “satisfactory” pot is produced the value of X falls in one of the four categories
X = 0; X = 1; X = 2; X 3. Under the assumptions given in the question, X has a Geometric
distribution with
P (X = x) = f (x) = p(1 p)x for x = 0; 1; 2; : : :
so we can find the probability for each of these categories.
We have
P (X = 0) = f (0) = p
P (X = 1) = f (1) = p (1 p)
P (X = 2) = f (2) = p (1 p)2
and
P (X 3) = f (3) + f (4) + f (5) +

= p(1 p)3 + p(1 p)4 + p(1 p)5 +
p(1 p)3
=
1 (1 p)
= (1 p)3 by the Geometric series.
Therefore
P (six 0’s, three 1’s, two 2’s and one value 3)

12!
= [p]6 [p (1 p)]3 [p (1 p)2 ]2 [(1 p)3 ]1
6!3!2!1!
12!
= p6+3+2 (1 p)3+4+3
6!3!2!1!
12!
= p11 (1 p)10
6!3!2!1!
Problems
9.2.1 An insurance company classifies policy holders as class A, B, C, or D. The probabilities of a
randomly selected policy holder being in these categories are 0.1, 0.4, 0.3 and 0.2, respectively.
Give expressions for the probability that 25 randomly chosen policy holders will include
(a) 3A’s, 11B’s, 7C’s, and 4D’s.

(b) 3A’s and 11B’s.
(c) 3A’s and 11B’s, given that there are 4D’s.
9.2.2 Chocolate chip cookies are made from batter containing an average of 0.6 chips per c.c. Chips
are distributed according to the conditions for a Poisson process. Each cookie uses 12 c.c. of
batter. Give expressions for the probabilities that in a dozen cookies:
(a) 3 have fewer than 5 chips.

(b) 3 have fewer than 5 chips and 7 have more than 9.
(c) 3 have fewer than 5 chips, given that 7 have more than 9.
9.3 Markov Chains

35 Consider a sequence of (discrete) random variables X1 ; X2 ; : : : each of which takes integer values
1; 2; : : : ; N (called states). We assume that for a certain matrix P (called the transition probability
matrix), the conditional probabilities are given by corresponding elements of the matrix; that is,
P (Xn+1 = jjXn = i) = Pij ; i = 1; : : : N; j = 1; : : : N
and furthermore that the chain only uses the last state occupied in determining its future; that is,
P (Xn+1 = jjXn = i; Xn 1 = i1 ; Xn 2 = i2 : : : Xn l = il ) = P (Xn+1 = jjXn = i) = Pij
for all j; i; i1 ; i2 ; : : : ; il ; and l = 2; 3; : : :. Then the sequence of random variables Xn is called a

Markov36 Chain. Markov Chain models are the most common simple models for dependent variables,
and are used to predict weather as well as movements of security prices. They allow the future of the
process to depend on the present state of the process, but the past behaviour can influence the future
only through the present state.
Example. Rain-No rain
Suppose that the probability that tomorrow is rainy given that today is not raining is (and it does not
otherwise depend on whether it rained in the past) and the probability that tomorrow is dry given that
today is rainy is : If tomorrow’s weather depends on the past only through whether today is wet or
dry, we can define random variables
(
1 if Day n is wet
Xn =
0 if Day n is dry
(beginning at some arbitrary time origin, day n = 0 ). Then the random variables Xn ; n = 0; 1; 2; : : :
form a Markov chain with N = 2 possible states and having probability transition matrix
" #
1
P =
1
35
This section optional for STAT 220 and STAT 230.
36
After Andrei Andreyevich Markov (1856-1922), a Russian mathematician, Professor at Saint Petersburg University.
Markov studied sequences of mutually dependent variables, hoping to establish the limiting laws of probability in their most
general form and discovered Markov chains, launched the theory of stochastic processes. As well, Markov applied the method
of continued fractions, pioneered by his teacher Pafnuty Chebyshev, to probability theory, completed Chebyschev’s proof of
the central limit theorem (see Chapter 10) for independent non-identically distributed random variables. For entertainment,
Markov was also interested in poetry and studied poetic style.
9.3. MARKOV CHAINS 207
Properties of the Transition Matrix P

P
N
Note that Pij 0 for all i; j and Pij = 1 for all i: This last property holds because given that
j=1
Xn = i; Xn+1 must occupy one of the states j = 1; 2; : : : ; N:
The distribution of Xn
Suppose that the chain is started by randomly choosing a state for X0 with distribution P [X0 = i] = qi ;
i = 1; 2; : : : ; N . Then the distribution of X1 is given by
N
X
P (X1 = j) = P (X1 = j; X0 = i)
i=1
N
X
= P (X1 = jjX0 = i)P (X0 = i)
i=1
XN
= Pij qi
i=1
T
and this is the j’th element of the vector q P where q is the column vector of values qi . To obtain the
distribution at time n = 1, premultiply the transition matrix P by a vector representing the distribution
at time n = 0. Similarly the distribution of X2 is the vector q T P 2 where P 2 is the product of the matrix
P with itself and the distribution of Xn is q T P n . Under very general conditions, it can be shown that
these probabilities converge because the matrix P n converges pointwise to a limiting matrix as n ! 1.
In fact, in many such cases, the limit does not depend on the initial distribution q because the limiting
matrix has all of its rows identical and equal to some vector of probabilities . Identifying this vector
when convergence holds is reasonably easy.
Definition 31 A limiting distribution of a Markov chain is a vector ( say) of long run probabilities of
the individual states such that
i = lim P [Xt = i]
t!1
Now let us suppose that convergence to this distribution holds for a particular initial distribution q
so we assume that
q T P n ! T as n ! 1
Then notice that

(q T P n )P ! T
P
but also
(q T P n )P = q T P n+1 ! T
as n ! 1
so T must have the property that

T T
P =
Any limiting distribution must have this property and this makes it easy in many examples to identify
the limiting behaviour of the chain.
Definition 32 A stationary distribution of a Markov chain is the column vector ( say) of probabilities
of the individual states such that T P = T .
Example: (weather continued)
Let us return to the weather example in which the transition probabilities are given by the matrix
" #
1
P =
1
What is the long-run proportion of rainy days? To determine this we need to solve the equations
T T
P =
" #
h i 1 h i
0 1 = 0 1
1
subject to the conditions that the values 0; 1 are both probabilities (non-negative) and add to one. It
is easy to see that the solution is
0 =
+
1 =
+
which is intuitively reasonable in that it says that the long-run probability of the two states is propor-
tional to the probability of a switch to that state from the other. So the long-run probability of a dry day
is the limit
0 = lim P (Xn = 0) =
n!1 +
You might try verifying this by computing the powers of the matrix P n for n = 1; 2; : : : : and show
that P n approaches the matrix " #
+ +
+ +
as n ! 1: There are various mathematical conditions under which the limiting distribution of a
Markov chain is unique and independent of the initial state of the chain but roughly they assert that the
chain is such that it forgets the more and more distant past.
Independent Random Variables
Consider a Markov chain with transition probability matrix

" #
1
P =
1
Notice that both rows of this matrix are identical so P (Xn+1 = 1jXn = 0) = = P (Xn+1 = 1jXn =
1): For this chain the conditional distribution of Xn+1 given Xn = i evidently does not depend on the
value of i: This demonstrates independence. Indeed if X and Y are two discrete random variables and
if the conditional probability function fyjx (yjx) of Y given X is identical for all possible values of x
then it must be equal to the unconditional (marginal) probability function fy (y): If fyjx (yjx) = fy (y)
for all values of x and y then X and Y are independent random variables. Therefore if a Markov
Chain has transition probability matrix with all rows identical, it corresponds to independent random
variables X1 ; X2 ; : : : :. This is the most forgetful of all Markov chains. It pays no attention whatever
to the current state in determining the next state.
Is the stationary distribution unique? One might wonder whether it is possible for a Markov chain
to have more than one stationary distribution and consequently possibly more than one limiting distri-
bution. We have seen that the 2 2 Markov chain with transition probability matrix
" #
1
P =
1
has a solution of T P = T and 0 + 1 = 1 given by 0 = + ; 1 = + : Is there is any

other solution possible? Rewriting the equation v T P = v T in the form v T (P I) = 0; note that the
dimension of the subspace of solutions v T is one provided that the rank of the matrix P I is one
(that is, the solutions v T are all scalar multiples of the vector T ), and the dimension is 2 provided
that the rank of the matrix P I is 0: Only if rank(P I) = 0 will there be two linear independent
solutions and hence two possible candidates for equilibrium distributions. But if P I has rank 0; then
P = I, the transition probability matrix of a very stubborn Markov chain which always stays in the
state currently occupied. For two-dimensional Markov Chains, only in the case P = I is there more
than one stationary distribution and any probability vector T satisfies T P = T and is a stationary
distribution. This is at the opposite end of the spectrum from the independent case above which pays
no attention to the current state in determining the next state. The chain with P = I never leaves the
current state.
Example (Gene Model) A simple form of inheritance of traits occurs when a trait is governed by
a pair of genes A and a: An individual may have an AA of an Aa combination (in which case they
are indistinguishable in appearance, or "A dominates a"). Let us call an AA individual dominant, aa;
recessive and Aa hybrid. When two individuals mate, the offspring inherits one gene of the pair from
each parent, and we assume that these genes are selected at random. Now let us suppose that two
individuals of opposite sex selected at random mate, and then two of their offspring mate, etc. Here the
state is determined by a pair of individuals, so the states of our process can be considered to be objects
like (AA; Aa) indicating that one of the pair is AA and the other is Aa (we do not distinguish the order
of the pair, or male and female-assuming these genes do not depend on the sex of the individual)
Number State
1 (AA; AA)
2 (AA; Aa)
3 (AA; aa)
4 (Aa; Aa)
5 Aa; aa)
6 (aa; aa)
For example, consider the calculation of P (Xt+1 = jjXt = 2): In this case each offspring has
probability 1=2 of being a dominant AA, and probability of 1=2 of being a hybrid (Aa). If two offspring
are selected independently from this distribution the possible pairs are (AA; AA); (AA; Aa); (Aa; Aa)
with probabilities 1=4; 1=2; 1=4 respectively. So the transitions have probabilities below:
(AA; AA) (AA; Aa) (AA; aa) (Aa; Aa) (Aa; aa) (aa; aa)
(AA; AA) 1 0 0 0 0 0
(AA; Aa) 0:25 0:5 0 0:25 0 0
(AA; aa) 0 0 0 1 0 0
(Aa; Aa) 0:0625 0:25 0:125 0:25 0:25 0:0625
(Aa; aa) 0 0 0 0:25 0:5 0:25
(aa; aa) 0 0 0 0 0 1
and transition probability matrix

2 3
1 0 0 0 0 0
6 7
6 0:25 :5 0 0:25 0 0 7
6 7
6 0 0 0 1 0 0 7
6 7
P =6 7
6 0:0625 0:25 0:125 0:25 0:25 0:0625 7
6 7
6 0 0 0 0:25 0:5 0:25 7
4 5
0 0 0 0 0 1
What is the long-run behaviour in such a system? For example, the two-generation transition proba-
bilities are given by
2 3
1 0 0 0 0 0
6 7
6 0:3906 0:3125 0:0313 0:1875 0:0625 0:01156 7
6 7
6 0:0625 0:25 0:125 0:25 0:25 0:0625 7
6 7
P2 = 6 7
6 0:1406 0:1875 0:0312 0:3125 0:1875 0:14063 7
6 7
6 0:01562 0:0625 0:0313 0:1875 0:3125 0:3906 7
4 5
0 0 0 0 0 1
which seems to indicate a drift to one or other of the extreme states 1 or 6. To confirm the long-run
behaviour calculate: 2 3
1 0 0 0 0 0
6 7
6 0:75 0 0 0 0 0:25 7
6 7
6 0:5 0 0 0 0 0:5 7
6 7
P 100 = 6 7
6 0:5 0 0 0 0 0:5 7
6 7
6 0:25 0 0 0 0 0:75 7
4 5
0 0 0 0 0 1
which shows that eventually the chain is absorbed in either of state 1 or state 6, with the probability of
absorption depending on the initial state. This chain, unlike the ones studied before, has more than one
possible stationary distribution, for example, T = (1; 0; 0; 0; 0; 0) and T = (0; 0; 0; 0; 0; 1); and in
these circumstances the chain does not have the same limiting distribution for all initial states.
9.4 Expectation for Multivariate Distributions: Covariance and Corre-

lation
Recall that for a discrete random variable X with probability function f (x) = P (X = x) we defined
X
E [g (X)] = g(x)f (x)
all x
It is easy to extend the definition of expected value to multiple discrete random variables.
Definition 33
X
E [g (X; Y )] = g(x; y)f (x; y)
all (x;y)
and
X
E [g (X1 ; X2 ; ; Xn )] = g (x1 ; x2 ; xn ) f (x1 ; ; xn )
all (x1 ;x2 ; ;xn )
As before, these represent the average value of g(X; Y ) and g(X1 ; X2 ; : : : ; Xn ). E [g (X; Y )] could
also be determined by finding the probability function fZ (z) of Z = g(X; Y ) and then using the
P
definition of expected value E(Z) = zfZ (z):
all z
Example: Let the joint probability function, f (x; y), be given by
x
f (x; y) 0 1 2 f2 (y)
y 1 0:1 0:2 0:3 0:6
2 0:2 0:1 0:1 0:4
f1 (x) 0:3 0:3 0:4 1
Find E(XY ) and E(X).
Solution:
X
E (XY ) = xyf (x; y)
all (x;y)
= (0 1) (0:1) + (1 1) (0:2) + (2 1) (0:3) + (0 2) (0:2) + (1 2) (0:1) + (2 2) (0:1)

= 1:4
9.4. EXPECTATION FOR MULTIVARIATE DISTRIBUTIONS: COVARIANCE AND CORRELATION213
To find E(X) we have a choice of methods. First, taking g(x; y) = x we get

X
E(X) = xf (x; y)
all (x;y)
= (0 0:1) + (1 0:2) + (2 0:3) + (0 0:2) + (1 0:1) + (2 0:1)

= 1:1
Alternatively, since E(X) only involves X, we could find f1 (x) and use
2
X
E(X) = xf1 (x) = (0 0:3) + (1 0:3) + (2 0:4) = 1:1
x=0
Example: In the example of Section 9.2 with sprinters A, B, and C we had (using only X1 and X2 in
our formulas)
10!
f (x1 ; x2 ) = (0:5)x1 (0:4)x2 (0:1)10 x1 x2
x1 !x2 !(10 x1 x2 )!
where A wins x1 times and B wins x2 times in 10 races. Find E (X1 X2 ).
Solution: This will be similar to the way we derived the mean of the Binomial distribution but, since
this is a Multinomial distribution, we’ll be using the Multinomial Theorem to evaluate the sum.
X
E (X1 X2 ) = x1 x2 f (x1 ; x2 )
X 10!
= x1 x2 (0:5)x1 (0:4)x2 (0:1)10 x1 x2
x 6=0
x (x
1 1 1)!x (x
2 2 1)!(10 x 1 x 2 )!
1
x2 6=0
X (10)(9)(8!)
=
x1 6=0
(x1 1)!(x2 1)! [(10 2) (x1 1) (x2 1)]!
x2 6=0
(0:5)(0:5)x1 1
(0:4)(0:4)x2 1
(0:1)(10 2) (x1 1) (x2 1)
X 8!
= 90(0:5)(0:4) (0:5)x1 1
(0:4)x2 1
(0:1)8 (x1 1) (x2 1)
x1 6=0
(x1 1)!(x2 1)! [8 (x1 1) (x2 1)]!
x2 6=0
Let y1 = x1 1 and y2 = x2 1 in the sum and we obtain

X 8!
E (X1 X2 ) = 18 (0:5)y1 (0:4)y2 (0:1)8 y1 y2
y1 !y2 !(8 y1 y2 )!
(y1 ;y2 )
= 18(0:5 + 0:4 + 0:1)8 = 18 by the Multinomial Theorem

Property of Multivariate Expectation: It is easily proved (make sure you can do this) that
E [ag1 (X; Y ) + bg2 (X; Y )] = aE [g1 (X; Y )] + bE [g2 (X; Y )]
This can be extended beyond 2 functions g1 and g2 , and beyond 2 variables X and Y .
Relationships between Variables
Independence is a “yes/no” way of defining a relationship between variables. We all know that there
can be different types of relationships between variables which are dependent. For example, if X
is your height in inches and Y your height in centimeters the relationship is one-to-one and linear.
More generally, two random variables may be related (non-independent) in a probabilistic sense. For
example, a person’s weight Y is not an exact linear function of their height X, but Y and X are
nevertheless related. We’ll look at two ways of measuring the strength of the relationship between two
random variables. The first is called covariance.
Definition 34 The covariance of X and Y , denoted Cov(X; Y ) or XY , is
Cov(X; Y ) = E [(X X )(Y Y )]
Note that
Cov(X; Y ) = E [(X X ) (Y Y )]
= E (XY XY X Y + X Y)
= E(XY ) X E(Y ) Y E(X) + X Y
= E(XY ) E(X)E(Y ) E(Y )E(X) + E(X)E(Y )

= E(XY ) E(X)E(Y )
and Cov(X; Y ) = E(XY ) E(X)E(Y ) is the formula we usually use for calculation purposes.
Example: Find Cov(X; Y ) if X and Y have joint probability function:
x
f (x; y) 0 1 2 f2 (y)
y 1 0:1 0:2 0:3 0:6
2 0:2 0:1 0:1 0:4
f1 (x) 0:3 0:3 0:4 1
Solution: We previously calculated E(XY ) = 1:4 and E(X) = 1:1. Similarly, E(Y ) = (1 0:6) +
(2 0:4) = 1:4. Therefore
Cov(X; Y ) = 1:4 (1:1)(1:4) = 0:14
Exercise: Calculate the covariance of X1 and X2 for the sprinter example. We have already found
that E (X1 X2 ) = 18. The marginal distributions of X1 and of X2 are models for which we’ve already
derived the mean. If your solution takes more than a few lines you’re missing an easier solution.
Interpretation of Covariance:
(1) Suppose large values of X tend to occur with large values of Y and small values of X with
small values of Y . Then (X X ) and (Y Y ) will tend to be of the same sign, whether
positive or negative. Thus (X X ) (Y Y ) will be positive. Hence Cov(X; Y ) > 0.
For example in Figure 9.1 we see several hundred points plotted. Notice that the majority
of the points are in the two quadrants (lower left and upper right) labelled with “+” so that
for these (X X ) (Y Y ) > 0: A minority of points are in the other two quadrants la-
belled “-” and for these (X X ) (Y Y ) < 0. Moreover the points in the latter two quad-
rants appear closer to the mean ( X ; Y ) indicating that on average, over all points generated
average((X X ) (Y Y )) > 0: Presumably this implies that over the joint distribution of
(X; Y ); E[(X X ) (Y Y )] > 0 or Cov(X; Y ) > 0:
y +
2
-
µ 0
-1
-2
+ -
-3
-4
-3 -2 -1 0 1 2 3
µ x
X
Figure 9.1: Random points (X; Y ) with covariance 0.5, variances 1.
For example if X = person’s height and Y = person’s weight, then these two random variables
will have a positive covariance.
(2) Suppose large values of X tend to occur with small values of Y and small values of X with
large values of Y . Then (X X ) and (Y Y ) will tend to be of opposite signs. Thus
(X X ) (Y Y ) tends to be negative. Hence Cov(X; Y ) < 0. For example see Figure 9.2.
-1
-2
-3
-4
-3 -2 -1 0 1 2 3
Figure 9.2: Covariance = 0:5, variances = 1
For example if X = thickness of attic insulation in a house and Y = heating cost for the house,
then Cov(X; Y ) < 0:
Theorem 35 If X and Y are independent then Cov(X; Y ) = 0.
Proof: Recall E (X X ) = E(X) X = 0. Let X and Y be independent.

Then f (x; y) = f1 (x)f2 (y).
" #
X X
Cov(X; Y ) = E [(X X ) (Y Y )] = (x X ) (y Y ) f1 (x)f2 (y)
all y all x
" #
X X
= (y Y ) f2 (y) (x X ) f1 (x)
all y all x
X
= [(y Y ) f2 (y)E (X X )]
all y
X
= 0=0
all y
The following theorem gives another way to proof the above theorem, and is useful in many other
situations.
Theorem 36 Suppose random variables X and Y are independent random variables. Then, if g1 (X)
and g2 (Y ) are any two functions,
E[g1 (X)g2 (Y )] = E[g1 (X)]E[g2 (Y )]
Proof: Since X and Y are independent, f (x; y) = f1 (x)f2 (y). Thus

X
E[g1 (X)g2 (Y )] = g1 (x)g2 (y)f (x; y)
all(x;y)
XX
= g1 (x)f1 (x)g2 (y)f2 (y)
all x all y
X X
=[ g1 (x)f1 (x)][ g2 (y)f2 (y)]
all x all y
= E[g1 (X)]E[g2 (Y )]
To prove result Theorem 35, we just note that if X and Y are independent then by Theorem 36
Cov(X; Y ) = E[(X X )(Y Y )]
= E(X X )E(Y Y)
=0 0=0
Caution: This result is not reversible. If Cov(X; Y ) = 0 we can not conclude that X and Y are
independent random variables. For example suppose that the random variable Z has a discrete Uniform
distribution on the values f 1; 0:9; : : : ; 0:9; 1g and define X = sin(2 Z) and Y = cos(2 Z): It is
easy to see that Cov(X; Y ) = 0 but the two random variables X; Y are clearly related because the
points (X; Y ) are always on a circle.
Example: Let (X; Y ) have the joint probability function f (0; 0) = 0:2; f (1; 1) = 0:6;
f (2; 0) = 0:2; that is, (X; Y ) only takes three values.
Then
x 0 1 2 Total
f1 (x) 0:2 0:6 0:2 1
and
y 0 1 Total
f2 (y) 0:4 0:6 1
are the marginal probability functions.

Since f1 (x)f2 (y) 6= f (x; y); therefore, X and Y are not independent. However,
E (XY ) = (0 0 0:2) + (1 1 0:6) + (2 0 0:2) = 0:6

E(X) = (0 0:2) + (1 0:6) + (2 0:2) = 1
and
E(Y ) = (0 0:4) + (1 0:6) = 0:6:
Therefore Cov(X; Y ) = E(XY ) E(X)E(Y ) = 0:6 (1)(0:6) = 0: So X and Y have covariance

0 but are not independent. If Cov(X; Y ) = 0 we say that X and Y are uncorrelated, because of the
definition of correlation 37 given below.
Exercise:
(a) Look back at the example in which f (x; y) was tabulated and Cov(X; Y ) = 0:14. Considering
how covariance is interpreted, does it make sense that Cov(X; Y ) would be negative?
(b) Without looking at the actual covariance for the sprinter exercise, would you expect Cov (X1 ; X2 )
to be positive or negative? (If A wins more of the 10 races, will B win more races or fewer races?)
The actual numerical value of Cov(X; Y ) has no interpretation, so covariance is of limited use in
measuring relationships. We now consider a second, related way to measure the strength of relationship
between X and Y .
Definition 37 The correlation coefficient of X and Y is
Cov(X; Y )
=
X Y
The correlation coefficient measures the strength of the linear relationship between X and Y and
is simply a rescaled version of the covariance, scaled to lie in the interval [ 1; 1]: You can attempt to
guess the correlation between two variables based on a scatter diagram of values of these variables at
http://www.istics.net/Correlations/. For example in Figure 9.3 you can see four correct guesses.
37
“The finest things in life include having a clear grasp of correlations.” Albert Einstein, 1919.
Figure 9.3: Guessing the correlation based on a scatter diagram of points
Properties of :
(1) Since X and Y , the standard deviations of X and Y , are both positive, will have the same sign
as Cov(X; Y ). Hence the interpretation of the sign of is the same as for Cov(X; Y ), and = 0 if X
and Y are independent. When = 0 we say that X and Y are uncorrelated.
(2) 1 1 and as ! 1 the relation between X and Y becomes one-to-one and linear.
Proof of (2): Define a new random variable S = X + tY , where t is some real number. We’ll show
that the fact that V ar(S) 0 gives us the desired result. We have
2
V ar(S) = E (S S)
2
= Ef[(X + tY ) ( X +t Y )] g
2
= Ef[(X X) + t(Y Y )] g
2
= E (X X) + 2t(X X )(Y Y) + t2 (Y Y)
2
2
= X + 2tCov(X; Y ) + t2 2
Y
Since V ar(S) 0 for any real number t; this quadratic equation must have at most one real root (value
of t for which it is zero). Therefore
[2Cov(X; Y )]2 4 2 2
X Y 0
leading to the inequality

Cov(X; Y )
1
X Y
To see that = 1 corresponds to a one-to-one linear relationship between X and Y , note that = 1
corresponds to a zero discriminant in the quadratic equation. This means that there exists one real
number t for which
V ar(S) = V ar(X + t Y ) = 0
But for V ar(X + t Y ) to be zero, X + t Y must equal a constant c. Thus X and Y satisfy a linear
relationship.
Exercise: Calculate for the sprinter example. Does your answer make sense? (You should already
have found Cov (X1 ; X2 ) in a previous exercise, so little additional work is needed.)
Problems
9.4.1 The joint probability function of (X; Y ) is:
x
f (x; y) 0 1 2
y 0 0:06 0:15 0:09
1 0:14 0:35 0:21
1
Calculate the correlation coefficient, . What does it indicate about the relationship between X
and Y ?
9.4.2 Suppose that X and Y are random variables with joint probability function:
x
f (x; y) 2 4 6
y 1 1=8 1=4 p
1
1 1=4 1=8 4 p
1
(a) For what value of p are X and Y uncorrelated?

(b) Show that there is no value of p for which X and Y are independent.
9.5. MEAN AND VARIANCE OF A LINEAR COMBINATION OF RANDOM VARIABLES 221
9.5 Mean and Variance of a Linear Combination of Random Variables

Many problems require us to consider linear combinations of random variables; examples will be given
below and in Chapter 10. Although writing down the formulas is somewhat tedious, we give here some
important results about their means and variances.
Results for Means:
1. E (aX + bY ) = aE(X) + bE(Y ) = a X + b Y , when a and b are constants. (This follows

from the definition of expected value.) In particular, E (X + Y ) = X + Y and
E (X Y ) = X Y.
2. Let ai be constants (real numbers) and E (Xi ) = i , i = 1; 2; : : : ; n. Then

Pn Pn P
n Pn
E ai Xi = ai i . In particular, E Xi = E (Xi ).
i=1 i=1 i=1 i=1
3. Let X1 ; X2 ; : : : ; Xn be random variables which have mean . (You can imagine these being
some sample results from an experiment such as recording the number of occupants in cars
P
n
travelling over a toll bridge.) The sample mean is X = n1 Xi . Then E X = .
i=1
P
n P
n P
n
Proof of (3): From (2), E Xi = E (Xi ) = = n . Thus
i=1 i=1 i=1
1P n 1 P
n 1
E Xi = E Xi = n =
n i=1 n i=1 n
Results for Covariance:

h i
1. Cov (X; X) = E [(X X ) (X X )] = E (X )2 = V ar(X)
2. Cov (aX + bY; cU + dV ) = acCov (X; U ) + adCov (X; V ) + bcCov (Y; U ) + bdCov (Y; V )
where a; b; c; and d are constants.
Proof:
Cov (aX + bY; cU + dV ) = E [(aX + bY a X b Y ) (cU + dV c U d V )]
= E f[a (X X) + b (Y Y )] [c (U U) + d (V V )]g
= acE [(X X ) (U U )] + adE [(X X ) (V V )]
+ bcE [(Y Y ) (U U )] + bdE [(Y Y ) (V V )]
= acCov (X; U ) + adCov (X; V ) + bcCov (Y; U ) + bdCov (Y; V )

This type of result can be generalized, but the results become messy to write out.
Results for Variance:
1. Variance of a linear combination:
V ar (aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov (X; Y )
Proof:
h i
V ar (aX + bY ) = E (aX + bY a X b Y )2
n o
2
= E [a (X X ) + b (Y Y )]
h i
2 2
= E a2 (X 2
X ) + b (Y Y ) + 2ab (X X ) (Y Y )
h i h i
2 2
= a2 E (X X) + b2 E (Y Y) + 2abE [(X X ) (Y Y )]
= a2 2
X + b2 2
Y + 2abCov (X; Y )
Exercise: Try to prove this result by writing V ar (aX + bY ) as Cov (aX + bY; aX + bY ) and using
properties of covariance.
2. Variance of a sum of independent random variables: Let X and Y be independent. Since

Cov (X; Y ) = 0, result 1. gives
2 2
V ar (X + Y ) = X + Y
that is, for independent variables, the variance of a sum is the sum of the variances. Also note
2
V ar (X Y)= X + ( 1)2 2
Y = 2
X + 2
Y
that is, for independent variables, the variance of a difference is the sum of the variances.
3. Variance of a general linear combination of random variables: Let ai be constants and

V ar (Xi ) = i2 . Then
P
n P
n P
n P
n
V ar ai Xi = a2i 2
i +2 ai aj Cov (Xi ; Xj )
i=1 i=1 i=1j=i+1
This is a generalization of result 1. and can be proved using either of the methods used for 1.
9.5. MEAN AND VARIANCE OF A LINEAR COMBINATION OF RANDOM VARIABLES 223
4. Variance of a linear combination of independent random variables: Special cases of result

3. are:
a) If X1 ; X2 ; ; Xn are independent then Cov (Xi ; Xj ) = 0, so that
P
n P
n
V ar ai Xi = a2i 2
i
i=1 i=1
b) If X1 ; X2 ; ; Xn are independent and all have the same variance 2, then

2
V ar X =
n
1 P
n P
n P
n
2.
Proof of 4 (b): X = n Xi . From 4(a), V ar Xi = V ar (Xi ) = n Using V ar (aX + b) =
i=1 i=1 i=1
a2 V ar(X), we get:
1P n 1 P
n n 2 2
V ar X = V ar Xi = V ar Xi = =
n i=1 n2 i=1 n 2 n
Remark: This result is a very important one in probability and statistics. To recap, it says that if
X1 ; : : : ; Xn are independent random variables with the same mean and same variance 2 , then the
P
n
sample mean X = n1 Xi has
i=1
2
E(X) = and V ar(X) =
n
This shows that the average X of n random variables with the same distribution is less variable than any
single observation Xi , and that the larger n is the less variability there is. This explains mathematically
why, for example, that if we want to estimate the unknown mean height in a population of people, we
are better to take the average height for a random sample of n = 10 persons than to just take the height
of one randomly selected person. A sample of n = 20 persons would be better still. There is an applet
at http://users.ece.gatech.edu/users/gtz/java/samplemean/notes.html which allows one to sample and
explore the rate at which the sample mean approaches the expected value. In Section 9.7 we will see
how to decide how large a sample we should take for a certain degree of precision. Also note that as
n ! 1; V ar(X) ! 0, which means that X becomes arbitrarily close to . This is sometimes called
the “law of averages38 ”. There is a formal theorem which supports the claim that for large sample sizes,
sample means approach the expected value, called the “law of large numbers”.
38
“I feel like a fugitive from the law of averages.”
William H. Mauldin (1921 - 2003)
Problems
9.5.1 The joint probability function of (X; Y ) is given by:
x
f (x; y) 0 1 2
y 0 0:15 0:1 0:05
1 0:35 0:2 0:15
1
Calculate E(X), V ar(X), Cov(X; Y ) and V ar(3X 2Y ).

You may use the fact that E(Y ) = 0:7 and V ar(Y ) = 0:21 without verifying these figures.
9.5.2 Suppose V ar(X) = 1:69, V ar(Y ) = 4, = 0:5. Find the standard deviation of U = 2X Y.
9.5.3 Let Y0 ; Y1 ; : : : ; Yn be uncorrelated random variables with E (Yi ) = 0 and V ar (Yi ) = 2,
i = 0; 1; : : : ; n. Let X1 = Y0 + Y1 ; X2 = Y1 + Y2 ; : : : ; Xn = Yn 1 + Yn .
Pn
Find Cov (Xi 1 ; Xi ) for i = 2; 3; : : : ; n and V ar Xi .
i=1
9.6 Linear Combinations of Independent Normal Random Variables

For continuous multivariate distributions we focus on linear combinations of Normal random variables
which have many important applications. The following theorem gives us the results that we need for
these applications.
Theorem 38 Linear Combinations of Independent Normal Random Variables
(1) Let X N ( ; 2 ) and Y = aX + b, where a and b are constant real numbers. Then
Y N (a + b; a2 2 )
(2) Let X N 1 ; 12 and Y N 2; 2

2 independently, and let a and b be constants. Then
aX + bY N a 1 + b 2 ; a 12 + b2
2 2
2 . In general if Xi N i ; i2 ; i = 1; 2; : : : ; n
P
n Pn P
n
independently and a1 ; a2 ; : : : ; an are constants, then ai Xi N ai i ; a2i i2 .
i=1 i=1 i=1
2
P
n
2
(3) Let X1 ; X2 ; : : : ; Xn be independent N ; random variables. Then Xi N n ;n
i=1
and X N ; 2 =n .
Result (1) follows easily from the change of variable method discussed in Section 8:1. Result
(2) is proved in Section 10:2 using moment generating functions. Result (3) is a special case of
9.6. LINEAR COMBINATIONS OF INDEPENDENT NORMAL RANDOM VARIABLES 225
(2). Note that the means and variances of these linear combinations of random variables can be
obtained using the results of Section 9:5.
Example: Suppose X N (3; 5) and Y N (6; 14) independently. Find P (X > Y ).
Solution: Whenever we have variables on both sides of the inequality we should collect them on
one side, leaving us with a linear combination. For example P (X > Y ) = P (X Y > 0). Since
X Y N (3 6; 5 + 14) = N ( 3; 19)
0 ( 3)
P (X Y > 0) = P Z> p where Z v N (0; 1)
19
= P (Z > 0:69)
=1 P (Z 0:69)
=1 0:75490
= 0:2451
Example: Three cylindrical parts are joined end to end to make up a shaft in a machine; 2 type A
parts and 1 type B. The lengths of the parts vary a little, and have the distributions: A N (6; 0:4) and
B N (35:2; 0:6). The overall length of the assembled shaft must lie between 46.8 and 47.5 or else
the shaft has to be scrapped. Assume the lengths of different parts are independent. What percent of
assembled shafts have to be scrapped?
Exercise: Why would it be wrong to represent the length of the shaft as 2A + B? How would this
length differ from the solution given below?
Solution: Let L, the length of the shaft, be L = A1 + A2 + B. Then
L N (6 + 6 + 35:2; 0:4 + 0:4 + 0:6) = N (47:2; 1:4)
and so
P (46:8 < L < 47:5)

46:8 47:2 L 47:2 47:5 47:2
=P p < p < p
1:4 1:4 1:4
= P ( 0:34 < Z < 0:25) where Z v N (0; 1)
= P (Z < 0:25) [1 P (Z < 0:34)]
= 0:59871 + 0:63307 1
= 0:23178
that is, 23:18% are acceptable and 76:82% must be scrapped. Obviously we have to find a way to
reduce the variability in the lengths of the parts. This is a common problem in manufacturing.
Exercise: How could we reduce the percent of shafts being scrapped? (What if we reduced the
variance of A and B parts each by 50%?)
Example: The heights of adult females in a large population is well represented by a Normal distrib-
ution with mean 64 inches and variance 6:2 (inches)2 .
(a) Find the proportion of females whose height is between 63 and 65 inches.
(b) Suppose 10 women are randomly selected, and let X be their average height, that is,
1 P
10
X = 10 Xi , where X1 ; X2 ; : : : ; X10 are the heights of the 10 women. Find P (63 X
i=1
65).
1 P
n
(c) Suppose X = n Xi , is the average height of n women selected at random. Find the smallest
i=1
value of n such that P X 64 1 0:95.
Solution:
(a) Let X N (64; 6:2) be the height X of a randomly chosen female. Then
P (63 X 65)
63 64 X 64 65 64
=P p p p
6:2 6:2 6:2
= P ( 0:40 Z 0:40) where Z v N (0; 1)
= 2P (Z 0:40) 1
= 2 (0:65542) 1
= 0:31084
and therefore 31% of females have a height between 63 and 65 inches.
(b) X N 64; 6:2

10 so
P 63 X 65
63 64 X 64 65 64
=P p p p
0:62 0:62 0:62
= P ( 1:27 Z 1:27) where Z v N (0; 1)
= 2P (Z 1:27) 1
= 2 (0:89796) 1 = 0:79592
9.6. LINEAR COMBINATIONS OF INDEPENDENT NORMAL RANDOM VARIABLES 227
(c) Since X N 64; 6:2

n we want
P jX 64j 1
!
jX 64j 1
=P p p
6:2=n 6:2=n
r
n
=P jZj where Z v N (0; 1)
6:2
0:95
p
But P (jZj 1:96) = 0:95, so we need n
6:2 1:96 or n (1:96)2 (6:2) = 23:82. Since n must be
an integer, the smallest value of n is 24.
Remark: This shows that if we were to select a random sample of n = 24 persons, then their average
height X would be within 1 inch of the average height = 64 of the whole population of women. So
if we did not know then we could estimate it to within 1 inch (with probability 0:95) by taking a
sample of only n = 24 persons which is not very large.
Exercise: Find the smallest value of n such that P (jX 64j 0:5) 0:95.
These ideas form the basis of statistical sampling and estimation of unknown parameter values in
populations and processes (STAT 231). If X N ( ; 2 ) and we know roughly what is, but don’t
know , then we can use the fact that X N ( ; 2 =n) to find the probability that the mean X from a
sample of size n will be within a given distance of the unknown mean .
Problems
9.6.1 Let X N (10; 4) and Y N (3; 100) be independent. Find:
(a) P (8:4 < X < 12:2)

(b) P (2Y > X)
(c) P Y < 0 where Y is the sample mean of 25 independent observations on Y .
9.6.2 Suppose X N (5; 4) and independently Y v G (7; 3). Find:
(a) The probability 2X differs from Y by more than 4.

Pn
(b) Suppose X = n1 Xi where Xi N (5; 4), i = 1; 2; : : : ; n independently. Find the
i=1
smallest value of n such that
P jX 5j < 0:1 0:98.
9.7 Indicator Random Variables

The results for linear combinations of random variables provide a way of breaking up more complicated
problems, involving mean and variance, into simpler pieces using indicator variables; an indicator
variable is just a binary variable (0 or 1) that indicates whether or not some event occurs. We’ll illustrate
this important method with 3 examples.
Example: Mean and Variance of a Binomial Random Variable

Let X Binomial(n; p). Define new random variables Xi by
Xi = 0 if the i’th trial was a failure

Xi = 1 if the i’th trial was a success.
The random variable Xi indicates whether the outcome “success” occurred on the i’th trial. The trick
we use is that the total number of successes, X, is the sum of the Xi ’s:
P
n
X= Xi :
i=1
We can find the mean and variance of Xi and then use our results for the mean and variance of a sum
to get the mean and variance of X. First,
P
1
E (Xi ) = xi f (xi ) = 0f (0) + 1f (1) = f (1)
xi =0
But f (1) = p since the probability of success is p on each trial. Therefore E (Xi ) = p. Since Xi = 0
or 1, Xi = Xi2 , and therefore
E Xi2 = E (Xi ) = p:
Thus
V ar (Xi ) = E Xi2 [E (Xi )]2 = p p2 = p(1 p):
In the Binomial distribution the trials are independent so the Xi ’s are also independent. Thus
P
n P
n P
n
E(X) = E Xi = E (Xi ) = p = np
i=1 i=1 i=1
P
n P
n P
n
V ar(X) = V ar Xi = V ar (Xi ) = p(1 p) = np(1 p)
i=1 i=1 i=1
These, of course, are the same as we derived previously for the mean and variance of the Binomial
distribution. Note how simple the derivation here is!
9.7. INDICATOR RANDOM VARIABLES 229
Remark: If Xi is a binary random variable with P (Xi = 1) = p = 1 P (Xi = 0) then E(Xi ) = p

and V ar(Xi ) = p(1 p), as shown above. (Note that Xi Binomial(1; p) is actually a Binomial
random variable.) In some problems the Xi ’s are not independent, and then we also need covariances.
Example: Let X have a Hypergeometric distribution. Find the mean and variance of X.
Solution: As above, let us think of the setting, which involves drawing n items at random from a total
of N , of which r are “S” and N r are “F ” items. Define
(
0 if i’th draw is a failure (F ) item
Xi =
1 if i’th draw is a success (S) item:
P
n
Then X = Xi as for the Binomial example, but now the Xi ’s are dependent. (For example, what
i=1
we get on the first draw affects the probabilities of S and F for the second draw, and so on.) Therefore
we need to find Cov(Xi ; Xj ) for i 6= j as well as E(Xi ) and V ar(Xi ) in order to use our formula for
the variance of a sum.
We see first that P (Xi = 1) = r=N for each of i = 1; 2 : : : ; n. (If the draws are random then the
probability an S occurs in draw i is just equal to the probability position i is an S when we arrange r
S’s and N r F ’s in a row.) This immediately gives
r
E(Xi ) =
N
r r
V ar(Xi ) = 1
N N
since
V ar(Xi ) = E(Xi2 ) [E(Xi )]2 = E(Xi ) [E(Xi )]2
The covariance of Xi and Xj (i 6= j) is equal to E(Xi Xj ) E(Xi )E(Xj ), so we need
1 X
X 1
E(Xi Xj ) = xi xj f (xi ; xj )
xi =0 xj =0
= f (1; 1)
= P (Xi = 1; Xj = 1)
The probability of an S on both draws i and j is just
r r 1
= P (Xi = 1)P (Xj = 1jXi = 1)
N N 1
Thus,
Cov (Xi ; Xj ) = E (Xi Xj ) E (Xi ) E (Xj )

r(r 1) r r r r 1 r
= =
N (N 1) N N N N 1 N
r(N r)
=
N 2 (N 1)
(Does it make sense that Cov (Xi ; Xj ) is negative? If you draw a success in draw i, are you more or
less likely to have a success on draw j?)
Now we find E(X) and V ar(X). First,
P
n P
n Pn r r
E(X) = E Xi = E (Xi ) = =n
i=1 i=1 i=1 N N
Before finding V ar(X), how many combinations Xi ; Xj are there for which i < j? Each i and j takes
values from 1; 2; : : : ; n so there are n2 different combinations of (i; j) values. Each of these can only
be written in one way to make i < j. There are n2 combinations with i < j (e.g. if i = 1; 2; 3 and
j = 1; 2; 3, the combinations with i < j are (1; 2), (1; 3) and (2; 3). So there are 32 = 3 different
combinations.) Therefore
P
n P
n P
V ar(X) = V ar Xi = V ar (Xi ) + 2 Cov (Xi ; Xj )
i=1 i=1 i<j
r(N r) n r(N r)
=n 2
+2
N 2 N 2 (N 1)
r N r (n 1) n 2n(n 1)
=n 1 since 2 = = n(n 1)
N N (N 1) 2 2
r r N n
=n 1
N N N 1
In the last two examples, we know f (x), and could have found E(X) and V ar(X) without using
indicator variables. In the next example f (x) is not known and is difficult to find, but we can still use
indicator variables for obtaining and 2 . The following example is a famous problem in probability.
Example: We have N letters to N different people, and N envelopes addressed to those N people.
One letter is put in each envelope at random. Find the mean and variance of the number of letters
placed in the right envelope.
9.7. INDICATOR RANDOM VARIABLES 231
Solution:
(
0 if letter i is not in envelope i
Let Xi =
1 if letter i is in envelope i:
P
N
Then Xi is the number of correctly placed letters. Once again, the Xi ’s are dependent (Why?).
i=1
First
1
X 1
E (Xi ) = xi f (xi ) = f (1) = = E Xi2
N
xi =0
since there is 1 chance in N that letter i will be put in envelope i and then,
1 1 1 1
V ar (Xi ) = E (Xi ) [E (Xi )]2 = 2
= 1 :
N N N N
Exercise: Before calculating Cov (Xi ; Xj ), what sign do you expect it to have? If letter i is correctly
placed does that make it more or less likely that letter j will be placed correctly?
Next, E (Xi Xj ) = f (1; 1). (As in the last example, this is the only non-zero term in the sum.)
Now, f (1; 1) = N1 N 1 1 since once letter i is correctly placed there is 1 chance in N 1 of letter j
going in envelope j. Therefore
1
E (Xi Xj ) = :
N (N 1)
For the covariance we have
Cov (Xi ; Xj ) = E (Xi Xj ) E (Xi ) E (Xj )

1 1 1 1 1 1
= =
N (N 1) N N N N 1 N
1
= 2
N (N 1)
Therefore
P
N P
N
E Xi = E (Xi )
i=1 i=1
PN 1 1
= = N
i=1 N N
=1
and
P
N P
N P
V ar Xi = V ar (Xi ) + 2 Cov (Xi ; Xj )
i=1 i=1 i<j
PN 1 1 N 1
= 1 +2 2
i=1 N N 2 N (N 1)
1 1 N 1
=N 1 +2 2
N N 2 N (N 1)
1 N (N 1) 1
=1 +2 2
N 2 N (N 1)
=1
Common sense often helps in this course, but for this example there is no way of being able to say this
result is obvious. On average one letter will be correctly placed and the variance will be one, regardless
of how many letters there are.
Problems
9.7.1 In a row of 25 switches, each is considered to be “on” or “off”. The probability of being on is
0:6 for each switch, independently of other switch. Find the mean and variance of the number of
unlike pairs among the 24 pairs of adjacent switches.
9.7.2 A plastic fabricating company produces items in strips of 24, with the items connected by a thin
piece of plastic:
Item 1 - Item 2 - Item 3 - - Item 23 - Item 24

A cutting machine then cuts the connecting pieces to separate the items, with the 23 cuts made
independently. There is a 10% chance the machine will fail to cut a connecting piece. Find the
mean and standard deviation of the number of the 24 items which are completely separate after
the cuts have been made. (Hint: Let Xi = 0 if item i is not completely separate, and Xi = 1 if
item i is completely separate.)

1. The joint probability function of (X; Y ) is given by:
x
f (x; y) 0 1 2
y 0 0:15 0:1 0:05
1 0:35 0:2 0:15
1
(a) Find the marginal probability function of X and the marginal probability function of Y .
(b) Are X and Y independent random variables? Why?
(c) Find P (X > Y ).
(d) Find the conditional probability function of X given Y = 0.
(e) Find the probability function of T = X + Y .
2. Consider Chapter 2, Problem 7, which concerned machine recognition of handwritten digits.

Recall that p(x; y) was the probability that the number actually written was x, and the number
identified by the machine was y.
(a) Are X and Y independent random variables? Why?

(b) Find P (X = Y ) = probability that a random number is correctly identified.
(c) If the number written is a 5, what is the probability that it is incorrectly identified?
3. In a quality control inspection, items are classified as having a minor defect, a major defect, or
as being acceptable. A carton of 10 items contains 2 with a minor defect, 1 with a major defect,
and 7 acceptable. Three items are chosen at random without replacement. Let X be the number
selected with a minor defect and let Y be the number selected with a major defect.
(a) Find the joint probability function of X and Y .

(b) Find the marginal probability function of X and the marginal probability function of Y .
(c) Find P (X = Y ).
(d) Find P (X = 1jY = 0).
4. A box contains 5 yellow and 3 red balls, from which 4 balls are drawn one at a time, at random,
without replacement. Let X be the number of yellow balls on the first two draws and Y the
number of yellow balls on all 4 draws.
(a) Find the joint probability function of X and Y .

(b) Find the marginal probability function of X and the marginal probability function of Y .
(c) Are X and Y independent random variables? Why?
5. For a person whose car insurance and house insurance are with the same company, let X and
Y represent the number of claims on the car and house policies, respectively, in a given year.
Suppose that for a certain group of individuals, X P oisson (0:1) and Y P oisson (0:05).
(a) If X and Y are independent random variables, find P (X + Y > 1).

(b) If X and Y are independent random variables, find the mean and variance of X + Y ,
(c) Suppose it was learned that P (X = 0; Y = 0) was very close to 0:94. Show why X
and Y cannot be independent random variables in this case. What might explain the non-
independence?
6. Let X and Y be discrete random variables with joint probability function
2x+y e 4
f (x; y) = for x = 0; 1; 2; : : : and y = 0; 1; 2; : : :
x!y!
(a) Find the marginal probability function of X and the marginal probability function of Y
without evaluating any sums.
(b) Find the probability function of the random variable T = X + Y .
7. In an auto parts company an average of defective parts are produced per shift. The number,
X, of defective parts produced has a Poisson distribution. An inspector checks all parts prior to
shipping them, but there is a 10% chance that a defective part will slip by undetected. Let Y be the
number of defective parts the inspector finds on a shift. Find the conditional probability function
of X given Y = y. (The company wants to know how many defective parts are produced, but
can only know the number which were actually detected.)
8. In a breeding experiment involving horses the offspring are of four genetic types with probabili-
ties:
Type 1 2 3 4 Total
3 5 5 3
Probability 16 16 16 16 1
A group of 40 independent offspring are observed.
(a) Find the probability that there are 10 offspring of each type.
(b) Find the probability that the total number of types 1 and 2 is 16.
(c) Find the probability that there are exactly 10 offspring of type 1, given that the total number
of types 1 and 2 is 16.
9. Bacteria are distributed through river water according to a Poisson process with an average of 5
per 100 cc of water. Five 50 cc samples of water are collected. Find the probability that exactly
one sample has no bacteria and exactly two samples have one bacterium.
10. A certain type of light bulb has lifetimes that can be modelled by an Exponential distribution
with mean 1000 hours.
(a) What proportion of light bulbs last less than 500 hours? between 500 and 1000 hours?
between 1000 and 1500 hours? longer than 1500 hours?
(b) For a carton of 50 light bulbs, find the probability that 15 light bulbs last less than 500 hours,
15 light bulbs last between 500 and 1000 hours, and 10 light bulbs last between 1000 and
1500 hours.
(c) For a carton of 50 light bulbs find the probability that 10 or more light bulbs last longer
than 1500 hours.
11. For Chapter 8, Problem 15 suppose you have a class of 50 students.
(a) What is the probability that exactly 5 students receive A’s, 15 students receive B’s, 10
students receive C’s and 15 students receive D’s.
(b) What is the probability that at least 45 students receive marks above an F ?
(c) What is the joint probability function of the number of students who receive A’s and the
number of students who receive B’s?
12. In a particular city, the probability a call to a fire department concerns various situations is as
given below:
Type Probability
1. fire in a detached home p1 = 0:10
2. fire in a semi detached home p2 = 0:05
3. fire in an apartment or multiple unit residence p3 = 0:05
4. fire in a non-residential building p4 = 0:15
5. non-fire-related emergency p5 = 0:15
6. false alarm p6 = 0:50
Let Xi represent the numbers of calls of type i, i = 1; 2; : : : ; 6 in a set of 10 calls.
(a) Give the joint probability function for X1 ; X2 ; : : : ; X6 .

(b) What is the probability there is at least one apartment fire, given that there are 4 fire-related
calls?
(c) If the average costs of calls of types 1; 2; : : : ; 6 are (in $100 units) 5, 5, 7, 20, 4, 2 respec-
tively, what is the expected total cost of the 10 calls?
13. Blood donors arrive at a clinic and are classified as type A, type O, or other types. The blood
types of donors are independent with P (type A) = p, P (type O) = q, and P (other type) = 1 p q.
Let X = number of of type A donors and Y = number of type O donors arriving before the tenth
other type.
(a) Find the joint probability function, f (x; y).

(b) Find the marginal probability function of X.
(c) Find the conditional probability function of Y given X = x.
14. Accidents occur on Wednesday’s at a particular intersection at random at the average rate of
accidents per Wednesday according to a Poisson process. Define the random variable
Xi = number of accidents on Wednesday at this intersection in week i, i = 1; 2; : : : ; n.
(a) Suppose n = 6 and the number of accidents observed on 6 consecutive Wednesday’s was
0, 2, 0, 1, 3, 1. What is the probability of observing these data if = 1? (Remember the
Poisson process assumption that the number of events in non-overlapping time intervals are
independent.)
(b) Suppose is unknown. What is the probability of observing these data as a function of ?
15. For the joint probability function in Problem 1, find the correlation coefficient of X and Y .
16. If X and Y are random variables with V ar(X) = 13, V ar(Y ) = 34 and = 0:7 then find
V ar (X 2Y ).
17. Let X and Y be independent random variables with E(X) = E(Y ) = 0, V ar(X) = 1 and
V ar(Y ) = 2. Find Cov (X + Y; X Y ).
18. Jane and Jack each toss a fair coin twice. Let X be the number of heads Jane obtains and Y the
number of heads Jack obtains. Define U = X + Y and V = X Y .
(a) Find E (U ) and E (V ).

(b) Find V ar (U ) and V ar (V ).
(c) Find Cov(U; V ). Are U and V independent random variables? Why?
19. Let X and Y be random variables with joint probability function
x = 0; 1; : : : ; n
n! x y n x y
f (x; y) = p q (1 p q) for y = 0; 1; : : : ; n
x!y!(n x y)!
and x + y n
(a) What is the distribution of T = X + Y ? Either explain why or derive this result.
(b) Find E(T ) and V ar(T )?
(c) Using (b) find Cov(X; Y ), and explain why you expect it to have the sign it does.
20. In a particular city, let the random variable X represent the number of children in a randomly
selected household, and let Y represent the number of female children. Assume that the prob-
ability a child is female is 0:5, regardless of the size of the household they live in, and that the
marginal distribution of X is as follows:
x 0 1 2 3 4 5 6 7 8 Total
P (X = x) 0:20 0:25 0:35 0:10 0:05 0:02 0:01 0:01 0:01 1
(a) Find E(X).

(b) Find the marginal probability function for the random variable Y = number of girls in a
randomly chosen family. Find E(Y ).
21. Suppose X and Y are discrete random variables with joint probability function f (x; y). If g(x; y)
is a function such that a g(x; y) b for all (x; y) in the range of (X; Y )
then show that a E[g(X; Y )] b.
22. Let Xi = the return on stock i, i = 1; 2; 3. Suppose E(Xi ) = 0:08; i = 1; 2; 3 and

V ar(X1 ) = (0:2)2 , V ar(X2 ) = (0:3)2 , V ar(X3 ) = (0:4)2 . Assuming X1 ; X2 ; X3 are in-
dependent random variables, find portfolio weights w1 ; w2 ; w3 so that the linear combination
w1 X1 + w2 X2 + w3 X3 has the smallest variance among all such linear combinations subject to
the constraint w1 + w2 + w3 = 1.
23. Suppose Xi v P oisson ( ), i = 1; 2; : : : ; n independently. Let
1P n
X= Xi
n i=1
Find E(X) and V ar(X). What happens to V ar(X) as n ! 1?
24. Suppose Xi v Geometric ( ), i = 1; 2; : : : ; n independently. Find E(X) and V ar(X). What

happens to V ar(X) as n ! 1?
25. Suppose Xi v Exponential ( ), i = 1; 2; : : : ; n independently. Find E(X) and V ar(X). What

happens to V ar(X) as n ! 1?
26. Suppose X1 ; X2 ; : : : ; Xn are independent and identically distributed random variables with
E (Xi ) = and V ar (Xi ) = 2 , i = 1; 2; : : : ; n.
(a) Find E Xi2 .

(b) Find E(X), V ar(X) and E (X)2 .
(c) Use (a) and (b) to show that E S 2 = 2 where
1 P
n
2 1 P
n
2
S2 = Xi X = Xi2 n X
n 1 i=1 n 1 i=1
27. Let X G( 1:4; 1:5) (recall G stands for the Gaussian distribution introduced in Section 8.5)
and Y N ( 2:1; 4) independently. Find:
(a) P (X + Y > 6)
(b) P ( 2X + Y < 3)
(c) P (Y < X)
28. An automobile driveshaft is assembled by placing parts three independent pieces A, B and C
end to end in a straight line. The standard deviation in the lengths of parts A, B and C are 0:6,
0:8, and 0:7 respectively.
(a) Find the standard deviation of the length of the assembled driveshaft.
(b) What percent reduction would there be in the standard deviation of the assembled driveshaft
if the standard deviation of the length of part B were cut in half?
29. The amount of wine in a bottle has a Normal distribution with mean 1:05 liters and variance
0:0004 (liters)2 .
(a) A bottle is labelled as containing 1 liter. What is the probability the bottle contains less
than 1 liter?
(b) The volume of a cask has a Normal distribution with mean 22 liters and variance 0:16
(liters)2 . What is the probability the contents of 20 randomly chosen wine bottles will fit
inside a randomly chosen cask?
30. A turbine shaft is made up of four sections. The lengths of the sections are independent and
have Normal distributions with different and : (8:10; 0:22), (7:25; 0:20), (9:75; 0:24), and
(3:10; 0:20). What is the probability an assembled shaft meets the specifications 28 0:26?
31. The examination scores obtained by a large group of students can be modelled by a Normal
distribution with a mean of 65% and a standard deviation of 10%.
(a) Find the probability that the average score in a random group of 25 students exceeds 70%.
(b) Find the probability that the average scores of two distinct random groups of 25 students
differ by more than 5%.
32. Suppose Xi v G( ; ), i = 1; 2; : : : ; n independently.
(a) What is the distribution of

1P n
X= Xi
n i=1
(b) Find E(X), and V ar(X). What happens to V ar(X) as n ! 1?

p
(c) Calculate P X 1:96 = n .
(d) If = 12, how large should n be to ensure that P X 1:0 is greater than 0:95?
33. A necklace consists of 5 beads on a string. The beads for making the necklace are drawn at
random from a box containing a very large number of beads. Two-thirds of the beads are pink
and one-third of the beads are blue. Let X1 = 1(0) if beads 1 and 2 are of different (same)
colour, X2 = 1(0) if beads 2 and 3 are of different (same) colour, . . . , and X5 = 1(0) if beads 5
and 1 are of different (same) colour. In the figure is an example of a possible necklace. For this
necklace X1 = 0, X2 = 1, X3 = 1, X4 = 1, and X5 = 1. Find the mean and variance of the
number of unlike pairs of adjacent beads in the necklace.
1
X5 X1
5 2
X4 X2
4 X3 3
34. The inhabitants of the beautiful and ancient canal city of Pentapolis live on 5 islands separated
from each other by water. Bridges cross from one island to another as shown.
1 2
4 3
On any day, a bridge can be closed, with probability p, for restoration work. Assuming that
the eight bridges are closed independently, find the mean and variance of the number of islands
which are completely cut off because of restoration work.
35. A multiple choice exam has 100 questions, each with 5 possible answers. One mark is awarded
for a correct answer and 1/4 mark is deducted for an incorrect answer. A particular student
has probability pi of knowing the correct answer to the i’th question, independently of other
questions.
(a) Suppose that when the student does not know the answer to a question, s/he guesses ran-
domly. Let T be the student’s final mark on the exam. Show that
P
100 P
100 1 P
100
E (T ) = pi and V ar (T ) = pi (1 pi ) + 100 pi
i=1 i=1 4 i=1
(b) Let S be the student’s final mark on the exam if s/he do not guess. Show that
P
100 P
100
E (S) = pi and V ar (S) = pi (1 pi )
i=1 i=1
(c) Compare the variances in (a) and (b) when

(i) pi = 0:9 for i = 1; 2; : : : ; n
(ii) pi = 0:5 for i = 1; 2; : : : ; n
36. Hash Tables Continued: See Chapter 5, Problem 26. For a hash table of size M and n keys
determine the following:
(a) the expected number of keys in a given list

(b) the expected number of empty slots
(c) the expected number of collisions
(d) the expected number of keys in the table when the event “every slot has at least one key”
occurs for the first time Hint: Let Xi = number of keys in the table when a total of i slots
P
M
1
are assigned for the first time, i = 1; 2; : : : ; M and use the approximation j t ln M .
j=1
37. A Markov chain has a doubly stochastic transition matrix if both the row sums and the column
sums of the transition matrix P are all 1. Show that for such a Markov chain, the Uniform
distribution on f1; 2; : : : ; N g is a stationary distribution.
38. A salesperson named Chen sells in three cities A, B, and C. Chen never sells in the same city on
successive weeks. If Chen sells in city A, then the next week Chen always sells in B. However
if Chen sells in either B or C, then the next week Chen is twice as likely to sell in city A as in
the other city. What is the long-run proportion of time Chen spends in each of the three cities?
39. Find lim P n where

n!
2 3
0 1 0
6 1 1 1 7
P =4 6 2 3 5
2 1
0 3 3
40. Waterloo in January is blessed by many things, but not by good weather. There are never two
nice days in a row. If there is a nice day, we are just as likely to have snow as rain the next day. If
we have snow or rain, there is an even chance of having the same the next day. If there is change
from snow or rain, only half of the time is this a change to a nice day. Taking as states the kinds
of weather R, N, and S. the transition probabilities P are as follows
2 3
R N S
6 7
6 R 12 14 41 7
P =6 6 7
1 1 7
4 N 2 0 2 5
S 14 14 21
If today is raining, find the probability of Rain, Nice, Snow three days from now. Find the
probabilities of the three states in five days, given (i) today is raining (ii) today is nice (iii) today
is snowing.
41. One-card Poker: A card game, which, for the purposes of this question we will call Metzler
Poker, is played as follows. Each of two players bets an initial $1 and is dealt a card from a deck
of 13 cards numbered 1; 2; : : : ; 13. Upon looking at their card, each player then decides (unaware
of the other’s decision) whether or not to increase their bet by $5 (to a total stake of $6). If both
increase the stake (“raise”), then the player with the higher card wins both stakes, that is, they get
their money back as well as the other player’s $6. If one person increases and the other does not,
then the player who increases automatically wins the pot. If neither person increases the stake,
then it is considered a draw - each player receives their own $1 back. Suppose that Player A and
B have similar strategies, based on threshold numbers fa; bg they have chosen between 1 and
13. A chooses to raise whenever their card is greater than or equal to a and B whenever B’s card
is greater than or equal to b.
(a) Suppose B always raises (so that b = 1). What is the expected value of A’s win or loss for
the different possible values of a = 1; 2; : : : ; 13.
(b) Suppose a and b are arbitrary. Given that both players raise, what is the probability that A
wins? What is the expected value of A’s win or loss?
(c) Suppose you know that b = 11. Find your expected win or loss for various values of a and
determine the optimal value. How much do you expect to make or lose per game under this
optimal strategy?
42. Searching a database: Suppose that we are given 3 records, R1 ; R2 ; R3 initially stored in that
order. The cost of accessing the j’th record in the list is j so we would like the more frequently
accessed records to be near the front of the list. Whenever a request for record j is processed,
the “move-to-front” heuristic stores Rj at the front of the list and the others in the original order.
For example if the first request is for record 2, then the records will be re-stored in the order
R2 ; R1 ; R3 . Assume that on each request, record j is requested with probability pj , for
j = 1; 2; 3.
(a) Show that if Xj is the permutation that obtains after j requests for records (e.g.
X2 = (2; 1; 3)), then Xj ; j = 1; 2; : : : is a Markov chain.
(b) Find the stationary distribution of this Markov chain. Hint: What is the probability that Xj
takes the form (2; ; ) for large j?
(c) Find the expected long-run cost per record accessed in the case (p1 ; p2 ; p3 ) = (0:1; 0:3; 0:6).
(d) How does this expected long-run cost compare with keeping the records in random order,
and with keeping them in order of decreasing values of pj (only possible if we know pj ):
43. Secretary Problem: Suppose you are to interview N candidates for a job, one at a time. You
must decide immediately after each interview whether to hire the current candidate or not and
you wish to maximize your chances of choosing the best person for the job (there is no benefit
from choosing the second or third best). For simplicity, assume candidate i has numerical value
Xi chosen without replacement from f1; 2; : : : ; N g where 1 = worst, N = best. Our strategy
is to interview k candidates first, and then pick the first of the remaining N k that has value
greater than max(X1 ; X2 ; : : : ; Xk ).
nP1
1
(a) What is the best choice of k? Hint: use the approximation j t ln(n).
j=1
(b) For the value of k found in (a), what is the approximate probability that you do choose the
maximum?
44. Challenge problem: A drunken probabilist stands n steps from a cliffs edge. He takes random
steps, either towards or away from the cliff, each step independent of the previous step. On each
step the probability he takes a step away from the cliff is 23 and the probability he takes a step
towards the cliff is 13 . What is the probability he does not fall off the cliff?
45. Challenge problem: Let X be a continuous random variable with probability density function
f1 (x) and let Y be a discrete random variable. We define the conditional probability density
function of X given Y = y as
d
f1 (xjy) = P (X xjY = y)
dx
and the conditional probability function of Y given X = x as
f1 (xjy)P (Y = y)
f2 (yjx) = P (Y = yjX = x) =
f1 (x)
(a) Show that

Z1
f2 (y) = P (Y = y) = f2 (yjx) f1 (x)dx
1
(b) Assume we have a coin which is not fair and we do not know the probability of a head for
the coin. As such, we model the probability of heads by a random variable X U (0; 1).
This is an appropriate model as the probability of heads can be any number in the interval
[0; 1]. We want to find the probability function of Y = the number of heads in n tosses of
the coin. Clearly,
n y
P (Y = yjX = x) = x (1 x)n y
y
Use the result in part (a) to show that
1
P (Y = y) =
n+1
Hint: Use the identity
Z1
a a!b!
(1 )b d =
(a + b + 1)!
0
(c) Let Y be an indicator variable with Y = 1 if jXj > 1 and Y = 0 if jXj 1. Find the
conditional probability density function of X given Y = 0.
46. Challenge problem: Suppose Ui v U (0; 1), i = 1; 2; : : : ; n independently. Define the random
variable
P
N = min ( ni=1 Ui k)
n
where k is a positive real number. What is the expected value of N ? How would you approximate
this expected value if k were large?
10. C.L.T., NORMAL
APPROXIMATIONS and M.G.F.’s
10.1 Central Limit Theorem (C.L.T.) and Normal Approximations

The Normal distribution can, under certain conditions, be used to approximate probabilities for linear
combinations of variables having a non-Normal distribution. This remarkable property follows from
an amazing result called the Central Limit Theorem. There are actually several versions of the Central
Limit Theorem. The version given below is one of the simplest.
Example: The major reason that the Normal distribution is so commonly used is that it tends to
approximate the distribution of sums of random variables. For example, if we throw n fair dice and
Sn is the sum of the outcomes, what is the distribution of Sn ? The tables below provide the number of
ways in which a given value can be obtained. The corresponding probability is obtained by dividing by
6n : For example on the throw of n = 1 dice the probable outcomes are 1; 2; : : : ; 6 with probabilities all
1=6 as indicated in the first panel of the histogram in Figure 10.1
For S2 = the sum of 2 fair dice, the possible values are f2; 3; : : : ; 12g. The probability function for
S2 is:
s 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P (S2 = s) 36 36 36 36 36 36 36 36 36 36 36
The probability histogram is shown in the second panel of Figure 10.1.

For S3 = the sum of 3 fair dice, the possible values are f3; 4; : : : ; 18g. The probability function for
S3 is:
s 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 3 6 10 15 21 25 27 27 25 21 15 10 6 3 1
P (S 3 = s) 216 216 216 216 216 216 216 216 216 216 216 216 216 216 216 216
245
246 10. C.L.T., NORMAL APPROXIMATIONS AND M.G.F.’S
0.2
n=1
0.1
0
1 2 3 4 5 6
0.2
n=2
0.1
0
2 3 4 5 6 7 8 9 10 11 12
0.2
n=3
0.1
0
2 4 6 8 10 12 14 16 18 20
Figure 10.1: Probability histograms for the sum of n rolls of a dice for n = 1; 2; 3
The probability histogram is shown in the third panel of Figure 10.1. The probability histogram for
S3 already resembles a Normal probability density function. In general, these distributions show a
simple pattern. For n = 1, the probability function is a constant (polynomial degree 0). For n = 2;
the probability histogram can be constructed from two linear functions spliced together (polynomials
of degree 1). For n = 3, the probability histogram can be constructed from three quadratic functions
(polynomials of degree 2). The shapes of these probability histograms rapidly approach the shape of
the Normal probability density function as n increases.
You can simulate the throws of any number of dice and illustrate the behaviour of the sums at
http://www.math.csusb.edu/faculty/stanton/probstat/clt.html.
This example illustrate what happens in general with the distribution of the sum of independent ran-
dom variables from any distribution. If X1 ; X2 ; : : : ; Xn are independent discrete random variables all
having the same distribution, with mean and variance 2 , then as n ! 1, the shape of the probabil-
Pn
ity histogram for the random variable Sn = Xi approaches the shape of a N n ; n 2 probability
i=1
1P
n
density function or equivalently the shape of the probability histogram for X = n Xi approaches
i=1
10.1. CENTRAL LIMIT THEOREM (C.L.T.) AND NORMAL APPROXIMATIONS 247
2
the shape of a N ; n probability density function. If X1 ; X2 ; : : : ; Xn are independent continuous
random variables all having the same distribution, with mean and variance 2 , then as n ! 1, the
Pn
shape of the probability density function of the random variable Sn = Xi approaches the shape
i=1
of a N n ; n 2 probability density function and the shape of the probability density function of X
2
approaches the shape of a N ; n probability density function.
The following theorem is the mathematical statement of these results.
Theorem 39 39 Central Limit Theorem If X1 ; X2 ; : : : ; Xn are independent random variables all

having the same distribution, with mean and variance 2 , then as n ! 1, the cumulative distribution
function of the random variable
Pn
Xi n
i=1 Sn n
p = p
n n
approaches the N (0; 1) cumulative distribution function. Similarly, the cumulative distribution func-
tion of
X
p
= n
approaches the N (0; 1) cumulative distribution function.
This is a theorem about limits. We will use it when n is large, but finite, to approximate the
distribution of Sn or X by a Normal distribution. That is, we will use
P
n
2
Sn = Xi has approximately a N n ; n distribution for large n
i=1
and
1P n 2
X= Xi has approximately a N ; distribution for large n
n i=1 n
Note that as n ! 1, both distributions N n ; n 2 and N ; 2 =n fail to exist. (The former
because both n and n 2 ! 1, the latter because 2 =n ! 0.)
Notes:
(1) The Central Limit Theorem does not hold if the common mean and variance 2 do not exist.
The Cauchy distribution introduced in Problem 8:20 is an example of such a distribution.
P
n
(2) We use the Central Limit Theorem to approximate the distribution of the sum Sn = Xi or
i=1
1P
n
average X = n Xi . The accuracy of the approximation depends on n (bigger is better) and
i=1
39
A proof is given in Section 10.3.
also on the actual distribution of the Xi ’s. The approximation works better for small n when the
shape of the probability function/probability density function of Xi is symmetric (for example,
the U (a; b) probability density function) or nearly symmetric (for example, the P oisson (5)
probability function).
(3) In Section 9:6, the distributions of linear combinations of independent Normal random variables
were given. In particular if X1 ; X2 ; : : : ; Xn are independent N ; 2 random variables then
P
n
2 1P n
2
Sn = Xi N n ;n and X = Xi N ; =n
i=1 n i=1
Thus, if the Xi ’s themselves have a Normal distribution, then Sn and X have exactly Normal
distributions for all values of n. If the Xi ’s do not have a Normal distribution themselves,
then Sn and X have approximately Normal distributions when n is large. From this distinc-
tion you should be able to guess that if the shape of the probability (density) function of Xi is
somewhat Normal shaped then the approximation will be good for smaller values of n. If the
shape of the probability (density) function of Xi is very non-Normal shaped (for example, an
Exponential ( ) probability density function) then the approximation will be poor for small n.
(This is related to the second remark in (2)).
Example: Hamburger patties are packed eight to a box, and each box is supposed to have 1 kilogram
of meat in it. The weights of the patties vary a little because they are mass produced, and the weight X
of a single patty is actually a random variable with mean = 0:128 kilogram and standard deviation
= 0:005 kilogram. Find the probability a box has at least 1 kilogram of meat, assuming that the
weights of the eight patties in any given box are independent.
Solution: Let X1 ; X2 ; : : : ; X8 be the weights of the eight patties in a box, and S8 = X1 +X2 + +X8
be their total weight. By the Central Limit Theorem, S8 has approximately a N (8 ; 8 2 ) distribution.
We will assume this approximation is reasonable even though n = 8 is small. (This is likely okay
because the distribution of X is likely fairly close to Normal.)
Thus S8 N (1:024; 0:0002) approximately and

1 1:024
P (S8 > 1) t P Z> p where Z v N (0; 1)
0:0002
= P (Z > 1:70) = P (Z 1:70)
= 0:95543
Note: We see that only about 96% of the boxes actually have 1 kilogram or more of hamburger. What
would you recommend be done to increase this probability to 99%?
Example: Suppose fires reported to a fire station satisfy the conditions for a Poisson process, with a
mean of 1 fire every 4 hours. Find the probability the 500’th fire of the year is reported on the 84th day
of the year.
Solution: Let Xi be the time between the (i 1)’st and i’th fires (X1 is the time to the first fire). Then
P
500
Xi has an Exponential distribution with = 1= = 4 hours, or = 1=6 day. Since S500 = Xi is
i=1
the time until the 500th fire, we want to find P (83 < S500 84). While the Exponential probability
density function is very skewed and not very Normal shaped, we are summing a large number of
independent Exponential variables. Hence, by the Central Limit Theorem, S500 has approximately a
N 500 ; 500 2 distribution, where = E(Xi ) and 2 = V ar(Xi ). For the Exponential distribution,
= = 1=6 and 2 = 2 = 1=36 so
0 1
500 500
83 84
P (83 < S500 84) t P @ q 6 < Z q 6 A where Z v N (0; 1)
500 500
36 36
= P ( 0:09 < Z 0:18)

= P (Z 0:18) P (Z 0:09)
= P (Z 0:18) [1 P (Z 0:09)]
= 0:57142 + 0:53586 1
= 0:10728
Example: This example is frivolous but shows how the Normal distribution can approximate even
sums of discrete random variables. In an apple orchard, suppose the number X of worms in an apple
has probability function:
x 0 1 2 3 Total
f (x) 0:4 0:3 0:2 0:1 1
Find the probability a basket with 250 apples in it has between 225 and 260 (inclusive) worms in it.
Solution:
3
X
= E (X) = xf (x) = 0 (0:4) + 1 (0:3) + 2 (0:2) + 3 (0:1) = 1
x=0
3
X
E X2 = x2 f (x) = (0)2 (0:4) + (1)2 (0:3) + (2)2 (0:2) + (3)2 (0:1) = 2
x=0
2
= V ar (X) = E X 2 2
=2 (1)2 = 1
P
250
2
By the Central Limit Theorem, S250 = Xi has approximately a N 250 ; 250 distribution,
i=1
where Xi is the number of worms in the i’th apple. Therefore S250 has approximately a N (250; 250)
distribution and
225 250 260 250
P (225 S250 260) t P p Z p where Z v N (0; 1)
250 250
= P ( 1:58 Z 0:63)
= P (Z 0:63) [1 P (Z 1:58)]
= 0:73565 + 0:94295 1
= 0:67850
While this approximation is adequate, we can improve its accuracy, as follows. When Xi has a discrete
P
n
distribution, as it does here, Sn = Xi will always remain discrete no matter how large n gets. So the
i=1
shape of the probability histogram of Sn , while Normal shaped, will never be exactly Normal shaped.
Consider a probability histogram for the random variable S250 , as shown in Figure 10.2. (Only part of
the histogram is shown.)
0 .0 3
0 .0 2 5
0 .0 2
0 .0 1 5
0 .0 1
0 .0 0 5
0
220 225 230 235 240 245 250 255 260 265
2 2 4 .5
2 6 0 .5
Figure 10.2: Probability histogram for S250
The area of the bar on the interval [s 0:5; s + 0:5] is equal to P (S250 = s). The smooth curve is the
P
260
probability density function for the approximating Normal distribution. Then P (S250 = s) is the
s=225
total area of all bars of the histogram for s = 225; 226; : : : ; 260. These bars actually span the interval
of values [224:5; 260:5]. The left and right end bars are more easily seen in Figure 10.3.
-3
x 10
8 0.02 5
0.02
5
0.01 5
0.01
3
0.00 5
0 0
22 4 22 4.5 22 5 22 5.5 25 9.5 26 0 26 0.5 26 1
224.5=lef t end 260.5=right end
Figure 10.3: A magnification of the bars at the left and right hand end of the interval
We can obtain a more accurate approximation by finding the area under the Normal probability
density function from 224:5 to 260:5, that is,
P (225 S250 260) = P (224:5 < S250 < 260:5)

224:5 250 260:5 250
tP p <Z< p where Z v N (0; 1)
250 250
= P ( 1:61 < Z < 0:66) = 0:74537 + 0:94630 1
= 0:69167
Unless making this adjustment greatly complicates the solution, it is preferable to make this “continuity
correction”.
Suppose you required the probability of a single value such as
P (S250 = 225)
Then without the continuity correction we obtain the silly approximation

225 250
P (S250 = 225) t P Z= p =0
250
In such a case we need to use the continuity correction. We obtain
P (S250 = 225) = P (224:5 < S250 < 225:5)

224:5 250 225:5 250
tP p <Z< p where Z v N (0; 1)
250 250
= P ( 1:61 < Z < 1:55) = 0:9463 0:93953 = 0:0068
and although this is small, it is certainly not zero.
Notes:
(1) A continuity correction should not be applied when approximating a continuous distribution by
the Normal distribution. Since the correction involves going halfway to the next possible value,
there would be no adjustment to make if the random variable takes on real values.
(2) Rather than trying to guess or remember when to add 0:5 and when to subtract 0:5, it is often
helpful to sketch a histogram and shade the bars you wish to include. It should then be obvious
which value to use.
(3) Whenever approximating the probability of a single value for a discrete distribution, such as
P (X = 50) where X is Binomial (100; 0:5) you need to use the continuity correction. Oth-
erwise, for approximating the Binomial with large n; it is not necessary to use the correction.
Normal Approximation to the Poisson Distribution

Let X be a random variable with a P oisson( ) distribution and suppose is large. For the moment
suppose that is an integer and recall that if we add independent Poisson random variables, each with
parameter 1, then the sum has the Poisson distribution with parameter : In general, a Poisson random
variable with large expected value can be written as the sum of a large number of independent random
variables, and so the Central Limit Theorem implies that it must be close to Normally distributed.
Theorem 40 Normal Approximation to Poisson: Suppose X P oisson( ). Then the cumulative

distribution function of the standardized random variable
X
Z= p
approaches that of a standard Normal random variable as ! 1.
We prove this theorem in Section 10.2.
Example: Suppose X P oisson( ). Use the Normal approximation to approximate
P (X > )
Compare this approximation with the true value when = 9.

Solution: Theorem 40 implies that the cumulative distribution function of the standardized random
variable
X
Z = p
(note: identify E(X) and V ar(X) in the above standardization) approaches the cumulative distribution
function of a standard Normal random variable Z. In particular, without a continuity correction,
P (X ) = P (Z 0) ! P (Z 0) = 0:5 as !1
Computing the true value when = 9 we obtain

9 9 92 9 99 9
P (X > 9) = 1 P (X 9) = 1 e + 9e + e + + e =1 0:5874 = 0:4126
2! 9!
The Normal approximation without a continuity correction gives
9 9
P (X > 9) t P Z > t P (Z > 0) = 0:5
3
which is not close to the true value 0:4126. The Normal approximation with a continuity correction
gives
9:5 9
P (X > 9) t P Z > t P (Z > 0:17) = 0:4324
3
which is much closer to the true value.
Normal Approximation to the Binomial Distribution

It is well-known that the probability histogram for a Binomial distribution, at least for large values
of n; resembles a bell-shaped or Normal curve. The most common demonstration of this is with a
mechanical device common in science museums called a “Galton board” or “Quincunx”40 which drop
balls through a mesh of equally spaced pins (see Figure 10.4). Notice that if balls either go to the right
or left at each of the eight levels of pins, independently of the movement of the other balls, then
X = number of moves to right has a Binomial(8; 0:5) distribution. If the balls are dropped from
location 0 (on the x axis) then the ball eventually rests at location 2X 8 which is approximately
Normally distributed since X is approximately Normal.
The following result is proved using the Central Limit Theorem.
Theorem 41 Normal Approximation to Binomial: Suppose X v Binomial(n; p). Then for n large,
the random variable
X np
W =p has approximately a N (0; 1) distribution
np (1 p)
40
The word comes from Latin quinque (five) unicia (twelve) and means five twelfths.
Figure 10.4: A "Galton Board" or "Quincunx"
Proof: We use indicator variables Xi , i = 1; 2; : : : ; n where Xi = 1 if the ith trial in the Binomial
Pn
process is an “S” outcome and Xi = 0 if it is an “F ” outcome. Then X = Xi and we can use the
i=1
Central Limit Theorem. Since
2
= E(Xi ) = p; and = V ar(Xi ) = p(1 p)
we have that as n ! 1 the cumulative distribution function of the random variable

Pn
Xi np
i=1 X np
W =p =p
np(1 p) np(1 p)
approaches the cumulative distribution function of a N (0; 1) random variable, as required.
Remark: We can write the Normal approximation either as pX np

N (0; 1) approximately or as
np(1 p)
X N (np; np(1 p)) approximately.
Remark: The continuity correction method can be used here. The following numerical example
illustrates the procedure.
Example: Suppose X Binomial(n; p).

(a) If n = 20 and p = 0:4, approximate the probability P (4 X 12). Compare the answer with the
exact value.
(b) If n = 100 and p = 0:4, approximate the probability P (34 X 48). Compare the answer with
the exact value.
Solution (a) By the Normal approximation to the Binomial we have X N (8; 4:8) approximately.
Without the continuity correction we have
4 8 X 8 12 8
P (4 X 12) = P p p p
4:8 4:8 4:8
t P ( 1:826 Z 1:826) where Z v N (0; 1)
= 0:932
Using the continuity correction method, we get

3:5 8 12:5 8
P (4 X 12) = P p Z p where Z v N (0; 1)
4:8 4:8
t P ( 2:054 Z 2:054)
= 0:960
The exact probability is

12
X 20
(0:4)x (0:6)20 x
= 0:963
x
x=4
which was calculated using the R function pbinom(). As expected the continuity correction method
gives a more accurate approximation.
(b) By the Normal approximation to the Binomial we have X N (40; 24) approximately so without
the continuity correction we have
34 40 48 40
P (34 X 48) t P p Z p where Z v N (0; 1)
24 24
= P ( 1:225 Z 1:633)
= 0:9488 (1 0:8897) = 0:8385
With the continuity correction

33:5 40 48:5 40
24 24
= P ( 1:327 Z 1:735)
= 0:9586 (1 0:9076) = 0:866
The exact value to three decimal places is

48
X 40
(0:4)x (0:6)40 x
= 0:866
x
x=34
so the approximation and exact answer agree to three decimal places.
Note: The error of the Normal approximation decreases as n increases, but it is a good idea to use
the continuity correction when it is convenient. For example if we are using a Normal approximation
to a discrete distribution like the Binomial which takes integer values and the standard deviation of the
Binomial is less than 10, then the continuity correction makes a difference of 0:5=10 = 0:05 to the
number we look up in the table. This can result in a difference in the probability of up to around 0:02.
If you are willing to tolerate errors in probabilities of that magnitude, your rule of thumb might be to
use the continuity correction whenever the standard deviation of the integer-valued random variable
being approximated is less than 10.
Example 10.1: Binomial sample size calculation Let p be the proportion of Canadians who think
Canada should adopt the US dollar.
(a) Suppose 400 Canadians are randomly chosen and asked their opinion. Let X be the number who
X
say yes. Find the probability that the proportion, 400 , of people who say yes is within 0:02 of p, if
p = 0:20.
(b) Suppose for a future opinion poll we want to determine the number, n, to survey to ensure that there
is a 95% chance that Xn lies within 0:02 of p. Suppose p = 0:20 is known.
(c) Repeat (b) when the value of p is unknown. (Note that this would be the more realistic situation in
the case of conducting an opinion poll.)
Solution:
a) Since X = the number of Canadians who say yes it is reasonable to assume X Binomial (400; 0:2).
(How well do you think the assumptions for a Binomial distribution hold in this case?) By the
Normal approximation to the Binomial we have that X has approximately a Normal distribution
with mean = np = (400)(0:2) = 80 and variance 2 = np(1 p) = (400)(0:2)(0:8) = 64.
Therefore
X
P 0:2 0:02
400
= P (jX 80j 8) = P (jX 80j 8:5)
8:5
tP jZj p = P (jZj 1:06)
64
= 2P (Z 1:06) 1 = 2 (0:85543) 1
= 0:71086
b) Since n is unknown, it is difficult to apply a continuity correction. If n is large the continuity

correction changes the answer by very little so we do not apply a continuity correction in this
case. By the Normal approximation to the Binomial we have that X has approximately a Normal
distribution with mean = np = 0:2n and variance 2 = np(1 p) = 0:16n. We want to find
n such that
X
P 0:2 0:02 0:95
n
Now
X
P 0:2 0:02
n
= P (jX 0:2nj 0:02n)
jX 0:2nj 0:02n
=P p p
0:16n 0:16n
p
t P jZj 0:05 n where Z v N (0; 1)
Since P (jZj 1:96) = 0:95 we want n such that

2
p 1:96
0:05 n 1:96 or n = 1536:64
0:05
In other words, we need to survey 1537 people to be at least 95% sure that X
n lies within 0:02 of
p = 0:2. Note that n = 1537 is large so using a continuity correction would not affect the final
answer.
c) By the Normal approximation to the Binomial, X N (np; np(1 p)) approximately. We want
to find n such that
X
P p 0:02 0:95
n
Now
X
P p 0:02
n
= P (jX npj 0:02n)
!
jX npj 0:02n
=P p p
np (1 p) np (1 p)
p !
0:02 n
tP jZj p where Z v N (0; 1)
p (1 p)
Since P (jZj 1:96) = 0:95 we want n such that

p 2
0:02 n 1:96
p 1:96 or n p (1 p)
p (1 p) 0:02
Unfortunately this does not give us an explicit expression for n because we don’t know p. The
way out of this dilemma is to find the maximum value for
2
1:96
p(1 p)
0:02
If we choose n this large, then we can be sure of having the required precision in our estimate,
X
n , for any p. It’s easy to see that p(1 p) is a maximum when p = 0:5. Therefore we take
2
1:96 1 1
n 1 2401
0:02 2 2
X
that is, if we survey n = 2401 people we can be 95% sure that n lies within 0:02 of p, regardless
of the value of p.
Remark: This method is used when poll results are reported in the media: you often see or hear that
“this poll is accurate to with 3 percent 19 times out of 20”. This is saying that n was big enough so that
P (p 0:03 X=n p + 0:03) was 95%. (This requires n of about 1067.)
Problems
10.1.1 Tomato seeds germinate (sprout to produce a plant) independently of each other, with probability
0:8 of each seed germinating. Give an expression for the probability that at least 75 seeds out of
100 which are planted in soil germinate. Evaluate this using a suitable approximation.
10.1.2 A metal parts manufacturer inspects each part produced. 60% are acceptable as produced, 30%
have to be repaired, and 10% are beyond repair and must be scrapped. It costs the manufacturer
$10 to repair a part, and $100 (in lost labour and materials) to scrap a part. Find the approximate
probability that the total cost associated with inspecting 80 parts will exceed $1200.
10.2 Moment Generating Functions

Univariate Discrete Distributions
We have seen two functions which characterize a distribution of a random variable, the probability
function/probability density function and the cumulative distribution function. If we are given the
probability function/probability density function of a random variable X or the cumulative distribution
function of the random variable X then we can determine everything there is to know about the distri-
bution of X. There is a third type of function, the moment generating function, which also uniquely
10.2. MOMENT GENERATING FUNCTIONS 259
determines a distribution. The moment generating function is closely related to other transforms used
in mathematics, the Laplace and Fourier transforms.
We first consider moment generating functions for discrete random variables.
Definition 42 Consider a discrete random variable X with probability function f (x). The moment
generating function (m.g.f.) of X is defined as
X
M (t) = E(etX ) = etx f (x)
all x
We will assume that the moment generating function is defined and finite for values of t in an interval
P
around 0 (that is, for some a > 0, etx f (x) < 1 for all t 2 [ a; a]).
x
The moments of a random variable X are the expectations of the functions X k for k = 1; 2; : : : .
The expected value E(X k ) is called the k’th moment of X. The mean = E(X) is therefore the first
moment, E(X 2 ) is the second moment and so on. It is often easy to find the moments of a probability
distribution mathematically by using the moment generating function. This often gives easier deriva-
tions of means and variances than the direct summation methods in Chapter 7. The following theorem
gives a useful property of moment generating functions.
Theorem 43 Let the random variable X have moment generating function M (t). Then
E(X k ) = M (k) (0) for k = 1; 2; : : :
where M (k) (0) stands for dk M (t)=dtk evaluated at t = 0.
Proof:
P
M (t) = etx f (x) and if the sum converges, then
all x
dk X tx
M (k) (t) = e f (x)
dtk
all x
X d
= (etx )f (x)
dtk
all x
X
= xk etx f (x)
all x
P
Therefore M (k) (0) = xk f (x) = E(X k ), as stated.
all x
Theorem 43 gives us another way to find the moments for a distribution.
Example: Suppose X has a Binomial(n; p) distribution. Then its moment generating function is
n
X n x
M (t) = etx p (1 p)n x
x
x=0
n
X n
= (pet )x (1 p)n x
x
x=0
t
= (pe + 1 p)n by the Binomial Theorem for all t 2 <
Therefore
M 0 (t) = npet (pet + 1 p)n 1
M 00 (t) = npet (pet + 1 p)n 1

+ n(n 1)p2 e2t (pet + 1 p)n 2
and so
E(X) = M 0 (0) = np;

E(X 2 ) = M "(0) = np + n(n 1)p2
V ar(X) = E(X 2 ) E(X)2 = np(1 p)
Exercise: Moment generating function for a Poisson random variable

Use the Exponential series to show that if X v P oisson ( ) then the moment generating function is
M (t) = exp + et for all t 2 <
Use this to show that E(X) = and V ar(X) = .
The moment generating function uniquely identifies a distribution in the sense that if two random
variables have the same moment generating function, they have the same distribution (so the same
probability function, cumulative distribution function, moments, etc.). Of course the moment gener-
ating functions must match for all values of t; in other words they agree as functions, not just at a
few points. For example if we can show somehow that the moment generating function of a random
variable X is
t
M (t) = e2(e 1) ; for all t 2 <
then we know from the previous example, the random variable must have a P oisson(2) distribution.
This means that if we are able to determine the moment generating function for a given random variable
then the moment generating function can be used to identify its distribution. This gives us another
technique for finding the distribution of a random variable.
Theorem 44 Uniqueness Theorem for Moment Generating Functions: Suppose that random
variables X and Y have moment generating functions MX (t) and MY (t) respectively. If
MX (t) = MY (t) for all t then X and Y have the same distribution.
Moment generating functions can also be used to determine that a sequence of distributions gets
closer and closer to some limiting distribution. To show this (albeit a bit loosely), suppose that a
sequence of probability functions fn (x) have corresponding moment generating functions
X
Mn (t) = etx fn (x)
all x
Suppose moreover that the probability functions fn (x) converge to another probability function f (x)
pointwise in x as n ! 1. This is what we mean by convergence of discrete distributions. Then since
fn (x) ! f (x) as n ! 1 for each x; (10.1)

X X
etx fn (x) ! etx f (x) as n ! 1 for each t (10.2)
all x all x
which says that Mn (t) converges to M (t) the moment generating function of the limiting distribution.
It shouldn’t be too surprising that a very useful converse to this result also holds. (This is strictly an
aside and may be of interest only to those with a thing for infinite series, but is it always true that
because the individual terms in a series converge as in (10.1) does this guarantee that the sum of the
series also converges (10.2)?)
Suppose conversely that Xn has moment generating function Mn (t) and Mn (t) ! M (t) for each
t such that M (t) < 1: For example we saw in Chapter 5 that a Binomial(n; p) distribution with very
large n and very small p is close to a Poisson distribution with parameter = np: Consider the moment
generating function of such a Binomial random variable
n
M (t) = pet + 1 p
n
= 1 + p et 1
h in
= 1+ et 1
n
Now take the limit of this expression as n ! 1: Since in general
c n
lim 1 + ! ec
n!1 n
we have h in t
lim 1 + et 1 = e (e 1)
n!1 n
and this is the moment generating function of a Poisson distribution with parameter . This shows
a little more formally than we did earlier that the Binomial distribution with small p approaches the
Poisson distribution with mean = np as n ! 1:
Moment Generating Function of a Continuous Random Variable
For continuous random variables the moment generating function is defined in a manner analogous to
discrete random variables.
Definition 45 Consider a continuous random variable X with probability density function f (x). The
moment generating function (m.g.f.) of X is defined as
Z1
tX
M (t) = E(e )= etx f (x)dx
1
We will assume that the moment generating function is defined and finite for values of t in an interval
R1 tx
around 0 (that is, for some a > 0, e f (x)dx < 1 for all t 2 [ a; a]).
1
Example: Moment generating function of a N ( ; 2) random variable If X has the N ( ; 2)
distribution, then
Z1
tX
M (t) = E(e )= etx f (x)dx
1
Z1
1 (x )2
= p etx exp 2
dx
2 2
1
Z1
1 1
= p exp 2
x2 2 x 2xt 2
+ 2
dx
2 2
1
Z1
t+ 2 t2 =2 1 1 h 2 2
i
=e p exp 2
x2 2 +t 2
x+ +t dx
2 2
1
Z1
t+ 2 t2 =2 1 1 2 2
=e p exp 2
x +t dx
2 2
1
t+ 2 t2 =2
=e for all t 2 <
where the last step follows since
Z1
1 1 2 2
p exp 2
x +t dx
2 2
1
is just the integral of a N ( + t 2 ; 2 ) probability density function and is therefore equal to one. This
confirms the values we obtained before for the mean and the variance of the Normal distribution
2 t2 =2
M 0 (0) = e t+
+t 2
jt=0 =
h i
2 t2 =2 2 2
M 00 (0) = e t+ 2
+ +t jt=0 = 2
+ 2
= E(X 2 )
from which we obtain

2
V ar(X) =
Exercise: Suppose X v Exponential ( ). Use the moment generating function of X given by
1 1
M (t) = for t <
1 t
(see Problem 15) to show E (X) = and V ar (X) = 2.
Note: The moment generating function uniquely identifies continuous distributions as well, so that
Theorem 44 also holds for continuous random variables. We use this property in the proof below.
Proof of Theorem: 40 The moment generating function of a Poisson( ) random variable X is given
by
t
MX (t) = e + e :
Then the standardized random variable is

X
Z = p
and this has moment generating function

h p i
MZ (t) = E(etZ ) = E et(X )=
p p
t
=e E(etX= )
p p
t
=e MX (t= )
This is easier to work with if we take logarithms,

p p
ln [MZ (t)] = t + et=
p t
= et= 1 p
Now as ! 1,
t
p !0
and
p t 1 t2
et= =1+ p + +o 1
2
so
p t
ln [MZ (t)] = et= 1 p
t2 1 t2
= +o ! as !1
2 2
where o 1 represents terms that go to zero faster than 1 as ! 1. Therefore the moment gen-
2
erating function of the standardized Poisson random variable Z approaches et =2 , which is the moment
generating function of the standard Normal and this implies that the Poisson distribution approaches
the Normal as ! 1.
We can similarly use moment generating functions to prove convergence of the distribution of a
Binomial random variable to the Normal distribution (see Problem 10:13).
10.3 Multivariate Moment Generating Functions

Suppose we have two possibly dependent random variables (X; Y ) and we wish to characterize their
joint distribution using a moment generating function. Just as the probability function and the cumu-
lative distribution function are, in tis case, functions of two arguments, so is the moment generating
function.
Definition 46 The joint moment generating function of (X; Y ) is
M (s; t) = E esX+tY
Recall that if X; Y are independent random variables and g1 (X) and g2 (Y ) are any two functions,
then
E[g1 (X)g2 (Y )] = E[g1 (X)]E[g2 (Y )] (10.3)
and so with g1 (X) = esX and g2 (Y ) = etY we obtain, for independent random variables X; Y
M (s; t) = MX (s)MY (t)
the product of the moment generating functions of X and Y respectively.
There is another labour-saving property of moment generating functions for independent random
variables. Suppose X; Y are independent discrete random variables with moment generating functions
10.3. MULTIVARIATE MOMENT GENERATING FUNCTIONS 265
MX (t) and MY (t) respectively. Suppose you wish the moment generating function of the sum
Z = X + Y . One could attack this problem by first determining the probability function of Z,
P
fZ (z) = P (Z = z) = P (X = x; Y = z x)
all x
P
= P (X = x)P (Y = z x)
all x
P
= fX (x)fY (z x)
all x
and then calculating

P
E(etZ ) = etZ fZ (z)
all z
Evidently lots of work! On the other hand, using (10.3) with
g1 (X) = etX and g2 (Y ) = etY
gives
h i
MZ (t) = E et(X+Y )
= E etX E etY
= MX (t)MY (t)
Theorem 47 The moment generating function of the sum of independent random variables is the prod-
uct of the individual moment generating functions.
Example: If X and Y are independent Bernoulli random variables with probability function
x 0 1
f (x) 1 p p
then both have moment generating function
MX (t) = MY (t) = 1 p + pet
and so the moment generating function of the sum Z is MX (t)MY (t) = (1 p + pet )2 : Similarly if we
3
add another independent Bernoulli the moment generating function is 1 p + pet and in general the
n
sum of n independent Bernoulli random variables is 1 p + pet which is the moment generating
function of a Binomial(n; p) distribution. This confirms that the sum of independent Bernoulli random
variables has a Binomial(n; p) distribution.
We now prove that a linear combination of independent Normal random variables has a Normal
distribution.
Theorem 48 If Xi N i ; i2 ; i = 1; 2; : : : ; n independently and a1 ; a2 ; : : : ; an are constants,

P
n P
n P
n
then ai Xi N ai i ; a2i i2 .
i=1 i=1 i=1
Proof: Recall that the moment generating function for a N ; 2 random variable is
M (t) = exp t + 21 2 t2 for all t 2 <. Since Xi N i ; i2 the moment generating function of
ai Xi is
1 2
E etai Xi = E e(tai )Xi = exp i (ai t) + (ai t)2 ; for all t 2 <
2 i
Since Xi ; i = 1; 2; : : : ; n are independent random variables then by Theorem 47 the moment generating
P n
of Y = ai Xi is the product of the individual moment generating functions so
i=1
Q
n 1 P
n 1 P
n
E etY = exp (ai t) + 2
(ai t)2 = exp ai i t+ a2i 2
i t2
i=1 2 i=1 2 i=1
P
n P
n
which we recognize as the moment generating function of a N ai i ; a2i 2
i random variable.
i=1 i=1
P
n P
n Pn
Therefore by the Uniqueness Theorem ai Xi N ai i ; a2i 2
i .
i=1 i=1 i=1
Proof of the Central Limit Theorem
We wish to now use our knowledge of moment generating functions to prove the Central Limit The-
orem. We prove that if Xi are independent identically distributed random variables with E(Xi ) = ;
var(Xi ) = 2 , then
n
1 X
Sn = p (Xi )
n
i=1
converges in distribution to N (0; 1) by showing that the corresponding moment generating functions
converge (assuming that they are finite).
To do this we note that the random variable Yi = (Xi ) = has E (Yi ) = 0 and V ar (Yi ) = 1.
Y t
By a Taylor series expansion of e with remainder term r(x), we have
t2 2 r(x)
E etY = E 1 + tY + Y + r(tY ) where 2 ! 0 as x ! 0
2 x
t 2
= 1 + tE(Y ) + E(Y 2 ) + o(t2 ) as t ! 0
2
t 2
= 1 + + o(t2 ) as t ! 0 (10.4)
2
10.3. MULTIVARIATE MOMENT GENERATING FUNCTIONS 267
o(t2 )
where o(t2 ) means terms which go to zero faster than t2 or t2
! 0 as t ! 0.41
Proof of Theorem 39 - Central Limit Theorem:

The Xi0 s are independent and identically distributed random. We assume that there common mo-
ment generating function MX (t) exists. This implies that the moment generating function of the
random variable Yi = (Xi ) = also exists and is given by
t
M (t) = E et(Xi )=
=e t=
MX :
By (10.4),
1 o(t2 )
M (t) = 1 + t2 + o(t2 ) where ! 0 as t ! 0
2 t2
p
Replacing t by t= n we have
2
t 1 t 1 o n 1
M p =1+ p +o where ! 0 as n ! 1 (10.5)
n 2 n n n 1
Since the moment generating function of the sum of independent random variables is the product
P
n
of the individual moment generating functions, then the moment generating function of Yi is equal
i=1
to [M (t)]n .
P
n
Let Mn (t) be the moment generating function of Sn = p1 Yi . Then
n
i=1
n
1 P n t
Mn (t) = E exp t p Yi = M p
n i=1 n
and by (10.5)
" #n
2
1 t 1
Mn (t) = 1 + p +o as n ! 1
2 n n
41
If you were REALLY paying attention here, you might wonder about the logic of these steps. If r(Y t)=t2 ! 0 for
random variable Y , how do we know E[r(Y t)]=t2 ! 0 as t ! 0? The proof of this is unfortunately beyond the scope of
this course.
Using ln (1 + x) = x + o x2 as x ! 0 we have
(" #n )
2
1 t 1
ln [Mn (t)] = ln 1+ p +o
2 n n
" #
2
1 t 1
= n ln 1 + p +o
2 n n
" #
2
1 t 1 2
=n p +o n +o n
2 n
" #
2
1 t 1
=n p +o n
2 n
1 o n 1
= t2 + as n ! 1
2 n 1
This implies that
" #n
2
1 t 1 2 =2
Mn (t) = 1 + p +o ! et as n ! 1
2 n n
2
Since et =2 is the moment generating function of a N (0; 1) random variable, then by the uniqueness of
moment generating functions we have that Sn converges in distribution to the N (0; 1) distribution.

1. When people are asked to make up a random number between 0 and 1, it has been found that the
distribution of the numbers, X, has probability density function close to
(
4x; 0 < x 1=2
f (x) =
4 (1 x) 21 < x < 1
(rather than the U (0; 1) distribution which would be expected). See Chapter 8, Problem 2 for
E (X) and V ar (X).
P
100
(a) Let Xi be the i’th “random number”, i = 1; 2; : : : ; 100 and S = Xi . Approximate
i=1
P (49:0 S 50:5).
(b) Approximate P (49:0 S 50:5) if Xi v U (0; 1) ; i = 1; 2; : : : ; 100.
2. For Chapter 9, Problem 35, approximate the probability of passing the exam, both with and
without guessing if (a) each pi = 0:45; (b) each pi = 0:55. What is the best strategy for passing
the course if (a) pi = 0:45 (b) pi = 0:55?
3. In a survey of n voters from a given riding in Canada, the proportion x=n who say they would
vote Conservative is used to estimate p, the probability a voter votes Conservative (x is the
number of Conservative supporters in the survey.) If Conservative support is actually 16%, how
large should n be so that with probability 0:95, the estimate will be in error at most 0:03? Hint:
See Example 10.1.
4. When blood samples are tested for the presence of a disease, samples from 20 people are pooled
and analysed together. If the analysis is negative, none of the 20 people is infected. If the pooled
sample is positive, at least one of the 20 people is infected so they must each be tested separately;
that is, a total of 21 tests is required. The probability a person has the disease is 0:02.
(a) Find the mean and variance of the number of tests required for each group of 20.
(b) For 2000 people, tested in groups of 20, find the mean and variance of the total number of
tests. What assumption(s) has been made about the pooled samples?
(c) Find the approximate probability that more than 800 tests are required for the 2000 people.
5. Suppose 80% of people who buy a new car say they are satisfied with the car when surveyed one
year after purchase. Let X be the number of people in a group of 60 randomly chosen new car
buyers who report satisfaction with their car. Let Y be the number of satisfied owners in a second
(independent) survey of 62 randomly chosen new car buyers. Using a suitable approximation,
find P (jX Y j 3). A continuity correction is expected.
6. Suppose that the unemployment rate in Canada is 7%.
(a) Find the approximate probability that in a random sample of 10; 000 persons in the labour
force, the number of unemployed will be between 675 and 725 inclusive. Since n = 10; 000
is large a continuity correction is not needed.
(b) How large a random sample would be required so that, with probability 0:95, the proportion
of unemployed persons in the sample is between 6:9% and 7:1%? Hint: See Example 10.1.
7. Requests to a web server are assumed to follow a Poisson process. On average there are two
requests per second.
(a) Give an expression for the probability that between 110 and 135 (inclusive) requests are
received in a one minute interval. Approximate this probability using a suitable approxi-
mation.
(b) Suppose the web server crashes if more than 150 requests are received in a one minute
interval. Give an expression for the probability that the web server crashes. Approximate
this probability using a suitable approximation.
(c) Suppose requests are observed beginning at midnight. Approximate the probability that the
waiting time until the 600’th request is less than four and a half minutes.
8. The following calculations will be useful for STAT 221/231/241.
(a) Suppose X v Binomial (n; p) where n is large. Approximate

r r !
X p (1 p) X p (1 p)
P 1:645 p + 1:645
n n n n
You may ignore the continuity correction.

P
n
(b) Suppose Xi v P oisson ( ), i = 1; 2; : : : ; n where n is large. Let X = 1
n Xi . Approx-
i=1
imate r r
P X 1:96 X + 1:96
n n
You may ignore the continuity correction.
P
n
(c) Suppose Xi v Exponential ( ), i = 1; 2; : : : ; n where n is large. Let X = 1
n Xi .
i=1
Approximate r r !
2 2
P X 2:576 X + 2:576
n n
9. Gambling: Your chances of winning or losing money can be calculated in many games of
chance as described here.
Suppose each time you play a game (or place a bet) of $1 that the probability you win (thus
ending up with a profit of $1) is 0:49 and the probability you lose (meaning your “profit” is $1)
is 0:51.
(a) Let S represent your profit after n independent plays or bets. Give a Normal approximation
for the distribution of S.
(b) If n = 20, determine P (S 0). (This is the probability you are “ahead” after 20 plays.)
Also find P (S 0) if n = 50 and n = 100. What do you conclude?
Note: For many casino games (roulette, blackjack) there are bets for which your probability
of winning is only a little less than 0:5. However, as you play more and more times, the
probability you lose (end up “behind”) approaches 1.
(c) Suppose now you are the casino owner. If all players combined place n = 100; 000 one
dollar bets in an evening, let Y be your profit. Find the value c with the property that
P (Y > c) = 0:99. Explain in words what this means.
10. Gambling: Crown and Anchor: Crown and Anchor is a game that is sometimes played at
charity casinos or just for fun. It can be played with a “wheel of fortune” or with three dice,
in which each die has its six sides labelled with a crown, an anchor, and the four card suits
club, diamond, heart and spade, respectively. You bet an amount (let’s say $1) on one of the six
symbols: let’s suppose you bet on “heart”. The three dice are then rolled simultaneously and you
win $t if t hearts turn up (t = 0; 1; 2; 3).
(a) Let S represent your profits from playing the game n times. Give a Normal approximation
for the distribution of S.
(b) Find (approximately) the probability that S > 0 if
(i) n = 10
(ii) n = 50
(a) Find the moment generating function of X.

(b) Use the moment generating function to determine E (X) and V ar (X).
12. Suppose X has a discrete Uniform distribution on fa; a + 1; : : : ; bg with probability function
1
P (X = x) = for x = a; a + 1; : : : ; b
b a+1
(b) Use the moment generating function to determine E (X) and E X 2 .
13. Let X be a discrete random variable taking values in the set f0; 1; 2g with E(X) = 1, and
E(X 2 ) = 1:5.
(a) Find P (X = x), x = 0; 1; 2 and thus determine the moment generating function of X.
(b) Determine E(X 3 ) and E(X 4 ).
(c) Show that any probability distribution on f0; 1; 2g is completely determined by its first two
moments.
14. Find the distributions that correspond to the following moment-generating functions:
(a)
1
M (t) = t
for t < ln(3=2)
3e 2
(b)
t
M (t) = e2(e 1)
for t 2 <
15. Suppose X v Exponential ( ).

(b) Use the moment generating function to determine E (X) and V ar (X).
16. Let X1 ; X2 ; : : : ; Xn be independent N (1; 2) random variables. For each of the following ran-
dom variables, find the moment generating function and use the Uniqueness Theorem to deter-
mine its distribution.
(a) Y = 3X1 + 4
(b) T = X1 + X2
(c) Sn = X1 + X2 + : : : + Xn
(d) Z = n 1=2 (S n)
n
17. Suppose X v P oisson( 1 ) and Y v P oisson( 2 ) independently. Use moment generating

functions to prove that X + Y v P oisson ( 1 + 2 ).
18. Suppose X is a continuous random variable with the probability density function
1 x=
f (x) = 2
xe for x > 0 and >0

(b) Suppose X v Exponential ( ) and independently Y v Exponential ( ). Use moment
generating functions to find the distribution of S = X + Y .
19. Recall that the moment generating function of the Binomial random variable X is
M (t) = (1 p + pet )n for t 2 <
Find the moment generating function of the standardized random variable

X np
Zn = p
np(1 p)
Assume p is fixed and n ! 1. Show that the moment generating function of Zn is

2 =2
E(eZn t ) ! et as n ! 1
This result implies that the standardized Binomial random variable Zn approaches the standard
Normal distribution.
20. A model for stock returns: A common model for stock returns is as follows: the number of
trades N of stock XXX in a given day has a Poisson distribution with parameter . At each trade,
say the i’th trade, the change in the price of the stock is Xi and has a Normal distribution with
mean 0 and variance 2 , say and these changes are independent of one another and independent
of N . Find the moment generating function of the total change in stock price over the day. Is
this a distribution that you recognise? What is its mean and variance?
11. SOLUTIONS TO SECTION
PROBLEMS
3.1.1 (a) Each student can choose in 4 ways and they each get to choose.
(i) Suppose we list the points in S in a specific order, for example (choice of student A,
choice of student B, choice of student C) so that the point (1; 2; 3)indicates A chose
section1, B chose section 2 and C chose section 3. Then S looks like
f(1; 1; 1); (1; 1; 2); (1; 1; 3); : : :g
Since each student can choose in 4 ways regardless of the choice of the other two
students, by the multiplication rule S has 4 4 4 = 64 points.
(ii) To satisfy the condition, the first student can choose in 4 ways and the others then only
have 1 section they can go in. Therefore the probability they are all in the same section
is 4 64
1 1
= 1=16.
(iii) To satisfy the condition, the first to pick has 4 ways to choose, the next has 3 sections
left, and the last has 2 sections left. Therefore the probability they are all in different
sections is 4 64
3 2
= 3=8.
(iv) To satisfy the condition, each has 3 ways to choose a section. Therefore the probability
there is no-one in section 1 is 3 64
3 3
= 27=64
(b) (i) Now S has ns points, each a sequence like (1; 2; 3; 2; : : :) of length s.
(ii) P (all in same section) = n 1 1 1=ns = 1=ns 1:
n(s)
(iii) P (different sections) = n(n 1)(n 2) (n s + 1)=ns = ns .
s
(iv) P (nobody in section 1) = (n 1)(n 1)(n 1) (n 1)=ns = (nns1) :
3.1.2 (a) There are 26 ways to choose each of the 3 letters, so in all the letters can be chosen in
26 26 26 ways. If all letters are the same, there are 26 ways to choose the first letter, and
only 1 way to choose the remaining 2 letters. So P (all letters the same) is 262613 1 = 1=262 .
274
275
(b) There are 10 10 10 ways to choose the 3 digits. The number of ways to choose all even
digits is 4 4 4. The number of ways to choose all odd digits is 5 5 5. Therefore P (all
3 3
even or all odd) = 4 10+53 = :189.
3.1.3 (a) There are 35 symbols in all (26 letters + 9 numbers). The number of different 6-symbol
passwords is 356 266 (we need to subtract off the 266 arrangements in which only letters
are used, since there must be at least one number). Similarly, we get the number of 7-
symbol and 8-symbol passwords as 357 267 and 358 268 . The total number of possible
passwords is then
(356 266 ) + (357 267 ) + (358 268 )
(b) Let N be the answer to part (a) (the total no. of possible passwords). Assuming you never
try the same password twice, the probability you find the correct password within the first
1,000 tries is
P (first password works) + P (second password works) + + P (1000’th password works)

1 N 1 1 N 1N 2 1 1000
= + + + =
N N N 1 N N 1 N 999 N
3.4.1 There are 7! different orders
(a) We can stick the even digits together in 3! orders. This block of even digits plus the 4 odd
digits can be arranged in 5! orders. Therefore P (even together) = 3!5!
7! = 1=7.
(b) For even at ends, there are 3 ways to fill the first place, and 2 ways to fill the last place and
5! ways to arrange the middle 5 digits. For odd at ends there are 4 ways to fill the first place
and 3 ways to fill the last place and 5! ways to arrange the middle 5 digits. P (even or odd
at ends) = (3)(2)(5!)+(4)(3)(5!)
7! = 73 .
9!
3.4.2 The total number of arrangements is 3!2! .
(a) E at each end gives 7!

2! arrangements of the middle 7 letters. L at each end gives
7!
3! arrange-
ments of the middle 7 letters. Therefore
7!
+ 7!
P (word begins and ends with the same letter) = 2! 9! 3! = 19 .
3!2!
(b) The X; C and N can be “stuck” together in 3! ways to form a single unit. We can then
7!
arrange the 3E’s, 2L’s, T , and (XCN ) in 3!2! ways. Therefore
7!
3! 1
P (XCN together) = 3!2!
9! = 12 .
3!2!
(c) There is only 1 way to arrange the letters in the order CEEELLNTX. Therefore P (alphabetical
order) = 9!1 = 129! .
3!2!
276 11. SOLUTIONS TO SECTION PROBLEMS
3.5.1 (a) The 8 cars can be chosen in 160

8 ways. We can choose x with faulty emission controls and
35 125
(8 x) with good ones in x 8 x ways. Therefore
P
8
35 125
x 8 x
x=3
P (at least 3 faulty) = 160
8
(b) This assumes all 1608 combinations are equally likely. This assumption probably doesn’t
hold since the inspector would tend to select older cars or those in bad shape.
3.5.2 (a) The first 6 finishes can be chosen in 156 ways. Choose 4 from numbers 1; 2; : : : ; 9 in
9
4
ways and 2 from numbers 10; : : : ; 15 in 62 ways. Therefore
9 6
4 2 54
P (4 single digits in top 6) = 15 =
6
143
(b) Need 2 single digits and 2 double digit numbers in first four digits and then a single digit.
This occurs in 92 62 7 ways. Therefore
9 6
2 2 7 36
P (fifth digit is the third single digit) = 15 =
4 11 143
Alternate Solution: There are 15(5) ways to choose the first 5 in order. We can choose in
order, 2 double digit and 3 single digit finishers in 6(2) 9(3) ways, and then choose which 2
of the first 4 places have double digit numbers in 42 ways. Therefore P (fifth digit is the
6(2) 9(3) (4) 36
third single digit) = 15(5) 2 = 143 .
12
(c) Choose 13 in 1 way and the other 6 numbers in 6 ways. (from 1; 2; : : : ; 12). Therefore P (13
(12
6) 28
is highest) = 15 = 195 .
(7)
Alternate Solution: From the 137 ways to choose 7 numbers from 1; 2; : : : ; 13 subtract
12
the 7 which don’t include 13 (that is, all 7 chosen from 1; 2; : : : ; 12). Therefore P (13 is
(13) (12) 28
highest) = 7 15 7 = 195 .
(7)
4.1.1 Let A be the event “Mandarin or Cantonese speaking”, let B be the event “Spanish speaking”,
and let C be the event “French speaking”. We are given
P (A) = 0:4; P (B) = 0:25; P (C) = 0:5; P (B \ C) = 0:1;

P (A \ C) = 0:12; P (A \ B \ C) = 0:02 and P A \ B \ C = 0:08
277
See Figure 10.5. Since
1 = 0:08 + 0:3 + 0:08 + 0:1 + 0:02 + 0:28 x + x + 0:15 x

= 1:01 x
therefore x = 0:01 and P A \ B \ C = 0:01.
S
A
0.28-x
0.1 x
0.02
C B
0.3 0.08 0.15-x
0.08
Figure 10.5: Venn diagram for Problem 4.1.1
4.1.2 P (M \ L) = 0:15, P (M ) = 0:45, P (L) = 0:45. See Figure 10.6 The region outside the
M L
.30 .15 .30
Figure 10.6: Venn diagram for Problem 4.1.2
circles represents females to the right. To make P (S) = 1. We need P (F \ R) = 0:25.

4.2.1 (a)
P (A [ B [ C) = P (A) + P (B) + P (C) P (AB) P (AC) P (BC) + P (ABC)

=1 0:1 [P (AC) + P (BC) P (ABC)]
= 0:9 P (AC [ BC)
Therefore P (A[B [C) = 0:9 is the largest value, and this occurs when P (AC [BC) = 0.
(b) If each point in the sample space has strictly positive probability then if P (AC [ BC) = 0;
then AC = ; and BC = ; so that A and C are mutually exclusive and B and C are
mutually exclusive. Otherwise we cannot make this determination. While A and C could
be mutually exclusive, it can’t be determined for sure.
4.2.2
P (A [ B) = P (A or B occur) = 1 P (A doesn’t occur AND B doesn’t occur)
= 1 P (A \ B)
Alternatively, S = (A [ B) [ (A \ B) is a partition, so P (S) = 1 ) P (A [ B) + P (A \ B) = 1.
4.3.1 (a) Points giving a total of 9 are: (3; 6); (4; 5); (5; 4) and (6; 3). The probabilities are
(0:1)(0:3) = 0:03 for (3; 6) and for (6; 3), and (0:2)(0:2) = 0:04 for (4; 5) and for (5; 4).
Therefore P f(3, 6) or (4, 5) or (5, 4) or (6, 3)g = 0:03 + 0:04 + 0:04 + 0:03 = 0:14:
(b) There are 41 arrangements with 1 nine and 3 non-nines. Each arrangement has probability
(0:14)(0:86)3 .
Therefore P (nine on 1 of 4 repetitions) = 41 (0:14)(0:86)3 = 0:3562:
4.3.2 Let W = {at least 1 woman student} and F = {at least 1 French speaking student}.
P (W \ F ) = 1 P W \F =1 P (W [ F ) = 1 P (W ) + P (F ) P (W \ F )
F
W
279
But
P (W \ F ) = P (no women students and no French speaking students)
= P (all 10 students are men who don’t speak French)
and
P (woman who speaks French) = P (woman)P (Frenchjwoman) = 0:45 0:20 = 0:09.
From Venn diagram, P (man who does not speak French) = 0:49.
Woman French
.36 .09 .06
.49
Figure 10.7:
Therefore
P (W \ F ) = (0:49)10 ; P (W ) = (0:55)10 ; P (F ) = (0:85)10
and
P (W \ F ) = 1 (0:55)10 + (0:85)10 (0:49)10 = 0:8014
4.3.3 Since B = (A \ B) [ A \ B and P (B) = P (A \ B) + P A \ B then
P (A \ B) = P (B) P (A \ B) (1)
By De Morgan’s Laws
P (A \ B) = P (A [ B) (2)
P (A \ B) = P (A)P (B) since A and B are independent events

, P (A [ B) = P (A \ B) = P (A)P (B) by (2)
,1 P (A [ B) = P (A)P (B)
,1 [P (A) + P (B) P (A \ B)] = P (A) [1 P (B)]
, [1 P (A)] [P (B) P (A \ B)] = P (A) P (A)P (B)
, P (A) P (A \ B) = P (A) P (A)P (B) by (1)
, P (A)P (B) = P (A \ B)
Therefore A and B are independent events if and only if A and B are independent events.
4.5.1 Let B = fbusg and L= flateg.

P (B \ L) P (LjB)P (B) (0:3)(0:2) 6
P (BjL) = = = =
P (L) P (LjB)P (B) + P (LjB)P (B) (0:3)(0:2) + (0:7)(0:1) 13
4.5.2 Let F = ffairg and H = f5 headsg
P (F \ H) P (HjF )P (F )
P (F jH) = =
P (H) P (HjF )P (F ) + P (HjF )P (F )
6 1 6
( 34 ) 5 (2)
= 6 1 6
= 0:4170
( 34 ) 5 ( 12 )6 + ( 4 ) 5 (0:8)5 (0:2)1
4.5.3 Let H = { defective headlights}, M = {defective muffler}

P (M \ H) P (M \ H) 0:1
P (M jH) = = = = 0:4
P (H) P ((M \ H) [ M \ H ) 0:1 + 0:15
4.6.1 By the Binomial Theorem

n
X n x
a = (1 + a)n for all n 2 Z+ and a 2 <
x
x=0
Differentiate with respect to a on both sides:

n
X n x 1
x a = n(1 + a)n 1
x
x=0
281
Multiply by a to get
n
X n x
x a = na(1 + a)n 1
x
x=0
p
Let a = 1 p . Then
n
X x n 1
n p p p np
x =n 1+ = (1)n 1
x 1 p p 1 1 p (1 p)n
x=0
Multiply by (1 p)n :
n
X x n
X
n p n n x np
x (1 p) = x p (1 p)n x
= (1 p)n = np
x 1 p x (1 p)n
x=0 x=0
4.6.3
1
X 1
X
k k x k k
p (p 1) = p (p 1)x converges since jp 1j < 1
x x
x=0 x=0
k
= pk (1 + p 1) by the Binomial Theorem
=1
P
2
5.1.1 We need f (x) 0 and f (x) = 1
x=0
9c2 + 9c + c2 = 10c2 + 9c = 1
Therefore 10c2 + 9c 1=0

(10c 1)(c + 1) = 0
c = 1=10 or 1
But if c = 1 we have f (1) < 0 which is impossible. Therefore c = 0:1.
5!
5.1.2 We are arranging Y F O O O where Y = {you}, F = {friend}, O = {other}. There are 3! = 20
distinct arrangements.
X = 0: Y F O O O; ; O O O Y F has 4 arrangements with Y first and 4 with F first.
X = 1: Y OF OO; ; OOY OF has 3 arrangements with Y first and 3 with F first.
X = 2: Y OOF O; OY OOF has 2 with Y first and 2 with F .
X = 3: Y OOOF has 1 with Y first and 1 with F .
x 0 1 2 3
f (x) 0:4 0:3 0:2 0:1
F (x) 0:4 0:7 0:9 1
5.3.1 (a) Using the Hypergeometric distribution,

d 12 d
0 7
f (0) = 12
7
d 0 1 2 3
5 5 5
f (0) 1 12 33 110
(b) While we could find no tainted tins if d is as big as 3, it is not likely to happen. This implies
the box is not likely to have as many as 3 tainted tins.
5.3.2 Considering order, there are N (n) points in S. We can choose which x of the n selections will
have “success” in nx ways. We can arrange the x “successes” in their selected positions in r(x)
ways and the (n x) “failures” in the remaining positions in (N r)(n x) ways.
Therefore
n (x)
r (N r)(n x)
f (x) = x
N (n)
with x ranging from max(0; n (N r)) to min(n; r).
5.4.1 (a) Using Hypergeometric, with N = 130; r = 26; n = 6,

26 104
2 4
f (2) = 130 = 0:2506
6
(b) Using the Binomial approximation to the Hypergeometric

2 4
6 26 104
f (2) t = 0:2458
2 130 130
5.4.2 (a) Let A = {camera A is picked } and B = {camera B is picked } and assume shots are
independent with a constant failure probability.
P (fail twice) = P (A)P (fail twicejA) + P (B)P (fail twicejB)

1 10 1 10
= (0:1)2 (0:9)8 + (0:05)2 (0:95)8 = 0:1342
2 2 2 2
(b)
1 10
P (A and fail twice) 2 2 (0:1)2 (0:9)8
P (Ajfail twice) = = = 0:7219
P (fail twice) 0:1342
283
5.5.1 We need (x 25) “failures” before our 25th “success”.
x 1 x 1
f (x) = (0:2)25 (0:8)x 25
or (0:2)25 (0:8)x 25
; x = 25; 26; 27; : : :
x 25 24
5.5.2 (a) In the first (x + 17) selections we need to get xdefective (use Hypergeometric distribution)
and then we need a good one on the (x + 18)’th draw. Therefore
200 2300
x 17 2283
f (x) = 2500 ; x = 0; 1; : : : ; 200
x+17
2500 (x + 17)
(b) Since 2500 is large and we’re only choosing a few of them, we can approximate the Hyper-
geometric portion of f (x) using Binomial
2 17
19 200 200 2283
f (2) t 1 = 0:2440
2 2500 2500 2481
5.6.1 Using the Geometric distribution we have
P (x not leaky found before rst leaky) = (0:7)x (0:3) = f (x)
P (X n 1) = f (n 1) + f (n) + f (n + 1) + : : :
n 1
= (0:7) (0:3) + (0:7)n (0:3) + (0:7)n+1 (0:3) + : : :
(0:7)n 1 (0:3)
= = (0:7)n 1
= 0:05
1 0:7
(n 1)log (0:7) = log (0:05) ; so n = 9:4
At least 9.4 cars means 10 or more cars must be checked. Therefore n = 10:
5.7.1 (a) Let X be the number who don’t show. Then X Binomial(122; 0:03)
P (not enough seats) = P (X = 0 or 1)

122 122
= (0:03)0 (0:97)122 + (0:03)1 (0:97)121
0 1
= 0:1161
(To use a Poisson approximation we need p near 0. That is why we defined “success” as
not showing up).
For Poisson, = np = (122)(0:03) = 3:66
3:66 3:66
f (0) + f (1) = e + 3:66e = 0:1199
(b) Binomial requires all passengers to be independent as to showing up for the flight, and that
each passenger has the same probability of showing up. Passengers are not likely inde-
pendent since people from the same family or company are likely to all show up or all not
show. Even strangers arriving on an earlier incoming flight would not miss their flight inde-
pendently if the flight was delayed. Passengers may all have roughly the same probability
of showing up, but even this is suspect. People travelling in different fare categories or in
different classes (e.g. charter fares versus first class) may have different probabilities of
showing up.
5.8.1 (a)
= 3; t = 2:5; = t = 7:5
7:56 e 7:5
f (6) = = 0:1367
6!
(b)
1 P (2 in 1st min: and 6 in 2 12 min:)
P (2 in 1st minutej6 in 2 minutes) =
2 P (6 in 2 21 min)
P (2 in 1st min: and 4 in last 1 12 min)
=
P (6 in 2 12 min:)
32 e 3 4:54 e 4:5
2! 4!
=
7:56 e 7:5
6!
2 4
6 3 4:5
= = 0:3110
2 7:5 7:5
Note this is a Binomial probability.
5.8.2 Assume that the conditions for a Poisson process are met, with lines as units of “time”:
(a) = :02 per line; t = 1 line; = t = 0:02

0e
:02
f (0) = =e = 0:9802
0!
(b) 1 = 80 0:02 = 1:6; 2 = 90 0:02 = 1:8
2e 1 2e 2
1 2
= 0:0692
2! 2!
5.9.1 Consider a 1 minute period with no occurrences as a “success”. Then X has a Geometric distri-
bution. The probability of “success” is
0e
f (0) = =e :
0!
285
Therefore f (x) = (e )(1 e )x 1

; x = 1; 2; 3; : : :
(There must be (x 1) failures before the first success.)
5.9.2 (a) =3 1:25 = 3:75

3:750 e 3:75
f (0) = = 0:0235
0!
(b) 1 e 3:75 14 e 3:75 , using a Geometric distribution

(c) Use a Binomial distribution
100 3:75 x 3:75 100 x

f (x) = e 1 e
x
Approximate this by Poisson with = np = 100e 3:75 t 2:35.

x
f (x) t e 2:35 2:35
x! (n large, p small).
Thus, P (X 4) = 1 P (X 3) = 1 0:789 = 0:211.
7.3.1 Let X = the organization’s profit. The profit depends on the ticket number picked. Since a 3
digit number can have either all digits equal, two different digits or three different digits, there
are 3 possible cases to consider.
Case 1: All digits are the same. There are 10 such tickets, 000,111, 222, . . . , 999 so the
probability of drawing such a ticket is 10=1000. If one of these tickets is drawn the profit is
X = 1000 200 = 800 since the organization takes in $1000 and pays out $200 for the one
winning ticket.
Case 2: There are 2 digits the same, for example, 211, 121, 112, or 662, 626, 266, etc. There
are 10 (9) = 90 ways to chose the 2 numbers and 3 ways to arrange them so there are a total of
90 (3) = 270 such tickets and so the probability of drawing such a ticker is 270=1000. If one of
these tickets is drawn the profit is X = 1000 3 (200) = 400 since there are 3 winners.
Case 3: Since there are a total of 1000 tickets and only 3 types of tickets, we can find the total
number of tickets with 3 different numbers by subtraction. There are 1000 10 270 = 720
tickets with 3 different digits. Therefore the probability of drawing a ticker with 3 different digits
is 720=1000. Note that if ticket 123 is drawn then the tickets 123,132,213,231,312,321 all win.
If one of these tickets is drawn the profit is X = 1000 6 (200) = 200.
Therefore the expected profit is
10 270 720
E (X) = 800 + 400 + ( 200) = 28 dollars
1000 1000 1000
that is, on average the organization loses $28.

7.4.1 Suppose n tickets are sold. Let the random variable Xn be the number of people who show
up. Then Xn has a Binomial(n; p) distribution with p = 0:97. For the Binomial distribution,
E(Xn ) = np. The expected revenue as a function of n > 120 is
h(n) = 100E(Xn ) 500E (Xn 120)+

n
X n x
= 100np 500 (x 120) p (1 p)n x
x
x=121
where (Xn 120)+ = max(0; Xn 120). If n 120, then h(n) = 100np and since this is an
increasing function of n, for n 120, we need only consider the case n > 120 in attempting to
maximize the function h(n). Consider the values of h (n) for n = 121; 122; 123; 124 since the
number of tickets sold must be a positive integer.
h(121) = 100(121) (0:97) 500(0:97)121 = 11; 724:46

h(122) = 100(122) (0:97) 1000(0:97)122 (500) 122(0:97)121 [1 (0:97)] = 11; 763:77
123
X 123
h(123) = 100 (123) (0:97) 500 (x 120) (0:97)x (0:03)123 x
= 11; 721:13
x
x=121
124
X 124
h(124) = 100 (124) (0:97) 500 (x 120) (0:97)x (0:03)124 x
= 11579
x
x=121
It would appear that the function h(n) for n = 121; 122; : : : has a maximum at n = 122 which
would indicate that the optimal number of tickets to be sold is n = 122.
Can we prove that n = 122 does indeed correspond to a maximum? Note that Xn+1 = Xn + Y
where Xn , Y are independent random variables and Y has a Bernoulli(p) distribution with mean
E (Y ) = p. Now
h (n + 1) = 100E (Xn + Y ) 500E (Xn + Y 120)+
and
h(n + 1) h(n) = 100E (Xn + Y ) 500E (Xn + Y 120)+ 100E (Xn ) + 500E (Xn 120)+
= 100E (Y ) 500 E (Xn + Y 120)+ E (Xn 120)+
= 100p 500 E (Xn + Y 120)+ E (Xn 120)+
Let (
+ + 1 x 120; y = 1
g (x; y) = (x + y 120) (x 120) =
0 otherwise
287
then
h(n + 1) h(n) = 100p 500 E (Xn + Y 120)+ E (Xn 120)+

= 100p 500E [g (Xn ; Y )]
Since
E [g (Xn ; Y )] = P (Xn 120; Y = 1) = pP (Xn 120)
we have
h(n + 1) h(n) = 100p 500pP (Xn 120)
Since P (Xn 120) is an increasing function of n, h(n + 1) h(n) is a decreasing function of

n. That is if h(n0 + 1) h(n0 ) < 0 for some value of n0 > 120 (in our case above n0 + 1 = 123
or n0 = 122) then h(n + 1) h(n) is negative for all n > n0 and h(n) h(n0 ) for all n > n0
which proves that the maximum value occurs at n0 .
7.4.2 (a) Let X be the number of words needing correction and let T be the time to type the passage.
Then X Binomial(450; 0:04) and T = 450 + 15X. X has mean np = 18 and variance
np(1 p) = 17:28.
E(T ) = E(450 + 15X) = 450 + 15E(X) = 450 + (15)(18) = 720
V ar(T ) = V ar(450 + 15X) = 152 V ar(X) = 3888.
(b) At 45 words per minute, each word takes 1 31 seconds. X Binomial(450; 0:02) and
1
T = 450 1 3 + 15X = 600 + 15X
E(X) = 450 0:02 = 9; E(T ) = 600 + (15)(9) = 735, so it takes longer on average.
8.1.1 (a) Since

Z1
x3 1 2k 3
kx2 dx = k j 1= = 1 and therefore k =
3 3 2
1
(b) 8
>
> 0 for x 1
>
< Rx
3 2 x3 x x3 1
F (x) = 2 x dx = 2 j 1 = 2 + 2 for 1<x<1
>
> 1
>
:
1 for x 1
(c)
P ( 0:1 < X < 0:2) = F (0:2) F ( 0:1) = 0:504 0:4995 = 0:0045
(d)
Z1 Z1
3 2 3 3
E(X) = x x dx = x3 dx = x4 j1 1 = 0
2 2 8
1 1
Z1
3 3 3
E(X 2 ) = x2 x2 dx = x5 j1 1 =
2 10 5
1
3
V ar(X) = E(X 2 ) [E (X)]2 =
5
(e)
p p
FY (y) = P (Y y) = P (X 2 y) = P ( y X y)
p 3 p 3
p p ( y) 1 ( y) 1
= FX ( y) FX ( y) = + + = y 3=2
2 2 2 2
d 3p
Thereforef (y) = dy FY (y) = 2 y for 0 y < 1 and is 0 otherwise.
8.1.2 (a)
kxn k
lim F (x) = 1 = lim = lim 1 = k so k = 1
x!1 x!1 1 + xn x!1 +1
xn
(b)
d nxn 1
f (x) = F (x) = for x > 0
dx (1 + xn )2
and zero otherwise.
(c) Let m be the median. Then
mn
F (m) = 0:5 =
1 + mn
Therefore mn = 1 and so the median equals 1.
8.2.1
Zx
3 2 x3 + 1
F (x) = u du = for x > 1
2 2
1
x3 +1
If y = F (x) = 2 is a random number between 0 and 1, then x = (2y 1)1=3 .
For y = 0:27125 we get x = ( 0:4574)1=3 = 0:77054:
8.3.1 Let the time to disruption be X. Then

8=
P (X 8) = F (8) = 1 e = 0:25
Therefore e 8= = 0:75 or = 8= ln (0:75) = 27:81hours.

289
8.3.2 (a) F (x) = P (distance x) = 1 P (distance > x)

= 1 P (0 flaws or 1 flaw within radius x) so the number of flaws has a Poisson distribution
with mean = x2 :
0e 1e
x2
F (x) = 1 =1 e 1+ x2
0! 1!
d 2 2 3 x2
f (x) = F (x) = 2 x e for x > 0
dx
(b)
Z1
2 2 3 x2
= E(X) = x2 x e
0
Z1
2 2 4 x2 dy
= 2 x e dx let y = x2 with dx = p
2 y
0
Z1 Z 1
2 y dy 1
= 2y e p =p y 3=2 e y
dy
2 y 0
0
1 5 1 3 3
=p =p
2 2 2
3 1 p
1 3 1 1 2 3
=p = p2 = p
2 2 2 4
8.5.1
P (jX j< )=P( <X < ) = P ( 1 < Z < 1)

= F (1) [1 F (1)] = 0:8413 (1 0:8413)
= 68:26% (about 2/3)
P (jX j<2 )=P( 2 <X < 2 ) = P ( 2 < Z < 2)
= F (2) [1 F (2)] = 0:9772 (1 0:9772)
= 95:44% (about 95%)
Similarly
P (jX j < 3 ) = P ( 3 < Z < 3) = 99:73% (over 99%)
9.1.1 (a) The marginal probability functions are:
x 0 1 2 and y 0 1 2
f1 (x) 0:3 0:2 0:5 f2 (y) 0:3 0:4 0:3
Since f1 (x) f2 (y) =

6 f (x; y) f or all (x; y)
Therefore X and Y are not independent random variables. e.g. f1 (1) f2 (1) = 0:08 6= 0:05
(b)
f (0; y) f (0; y)
f (yjX = 0) = =
f1 (0) 0:3
y 0 1 2
f (yjX = 0) 0:3 0:5 0:2
(c)
d 2 1 0 1 2
f (d) 0:06 0:24 0:29 0:26 0:15
(e.g. P (D = 0) = f (0; 0) + f (1; 1) + f (2; 2))
9.1.2
x+k 1 y+` 1
f (x; y) = f (x)f (y) = pk+` (1 p)x+y
x y
t
X
f (t) = f (x; y = t x)
x=0
t
X x+k 1 t x+` 1
= pk+` (1 p)t
x t x
x=0
Xt
k `
= ( 1)x ( 1)t x
pk+` (1 p)t
x t x
x=0
t
X
t k+` t k `
= ( 1) p (1 p)
x t x
x=0
k `
= ( 1)t pk+` (1 p)t using the Hypergeometric Identity
t
t+k+` 1
= pk+` (1 p)t ; t = 0; 1; 2;
t
k `
using the given identity on ( 1)t t . (T has a Negative Binomial distribution)
9.2.1 (a) Use a Multinomial distribution.

25!
P 3 A0 s; 11B 0 s, 7C 0 s and 4D0 s = (0:1)3 (0:4)11 (0:3)7 (0:2)4
3! 11! 7! 4!
291
(b) Group C’s and D’s into a single category.
25!
P 3 A0 s and 11B 0 s = (0:1)3 (0:4)11 (0:5)11
3!11!11!
(c) Of the 21 non D’s we need 3A’s, 11 B’s and 7C’s. The (conditional) probabilities for the
non-D’s are: 1=8 for A, 4=8 for B, and 3=8 for C.
(e.g. P (AjD) = P (A)=P (D) = 0:1=0:8 = 1=8)
Therefore
3 11 7
21! 1 4 3
f (3 A0 s; 11B 0 s, 7C 0 sj4D0 s) =
3!11!7! 8 8 8
9.2.2 = 0:6 12 = 7:2
P
4
7:2x e 7:2
p1 = P (fewer than 5 chips) = x!
x=0
P
9
7:2x e 7:2
p2 = P (more than 9 chips) = 1 x!
x=0
12
(a) 3 p31 (1 p1 )9
12!
(b) 3!7!2 p31 p72 (1 p1 p2 )2
(c) Given that 7 have > 9 chips, the remaining 5 are of 2 types - under 5 chips, or 5 to 9 chips
P (< 5 and 9) p1
P (< 5j 9 chips) = =
P ( 9) 1 p2
Using a Binomial distribution,
3 2
5 p1 p1
P (3 under 5j7 over 9) = 1
3 1 p2 1 p2
x 0 1 2 y 0 1
9.4.1
f1 (x) 0:2 0:5 0:3 f2 (y) 0:3 0:7
E(X) = (0 0:2) + (1 0:5) + (2 0:3) = 1:1

E(Y ) = (0 0:3) + (1 0:7) = 0:7
2 2 2
E(X ) = (0 0:2) + (1 0:5) + (22 0:3) = 1:7
E(Y 2 ) = 0:7
V ar(X) = 1:7 1:12 = 0:49
V ar(Y ) = 0:7 (0:7)2 = 0:21
E(XY ) = (1 1 0:35) + (2 1 0:21) = 0:77
Cov(X; Y ) = 0:77 (1:1)(0:7) = 0
Cov(X; Y )
Therefore =p =0
V ar(X)V ar(Y )
While = 0 indicates X and Y may be independent (and indeed are in this case), it does not
prove that they are independent. It only indicates that there is no linear relationship between X
and Y .
293
9.4.2
(a)
x 2 4 6 y 1 1
3 5
f1 (x) 3=8 3=8 1=4 f2 (y) 8 +p 8 p
3 3 1 3
E(X) = 2 8 + 4 8 + 6 4 = 15=4; E(Y ) = 8 p + 58 p= 1
4 2p;
1 1 1
E(XY ) = 2 8 + 4 4 + + 6 4 p = 54 12p
5 15 15
Cov(X; Y ) = 0 = E(XY ) E(X)E(Y ) ) 4 12p = 16 2 p
Therefore p = 5=72
(b) If X and Y are independent then Cov(X; Y ) = 0, and so p must be 5/72. But if
p = 5=72 then
3 4 1
f1 (2)f2 ( 1) = = 6= f (2; 1)
8 9 6
Therefore X and Y cannot be independent for any value of p
9.5.1
x 0 1 2
f1 (x) 0:5 0:3 0:2
E(X) = (0 0:5) + (1 0:3) + 2 0:2) = 0:7
E(X 2 ) = (02 0:5) + (12 0:3) + (22 0:2) = 1:1
V ar(X) = E(X 2 ) [E(X)]2 = 0:61
X
E(XY ) = xyf (x; y) and this has only two non-zero terms
all x;y
= (1 1 0:2) + (2 1 0:15) = 0:5
Cov(X; Y ) = E(XY ) E(X)E(Y ) = 0:01

V ar(3X 2Y ) = 9V ar(X) + ( 2)2 V ar(Y ) + 2(3)( 2)Cov(X; Y )
= 9(0:61) + 4(0:21) 12(0:01) = 6:21
9.5.2
Cov(X; Y )
= = 0:5
x y
p
Cov(X; Y ) = 0:5 1:69 4 = 1:3
2 2
V ar(U ) = V ar(2X Y)=4 X + Y 4Cov(X; Y ) = 5:56
Therefore the standard deviation of U = 2:36

9.5.3
Cov (Xi 1 ; Xi ) = Cov (Yi 2 + Yi 1 ; Yi 1 + Yi )
= Cov (Yi 2 ; Yi 1 ) + Cov (Yi 2 ; Yi ) + Cov (Yi 1 ; Yi 1 ) + Cov (Yi 1 ; Yi )
2
= 0 + 0 + V ar (Yi 1) +0=
Cov (Xi ; Xj ) = 0 for j 6= i 1
and
2
V ar(Xi ) = V ar(Yi 1) + V ar(Yi ) = 2
so
n
X n
X
P
n
2 2 2
V ar Xi = V ar(Xi ) + 2 Cov (Xi 1 ; Xi ) = n(2 ) + 2(n 1) = (4n 2)
i=1 i=1 i=2
9.6.1 (a)
8:4 10 12:2 10
P (8:4 < X < 12:2) = P <Z< where Z v N (0; 1)
2 2
= P ( 0:8 < Z < 1:1) = P (Z < 1:1) P (Z < 0:8)
= P (Z < 1:1) [1 P (Z < 0:8)]
= 0:8643 + 0:7881 1 = 0:6524
(b) Since 2Y X is Normally distributed with mean 2(3) 10 = 4; and variance 22 (100) +
( 1)2 (4) = 404 then
0 ( 4)
P (2Y > X) = P (2Y X > 0) = P Z> p = 0:20
404
= P (Z > 0:20) = 1 P (Z < 0:20) = 1 0:5793 = 0:4207
(c) Y is Normally distributed with mean 3; and variance 100=25 = 4: Therefore

0 3
P (Y < 0) = P Z< = 1:5 where Z v N (0; 1)
2
= P (Z > 1:5) = 1 P (Z 1:5) = 1 0:9332 = 0:0668
9.6.2 (a) Since 2X Y is Normally distributed with mean 2(5) 7 = 3;variance 22 (4) + 9 = 25
then
P (j2X Y j > 4) = P (2X Y > 4) + P (2X Y < 4)
4 3 4 3
=P Z> = 0:20 + P Z< = 1:40
5 5
= 0:42074 + 0:08076 = 0:5015
295
(b) Since X N (5; 4=n)
0:1 p
P X 5 < 0:1 = P jZj < p = P jZj < 0:05 n = 0:98
2= n
p
Since P (jZj < 2:3263) = 0:98 we solve 0:05 n = 2:3263 to obtain n = 2164:7 so
n = 2165.
(
0; if the i’th pair is alike
9.7.1 Let Xi = ; i = 1; 2; : : : ; 24:
1; if the i’th pair is unalike
1
X
E(Xi ) = xi f (xi ) = 1f (1) = P (ON OFF [ OFF ON) = (0:6)(0:4)+(0:4)(0:6) = 0:48
xi =0
E(Xi2 ) = E(Xi ) = 0:48 (for Xi = 0 or 1)
V ar(Xi ) = 0:48 (0:48)2 = 0:2496
Consider a pair which has no common switch such as X1 ; X3 : Since X1 depends on switch 1&2
and X3 on switch 3&4 and since the switches are set independently, X1 and X3 are independent
and so Cov(X1 ; X3 ) = 0: In fact all pairs are independent if they have no common switch, but
may not be independent if the pairs are adjacent. In this case, for example, since Xi Xi+1 is also
an indicator random variable,
E(Xi Xi+1 ) = P (Xi Xi+1 = 1)

= P (ON OFF ON [ OFF ON OFF)
= (0:6)(0:4)(0:6) + (0:4)(0:6)(0:4) = 0:24
Therefore
Cov (Xi ; Xi+1 ) = E(Xi Xi+1 ) E(Xi )E(Xi+1 )
= 0:24 (0:48)2 = 0:0096

24
! 24
X X
E Xi = E(Xi ) = 24 0:48 = 11:52
i=1 i=1
24
! 24 23
X X X
V ar Xi = V ar(Xi ) + 2 Cov(Xi ; Xi+1 ) = (24 0:2496) + (2 23 0:0096)
i=1 i=1 i=1
= 6:432
P
1
9.7.2 Using Xi as defined, E(Xi ) = xi f (xi ) = f (1) = E Xi2 since Xi = Xi2 we have
xi =0
E (X1 ) = E (X24 ) = 0:9 since only one cut is needed

E (X2 ) = E (X3 ) = = E (X23 ) = (0:9)2 = 0:81 since two cuts are needed
V ar (X1 ) = V ar (X24 ) = 0:9 (0:9)2 = 0:09
V ar (X2 ) = V ar (X3 ) = = V ar (X23 ) = 0:81 (0:81)2 = 0:1539
Also
Cov (Xi ; Xj ) = 0 if j 6= i 1 since there are no common pieces and cuts are independent.
Since
X
E (Xi Xi+1 ) = xi xi+1 f (xi ; xi+1 ) = f (1; 1)
(
(0:9)2 for i = 1 or i = 23 two cuts are needed
=
(0:9)3 for i = 2; 3; : : : ; 22 three cuts are needed
we have
Cov (Xi ; Xi+1 ) = E (Xi Xi+1 ) E(Xi )E (Xi+1 )

(
(0:9)2 (0:9)(0:9)2 = 0:081 for i = 1 or i = 23
= 3 2 2
(0:9) (0:9) (0:9) = 0:0729 for i = 2; 3; : : : ; 22
Therefore
P
24 P
24
E Xi = E(Xi ) = (2 0:9) + (22 0:81) = 19:62
i=1 i=1
P24 P24 P
V ar Xi = V ar(Xi ) + 2 Cov (Xi ; Xj )
i=1 i=1 i<j
= (2 0:09) + (22 0:1539) + 2 [(2 0:081) + (21 0:0729)] = 6:9516
P
24 p
and the standard deviation of Xi equals 6:9516 = 2:64.
i=1
297
10.1.1 Let X be the number germinating. Then X Binomial(100; 0:8). Approximate using a
Normal distribution with = np = 80 and 2 = np(1 p) = 16.
100
X 100
P (X 75) = (0:8)x (0:2)100 x
x
x=75
74:5 80
tP Z> where Z v N (0; 1)
4
= P (Z > 1:38)
= P (Z 1:38) = 0:9162
10.1.2 Let Xi be the cost associated with inspecting part i
E (Xi ) = (0 0:6) + (10 0:3) + (100 0:1) = 13

E Xi2 = 02 0:6 + 102 0:3 + 1002 0:1 = 1030
2
V ar (Xi ) = 1030 13 = 861
P
80
By the Central Limit Theorem Xi is Normal with mean 80 13 = 1040 and variance
i=1
P
80
80 861 = 68880 approximately. Since Xi increases in $10 increments,
i=1
P
80 1205 1040
P Xi > 1200 tP Z> p where Z v N (0; 1)
i=1 68880
= P (Z > 0:63)
=1 P (Z 0:63) = 0:2643
12. SOLUTIONS TO END OF
CHAPTER PROBLEMS
Chapter 2:
2.1 (a) Label the profs A; B; C and D.
S = fAA; AB; AC; AD; BA; BB; BC; BD; CA; CB; CC; CD; DA; DB; DC; DDg
(b) 1=4
2.2 (a) A sample space is fHHH; HHT; HT H; T HH; HT T; T HT; T T H; T T T g. All outcomes
are equally probable with probability 81 .
(b)
3
P (two heads) = P (fHHT; HT H; T HHg) =
8
(c)
2
P (two consecutive tails) = P (fHT T; T T Hg) =
8
2.3 (a) A suitable sample space is
S = f(1; 2); (1; 3); (1; 4); (1; 5); (2; 3); (2; 4); (2; 5); (3; 4); (3; 5); (4; 5);
(2; 1); (3; 1); (4; 1); (5; 1) ; (3; 2) ; (4; 2) ; (5; 2) ; (4; 3) ; (5; 3) ; (5; 4)g
1
All outcomes are equally probable with probability 20 .
(b)
6
P (both numbers are odd) = P (f(1; 3); (1; 5); (3; 5); (3; 1); (5; 1) ; (5; 3)g) =
20
(c)
P (two numbers are consecutive)

8
= P (f(1; 2); (2; 3); (3; 4); (4; 5); (2; 1); (3; 2) ; (4; 3) ; (5; 4)g) =
20
298
299
2.4 (a) Let XWYZ represent the outcome that X is in W’s envelope, W is in X’s envelope, Y is in
Y’s envelope and Z is in Z’s envelope. Similarly let ZXYW represent the outcome that Z is in
W’s envelope, X is in X’s envelope, Y is in Y’s envelope and W is in Z’s envelope. With this
notation the set of all possible outcomes are the 4! = 24 possible arrangements of the letters
WXYZ as listed below:
S = fW XY Z; XW Y Z; Y W XZ; ZW XY;
W XZY; XW ZY; Y W ZX; ZW Y X;
W Y XZ; XY W Z; Y XW Z; ZXW Y;
W Y ZX; XY ZW; Y XZW; ZXY W;
W ZXY; XZW Y; Y ZW X; ZY XW;
W ZY X; XZY W; Y ZXW; ZY W Xg
(b) A = fW XY Z; W XZY; W Y XZ; W Y ZX; W ZXY; W ZY Xg

B = fXW ZY; XY ZW; XZW Y; Y W ZX; Y ZW X; Y ZXW; ZW XY; ZY XW; ZY W Xg
C = fW XZY; W Y XZ; W ZY X; ZXY W; Y XW Z; XW Y Zg
D=;
(c)
6 9 6
P (A) = ; P (B) = ; P (C) = ; P (D) = P (;) = 0
24 24 24
2.5 (a) Let ijk represent the outcome “ball 1 is in box i, ball 2 is in box j and ball 3 is in box k”
where i; j; k = 1; 2; 3. Then
S = f111; 222; 333; 112; 121; 211; 113; 131; 311; 221; 212; 122;
223; 232; 322; 331; 313; 133; 332; 323; 233; 123; 132; 213; 231; 312; 321g
8
(b) Since A = f222; 333; 223; 232; 322; 332; 323; 233g, P (A) = 27 .
1
Since B = f333g, P (B) = 27 .
6 2
Since C = f123; 132; 213; 231; 312; 321g, P (C) = 27 = 9
(c)
(n 1)3 (n 2)3 n (n 1) (n 2)
P (A) = ; P (B) = ; P (C) =
n3 n3 n3
(d)
(n 1)k (n 2)k n (n 1) (n k + 1)
P (A) = ; P (B) = ; P (C) =
nk nk nk
2.6 (a) 0:018 (b) 0:020 (c) 18=78 = 0:231
2.7 (b) 0:978

300 12. SOLUTIONS TO END OF CHAPTER PROBLEMS
Chapter 3:
3.1
4 6(5) 5 5(4) 10 5(4)
(a) (b) (c)
7(6) 7(6) 7(6)
3.2
6 4!
4!
6(4) 2!2! 2 2!2!
(a) 4 (b) (c)
6 64 64
3.3
5
7 6
7(5) 7 1 2 65 55
(a) (b) = 4 (c) (d) 1 (e)
75 75 7 75 75 75
3.4 (a)
(n 1)k
n(k)
(i) (ii)
nk nk
(b) All nk outcomes are equally likely. That is, all n floors are equally likely to be selected,
and each person’s selection is unrelated to each other person’s selection. Both assumptions are
doubtful since people may be travelling together (e.g. same family) and the floors may not have
equal traffic (e.g. more likely to use the stairs for going up 1 floor than for 10 floors);
3.5
4 12 36
2 4 7
52
13
3.6
!
8! 8! 8!
1 1
(a) 10!
= 10!
(b) 3!3!
10!
(c) 2 3!2!
10!
+ 3!3!
10!
3!3!2!1!1! 3!3!2! 3!3!2! 3!3!2! 3!3!2!
8! 7!
(d) 3!2!
10!
(e) 2!2!10!
3!3!2! 3!3!2!
3.7
10 10
3 1 3 1 10(3)
(a) = (b) =
10(3) 3! 103 3! 103
3.8 (a) The probability that every person has a different birthday is
365(n)
365n
(b)
365(n)
p (n) = 1 for n = 1; 2; : : : ; 365
365n
301
0.9
0.8
0.7
0.6
p(n)
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80
n
(c) The plot of p (n) is given below: p (23) = 0:5073 so there if there are 23 or more people in
the room then the probability at least two people have the same birthday is greater than 0:5.
3.9
1 2
(a) (b)
n n
3.10 (a) For nine tickets the sets of 3 tickets which form an arithmetic sequence are
A = ff1; 2; 3g ; f2; 3; 4g ; f3; 4; 5; g ; f4; 5; 6g ; f5; 6; 7g ; f6; 7; 8g ; f7; 8; 9g ;

f1; 3; 5g ; f2; 4; 6g ; f3; 5; 7g ; f4; 6; 8g ; f5; 7; 9g ;
f1; 4; 7g ; f2; 5; 8g ; f3; 6; 9g ;
f1; 5; 9gg
and
7+5+3+1
P (A) = 9
3
(b) For 2n + 1 tickets
A = ff1; 2; 3g ; f2; 3; 4g ; : : : ; f2n 1; 2n; 2n + 1g ;

f1; 3; 5g ; f2; 4; 6g ; : : : ; f2n 3; 2n 1; 2n + 1g ;
..
.
f1; n; 2n 1g ; f2; n + 1; 2ng ; f3; n + 2; 2n + 1g ;
f1; n + 1; 2n + 1gg
and
(2n 1) + (2n 3) + +3+1 1+3+ + (2n 3) + (2n 1)
P (A) = 2n+1 = 2n+1
3 3
1+2+3+4+ + (2n 3) + (2n 2) + (2n 1) [2 + 4 + + (2n 2)]
= 2n+1
3
1+2+3+4+ + (2n 3) + (2n 2) + (2n 1) 2 [1 + 2 + + (n 1)]
= 2n+1
3
P1
2n nP1 h i
i 2 i (2n 1)(2n)
2 (n 1)n
i=1 i=1 2 2 n2
= 2n+1 = 2n+1 = 2n+1
3 3 3
3.11 (a)
4!
2!2! 4! 10(4)
(i) (ii) (b)
104 104 104
3.12 (a)
6 19
2 3
25
5
(b) Let N = the unknown number of deer in the area. We know that the proportion of these deer
which have been tagged is 6=N . The proportion of deer in the sample of 5 deer who have been
tagged is 2=5. It seems reasonable to estimate the population proportion 6=N using the sample
proportion 2=5. Solving 6=N = 2=5 gives N = 15 as an estimate of the number of deer in the
area.
3.13
6 43 6 43 6 43
3 3 0 6 x 6 x
(a) 49 (b) 49 (c) 49 for x = 0; 1; : : : ; 6
6 6 6
3.14 (a)
5(3) 263
(b)
3 2
5 21
1 2 21 21 26 253
(i) (ii) (iii) (iv) 1
263 263 263 263
(c)
3 (2)
5 21
1 2 21(2) 24 25(3)
(i) (ii) (iii) (iv) 1
26(3) 26(3) 26(3) 26(3)
303
3.15 (a) The probability of at least one collision is one minus the probability of no collisions or
M (n)
1
Mn
M M 1 M 2 M (n 1)
=1
M M M M
1 2 n 1
=1 1 1 1
M M M
(b)
1 2 n 1
1 1 1 1
M M M
1=M 2=M (n 1)=M
t1 e e e
1 nP1 1
n(n 1)=2
=1 exp i =1 e M
M i=1
(c) We want
M (n)
1 0:5
Mn
or approximately
1
n(n 1)=2
1 e M 0:5
Solving
1
n(n 1)=2
1 e M = 0:5
gives
1
n (n 1) = 2M log = 2M log 2
1 0:5
p p
or n t 2M log 2. Therefore n should be less than 2M log 2.
p p p
(d) 2M log 2 = M 1=2 2 log 2 t 1:18M 1=2 t M 1=2 so if M = 2L , then 2M log 2 t 2L=2 .
3.16
48 45 48
3 2 3
(a) 1 50 (b) 1 47 (c) 50
3 2 5
3.17 Let Q = {heads on quarter} and D = {heads on dime}. Then
P (Both heads at same time)

= P (QD [ Q DQD [ Q D Q D QD [ )
= (0:6)(0:5) + (0:4)(0:5)(0:6)(0:5) + (0:4)(0:5)(0:4)(0:5)(0:6)(0:5) +
(0:6)(0:5)
= = 3=8 by the Geometric series
1 (0:4)(0:5)
3.18 Solution not provided.
Chapter 4:
4.1 0:75, 0:6, 0:65, 0, 1, 0:35, 1
4.2
P (A) = 0:01 P (B) = 0:72 P (C) = (0:9)3 P (D) = (0:5)3 P (E) = (0:5)2
P (B \ E) = 0:12 P (B [ D) = 0:785 P (B [ D [ E) = 0:886
P ((A [ B) \ D) = 0:065 P (A [ (B \ D)) = 0:07
4.3
P (A \ B) P (A) P (A \ B)
P (AjB) = =
P (B) 1 P (B)
P (A) P (B)P (AjB) 0:3 (0:4) (0:5) 1
= = =
1 P (B) 1 0:4 6
h i
4.4 (a) (0:7)8 (b) (0:9)8 (c) (0:6)8 (d) 1 (0:7)8 + (0:9)8 (0:6)8
4.5 Since A and B are independent events P (A \ B) = (0:3) (0:2) = 0:06.

Therefore P (A [ B) = P (A) + P (B) P (A \ B) = 0:3 + 0:2 0:06 = 0:44.
4.6 Since E = A [ B, F = A \ B and E and F are independent events we have
P (E \ F ) = P ((A [ B) \ (A \ B)) = P (A \ B)
and
P (E \ F ) = P (E) P (F ) = P (A [ B) P (A \ B)
which implies
P (A [ B) P (A \ B) = P (A \ B)
or
P (A \ B) [1 P (A [ B)] = 0
This statement holds only if P (A \ B) = 0 or P (A [ B) = 1.

But 1 = P (A [ B) = 1 P A \ B which implies P A \ B = 0. Therefore either
P (A \ B) = 0 or P A \ B = 0 as required.
4.7 A necessary and sufficient condition is

f m
=
F M
305
4.8 Note that A and B are independent events since you are given that A and B are independent
events (see solution to 4.3.3).
P (A [ B) = 0:15 + (1 0:3) (0:15) (1 0:3) = 0:745

P (B \ D \ A) P (B \ A)
P (B \ DjA) = = since A D
P (A) P (A)
P (B) P (A)
= = P (B) = 0:7
P (A)
P B[D =P B\D =1 P B\D =1 P DjB P B
=1 (0:8) (0:3) = 0:76

P C \ A[B P C \ A [ (C \ B)
P CjA [ B = =
P A[B P A[B
P (C [ (C \ B)) P (C) P (C)
= = =
P A[B P A[B P A + P (B) P A P (B)
0:1
= = 0:1047
0:85 + 0:7 (0:85) (0:7)
4.9 Let Ci be the event that component i is working, i = 1; 2; 3; 4. Using Rule 4b and the fact that
the components function independently we have
P (system is working properly)

= P ((C1 \ C2 ) [ (C1 \ C4 ) [ (C3 \ C4 ))
= P (C1 \ C2 ) + P (C1 \ C4 ) + P (C3 \ C4 )
P ((C1 \ C2 ) \ (C1 \ C4 )) P ((C1 \ C2 ) \ (C3 \ C4 ))
P ((C1 \ C4 ) \ (C3 \ C4 )) + P ((C1 \ C2 ) \ (C1 \ C4 ) \ (C3 \ C4 ))
= P (C1 \ C2 ) + P (C1 \ C4 ) + P (C3 \ C4 )
P (C1 \ C2 \ C4 ) P (C1 \ C2 \ C3 \ C4 )
P (C1 \ C3 \ C4 ) + P (C1 \ C2 \ C3 \ C4 )
= P (C1 ) P (C2 ) + P (C1 ) P (C4 ) + P (C3 ) P (C4 )
P (C1 ) P (C2 ) P (C4 ) P (C1 ) P (C3 ) P (C4 )
= (0:9) (0:8) + (0:9) (0:6) + (0:7) (0:6) (0:9) (0:8) (0:6) (0:9) (0:7) (0:6)
= 0:87
4.10
5 4
(a) (0:7)3 (0:3)2 (b) (0:7)4 (0:3)1 (c) (0:7)3 (0:3)2
3 2
4.11 (a) Since students answer independently
P (all 3 student get the correct answer) = P (A \ B \ C)

= P (A) P (B) P (C)
= (0:9) (0:7) (0:4)
= 0:252
(b)
P (exactly two students get the correct answer)

=P A\B\C +P A\B\C +P A\B\C
= P (A) P (B) P C + P (A) P B P (C) + P A P (B) P (C)
= (0:9) (0:7) (0:6) + (0:9) (0:3) (0:4) + (0:1) (0:7) (0:4)
= 0:514
(c)
P (C is wrong j 2 students correct)

P (C is wrong and 2 students correct)
=
P (2 students correct)
P A\B\C 0:378
= = = 0:7354
0:514 0:514
4.12
(0:6) (0:5) 5
(a) 0:48 (b) (0:6) (0:5) + (0:48) (0:5) = 0:54 (c) =
0:54 9
4.13 (a) (0:05) (0:3) + (0:04) (0:6) + (0:02) (0:1) = 0:041
(b) 1 (1 0:041)10 = 0:342
4.14 (a) 0:1225; 0:175 (b) 0:395
4.15 The probability of C for the first n 1 trials and then A occurs on the n’th trial is rn 1 p. Add
over all n 1 using Geometric series.
4.16
4.17 (a) The probability all three positions show a flower is
2 6 2
= 0:024
10 10 10
307
(b) Suppose there are m 1 flowers on wheel 1, n 1 flowers on wheel 2, and 10 m n 1

on wheel 3. The probability all three positions show a flower is
m n 10 m n mn (10 m n)
=
10 10 10 103
Let f (n; m) = mn (10 m n). We want to minimize f (n; m) subject to the restrictions m
1, n 1, and 10 m n 1. For each value of m, f (n; m) = mn (10 m n) is a quadratic
function of n which is minimized for n = 1 or n = 9 m. Now f (1; m) = f (9 m; m) =
m(9 m) which is minimized for m = 1 or m = 8. Now f (1; 1) = f (1; 8) = f (8; 1) = 8
and the values of (m; n; 10 m 9) which minimize the probability all three positions show a
flower are (1; 1; 8), (1; 8; 1), and (8; 1; 1).
4.18 (a) 0:010 + 0:016 + 0:040 = 0:066

(b) 0:066 + 0:010 = 0:076
(c) 0:185 + 0:683 + 0:016 + 0:056 = 0:924 or 1 0:076 = 0:924
(d)
0:185 + 0:683
= 0:929
0:066 + 0:185 + 0:683
(e) The events are not independent since
0:010 = P (unemployed \ no certificate, diploma or degree)

6= P (unemployed) P (no certificate, diploma or degree) = (0:066) (0:076)
4.19
(a) P (Yes) = P (Yes jB) P (B) + P (Yes jA) P (A)

80 2 20 4p 1
=p + = +
100 12 100 5 30
(30x=n) 1
(b) p =
24
4p
P (Yes jB) P (B) 24p
(c) P (BjYes) = = 4p 5 1 =
P (Yes) 5 + 30
1 + 24p
4.20 0:9, 0:061, 0:078
4.21 (a)
P (A) = P (Message contains the word Viagra)

= P (AjSpam) P (Spam) + P (AjNot Spam) P (Not Spam)
= (0:2) (0:5) + (0:001) (0:5) = 0:1005
(b)
P (AjSpam) P (Spam) (0:2) (0:5)

P (SpamjA) = = = 0:995
P (A) 0:1005
P (Not SpamjA) = 1 0:995 = 0:005
(c)
P (declared as SpamjSpam) = P (AjSpam) = 0:2
4.22 (a)
P (A1 A2 A3 ) = P (A1 A2 A3 jSpam) P (Spam) + P (A1 A2 A3 jNot Spam) P (Not Spam)

= P (A1 jSpam) P (A2 jSpam) P (A3 jSpam) P (Spam)
+ P (A1 jNot Spam) P (A2 jNot Spam) P (A3 jNot Spam) P (Not Spam)
= (0:2) (0:1) (0:1) (0:5) + (0:005) (0:004) (0:005) (0:5)
= 0:00100005
(b)
P (A1 A2 A3 jSpam) P (Spam) (0:2) (0:1) (0:1) (0:5)

P (SpamjA1 A2 A3 ) = = = 0:99995
P (A1 A2 A3 ) 0:00100005
(c)
P A1 A2 A3 = P A1 A2 A3 jSpam P (Spam) + P A1 A2 A3 jNot Spam P (Not Spam)

= (0:2) (0:1) (0:9) (0:5) + (0:005) (0:004) (0:995) (0:5)
= 0:00900995
P A1 A2 A3 jSpam P (Spam) (0:2) (0:1) (0:9) (0:5)

P SpamjA1 A2 A3 = = = 0:99889
P A1 A2 A3 0:00900995
(d)
P (declared as SpamjSpam) = P (A1 [ A2 [ A3 jSpam)

= P A1 \ A2 \ A3 jSpam = 1 P A1 \ A2 \ A3 jSpam
=1 P A1 jSpam P A2 jSpam P A3 jSpam

=1 (0:8) (0:9) (0:9) = 0:352
which is larger than 0:2.

309
(e)
P (declared as Spam)
= P (declared as SpamjSpam) P (Spam) + P (declared as SpamjNot Spam) P (Not Spam)
= [1 (0:8) (0:9) (0:9)] (0:5) + [1 (0:995) (0:996) (0:995)] (0:5)
= 0:18296755
P (declared as SpamjSpam) P (Spam)

P (Spamjdeclared as Spam) =
[1 (0:8) (0:9) (0:9)] (0:5)
= = 0:961919
0:18296755
(f )
P (declared as SpamjA1 ) P (A1 ) (1) P (A1 )

P (A1 jdeclared as Spam) = =
P (declared as Spam) P (declared as Spam)
P (A1 jSpam) P (Spam) + P (A1 jNot Spam) P (Not Spam)
=
(0:2) (0:5) + (0:005) (0:5)
= = 0:560209
0:18296755
4.23 (a) Note that P (feature presentjF ) = r P feature presentjF = 0:02r.
P (feature presentjF ) P (F )
P (F jfeature present) =
P (feature present)
P (feature presentjF ) P (F )
=
P (feature presentjF ) P (F ) + P feature presentjF P F
(0:02r) (0:0005) r
= =
(0:02r) (0:0005) + (0:02) (0:9995) r + 1999
For r = 10, 30 and 50 we obtain 0:005; 0:0148; 0:0244 respectively.

(b) P (flagged) = P (feature present) = (0:02r) (0:0005) + (0:02) (0:9995) and if r = 50 we
have P (flagged) = (0:02) (50) (0:0005)+(0:02) (0:9995) = 0:02049 or 2:049% of transactions
will be flagged.
Chapter 5:
5.1 (a)
P
1= f (x) = 0:1c + 0:2c + 0:5c + c + 0:2c = 2c so c = 1=2 = 0:5
x2A
P (X > 2) = P (X = 3) + P (X = 4) = c + 0:2c = 1:2c = 0:6
(b)
x 0 1 2 3 4
F (x) = P (X x) 0:05 0:15 0:4 0:9 1
1
5.2 (a) 4k 2 = 1 so k = 2 = 0:5 P (2 < X 4) = P (X 4) P (X 2) = 0:5 0:2 = 0:3
(b)
x 1 2 3 4 5 Total
f (x) 0:05 0:15 0:05 0:25 0:5 1
5.3 (a)
5 4 1
P (X = 5) = P (X 5) P (X 4) = 1 2 1 2 =
32
4 1
P (X 5) = 1 P (X 4) = 1 1 2 =
16
(b) For x = 1; 2; : : :
f (x) = P (X = x) = P (X x) P (X x 1)
x x+1
= F (x) F (x 1) = 1 2 1 2
x x
=2 (2 1) = 2
5.4 (a) (i)
x 1 2 3 4 5 6 7 8 9 Total
2 4 6 8 10 12 14 16 18
fX (x) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2)
1
or
2x
fX (x) = ; x = 1; 2; : : : ; 9
10(2)
(ii)
y 1 2 3 4 5 6 7 8 9
2 2 4 4 6 6 8 8 10
fY (y) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2)
311
y 10 11 12 13 14 15 16 17 Total
8 8 6 6 4 4 2 2
fY (y) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2) 10(2)
1
or 8
< 10 jy 9j
; y = 1; 3; 5 : : : ; 17
10(2)
fY (y) =
: 9 jy 9j
; y = 2; 4; 6 : : : ; 18
10(2)
(b) (i)
x 0 1 2 3 4 5 6 7 8 9 Total
1 3 5 7 9 11 13 15 17 19
fX (x) 102 102 102 102 102 102 102 102 102 102
1
or
2x + 1
fX (x) = ; x = 0; 1; : : : ; 9
102
(ii)
y 0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
fY (y) 102 102 102 102 102 102 102 102 102
y 9 10 11 12 13 14 15 16 17 18 Total
10 9 8 7 6 5 4 3 2 1
fY (y) 102 102 102 102 102 102 102 102 102 102
1
or
10 jy 9j
fY (y) = ; y = 0; 1; : : : ; 18
102
5.5 (a)
1
X p
p (1 p)x = = 1 by the Geometric series
1 (1 p)
x=0
(b) For x = 0; 1; : : :
1
X p (1 p)x
P (X < x) = 1 P (X x) = 1 p (1 p)t = 1
t=x
1 (1 p)
x
=1 (1 p) by the Geometric series
(c)
P (X is odd) = P (X = 1) + P (X = 3) + P (X = 5) +
= p (1 p) + p (1 p)3 + p (1 p)5 +
p (1 p)
= by the Geometric series
1 (1 p)2
(d)
P (X is divisible by 3) = P (X = 0) + P (X = 3) + P (X = 6) +
= p + p (1 p)3 + p (1 p)6 +
p
1 (1 p)3
(e)
P (R = 0) = P (X = 0) + P (X = 4) + P (X = 8) +
= p + p (1 p)4 + p (1 p)8 +
p
1 (1 p)4
The cases r = 1; 2; 3 can be done similarly to obtain
p(1 p)r
P (R = r) = ; r = 0; 1; 2; 3
1 (1 p)4
5.6 (a) The probability function of X = number of defective chips in sample of twenty is
50 950
x 20 x
P (X = x) = 1000 ; x = 0; 1; : : : ; 20
20
(b)
20
X 50 950 1
X 50 950
x 20 x x 20 x
P (X 2) = 1000 =1 1000
x=2 20 x=0 20
(c) Since n = 20 draws is small compared to N = 1000 = total number of items this probability
can be approximated using the Binomial approximation
1
X 20
P (X 2) t 1 (0:05)x (0:95)20 x
x
x=0
=1 (0:95)20 20 (0:05) (0:95)19 = 0:264
5.7
74 76
4 y 12 y
(a) (b) 150 (c) 0:0176
15 12
313
5.8 Let X = the number of corrupted bits. Then X v Binomial 104 ; 10 5 .

(a)
104 5 0 5 10
4
5 10
4
P (X = 0) = 10 1 10 = 1 10 = 0:904837
0
(b)
5 10
4 104 5 1
4
5 10 1
P (X 1) = 1 10 + 10 1 10
1
4 4
5 10 5 10 1
= 1 10 + 104 10 5
1 10 = 0:9953216
(c) Since n = 104 is large and p = 10 5 is small, the Poisson approximation to the Binomial
may be used with = 104 10 5 = 0:1.
(0:1)0 e 0:1
0:1
P (X = 0) t =e = 0:9048374
0!
0:1 (0:1)1 e 0:1
0:1
P (X 1) t e + = 1:1e = 0:9953212
1!
5.9 (a)
500 499500
0 10
1 500000 = 0:009955209
10
10
t1 (0:001)0 (0:999)10 = 1 (0:999)10 t 0:009955120
0
The Binomial approximation to the Hypergeometric is valid since the number of draws, n = 10,
is small relative to the the total number of items, N = 500000.
(b)
500 499500
0 2000
1 500000 = 0:865342
2000
2000
t1 (0:001)0 (0:999)2000 = 1 (0:999)2000 = 0:8648001
0
e 2 20
t1 = 1 e 2 = 0:864665
0!
The Binomial approximation to the Hypergeometric is valid since the number of draws,
n = 2000, is still small relative to the the total number of items, N = 500000. The Poisson
approximation to the Binomial is valid since n = 2000 is large, and p = 0:001 is small.
5.10 (a)
(n x)
r N r r(x) (N r)
x n x x! (n x)! n! r(x) (N r)(n x)
n r(x) (N r)(n x)
N
= = =
N (n) x! (n x)! N (n) x N (n)
n n!
Substituting r = pN we obtain
r N r
x n x n (pN )(x) (N pN )(n x)
N
=
n
x N (n)
n (pN )(x) [(1 p) N ](n x)
=
x N (x) (N x)(n x)
(b) Since
(pN )(x) (pN ) (pN 1) (pN 2) (pN x + 1)

(x)
=
N N (N 1) (N 2) (N x + 1)
1
! 2
! !
x
1 pN 1 pN 1 xpN1
=p
1 N1 1 N2 1 xN 1
then
(pN )(x)
lim = px
N !1 N (x)
Similarly
[(1 p) N ](n x)
= (1 p)n x
(N x)(n x)
and thus
n (pN )(x) [(1 p) N ](n x)
n x
lim (n x)
= p (1 p)n x
N !1 x N (x) (N x) x
(c) This result justifies the Binomial approximation to the Hypergeometric when N gets large
but p = r=N , the proportion of type 1 items, and n, the number of items drawn, are held fixed.
5.11
35 70
x 7 63
P (X = x) = 105 for x = 0; 1; : : : ; 35
x+7
105 (x + 7)
5.12 (a) (0:99)20 = 0:8179; 1 (0:99)20 = 0:1821

(b) (0:99)30 = 0:7397
(c)
X1 95
X
y+3 y+3
(0:01)4 (0:99)y = 1 (0:01)4 (0:99)y
3 3
y=96 y=0
315
5.13 (a)
15
8
= 0:1709
9
(b)
24
8 1
= 0:006578
9 9
(c)
60 59
8 1 8
+ 60 = 0:007249
9 9 9
5.14 (a)
p (1 p)x
P (X x) = p (1 p)x +p (1 p)x+1 +p (1 p)x+2 + = = (1 p)x for x = 0; 1; : : :
1 (1 p)
(b) X = 0
5.15 (b) Let X = number of requests in a one second interval. Then X v P oisson (2).
2
X 2x e 2
P (X 3) = 1 P (X 2) = 1
x!
x=0
2
=1 5e = 0:3233
(c) Let Y = number of requests in a one minute = 60 second interval. Then Y v P oisson (2 60).
1
X 125
X
(120)y e 120 (120)y e 120
P (Y > 125) = =1
y! y!
y=126 y=0
= 0:3038 (calculated using R)
5.16 0:9989
5.17
10 10
(a) 0:0758 (b) 0:0488 (c) (e )y (1 e 10
)10 y
(d) = 0:12
y
5.18 (a) 0:0769 (b) 0:2019; 0:4751
5.19 (a)
7
(0:8)5 (0:2)2 = 0:2753
5
(b)
6
(0:8)5 (0:2)2 = 0:1966
2
(c) Let X = number of power failures in one month. Since power failures occur independently
of each other at a uniform rate through the months of the year, with little chance of 2 or more
occurring simultaneously then X v P oisson ( ). To determine we note that
0e
0:8 = P (X = 0) = =e or = ln (0:8)
0!
Therefore
P (> 1 power failure in one month) = P (X > 1) = 1 P (X 1)

=1 P (X = 0) P (X = 1)
[ ln (0:8)]1 eln(0:8)
=1 0:8 = 0:02149
1!
5.20 (a) Let X = number of spruce budworms in a one hectare plot. Then X v P oisson ( ) since
spruce budworms are distributed through a forest according to a Poisson process so that the
average is per hectare.
P (a one hectare plots contains at least k spruce budworms)

= P (X k) = 1 P (X k 1)
k 1
X xe
=1
x!
x=0
Now
P (at least 1 of n one hectare plots contains at least k spruce budworms)

=1 P (none of n one hectare plots contain at least k spruce budworms)
" k 1 x
#0 " k 1 #n
n X e X xe
=1 1
0 x! x!
x=0 x=0
"k 1 #n
X xe
=1
x!
x=0
(b) (Could probably argue for other answers also). Budworms may not be distributed at a uniform
rate over the forest and they may not occur singly.
5.21 (a) 0:2048 (b) 0:0734 (c) 0:428 (d) 0:1404

317
5.22
(4:5)11 e 4:5
(b) p = = 0:004264
11!
10
X (4:5)x e 4:5
(c) 1 = 0:006669
x!
x=0
20 2
(d) p (1 p)18 = 0:003199
2
999 12 999 11
(e) (i) p (1 p)988 = p (1 p)988 p = 0:00001244
11 11
999 11 11 e
(ii) p (1 p)988 p t p = 0:00001265
11 11!
On the first 999 attempts we essentially have a Binomial distribution with n = 999 (large),
p = 0:004264 (near 0) so the Poisson approximation can be used with = 999 (0:004264) =
4:2601.
5.23
(0:96)0 e 0:96
0:96
(a) P (no bubbles) = =e
0!
" #
(0:96)0 e 0:96 (0:96)1 e 0:96
(b) P (more than one bubble) = 1 +
0! 1!
0:96
=1 1:96e = 0:2495
n 0:96 x 0:96 n x
(c) P (X = x) = e 1 e ; x = 0; 1; : : : ; n
x
100
X 100 0:96 x 0:96 100 x
(d) 1 1:96e 1:96e
x
x=11
10
X 100 0:96 x 0:96 100 x
=1 1 1:96e 1:96e
x
x=0
= 0:9999
(0:8 )0 e 0:8
0:8
(e) =e 0:5 or 1:25 ln (0:5) bubbles per m2
0!
(f ) e 0:8 + (0:8 ) e 0:8 0:95 which must be solved numerically to obtain t 0:4442:
5.24 (a) 0:555 (b) 0:828; 0:965 (c) 0:789; 0:946 (d) n = 1067
x 1
5.25 (a) 999 (0:3192)1000 (0:6808)x 1000
(b) 0:002, 0:051, 0:350, 0:797

3200
(c) y (0:3192)y (0:6808)3200 y; 0:797
5.26 (a) A key will be assigned to a given list (Success) with probability 1=M and not assigned (Fail-
ure) with probability 1 1=M . Since the keys are assigned independently, we have a sequence
of n Bernoulli trials. The probability that exactly x of the n keys are assigned to a given list is
given by the Binomial distribution
x n x
n 1 1
1 for x = 0; 1; : : : ; n
x M M
(b) If n the number of keys is large and M the size of the hash table is large so that 1=M is small
then we have
n 1 x 1 n x xe
1 t for x = 0; 1; : : :
x M M x!
by the Poisson approximation to the Binomial, where = n (1=M ) = n=M .

(c) If = 10
x
10 10e
P (X x) t e
x
Thus
15
10 10e
P (X 15) t e = 0:3389
15
and
20
10 10e
P (X 20) t e = 0:0210
20
5.27 (b)
2
X (0:75)x e 0:75
1 = 0:04051
x!
x=0
(c) e 0:75(2) = 0:2231

(d) e 0:75(3) = 0:1054
(e) 0:75e 0:75 = 0:3543
10 0:75 7 0:75 3
0:75e 1 0:75e = 0:02263
7
319
5.28 (a) Let X3 = number of bits which are flipped in transmission in a group of three bits. Then
X3 v Binomial (3; p). A group of three repeated bits will be decoded correctly if none or one
bit is flipped in transmission. Therefore
P (a group of three repeated bits is decoded correctly)

= P (X3 1) = P (X3 = 0) + P (X3 = 1)
3 0 3 1
= p (1 p)3 + p (1 p)2
0 1
= (1 p)3 + 3p (1 p)2
(b) If no ECC is used then a message of length four is decoded correctly if no bits are flipped in
transmission which occurs with probability (1 p)4 .
(c) If TRC is used then the original message of length four is sent as a string of length twelve.
The message is decoded correctly if each group of three is decoded correctly. Using the result
from (a)
P (original message of length four is decoded correctly)

h i4
= (1 p)3 + 3p (1 p)2
(d) See table given in Problem 5.29 (c). For p = 0:2 the probability that the message is not
decoded correctly exceeds 50% if no ECC is used whereas if TRC is used the probability the
message is decoded correctly is approximately 64%. As p decreases in value the probability the
message is decoded correctly increases for both TRC and no ECC as one would expect. For TRC
the probability the message is decoded correctly is nearly 1 for p = 0:01.
(e) Let X5 = number of bits which are flipped in transmission in a group of five bits. Then
X5 v Binomial (5; p). A group of five repeated bits will be decoded correctly if none, one or
two bits are flipped in transmission. Therefore
P (a group of five repeated bits is decoded correctly)

= P (X5 2) = P (X5 = 0) + P (X5 = 1) + P (X5 = 2)
5 0 5 1 5 2
= p (1 p)5 + p (1 p)4 + p (1 p)3
0 1 2
= (1 p)5 + 5p (1 p)2 + 10p2 (1 p)3
(f ) Let Xk = number of bits which are flipped in transmission in a group of k bits. Then
Xk v Binomial (k; p). A group of k repeated bits will be decoded correctly if 0; 1; : : : ; or
(k 1) =2 bits are flipped in transmission. Therefore
P (k; p) = P (a group of k repeated bits is decoded correctly)

k 1
=P Xk
2
(k 1)=2
X k x
= p (1 p)k x
x
x=0
k
P (k; p) 3 5 7 9
0:2 0:896 0:9421 0:9667 0:9969
p 0:1 0:972 0:9914 0:9973 0:9999
0:05 0:9928 0:9988 0:9998 1:0000
As the number of repeated bits, k, is increased the probability that a group of k repeated bits is
decoded correctly approaches 1 for each value of p.
5.29 (a) A correctable message is received using Hamming(7,4) if none or one of the seven bits sent
are flipped which occurs with probability
7 0 7 1
p (1 p)7 + p (1 p)6
0 1
= (1 p)7 + 7p (1 p)6
(b) See values given in the table in (c).

(c)
p
No. bits
ECC 0:2 0:1 0:05 0:01
sent
4 No ECC 0:4906 0:6561 0:8145 0:9696
12 TRC 0:6445 0:8926 0:9713 0:9988
7 Hamming(7,4) 0:5767 0:8503 0:9556 0:9980
As before, the probabilities increase as the value of p decreases. Of most interest is to com-
pare TRC which requires 12 bits to Hamming(7,4) which requires only 7 bits. Hamming(7,4)
performs almost as well as TRC for fewer bits sent.
321
Chapter 7:
7.1 E (X) = 4, E X 2 = 17:6, V ar (X) = E X 2 [E (X)]2 = 17:6 16 = 1:6

E (X) = 2:5, E X 2 = 7:2, V ar (X) = E X 2 [E (X)]2 = 7:2 6:25 = 0:95
7.2 E (X) = 2:775, E X 2 = 10:275,

V ar (X) = E X 2 [E (X)]2 = 10:275 (2:775)2 = 2:574375
7.3 (a) The person wins $2x if x tosses are needed for x = 1; 2; 3; 4; 5 but loses $256 if x > 5. Note
that
1 1 1 1=26 1
P (X > 5) = 6 + 7 + 8 + = 1 = 5
2 2 2 1 2 2
Let W = winnings. Then the probability function for W is
w 2 22 23 24 25 256 Total
1 1 2 1 1 1 1 1
P (W = w) 2 2 = 22 23 24 25 25
1
Therefore
1 1 1 1 1 1
E (W ) = 2 + 22 + 23 + 24 + 25 + ( 256)
2 22 23 24 25 25
=5 8= 3 dollars
(b)
1 2 1 2 1 2 1 2 1 1
E W 2 = (2)2 + 22 + 23 + 24 + 25 + ( 256)2
2 22 23 24 25 25
= 2110
and
V ar (W ) = E W 2 [E (W )]2 = 2110 ( 3)2 = 2101 (dollars)2
7.4 (a) Let Y = number of cups drunk by Yasmin and let Z = number of cups drunk by Zack. Then
Y = 2X 2 and Z = j2X 1j.
P
5 P
5
E (Y ) = E 2X 2 = (2x2 )P (X = x) = 2 x2 P (X = x)
x=0 x=0
= 2[(0)2 (0:09) + (1)2 (0:10) + (2)2 (0:25)

+ (3)2 (0:40) + (4)2 (0:15) + (5)2 (0:01)] = 14:7
On average Yasmin drinks 14:7 cups of coffee per week.
5
X
E (Z) = E (j2X 1j) = j2x 1jP (X = x)
x=0
= j2(0) 1j(0:09) + j2(1) 1j(0:10) + j2(2) 1j(0:25)

+ j2(3) 1j(0:40) + j2(4) 1j(0:15) + j2(5) 1j(0:01)
= 4:08
On average Zack drinks 4:08 cups of coffee per week.

(b) Since
h i P
5
2
E Y2 =E 2X 2 = E 4X 4 = 4 x4 P (X = x)
x=0
4 4 4
= 4[(0) (0:09) + (1) (0:10) + (2) (0:25)
+ (3)4 (0:40) + (4)4 (0:15) + (5)4 (0:01)] = 324:6
therefore
V ar (Y ) = E Y 2 [E (Y )]2 = 324:6 (14:7)2 = 108:51
Since
5
X
2 2
E Z = E j2X 1j = j2x 1j2 P (X = x)
x=0
2
= j2(0) 1j (0:09) + j2(1) 1j2 (0:10) + j2(2) 1j2 (0:25)
+ j2(3) 1j2 (0:40) + j2(4) 1j2 (0:15) + j2(5) 1j2 (0:01)
= 20:6
therefore
V ar (Z) = E Z 2 [E (Z)]2 = 20:6 (4:08)2 = 3:9536
7.5 (a) Using the results of Problem 4.6.2 we have
1
X 1
X
E (X) = xp (1 p)x = p (1 p) x (1 p)x 1
x=1 x=1
1 1 p
= p (1 p) 2 =
[1 (1 p)] p
323
1
X
E [X (X 1)] = x (x 1) p (1 p)x
x=2
1
X
= p (1 p)2 x (x 1) (1 p)x 2
x=1
2 2 (1 p)2
= p (1 p)2 =
[1 (1 p)]3 p2
and
V ar (X) = E [X (X 1)] + E (X) [E (X)]2

2
2 (1 p)2 1 p 1 p
= +
p2 p p
2
1 p p (1 p)
= +
p p2
(1 p)
= (1 p + p)
p2
1 p
=
p2
(b) Let Y = the number of trials to obtain the first success. Then Y = X + 1 and therefore
1 p 1
E (Y ) = E (X + 1) = E (X) + 1 = +1=
p p
7.6 Let X be the number of tests applied to an individual.
P (X = 1) = P (individual tests negative)

= P (tests negativejD) P (D) + P tests negativejD P D
= 0 + (0:95) (0:98) = 0:931
P (X = 2) = P (individual tests positive)

= P (tests positivejD) P (D) + P tests positivejD P D
=1 0:931 = 0:069
Then the expected cost per person is $10(0:931) + $110(0:069) = $16:90.
7.7 (a) For test B, P (test positive) = 0:02 so the expected number of cases detected among 150
person is 150 (0:02) = 3 cases.
(b) For test A, P (test positive) = (0:8) (0:02) + (0:05) (0:98) = 0:065 so the expected cost is
2000 + 100 (2000) (0:065) = 15000 and the expected number of cases detected is
2000 (0:02) (0:8) = 32 cases.
7.8 (a) The possible number of tests done is 1 if all k people are negative or k + 1 tests if at least
one person is positive.
x 1 k+1 Total
k k
P (X = x) (1 p) 1 (1 p) 1
so h i
E (X) = (1) (1 p)k + (k + 1) 1 (1 p)k = k + 1 k (1 p)k
(b)
n nh i
Expected number of tests for groups = k + 1 k (1 p)k
k k
n
=n+ n(1 p)k
k
which gives 1:01n; 0:249n; 0:196n for k = 1; 5; 10.
7.9 (a) If you bet $1 on 10 consecutive plays then your expected net winnings equals
18 19 1 10
10 (1) + ( 1) = 10 = dollars
37 37 37 37
If you bet $10 on a single play then your expected net winnings equals
18 19 10
(10) + ( 10) = dollars
37 37 37
(b) If you bet $1 on 10 consecutive plays then the probability you make a profit is the probability
you win 6 or more times which equals
10
X x 10 x
10 18 19
= 0:3442
x 37 37
x=6
18
If you bet $10 on a single play then the probability you make a profit is 37 = 0:4865
7.10 Expected winnings per dollar spent is equal to
2 6 2 4 3 3 4 1 5
(20) + (10) + (5)
10 10 10 10 10 10 10 10 10
= 0:94 dollars
7.11 (a)
n
X n
X n+1
X n
X
Expected net profit = (ai 1) pi + ( 1) pn+1 = ai p i pi = ai p i 1
i=1 i=1 i=1 i=1
325
(b)
Expected net profit = 3 (0:1) + 5 (0:04 + 0:04 + 0:04) 1= 0:10 dollars
(c) The expected profit is
P
n P
n 1
0:05 = dbi pi + ( 1) pn+1 = d pi + ( 1) pn+1 = dn pn+1
i=1 i=1 pi
so
d = (pn+1 0:05) =n
If n = 10 and pn+1 = 0:7 then d = 0:065.
7.12 (a) Let Xi be the winnings when attempting Question i first and let i be the event Question i is
answered correctly, i = A; B.
X
E (XA ) = xP (XA = x)
x2S
= (100 + 200)P (A \ B) + (100)P (A \ B) + (0)P (A)

= (300)P (A)P (B) + (100)P (A)P (B)
= (300)(0:8)(0:6) + (100)(0:8)(0:4)
= 176
X
E (XB ) = xP (XB = x)
x2S
= (100 + 200)P (B \ A) + (200)P (B \ A) + (0)P (B)

= (300)P (B)P (A) + (200)P (B)P (A)
= (300)(0:8)(0:6) + (200)(0:6)(0:2)
= 168
Therefore the expected winnings are greatest when attempting Question A first.
(b)
E (XA ) = (300)P (A \ B) + (100)P (A \ B) + ( 50)P (A)

= (300)(0:8)(0:6) + (100)(0:8)(0:4) (50)(0:2)
= 166
E (XB ) = (300)P (B \ A) + (200)P (B\A) + ( 50)P (B)
= (300)(0:6)(0:8) + (200)(0:6)(0:2) (50)(0:4)
= 148
Question A should still be attempted first to maximize expected winnings.

7.13 Let N = net profit. Then N = 59:5n 25 200X 2 where X v Binomial (n; 0:05) which
implies E (X) = n (0:05) and V ar (X) = n (0:05) (0:95). Also since
V ar (X) = E X 2 [E (X)]2 we know E X 2 = V ar (X) + [E (X)]2 . The expected net
profit is
E (N ) = E 59:5n 25 200X 2 = 59:5n 25 200E X 2

n o
= 59:5n 25 200 V ar (X) + [E (X)]2
= 50n 0:5n2 25
which is maximized for n = 50.
7.14 (a) Let N = the number of trick-or-treaters that arrive in the first half hour. Since arrivals follow
a Poisson process then N P oisson (6).
P (5 N 7) = P (N = 5) + P (N = 6) + P (N = 7)
e 6 (6)5 e 6 (6)6 e 6 (6)7
= + +
5! 6! 7!
6 1296
=e
7
= 0:45892
(b) Let M = the number of trick-or-treaters that arrive per 3:5 hours. Therefore M P oisson( )
where = 3:5 12 = 42. Since E (M ) = 42, the expected number of trick-or-treaters over the
whole evening is 42.
(c) Using M as (b), we note that
x
P (M = x) e 42 42
x! x+1 x+1
= 42x+1 = =
P (M = x + 1) e 42 42 42
(x+1)!
This is < 1 if x < 41 and > 1 if x > 41. It follows that outcomes become more likely as x
increases to a maximum of 41 and then less likely thereafter. The two outcomes, X = 41 and
41
X = 42 are both equally likely (and occur with probability e 42 42 42 4242
41! = e 42! 0:0614).
(d) Since V ar(M ) = , the variance of the number of trick-or-treaters arriving over the whole
evening is 42.
327
7.15 Let X = the number of weeks the stock price increases in value by $1, then X v Binomial 13; 12 .
The price of the stock in 13 weeks is S = 50 + X (13 X) = 37 + 2X. The return from the
option is R = max(37 + 2X 55; 0) = max (2X 18; 0). Therefore
13
X x 13 x
13 1 1
E (R) = 0 + (2x 18) 1
x 2 2
x=10
13
13 13 13 13 1
= 2 +4 +6 +8
10 11 12 13 2
485
=
4096
7.16 Let X = time to display the information.
(a) Without cache, X = 50 + 70 + 50 = 170 only so the expected time to display is 170.
(b) With a cache, X = 10 or 10 + 50 + 70 + 50 = 180 (since the cache is always searched

first) with probabilities 0:2 and 0:8 respectively, so the expected time to display is 10(0:2) +
180(0:8) = 146 ms.
(c) Let p be the probability of a cache hit. Solving 170 = 10p + 180(1 p) gives p = 0:059.
Therefore even with only around a 6% probability of a cache hit it is still worthwhile to use a
cache!
Chapter 8:
8.1 (a) Solving

Z1 Z1 Z1
2
1= f (x)dx = 0 + k 1 x dx + 0 = k 1 x2 dx
1 1 1
gives k = 3=4:
8
>
> 0 x 1
>
< Rx
3 1
F (x) = 4 1 u2 du = 4 2 + 3x x3 1<x<1
>
> 1
>
:
1 x 1
(b) We need to find c such that
0:95 = P ( c X c)
1
= 2 + 3c c3 2 3c + c3
4
1 1
= 6c 2c3 = 3c c3
4 2
or
c3 3c + 1:9 = 0
This cubic equation must be solved numerically which gives c = 0:811.
(c) = E (X) = 0 since the p.d.f. is symmetric about x = 0.
Z1
2 2 2 2
= V ar (X) = E X 0 =E X = 0:75 x2 1 x2 dx = 0:2
1
p
= sd (X) = 0:2
(d) For 0 < y < 1

p p p p
G (y) = P (Y y) = P X 2 y =P( y X y) = F ( y) F( y)
d p dp p d p p 1 p 1
g (y) = G (y) = f ( y) y f( y) ( y) = f ( y) p + f ( y) p
dy dy dy 2 y 2 y
p 1 p p
= 2f ( y) p since f ( y) = f ( y) by symmetry of f (x)
2 y
3 1 3
= (1 y) p = y 1=2 y 1=2
4 y 4
Therefore (
3 1=2
4 y y 1=2 0<y<1
g (y) =
0 otherwise
329
8.2 (a) The area under the p.d.f. and above the x-axis consists of two triangles with
R1
area = 12 12 (2) + 21 12 (2) = 1 and therefore f (x)dx = 1.
1
(b)
Z0:8 Z0:5 Z0:8

P (0:25 X 0:8) = f (x)dx = 4xdx + 4 (1 x) dx = 0:375 + 0:42 = 0:795
0:25 0:25 0:5
(c) Since the p.d.f. is symmetric about the line x = 1=2, the median is equal to 1=2.
To find the 10th percentile we need to find the value c such that 0:1 = F (c). Since 0:5 = F (1=2)
we know that c must lie between 0 and 1=2. Therefore c is the solution to
Zc
0:1 = 4xdx = 2x2 jc0 = 2c2
0
p
which gives c = 0:05 t 0:2236.
(d) Since the p.d.f. is symmetric about the line x = 1=2, the mean is = E (X) = 1=2.
Z1 Z1=2 Z1
2 2 3 7
E X = x f (x)dx = 0 + 4x dx + 4x2 (1 x) dx + 0 =
24
1 0 1=2
2
7 1 1
V ar (X) = E X 2 [E (X)]2 = =
24 2 24
(e) For 1<y<1
y+1 y+1
G (y) = P (Y y) = P (2 (X 1=2) y) = P X =F
2 2
d y+1 d y+1 y+1 1
g (y) = G (y) = f =f
dy 2 dy 2 2 2
Therefore 8
>
< y+1 1<y 0
g (y) = 1 y 0<y<1
>
:
0 otherwise.
(f ) For 0 < z < 1
H (z) = P (Z z) = P X 3 z =P X z 1=3 = F z 1=3

since P (X < 0) = 0. Therefore

d d 1=3 1
h (z) = H (z) = f z 1=3 z = f z 1=3
dz dz 3z 2=3
8
>
< 4z 1=3 =3 0 < z (0:5)3
= 4 z 2=3 z 1=3 =3 (0:5)3 < z < 1
>
:
0 otherwise.
8.3 For 0 < y < 1
G (y) = P (Y y) = P ((X + 10) =20 y) = P (X 20y 10)

= F (20y 10)
d d 1
g (y) = G (y) = f (20y 10) (20y 10) = (20) = 1
dy dy 20
which is the p.d.f. of a U (0; 1) random variable.
8.4
P (jX j 2 )=P( 2 X + 2 ) = 2P ( X + 2 ) by symmetry

= 2 [P (X +2 ) P (X )]
= 2 [P (X +2 ) 0:5] since P (X ) = P (X > ) = 0:5
= 2P (X +2 ) 1
8.5 (a) For females
P (X > 80; X > 30) P (X > 80) 0:704

P (X > 80jX > 30) = = = = 0:707
P (X > 30) P (X > 30) 0:996
P (X > 90) 0:396
P (X > 90jX > 30) = = = 0:398
P (X > 30) 0:996
For males
P (X > 80) 0:603
P (X > 80jX > 30) = = = 0:610
P (X > 30) 0:989
P (X > 90) 0:273
P (X > 90jX > 30) = = = 0:276
P (X > 30) 0:989
(b)
P (X > 90) = P (X > 90jFemale) P (Female) + P (X > 90jMale) P (Male)

= (0:396) (0:49) + (0:273) (0:51) = 0:333
331
8.6 (a) f (x) is a probability density function for > 1; since for > 1; f (x) 0 for all x 2 <
and
Z1 Z1
f (x)dx = ( + 1) x dx = x +1 j10 = 1
1 0
(b)
Z0:5
+1 0:5 +1
P (X 0:5) = ( + 1) x dx = x j0 = (0:5)
0
(c) For k = 1; 2; : : :
Z1 Z1 Z1
k k k
E X = x f (x)dx = x ( + 1) x dx = ( + 1) xk+ dx
1 0 0
+1 +k+1 1 +1
= x j0 = :
+k+1 +k+1
Therefore
+1 +1
E (X) = ; E X2 =
+2 +3
and
2
2 2 +1 +1 +1
V ar (X) = E X [E (X)] = =
+3 +2 ( + 2)2 ( + 3)
(d) For y > 1
1 1 1
G (y) = P (Y y) = P y =P X> =1 P X
X y y
1
=1 F since F (x) = P (X x)
y
d d 1
g (y) = G (y) = 1 F
dy dy y
1 d 1 1 1 +1
= f =f = for y > 1
y dy y y y2 y +2
Therefore (
+1
y +2 y>1
g (y) =
0 otherwise
8.7 (a)
Z1 Z1
x2 = x2 = b
1= f (x)dx = 0 + k xe dx = k lim e j0
b!1 2
1 0
b2 = 2
=k 1 lim e =k and therefore k = :
2 b!1 2
8
< 0 x 0
F (x) = Rx u2 = x2 =
: 2u e du = 1 e x>0
0
(b)
Z1 Z1 Z1
2x x2 = 2x2 x2 = x2 2x
E (X) = xf (x)dx = 0 + x e dx = e dx let y = , dy = dx
1 0 0
Z1 Z1 p
1=2 1=2 y 1=2 3=2 1 y 1=2 3 1=2 1 1
= y e dy = y e dy = = =
2 2 2 2
0 0
Z1 Z1 Z1
2 2 2 2x x2 = 2x3 x2 = x2 2x
E X = x f (x)dx = 0 + x e dx = e dx let y = , dy = dx
1 0 0
Z1 Z1
y
= ye dy = y2 1
e y
dy = (2) = (1) =
0 0
p !2
2 4
V ar (X) = E X 2 [E (X)] = =
2 4
(c) For y > 0
X2 p
G (y) = P (Y y) = P y =P X y since P (X 0) = 0
p
=F y
p p p
d p dp p 2 y y y
g (y) = G (y) = f y y=f y p = e p =e
dy dy 2 y 2 y
which is the p.d.f. of an Exponential (1) random variable. Therefore Y = X 2 = v Exponential (1).
8.8 Since X is the diameter of the sphere

3 1=3
4 X 6
Y = = X 3 and X = Y :
3 2 6
Since X v U (0:6; 1)
1 5
f (x) = = for 0:6 < x < 1:
1 0:6 2
The range of Y is
(0:6)3 < y < (1)3 or 0:036 y
6 6 6
333
Therefore for 0:036 y 6

!
1=3
3 6
G (y) = P (Y y) = P X y =P X Y
6
!
1=3
6
=F Y
and
" !#
1=3
d d 6
g(y) = G (y) = F Y
dy dy
! 1
1=3 1=3
6 d 6 5 6 3 1 2=3
=f Y Y = y
dy 2 3
1
5 6 3 2
= y 3
6
Therefore ( 1 2
5 6
6
3
y 3 0:036 y 6
g (y) =
0 otherwise
8.9 (a) Let X = magnitude of earthquake. Then X v Exponential (2:5). The probability an
earthquake has a magnitude greater than 5 is P (X > 5) = e 5=2:5 = e 2 .
(b) The probability that among 3 earthquakes there are none with a magnitude greater than 5 is
3 2 0 (1 e 2 )3 = (1 e 2 )3 .
0 e
(c) By the memoryless property of the Exponential distribution
1=2:5 0:4
P (X > 5jX > 4) = P (X > 1) = e =e
8.10 Let X = lifetime of this type of light bulb in hours. Then X v Exponential (1000).
q
(a) E (X) = 1000 hours and sd (X) = (1000)2 = 1000 hours.
(b) Let Y = lifetime of this type of lightbulb in days. Then Y = X=24 and
X 1 1 125
E (Y ) = E = E (X) = (1000) = days
24 24 24 3
2 2 2 2
X 1 1 2 1000 125
V ar (Y ) = V ar = V ar (X) = (1000) = =
24 24 24 24 3
s
2
125 125
sd (Y ) = = days
3 3
(c) Solving
m=1000
0:5 = P (X > m) = e
gives
m = 1000 ln 2 t 693:14 hours
8.11 (a) Let X = waiting time until the next accident. Since accidents occur according to a Poisson
process with rate = 0:5 accidents per day, then X has an Exponential distribution with mean
= 1= = 1=0:5 = 2 days. Since E (Y ) = if Y v Exponential ( ) then E (X) = 2 days is
the expected waiting time until the next accident.
(b)
P (waiting time is less than 12 hours)

= P (waiting time is less than 0:5 days)
0:5=2 0:25
= P (X < 0:5) = 1 e =1 e
(c) By the memoryless property of the Exponential distribution

0:25
P (X > 1jX > 0:5) = P (X > 0:5) = e
8.12 (a) Let X be the lifetime kilometer-age of the car. Then X has an Exponential distribution with
mean 20 thousand kilometers. By the memoryless property of the Exponential distribution,
20=20 1
P (X > 20 + 10jX > 10) = P (X > 20) = e =e = 0:3679
(b) If X = the lifetime kilometer-age of the car (in thousands of kilometers) is a U (0; 40) random
variable then
P (X > 30; X > 10) P (X > 30)
P (X > 20 + 10jX > 10) = =
P (X > 10) P (X > 10)
R 1
40
40 dx
30 1
= 40 = = 0:3333
R 1 3
40 dx
10
8.13 Let X = waiting time in days between server crashes. Since on average there are three server
crashes per day, X v Exponential (1=3).
(a) Since 8 hours = 1=3 day, the probability that the waiting time between two consecutive
crashes is greater than 8 hours is
Z1
1 3x 3(1=3) 1
P X> = 3e dx = e =e = 0:368
3
1=3
335
(b) By the memoryless property of the Exponential distribution
1 1 1 1 3(1=24)
P X> + jX > =P X> =e = 0:882
24 3 3 24
since one hour = 1=24 days.

(c) Let N = number of crashes in a day. Then N v P oisson (3) and the probability that there
are fewer than three crashes in a day is
2
X 3n e 3
P (N < 3) = P (N 2) = = 0:423
n!
n=0
8.14 (a) Clearly, f (x) 0 for any x 2 R. Since

Z1 Z1
x 1 x
f (x)dx = 0 + e dx
( )
1 0
Z 1
1 x x
= x 1 e dx let = y
( )
Z0 1
1
= ( y) 1 e y d( y)
( )
Z 10
1 1 y
= y e dy = 1
( ) 0
therefore f is a legitimate probability density function.
(b) Z Z
1 1
x 1 x ( + 1)
E(X) = xf (x)dx = x e dx = =
1 0 ( ) ( )
Similarly,
Z 1 Z 1
2 2 x 1 x ( + 2)
E(X ) = x f (x)dx = x2 e dx = 2
= ( + 1) 2
1 0 ( ) ( )
Hence,
V ar(X) = E X 2 [E (X)]2 = 2
(c) For = 1, we have (1) = 1. Then,

( x
1
e x 0
f (x) =
0 otherwise:
This is the probability density function of an Exponential ( ) random variable.
8.15 0:06681, 0:24173, 0:38292, 0:2417, 0:06681

8.16 Since X v N (10; 16), then to find the 20th percentile we need to find c such that
P (X c) = 0:2 or P Z c 10
4 = 0:2 where Z v N (0; 1).
From the inverse Normal table we have P (Z 0:8416) = 0:8 which implies
P (Z 0:8416) = 0:2 which gives c 410 = 0:8416 or c = 6:6336.
c 10
To find the 40th percentile we need to find c such that P (X c) = 0:4 or P Z 4 = 0:4.
From the inverse Normal table we have P (Z 0:2533) = 0:6 which implies
P (Z 0:2533) = 0:4 which gives c 410 = 0:2533 or c = 8:9868.
c 10
From the inverse Normal table we have P (Z 0:2533) = 0:6 which gives
c 10
4 = 0:2533 or c = 11:0132.
c 10
From the inverse Normal table we have P (Z 0:8416) = 0:8 which gives
c 10
4 = 0:8416 or c = 13:3664.
8.17 (a) Since X v N 2; (0:01)2
X 2 2 2
P (X < 2) = P < = P (Z < 0) = 0:5 where Z N (0; 1)
0:01 0:01
(b) Find c such that
jX j c
0:9 = P (jX j c) = P
0:01 0:01
Since P (jZj 1:6449) = 0:9 then c = (0:01) (1:6449) = 0:016449.
(c) X v N ; (0:01)2 . We want such that P (X < 2) < 0:01. Since
2
P (X < 2) = P Z< where Z N (0; 1)
0:01
and P (Z < 2:3263) = 0:01, therefore
2
< 2:3263 or > 2:023263
0:01
8.18 Let X be the bolt’s diameter. Then X N (1:2; (0:005)2 ).
P (X > 1:21 or X < 1:19) = P (X > 1:21) + P (X < 1:19)

X 1:2 1:21 1:2 X 1:2 1:19 1:2
=P > +P <
0:005 0:005 0:005 0:005
= P (Z > 2) + P (Z < 2) where Z N (0; 1)
=1 P (Z < 2) + [1 P (Z < 2)]
= 2 [1 P (Z < 2)] = 2 [1 F (2)] = 2(1 0:97725) = 0:0455
337
8.19 The average wholesale price per egg is

37 40 37 40 42 40 42 40
5P Z< + 6P <Z< + 7P Z>
2 2 2 2
= 5P (Z < 1:5) + 6P ( 1:5 < Z < 1) + 7P (Z > 1) where Z N (0; 1)
= 5(1 0:93319) + 6(0:84134 + 0:93319 1) + 7(1 0:84134)
= 6:092 cents
2
8.20 Let X = lifetime of computer chip. Then X v N 5 106 ; 5 105 .
(a) Since
6 106 5 106
P X 6 106 = P Z where Z N (0; 1)
5 105
= P (Z 2) = 0:97725
the proportion of computer chips that last less than 6 106 hours is 0:97725.
(b) Since
4 106 5 106
P X>4 106 = P Z> where Z N (0; 1)
5 105
= P (Z > 2) = P (Z < 2) = 0:97725
the proportion of computer chips that last longer than 4 106 hours is 0:97725.
2
(c) Let Y = the lifetime of improved computer chip. Then Y v N new ; 5 105 . The
manufacturer wants
4:5 106 new
0:95 P Y > 4:5 106 = P Z> where Z N (0; 1)
5 105
Since P (Z 1:6449) = 0:95, P (Z > 1:6449) = 0:95 the manufacturer should choose new such
that
4:5 106 new
1:6449
5 105
or
new 4:5 106 + 1:6449 5 105 = 4:5 106 + 0:82249 106 = 5:32245 106
Therefore the new mean should be at least 5:32245 106 .

8.21 (a) Let X = temperature of CPU. Since X v N 60; 52
75 60
P (X > 75) = P Z> = P (Z > 3) = 1 0:99865 = 0:00135 where Z N (0; 1)
5
(b) Since P (Z < 1:2816) = 0:9 where Z N (0; 1), therefore
c 60
= 1:2816 or c = 66:408
5
(c) Since P (Z > 2:3263) = 0:01 where Z N (0; 1), therefore
95 new
= 2:3263 or new = 83:3685
5
8.22 (a)
P (false negative) = P (X < d) if X v N 1;

2
1
= P (X < 5) if X v N 10; (6)2

5 10
=P Z< where Z v N (0; 1)
6
t P (Z < 0:83) = 1 P (Z < 0:83) = 1 0:79673 = 0:20327
P (false positive) = P (X d) if X v N 0;
2
0
= P (X 5) if X v N 0; (4)2
5 0
=P Z where Z v N (0; 1)
4
=1 P (Z < 1:25) = 1 0:89435 = 0:10565
(b)
P (false negative) = P (X < 5) if X v N 10; (3)2

5 10
=P Z< where Z v N (0; 1)
3
t P (Z < 1:67) = 1 P (Z < 1:67) = 1 0:95254 = 0:04746
P (false positive) = P (X 5) if X v N 0; (3)2

5 0
=P Z where Z v N (0; 1)
3
=1 P (Z < 1:67) = 1 0:95254 = 0:04746
339
8.23 (a) False positive probabilities are P (Z > d=3) = 0:0475; 0:092; 0:023 for Z v N (0; 1) and
d = 5; 4; 6 in (i), (ii), (iii).
False negative probabilities are P (Z < (d 10) =3) = 0:0475, 0:023, 0:092 for Z v N (0; 1)
and d = 5; 4; 6 in (i), (ii), (iii).
(b) The factors are the security (proportion of spam in email) and proportion of legitimate mes-
sages that are filtered out.
8.24 (a) Sn is the time we need to wait until the n’th hit occurs.
(b) If Sn t, then the n’th hit has happened sometime in (0; t]. Therefore, Xt n, because
Xt counts the number of hits in (0; t]. Conversely, if Xt n, it means that the number of hits
occurred up to time t is at least n. Therefore, the n’th hit has happened sometime in (0; t], that
is, Sn t.
(c) We know that Xt P oisson( t). Therefore,
P (Sn t) = P (Xt n) = 1 P (Xt < n)

n
X1
=1 P (Xt = j)
j=0
n
X1 t(
e t)j
=1
j!
j=0
(d) The probability density function of Sn is given by

n
X1
d t)jt( e t
fSn (t) = P (Sn t) = e + j ( t)j 1
dt j! j!
j=0
8 9
< n
X1 ( t)j ( t)j 1 = ( t)n 1 ( t)0
= e t 1+ = e t 1+
: j! (j 1)! ; (n 1)! 0!
j=1
n tn 1 e t
= :
(n 1)!
This is for t > 0. For t < 0, we simply have fSn (t) = 0. Therefore,
( nn 1 t
t e
(n 1)! if t > 0
fSn (t) = :
0 if t < 0
Noting that (n) = (n 1)!, we have
( n tn 1 e t
(n) if t > 0
fSn (t) =
0 if t < 0
which is a Gamma random variable with parameters = n and = 1= .
8.25 (a) To show that E (X) does not exist we need to show the improper integral
Z1
x
2
dx
( + x2 )
1
does not converge. By a change of variable

Z1 Z1
x x
2 + x2
dx = dx
1 + x2
1 1
For this integral to converge the integral

Z1
x
dx
1 + x2
1
must converge. Since for x 1

Z1
x x 1 1 1 1
= and dx diverges
1 + x2 x2 + x2 2 x 2 x
1
R1 x
therefore by the Comparison Test for Improper Integrals the integral 1+x2
dx does not converge
h 1 i
and thus E (X) does not exist. It follows that the variance V ar (X) = E (X )2 does not
exist since E (X) = does not exit.
(b) Let Y = X 1 . Show that Y has a Cauchy distribution with parameter 1 . Since
P (Y y) = P (X 1 y) and X can be positive or negative, we have two cases:
Case 1: Let y > 0. Then,
1 1 1 1
P X y =P X y; X > 0 + P X y; X < 0 = P X> + P (X < 0)
y
n o
1
This follows since for y > 0, X 1 y; X > 0 = X y and X 1 y; X < 0 =
fX < 0g. Also, since the probability density function of X is symmetric around the origin,
P (X < 0) = 0:5. Therefore,
1 3 1
P X y = P X
2 y
Case 2: Let y < 0. In this case, X 1 y = y 1 X < 0 . Then,
1 1 1 1
P X y =P y X < 0 = P (X < 0) P X y = 0:5 P X y
341
Therefore (
1:5 P X y 1 y>0
P (Y y) = 1
0:5 P X y y<0
For any y 6= 0,
d d 1
fY (y) = P (Y y) = P X y
dy dy
2 1
=y fX y
1
= 2y2
= h i
( + 1) y2 + ( 1 )2
Therefore Y = X 1 is a Cauchy random variable with parameter 1.
(c)
Zx Zx
F (x) = f (u)du = 2
du
( + u2 )
1 1
x
1 u
= lim arctan
b 1 b
1 x b
= arctan lim arctan
b 1
1h x i
= arctan +
2
1 1 x
= + arctan for x 2 <
2
Since F is increasing, the inverse cumulative distribution function is in fact the inverse function
of F , so
1
F 1 (s) = tan s ; s 2 [0; 1]
2
(d) Suppose U U ( 1; 0). We know that if V U (0; 1), then F 1 (V ) is a Cauchy random
variable with parameter . However if U U ( 1; 0) then 1 + U v U (0; 1). Therefore,
1
g(U ) = F (1 + U ) = tan [ (U + 0:5)] is the desired function.
Chapter 9:
9.1 (a) The marginal probability functions f1 (x) and f2 (y) are given in the table
x
f (x; y) 0 1 2 f2 (y) = P (Y = y)
y 0 0:15 0:1 0:05 0:3
1 0:35 0:2 0:15 0:7
f1 (x) = P (X = x) 0:5 0:3 0:2 1
(b) X and Y are not independent random variables since
P (X = 1; Y = 0) = 0:1 6= P (X = 1) P (Y = 0) = (0:3) (0:3) = 0:09
(c) P (X > Y ) = f (1; 0) + f (2; 0) + f (2; 1) = 0:3

(d) Conditional probability function of X given Y = 0:
x 0 1 2 Total
0:15 3 0:1 2 0:05 1
P (X = xjY = 0) 0:3 = 6 0:3 = 6 0:3 = 6 1
(e) Probability function of T = X + Y :
t 0 1 2 3 Total
P (T = t) 0:15 0:1 + 0:35 = 0:45 0:05 + 0:2 = 0:25 0:15 1
9.2 (a)
y
p (x; y) 0 1 2 3 4 5 6 7 8 9 p1 (x)
0 0:096 0 0 0 0 0 0:004 0 0 0 0:1
1 0 0:1 0 0 0 0 0 0 0 0 0:1
2 0 0 0:1 0 0 0 0 0 0 0 0:1
3 0 0 0 0:1 0 0 0 0 0 0 0:1
x 4 0 0 0 0 0:098 0 0 0:002 0 0 0:1
5 0 0 0 0 0 0:095 0 0 0 0:005 0:1
6 0:004 0 0 0 0 0 0:096 0 0 0 0:1
7 0 0 0 0 0:002 0 0 0:098 0 0 0:1
8 0 0 0 0 0 0 0 0 0:1 0 0:1
9 0 0 0 0 0 0:005 0 0 0 0:095 0:1
p2 (y) 0:1 0:1 0:1 0:1 0:1 0:1 0:1 0:1 0:1 0:1 1
343
Since P (X = 0; Y = 0) = 0:096 6= P (X = 0) P (Y = 0) = (0:1) (0:1) = 0:01, therefore X

and Y are not independent random variables.
(b)
X
P (X = Y ) = p (x; y)
(x;y):
x=y
= 2 (0:096) + 4 (0:1) + 2 (0:098) + 2 (0:095)

= 0:978
(c) P (number 5 is identified incorrectlyjnumber is a five) = p (5; 9) =0:1 = 0:005=0:1 = 0:05.
9.3 (a) The joint probability function of X and Y is
2 1 7
x y 3 x y
10 x = 0; 1; 2; y = 0; 1; x + y 3
3
(b) The marginal probability function of X is
2 8
x 3 x
f1 (x) = 10 x = 0; 1; 2
3
The marginal probability function of Y is

1 9
y 3 y
f2 (y) = 10 y = 0; 1
3
(c)
X X
P (X = Y ) = f (x; y) = P (X = x; Y = y)
(x;y): (x;y):
x=y x=y
= P (X = 0; Y = 0) + P (X = 1; Y = 1)
2 1 7 2 1 7
0 0 3 1 1 1 49
= 10 + 10 =
3 3
120
(d)
(21)(10)(72)
P (X = 1; Y = 0) (10) 1
P (X = 1jY = 0) = = 139 =
P (Y = 0) (0)(3) 2
(10
3)
9.4 (a) The event “x yellow balls on 1st 2 draws and y yellow balls on 4 draws” only happens if x
yellow balls are drawn on the first 2 draws and the remaining y x yellow balls are drawn on
the last 2 draws. The joint probability function of X and Y is
P (X = x; Y = y)
= P (x yellow balls on 1st 2 draws and y x yellow balls on last 2 draws)
= P (x yellow balls on 1st 2 draws)
P (y x yellow balls on last 2 draws given x yellow balls on 1st 2 draws)
5 3 5 x 3 (2 x)
x 2 x y x 2 (y x)
= 8 6
2 2
5 3 5 x x+1
x 2 x y x 2+x y
= 8 6 for x = 0; 1; 2; y = max(1; x); x + 1; x + 2
2 2
(b) Since X = number of yellow balls in first 2 draws without replacement, the marginal distri-
bution of X is Hypergeometric with marginal probability function
5 3
x 2 x
P (X = x) = 8 x = 0; 1; 2
2
Similarly since Y = number of yellow balls in all 4 draws without replacement, the marginal
probability function of Y is
5 3
y 4 y
P (Y = y) = 8 y = 1; 2; 3; 4:
4
Since
3 5 3
2 3 1
P (X = 0; Y = 3) = 0 6= P (X = 0) P (Y = 3) = 8 8
2 4
therefore X and Y are not independent random variables.
9.5 (a)
X X
P (X + Y > 1) = P (X = x; Y = y) = 1 P (X = x; Y = y)
(x;y): (x;y):
x+y>1 x+y 1
=1 [P (X = 0; Y = 0) + P (X = 0; Y = 1) + P (X = 1; Y = 0)]
=1 P (X = 0) P (Y = 0) P (X = 0) P (Y = 1)
P (X = 1) P (Y = 0)
345
since X and Y are independent random variables. Since X P oisson (0:1) and
Y P oisson (0:05)
P (X + Y > 1)
0:1 0:05 (0:05)1 e 0:1

0:05 (0:1)1 e 0:1
0:05
=1 e e e e
1! 1!
0:15 0:15
=1 e (1 + 0:05 + 0:1) = 1 1:15e
(b) E (X + Y ) = E (X) + E (Y ) = 0:1 + 0:05 = 0:15 and

V ar (X + Y ) = V ar (X) + V ar (Y ) = 0:1 + 0:05 = 0:15.
9.6 (a) Note that
2x+y e 4 2x e 2 2y e 2 for x = 0; 1; : : :
f (x; y) = P (X = x; Y = y) = =
x!y! x! y! and y = 0; 1; : : :
x 2
e
We recognize 2 x! as the probability function of a P oisson (2) random variable. Therefore
X v P oisson (2) and Y v P oisson (2) independently.
(b) Since X v P oisson (2) and Y v P oisson (2) independently, therefore
X + Y v P oisson (4) by Theorem 29.
9.7 First note that

X
P (Y = y) = P (X = x; Y = y)
all x
X
= P (Y = yjX = x)P (X = x) by the Product Rule
all x
X
= f2 (yjx) f1 (x)
all x
Since X v P oisson ( )
xe
f1 (x) =
for x = 0; 1; : : : :
x!
Also, for a given number x of defective items produced, the number, Y , detected has a Binomial
distribution with n = y and p = 0:9, assuming each inspection takes place independently so
x
f (yjx) = (0:9)y (0:1)x y
for y = 0; 1; : : : ; x:
y
Therefore
f (x; y) = f1 (x)f2 (yjx)

xe x! y = 0; 1; : : : ; x x = y; y + 1; : : :
= (0:9)y (0:1)x y
for or
x! y!(x y)! x = 0; 1; : : : y = 0; 1; : : :
To get f1 (xjy) we need f2 (y). We have

X 1
X xe
f2 (y) = f (x; y) = (0:9)y (0:1)x y
x=y
y!(x y)!
all x
(x y since the number of defective items produced can’t be less than the number detected)
1
X
(0:9)y e x (0:1)x y
=
y! x=y
(x y)!
Then
X1
(0:9 )y e (0:1 )x y
f2 (y) =
y! x=y
(x y)!
(0:9 )y e (0:1 )0 (0:1 )1 (0:1 )2
= + + +
y! 0! 1! 2!
(0:9 )y e
= e0:1 by the Exponential series
y!
(0:9 )y e 0:9
= for y = 0; 1; : : :
y!
Therefore
xe (0:9)y (0:1)x y
f (x; y) y!((x y)!
f1 (xjy) = = (0:9)y y e :9
f2 (y)
y!
(0:1 )x y e 0:1
= for x = y; y + 1; y + 2; : : :
(x y)!
9.8 Let Xi = the number of offspring of type i in a sample of size 40, i = 1; 2; 3; 4. Then
P (X1 = x1 ; X2 = x2 ; X3 = x3 ; X4 = x4 )
x1 x2 x3 x4
40! 3 5 5 3
=
x1 !x2 !x3 !x4 ! 16 16 16 16
xi = 0; 1; : : : ; i = 1; 2; 3; 4 and x1 + x2 + x3 + x4 = 40
(a)
P (X1 = 10; X2 = 10; X3 = 10; X4 = 10)

10 10 10 10
40! 3 5 5 3
=
10!10!10!10! 16 16 16 16
20 20
40! 3 5
=
(10!)4 16 16
347
3 5
(b) The probability of a type 1 or type 2 offspring is 16 + 16 = 12 . Therefore
X1 + X2 v Binomial 40; 12 and
40
40 1
P (X1 + X2 = 16) =
16 2
(c)
P (X1 = 10; X1 + X2 = 16)

P (X1 = 10jX1 + X2 = 16) =
P (X1 + X2 = 16)
P (X1 = 10; X2 = 6)
=
P (X1 + X2 = 16)
40! 3 10 5 6 8 24
10!6!24! 16 16 16
=
40! 8 40
16!24! 16
10 6
16 3 5
=
10 8 8
9.9 Let X = number of bacteria in 50 cubic centimeters of water. Then X has a P oisson(2:5)
distribution.
(2:5)x e 2:5
P (X = x) = x = 0; 1; : : :
x!
Then if (X0 ; X1 ; X2+ ) represent the number of samples with 0; 1;and 2 or more bacteria in the
five samples, having respectively probabilities e 2:5 ; 2:5e 2:5 and 1 3:5e 2:5 ; we have
5! 2:5 1 2:5 2 2:5 2

P (X0 = 1; X1 = 2; X2+ = 2) = e 2:5e 1 3:5e
1!2!2!
9.10 Let X = the lifetime of a light bulb. Then X v Exponential (1000).

(a)
500=1000 0:5
P (X < 500) = 1 e =1 e
0:5 1
P (500 < X < 1000) = e e
1 1:5
P (1000 < X < 1500) = e e
1:5
P (X > 1500) = e
(b) Let A be the event: 15 light bulbs last less than 500 hours, 15 light bulbs last between 500
and 1000 hours, and 10 light bulbs last between 1000 and 1500 hours.
50! 0:5 15 0:5 1 15 1 1:5 10 1:5 10

P (A) = 1 e e e e e e
15!15!10!10!
(c)
P (10 or more light bulbs last longer than 1500 hours)

9
X 50 1:5 y 1:5 50 y
=1 e 1 e
y
y=0
9.11 From Chapter 8, Problem 15 we have P (A) = 0:06681, P (B) = 0:24173, P (C) = 0:38292,
P (D) = 0:2417, P (F ) = 0:06681.
(a)
P (5 A’s, 15 B’s, 10 C’s and 15 D’s)

50!
= (0:06681)5 (0:24173)15 (0:38292)10 (0:24173)15 (0:06681)5
5!15!10!15!5!
(b)
P ( at least 45 students have marks above an F )

50
X 50
= (0:93319)y (0:06681)50 y
y
y=45
(c) Let X = number of students who receive A’s and let Y = the number of students that receive
B’s in a class of 50 students. Then the joint probability function of X and Y is
50!
P (X = x; Y = y) = (0:06681)x (0:24173)y (0:69146)50 x y
x!y! (50 x y)!
x; y = 0; 1; : : : ; x + y 50
9.12 (a)
10!
P (X1 = x1 ; : : : ; X6 = x6 ) = (0:1)x1 (0:05)x2 (0:05)x3 (0:15)x4 (0:15)x5 (0:5)x6
x1 !x2 ! x6 !
xi = 0; 1; : : : ; x1 + x2 + + x6 = 10
(b)
P (at least one apartment fire given 4 fire-related calls)

= P (X3 1jX1 + X2 + X3 + X4 = 4) = 1 P (X3 = 0jX1 + X2 + X3 + X4 = 4)
P (X3 = 0; X1 + X2 + X3 + X4 = 4) P (X3 = 0; X1 + X2 + X4 = 4)
=1 =1
P (X1 + X2 + X3 + X4 = 4) P (X1 + X2 + X3 + X4 = 4)
10!
0!4!6! (0:05)0 (0:1 + 0:05 + 0:15)4 (0:65)6 (0:3)4 6 4
=1 =1 =1
10!
4!6! (0:35)4 (0:65)6 (0:35)4 7
349
(c) Since Xi v Binomial (10; pi ) then E (Xi ) = 10pi . The total cost T is given by
T = 100 (5X1 + 5X2 + 7X3 + 20X4 + 4X5 + 2X6 ) :
The expected cost is
E (T ) = 100 [5E (X1 ) + 5E (X2 ) + 7E (X3 ) + 20E (X4 ) + 4E (X5 ) + 2E (X6 )]

= 100 (10) [5 (0:1) + 5 (0:05) + 7 (0:05) + 20 (0:15) + 4 (0:15) + 2 (0:5)]
= 5700 dollars
9.13 (a) The joint probability function of X and Y is

(9 + x + y)! x y
P (X = x; Y = y) = p q (1 p q)10 x; y = 0; 1; 2; : : :
x!y!9!
(b)
1
X (9 + x + y)!
P (X = x) = px q y (1 p q)10
x!y!9!
y=0
1
X
(9 + x)! x (9 + x + y)!
= p (1 p q)10 qy
x!9! y! (9 + x)!
y=0
9+x x
p (1 p q)10 X
1
x (10 + x) + y 1
= q y (1 q)10+x
(1 q)10+x y=0
y
9+x x 10 x
= p (1 p q)10 (1 q) x = 0; 1; : : :
x
where the sum is equal to one since
1
X k+y 1
q y (1 q)k = 1
y
y=0
because it is a sum over all values for a Negative Binomial probability function.
(c) The conditional probability function of Y given X = x is
(9+x+y)! x y
P (X = x; Y = y) x!y!9! p q (1 p q)10
P (Y = yjX = x) = = (9+x)! x 10 x
P (X = x) p q)10 (1
x!9! p (1 q)
(10 + x) +y 1
= q y (1 q)x+10 y = 0; 1; 2; : : :
y
which we recognize as a Negative Binomial probability function with k = 10 + x and p = 1 q.
9.14 (a)
P (X1 = 0; X2 = 2; X3 = 0; X4 = 1; X5 = 3; X6 = 1)
= P (X1 = 0) P (X2 = 2) P (X3 = 0) P (X4 = 1) P (X5 = 3) P (X6 = 1)
10 e 1 12 e 1 10 e 1 11 e 1 13 e 1 11 e 1
=
0! 2! 0! 1! 3! 1!
e 6
= for > 0:
12
(b)
P (X1 = 0; X2 = 2; X3 = 0; X4 = 1; X5 = 3; X6 = 1)
= P (X1 = 0) P (X2 = 2) P (X3 = 0) P (X4 = 1) P (X5 = 3) P (X6 = 1)
0e 2e 0e 1e 3e 1e
=
0! 2! 0! 1! 3! 1!
7e 6
= for > 0:
12
9.15
x
f (x; y) 0 1 2 f2 (y)
y 0 0:15 0:1 0:05 0:3
1 0:35 0:2 0:15 0:7
f1 (x) 0:5 0:3 0:2 1
E (X) = 0 + 0:3 + 2 (0:2) = 0:7; E X 2 = 0 + 0:3 + (2)2 (0:2) = 1:1;

V ar (X) = 1:1 (0:7)2 = 0:61
E (Y ) = 0:7; E Y 2 = 0:7; V ar (Y ) = 0:7 (0:7)2 = 0:21
E (XY ) = 0:2 + 2 (0:15) = 0:5; Cov (X; Y ) = 0:5 (0:7) (0:7) = 0:01
0:01
=p = 0:02794
(0:61) (0:21)
9.16 Note that

p p p
Cov (X; Y ) = V ar (X) V ar (Y ) = ( 0:7) (13) (34) = ( 0:7) 442
V ar (X 2Y ) = V ar (X) + ( 2)2 V ar (Y ) + 2 (1) ( 2) Cov (X; Y )

h p i p
= 13 + 4 (34) 4 ( 0:7) 442 = 149 + 2:8 442
= 207:867
351
9.17
Cov (X + Y; X Y ) = V ar (X) Cov (X; Y ) + Cov (X; Y ) V ar (Y )

= V ar (X) V ar (Y ) = 1 2= 1
9.18 (a)
E (U ) = E (X + Y ) = E (X) + E (Y ) = 2 (0:5) + 2 (0:5) = 2

E (V ) = E (X Y ) = E (X) E (Y ) = 2 (0:5) 2 (0:5) = 0
(b)
V ar (U ) = V ar (X + Y ) = V ar (X) + V ar (Y ) = 2 (0:5)2 + 2 (0:5)2 = 1

V ar (V ) = V ar (X Y ) = V ar (X) + V ar (Y ) = 2 (0:5)2 + 2 (0:5)2 = 1
(c)
Cov (U; V ) = Cov (X + Y; X Y ) = V ar (X) V ar (Y ) = 0
Although Cov (U; V ) = 0, U and V are not independent since P (U = 0) 6= 0; P (V = 1) 6= 0

but P (U = 0; V = 1) = 0:
9.19 (a) For a trinomial distribution there are three possible outcomes. Call these possible outcomes
A, B and C where P (A) = p, P (B) = q and P (C) = 1 p q. Since the joint distribution
of X and Y is given as
x = 0; 1; : : : ; n
n! x y n x y
f (x; y) = p q (1 p q) for y = 0; 1; : : : ; n
x!y!(n x y)!
and x + y n:
then the random variable X counts the number of times outcome A occurs and Y counts the
number of times outcome B occurs. Therefore the random variable T = X + Y counts the
number of times outcome A or B occurs. Since P (A [ B) = p + q then T = X + Y v
Binomial(n; p + q).
(b) Since T = X + Y v Binomial(n; p + q) then E (T ) = n(p + q) and
V ar (T ) = n(p + q)(1 p q).
(c) V ar (T ) = n(p + q)(1 p q). Also
V ar (T ) = V ar (X + Y )
= V ar (X) + V ar (Y ) + 2Cov(X; Y )
or
1
Cov(X; Y ) = [V ar (T ) V ar (X) V ar (Y )]
2
Since the marginal distributions of a Multinomial distribution are Binomial we know
X v Binomial(n; p) and Y v Binomial(n; q) and thus V ar (X) = np (1 p) and
V ar (Y ) = nq (1 q). Therefore
1
Cov(X; Y ) = [V ar (T ) V ar (X) V ar (Y )]
2
1
= [n(p + q)(1 p q) np (1 p) nq (1 q)]
2
1
= [ npq npq] = npq
2
We would expect the covariance to be negative since we know that if X is large (number of A
outcomes is large) then Y must be small (number of B outcomes is small) since the total number
of trials n is fixed.
9.20 (a)
E (X)
= 0 (0:2) + 1 (0:25) + 2 (0:35) + 3 (0:1) + 4 (0:05) + 5 (0:02) + 6 (0:02) + 7 (0:01) + 8 (0:01)
= 1:76
(b)
X8 x
x 1
P (Y = y) = f (x)
x=y
y 2
8 X
X 8 x
x 1
E(Y ) = y f (x) change the order of summation
y 2
y=0 x=y
2 3
8
X Xx x
4 x 1 5 term in [ ] is mean of Binomial x; 1
= f (x) y r.v.
y 2 2
x=0 y=0
8
X 1
= f (x) x
2
x=0
X8
1
= xf (x)
2
x=0
1 1
= E (X) = (1:76) = 0:88
2 2
353
9.21
X X
E[g (X; Y )] = g (x; y) f (x; y) b f (x; y) = b:
all (x;y) all (x;y)
Similarly E[g (X; Y )] a.
9.22 The optimal weights are
1 1 1 1 1 1
w1 = 2; w2 = 2; w3 = 2 where c = 2 + 2 + 2
c 1 c 2 c 3 1 2 3
and 1 = 0:2; 2 = 0:3; 3 = 0:4
9.23
1P n 1P n 1
E X = E (Xi ) = = (n ) =
n i=1 n i=1 n
and since the Xi ’s are independent random variables
2 n
1 P
V ar X = V ar (Xi )
n i=1
2 n 2
1 P 1
= = (n ) = ! 0 as n ! 1
n i=1 n n
9.24
1P n 1P n 1 1 1 1
E X = E (Xi ) = = n =
n i=1 n i=1 n
2 n
1 P
and V ar X = V ar (Xi )
n i=1
2 n 2
1 P1 1 1 1 1
= 2
= n 2
= 2
! 0 as n ! 1
n i=1 n n
9.25
1P n 1P n 1
E X = E (Xi ) = = (n ) =
n i=1 n i=1 n
2 n
1 P
and V ar X = V ar (Xi )
n i=1
2 n 2 2
1 P 2 1 2
= = n = ! 0 as n ! 1
n i=1 n n
9.26 (a) E Xi2 = V ar (Xi ) + E Xi2 = 2 + 2.
(b)
1P n 1P n 1P n 1
E(X) = E Xi = E (Xi ) = = (n ) =
n i=1 n i=1 n i=1 n
Since the Xi ’s are independent random variables
2 n 2 n
1P n 1 P 1 P 2
V ar(X) = V ar Xi = V ar (Xi ) =
n i=1 n i=1 n i=1
2 2
1 2
= n =
n n
2
2
E (X)2 = E X + V ar X = 2
+
n
(c)
1 P
n h i
2
E S2 = E Xi2 nE X
n 1 i=1
1 P
n 2
2 2 2
= + n +
n 1 i=1 n
1 2 2 2 2
= n + n
n 1
1 2
= (n 1)
n 1
2
=
9.27 (a) Since X G( 1:4; 1:5) and Y N ( 2:1; 4) independently, then
E (X + Y ) = E (X) + E (Y ) = 1:4 + ( 2:1) = 3:5
and
V ar (X + Y ) = V ar (X) + V ar (Y ) = (1:5)2 + 4 = 2:25 + 4 = 6:25 = (2:5)2
so X + Y v G ( 3:5; 2:5) and
6 ( 3:5) 2:5
P (X + Y > 6) = P Z> =P Z> where Z v N (0; 1)
2:5 2:5
t P (Z > 1) = P (Z 1) = 0:84134
(b) Since X G( 1:4; 1:5) and Y N ( 2:1; 4) independently, 2X + Y v N (0:7; 13) and
3 0:7
P ( 2X + Y < 3) = P Z< p t P (Z < 0:64) = 0:73891
13
355
(c) Since X G( 1:4; 1:5) and Y N ( 2:1; 4) independently, Y X v N ( 0:7; 6:25)

and
0 ( 0:7)
P (Y < X) = P (Y X < 0) = P Z< where Z v N (0; 1)
2:5
0:7
=P Z< = P (Z < 0:28)
2:5
= 0:61026
9.28 (a) 1:22

(b) 17:67%
9.29 (a) Let X = amount of wine in a bottle. Then X v N (1:05; 0:0004).
P (bottle contains less than 1 liter)

1 1:05
= P (X < 1) = P Z< where Z v N (0; 1)
0:02
= P (Z < 2:5) = 1 P (Z 2:5) = 1 0:99379 = 0:00621
A bottle is labelled as containing 1 liter. What is the probability the bottle contains less than 1
liter?
(b) Let V = volume of a cask. Then V v N (22; 0:16). Let Xi = amount of wine in the ith
bottle, i = 1; 2; : : : ; 20. Then Xi v N (1:05; 0:0004), i = 1; 2; : : : ; 20 independently. Therefore
P
20
T = Xi v N (20 (1:05) ; 20 (0:0004)) or T v N (21; 0:008).
i=1
P (contents of 20 bottle fit inside) = P (V T ) = P (V T 0)
Since V v N (22; 0:16) independently of T v N (21; 0:008), V T v N (22 21; 0:16 + 0:008)
or V T v N (1; 0:168).
Therefore
0 1
P (V T 0) = P Z p where Z v N (0; 1)
0:168
P (Z 2:44) = P (Z 2:44) = 0:99266
9.30 0:4134
9.31 (a) 0:0062 (b) 0:0771

9.32 (a) Since X is a linear combination of independent Normal random variables it has a Normal
distribution.
(b)
1P n 1P n 1
E X = E (Xi ) = = (n ) =
n i=1 n i=1 n
2 n 2 n
1 P 1 P 2
V ar X = V ar (Xi ) =
n i=1 n i=1
2
1 2
= n
n
2
=
n
(c)
p
P X 1:96 = n = P (jZj 1:96) where Z v N (0; 1)
= 2P (Z 1:96) 1 = 2 (0:975) 1 = 0:95
p
(d) We want P X 1:0 0:95 where X v G ( ; 12= n) or
!
X 1:0
P X 1:0 = P p p
12= n 12= n
p
n
=P jZj 0:95 where Z v N (0; 1)
12
p
Since P (jZj 1:96) = 0:95 we want n=12 1:96 or n (1:96)2 (144) = 553:2. Therefore
n should be at least 554.
9.33 Let T = X1 + X2 + + E (X5 ) = number of adjacent pairs of unlike beads in a necklace.

Since E (X1 ) = E (X2 ) = = E (X5 ) and
E (X1 ) = P (X1 = 1)
= P (Bead 1 is Pink and Bead 2 is Blue) + P (Bead 1 is Blue and Bead 2 is Pink)
2 1 1 2 4
= + =
3 3 3 3 9
therefore
4 20
E (T ) = 5 = :
9 9
357
Now V ar (X1 ) = V ar (X2 ) = = V ar (X5 ) and

4 5 20
V ar (X1 ) = P (X1 = 1) [1 P (X1 = 1)] = =
9 9 81
To find Cov (X1 ; X2 ) we note that
E (X1 X2 ) = P (X1 = 1; X2 = 1)
= P (Bead 1 is Pink, Bead 2 is Blue, Bead 3 is Pink)
+ P (Bead 1 is Blue, Bead 2 is Pink, Bead 3 is Blue)
2 1 2 1 2 1 6 18
= + = =
3 3 3 3 3 3 27 81
and therefore
18 4 4 2
Cov (X1 ; X2 ) = E (X1 X2 ) E (X1 ) E (X2 ) = = :
81 9 9 81
2
Now Cov (X1 ; X2 ) = Cov (X2 ; X3 ) = Cov (X3 ; X4 ) = Cov (X4 ; X5 ) = Cov (X5 ; X1 ) = 81
and all other covariances are zero. Therefore
20 2 100 + 20 40
V ar (T ) = 5 + 2 (5) = = :
81 81 81 27
9.34 p3 (4 + p); 4p3 (1 p3 ) + p4 (1 p4 ) + 8p5 (1 p2 )
9.35 Let (
1 student answers question i correctly
Xi = i = 1; 2; : : : ; 100
0 otherwise
(a) If the student guesses randomly then
P (Xi = 1) = P (student answers question i correctly)

= P (student knows the answer)
+ P (student does not know the answer but guesses correctly)
1 4 1 1
= pi + (1 pi ) = pi + = (1 + 4pi ) = qi ; i = 1; 2; : : : ; 100
5 5 5 5
Then Xi v Binomial (1; qi ) with
1
E (Xi ) = qi = (1 + 4pi )
5
1 4 4
V ar (Xi ) = qi (1 qi ) = (1 + 4pi ) (1 pi ) = (1 + 4pi ) (1 pi )
5 5 25
4
= [(1 pi ) + 4pi (1 pi )] for i = 1; 2; : : : ; 100
25
P
100
Let S = Xi : Then
i=1
P
100 P
100 P1
100 P
100 4 100
E (S) = E Xi = E (Xi ) = (1 + 4pi ) = + pi
i=1 i=1 i=1 5 5 5 i=1
and
P
100 P
100 P
4 100
V ar (S) = V ar Xi = V ar (Xi ) = [(1 pi ) + 4pi (1 pi )] :
i=1 i=1 25 i=1
The student’s total mark is given by
P
100 1 P
100 P
5 100 5
T = Xi 100 Xi = Xi 25 = S 25:
i=1 4 i=1 4 i=1 4
Therefore
5 5 P
5 100 4 100 P
100
E (T ) = E S 25 = E (S) 25 = + pi 25 = pi
4 4 4 5 5 i=1 i=1
and
5 25 P
25 4 100
V ar (T ) = V ar S 25 = V ar (S) = [(1 pi ) + 4pi (1 pi )]
4 16 16 25 i=1
P
100 1 P
100
= pi (1 pi ) + 100 pi
i=1 4 i=1
as required.
(b) If the student does not guess then Xi v Binomial (1; pi ) with E (Xi ) = pi and
P
100
V ar (Xi ) = pi (1 pi ). The student’s total mark is S = Xi with
i=1
P
100 P
100 P
100
E (S) = E Xi = E (Xi ) = pi
i=1 i=1 i=1
and
P
100 P
100 P
100
V ar (S) = V ar Xi = V ar (Xi ) = pi (1 pi )
i=1 i=1 i=1
as required.
(c) (i) If pi = 0:9 then V ar(T ) = 11:5 and V ar(S) = 9.
(ii) If pi = 0:5 then V ar(T ) = 37:5 and V ar(S) = 25.
1
9.36 (a) X = the number of keys assigned to a given list has a Binomial n; M distribution. The
expected number of keys assigned to a given list is
1 n
E (X) = n =
M M
359
(b) Consider slot i in the hash table and let Si = 1 if the slot is empty and Si = 0 otherwise,
i = 1; 2; : : : ; M . Then
1 n
P (Si = 1) = 1
M
and n n n
1 1 1
E (Si ) = (1) 1 + (0) 1 1 = 1
M M M
Now S = S1 + S2 + + SM = the number of empty slots and the expected number of empty
slots is
E (S) = E (S1 + S2 + + SM )
= E (S1 ) + E (S2 ) + + E (Sm )
n
1
=M 1
M
(c) We first note that
number of collisions = n number of occupied slots
and
number of occupied slots = M number of empty slots
so
number of collisions = n (M number of empty slots)

=n M + number of empty slots
Using the result from (b) we have
E (number of collisions) = n M + E (number of empty slots)

n
1
=n M +M 1
M
(d) Let Xi = number of keys in the table when a total of i slots are assigned for the first time,
P
M
i = 1; 2; : : : ; M . Then T = Xi = number of keys in the table when every slot has at least
i=1
one key for the first time. Now X1 = 1 with probability one so E (X1 ) = 1. X2 = the number
of keys assigned when a second slot is assigned for the first time in a sequence of Bernoulli trials
where a success is “a second slot is chosen for the first time” and P (Success) = MM 1 . Recall if
X v Geometric (p) then E (X) = (1 p) =p. Therefore
M 1 M 1 M 1
1 M M +1 M M
E (X2 ) = 1 + M 1
= M 1
=
M M
M 1
Similarly, X3 = the number of keys assigned when a third slot is assigned for the first time in
a sequence of Bernoulli trials where a success is “a third slot is chosen for the first time” and
P (Success) = MM 2 . Therefore E (X3 ) = MM 2 . Continuing in this manner we find
P
M P
M M PM 1
E (T ) = E (Xi ) = =M
i=1 i=1 M i+1 j=1 j
which is the sum of the first M terms in a harmonic series which does not have a closed form.
Using the approximation
P
M 1
t ln M
j=1 j
we have
PM 1
E (T ) = M t M ln M
j=1 j
9.37 Suppose P is N N and let 1 be a column vector of ones of length N . Consider the probability
vector corresponding to the discrete Uniform distribution = N1 1. Then
T 1 T 1 P
N P
N P
N 1 T T
P = 1 P = Pi1 ; Pi2 ; : : : ; PiN = 1 =
N N i=1 i=1 i=1 N
since P is doubly stochastic. Therefore is a stationary distribution of the Markov chain.
9.38 The transition matrix is 2 3

0 1 0
6 2 1 7
P =4 3 0 3 5
2 1
3 3 0
from which, solving T P = T and rescaling so that the sum of the probabilities is one, we
obtain T = (0:4; 0:45; 0:15), the long run fraction of time spent in cities A,B,C respectively.
9.39 By arguments similar to those in Section 9.3, the limiting matrix has rows all identically T
where the vector T are the stationary probabilities satisfying T P = T and

2 3
0 1 0
6 7
P = 4 61 12 13 5
2 1
0 3 3
The solution is T = (0:1; 0:6; 0:3) and the limit is

2 3
0:1 0:6 0:3
6 7
4 0:1 0:6 0:3 5
0:1 0:6 0:3
361
9.40 If today is raining, the probability of Rain, Nice, Snow three days from now is obtainable from
the first row of the matrix P 3 ; that is, (0:406 0:203 0:391): The probabilities of the three states
in five days, given (1) today is raining (2) today is nice (3) today is snowing are the three rows
of the matrix P 5 : In this case call rows are identical to three decimals; they are all equal the
equilibrium distribution T = (0:400; 0:200; 0:400):
9.41 (a) If a > b, and both parties raise then the probability B wins is
13 b 1
<
2(13 a) 2
13 2a+b
and the probability A wins is 1 minus this or 2(13 a) . If a b; then the probability A wins is
13 a
2(13 b)
In the special case b = 1, count the number of possible pairs (i; j) for which A = i a and
B = j > A:
1 A = 12
2 A = 11
.. ..
. .
13 a A=a
(13 a)(13 a+1)
2 Total
This leads to
(13 a)(13 a + 1)
P (B > A; A a) =
2(132 )
Similarly, since the number of pairs (A; B) for which A a; and B < a is (13 a + 1)(a 1);
we have
P (A > B; A a) = P (A > B; A a; B a) + P (A > B; A a; B < a)

(13 a)(13 a + 1) (13 a + 1)(a 1) (14 a)(a + 11)
= + =
2(132 ) 132 2(132 )
Therefore, in case b = 1, the expected winnings of A are
1P (B raises, A does not) 6P (both raise, B wins) + 6P (both raise, A wins)

= 1P (A < a) 6P (B > A; A a) + 6P (A > B; A a)
a 1 (13 a)(13 a + 1) (14 a)(a + 11)
= 1 6 +6
13 2(132 ) 2(132 )
6 2 77 71 1
= a + a = (a 1) (6a 71)
169 169 169 169
whose maximum (over real a) is at 77=12 and over integer a, at 6 or 7. For a = 1; 2; : : : ; 13

this gives expected winnings of 0, 0:38462, 0:69231, 0:92308, 1:0769, 1:1538, 1:1538, 1:0769,
0:92308, 0:69231, 0:38462, 0, 0:46154 respectively, and the maximum occurs for a = 6 or 7.
(b) We want P (A > B; A a; B b): Count the number of pairs (i; j) for which A a and
B b and A > B: Assume that b a.
1 B = 12
2 B = 11
.. ..
. .
13 a B=a
(a b)(13 a + 1) b B<a
for a total of
(13 a)(13 a + 1) 1
+ (a b)(13 a + 1) = (14 a) (13 + a 2b)
2 2
and
(14 a) (13 + a 2b)
P (A > B; A a; B b) =
2(132 )
Similarly
(13 a)(13 a + 1)
P (A < B; A a; B b) = P (A < B; A a) =
2(132 )
Therefore the expected return to A (still assuming b a) is
1P (A < a; B b) + 1P (A a; B < b)
+6P (A > B; A a; B b) 6P (A < B; A a; B b)
(a 1)(13 b + 1) (b 1)(13 a + 1)
= 1 2
+1
13 132
(14 a) (13 + a 2b) (13 a)(13 a + 1)
+6 2
6
2(13 ) 2(132 )
1
= (71 6a) (a b)
132
If b > a then the expected return to B is obtained by switching the role of a; b above, namely
1
(71 6b) (b a)
132
and so the expected return to A is
1
(71 6b) (a b)
132
363
In general, then the expected return to A is
1
(71 6 max(a; b)) (a b)
132
(c) By part (b), A’s possible expected profit per game for a = 1; 2; : : : ; 13 and b = 11 is
1 6 71
(71 6 max(a; 11)) (a 11) = max(a; 11) (a 11)
132 132 6
For a = 1; 2; : : : ; 13 these are, respectively, 0:2959, 0:2663, 0:2367, 0:2071, 0:1775,

0:1479, 0:1183, 0:0888, 0:0592, 0:0296, 0, 0:0059, 0:0828. There is no strategy
that provides a positive expected return. The optimal is the break-even strategy a = 11. (Note:
in this two-person zero-sum game, a = 11 and b = 11 is a minimax solution.)
9.42 (a) The permutation Xj+1 after j + 1 requests depends only on the permutation Xj before and
the record requested at time j + 1. Thus the new state depends only on the old state Xj (without
knowing the previous states) and the record currently requested.
pi
(b) For example the long-run probability of the state (i; j; k) is qi pj , where qi = 1 pi .
(c) The probability that record j is in position k is
pj for k = 1
pj (Q qj ) for k = 2
1 pj (1 + Q qj ) for k = 3
P
3
where Q = qi . The expected cost of accessing a record in the long run is
i=1
3
X
fp2j + 2p2j (Q qj ) + 3pj [1 pj (1 + Q qj )]g
j=1
Substituting p1 = 0:1; p2 = 0:3; p3 = 0:6 gives q1 = 19 ; q2 = 73 ; q3 = 6

4 and Q = 1
9 + 3
7 + 6
4 =
2:0397 and the expected cost is 1:7214.
(d) If they are in random order, the expected cost is 1( 13 ) + 2( 13 ) + 3( 31 ) = 2. If they are ordered
in terms of decreasing pj , the expected cost is p23 + 2p22 + 3p21 = 0:57.
9.44 Let J = index of maximum. P (J = j) = N1 , for j = 1; 2; : : : ; N . Let A = “your strategy

chooses the maximum”.
A occurs only if J > k and if maxfXi ; k < i < Jg < maxfXi ; 1 i kg. Given J = j > k,
the probability of this is the probability that maxfXi ; 1 i < jg occurs among the first k values,
k
which occurs with probability is j 1. Therefore,
P PN 1
P (A) = j P (AjJ = j)P (J = j) = j=k+1 P (AjJ = j)
N
N
X k 1 k 1 1 1 k N
= = + + ::: + ln
j 1N N k k+1 N 1 N k
j=k+1
Note that the value of x maximizing x ln(1=x) is x = e 1 0:37 so roughly, the best k is N e 1.
The probability that you select the maximum is approximately e 1 0:37.
9.45 (a) By definition

d
f1 (xjy) = P (X xjY = y)
dx
and
f1 (xjy) P (Y = y)
f2 (yjx) = P (Y = yjX = x) =
f1 (x)
Note that
Z1 Z1
d
f1 (xjy) dx = P (X xjY = y) dx
dx
1 1
= lim P (X xjY = y) lim P (X xjY = y) = 1 0=1

x!1 x!1
Since
f1 (xjy) P (Y = y)
f2 (yjx) = P (Y = yjX = x) =
f1 (x)
we have
f1 (xjy) P (Y = y) = f2 (yjx) f1 (x)
and
Z1 Z1
f1 (xjy) P (Y = y) dx = f2 (yjx) f1 (x) dx
1 1
but
Z1 Z1
f1 (xjy) P (Y = y) dx = P (Y = y) f1 (xjy) dx = P (Y = y)
1 1
and therefore
Z1
f2 (yjx) f1 (x) dx = P (Y = y) = f2 (y)
1
as required.
365
n y
(b) Since X v U (0; 1) and P (Y = yjX = x) = f2 (yjx) = x (1 x)n y
therefore
y
Z1 Z1
n y
f2 (y) = P (Y = y) = f2 (yjx) f1 (x) dx = x (1 x)n y
1 dx
y
1 0
Z1
n n! y! (n y)! 1
= xy (1 x)n y
dx = =
y y! (n y)! (n + 1)! n+1
0
(c) To find the conditional probability density function of X given Y = 0 we note that
P (X x; jXj 1)
P (X xjY = 0) = P (X xj jXj 1) =
P (jXj 1)
and
8
>
< 0 if x < 1
P (X x; jXj 1) = P ( 1 < X x) if jxj 1
>
:
P (jXj 1) for x > 1
8
>
< 0 if x < 1
= F1 (x) F1 ( 1) if jxj 1
>
:
P (jXj 1) for x > 1
where F1 (x) = P (X x). Note also that

8
>
< 0 if x < 1
d
P (X x; jXj 1) = f1 (x) if jxj 1
dx >
:
0 for x > 1
and therefore the conditional probability density function of X given Y = 0 is

8
>
< 0 if x < 1
d f1 (x)
f1 (xj0) = P (X xjY = 0) = P (jXj 1) if jxj 1
dx >
:
0 for x > 1
Chapter 10:
10.1 (a) Since E (Xi ) = 1=2 and V ar (Xi ) = 1=24, E (S) = 100 (1=2) = 50 and
V ar (S) = 100 (1=24) = 25=6. Since S is the sum of independent and identically distributed
P
100
random variables then by the Central Limit Theorem S = Xi will have approximately a
i=1
N (50; 25=6) distribution.
Therefore
!
49 50 50:5 50
P (49:0 S 50:5) t P p Z p where Z v N (0; 1)
25=6 25=6
= P ( 0:4899 Z 0:2449)
t P (Z 0:24) + P (Z 0:49) 1
= 0:59484 + 0:68793 1
= 0:28277
(b) If Xi v U (0; 1) then E (Xi ) = 1=2 and V ar (Xi ) = 1=12. Then E (S) = 100 (1=2) = 50
and V ar (S) = 100 (1=12) = 25=3. Since S is the sum of independent and identically distrib-
P
100
uted random variables then by the Central Limit Theorem S = Xi will have approximately a
i=1
N (50; 25=3) distribution.
Therefore
!
49:0 50 50:5 50
P (49:0 S 50:5) t P p Z p where Z v N (0; 1)
25=3 25=3
= P ( 0:3464 Z 0:1732)
t P (Z 0:17) + P (Z 0:35) 1
= 0:5675 + 0:63683 1
= 0:20433
10.2 Recall that if the student does not guess

P
100 P
100
E (S) = pi and V ar (S) = pi (1 pi )
i=1 i=1
where
P
100
S= Xi
i=1
is their total mark.
If the student guesses then their total mark T is
P
100 1 P
100 P
5 100 5
T = Xi 100 Xi = Xi 25 = S 25
i=1 4 i=1 4 i=1 4
367
where
P
100 4 100 P
4 100
E (S) = + pi and V ar (S) = [(1 pi ) + 4pi (1 pi )]
5 5 i=1 25 i=1
(a) If pi = 0:45 and student does not guess then E (S) = 100 (0:45) = 45 and
V ar (S) = 100 (0:45) (0:55) = 24:75. Since S is the sum of independent and identically distrib-
P
100
i=1
N (45; 24:75) distribution. Therefore
49:5 45
P (S 50) t P Z p where Z v N (0; 1)
24:75
=1 P (Z 0:9045) t 1 P (Z 0:90) = 1 0:81954
= 0:18045
If pi = 0:45 and the student guesses then E (S) = 100 4

5 + 5 (100) (0:45) = 56 and
4
V ar (S) = 25 [100 (0:55) + 400 (0:45) (0:55)] = 24:64:
5 59:5 56
P (T 50) = P S 25 50 = P (S 60) t P Z p where Z v N (0; 1)
4 24:64
=1 P (Z 0:7051) t 1 P (Z 0:71) = 1 0:76115
= 0:23885
(b) If pi = 0:55 and the student does not guess then E (S) = 100 (0:55) = 55 and
V ar (S) = 100 (0:55) (0:45) = 24:75. Since S is the sum of independent and identically distrib-
P
100
i=1
N (55; 24:75) distribution. Therefore
49:5 55
24:75
=1 P (Z 1:1055) t P (Z 1:11)
= 0:86650
If pi = 0:55 and the student guesses then E (S) = 100 4

5 + 5 (100) (0:55) = 64 and
4
V ar (S) = 25 [100 (0:45) + 400 (0:55) (0:45)] = 23:04:
5 59:5 64
P (T 50) = P S 25 50 = P (S 60) t P Z p where Z v N (0; 1)
4 23:04
=1 P (Z 0:9375) t P (Z 0:94)
= 0:82639
If pi = 0:45 then the best strategy for passing is to guess and if pi = 0:55 then the best strategy
for passing is to not guess.
10.3 We to find n such that

X
P 0:16 0:03 0:95
n
where X v Binomial (n; 0:16). By the Normal approximation to the Binomial
!
X X 0:16n 0:03n
P 0:16 0:03 =P p p
n n (0:16) (0:84) n (0:16) (0:84)
p
t P jZj 0:08183 n where Z v N (0; 1)
p
Since P (jZj 1:96) = 0:95 then we want 0:08183 n 1:96 or
n (1:96=0:08183)2 = (23:95)2 = 573:6. Therefore n should be at least 574.
h i
10.4 (a) Expected number of tests = (1) (0:98)20 + (21) 1 (0:98)20 = 7:6478
h i
Variance of number of tests = (1)2 (0:98)20 + (21)2 1 (0:98)20 (7:6478)2 = 88:7630
(b) For 2000 people the expected number of tests is (100) (7:6478) = 764:78, and the variance
of the number of tests is (100) (88:7630) = 8876:30, since people within pooled samples are
independent and each pooled sample is independent of each other pooled sample.
P
100
(c) Let N = number of tests for 2000 people. Now N = Ni where Ni = number of
i=1
tests required in the ith group of 20 people. Since N is the sum of 100 independent and iden-
tically distributed random variables then by the Central Limit Theorem N has approximately a
N (764:78; 8876:30) distribution. Since the possible values for N are n = 100; 120; 140; : : : ; 2100
the continuity correction is 20=2 = 10. Therefore
(800 + 10) 764:78

P (N > 800) t P Z p where Z v N (0; 1)
8876:30
=1 P (Z 0:48) = 1 0:68439
= 0:31561
10.5 Since X v Binomial (60; 0:8), then
E(X) = 60 (0:8) = 48 and V ar (X) = 60 (0:8) (0:2) = 9:6
Since Y v Binomial (62; 0:8) ; then
E (Y ) = 62 (0:8) = 49:6 and V ar (Y ) = 62 (0:8) (0:2) = 9:92

369
Now E (X Y ) = 48 49:6 = 1:6 and since X and Y are independent random variables
V ar (X Y ) = V ar(X) + V ar(Y ) = 9:6 + 9:92 = 19:52
By the Normal approximation to the Binomial, X has approximately a N (48; 9:6) distribution
and Y has approximately a N (49:6; 9:92) distribution. Since X and Y are independent random
variables then X Y has approximately a N ( 1:6; 19:52) distribution. Therefore
P (jX Yj 3) = 1 P (jX Y j < 3) = 1 P( 3<X Y < 3)

2:5 ( 1:6) 2:5 ( 1:6)
t1 P p Z p where Z v N (0; 1)
19:52 19:52
=1 P ( 0:20 Z 0:93)
=1 [P (Z 0:93) 1 + P (Z 0:20)]
=2 0:82381 0:57926
= 0:59693
10.6 (a) Let X = number of unemployed people in a sample of 10000 persons. Then
X v Binomial (10000; 0:07). By the Normal approximation to the Binomial X has approxi-
mately a N (700; 651) distribution. Therefore
675 700 725 700
651 651
= P (jZj 0:98) = 2P (Z 0:98) 1
= 2 (0:83646) 1
= 0:67292
Note than since n = 10000 is very large a continuity correction has not been used.
(b) We need to find n such that
X X
P 0:069 0:071 =P 0:07 0:001 0:95
n n
where X v Binomial (n; 0:07). By the Normal approximation to the Binomial
!
X X 0:07n 0:001n
P 0:07 0:001 =P p p
n n (0:07) (0:93) n (0:07) (0:93)
p
t P jZj 0:003919 n where Z v N (0; 1)
p
Since P (jZj 1:96) = 0:95 then we want 0:003919 n 1:96 or
n (1:96=0:003919)2 = (500:1276)2 = 250127:6. Therefore n should be at least 250; 128.
10.7 Let X = number of requests in a one minute = 60 second interval. Then X v P oisson (2 60).
Since = 120 is large we can use the Normal approximation to the Poisson.
(a)
135
X (120)x e 120
P (110 X 135) =
x!
x=110
109:5 120 135:5 120

120 120
= P ( 0:96 Z 1:41)
= P (Z 1:41) P (Z 0:96)
= P (Z 1:41) [1 P (Z 0:96)]
= P (Z 1:41) + P (Z 0:96) 1
= 0:92073 + 0:83147 1
= 0:7522
(b)
1
X 150
X
(120)x e 120 (120)x e 120
P (X > 150) = =1
x! x!
x=151 x=0
150:5 120
P (X > 150) t P Z p where Z v N (0; 1)
120
= P (Z 2:78)
=1 0:99728
= 0:00272
(c) Let Xi = waiting time between requests (i 1) and i, i = 1; 2; : : : ; 600. Then Xi has an
Exponential distribution with mean 1=2 = 0:5 seconds and variance (1=2)2 = 0:25 (seconds)2 .
The waiting time until the 600’th request is S = X1 + X2 + + X600 . Since S is the sum of
independent and identically distributed random variables then by the Central Limit Theorem S
371
will have approximately a N (600 (0:5) ; 600 (0:25)) = N (300; 150) distribution.
P (S < (4:5) (60)) = P (S < 270)

270 300
tP Z< p where Z v N (0; 1)
150
= P (Z < 2:45) = 1 P (Z < 2:45) = 1 0:99286
= 0:00714
Note that a continuity correction is not used since S is a continuous random variable.
10.8 (a) From Theorem 41 we have
X np
p v N (0; 1) approximately
np (1 p)
which implies
X
p
qn v N (0; 1) approximately
p(1 p)
n
Therefore
r r !
X p (1 p) X p (1 p)
P 1:645 p + 1:645
n n n n
0 1
X
p
= P @ 1:645 qn 1:645A
p(1 p)
n
t P ( 1:645 Z 1:645) where Z v N (0; 1)

= 2P (Z 1:645) 1
= 2 (0:95) 1 = 0:9
(b) Since Xi v P oisson ( ), i = 1; 2; : : : ; n where n is large then by Theorem 39
X
q v N (0; 1) approximately
n
Therefore
r r
P X 1:96 X + 1:96
n n
0 1
X
= P @ 1:96 q 1:96A
n
t P ( 1:96 Z 1:96) where Z v N (0; 1)

= 2P (Z 1:96) 1
= 2 (0:975) 1 = 0:95
(c) Since Xi v Exponential ( ), i = 1; 2; : : : ; n where n then by Theorem 39

X
q v N (0; 1) approximately
2
n
Therefore
r r !
2 2
P X 2:576 X + 2:576
n n
0 1
X
= P @ 2:576 q 2:576A
2
n
t P ( 2:576 Z 2:576) where Z v N (0; 1)

= 2P (Z 2:576) 1
= 2 (0:995) 1 = 0:99
10.9 (a) If you play n times then your expected profit is
E (S) = n [(1) (0:49) + ( 1) (0:51)] = 0:02n
and the variance of your profit is

h i
V ar (S) = n (1)2 (0:49) + ( 1)2 (0:51) ( 0:02)2 = 0:9996n
Since S is the sum of independent and identically distributed random variables then, by the
Central Limit Theorem, S has approximately a N ( 0:02n; 0:9996n) distribution.
(b) If n = 20, the possible values of S are x = 20; 18; : : : ; 2; 0; 2; 18; 20 and the continuity
correction is 2=2 = 1.
!
(0 1) ( 0:02) (20)
0:9996 (20)
= P (Z 0:13) = P (Z 0:13) = 0:55172
373
If n = 50, the possible values of S are x = 50; 48; : : : ; 2; 0; 2; 48; 50.

!
(0 1) ( 0:02) (50)
0:9996 (50)
= P (Z 0) = 0:5
If n = 100, the possible values of S are x = 100; 98; : : : ; 2; 0; 2; 98; 100.

!
(0 1) ( 0:02) (100)
0:9996 (100)
= P (Z 0:10) = 1 P (Z 0:10) = 1 0:53983 = 0:46017
The more you play, the smaller your chance of winning.

(c) For the casino owner, Y has approximately a N (0:02n; 0:9996n) distribution. For
n = 100; 000, Y has approximately a N (2000; 99960) distribution. We want to find c such that
P (Y > c) = 0:99. Since
(c + 1) 2000 c 1999
P (Y > c) t P Z> p =P Z> p where Z v N (0; 1)
99960 99960
and P (Z > 2:3263) = 0:99 then
c 1999 p
p = 2:3263 or c = 1999 (2:3263) 99960 = 1263:506
99960
With probability 0:99 the casino owner’s profit is at least $1263:51.
10.10 (a) Let T be the number of hearts which turn up. Then T v Binomial (3; 1=6) with
E (T ) = 3 (1=6) = 1=2 and V ar (T ) = 3 (1=6) (5=6) = 5=12. The profit for one play is T 1
with E (T 1) = 1=2 1 = 1=2 and V ar (T 1) = V ar (T ) = 5=12. If you play the game
n times then your expected profit is
n
E (X) = n [E (T 1)] =
2
and the variance of your profit is
5n
V ar (X) = n [V ar (T 1)] =
12
Since S is the sum of independent and identically distributed random variables then, by the
Central Limit Theorem, S has approximately a N ( n=2; 5n=12) distribution.
(b) (i) If n = 10
!
0 + 0:5 ( 5)
P (S > 0) = P Z> p = P (Z > 2:69) = 1 P (Z < 2:69)
50=12
=1 0:99643 = 0:00357
(ii) If n = 50
!
0 + 0:5 ( 25)
P (S > 0) = P Z> p = P (Z > 5:58677) t 0
250=12
10.11 (a)
P
1 P
1
x
M (t) = E etX = etx p (1 p)x = p (1 p) et
x=0 x=0
p
= by the Geometric series for t < ln(1 p)
1 (1 p)et
p
= where q = 1 p
1 qet
(b)
d h 1
i
2
M 0 (t) = p 1 qet = p ( 1) 1 qet qet
dt
2 pqet
= pqet 1 qet =
(1 qet )2
pq q
E(X) = M 0 (0) = 2
=
p p
d h t 2
i h
3 2
i
M 00 (t) = pqe 1 qet = pq et ( 2) 1 qet qet + et 1 qet
dt
pqet 2qet + 1 qet
=
(1 qet )3
pq (2q + 1 q) pq (1 + q) q (1 + q)
E(X 2 ) = M 00 (0) = 3 = 3
=
(1 q) p p2
2
q (1 + q) q q
V ar (X) = E(X 2 ) [E(X)]2 = =
p2 p p2
10.12
1 Pb eat e(b+1)t
M (t) = E etX = ext = for t 6= 0
b a + 1 x=a (1 et )(b a + 1)
375
1 P
b d 1 Pb
M 0 (t) = ext = xext
b a + 1 x=a dt b a + 1 x=a
1 Pb 1
E(X) = M 0 (0) = x= [b (b + 1) (a 1) a]
b a + 1 x=a 2 (b a + 1)
1 Pb d2 1 Pb
M 00 (t) = ext
= x2 ext
b a + 1 x=a dt2 b a + 1 x=a
1 P
b 1
E(X 2 ) = M 00 (0) = x2 = [b (b + 1) (2b + 1) (a 1) a (2a 1)]
b a + 1 x=a 6 (b a + 1)
10.13 (a) Since X only takes on values 0; 1; 2 the moment generating function of X is
M (t) = et(0) P (X = 0) + et(1) P (X = 1) + et(2) P (X = 2)

= P (X = 0) + et P (X = 1) + e2t P (X = 2)
Taking two derivatives with respect to t we have
M 0 (t) = et P (X = 1) + 2e2t P (X = 2)
M 00 (t) = et P (X = 1) + 4e2t P (X = 2)
Since M 0 (0) = E (X) = 1 and M 00 (0) = E X 2 = 1:5 we have
1 = E (X) = M 0 (0) = P (X = 1) + 2P (X = 2)
and
1:5 = E X 2 = M 00 (0) = P (X = 1) + 4P (X = 2)
Solving these two equations in two unknowns gives P (X = 2) = 0:25 and P (X = 1) = 0:5
and thus P (X = 0) = 0:25. Therefore
M (t) = 0:25 + 0:5et + 0:25e2t for t 2 <
(b)
M (3) (t) = et P (X = 1) + 8e2t P (X = 2) = et (0:5) + 8e2t (0:25)

and E X 3 = M (3) (0) = 0:5 + 2 = 2:5
M (4) (t) = et P (X = 1) + 16e2t P (X = 2) = et (0:5) + 16e2t (0:25)

and E X 4 = M (4) (0) = 0:5 + 4 = 4:5
(c) Given the first two moments E(X) = m1 and E(X 2 ) = m2 , there is a unique solution to the
equations p0 + p1 + p2 = 1; p1 + 2p2 = m1 ; p1 + 4p2 = m2 where pi = P (X = x) ; x = 1; 2; 3.
10.14 (a) Expand M (t) in a power series in powers of et , that is

1 t
1 3e
M (t) = t
= 2 t
3e 2 1 3e
1
X i
1 2 t 2 t
= et e by the Geometric series if e <1
3 3 3
i=0
1
X i
1 2
= et(i+1)
3 3
i=0
1
X x 1
1 2
= etx
3 3
x=1
Therefore
x 1
1 2
P (X = x) = coefficient of ext = for x = 1; 2; : : :
3 3
which we recognize as being the probability function of X = the total number of trials until the
first success in a sequence of Bernoulli trials with P (S) = 31 .
(b)
t
M (t) = e2(e 1)
=e 2 2et
e
1
X x
2 2et
=e by the Exponential series for t 2 <
x!
x=0
1
X e 2 2x
= etx
x!
x=0
Therefore
2x e 2
P (X = x) = coefficient of ext = for x = 0; 1; : : :
x!
which we recognize as being the probability function of a P oisson (2) random variable.
10.15 (a)
Z1 Z1
tX xt 1 x= 1 x( 1 t)
M (t) = E e = e e dx = e dx
0 0
1 1 1 1
= 1 = if t <
t 1 t
1
R1 x( 1 t)
If t , the integral e dx does not converge and the moment generating function does
0
1
not exist for t .
377
(b)
d 1 2 2
M 0 (t) = (1 t) = ( 1) (1 t) ( ) = (1 t)
dt
E(X) = M 0 (0) =
d2 h 2
i h
3
i
3
M 00 (t) = (1 t) = ( 2) (1 t) ( ) =2 2
(1 t)
dt2
E X2 = M 00 (0) = 2 2
V ar (X) = E(X 2 ) [E(X)]2 = 2 2

( )2 = 2
2 2
10.16 Recall that if X v N ; 2 then the moment generating function of X is MX (t) = e t+ t =2
for t 2 <. If Xi v N (1; 2), i = 1; 2; : : : ; n then the moment generating function of Xi is
2
Mi (t) = E etXi = et+t for t 2 <, i = 1; 2; : : : ; n.
(a) The moment generating function of Y = 3X1 + 4 is
MY (t) = E etY = E et( 3X1 +4)
3t+( 3t)2
= e4t E e( 3t)X1
= e4t M1 ( 3t) = e4t e
2
= et+9t for t 2 <
which is the moment generating function of a N (1; 18) random variable. By the Uniqueness
Theorem Y v N (1; 18).
(b) The moment generating function of T = X1 + X2 is
MT (t) = E etT = E et(X1 +X2 )
= E etX1 E etX2 since X1 and X2 are independent random variables

t+t2 t+t2
= e e
2
= e2t+2t for t 2 <
Theorem, T v N (1; 4).
(c) The moment generating function of Sn = X1 + X2 + : : : + Xn is
MSn (t) = E etSn = E et(X1 +X2 +:::+Xn )

Q
n Q
n
=E etXi = E etXi since the Xi ’s are independent random variables
i=1 i=1
Q
n Q
n 2
= Mi (t) = et+t
i=1 i=1
nt+nt2
=e for t 2 <
which is the moment generating function of a N (n; 2n) random variable. By the Uniqueness
Theorem, Sn v N (n; 2n).
1=2
(d) The moment generating function of Z = (2n) (Sn n) is
1=2
MZ (t) = E etZ = E et(2n) (Sn n)
1=2 1=2
= et(2 )n1=2 E et(2n) Sn
= et(2
1=2
)n1=2 M t (2n) 1=2
Sn
1=2 1=2 1=2 2

= et(2 )n1=2 en(t(2n) )+n(t(2n) )
t2 =2
=e for t 2 <
Theorem, Z v N (0; 1).
10.17 Since X v P oisson( 1 ), the moment generating function of X is

t
1+ 1e
MX (t) = e
Since Y v P oisson( 2 ), the moment generating function of Y is

t
2+ 2e
MY (t) = e
Since X and Y are independent random variables, the moment generating function of the sum
X + Y is the product of the moment generating functions, that is,
t t t
1+ 1e 2+ 2e ( 1 + 2 )+( 1 + 2 )e
MX (t)MY (t) = e e =e
Note that this is the moment generating function of a Poisson distribution with parameter 1 + 2.
Therefore by the Uniqueness Theorem X + Y v P oisson ( 1 + 2 ).
10.18 (a) The moment generating function of X is

Z1 Z1
1 1 1 1 1
M (t) = E e tX
= e xt
2
xe x=
dx = 2
xe ( t)x dx = 2 if t <
(1 t)
0 0
(b) From the solution to Chapter 10, Problem 15 we have

1 1
MX (t) = MY (t) = for t <
1 t
Therefore the moment generating function of S = Z + Y is
h i 1 1
MS (t) = E et(X+Y ) = E(etX )E(etY ) = MX (t)MY (t) = 2 for t <
(1 t)
and since this is the moment generating function of the distribution obtained in (a), S must have
the probability density function f (s) = 12 se s= for s > 0.
379
10.19
10.20 Let Y = total change over day. Given N = n, Y has a N (0; n 2) distribution and therefore
n 2 t2
E etY jN = n = exp
2
P
1
MY (t) = E etY = E[etY jN = n]P (N = n)
n=0
P
1 n 2 t2 n
=e exp
n=0 2 n!
2 t2 =2
P
1 (e )n
=e
n=0 n!
2 t2 =2
= exp( +e ) by the Exponential series
This is not a moment generating function we have seen in this course. The mean is MY0 (0) = 0
and the variance is MY00 (0) = 2 .
13. SAMPLE TESTS
Sample Midterm 1
1. Four students were late for an exam. Their excuse was that the car they shared had a flat tire on the
way. The instructor, suspecting that they were not telling the truth, asked them each separately which
tire went flat. Assume each student will randomly pick one of the tires 1, 2, 3, or 4, independently of
each other. Find the probability of each of the following events:
(a) A = they all pick the same tire.
(b) B = they all pick a different tire.
(c) C = at least two of them pick the same tire.
(d) D = exactly one student picks tire 1 and exactly one student picks tire 3.
2. The letters of the word PROBABILITY are arranged at random to form a “word”. Find the proba-
bility of each of the following events:
(a) A = the word ends with the letter Y
(b) B = the two B’s occur side by side in the word
(c) C = the word ends with the letter Y and the two B’s occur side by side
(d) D = the word does not end with the letter Y and the B’s do not occur side by side.
380
381
3. In a class of 60 students, 40% are international students. Five students are chosen at random.
(a) If the students are chosen without replacement, what is the probability none of them are interna-
tional students?
(b) If the students are chosen without replacement, what is the probability at least two of them are
international students?
(c) If the students are chosen with replacement, what is the probability none of them are international
students?
(d) If the students are chosen with replacement, what is the probability exactly one of them is an
international student?
4. (a) A and B are mutually exclusive events with P (A) = 0:6 and P (B) = 0:3. Find P (A \ B)
and P (A [ B).
(b) A and B are independent events with P (A) = 0:5 and P (B) = 0:1. Find P (A \ B) and
P (A [ B).
(c) P (A) = 0:4, P (B) = 0:6, and P (AjB) = 0:5. Find P (BjA) and P BjA .
(d) P (B) = 0:3, P (AjB) = 0:6, and P AjB = 0:2. Find P (A) and P A .
5. Students Aziz, Bo and Chun each independently write a tutorial test. The probability of passing the
test is 0:8 for Aziz, 0:6 for Bo, and 0:7 for Chun.
(a) Find the probability that at least one of them passes the test.
(b) Find the probability that exactly two of them pass the test.
(c) If exactly two of them pass the test, what is the probability it was Bo who did not pass the test?
6. In 2013, 10% of all immigrants to Canada were refugees. Forty-five percent of the refugees were
under 25 years old, and 30% of the non-refugee immigrants were under 25 years old. A person is
chosen at random from those who immigrated to Canada in 2013.
(a) What is the probability the randomly chosen person is a refugee and under 25 years old?
(b) What is the probability the randomly chosen person is a non-refugee immigrant and under 25 years
old?
(c) What is the probability the randomly chosen person is under 25 years old?
(d) If the randomly chosen person is under 25 years old, then what is the probability the person is a
refugee?
382 13. SAMPLE TESTS
Sample Midterm 2
1. Traffic accidents at the intersection of University Avenue and Westmount Road occur according to a
Poisson process with an average rate of 0:5 accidents per day. If no accidents occur during a week of
seven days (Sunday to Saturday), the week is declared a “Safe-Week”.
(a) Find the probability of a Safe-Week.
(b) Find the probability that in a period of 10 non-overlapping weeks there is at most 1 Safe-Week.
(c) Find the probability that there are 6 accidents during the two-week period November 1-14.
(d) Given that 6 accidents occurred during the two-week period November 1-14, find the probability
that the first week (November 1-7) was a Safe-Week.
(e) Suppose an accident has just occurred. What is the expected waiting time until the next accident?
2. Suppose the random variable X has a Geometric (p) distribution.
(a) Prove that P (X x) = (1 p)x for x = 0; 1; 2; : : :.

(b) Prove that P (X x + y X x) = P (X y) for all non-negative integers x and y.
(c) Prove that E (X) = (1 p) =p. Be sure to show all your work.
(d) If p is the probability of success in a sequence of Bernoulli trials then find the expected total number
of trials to obtain the first success.
3. X is a continuous random variable with cumulative distribution function

8
>
> 0 if x 0;
>
>
< 2x 2 if 0 < x 12 ;
F (x) =
>
> 4x 2x2 1 if 12 < x < 1
>
>
: 1 if x 1
(a) Find f (x), the probability density function of X, for all x 2 < (the set of real numbers).
(b) Find P (X 0:2).
(c) Show E (2X + 1) = 2.
(d) If V ar (X) = 1=24, then find E X 2 WITHOUT using integration.

383
4. In a large population the probability a randomly chosen person has a rare disease is 0:02. An
inexpensive diagnostic test gives a false positive result (person does not have the disease but the test
says they do) with probability 0:05 and a false negative result (person has the disease but the test says
they don’t) with probability 0:01. The inexpensive test costs $10. If a person tests positive they are
given a more expensive diagnostic test that costs $100 which correctly identifies all persons with the
disease.
(a) What is the expected cost per person for this testing protocol?
(b) To reduce the number of cases being missed due to false negative results, a second test is added to
the testing protocol above as follows: If a person tests negative on the first test using the inexpensive
test then the person is tested again using the inexpensive test. If the second test is negative then no
more testing is done. If the second test is positive then the person is tested with the more expensive
test. What is the expected cost per person for this testing protocol?
1
f (x) = x for 0 < x < 1
and zero otherwise where > 0 is a constant.
(a) Find P (X 0:25).
(b) Find E X k for k = 1; 2; : : :.
(c) Let Y = ln X. Show that Y v Exponential(1). Be sure to show all your work.
6. For each of the functions in the table indicate with a X which of the statements A-M is true. For
example, statement A is true for all these functions so there is a X in each box in the column labelled
A.
A B C D E F G H I J K L M
the p.f. f (x) of
X
a discrete r.v.
the c.d.f. F (x) of
X
a discrete r.v.
the p.d.f. f (x) of
X
a continuous r.v.
the c.d.f. F (x) of
X
a continuous r.v.
= do not use this box

A: The value of the function is always non-negative.
B: Every value of the function lies in the interval [0; 1].
C: The limit of the function as x ! 1 equals 1.
D: The limit of the function as x ! 1 equals 0.
E: The domain of the function is countable.
F: The domain of the function is < (the set of real numbers).
G: The function is non-decreasing for all x 2 <.
H: The function is increasing for all x 2 <.
I: The function is right-continuous for all x 2 <.
J: The function is continuous for all x 2 <.
K: The sum of the function over all values of x equals 1.
L: The area bounded by the graph of the curve of the function and the x axis equals 1.
M: The derivative of the function is equal to P (X = x).

385
Sample Final Exam
Part A: Circle the letter corresponding to the correct answer.
1. Three numbers are drawn at random WITH replacement from the digits f0; 1; 2; 3; 4; 5; 6; 7; 8; 9g.
The probability that there is a repeated number among the three numbers drawn is:
3 102 +10
A: 103
2
B: 1 3 10103
C: 1 101093 8
10 9+10
D: 103
3 10 8
E: 10 9 8
2. If two events A and B are independent and mutually exclusive, then:
A: this is impossible
B: A must have a probability 1
C: both A and B must have probability 1
D: both A and B must have probability 0
E: either A or B (or both) have probability 0
3. In a specific population 50% of all people are males. Five percent of the males are colour-blind,
and 0:25% of the females are colour-blind. If a randomly chosen person is colour-blind, then the
probability, to 3 decimal places, that the person is a male is:
A: 0:050
B: 0:025
C: 0:952
D: 0:026
E: None of the above
4. Sharks normally attack swimmers at Myhammy Beach on average about one day in 200. There
have been no shark attacks in the last 400 days. The probability of this happening is approxi-
mately:
1
A: 2
B: e 1
C: 2e 2
D: e 2
E: none of the above
5. Suppose new posts on a forum occur independently at a constant rate of 3 posts per half hour.
The probability that exactly 20 non-overlapping minutes in a half-hour period contain no new
posts is:
A: 30
e 0:1 10 1 e 0:1 20
20
B: 30
e 0:1 20 1 e 0:1 10
20
C: 29
e 0:1 20 1 e 0:1 10
19
D: 30
e 0:1 30
20

387
6. Suppose X is a non-negative random variable with E X 2 = 6 and V ar (X) = 2 then
A: E (X) = 2
B: E (X) = 4
C: E (X) = 2
D: E (X) = 6
E: there is not enough information to determine E (X).
7. A certain river floods every year. Suppose the low-water mark is set at one meter and the high-
water mark is modeled by the random variable X with cumulative distribution function:
(
1 x12 x 1
F (x) =
0 x<1
The probability that the high-water mark is greater than 3m but less than 4m is:
137
A: 144
7
B: 144
16
C: 144
9
D: 144

8. In the top half of the graph below is the probability density function of the random variable X
and in the bottom half is the probability density function of the random variable Y . Assume the
probability density function equals 0 outside the visible area of the graphs.
0.8
0.6
f(x)
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
x
0.8
0.6
f(y)
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
y
Which one of the following statements is false?
A: E (Y ) > E (X)
B: E (X) = 1
C: P (X = 1) = P (Y = 1)
D: sd (X) > sd (Y )
E: V ar (X) 1
389
9. Suppose that X U (1; 6) and Y U (1; 20). Which one of the following statements is true?
A: P (X > 3) < P (Y > 10)

B: P (X > 3) = P (Y > 10)
C: P (X > 3) > P (Y > 10)
D: Not enough information to determine.
10. Suppose X v Exponential(2). Then P (X < 3jX > 1) is equal to:
1 1
A: 2e
B: e 1
C: 1 e 1
D: 1 e 2
E: e 2
11. Suppose X is a random variable with V ar (X) > 0. Which one of the following statements is
true?
A: E(X 2 ) > [E(X)]2

B: E(X 2 ) = [E(X)]2
C: E(X 2 ) < [E(X)]2
D: Not enough information to determine.
12. Average daily caffeine consumption is 165 mg. Ninety-nine percent of people consume less
than 380 mg. Assuming daily caffeine consumption follows a Normal distribution, the standard
deviation is:
A: 130:7
B: 107:5
C: 167:8
D: 92:4
13. Suppose X P oisson(2), Y P oisson(3), and that X and Y are independent. The joint
probability function of X and Y is:
A: f (x; y) = e 6 2x 3y ; x = 0; 1; 2; :::; y = 0; 1; 2; :::

x!y!
5 5x+y
B: f (x; y) = e x!y! ; x = 0; 1; 2; :::; y = 0; 1; 2; :::
5 2x 3y
C: f (x; y) = e x!y! ; x = 0; 1; 2; :::; y = 0; 1; 2; :::
6 6x+y
D: f (x; y) = e x!y! ; x = 0; 1; 2; :::; y = 0; 1; 2; :::
14. Suppose X v N ( 2; 1), Y v N (2; 4) and Z v N (0; 1) independently. Let W = 3X + Y +

2Z. Which one of the following statements is true?
A: W v N (3; 9)
B: W v N (8; 9)
C: W v U ( 4; 17)
D: W v N (8; 17)
E: W v N (3; 17)
15. The random variable which would be the LEAST accurately approximated using the Central
Limit Theorem is:
A: the sum on 40 fair 6-sided dice.

B: the average grade of 913 students in STAT 230.
C: the total waiting time for 5 events in a Poisson process with rate = 10 events per hour.
D: the number of Heads in 50 flips of a fair coin.
E: the number of events in 5 hours in a Poisson process with rate = 10 events per hour.
16. If X v Binomial (100; 0:4) then P (X 45) is best approximated by:
A: P Z 45:5
p 40
24
where Z v N (0; 1)
B: P Z 44:5
p 40
24
where Z v N (0; 1)
C: P Z 45
p 40
24
where Z v N (0; 1)
D: P Z 46:5
p 40
24
where Z v N (0; 1)
391
Part B: Fill in the blank
1. For each of (a) to (j) choose the appropriate name of the distribution for the random variable
X from the following list: Discrete Uniform, Hypergeometric, Binomial, Negative Binomial,
Geometric, Poisson, Continuous Uniform, Exponential, Normal, and Multinomial:
(a) A researcher is interested in studying a rare disease among beavers in Algonquin National
Park. The researcher decides to capture and test beavers until the first beaver with the
disease is found. X = number of disease-free beavers tested by the researcher.
(b) The pointer on a circular spinner is spun. X = point on the circumference of the circle at
which the pointer stops (assume almost no friction).
(c) An instructor has n identical looking keys in the bottom of her knapsack and only one of
the keys opens the door to her office. She draws a key from her knapsack and tries to open
her office door. If the key does not work she draws another key. She continues this process
until she obtains the correct key. X = the draw on which she obtains the correct key where
the draws are numbered 1 (1st draw), 2 (2nd draw), etc.
(d) The probability of winning any prize in a weekly lottery is p. Jamie decides to purchase
one lottery ticket each week until s/he wins 3 prizes. X = number of weeks in which s/he
wins no prizes.
(e) Electrical power failures in a large Canadian city occur independently of each other through-
out the year at a uniform rate with little chance of more than one failure on a given day.
X = number of power failures in a month.
(f) In a shipment of N smartphones there are D defective smartphones. A sample of n smart-

phones are chosen at random and tested. X = number of defective smartphones in the
sample.
(g) Aziz, Bo and Chow play a game together in which Aziz wins with probability p, Bo wins
with probability q, and Chow wins with probability r (p + q + r = 1). They play the game
n times. X = number of times Aziz or Bo wins.
(h) Since men tend to have larger feet on average than women a very long footprint at a crime
scene might indicate the criminal is male. A criminal investigator randomly selects 100
males and measures their right foot in centimeters. X = length of right foot in centimeters
of a randomly chosen male from the 100 measured.
(i) Hits on a particular website occur independently of each other at a uniform rate throughout
the day with little chance of more than one hit in a one minute interval. X = waiting time
between consecutive hits on the website.
(j) In a very large city, the probability that a randomly chosen person supports a new bylaw
banning Christmas decorations until after Remembrance Day is equal to p. A sample of 100
people are selected at random. X = number of people in sample who support the bylaw.
393
2. Here are five concepts covered in STAT 230:
A: Bernoulli trials
B: Poisson process
C: Binomial approximation to the Hypergeometric
D: Poisson approximation to the Binomial
E: Central Limit Theorem
For each of the following statements indicate with a letter A, B, C, D, or E which of the
above concepts is best associated with that statement.
(a) n random variables are independent and identically distributed with mean and variance
2.
(b) Events occur at a uniform rate over time.
(c) Trials are independent.
(d) The probability p of one of only two possible outcomes is constant on each trial.
(e) The number of random draws n made without replacement from a population of two types
of items is small relative to the size of the population.
(f) The probability of 2 or more events in a sufficiently short period of time is approximately
zero.
(g) The probability p of one of only two possible outcomes is constant and small on each trial.
(h) The number of events occurring in non-overlapping time intervals are independent.
(i) The number of random variables n in the sum or average approaches 1.
(j) The number of independent trials n is large.

Part C: Long Answer
1. X and Y are discrete random variables with joint probability function
f (x; y) x
P (X = x; Y = y) 0 1 2
1 0:15 0:05 0:15
y 0 0:15 0:05 0:20
1 0:10 0:00 0:15
(a) Are X and Y independent random variables? Justify your answer.

(b) Find the covariance of X and Y .
(c) Find the correlation coefficient of X and Y .
(d) Find V ar (2X Y + 1).
(e) Tabulate the conditional probability function of X given Y = 0:
(f) Tabulate the probability function of T = X + Y .
2. The weights of full-term babies born in Ontario are Normally distributed with mean = 3:5 kg
and standard deviation = 0:5 kg.
(a) What proportion of full-term babies born in Ontario weigh more than 4:25 kg?
(b) What proportion of full-term babies born in Ontario weigh between 3:1 and 4:25 kg?
(c) What proportion of full-term babies born in Ontario have weights within one standard de-
viation of the mean?
(d) A sample of 9 babies is drawn from all full-term babies born in Ontario in 2014. Give
an expression for the probability that exactly 1 baby weighs more than 4:25 kg, exactly 5
babies weigh between 3:1 kg and 4:25 kg, and exactly 3 babies weigh less than 3:1 kg. You
do not need to evaluate the expression.
(e) A sample of 9 babies is drawn from all full-term babies born in Ontario in 2014. What is
the probability that their average weight exceeds 3:4 kg?
Pn
(f) Let X = n1 Xi be the average weight of n babies chosen at random. Find the smallest
i=1
value of n such that P X 3:5 0:05 0:9.
395
3. Suppose X is a continuous random variable with probability density function:

8
>
< 12 xe x= x 0
f (x) =
>
: 0 otherwise:
(a) Using the Gamma function or integration by parts show that E(X k ) = k (k + 1)! for
k = 1; 2; : : :.
(b) Use the result given in (a) to find E (X) and V ar (X).
p
(c) Find the probability density function of Y = X.
(d) Suppose that X1 ; X2 ; : : : ; X98 are independent random variables, each having the proba-
P
98
bility density function f (x). Let X = Xi =98 denote the sample mean. Use a suitable
i=1
approximation to calculate the probability
jX 2 j
P < 1:15 :
=7
4. Ten friends go to an all-you-can-eat sushi restaurant and sit at one large round table. Each person
likes spicy food with probability 0:6, independently of each other. We say a “match” occurs
when two people sitting next to each other BOTH like spicy food or BOTH do not like spicy
food. Let
(
1 if there is a “match” between person i and person i + 1
Xi =
0 otherwise.
for i = 1; 2; :::; 10 where person 11 is defined to be person 1 since they are at a circular table.
(a) Find the expected value of Xi .

(b) Find the expected total number of “matches” at the table.
(c) Find the variance of Xi .
(d) Show that the covariance between X1 and X2 is exactly 0:0096.
(e) Find the variance of the total number of “matches” at the table.
14. SOLUTIONS TO SAMPLE TESTS
Sample Midterm 1 Solutions

1. (a) A = they all pick the same tire.
Sample space = f1111; 1122; 1234; : : : ; 4444g = the set of all 44 permutations of the numbers
1; 2; 3; 4 with repeats. All outcomes are equally probable.
Since A = f1111; 2222; 3333; 4444g then
4 1 1
P (A) = 4
= 3 = = 0:016
4 4 64
(b) B = they all pick a different tire.

B = the set of all 4! permutations of the numbers 1; 2; 3; 4 without repeats. Therefore
4! 3
P (B) = = = 0:094
44 32
(c) C = at least two of them pick the same tire.

Since the complement of the event ‘at least 2 of them pick the same tire’ is the event ‘they all pick
a different tire’ therefore
4! 3 29
P (C) = 1 P (B) = 1 4
=1 = = 0:906
4 32 32
(d) D = exactly one student picks tire 1 and exactly one student picks tire 3.
4!
D = the set of all 4! permutations of the numbers 1234, all 2!1!1! permutations of the numbers
4!
1322, and all 2!1!1! permutations of the numbers 1344
4!
4! + 2 2!1!1! 24 + 24 3
P (D) = 4
= 4
= = 0:188
4 4 16
396
397
2. (a) A = the word ends with the letter Y

11!
Sample space = the set of all 2!2! permutations of the letters BBIIPROALTY. All outcomes are
equally probable.
10!
There is only 1 way to place the Y. The remaining 10 letters can be arranged in 2!2! ways. Therefore
10!
1 2!2! 1
P (A) = 11!
= = 0:091
2!2!
11
(b) B = the two B’s occur side by side in the word

10!
Consider BB as one letter. The number of arrangements of the letters BB IIPROALTY is 2! .
Therefore
10!
2! 2
P (B) = 11! = = 0:182
2!2!
11
(c) C = the word ends with the letter Y and the two B’s occur side by side
There is only 1 way to place the Y and we consider BB as one letter. The number of arrangements
of BB IIPROALT is 9!2! .
Therefore
1 9! 2 1
P (C) = P (A \ B) = 11!2! = = = 0:018
2!2!
110 55
(d) D = the word does not end with the letter Y and the B’s do not occur side by side.
P (D) = P A \ B = P A [ B by De Morgan’s Laws

=1 P (A [ B)
=1 [P (A) + P (B) P (A \ B)] by the Sum Rule
1 2 1 41
=1 + =
11 11 55 55
= 0:745
398 14. SOLUTIONS TO SAMPLE TESTS
3.(a) Let Ai be the event exactly i international students are chosen, i = 0; 1; : : : ; 5. Then
24 36
i 5 i
P (Ai ) = 60 i = 0; 1; : : : ; 5:
5
Therefore
24 36 36
0 5 5
P (none are international students) = P (A0 ) = 60 = 60 = 0:069
5 5
(b) If the students are chosen without replacement
P (at least 2 are international students) = 1 P (A0 ) P (A1 )

36 24 36
5 1 4
=1 60 60
5 5
=1 0:069 0:259
= 0:672
Alternatively
P (at least 2 are international students) = P (A2 ) + P (A3 ) + P (A4 ) + P (A5 )

5
X 24 36
i 5 i
= 60
i=2 5
= 0:3608 + 0:2335 + 0:0700 + 0:0078

= 0:672
(c) If the students are chosen with replacement
365
P (none are international students) = = (0:6)5 = 0:078
605
(d) If the students are chosen with replacement, then the probability we draw the international student
4
first followed by 4 non-international students is 246036
5 = (0:4) (0:6)4 . However the international
students could also be drawn on the 2nd, 3rd, 4th, and 5th draws. Therefore
P (exactly 1 international student) = 5 (0:4) (0:6)4 = 0:259

399
4: (a) Since A and B are mutually exclusive events P (A \ B) = 0 and
P (A [ B) = P (A) + P (B) = 0:6 + 0:3

= 0:9
(b) Since A and B are independent events P (A \ B) = P (A) P (B) = (0:5) (0:1) = 0:05 and
P (A [ B) = P (A) + P (B) P (A \ B) by the Sum Rule

= 0:5 + 0:1 0:05
= 0:55
(c)
P (A \ B) = P (AjB) P (B) by the Product Rule

= (0:5) (0:6) = 0:3
P (A \ B) 0:3 3
P (BjA) = = = = 0:75
P (A) 0:4 4
P BjA = 1 P (BjA) = 1 0:75 = 0:25
(d)
P (A) = P (A \ B) + P A \ B
= P (AjB) P (B) + P AjB P B by the Product Rule
= (0:6) (0:3) + (0:2) (0:7) = 0:18 + 0:14
= 0:32
P A =1 P (A) = 1 0:32
= 0:68
5. (a) Let A be the event Aziz passes, B be the event Bo passes, and C be the event Chun passes.
These events are independent events with P (A) = 0:8, P (B) = 0:6, and P (C) = 0:7.
P (at least 1 passes)

=1 P (none of them pass) = 1 P A\B\C
=1 P A P B P C since the events are independent
=1 (0:2) (0:4) (0:3) = 1 0:024
= 0:976
(b)
P (exactly 2 pass)
=P A\B\C +P A\B\C +P A\B\C
= P (A) P (B) P C + P (A) P B P (C) + P (A) P (B) P C
since the events are independent
= (0:8) (0:6) (0:3) + (0:8) (0:4) (0:7) + (0:2) (0:6) (0:7)
= 0:144 + 0:224 + 0:084
= 0:452
(c)
P (Bo did not pass the test j exactly 2 pass)

= P (A \ B \ C j exactly 2 pass)
P A \ B \ C \ exactly 2 pass
=
P (exactly 2 pass)
P A\B\C
=
P (exactly 2 pass)
0:224
=
0:452
= 0:496
401
6. (a) Let R be the event the person is a refugee and let A be the even the person is under 25 years old.
P (person is a refugee and under 25 years old)

= P (R \ A) = P (AjR) P (R) by the Product Rule
= (0:45) (0:1)
= 0:045
(b)
P (person is a non-refugee immigrant and under 25 years old)

=P R\A
= P AjR P R by the Product Rule
= (0:3) (1 0:1)
= 0:27
(c)
P (person is under 25 years old)

= P (A)
= P (R \ A) + P R \ A
= 0:045 + 0:27
= 0:315
(d)
P (person is a refugee j person is under 25 years old)

= P (RjA)
P (A \ R)
=
P (A)
0:045
=
0:315
= 0:143
Sample Midterm 2 Solutions

1. (a) Accidents occur at the average rate of 0:5 accidents per day or (7) (0:5) = 3:5 accidents per
week (7 days).
(3:5)0 e 3:5
3:5
P (Safe-Week) = P (0 accidents in 1 week) = =e = 0:030
0!
(b) Let Y = number of Safe-Weeks in a 10 week period. Then Y v Binomial 10; e 3:5 :
P (there is at most 1 Safe-Week in a 10 week period)

= P (Y 1)
10 3:5 0 3:5 10 10 3:5 1 3:5 9
= 0 e 1 e + 1 e 1 e
3:5 10 3:5 3:5 9
= 1 e + 10e 1 e
= 0:965
(c) Accidents occur at the average rate of 0:5 accidents per day or (14) (0:5) = 7 accidents per two-
week period (14 days).
(7)6 e 7
P (6 accidents in 2 week period) = = 0:149
6!
(d)
P (0 accidents the 1st week j 6 accidents in 2-week period)

P (0 accidents the 1st week and 6 accidents in 2-week period)
=
P (6 accidents in 2-week period)
P (0 accidents the 1st week and 6 accidents the 2nd week)
=
P (6 accidents in 2-week period)
(3:5)0 e 3:5 (3:5)6 e 3:5
0! 6!
=
(7)6 e 7
6!
(3:5)6 6
= 6 = (0:5)
(7)
= 0:016
(e) Let Y = waiting time until the next accident. Since accidents occur at the average rate of 0:5
accidents per day, then Y v Exponential 0:5
1 1
and E (Y ) = 0:5 = 2 days.
403
2. (a)
P (X x) = P (X = x) + P (X = x + 1) + P (X = x + 2) +
= p (1 p)x + p (1 p)x+1 + p (1 p)x+2 + which is a Geometric series
x
p (1 p)
=
1 (1 p)
= (1 p)x for x = 0; 1; : : :
(b)
P (X x + y andX x) P (X x + y)
P (X x+y X x) = =
P (X x) P (X x)
(1 p)x+y
=
(1 p)x
= (1 p)y = P (X y)
which holds for all non-negative integers x and y.
(c) By the Geometric series we have

1
X a
a ri = ; jrj < 1:
1 r
i=0
By differentiating with respect to r we obtain

1
X a
a iri 1
= ; jrj < 1:
i=1
(1 r)2
Therefore
1
X 1
X p (1 p) 1 p
E (X) = x p (1 p)x = p (1 p) x (1 p)x 1
= 2 = :
[1 (1 p)] p
x=1 x=1
(d) Let N = total number of trials to obtain the first success. Then N = X + 1 and
E (N ) = E (X + 1) = E (X) + 1
1 p 1 p+p
= +1=
p p
1
=
p
3. (a) Since 8
>
> 0 if x 0;
>
>
<2x2 if 0 < x 12 ;
F (x) =
>
> 4x 2x2 1 if 12 < x < 1
>
>
: 1 if x 1
8
>
< 0 if x 0; or if x 1
d
f (x) = F (x) = 4x if 0 < x 12 ;
dx >
:
4 4x if 12 < x < 1
Note: F 0 (x) does not exist at x = 0; 0:5; 1 so we arbitrarily define f (0) = 0 = f (1), and
f (0:5) = 2.
2
1.8
1.6
1.4
1.2
f(x)
1
0.8
0.6
0.4
0.2
0
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
x
(b)
P (X 0:2) = 1 P (X 0:2) = 1 F (0:2) = 1 2 (0:2)2 = 1 0:08 = 0:92
1
(c) By symmetry of the probability density function E (X) = 2 and therefore
1
E (2X + 1) = 2E (X) + 1 = 2 +1=2
2
(d)
2
1 1 7
E X 2 = V ar (X) + [E (X)]2 = + = = 0:292
24 2 24
405
4. (a)
Expected Cost = 110 [(0:02) (0:99) + (0:98) (0:05)] + 10 [(0:02) (0:01) + (0:98) (0:95)] = 16:880
(b)
Expected Cost = 110 [(0:02) (0:99) + (0:98) (0:05)]

h i
+ 20 (0:02) (0:01)2 + (0:98) (0:95)2
+ 120 [(0:02) (0:01) (0:99) + (0:98) (0:95) (0:05)]

= 30:867
5. (a)
Z
0:25
1
P (X 0:25) = x dx
0
= x j0:25
0
= (0:25)
(b)
Z1
k
E X = xk x 1
dx
0
Z1
+k 1
= x dx
0
1 +k 1
= x j0
+k
= for k = 1; 2; : : :
+k
(c) Let F (x) = P (X x) be the cumulative distribution function for X and G (y) = P (Y y) be
the cumulative distribution function for Y = ln X. For y > 0
G (y) = P (Y y)
=P( ln X y)
y=
=P X e
y=
=1 F e
For y > 0 the probability density function for Y is

d h i
g (y) = 1 F e y=
dy
d
= f e y= e y= by the Chain Rule
dy
1 1
= e y= e y=
y
=e
For y 0; g (y) = 0. Since g (y) is the probability density function for an Exponential(1) random
variable we have shown that Y = ln X v Exponential(1).
407
6.
A B C D E F G H I J K L M
the p.f. f (x) of
X X X X
a discrete r.v.
the c.d.f. F (x) of
X X X X X X X
a discrete r.v.
the p.d.f. f (x) of
X X X
a continuous r.v.
the c.d.f. F (x) of
X X X X X X X X
a continuous r.v.
= do not use this box

A: The value of the function is always non-negative.
B: Every value of the function lies in the interval [0; 1].
C: The limit of the function as x ! 1 equals 1.
D: The limit of the function as x ! 1 equals 0.
E: The domain of the function is countable.
F: The domain of the function is < (the set of real numbers).
G: The function is non-decreasing for all x 2 <.
H: The function is increasing for all x 2 <.
I: The function is right-continuous for all x 2 <.
J: The function is continuous for all x 2 <.
K: The sum of the function over all values of x equals 1.
L: The area bounded by the graph of the curve of the function and the x axis equals 1.
M: The derivative of the function is equal to P (X = x).

Sample Exam Solutions

Part A:
1. C
2. E
3. C
4. D
5. B
6. A
7. B
8. D
9. C
10. C
11. A
12. D
13. C
14. D
15. C
16. B
409
Part B:
1. (a) Geometric
(b) Continuous Uniform
(c) Discrete Uniform
(d) Negative Binomial
(e) Poisson
(f) Hypergeometric
(g) Binomial
(h) Normal
(i) Exponential
(j) Binomial
(a) E
(b) B
(c) A (also D and E)
(d) A
(e) C
(f) B
(g) D
(h) B
(i) E
(j) D (also E)
Part C: Long Answer
1. X and Y are discrete random variables with joint probability function
f (x; y) x
P (X = x; Y = y) 0 1 2 P (Y = y)
1 0:15 0:05 0:15 0:35
y 0 0:15 0:05 0:20 0:40
1 0:10 0:00 0:15 0:25
P (X = x) 0:40 0:10 0:50 1:00
(a) Since
P (X = 1; Y = 1) = 0 6= P (X = 1) P (Y = 1) = (0:1) (0:25)
therefore X and Y are not independent random variables.

(b)
E (X) = (1) (0:1) + (2) (0:5) = 1:1

E (Y ) = (1) (0:35) + ( 1) (0:25) = 0:1
E (XY ) = (1) (1) (0:05) + (2) (1) (0:15) + (2) ( 1) (0:15) = 0:05
Cov (X; Y ) = 0:05 (1:1) (0:1)
= 0:06
(c)
E X 2 = (1)2 (0:1) + (2)2 (0:5) = 2:1

V ar (X) = 2:1 (1:1)2 = 0:89
E Y 2 = (1)2 (0:35) + ( 1)2 (0:25) = 0:6
V ar (Y ) = 0:6 (0:1)2 = 0:59
0:06
(X; Y ) = p
(0:89) (0:59)
= 0:083
411
(d)
V ar (2X Y + 1) = V ar (2X Y)
= (2)2 V ar (X) + ( 1)2 V ar (Y ) + 2 (2) ( 1) Cov (X; Y )
= (4) (0:89) + 0:59 + ( 4) ( 0:06)
= 3:56 + 0:59 + 0:24
= 4:39
(e)
x 0 1 2 T otal
0:15 0:05 0:20

P (X = xjY = 0) 0:40 = 0:375 0:40 = 0:125 0:40 = 0:5 1:0
(f)
t 1 0 1 2 3 T otal
P (T = t) 0:10 0:15 0:35 0:25 0:15 1

2. The weights of full-term babies born in Ontario are Normally distributed with mean = 3:5 kg
and standard deviation = 0:5 kg.
(a) Let X = weight of randomly chosen full-term baby. Then X v N 3:5; (0:5)2 .
X 3:5 4:25 3:5

P (X > 4:25) = P >
0:5 0:5
= P (Z > 1:5) where Z v N (0; 1)
=1 P (Z 1:5)
=1 0:93319
= 0:06681
= 0:067
Therefore the proportion of full-term babies born in Ontario that weigh more than 4:25 kg
is 0:067.
(b)
3:1 3:5 X 3:5 4:25 3:5

P (3:1 X 4:25) = P
0:5 0:5 0:5
= P ( 0:8 Z 1:5) where Z v N (0; 1)
= P (Z 1:5) P (Z 0:8)
= P (Z 1:5) [1 P (Z 0:8)]
= P (Z 1:5) + P (Z 0:8) 1 = 0:93319 + 0:78814 1
= 0:72133 = 0:721
Therefore the proportion of full-term babies born in Ontario that weigh between 3:1 and
4:25 kg is 0:721.
(c)
X 3:5 0:5
P (jX 3:5j 0:5) = P
0:5 0:5
= P (jZj 1) where Z v N (0; 1)
= 2P (Z 1) 1 = 2 (0:84134) 1
= 0:68268 = 0:683
Therefore the proportion of full-term babies born in Ontario that have weights within one
standard deviation of the mean is 0:683.
413
(d)
P (exactly 1 baby weighs more than 4:25kg;

exactly 5 babies weigh between 3:1kg and 4:25kg,
and exactly 3 babies weigh less than 3:1kg)
9!
= (0:067)1 (0:721)5 (0:212)3
1!5!3!
(e) Let Xi = weight of i’th baby, i = 1; 2; : : : ; 9. Then Xi v N 3:5; (0:5)2 , i = 1; 2; : : : ; 9

2
0:5 2
independently and X v N 3:5; (0:5)
9 or X v N 3:5; 3 .
X 3:5
P X > 3:4 = P
0:5=3
= P (Z > 0:6) where Z v N (0; 1)
= P (Z < 0:6) = 0:72575 = 0:726
2
(f) Since Xi v N 3:5; (0:5)2 , i = 1; 2; : : : ; n independently, then X v N 3:5; (0:5)
n .
We want
!
X 3:5 0:05
0:9 P X 3:5 0:05 = P p p
0:5= n 0:5= n
p
= P jZj 0:1 n where Z v N (0; 1)
p
= 2P Z 0:1 n 1
or
p
0:95 P Z 0:1 n :
Since
P (Z 1:6449) = 0:95
we need
p
0:1 n 1:6449 or n (16:449)2 = 270:570
so the smallest value of n is 271.

3. Suppose X is a continuous random variable with probability density function:

8
>
< 12 xe x= x 0
f (x) =
>
: 0 otherwise:
(a)
Z1 Z1
k k 1 x= 1 x
E X = x 2
xe dx = 2
xk+1 e x=
dx let y =
0 0
Z1 Z1
1 k+1 y k
= 2
( y) e dy = y k+2 1
e y
dy = k
(k + 2)
0 0
k
= (k + 1)!
(b) Let k = 1 to obtain

1
E (X) = (1 + 1)! = 2 :
Let k = 2 to obtain
E X2 = 2
(2 + 1)! = 6 2 :
Then
V ar (X) = E X 2 [E (X)]2 = 6 2
(2 )2 = 2 2
(c) For y > 0 the c.d.f. of Y is

p
G (y) = P (Y y) = P X y
=P X y2
= F y2
where F (x) = P (X x) is the c.d.f. of X.
For y > 0 the p.d.f. of Y is
d d
g (y) = G (y) = f y 2 y2
dy dy
1 2
= 2 y 2 e y = (2y)
2 y2 =
= 2
y3e
and g (y) = 0 for y 0.

415
(d) Since E (Xi ) = 2 and V ar (Xi ) = 2 2 i = 1; 2; : : : ; 98 then by the Central Limit

Theorem
2 2
XvN 2 ; approximately
98
!
2
or X v N 2 ; approximately
7
X 2
or v N (0; 1) approximately.
=7
Therefore
!
X 2
P < 1:15
=7
t P (jZj < 1:15) where Z v N (0; 1)
= 2P (Z < 1:15) 1
= 2 (0:87493) 1
= 0:74986 = 0:750
4. Ten friends go to an all-you-can-eat sushi restaurant and sit at one large round table. Each person
likes spicy food with probability 0:6, independently of each other. We say a “match” occurs
when two people sitting next to each other BOTH like spicy food or BOTH do not like spicy
food. Let
(
1 if there is a “match” between person i and person i + 1
Xi =
0 otherwise.
for i = 1; 2; :::; 10 where person 11 is defined to be person 1 since they are at a circular table.
(a) Let Fi be the event a person likes spicy food and Fi be the event a person does not like
spicy food.
E (Xi ) = P (Xi = 1) = P (Fi \ Fi+1 ) + P Fi \ Fi+1

= (0:6) (0:6) + (0:4) (0:4) = 0:52
(b) Let T = X1 + X2 + + X10 = total number of matches. Then
E (T ) = E (X1 ) + E (X2 ) + + E (X10 )

= 10 (0:52) = 5:2
(c)
V ar (Xi ) = P (Xi = 1) [1 P (Xi = 1)]

= (0:52) (0:48) = 0:2496 = 0:250
(d)
E (X1 X2 ) = P (X1 = 1; X2 = 1)
= P (F1 \ F2 \ F3 ) + P F1 \ F2 \ F3
= (0:6)3 + (0:4)3 = 0:28
Therefore
Cov (X1 ; X2 ) = E (X1 X2 ) E (X1 ) E (X2 )

= 0:28 (0:52)2 = 0:0096
(e) Since Cov (X1 ; X2 ) = Cov (X2 ; X3 ) = = Cov (X9 ; X10 ) = Cov (X10 ; X1 ) =
0:0096 and all other covariances are equal to zero therefore
V ar (T ) = 10 (0:2496) + 2 (10) (0:0096) = 2:688

15. SUMMARY OF DISTRIBUTIONS
AND N (0; 1) TABLES
417
Summary of Discrete Distributions
Moment
Probability
Notation and Mean Variance Generating
Function
Parameters EX VarX Function
fx
Mt
Discrete Uniforma, b b
1
b−a1 ab b−a1 2 −1 1
∑ e tx
b≥a 2 12
b−a1
xa
x  a, a  1, … , b
a, b integers t∈
HypergeometricN, r, n   xr  N−r
n−x 
 Nn 
N  1, 2, … nr nr
1 − r
 N−n Not tractable
n  0, 1, … , N x  max 0, n − N  r, N N N N−1
r  0, 1, … , N … , minr, n
Binomialn, p
 nx p x q n−x np npq pe t  q n
0 ≤ p ≤ 1, q  1 − p
x  0, 1, … , n t∈
n  1, 2, …
Bernoullip p x q 1−x p pq pe t  q
0 ≤ p ≤ 1, q  1 − p x  0, 1 t∈
Negative Binomialk, p  xk−1

x p q
k x
p k
kq kq
−k 1−qe t
0  p ≤ 1, q  1 − p  x p −q
k x p p2
t  − ln q
k  1, 2, … x  0, 1, …
p
Geometricp pq x q q 1−qe t
p p2
0  p ≤ 1, q  1 − p x  0, 1, … t  − ln q
e −  x
Poisson e e −1
t
x!  
≥0 x  0, 1, … t∈
fx 1 , x 2 , … , x k  
Multinomialn; p 1 , p 2 , … , p k  Mt 1 , t 2 , … , t k 
n!
x 1 !x 2 !x k !
p x11 p x22 p xk k
0 ≤ pi ≤ 1 VarX i  p 1 e t 1 p 2 e t 2 
x i  0, 1, … , n EX i   np i
i  1, 2, … , k  np i 1 − p i  p k−1 e t k−1 p k  n
i  1, 2, … , k i  1, 2, … , k
k i  1, 2, … , k ti ∈ 
and ∑ p i  1 k
i1 and ∑ x i  n i  1, 2, … , k − 1
i1
Summary of Continuous Distributions
Probability Moment
Notation and Density Mean Variance Generating
Parameters Function EX VarX Function
fx Mt
e bt −e at
Uniforma, b 1
ab b−a 2 b−at
t≠0
b−a
2 12
ba a≤x≤b 1 t0
1 1
Exponential e −x/
  2 1−t
1
0 x≥0 t 
1 2 2
N,  2   G,  e −x− /2  2 e t
2 t 2 /2
2   
 ∈ ,  2  0 x∈ t∈
N(0,1) Cumulative
Distribution Function
This table gives values of F(x) = P(X ≤ x) for X ~ N(0,1) and x ≥ 0

x 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.50000 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.52790 0.53188 0.53586
0.1 0.53983 0.54380 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.62930 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.65910 0.66276 0.66640 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.70540 0.70884 0.71226 0.71566 0.71904 0.72240
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.75490
0.7 0.75804 0.76115 0.76424 0.76730 0.77035 0.77337 0.77637 0.77935 0.78230 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.86650 0.86864 0.87076 0.87286 0.87493 0.87698 0.87900 0.88100 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.90320 0.90490 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.92220 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.94520 0.94630 0.94738 0.94845 0.94950 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.96080 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.97320 0.97381 0.97441 0.97500 0.97558 0.97615 0.97670
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.98030 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.98300 0.98341 0.98382 0.98422 0.98461 0.98500 0.98537 0.98574
2.2 0.98610 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.98840 0.98870 0.98899
2.3 0.98928 0.98956 0.98983 0.99010 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.99180 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.99430 0.99446 0.99461 0.99477 0.99492 0.99506 0.99520
2.6 0.99534 0.99547 0.99560 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.99720 0.99728 0.99736
2.8 0.99744 0.99752 0.99760 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.99900
3.1 0.99903 0.99906 0.99910 0.99913 0.99916 0.99918 0.99921 0.99924 0.99926 0.99929
3.2 0.99931 0.99934 0.99936 0.99938 0.99940 0.99942 0.99944 0.99946 0.99948 0.99950
3.3 0.99952 0.99953 0.99955 0.99957 0.99958 0.99960 0.99961 0.99962 0.99964 0.99965
3.4 0.99966 0.99968 0.99969 0.99970 0.99971 0.99972 0.99973 0.99974 0.99975 0.99976
3.5 0.99977 0.99978 0.99978 0.99979 0.99980 0.99981 0.99981 0.99982 0.99983 0.99983

N(0,1) Quantiles: This table gives values of F-1(p) for p ≥ 0.5

p 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.075 0.08 0.09 0.095
0.5 0.0000 0.0251 0.0502 0.0753 0.1004 0.1257 0.1510 0.1764 0.1891 0.2019 0.2275 0.2404
0.6 0.2533 0.2793 0.3055 0.3319 0.3585 0.3853 0.4125 0.4399 0.4538 0.4677 0.4959 0.5101
0.7 0.5244 0.5534 0.5828 0.6128 0.6433 0.6745 0.7063 0.7388 0.7554 0.7722 0.8064 0.8239
0.8 0.8416 0.8779 0.9154 0.9542 0.9945 1.0364 1.0803 1.1264 1.1503 1.1750 1.2265 1.2536
0.9 1.2816 1.3408 1.4051 1.4758 1.5548 1.6449 1.7507 1.8808 1.9600 2.0537 2.3263 2.5758

STAT 230 Course Notes Fall 2019

Uploaded by

Copyright:

Available Formats

STAT 230 Course Notes Fall 2019

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT 230 Course Notes Fall 2019

Uploaded by

Copyright:

Available Formats

STAT 220/230 COURSE NOTES

Fall 2019 Edition

2. MATHEMATICAL PROBABILITY MODELS 5

3. PROBABILITY AND COUNTING TECHNIQUES 15

4. PROBABILITY RULES AND CONDITIONAL PROBABILITY 38

5. DISCRETE RANDOM VARIABLES 72

5.3 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6. COMPUTATIONAL METHODS AND THE STATISTICAL SOFTWARE R 113

7. EXPECTED VALUE AND VARIANCE 122

8. CONTINUOUS RANDOM VARIABLES 148

9. MULTIVARIATE DISTRIBUTIONS 191

9.2 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

10. C.L.T., NORMAL APPROXIMATIONS and M.G.F.’s 245

11. SOLUTIONS TO SECTION PROBLEMS 274

12. SOLUTIONS TO END OF CHAPTER PROBLEMS 298

13. SAMPLE TESTS 380

14. SOLUTIONS TO SAMPLE TESTS 396

15. SUMMARY OF DISTRIBUTIONS AND N (0; 1) TABLES 417

1.1 Definitions of Probability

1. The classical definition: The probability of some event is

number of ways the event can occur

Unfortunately, all three of these definitions have serious limitations.

The difficulties in producing a satisfactory definition can be overcome by treating probability as a

a sample space of all possible outcomes of a random experiment is defined

a mechanism for assigning probabilities (numbers between 0 and 1) to events is specified.

1.2 Chapter 1 Problems

(a) A person’s birthday is in April

3. Give examples of how probability applies to each of the following areas.

(a) Lottery draws

(a) The position of a small particle in space

2.1 Sample Spaces and Probability

Example: Roll a six-sided die, and define the events

ai = there are i pips on the top face, for i = 1; 2; : : : ; 6

Sample spaces may be either discrete or non-discrete; S is discrete if it consists of a finite or

Definition 2 An event in a discrete sample space is a subset A S. If the event is indivisible so it

Note that the notation A B means a 2 A implies a 2 B.

Definition 3 Let S = fa1 ; a2 ; a3 ; : : : g be a discrete sample space. Assign numbers (probabilities)

The set of probabilities fP (ai ); i = 1; 2; : : : g is called a probability distribution on S.

(1) Specify a sample space S.

(2) Assign a probability distribution to the simple events in S.

A = f2|; 3|; : : : ; A|g

Figure 2.1: Ten tosses of two coins.

(6; 1) (6; 2) (6; 3) (6; 6)g

Figure 2.2: Results of 1000 throws of 2 dice

2.2 Chapter 2 Problems

(a) List a suitable sample space for this “experiment”.

2. A fair coin is tossed three times.

(a) List a sample space for this experiment.

(a) List a sample space for this experiment.

920 persons without the disease tested negative

p(0; 6) = p(6; 0) = 0:004; p(0; 0) = p(6; 6) = 0:096

3.1 Addition and Multiplication Rules

(i) Specify the sample space S.

(a) all 3 letters are the same?

(a) What is the total number of possible passwords?

3.2 Counting Arrangements or Permutations

S = fabcdef, abcdfe, . . . , fedcbag

A = fafbcde, aebcdf, . . . , efdcbag

number of outcomes in A 240 1