Probability Aug 16
Probability Aug 16
Contents
1 Foundations
1.1 Embracing uncertainty . . . . . . . . . . . . . . . . .
1.2 Axioms of probability . . . . . . . . . . . . . . . . .
1.3 Calculating the size of various sets . . . . . . . . . .
1.4 Probability experiments with equally likely outcomes
1.5 Sample spaces with infinite cardinality . . . . . . . .
1.6 Short Answer Questions . . . . . . . . . . . . . . . .
1.7 Problems . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
6
10
13
15
20
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
27
32
34
34
36
37
38
41
43
45
47
50
53
60
61
62
67
67
67
70
72
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
CONTENTS
2.12.5 Reliability of a single backup . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.13 Short Answer Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.14 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
100
103
104
106
106
108
112
113
113
115
119
124
125
125
135
137
138
140
145
148
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
161
. 161
. 163
. 165
. 175
. 175
. 176
. 178
. 178
. 181
. 184
. 188
. 189
. 190
. 194
. 195
CONTENTS
4.9
4.10
4.11
4.12
4.13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Wrap-up
6 Appendix
6.1 Some notation . . . . . . . . . . . . . . .
6.2 Some sums . . . . . . . . . . . . . . . . .
6.3 Frequently used distributions . . . . . . .
6.3.1 Key discrete-type distributions . .
6.3.2 Key continuous-type distributions
6.4 Normal tables . . . . . . . . . . . . . . . .
6.5 Answers to short answer questions . . . .
6.6 Solutions to even numbered problems . .
.
.
.
.
.
.
.
.
.
.
.
.
204
204
205
206
211
211
213
216
217
218
220
223
237
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
239
. 239
. 240
. 240
. 240
. 241
. 243
. 245
. 246
vi
CONTENTS
Preface
A key objective of these notes is to convey how to deal with uncertainty in both qualitative and
quantitative ways. Uncertainty is typically modeled as randomness. We must make decisions with
partial information all the time in our daily lives, for instance when we decide what activities to
pursue. Engineers deal with uncertainty in their work as well, often with precision and analysis.
A challenge in applying reasoning to real world situations is to capture the main issues in a
mathematical model. The notation that we use to frame a problem can be critical to understanding
or solving the problem. There are often events, or variables, that need to be given names.
Probability theory is widely used to model systems in engineering and scientific applications.
These notes adopt the most widely used framework of probability, namely the one based on Kolmogorovs axioms of probability. The idea is to assume a mathematically solid definition of the
model. This structure encourages a modeler to have a consistent, if not completely accurate, model.
It also offers a commonly used mathematical language for sharing models and calculations.
Part of the process of learning to use the language of probability theory is learning classifications
of problems into broad areas. For example, some problems involve finite numbers of possible
alternatives, while others concern real-valued measurements. Many problems involve interaction of
physically independent processes. Certain laws of nature or mathematics cause some probability
distributions, such as the normal bell-shaped distribution often mentioned in popular literature, to
frequently appear. Thus, there is an emphasis in these notes on well-known probability distributions
and why each of them arises frequently in applications.
These notes were written for the undergraduate course, ECE 313: Probability with Engineering
Applications, offered by the Department of Electrical and Computer Engineering at the University
of Illinois at Urbana-Champaign. The official prerequisites of the course ensure that students have
had calculus, including Taylor series expansions, integration over regions in the plane, the use of
polar coordinates, and some basic linear algebra.
The author gratefully acknowledges the students and faculty who have participated in this
course through the years. He is particularly grateful to Professor D. V. Sarwate, who first introduced the course, and built up much material for it on the website.
B. Hajek
August 2016
vii
CONTENTS
Organization
Chapter 1 presents an overview of the many applications of probability theory, and then explains
the basic concepts of a probability model and the axioms commonly assumed of probability models.
Often probabilities are assigned to possible outcomes based on symmetry. For example, when a six
sided die is rolled, it is usually assumed that the probability a particular number i shows is 1/6,
for 1 i 6. For this reason, we also discuss in Chapter 1 how to determine the sizes of various
finite sets of possible outcomes.
Random variables are introduced in Chapter 2 and examined in the context of a finite, or
countably infinite, set of possible outcomes. Notions of expectation (also known as mean), variance,
hypothesis testing, parameter estimation, multiple random variables, and well known probability
distributionsPoisson, geometric, and binomial, are covered. The Bernoulli process is consideredit
provides a simple setting to discuss a long, even infinite, sequence of event times, and provides a
tie between the binomial and geometric probability distributions.
The focus shifts in Chapter 3 from discrete-type random variables to continuous-type random
variables. The chapter takes advantage of many parallels and connections between discrete-type
and continuous-type random variables. The most important well known continuous-type distributions are covered: uniform, exponential, and normal (also known as Gaussian). Poisson processes
are introducedthey are continuous-time limits of the Bernoulli processes described in Chapter
2. Parameter estimation and binary hypothesis testing are covered for continuous-type random
variables in this chapter as they are for discrete-type random variables in Chapter 2.
Chapter 4 considers groups of random variables, with an emphasis on two random variables.
Topics include describing the joint distribution of two random variables, covariance and correlation coefficient, and prediction or estimation of one random variable given observation of another.
Somewhat more advanced notions from calculus come in here, in order to deal with joint probability
densities, entailing, for example, integration over regions in two dimensions.
Short answer questions and problems can be found at the end of each chapter with answers to
the questions and even numbered problems provided in the appendix. Most of the short answer
questions also have video links to solutions which should open if you click on [video] in the pdf
version of these notes, if you have a browser and internet connection. A brief wrap up is given
in Chapter 5. A small number of other videos are provided for examples. These videos are not
meant to substitute for reading the notes; it is recommended students attempt solving problems
and watch the videos afterwards if needed.
CONTENTS
Chapter 1
Foundations
1.1
Embracing uncertainty
We survive and thrive in an uncertain world. What are some uses of probability in everyday life?
In engineering? Below is an incomplete list:
Call centers and other staffing problems: Experts with different backgrounds are needed
to staff telephone call centers for major financial investment companies, travel reservation
services, and consumer product support. Management must decide the number of staff and the
mix of expertise so as to meet availability and waiting time targets. A similar problem is faced
by large consulting service companies, hiring consultants that can be grouped into multiple
overlapping teams for different projects. The basic problem is to weigh future uncertain
demand against staffing costs.
Electronic circuits: Scaling down the power and energy of electronic circuits reduces the
reliability and predictability of many individual elements, but the circuits must nevertheless
be engineered so the overall circuit is reliable.
Wireless communication: Wireless links are subject to fading, interference from other
transmitters, doppler spread due to mobility and multipath propagation. The demand, such
as the number of simultaneous users of a particular access point or base station, is also time
varying and not fully known in advance. These and other effects can vary greatly with time
and location, but yet the wireless system must be engineered to meet acceptable call quality
and access probabilities.
Medical diagnosis and treatment: Physicians and pharmacologists must estimate the
most suitable treatments for patients in the face of uncertainty about the exact condition of
the patient or the effectiveness of the various treatment options.
Spread of infectious diseases: Centers for disease control need to decide whether to institute massive vaccination or other preventative measures in the face of globally threatening,
possibly mutating diseases in humans and animals.
3
CHAPTER 1. FOUNDATIONS
Information system reliability and security: System designers must weigh the costs
and benefits of measures for reliability and security, such as levels of backups and firewalls,
in the face of uncertainty about threats from equipment failures or malicious attackers.
Evaluation of financial instruments, portfolio management: Investors and portfolio
managers form portfolios and devise and evaluate financial instruments, such as mortgage
backed securities and derivatives, to assess risk and payoff in an uncertain financial environment.
Financial investment strategies, venture capital: Individuals raising money for, or
investing in, startup activities must assess potential payoffs against costs, in the face of
uncertainties about a particular new technology, competitors, and prospective customers.
Modeling complex systems: Models incorporating probability theory have been developed
and are continuously being improved for understanding the brain, gene pools within populations, weather and climate forecasts, microelectronic devices, and imaging systems such as
computer aided tomography (CAT) scan and radar. In such applications, there are far too
many interacting variables to model in detail, so probabilistic models of aggregate behavior
are useful.
Modeling social science: Various groups, from politicians to marketing folks, are interested
in modeling how information spreads through social networks. Much of the modeling in this
area of social science involves models of how people make decisions in the face of uncertainty.
Insurance industry: Actuaries price policies for natural disasters, life insurance, medical
insurance, disability insurance, liability insurance, and other policies, pertaining to persons,
houses, automobiles, oil tankers, aircraft, major concerts, sports stadiums and so on, in the
face of much uncertainty about the future.
Reservation systems: Electronic reservation systems dynamically set prices for hotel rooms,
airline travel, and increasingly for shared resources such as smart cars and electrical power
generation, in the face of uncertainty about future supply and demand.
Reliability of major infrastructures: The electric power grid, including power generating stations, transmission lines, and consumers is a complex system with many redundancies.
Still, breakdowns occur, and guidance for investment comes from modeling the most likely
sequences of events that could cause outage. Similar planning and analysis is done for communication networks, transportation networks, water, and other infrastructure.
Games, such as baseball, gambling, and lotteries: Many games involve complex calculations with probabilities. For example, a professional baseball pitchers choice of pitch
has a complex interplay with the anticipation of the batter. For another example, computer
rankings of sports teams based on win-loss records is a subject of interesting modeling.
Commerce, such as online auctions: Sellers post items online auction sites, setting initial
prices and possibly hidden reserve prices, without complete knowledge of the total demand
for the objects sold.
Online search and advertising: Search engines decide which webpages and which advertisements to display in response to queries, without knowing precisely what the viewer is
seeking.
Personal financial decisions: Individuals make decisions about major purchases, investments, and insurance, in the presence of uncertainty.
Personal lifestyle decisions: Individuals make decisions about diet, exercise, studying for
exams, investing in personal relationships, all in the face of uncertainty about such things as
health, finances, and job opportunities.
Hopefully you are convinced that uncertainty is all around us, in our daily lives and in many
professions. How can probability theory help us survive, and even thrive, in the face of such
uncertainty? Probability theory:
provides a language for people to discuss/communicate/aggregate knowledge
about uncertainty. Use of standard deviation is widely used when results of opinion polls
are described. The language of probability theory lets people break down complex problems,
and to argue about pieces of them with each other, and then aggregate information about
subsystems to analyze a whole system.
provides guidance for statistical decision making and estimation or inference. The
theory provides concrete recommendations about what rules to use in making decisions or
inferences, when uncertainty is involved.
provides modeling tools and ways to deal with complexity. For complex situations,
the theory provides approximations and bounds useful for reducing or dealing with complexity
when applying the theory.
What does probability mean? If I roll a fair six-sided die, what is the probability a six shows?
How do I know? What happens if I roll the same die a million times?
What does it mean for a weather forecaster to say the probability of rain tomorrow is 30%?
Here is one system we could use to better understand a forecaster, based on incentives. Suppose
the weather forecaster is paid p (in some monetary units, such as hundreds of dollars) if she declares
that the probability of rain tomorrow is p. If it does rain, no more payment is made, either to or
from the forecaster. If it does not rain, the forecaster keeps the initial payment of p, but she has
to pay ln(1 p). In view of these payment rules, if the weather forecaster believes, based on all
the information she has examined, that the probability of rain tomorrow is q, and if she reports
p, her expected total payoff is p + (1 q) ln(1 p). For q fixed this payoff is maximized at p = q.
That is, the forecaster would maximize the total payoff she expects to receive by reporting her
best estimate. Someone receiving her estimate would then have some understanding about what
it meant. (See A.H. Murphy and R.L. Winkler (1984). Probability Forecasting in Meteorology,
Journal of the American Statistical Association, 79 (387), 489500.)
CHAPTER 1. FOUNDATIONS
1.2
Axioms of probability
There are many philosophies about probability. In these notes we do not present or advocate any
one comprehensive philosophy for the meaning and application of probability theory, but we present
the widely used axiomatic framework. The framework is based on certain reasonable mathematical
axioms. When faced with a real life example, we use a mathematical model satisfying the axioms.
Then we can use properties implied by the axioms to do calculations or perform reasoning about
the model, and therefore about the original real life example. A similar approach is often taken in
the study of geometry. Once a set of axioms is accepted, additional properties can be derived from
them.
Before we state the axioms, we discuss a very simple example to introduce some terminology.
Suppose we roll a fair die, with each of the numbers one through six represented on a face of the die,
and observe which of the numbers shows (i.e. comes up on top when the die comes to rest). There
are six possible outcomes to this experiment, and because of the symmetry we declare that each
should have equal probability, namely, 1/6. Following tradition, we let (pronounced omega)
denote the sample space, which is the set of possible outcomes. For this example, we could take
= {1, 2, 3, 4, 5, 6}. Performing the experiment of rolling a fair die corresponds to selecting an
outcome from . An event is a subset of . An event is said to occur or to be true when the
experiment is performed if the outcome is in the event. Each event A has an associated probability,
P (A). For this experiment, {1} is the event that one shows, and we let P ({1}) = 1/6. For brevity,
we write this as P {1} = 1/6. Similarly, we let P {2} = P {3} = P {4} = P {5} = P {6} = 1/6.
But there are other events. For example, we might define B to be the event that the number that
shows is two or smaller. Equivalently, B = {1, 2}. Since B has two outcomes, its reasonable that
P (B) = 2/6 = 1/3. And E could be the event that the number that shows is even, so E = {2, 4, 6},
and P (E) = 3/6 = 1/2.
Starting with two events, such as B and E just considered, we can describe more events using
and and or, where and corresponds to the intersection of events, and or corresponds to
the union of events. This gives rise to the following two events1 :
the number that shows is two or smaller and even =
BE
the number that shows is two or smaller or even = B E = {1, 2} {2, 4, 6} = {1, 2, 4, 6}.
The probabilities of these events are P (BE) = 1/6 and P (B E) = 4/6 = 2/3. Let O be the event
that the number that shows is odd, or O = {1, 3, 5}. Then:
the number that shows is even and odd =
EO = {2, 4, 6} {1, 3, 5} =
Here BE is the intersection of sets B and E. It is the same as B E. See Appendix 6.1 for set notation.
in the event. For example, the complement of B, written B c , is the event {3, 4, 5, 6}. Then, when
the die is rolled, either B is true or B c is true, but not both. That is, B B c = and BB c = .
Thus, whatever events we might be interested in initially, we might also want to discuss events
that are intersections, unions, or complements of the events given initially. The empty set, , or
the whole space of outcomes, , should be events, because they can naturally arise through taking
complements and intersections or unions. For example, if A is any event, then A Ac = and
AAc = . The complement of is , and vice versa.
A bit more terminology is introduced before we describe the axioms precisely. One event is said
to exclude another event if an outcome being in the first event implies the outcome is not in the
second event. For example, the event O excludes the event E. Of course, E excludes O as well.
Two or more events E1 , E2 , . . . , En , are said to be mutually exclusive if at most one of the events
can be true. Equivalently, the events E1 , E2 , . . . , En are mutually exclusive if Ei Ej = whenever
i 6= j. That is, the events are disjoint sets. If events E1 , E2 , . . . , En are mutually exclusive, and if
E1 En = , then the events are said to form a partition of . For example, if A is an event,
then A and Ac form a partition of .
De Morgans law in the theory of sets is that the complement of the union of two sets is the
intersection of the complements. Or vice versa: the complement of the intersection is the union of
the complements:
(A B)c = Ac B c
(AB)c = Ac B c .
(1.1)
De Morgans law is easy to verify using the Karnaugh map for two events, shown in Figure 1.1.
The idea of the map is that the events AB, AB c , Ac B, and Ac B c form a partition of . The lower
half of the figure indicates the events A, B, and A B, respectively. The only part of that A B
B
c c
AB
AB
Bc
Bc
AB
c c
AB
Bc
AB
Figure 1.1: Karnaugh map showing how two sets, A and B, partition .
does not cover is Ac B c , proving the first version of De Morgans law.
The Axioms of Probability The proof of De Morgans law outlined above shows that it is a
universal truth about setsit does not need to be assumed. In contrast, the axioms described next
CHAPTER 1. FOUNDATIONS
are intuitively reasonable properties that we require to be true of a probability model. The set of
axioms together define what we mean by a valid probability model.
An experiment is modeled by a probability space, which is a triplet (, F, P ). We will read this
triplet as Omega, Script F, P. The first component, , is a nonempty set. Each element of is
called an outcome and is called the sample space. The second component, F, is a set of subsets
of called events. The final component, P , of the triplet (, F, P ), is a probability measure on F,
which assigns a probability, P (A), to each event A. The axioms of probability are of two types:
event axioms, which are about the set of events F, and probability axioms, which are about the
probability measure P.
Event axioms The set of events, F, is required to satisfy the following axioms:
Axiom E.1 is an event (i.e. F).
Axiom E.2 If A is an event then Ac is an event (i.e. if A F then Ac F).
Axiom E.3 If A and B are events then A B is an event (i.e. if A, B F then A B F).
More generally, if A1 , A2 , . . . is a list of events then the union of all of these events (the set of
outcomes in at least one of them), A1 A2 , is also an event.
One choice of F that satisfies the above axioms is the set of all subsets of . In fact, in these notes,
whenever the sample space is finite or countably infinite (which means the elements of can be
arranged in an infinite list, indexed by the positive integers), we let F be the set of all subsets of
. When is uncountably infinite, it is sometimes mathematically impossible to define a suitable
probability measure on the set of all subsets of in a way consistent with the probability axioms
below. To avoid such problems, we simply dont allow all subsets of such an to be events, but
the set of events F can be taken to be a rich collection of subsets of that includes any subset of
we are likely to encounter in applications.
If the Axioms E.1-E.3 are satisfied, the set of events has other intuitively reasonable properties,
and we list some of them below. We number these properties starting at 4, because the list is a
continuation of the three axioms, but we use the lower case letter e to label them, reserving the
upper case letter E for the axioms.2
Property e.4 The empty set, , is an event (i.e. F). That is because is an event by Axiom
E.1, so c is an event by Axiom E.2. But c = , so is an event.
Property e.5 If A and B are events, then AB is an event. To see this, start with De Morgans
law: AB = (Ac B c )c . By Axiom E.2, Ac and B c are events. So by Axiom E.3, Ac B c is
an event. So by Axiom E.2 a second time, (Ac B c )c is an event, which is just AB.
Property e.6 More generally, if B1 , B2 , . . . is a list of events then the intersection of all of these
events (the set of outcomes in all of them) B1 B2 is also an event. This is true by the same
reason given for Property e.5, starting with the fact B1 B2 = (B1c B2c )c .
2
In fact, the choice of which properties to make axioms is not unique. For example, we could have made Property
e.4 an axiom instead of Axiom E.1.
Probability axioms The probability measure P is required to satisfy the following axioms:
Axiom P.1 For any event A, P (A) 0.
Axiom P.2 If A, B F and if A and B are mutually exclusive, then P (A B) = P (A) + P (B).
More generally, if E1 , E2 , . . . is an infinite list (i.e. countably infinite collection) of mutually
exclusive events, P (E1 E2 ) = P (E1 ) + P (E2 ) + .
Axiom P.3 P () = 1.
If Axioms P.1-P.3 are satisfied (and Axioms E.1-E.3 are also satisfied) then the probability
measure P has other intuitively reasonable properties. We list them here:
Property p.4 For any event A, P (Ac ) = 1P (A). That is because A and Ac are mutually exclusive
events and = A Ac . So Axioms P.2 and P.3 yield P (A) + P (Ac ) = P (A Ac ) = P () = 1.
Property p.5 For any event A, P (A) 1. That is because if A is an event, then P (A) = 1
P (Ac ) 1 by Property p.4 and by the fact, from Axiom P.1, that P (Ac ) 0.
Property p.6 P () = 0. That is because and are complements of each other, so by Property
p.4 and Axiom P.3, P () = 1 P () = 0.
Property p.7 If A B then P (A) P (B). That is because B = A (Ac B) and A and Ac B are
mutually exclusive, and P (Ac B) 0, so P (A) P (A) + P (Ac B) = P (A (Ac B)) = P (B).
Property p.8 P (A B) = P (A) + P (B) P (AB). That is because, as illustrated in Figure 1.1,
A B can be written as the union of three mutually exclusive sets: A B = (AB c ) (Ac B)
(AB). So
P (A B) = P (AB c ) + P (Ac B) + P (AB)
= (P (AB c ) + P (AB)) + (P (Ac B) + P (AB)) P (AB)
= P (A) + P (B) P (AB).
Property p.9 P (A B C) = P (A) + P (B) + P (C) P (AB) P (AC) P (BC) + P (ABC).
This is a generalization of Property p.8 and can be proved in a similar way.
Example 1.2.1 (Toss of a fair coin) Suppose the experiment is to flip a coin to see if it shows
heads or tails. Using H for heads and T for tails, the experiment is modeled by the following choice
of and P :
= {H, T }
F = {{H}, {T }, {H, T }, }
1
P () = P {H, T } = 1,
P () = 0.
P {H} = P {T } = ,
2
10
CHAPTER 1. FOUNDATIONS
Example 1.2.2 A particular experiment is to observe the color of a traffic signal at the time it is
approached by a vehicle. The sample space is = {green, yellow, red} and we let any subset of
be an event. What probability measure, P , should we choose? Since there are three colors, we could
declare them to be equally likely, and thus have probability 1/3 each. But here is an intuitively
more reasonable choice. Suppose when we examine the signal closely, we notice that the color of
the signal goes through cycles of duration 75 seconds. In each cycle the signal dwells on green for
30 seconds, then dwells on yellow for 5 seconds, then dwells on red for 40 seconds. Assuming that
the arrival time of the vehicle is random, and not at all connected to the signal (in particular, the
traffic signal is isolated and not synchronized with other signals that the vehicle passes) then it
seems intuitively reasonable to assign probabilities to the colors that are proportional to their dwell
2
5
1
40
8
times. Hence, we declare that P {green} = 30
75 = 5 , P {yellow} = 75 = 15 , and P {red} = 75 = 15 .
Note that the three outcomes are not equally likely.
1.3
An important class of probability spaces are those such that the set of outcomes, , is finite, and all
outcomes have equal probability. Therefore, the probability for any event A is P (A) = |A|
|| , where
|A| is the number of elements in A and || is the number of elements in . This notation |A| is
the same as what we use for absolute value, but the argument is a set, not a value. The number of
elements in a set is called the cardinality of the set. Thus, it is important to be able to count the
number of elements in various sets. Such counting problems are the topic of this section.
Principle of counting Often we are faced with finding the number of vectors or sets satisfying
certain conditions. For example, we might want to find the number of pairs of the form (X,N) such
that X {H, T } and N {1, 2, 3, 4, 5, 6}. These pairs correspond to outcomes of an experiment,
in which a coin is flipped and a die is rolled, with X representing the side showing on the coin (H
for heads or T for tails) and N being the number showing on the die. For example, {H, 3}, or H3
for short, corresponds to the coin showing heads and the die showing three. The set of all possible
outcomes can be listed as:
H1 H2 H3 H4 H5 H6
T 1 T 2 T 3 T 4 T 5 T 6.
Obviously there are twelve possible outcomes. There are two ways to choose X, and for every choice
of X, there are six choices of N. So the number of possible outcomes is 2 6 = 12.
This example is a special case of the principle of counting: If there are m ways to select one
variable and n ways to select another variable, and if these two selections can be made independently, then there is a total of mn ways to make the pair of selections. The principle extends to
more than one variable as illustrated by the next example.
Example 1.3.1 Find the number of possible 8-bit bytes. An example of such a byte is 00010100.
11
Solution: There are two ways to select the first bit, and for each of those, there are two ways
to select the second bit, and so on. So the number of possible bytes is 28 . (Each byte represents a
number in the range 0 to 255 in base two representation.)
The idea behind the principle of counting is more general than the principle of counting itself.
Counting the number of ways to assign values to variables can often be done by the same approach,
even if the choices are not independent, as long as the choice of one variable does not affect the
number of choices possible for the second variable. This is illustrated in the following example.
Example 1.3.2 Find the number of four letter sequences that can be obtained by ordering the
letters A, B, C, D, without repetition. For example, ADCB is one possibility.
Solution: There are four ways to select the first letter of the sequence, and for each of those,
there are three ways to select the second letter, and for each of those, two ways to select the third
letter, and for each of those, one way to select the final letter. So the number of possibilities is
4 3 2 1 = 4! (read four factorial). Here the possibilities for the second letter are slightly limited
by the first letter, because the second letter must be different from the first. Thus, the choices of
letter for each position are not independent. If the letter A is used in the first position, for example,
it cant be used in the second position. The problem is still quite simple, however, because the
choice of letter for the first position does not affect the number of choices for the second position.
And the first two choices dont affect the number of choices for the third position.
In general, the number of ways to order n distinct objects is n! = n (n 1) 2 1. An ordering
of n distinct objects is called a permutation, so the number of permutations of n distinct objects is
n!. The next example indicates how to deal with cases in which the objects are not distinct.
Example 1.3.3 How many orderings of the letters AAB are there, if we dont distinguish between
the two As?
Solution: There are three orderings: AAB, ABA, BAA. But if we put labels on the two As,
writing them as A1 and A2 , then the three letters become distinct, and so there are 3!, equal to
six, possible orderings:
A1 A2 B A 2 A1 B
A1 BA2 A2 BA1
BA1 A2 BA2 A1 .
These six orderings are written in pairs that would be identical to each other if the labels on the As
were erased. For example, A1 A2 B and A2 A1 B would both become AAB if the labels were erased.
For each ordering without labels, such as AAB, there are two orderings with labels, because there
are two ways to order the labels on the As.
12
CHAPTER 1. FOUNDATIONS
Example 1.3.4 [video] How many orderings of the letters ILLINI are there, if we dont distinguish
the Is from each other and we dont distinguish the Ls from each other?3
Solution: Since there are six letters, if the Is were labeled as I1 , I2 , I3 and the Ls were labeled
as L1 and L2 , then the six letters would be distinct and there would be 6!=720 orderings, including
I1 I3 L2 N I2 L1 . Each ordering without labels, such as IILN IL corresponds to 3! 2 = 12 orderings
with labels, because there are 3! ways to label the three Is, and for each of those, there are two
ways to label the Ls. Hence, there are 720/12 = 60 ways to form a sequence of six letters, using
three identical Is, two identical Ls, and one N.
Principle of Over Counting The above two examples illustrate the principle of over counting,
which can be stated as follows: For an integer K 1, if each element of a set is counted K times,
then the number of elements in the set is the total count divided by K.
Example 1.3.5 Suppose nine basketball players are labeled by the letters A, B, C, D, E, F, G, H,
and I. A lineup is a subset consisting of five players. For example, {A, B, D, E, H} is a possible lineup. The order of the letters in the lineup is not relevant. That is, {A, B, D, E, H} and
{A, B, E, D, H} are considered to be the same. So how many distinct lineups are there?
Solution: If the order did matter, the problem would be easier. There would be 9 ways to select
the first player, and for each of those, 8 ways to select the second player, 7 ways to select the third,
6 ways to select the fourth, and 5 ways to select the fifth. Thus, there are 9 8 7 6 5 ways to select
lineups in a given order. Since there are 5! ways to order a given lineup, each lineup would appear
5! times in a list of all possible lineups with all possible orders. So, by the principle of over counting,
the number of distinct lineups, with the order of a lineup not mattering, is 9 8 7 6 5/5! = 126.
Example 1.3.6 How many binary sequences of length 9 have (exactly) 5 ones? One such sequence
is 110110010.
Solution: If the ones were distinct and the zeros were distinct, there would be 9! choices. But
there are 5! ways to order the ones, and for each of those, 4! ways to order the zeros, so there are
5! 4! orderings with labels for each ordering without labels. Thus, the number of binary sequences
9!
. This is the same as the solution to Example 1.3.5. In fact, there
of length 9 having five ones is 5!4!
is a one-to-one correspondence between lineups and binary sequences of length 9 with 5 ones. The
positions of the 1s indicate which players are in the lineup. The sequence 110110010 corresponds
to the lineup {A, B, D, E, H}.
3
ILLINI, pronounced ill LIE nigh, is the nickname for the students and others at the University of Illinois.
13
In general, the number of subsets of size k of a set of n distinct objects can be determined as
follows. There are n ways to select the first object, n 1 ways to select the second object, and
so on, until there are n k + 1 ways to select the k th object. By the principle of counting, that
gives a total count of n(n 1) (n k + 1), but this chooses k distinct objects in a particular
order. By definition of the word set, the order of the elements within a set does not matter.
Each set of k objects is counted k! ways by this method, so by the principle of over counting, the
number of subsets of size k of a set of n distinct objects (with the order not mattering) is given by
n(n1)(nk+1)
n!
. This is equal to (nk)!k!
, and is called n choose k, and is written as the binomial
k!
n
n!
coefficient, k = (nk)!k! .
n
It is useful to keep in mind that nk = nk
and that
k terms
}|
{
z
n(n 1) (n k + 1)
n
.
=
k!
k
87654
For example, 83 and 85 are both equal to 876
321 = 56, which is also equal to 54321 .
Note that (a + b)2 = a2 + 2ab + b2 and (a + b)3 = a3 + 3a2 b + 3ab2 + b3 . In general,
n
X
n k nk
a b
.
(a + b) =
k
n
(1.2)
k=0
Equation (1.2) follows by writing (a + b)n = (a + b)(a + b) (a + b), and then noticing that the
|
{z
}
n factors
coefficient of ak bnk in (a + b)n is the number of ways to select k out of the n factors from which
to select a.
1.4
This section presents more examples of counting, and related probability experiments such that all
outcomes are equally likely.
Example 1.4.1 Suppose there are nine socks loose in a drawer in a dark room which are identical
except six are orange and three are blue. Someone selects two at random, all possibilities being
equally likely. What is the probability the two socks are the same color? Also, if instead, three
socks were selected, what is the probability that at least two of them are the same color?
Solution For the first question, we could imagine the socks are numbered one through nine, with
socks numbered one through six being orange and and socks numbered seven through nine being
blue. Let be the set of all subsets of {1, 2, 3, 4, 5, 6, 7, 8, 9} of size two. The number of elements
of is given by || = 92 = 98
2 = 36. The number of ways two orange socks could be chosen is
14
CHAPTER 1. FOUNDATIONS
= 15, and the number of ways two blue socks could be chosen is 32 = 3. Thus, the probability
two socks chosen at random have the same color is 15+3
36 = 1/2.
The second question is trivial. Whatever set of three socks is selected, at least two of them are
the same color. So the answer to the second question is one.
6
2
Example 1.4.2 Let an experiment consist of rolling two fair dice, and define the following three
events about the numbers showing: A =sum is even, B =sum is a multiple of three, and
C =the numbers are the same. Display the outcomes in a three-event Karnaugh map, and find
P (ABC).
Solution. We write ij for the outcome such that i appears on the first die and j appears on the
second die. The outcomes are displayed in a Karnaugh map in Figure 1.2. It is a little tedious to fill
in the map, but it is pretty simple to check correctness of the map. For correctness it suffices that
14,16,23,25
32,34,41,43,
52,56,61,65
12,21,36,45,
54,63
13,26,31,35, 11,22,44,55
46,53,62,64
Cc
33,66
15,24,42,51
A
Cc
Figure 1.2: Karnaugh map for the roll of two fair dice and events A, B, and C.
for each of the events A, Ac , B, B c , C, C c , the correct outcomes appear in the corresponding lines
(rows or columns). For example, all 18 outcomes in A are in the bottom row of the map. All 18
outcomes of Ac are in the upper row of the map. Similarly we can check for B, B c , C, and C c . There
are 36 elements in , and ABC = {33, 66}, which has two outcomes. So P (ABC) = 2/36 = 1/18.
Example 1.4.3 [video] [video] Suppose a deck of playing cards has 52 cards, represented by the
set C :
C = {1C, 2C, . . . , 13C, 1D, 2D, . . . , 13D, 1H, 2H, . . . 13H, 1S, 2S, . . . , 13S}.
Here C, D, H, or S stands for the suit of a card: clubs, diamonds, hearts, or spades.
Suppose five cards are drawn at random from the deck, with all possibilities being equally likely.
15
In the terminology of the game of poker, a FULL HOUSE is the event that three of the cards all
have the same number, and the other two cards both have some other number. For example, the
outcome {1D, 1H, 1S, 2H, 2S} is an element of FULL HOUSE, because three of the cards have the
number 1 and the other two cards both have the number 2. STRAIGHT is the event that the
numbers on the five cards can be arranged to form five consecutive integers, or that the numbers
can be arranged to get the sequence 10,11,12,13,1 (because, by tradition, the 1 card, or ace, can
act as either the highest or lowest number). For example, the outcome {3D, 4H, 5H, 6S, 7D} is an
element of STRAIGHT. Find P (FULL HOUSE) and P (STRAIGHT).
Solution: The sample
space is = {A : A C and |A| = 5}, and the number of possible
outcomes is || = 52
5 . To select an outcome in FULL HOUSE, there are 13 ways to select a
number for the three cards with the same number, and then 12 ways
to select a number for the
other two cards. Once the two numbers are selected, there are 43 ways to select 3 of the 4 suits
for the three cards with the same number, and 42 ways to select 2 of the 4 suits for the two cards
with the other number. Thus,
13 12 43 42
P (FULL HOUSE) =
52
5
=
=
13 12 4 4 3 5 4 3 2
2 52 51 50 49 48
6
6
=
0.0014.
17 5 49
4165
To select an outcome in STRAIGHT, there are ten choices for the set of five integers on the
five cards that can correspond to STRAIGHT, and for each of those, there are 45 choices of what
suit is assigned to the cards with each of the five consecutive integers. Thus,
P (STRAIGHT) =
10 45
52
5
1.5
10 45
5251504948
5432
0.0039.
It is appropriate for many experiments to use a sample space with infinite cardinality. For
example, an experiment could be to randomly select a number from the interval [0, 1], and there
are infinitely many numbers in the interval [0, 1]. This section discusses infinite sets in general, and
includes examples of sample spaces with infinitely many elements. The section highlights some
implications of Axiom P.2.
The previous two sections discuss how to find the cardinality of some finite sets. What about
the cardinality of infinite sets? Do all infinite sets have the same number of elements? In a
16
CHAPTER 1. FOUNDATIONS
strong sense, no. The smallest sort of infinite set is called a countably infinite set, or a set with
countably infinite cardinality, which means that all the elements of the set can be placed in a list.
Some examples of countably infinite sets are the set of nonnegative integers Z+ = {0, 1, 2, . . .},
the set of all integers Z = {0, 1, 1, 2, 2, 3, 3, . . . }, and the set of nonnegative rational numbers:
Q+ = { ji : i 1, j 1, integers}. Figure 1.3 shows a two dimensional array that contains every
positive rational number at least once. The zig-zag path in Figure 1.3 shows how all the elements
can be placed on an infinite list.
1
1
2
1
3
1
4
1
5
1
1
2
2
2
3
2
4
2
5
2
1
3
2
3
3
3
4
3
5
3
1
4
2
4
3
4
4
4
5
4
1
5
2
5
3
5
4
5
5
5
...
...
...
...
...
...
...
...
...
...
17
can be different finite lengths. This is called a variable length encoding of the set. It is enough to
represent the index of the element in the list. For example, if is the twenty sixth element of the
list, then the representation of would be the binary expansion of 26, or 11010. Proposition 1.5.1
means that it is impossible to index the set of real numbers by variable length binary strings of
finite length.
Example 1.5.2 Consider an experiment in which a player is asked to choose a positive integer.
To model this, let the space of outcomes be the set of positive integers. Thus, is countably
infinite. Mathematically, the case that is countably infinite is similar to the case is finite. In
particular, we can continue to let F, the set of events, be the set of all subsets of . To name a
specific choice of a probability measure P, we use the following rather arbitrary choice. Let
1 1 1 1 1
1 1 1 1
(p1 , p2 , p3 , . . .) = 1 , , , . . . =
, , , ,... ,
2 2 3 3 4
2 6 12 20
1
1
or, equivalently, pi = i(i+1)
for i 1. Note that p1 + + pi = 1 i+1
for i 1 and therefore
p1 +p2 + = 1. Assume that the player chooses integer i with probability pi . Since the outcomes
in
P
any event A can be listed in a sequence, Axiom P.2 requires that for any event A, P (A) = iA pi .
1
1
For example, P {1, 4, 5} = 21 + 20
+ 30
0.8533. If A is the event that the number chosen is a
multiple of five, then P (A) can be calculated numerically:
P (A) =
1
1
1
+
+
+ 0.05763
(5)(6) (10)(11) (15)(16)
(1.3)
Taking a = b, we see that singleton sets {a} are events, and these sets have probability zero. In
order for the event axioms to be true, open intervals (a, b) must also be events and P ( (a, b) ) = ba.
Any open subset of can be expressed as the union of a finite or countably infinite set of open
intervals, and any closed set is the the complement of an open set, so F should contain all open
and all closed subsets of . Thus, F must contain any set that is the intersection of countably
many open sets, and so on.
18
CHAPTER 1. FOUNDATIONS
Example 1.5.3 describes the probability space (, F, P ) for the experiment of selecting a point
from the unit interval, [0, 1], such that the probability the outcome is in an interval [a, b] is b a.
This is true even if a = b, which means that the probability of any particular singleton subset {x}
of [0, 1] is equal to zero. But the entire interval [0, 1] is the union of all such sets
P{x}, and those
sets are mutually exclusive. Why then, doesnt Axiom P.2 imply that P {} = x P {x} = 0,
in contradiction to Axiom P.3? The answer is that Axiom P.2 does not apply to this situation
because it only holds for a finite or countably infinite collection of events, whereas the set [0, 1] is
uncountably infinite.
The next two examples make use of the formula for the sum of a geometric series, so we derive
the formula here. A geometric series with first term one has the form 1, x, x2 , x3 , . Equivalently,
the k th term is xk for k 0. Observe that for any value of x :
(1 x)(1 + x + x2 + + xn ) = 1 xn+1 ,
because when the product on the left hand side is expanded, the terms of the form xk for 1 k n
cancel out. Therefore, for x 6= 1, the following equivalent expressions hold for the partial sums of
a geometric series:
1 xn+1
1 + x + x + + x =
1x
2
or
n
X
xk =
k=0
1 xn+1
.
1x
(1.4)
The sum of an infinite series is equal to the limit of the nth partial sum as n . If |x| < 1 then
limn xn+1 = 0. So letting n in (1.4) yields that for |x| < 1 :
1 + x + x2 + =
1
1x
or
X
k=0
xk =
1
.
1x
(1.5)
The formula (1.4) for partial sums and the formula (1.5) for infinite sums of a geometric series are
used frequently in these notes.
Example 1.5.4 (Repeated binary trials) Suppose we would like to represent an infinite sequence
of binary observations, where each observation is a zero or one with equal probability. For example,
the experiment could consist of repeatedly flipping a fair coin, and recording a one each time it
shows heads and a zero each time it shows tails. Then an outcome would be an infinite sequence,
= (1 , 2 , ), such that for each i 1, i {0, 1}. Let be the set of all such s. The set of
events can be taken to be large enough so that any set that can be defined in terms of only finitely
many of the observations is an event. In particular, for any binary sequence (b1 , , bk ) of some
finite length k, the set { : i = bi for 1 i k} should be in F, and the probability of such
a set is taken to be 2k .
There are also events that dont depend on a fixed, finite number of observations. For example,
suppose there are two players who take turns performing the coin flips, with the first one to get
heads wins. Let F be the event that the player going first wins. Show that F is an event and then
find its probability.
19
Solution: For k 1, let Ek be the event that the first one occurs on the k th observation. That
is, Ek is the event that the first k observations are given by the binary sequence (b1 , . . . , bk ) =
(0, 0, . . . , 0, 1), and its probability is given by P {Ek } = 2k .
Observe that F = E1 E3 E5 . . . , so F is an event by Axiom E.3. Also, the events E1 , E3 , . . .
are mutually exclusive, so by the full version of Axiom P.3 and (1.5):
!
2
1
1
1
1/2
2
P (F ) = P (E1 ) + P (E3 ) + =
1+
+
+ =
= .
2
4
4
1 (1/4)
3
The player who goes first has twice the chance of winning as the other player.
Example 1.5.5 (Selection of a point in a square) Take to be the square region in the plane,
= {(x, y) : 0 x < 1, 0 y < 1}.
It can be shown that there is a probability space (, F, P ) such that any rectangular region that
is a subset of of the form R = {(u, v) : a u < b, c v < d} is an event, and
P (R) = area of R = (b a)(d c).
Let T be the triangular region T = {(x, y) : x 0, y 0, x + y < 1}. Since T is not rectangular, it
is not immediately clear whether T is an event. Show that T is an event, and find P (T ), using the
axioms.
Solution
Consider the infinite sequence of square regions shown in Figure 1.4. Square 1 has area
8
9
10
11
12
6
1
13
14
15
20
CHAPTER 1. FOUNDATIONS
The set of squares is countably infinite, and their union is T , so T is an event by Axiom E.3. Since
the square regions are mutually exclusive, Axiom P.2 implies that P (T ) is equal to the sum of the
areas of the squares:
P (T ) = 1/4 + 2(1/4)2 + 22 (1/4)3 + 23 (1/4)4 +
= (1/4)(1 + 21 + 22 + 23 + )
= (1/4) 2 = 1/2.
Of course, it is reasonable that P (T ) = 1/2, because T takes up half of the area of the square. In
fact, any reasonable subset of the square is an event, and the probability of any event A is the area
of A.
1.6
Section 1.2[video]
1. What is P (AB) if P (A) = 0.5, P (B) = 0.46, and P (A B) = 2P (AB).?
2. What is P (ABC) if P (A) = P (B) = P (C) = 0.5, P (A B) = 0.55, P (A C) = 0.7,
P (BC) = 0.3 and P (ABC) = 2P (ABC c )?
Section 1.3[video]
1. Find the number of 8 letter sequences made by ordering the letters from BASEBALL
2. Find the number of 6 letter sequences made by deleting two letters from BASEBALL and
ordering the remaining six.
3. How many distinct color combinations can be displayed by a set of three marbles drawn from
a bag containing six marbles: three orange, two blue, and one white? Suppose the order of
the three marbles doesnt matter.
4. How many essentially different ways are there to string eight distinct beads on a necklace?
(If one way is the mirror image of another or the same as another up to rotation, they are
considered to be the same.)
Section 1.4[video]
1. If the letters of ILLINI are randomly ordered, all orderings being equally likely, what is the
probability the three Is are consecutive?
2. If the letters of ILLINI are randomly ordered, all orderings being equally likely, what is the
probability no position has the same letter as in the original order?
Section 1.5[video]
1. Consider the ordering of rational numbers indicated in Figure 1.3. What is the 100th rational
number on the list?
1.7. PROBLEMS
1.7
21
Problems
22
CHAPTER 1. FOUNDATIONS
1.7. PROBLEMS
23
(a) Define a sample space for this experiment. Suppose that the order that the people
draw the socks doesnt matterall that is recorded is which two socks each person selects.
(b) Determine ||, the cardinality of .
(c) Determine the number of outcomes in M.
(d) Find P (M ).
(e) Find a short way to calculate P (M ) that doest require finding |M | and ||. (Hint: Write
P (M ) as one over an integer and factor the integer.)
1.10. [Two more poker hands I]
Suppose five cards are drawn from a standard 52 card deck of playing cards, as described in
Example 1.4.3, with all possibilities being equally likely.
(a) TWO PAIR is the event that two cards both have one number, two other cards both
have some other number, and the fifth card has a number different from the other two
numbers. Find P (TWO PAIR).
(b) THREE OF A KIND is the event that three of the cards all have the same number, and
the other cards have numbers different from each other and different from the three with
the same number. Find P (THREE OF A KIND).
(c) FOUR OF A KIND is the event that four of the five cards have the same number. Find
P (FOUR OF A KIND).
1.11. [Two more poker hands II]
Suppose five cards are drawn from a standard 52 card deck of playing cards, as described in
Example 1.4.3, with all possibilities being equally likely.
(a) FLUSH is the event that all five cards have the same suit. Find P (FLUSH).
(b) FOUR OF A KIND is the event that four of the five cards have the same number. Find
P (FOUR OF A KIND).
1.12. [Some identities satisfied by binomial
coefficients]
n
[video] Using only the fact that k is the number of ways to select a set of k objects from
a set of n objects, explain in words why each of the following identities is true. The idea
is to P
identify how to count something two different ways. For example, to explain why
n
2 = nk=0 nk , you could note that 2n is the total number of subsets of a set of n objects,
because there are two choices for each object (i.e. include or not include in the set) and the
n choices are independent. The right hand side is also the number of subsets of a set of n
objects, with the k th term being the number of such subsets of cardinality k.
n
(a) nk = nk
for 0 k n.
n1
(b) nk = n1
for 1 k n 1.
k1 +
k
n
P
n
n 2
n 2
(c) 2n
= nk nk
. Consider a set of 2n distinct objects, half
k=0 k . (Hint: k
n =
orange and half blue.)
24
CHAPTER 1. FOUNDATIONS
(d)
P
l1
for 1 k n. (Hint: If a set of k objects is selected from among n
= nl=k k1
objects numbered one through n, what are the possible values of the highest numbered
object selected? For example, if n = 5 and k = 3, the equality becomes 53 = 22 +
4
3
2 + 2 , or 10 = 1 + 3 + 6. The ten subsets can be divided into three groups: {1, 2, 3};
{1, 2, 4}, {1, 3, 4}, {2, 3, 4}; and {1, 2, 5}, {1, 3, 5}, {1, 4, 5}, {2, 3, 5}, {2, 4, 5}, {3, 4, 5} such
that the set in the first group has largest element 3, the sets in the second group have
largest element 4, and the sets in the third group have largest element 5.)
n
k
0.68
0.03
0.19
0.01
0.01
0.01
0.07
Cc
Cc
Chapter 2
Chapter 1 focuses largely on events and their probabilities. An event is closely related to a binary
variable; if a probability experiment is performed, then a particular event either occurs or does not
occur. A natural and useful generalization allows for more than two values:
Definition 2.1.1 A random variable is a real-valued function on .
Thus, if X is a random variable for a probability space (, F, P ), if the probability experiment
is performed, which means a value is selected from , then the value of the random variable is
X(). The value X() is called the realized value of X for outcome . A random variable can have
many possible values, and for a given subset of the real numbers, there is some probability that
the value of the random variable is in the set. If A R, then { : X() A} is the event that the
value of X is in A. For brevity, we usually write such an event as {X A} and its probability as
P {X A}.
A random variable is said to be discrete-type if there is a finite set u1 , . . . , un or a countably
infinite set u1 , u2 , . . . such that
P {X {u1 , u2 , . . .}} = 1.
(2.1)
The probability mass function (pmf) for a discrete-type random variable X, pX , is defined by
pX (u) = P {X = u}. Note that (2.1) can be written as:
X
pX (ui ) = 1.
i
25
1
6
for
26
Example 2.1.3 Let S be the sum of the numbers showing on a pair of fair dice when they are
rolled. Find the pmf of S.
Solution: The underlying sample space is = {(i, j) : 1 i 6, 1 j 6}, and it has 36
1
possible outcomes, each having probability 36
. The smallest possible value of S is 2, and {S =
1
2} = {(1, 1)}. That is, there is only one outcome resulting in S = 2, so pS (2) = 36
. Similarly,
2
3
{S = 3} = {(1, 2), (2, 1)}, so pS (3) = 36 . And {S = 4} = {(1, 3), (2, 2), (3, 1)}, so pS (4) = 36
, and
so forth. The pmf of S is shown in Fig. 2.1.
pS
1/6
2 3 4 5 6 7 8 9 10 11 12
Figure 2.1: The pmf of the sum of numbers showing for rolls of two fair dice.
Example 2.1.4 Suppose two fair dice are rolled and that Y represents the maximum of the two
numbers showing. The same set of outcomes, = {(i, j) : 1 i 6, 1 j 6}, can be used as in
Example 2.1.3. For example, if a 3 shows on the first die and a 5 shows on the second, then Y = 5.
That is, Y ((3, 5)) = 5. In general, Y ((i, j)) = max{i, j} for (i, j) . Determine the pmf of Y.
Solution: The possible values of Y are 1, 2, 3, 4, 5, and 6. That is, the support of pY is {1, 2, 3, 4, 5, 6}.
There is only one outcome in such that Y = 1. Specifically, {Y = 1} = {(1, 1)}. Similarly,
{Y = 2} = {(2, 2), (1, 2), (2, 1)},
{Y = 3} = {(3, 3), (1, 3), (2, 3), (3, 1), (3, 2)}, and so forth. The pmf of Y is shown in Fig. 2.2.
p
11/36
9/36
7/36
5/36
3/36
1/36
12 3 4 5 6
Figure 2.2: The pmf for the maximum of numbers showing for rolls of two fair dice.
2.2
27
The mean of a random variable is a weighted average of the possible values of the random variable,
such that the weights are given by the pmf:
Definition 2.2.1 The mean (also called expectation)
of a random variable X with pmf pX is
P
denoted by E[X] and is defined by E[X] =
u
p
(u
i
i ), where u1 , u2 , . . . is the list of possible
X
i
values of X.
Example 2.2.2 Let X be the number showing for a roll of a fair die. Find E[X].
Solution
1+2+3+4+5+6
6
1
6
+2
1
6
+3
1
6
+4
1
6
+5
1
6
+6
1
6
Example 2.2.3 Let Y be the number of distinct numbers showing when three fair dice are rolled.
Find the pmf and mean of Y.
Solution The underlying sample space is = {i1 i2 i3 : 1 i1 6, 1 i2 6, 1 i3 6}. There
are six ways to choose i1 , six ways to choose i2 , and six ways to choose i3 , so that || = 666 = 216.
What are the possible values of Y ? Yes, Y takes values 1, 2, 3. The outcomes in that give rise to
each possible value of Y are:
{Y = 1} = {111, 222, 333, 444, 555, 666}
{Y = 2} = {112, 121, 211, 113, 131, 311, 114, 141, 411, , 665, 656, 566}
{Y = 3} = {i1 i2 i3 : i1 , i2 , i3 are distinct}.
Obviously, |{Y = 1}| = 6. The outcomes of {Y = 2} are listed above in groups of three, such as
112,121,211. There are thirty such groups of three, because there are six ways to choose which
number appears twice, and then five ways to choose which number appears once, in the outcomes
in a group. Therefore, |{Y = 2}| = 6 5 3 = 90. There are six choices for the first number
of an outcome in {Y = 3}, then five choices for the second, and then four choices for the third.
So |{Y = 3}| = 6 5 4 = 120. To double check our work, we note that 6+90+120=216, as
6
1
90
120
20
expected. So, pY (1) = 216
= 36
, pY (2) = 216
= 15
36 , and pY (3) = 216 = 36 . The mean of Y is
1+2+3
NOT simply 3 because the three possible values are not equally likely. The correct value is
1
20
91
E[Y ] = 1 36
+ 2 15
36 + 3 36 = 36 2.527.
The [video] gives a physical interpretation of the mean of a pmf, and describes how to build
and calibrate a nine volt calculator for calculating means.
28
Example 2.2.4 Suppose X is a random variable taking values in {2, 1, 0, 1, 2, 3, 4, 5}, each with
probability 81 . Let Y = X 2 . Find E[Y ].
Solution.
1
8
2 u 5
0 else.
The definition of E[Y ] involves the pmf of Y, so let us find the pmf of Y. We first think about the
possible values of Y (which are 0, 1, 4, 9, 16, and 25), and then for each possible value u, find
pY (u) = P {Y = u}. For example, for u = 1, pY (1) = P {Y = 1} = P {X = 1 or X = 1} = 2/8.
The complete list of nonzero values of pY is as follows:
u pY (u)
0
1/8
1
2/8
4
2/8
9
1/8
16 1/8
25 1/8.
Therefore, E[Y ] = 0
1
8
+1
2
8
+4
2
8
+9
1
8
+ 16
1
8
+ 25
1
8
60
8
= 7.5.
You may have noticed, without even thinking about it, that there is another way to compute
E[Y ] in Example 2.2.4. We can find E[Y ] without finding the pmf of Y. Instead of first adding
together some values of pX to find pY and then summing over the possible values of Y, we can just
directly sum over the possible values of X, yielding
E[Y ] = (2)2
1
1
1
1
1
1
1
1
+ (1)2 + 02 + 12 + 22 + 32 + 42 + 52 .
8
8
8
8
8
8
8
8
(2.2)
This formula is so natural it is called the law of the unconscious statistician (LOTUS).
Example 2.2.5 Suppose two fair dice are rolled. Find the pmf and the mean of the product of
the two numbers showing.
29
4/36
1 2 3
1 2 3
2 4 6
3 6 9
4 8 12
5 10 15
6 12 18
4
4
8
12
16
20
24
5
6
5
6
10 12
15 18
20 24
25 30
30 36.
p (k)
Y
3/36
2/36
1/36
k
5
10
20
15
25
30
35
Figure 2.3: The pmf for the product of two fair dice
Solution Let X1 denote the number showing on the first die and X2 denote the number showing
1
on the second die. Then P {(X1 , X2 ) = (i, j)} = 36
for 1 i 6 and 1 j 6. Let Y = X1 X2 .
The easiest way to calculate the pmf of Y is to go through all 36 possible outcomes, and add the
probabilities for outcomes giving the same value of Y. A table of Y as a function of X1 and X2
is shown in Table 2.1. There is only one outcome giving Y = 1, so pY (1) = 1/36. There are two
outcomes giving Y = 2, so pY (2) = 2/36, and so on. The pmf of Y is shown in Figure 2.3. One
way to compute E[Y ] would be to use the definition of expectation and the pmf of Y shown. An
alternative that we take is to use LOTUS. We have that Y = g(X1 , X2 ), where g(i, j) = ij. Each
of the 36 possible values of (X1 , X2 ) has probability 1/36, so we have
6
E[Y ] =
1 XX
ij
36
i=1 j=1
! 6
6
X
1 X
i
j
36
i=1
j=1
2
(21)2
7
49
=
=
= 12.25.
36
2
4
30
Example 2.2.6 Let X be a random variable with a pmf pX and consider the new random variable
Y = X 2 + 3X. Use LOTUS to express E[Y ] in terms of E[X] and E[X 2 ].
Solution:
!
=
u2i pX (ui )
!
+
3ui pX (ui )
= E[X 2 ] + 3E[X].
The LOTUS equation (2.2) illustrates that expectations are weighted averages, with the weighting given by the pmf of the underlying random variable X. An important implication, illustrated
in Example 2.2.6, is that expectation is a linear operation. A more general statement of this linearity is the following. If g(X) and h(X) are functions of X, and a, b, and c are constants, then
ag(X) + bh(X) + c is also a function of X, and the same method as in Example 2.2.6 shows that
E[ag(X) + bh(X) + c] = aE[g(X)] + bE[h(X)] + c.
(2.3)
By the same reasoning, if X and Y are random variables on the same probability space, then
E[ag(X, Y ) + bh(X, Y ) + c] = aE[g(X, Y )] + bE[h(X, Y )] + c.
In particular, E[X + Y ] = E[X] + E[Y ].
Variance and standard deviation Suppose you are to be given a payment, with the size of
the payment, in some unit of money, given by either X or by Y, described as follows. The random
1
999
variable X is equal to 100 with probability one, whereas pY (100000) = 1000
and pY (0) = 1000
.
Would you be equally happy with either payment? Both X and Y have mean 100. This example
illustrates that two random variables with quite different pmfs can have the same mean. The pmf
for X is concentrated on the mean value, while the pmf for Y is considerably spread out.
The variance of a random variable X is a measure of how spread out the pmf of X is. Letting
X = E[X], the variance is defined by:
Var(X) = E[(X X )2 ].
(2.4)
The difference X X is called the deviation of X (from its mean). The deviation is the error if
X is predicted by X . By linearity of expectation, the mean of the deviation is zero: E[X X ] =
E[X] X = X X = 0. Sometimes Var(X) is called the mean square deviation of X, because
it is the mean of the square of the deviation. It might seem a little arbitrary that variance is
defined using a power of two, rather than some other power, such as four. It does make sense
to talk about the mean fourth power deviation, E[(X X )4 ], or the mean absolute deviation,
E[|X X |]. However, the mean square deviation has several important mathematical properties
31
)
]
=
=
X
2
2 = 1.
X
X
X
Note that even if X is a measurement in some units such as meters, the standardized random
X
variable X
is dimensionless, because the standard deviation X is in the same units as X.
X
Using the linearity of expectation, we derive another expression for Var(X) :
Var(X) = E[X 2 2X X + 2X ]
= E[X 2 ] 2X E[X] + 2X
= E[X 2 ] 2X .
(2.5)
For an integer i 1, the ith moment of X is defined to be E[X i ]. Therefore, the variance of a
random variable is equal to its second moment minus the square of its first moment. The definition
of variance, (2.4), is useful for keeping in mind the meaning and scaling properties of variance,
while the equivalent expression (2.5) is often more useful for computing the variance of a random
variable from its distribution.
32
2.3
Conditional probabilities
Let A and B be two events for some probability experiment. The conditional probability of B given
A is defined by
(
P (AB)
if P (A) > 0
P (A)
P (B|A) =
undefined if P (A) = 0.
It is not defined if P (A) = 0, which has the following meaning. If you were to write a computer
routine to compute P (B|A) and the inputs are P (AB) = 0 and P (A) = 0, your routine shouldnt
simply return the value zero. Rather, your routine should generate an error message such as input
errorconditioning on event of probability zero. Such an error message would help you or others
find errors in larger computer programs which use the routine.
Intuitively, if you know that event A is true for a probability experiment, then you know that
the outcome of the probability experiment is in A. Conditioned on that, whether B is also true
should only depend on the outcomes in B that are also in A, which is the set AB. If B is equal
to A, then, given A is true, B should be true with conditional probability one. That is why the
definition of P (B|A) has P (A) in the denominator.
One of the very nice things about elementary probability theory is the simplicity of this definition of conditional probability. Sometimes we might get conflicting answers when calculating the
probability of some event, using two different intuitive methods. When that happens, inevitably,
at least one of the methods has a flaw in it, and falling back on simple definitions such as the
definition of conditional probability clears up the conflict, and sharpens our intuition.
The following examples show that the conditional probability of an event can be smaller than,
larger than, or equal to, the unconditional probability of the event. (Here, the phrase unconditional
probability of the event is the same as the probability of the event; the word unconditional is
used just to increase the contrast with conditional probability.)
Example 2.3.1 Roll two dice and observe the numbers coming up. Define two events by: A=the
sum is six, and B=the numbers are not equal. Find and compare P (B) and P (B|A).
Solution: The sample space is = {(i, j) : 1 i 6, 1 j 6}, which has 36 equally likely
events. To find P (B) we count the number of outcomes in B. There are six choices for the number
coming up on the first die, and for each of those, five choices for the number coming up on the
second die that is different from the first. So B has 6 5 = 30 outcomes, and P (B) = 30/36 = 5/6.
Another way to see that P (B) = 5/6 is to notice that whatever the number on the first die is, the
probability that the number on the second die will be different from it is 5/6.
Since A = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} we have P (A) = 5/36. Similarly, since
AB = {(1, 5), (2, 4), (4, 2), (5, 1)} we have P (AB) = 4/36.
4/36
4
Therefore, P (B|A) = PP(AB)
(A) = 5/36 = 5 . Note that, for this example, P (B|A) < P (B). That
is, if one learns that the sum is six, the chances that different numbers show on the dice decreases
from the original probability that different numbers show on the dice.
33
Example 2.3.2 Continuing with the previous example, find and compare P (B c ) and P (B c |A).
Solution: Here, B c is the event that the numbers coming up are the same. Since six outcomes
6
1
are in B c , P (B c ) = 36
= 16 . Since AB c = {(3, 3)}, P (AB c ) = 36
. As noted above, P (A) = 5/36.
c
1/36
P
(AB
)
Therefore, P (B c |A) = P (A) = 5/36 = 1/5. Another way to compute P (B c |A) would have been to
notice that P (B|A)+P (B c |A) = 1, and use the fact, from Example 2.3.1, that P (B|A) = 4/5. Note
that, for this example, P (B c |A) > P (B c ). That is, if one learns that the sum is six, the chances
that the numbers coming up are the same number increases from the original probability that the
numbers coming up are the same.
Example 2.3.3 Again, consider the rolls of two fair dice. Let E=the number showing on the
first die is even, and F =the sum of the numbers showing is seven. Find and compare P (F ) and
P (F |E).
6
= 61 . Since EF = {(2, 5), (4, 3), (6, 1)}, P (EF ) =
Solution: Since F has six outcomes, P (F ) = 36
P (EF )
1/12
1
1
3
1
36 = 12 . Since E has 18 outcomes, P (E) = 2 . Therefore, P (F |E) = P (E) = 1/2 = 6 . Note that,
for this example, P (F ) = P (F |E). That is, if one learns that the number coming up on the first
die is even, the conditional probability that the numbers coming up on the dice sum to seven is the
same as the original probability that the numbers sum to seven.
34
2.4
2.4.1
Let A and B be two events for some probability space. Consider first the case that P (A) > 0.
As seen in the previous section, it can be that P (B|A) = P (B), which intuitively means that
knowledge that A is true does not affect the probability that B is true. It is then natural to
consider the events to be independent. If P (B|A) 6= P (B), then knowledge that A is true does
affect the probability that B is true, and it is natural to consider the events to be dependent
(i.e. not independent). Since, by definition, P (B|A) = PP(AB)
(A) , the condition P (B|A) = P (B) is
equivalent to P (AB) = P (A)P (B).
Lets consider the other case: P (A) = 0. Should we consider A and B to be independent? It
doesnt make sense to condition on A, but P (Ac ) = 1, so we can consider P (B|Ac ) instead. It holds
c B)
c
c
that P (B|Ac ) = PP(A
(Ac ) = P (A B) = P (B) P (AB) = P (B). Therefore, P (B|A ) = P (B). That
is, if P (A) = 0, knowledge that A is not true does not affect the probability of B. So it is natural
to consider A to be independent of B.
These observations motivate the following definition, which has the advantage of applying
whether or not P (A) = 0 :
Definition 2.4.1 Event A is independent of event B if P (AB) = P (A)P (B).
Note that the condition in the definition of independence is symmetric in A and B. Therefore,
A is independent of B if and only if B is independent of A. Another commonly used terminology for
these two equivalent relations is to say that A and B are mutually independent. Here, mutually
means that independence is a property of the two events. It does not make sense to say that a
single event A is independent, without reference to some other event.
If the experiment underlying the probability space (, F, P ) involves multiple physically separated parts, then it is intuitively reasonable that an event involving one part of the experiment
should be independent of another event that involves some other part of the experiment that is
physically separated from the first. For example, when an experiment involves the rolls of two fair
dice, it is implicitly assumed that the rolls of the two dice are physically independent, and an event
A concerning the number showing on the first die would be physically independent of any event
concerning the number showing on the second die. So, often in formulating a model, it is assumed
that if A and B are physically independent, then they should be independent under the probability
model (, F, P ).
The condition for A to be independent of B, namely P (AB) = P (A)P (B) is just a single
equation that can be true even if A and B are not physically independent.
Example 2.4.2 Consider a probability experiment related to the experiment discussed in Section
1.3, in which a fair coin is flipped and a die is rolled, with N denoting the side showing on the
coin and X denoting the number showing on the die. We should expect the event {N = H}
to be independent of the event {X = 6}, because they are physically independent events. This
independence holds, assuming all twelve outcomes in are equally likely, because then P {N =
35
Example 2.4.3 Suppose the probability experiment is to roll a single die. Let A be the event
that the outcome is even, and let B be the event that the outcome is a multiple of three. Since
these events both involve the outcome of a single role of a die, we would not consider them to
be physically independent. However, A = {2, 4, 6}, B = {3, 6}, and AB = {6}. So P (A) = 12 ,
P (B) = 13 and P (AB) = 16 . Therefore, P (AB) = P (A)P (B), which means that A and B are
mutually independent. This is a simple example showing that events can be mutually independent,
even if they are not physically independent.
Here is a final note about independence of two events. Suppose A is independent of B. Then
P (Ac B) = P (B) P (AB) = (1 P (A))P (B) = P (Ac )P (B),
so Ac is independent of B. Similarly, A is independent of B c , and therefore, by the same reasoning,
Ac is independent of B c . In summary, the following four conditions are equivalent: A is independent
of B, Ac is independent of B, A is independent of B c , Ac is independent of B c .
Let us now consider independence conditions for three events. The following definition simply
requires any one of the events to be independent of any one of the other events.
Definition 2.4.4 Events A, B, and C are pairwise independent if P (AB) = P (A)P (B), P (AC) =
P (A)P (C) and P (BC) = P (B)P (C).
Example 2.4.5 Suppose two fair coins are flipped, so = {HH, HT, T H, T T }, and the four
outcomes in are equally likely. Let
A = {HH, HT } =first coin shows heads,
B = {HH, T H} =second coin shows heads,
C = {HH, T T } =both coins show heads or both coins show tails.
It is easy to check that A, B, and C are pairwise independent. Indeed, P (A) = P (B) = P (C) = 0.5
and P (AB) = P (AC) = P (BC) = 0.25. We would consider A to be physically independent of B
as well, because they involve flips of different coins. Note that P (A|BC) = 1 6= P (A). That is,
knowing that both B and C are true affects the probability that A is true. So A is not independent
of BC.
Example 2.4.5 illustrates that pairwise independence of events does not imply that any one
of the events is independent of the intersection of the other two events. In order to have such
independence, a stronger condition is used to define independence of three events:
36
Definition 2.4.6 Events A, B, and C are independent if they are pairwise independent and if
P (ABC) = P (A)P (B)P (C).
Suppose A, B, C are independent. Then A (or Ac ) is independent of any event that can be
made from B and C by set operations. For example, A is independent of BC because P (A(BC)) =
P (ABC) = P (A)P (B)P (C) = P (A)P (BC). For a somewhat more complicated example, heres a
proof that A is independent of B C :
P (A(B C)) = P (AB) + P (AC) P (ABC)
= P (A)[P (B) + P (C) P (B)P (C)]
= P (A)P (B C).
If the three events A, B, and C have to do with three physically separated parts of a probability
experiment, then we would expect them to be independent. But three events could happen to be
independent even if they are not physically separated. The definition of independence for three
events involves four equalitiesone for each pairwise independence, and the final one: P (ABC) =
P (A)P (B)P (C).
Finally, we give a definition of independence for any finite collection of events, which generalizes
the above definitions for independence of two or three events.
Definition 2.4.7 Events A1 , A2 , . . . , An are independent if
P (Ai1 Ai2 Aik ) = P (Ai1 )P (Ai2 ) P (Aik )
whenever 2 k n and 1 i1 < i2 < < ik n.
The definition of independence is strong enough that if new events are made by set operations on
nonoverlapping subsets of the original events, then the new events are also independent. That is,
suppose A1 , A2 , . . . , An are independent events, suppose n = n1 + +nk with ni 1 for each i, and
suppose B1 is defined by Boolean operations (intersections, complements, and unions) of the first n1
events A1 , . . . , An1 , B2 is defined by Boolean operations on the next n2 events, An1 +1 , . . . , An1 +n2 ,
and so on, then B1 , . . . , Bk are independent.
2.4.2
Definition 2.4.8 Random variables X and Y are independent if any event of the form X A is
independent of any event of the form Y B.
If X and Y are independent random variables and i and j are real values, then
P {X = i, Y = j} = pX (i)pY (j) (this follows by taking A = {i} and B = {j} in the definition of
independence). Conversely, if X and Y are discrete-type random variables such that
37
P {X = i, Y = j} = pX (i)pY (j) for all i, j, then for any subsets A and B of the real numbers,
P {X A, Y B} =
P {X = i, Y = j} =
!
iA
pX (i)pY (j)
iA jB
(i,j):iA,jB
XX
pX (i)
jB
so that {X A} and {Y B} are mutually independent events. Thus, discrete random variables
X and Y are independent if and only if P {X = i, Y = j} = pX (i)pY (j) for all i, j,
More generally, random variables (not necessarily discrete-type) X1 , X2 , . . . , Xn are mutually
independent if any set of events of the form {X1 A1 }, {X2 A2 }, . . . , {Xn An } are mutually
independent. Independence of random variables is discussed in more detail in Section 4.4.
2.4.3
Bernoulli distribution
Some distributions arise so frequently that they have names. Two such distributions are discussed
in this section: the Bernoulli and binomial distributions. The geometric and Poisson distributions
are two other important discrete-type distributions with names, and they are introduced in later
sections.
A random variable X is said to have the Bernoulli distribution with parameter p, where 0 p
1, if P {X = 1} = p and P {X = 0} = 1p. Note that E[X] = p. Since X = X 2 , E[X 2 ] = E[X] = p.
So Var(X) = E[X 2 ] E[X]2 = p p2 = p(1 p). The variance is plotted as a function of p in
Figure 2.4. It is symmetric about 1/2, and achieves its maximum value, 1/4, at p = 1/2.
38
2.4.4
Binomial distribution
Suppose n independent Bernoulli trials are conducted, each resulting in a one with probability p
and a zero with probability 1 p. Let X denote the total number of ones occurring in the n trials.
Any particular outcome with k ones and n k zeros, such as 11010101, if n = 8 and k = 5, has
probability pk (1 p)nk . Since there are nk such outcomes, we find that the pmf of X is
n k
pX (k) =
p (1 p)nk for 0 k n.
k
The distribution of X is called the binomial distribution with parameters n and p. Figure 2.5 shows
the binomial pmf for n = 24 and p = 1/3. Since a partition for the sample space is the set of
Figure 2.5: The pmf of a binomial random variable with n = 24 and p = 1/3.
n + 1 events of the form {X = k} for 0 k n, it follows from the axioms of probability that the
pmf pX just derived sums to one. We will double check that fact using a series expansion. Recall
that the Taylor series expansion of a function f about a point xo is given by
f (x) = f (xo ) + f 0 (xo )(x xo ) + f 00 (xo )
(x xo )2
(x xo )3
+ f 000 (xo )
+ .
2
3!
The Maclaurin series expansion of a function f is the Taylor series expansion about xo = 0 :
f (x) = f (0) + f 0 (0)x + f 00 (0)
x2
x3
+ f 000 (0) + .
2
3!
(2.6)
39
Substituting x = p/(1 p) into (2.6) and multiplying through by (1 p)n yields that
n
X
n k
p (1 p)nk = 1,
k
k=0
n
X
(n 1)!
pk1 (1 p)nk
(n k)!(k 1)!
k=1
n1
X n 1
= np
pl (1 p)n1l (here l = k 1)
l
= np
l=0
= np.
(2.7)
The variance of the binomial distribution is given by Var(X) = np(1 p). This fact can be shown
using the pmf, but a simpler derivation is given in Example 4.8.1.
To explore the shape of the pmf, we examine the ratio of consecutive terms:
p(k)
p(k 1)
n!
k!(nk)!
n!
(k1)!(nk+1)!
(n k + 1)p
.
k(1 p)
pk (1 p)nk
pk1 (1 p)nk+1
Therefore, p(k) p(k 1) if and only if (n k + 1)p k(1 p), or equivalently, k (n + 1)p.
Therefore, letting k = b(n + 1)pc, (that is, k is the largest integer less than or equal to (n + 1)p,
which is approximately equal to np) the following holds:
p(0) < < p(k 1) = p(k ) > p(k + 1) > > p(n)
p(0) < < p(k 1) < p(k ) > p(k + 1) > > p(n)
if (n+1)p is an integer
if (n+1)p is not an integer.
That is, the pmf p(k) increases monotonically as k increases from 0 to k and it decreases monotonically as k increases from k to n. Thus, k is the value of k that maximizes p(k); which is to
say that k is the mode of the pmf. For example, if n = 24 and p = 1/3, the pmf is maximized at
k = 8, as seen in Figure 2.5.
40
P
P A number m is called the median of the binomial distribution if k:km p(k) 0.5 and
k:km p(k) 0.5. It can be shown that any median m of the binomial distribution satisfies
|m np| max{p, 1 p}1 , so the median is always close to the mean, np.
Example 2.4.9 [video] Suppose two teams, A and B, play a best-of-seven series of games. Assume
that ties are not possible in each game, that each team wins a given game with probability one
half, and the games are independent. The series ends once one of the teams has won four games,
because after winning four games, even if all seven games were played, that team would have more
wins than the other team, and hence it wins the series. Let Y denote the total number of games
played. Find the pmf of Y.
Solution. The possible values of Y are 4,5,6, or 7, so let 4 n 7. The event {Y = n} can be
expressed as the union of two events:
{Y = n} = {Y = n, A wins the series} {Y = n, B wins the series}.
The events {Y = n, A wins the series} and {Y = n, B wins the series} are mutually exclusive, and,
by symmetry, they have the same probability. Thus, pY (n) = 2P {Y = n, A wins the series}. Next,
notice that {Y = n, A wins the series} happens if and only if A wins three out of the first n 1
games, and A also wins the nth game. The number of games that team A wins out of the first n 1
games has the binomial distribution with parameters n 1 and p = 21 . So, for 4 n 7,
pY (n) = 2P {Y = n, A wins the series}
= 2P {A wins 3 of the first n 1 games}P {A wins the nth game}
n 1 (n1) 1
n 1 (n1)
= 2
2
=
2
.
3
2
3
5 5
or (pY (4), pY (5), pY (6), pY (7)) = ( 18 , 41 , 16
, 16 ). By the same reasoning, more generally, if A wins
each game with some probability p, not necessarily equal to 12 :
Example 2.4.10 Let X denote the number showing for one roll of a fair die. Find Var(X) and
the standard deviation, X .
1
R. Kaas and J.M. Buhrman, Mean, median, and mode in binomial distributions, Statistica Neerlandica, vol
34, no. 1, pp. 13 - 18, 2008.
41
91
6 .
So
Example 2.4.11 For each of the following two choices of distribution for X, let Y be the standardized version of X. Sketch the pmfs of X and Y (defined in Section 2.2). (a) X is the number
generated by rolling a fair die. (b) X has the binomial distribution with parameters n = 4 and
p = 0.5.
Solution: (a) The mean of X is 3.5 and the standard deviation is approximately 1.7, so to get
the pmf of Y we shift the pmf of X to the left by 3.5 (i.e. we center it) and then scale by shrinking
the pmf horizontally by the factor 1.7. The values of the pmf for Y are still all 1/6.
(a)
p (u)
X
p (u)
Y
1/6
1/6
u
1
(b) p (u)
X
2
6/16
u
1
p (u)
Y
4/16
1
6/16
4/16
1/16
u
0
1/16
u
2
(b) The mean of X is np = 2 and the variance is np(1 p) = 1. Since the variance is already one,
Y = X 2; no scaling is necessary. The pmf of Y is the pmf of X shifted to the left by two (i.e. it
is centered).
2.5
Geometric distribution
Suppose a sequence of independent trials are conducted, such that the outcome of each trial is
modeled by a Bernoulli random variable with parameter p. Here p is assumed to satisfy 0 < p 1.
Thus, the outcome of each trial is one with probability p and zero with probability 1 p. Let L
denote the number of trials conducted until the outcome of a trial is one. Let us find the pmf of
L. Clearly the range of possible values of L is the set of positive integers. Also, L = 1 if and only
if the outcome of the first trial is one. Thus, pL (1) = p. The event {L = 2} happens if the outcome
of the first trial is zero and the outcome of the second trial is one. By independence of the trials,
it follows that pL (2) = (1 p)p. Similarly, the event {L = 3} happens if the outcomes of the first
two trials are zero and the outcome of the third trial is one, so pL (3) = (1 p)(1 p)p = (1 p)2 p.
By the same reasoning, for any k 1, the event {L = k} happens if the outcomes of the first k 1
trials are zeros and the outcome of the k th trial is one. Therefore,
pL (k) = (1 p)k1 p for k 1.
42
The tail probabilities for L have an even simpler form. Specifically, if k 0 then the event {L > k}
happens if the outcomes of the first k trials are zeros. So
P {L > k} = (1 p)k
for k 0.
(2.8)
The random variable L is said to have the geometric distribution with parameter p. The fact that
the sum of the probabilities is one,
(1 p)k1 p = 1,
k=1
is a consequence of taking x = 1 p in the formula (1.5) for the sum of a geometric series.
each side of (1.5), setting x = (1 p), and rearranging, we find that E[L] =
PBy differentiating
1
kp
(k)
=
.
For
example, if p = 0.01, E[L] = 100.
L
k=1
p
Another, more elegant way to find E[L] is to condition on the outcome of the first trial. If the
outcome of the first trial is one, then L is one. If the outcome of the first trial is zero, then L is
e
one plus the number of additional trials until there is a trial with outcome one, which we call L.
e
e
Therefore, E[L] = p 1 + (1 p)E[1 + L] However, L and L have the same distribution, because they
e
are both equal to the number of trials needed until the outcome of a trial is one. So E[L] = E[L],
1
and the above equation becomes E[L] = 1 + (1 p)E[L], from which it follows that E[L] = p .
The variance of L can be found similarly. We could differentiate each side of (1.5) twice, set
x = 1 p, and rearrange. Here we take the alternative approach, just described for finding the
mean. By the same reasoning as before,
e 2 ] or E[L2 ] = p + (1 p)E[(1 + L)2 ].
E[L2 ] = p + (1 p)E[(1 + L)
Expanding out, using the linearity of expectation, yields
E[L2 ] = p + (1 p)(1 + 2E[L] + E[L2 ]).
Solving for E[L2 ], using the fact E[L] = p1 , yields E[L2 ] = 2p
. Therefore, Var(L) = E[L2 ]E[L]2 =
p2
p
1p
. The standard deviation of L is L = Var(L) = 1p
p .
p2
It is worth remembering that if p is very small, the mean E[L] = p1 is very large, and the
standard deviation is nearly as large as the mean (just smaller by the factor 1 p).
Example 2.5.1 Suppose a fair die is repeatedly rolled until each of the numbers one through six
shows at least once. What is the mean number of rolls?
Solution. The total number of rolls, R, can be expressed as R = R1 +. . .+R6 , where for 1 i 6,
Ri is the number of rolls made after i 1 distinct numbers have shown, up to and including the
roll such that the ith distinct number shows. For example, if the sequence of numbers showing is
2, 4, 2, 3, 4, 4, 3, 5, 3, 5, 4, 4, 6, 2, 3, 3, 4, 1, insert a vertical bar just after each roll that shows a new
distinct number to get 2|4|2, 3|4, 4, 3, 5|3, 5, 4, 4, 6|2, 3, 3, 4, 1|. Then Ri is the number of numbers in
43
1 1 1 1 1 1
+ + + + +
1 2 3 4 5 6
This is a special case of the coupon collector problem, with n = 6 coupon types. In general, if there
are n coupon types, the expectednumber of coupons needed until at least one coupon of each type
is obtained, is n 11 + 12 + + n1 n ln(n).
P {L > k + n, L > n}
P {L > n}
P {L > k + n}
P {L > n}
(1 p)k+n
(1 p)n
2.6
Recall that a random variable has the Bernoulli distribution with parameter p if it is equal to one
with probability p and to zero otherwise. A Bernoulli process is an infinite sequence, X1 , X2 , . . . ,
of Bernoulli random variables, all with the same parameter p, and independent of each other.
Therefore, for example, P {X5 = 1, X6 = 1, X7 = 0, X12 = 1} = p3 (1 p). The k th random variable
Xk indicates the (random) outcome of the k th trial in an infinite sequence of trials. For any in
the underlying probability space, the Bernoulli process has a corresponding realized value Xk ()
for each time k, and that function of time is called the sample path of the Bernoulli process for
outcome . A sample path of a Bernoulli process is illustrated in Figure 2.6. The figure indicates
44
3
2
Xs
k
0
10
L1
S1
20
L2
S2
30
40
50
L3
S3
45
The cumulative number of ones in k trials, for k 0, (C0 , C1 , C2 , . . .). For k fixed, Ck is the
number of ones in k independent Bernoulli trials, so it has the binomial distribution with
parameters k and p. More generally, for 0 k < l, the increment Cl Ck is the number of
ones in l k Bernoulli trials, so it has the binomial distribution with parameters l k and p.
Also, the increments of C over nonoverlapping intervals are independent.
The cumulative numbers of trials for j ones, for j 0: (S0 , S1 , S2 , . . .). As discussed below,
for integers r 1, Sr has the negative binomial distribution with parameters r and p.
Negative binomial distribution In the remainder of this section we discuss the distribution
of Sr , which is the number of trials required for r ones, for r 1. The possible values of Sr are
r, r + 1, r + 2, . . . . So let n r, and let k = n r. The event {Sr = n} is determined by the outcomes
of the first n trials. The event is true if and only if there
are r 1 ones and k zeros in the first
k + r 1 trials, and trial n is a one. There are n1
such
sequences of length n, and each has
r1
r1
nr
probability p (1 p) p. Therefore, the pmf of Sr is given by
n1 r
p(n) =
p (1 p)nr
r1
for n r.
This is called the negative binomial distribution with parameters r and p. To check that the pmf
sums to one, begin with the Maclaurin series expansion (i.e. Taylor series expansion about zero)
of (1 x)r :
X
k+r1 k
r
x .
(1 x) =
r1
k=0
1
r
1
n=r
k=0
Use of the expansion of (1 x)r here, in analogy to the expansion of (1 + x)n used for the binomial
distribution, explains the name negative binomial distribution. Since Sr = L1 + . . . Lr , where
each Lj has mean p1 , E[Sr ] = pr . It is shown in Example 4.8.1 that Var(Sr ) = rVar(L1 ) = r(1p)
.
p2
2.7
By definition, the Poisson probability distribution with parameter > 0 is the one with pmf
k
p(k) = e k! for k 0. In particular, 0! is defined to equal one, so p(0) = e . The next three
2
3
terms of the pmf are p(1) = e , p(2) = 2 e , and p(3) = 6 e . The Poisson distribution
arises frequently in practice, because it is a good approximation for a binomial distribution with
parameters n and p, when n is very large, p is very small, and = np. Some examples in which
such binomial distributions occur are:
46
Binomial distributions have two parameters, namely n and p, and they involve binomial coefficients, which can be cumbersome. Poisson distributions are simplerhaving only one parameter,
, and no binomial coefficients. So it is worthwhile using the Poisson distribution rather than the
binomial distribution for large n and small p. We now derive a limit result to give evidence that
this is a good approximation. Let > 0, let n and k be integers with n and 0 k n, and let
pb (k) denote the probability mass at k of the binomial distribution with parameters n and p = /n.
We first consider the limit of the mass of the binomial distribution at k = 0. Note that
n
ln pb (0) = ln(1 p) = n ln(1 p) = n ln 1
.
n
By Taylors theorem, ln(1 + u) = u + o(u) where o(u)/u 0 as u 0. So, using u = n ,
ln pb (0) = n + o
as n .
n
n
Therefore,
pb (0) =
1
n
n
as n ,
(2.9)
so the probability mass at zero for the binomial distribution converges to the probability mass at
zero for the Poisson distribution. Similarly, for any integer k 0 fixed,
n k
pb (k) =
p (1 p)nk
k
n (n 1) (n k + 1) k
nk
=
1
k!
n
n
"
#
k pb (0) n (n 1) (n k + 1)
k
=
1
(2.10)
k!
n
nk
k e
k!
as n ,
because (2.9) holds, and the terms in square brackets in (2.10) converge to one.
Checking that the Poisson distribution sums to one, and deriving the mean and variance of the
Poisson distribution, can be done using the pmf as can be done for the binomial distribution. This
47
is not surprising, given that the Poisson distribution is a limiting form of the binomial distribution.
The Maclaurin series for ex plays a role:
ex =
X
xk
k=0
k!
(2.11)
X
k e
k=0
k!
so the pmf does sum to one. Following (2.7) line for line yields that if Y has the Poisson distribution
with parameter ,
E[Y ] =
k e
k!
k e
k!
k=0
k=1
=
=
k=1
X
l=0
k1 e
(k 1)!
l e
l!
(here l = k 1)
= .
Similarly, it can be shown that Var(Y ) = . The mean and variance can be obtained by taking
the limit of the mean and limit of the variance of the binomial distribution with parameters n and
p = /n, as n , as follows. The mean of the Poisson distribution is limn n n = , and the
variance of the Poisson distribution is limn n n 1 n = .
2.8
Sometimes when we devise a probability model for some situation we have a reason to use a
particular type of probability distribution, but there may be a parameter that has to be selected.
A common approach is to collect some data and then estimate the parameter using the observed
data. For example, suppose we decide that an experiment is accurately modeled by a probability
model with a random variable X, and that the pmf of X is p , where is a parameter, but the
value of the parameter is not known before the experiment is performed. When the experiment is
performed, suppose we observe a particular value k for X. According to the probability model, the
probability of k being the observed value for X, before the experiment was performed, would have
been p (k). It is said that the likelihood that X = k is p (k). The maximum likelihood estimate of
48
for observation k, denoted by bM L (k), is the value of that maximizes the likelihood, p (k), with
respect to . Intuitively, the maximum likelihood estimate is the value of that best explains the
observed value k, or makes it the least surprising.
A way to think about it is that p (k) depends on two variables: the parameter and the value
of the observed value k for X. It is the likelihood that X = k, which depends on . For parameter
estimation, the goal is to come up with an estimate of the parameter for a given value k of the
observed value. It wouldnt make sense to maximize the likelihood with respect to the observed
value k, because k is assumed to be given. Rather, the maximization is performed with respect to
the unknown parameter value; the maximum likelihood estimate is the value of the parameter that
maximizes the likelihood of the observed value.
In the following context it makes sense to maximize p (k) with respect to k for fixed. Suppose
you know and then you enter into a guessing game, in which you guess what the value of X will
be, before the experiment is performed. If your guess is k, then your probability of winning is p (k),
so you would maximize your probability of winning by guessing the value of k that maximizes p (k).
For parameter estimation, the value k is observed and the likelihood is maximized with respect to
; for the guessing game, the parameter is known and the likelihood is maximized with respect
to k.
Example 2.8.1 Suppose a bent coin is given to a student. The coin is badly bent, but the student
can still flip the coin and see whether it shows heads or tails. The coin shows heads with probability
p each time it is flipped. The student flips the coin n times for some large value of n (for example,
n = 1000 is reasonable). Heads shows on k of the flips. Find the ML estimate of p.
Solution: The parameter to be estimated here is p, so p plays the role of in the definition of
ML estimation. We use the letter p in this example instead of because p is commonly
used in connection with Bernoulli trials. Let X be the number of times heads shows for n flips of
the coin. It has the binomial
distribution with parameters n and p. Therefore, the likelihood that
n k
X = k is pX (k) = k p (1 p)nk . Here n is known (the number of times the coin is flipped) and k
is known (the number of times heads
shows). Therefore, the maximum likelihood estimate, pbM L , is
the value of p that maximizes nk pk (1 p)nk . Equivalently, pbM L , is the value of p that maximizes
pk (1 p)nk . First, assume that 1 k n 1. Then
d(pk (1 p)nk )
k nk
=
49
Example 2.8.2 Suppose it is assumed that X is drawn at random from the numbers 1 through
n, with each possibility being equally likely (i.e. having probability n1 ). Suppose n is unknown but
that it is observed that X = k, for some known value k. Find the ML estimator of n given X = k
is observed.
Solution: The pmf can be written as pX (k) = n1 I{1kn} . Recall that I{1kn} is the indicator
function of {1 k n}, equal to one on that set and equal to zero elsewhere. The whole idea
now is to think of pX (k) not as a function of k (because k is the given observation), but rather,
as a function of n. An example is shown in Figure 2.7. It is zero if n k 1; it jumps up to k1
1/k
n
0
Example 2.8.3 Suppose X has the geometric distribution with some parameter p which is unknown. Suppose a particular value k for X is observed. Find the maximum likelihood estimate,
pbM L .
Solution: The pmf is given by pX (k) = (1 p)k1 p for k 1. If k = 1 we have to maximize
pX (1) = p with respect to p, and p = 1 is clearly the maximizer. So pbM L = 1 if k = 1. If k 2 then
pX (k) = 0 for p = 0 or p = 1. Since ((1p)k1 p)0 = ((k1)p(1p))(1p)k2 = (1kp)(1p)k2 ,
we conclude that pX (k) is increasing in p for 0 p k1 and decreasing in p for k1 p 1. Therefore,
pbM L = k1 if k 2. This expression is correct for k = 1 as well, so for any k 1, pbM L = k1 .
Example 2.8.4 It is assumed that X has a Poisson distribution with some parameter with
0, but the value of is unknown. Suppose it is observed that X = k for a particular integer
k. Find the maximum likelihood estimate of X.
k
e
; the value of maximizing such likelihood
Solution: The likelihood of observing X = k is k!
is to be found for k fixed. Equivalently, the value of maximizing k e is to be found. If k 1,
d(k e )
= (k )k1 e ,
d
50
so the likelihood is increasing in for < k and decreasing in for > k; the likelihood is
bM L (k) = k. If k = 0, the likelihood is e , which is maximized by
maximized at = k. Thus,
bM L (k) = k for all k 0.
= 0, so
2.9
The mean and variance of a random variable are two key numbers that roughly summarize the
distribution of the random variable. In some applications, the mean, and possibly the variance, of
a random variable are known, but the distribution of the random variable is not. The following
two inequalities can still be used to provide bounds on the probability a random variable takes
on a value far from its mean. The second of these is used to provide a confidence interval on an
estimator of the parameter p of a binomial distribution.
First, the Markov inequality states that if Y is a nonnegative random variable, then for c > 0,
P {Y c}
E[Y ]
.
c
To prove Markovs inequality for a discrete-type nonnegative random variable Y with possible
values u1 , u2 , . . . , note that for each i, ui is bounded below by zero if ui < c, and ui is bounded
below by c if ui c. Thus,
X
E[Y ] =
ui pY (ui )
i
0 pY (ui ) +
i:ui <c
= c
cpY (ui )
i:ui c
pY (ui ) = cP {Y c},
i:ui c
which implies the Markov inequality. Equality holds in the Markov inequality if and only if pY (0) +
pY (c) = 1.
Example 2.9.1 Suppose 200 balls are distributed among 100 buckets, in some particular but
unknown way. For example, all 200 balls could be in the first bucket, or there could be two balls
in each bucket, or four balls in fifty buckets, etc. What is the maximum number of buckets that
could each have at least five balls?
Solution: This question can be answered without probability theory, but it illustrates the essence
of the Markov inequality. If asked to place 200 balls within 100 buckets to maximize the number
of buckets with five or more balls, you would naturally put five balls in the first bucket, five in the
second bucket, and so forth, until you ran out of balls. In that way, youd have 40 buckets with five
or more balls; that is the maximum possible; if there were 41 buckets with five or more balls then
there would have to be at least 205 balls. This solution can also be approached using the Markov
51
inequality as follows. Suppose the balls are distributed among the buckets in some particular way.
The probability experiment is to select one of the buckets at random, with all buckets having equal
probability. Let Y denote the number of balls in the randomly selected bucket. Then Y is a
nonnegative random variable with
E[Y ] =
100
X
i=1
so by Markovs inequality, P {Y 5} 25 = 0.4. That is, the fraction of buckets with five or more
balls is less than or equal to 0.4. Equality is achieved if and only if the only possible values of Y
are zero and five, that is, if and only if each bucket is either empty or has exactly five balls.
Second, the Chebychev inequality states that if X is a random variable with finite mean and
variance 2 , then for any d > 0,
2
P {|X | d} 2 .
(2.12)
d
The Chebychev inequality follows by applying the Markov inequality with Y = |X |2 and c = d2 .
A slightly different way to write the Chebychev inequality is to let d = a, for any constant a > 0,
to get
1
P {X | a} 2 .
(2.13)
a
In words, this form of the Chebychev inequality states that the probability that a random variable
differs from its mean by a or more standard deviations is less than or equal to a12 .
Confidence Intervals The Chebychev inequality can be used to provide confidence intervals for
estimators. Confidence intervals are often given when some percentages are estimated based on
samples from a large population. For example, an opinion poll report might state that, based on a
survey of some voters, 64% favor a certain proposal, with polling accuracy 5%. In this case, we
would call [59%, 69%] the confidence interval. Also, although it is not always made explicit, there is
usually a level of confidence associated with the confidence interval. For example, a 95% confidence
in a confidence interval means that, from the perspective of someone before the data is observed,
the probability the (random) confidence interval contains the fixed, true percentage is at least 95%.
(It does not mean, from the perspective of someone who knows the value of the confidence interval
computed from observed data, that the probability the true parameter is in the interval is at least
95%.)
In practice, the true fraction of voters that favor a certain proposition would be a given number,
say p. In order to estimate p we could select n voters at random, where n is much smaller than the
total population of voters. For example, a given poll might survey n = 200 voters to estimate the
fraction of voters, within a population of several thousand voters, that favor a certain proposition.
The resulting estimate of p would be pb = X
n , where X denotes the number of the n sampled
voters who favor the proposition. We shall discuss how a confidence interval with a given level of
confidence could be determined for this situation. Assuming the population is much larger than n,
52
it is reasonable to model X as a binomial random variable with parameters n and p. The mean of
X is np so Chebychevs inequality yields that for any constant a > 0 :
1
P {X np| a} 2 .
a
Another way to put it, is:
X
a
1
P p
2
n
n
a
or, equivalently, P
X
p < a 1 1 .
n
n
a2
p
np(1 p), is:
Still another way to put this, using pb = X
and
=
n
(
!)
r
r
1
p(1 p)
p(1 p)
P p pb a
, pb + a
1 2.
n
n
a
(2.14)
a
1
a
1 2.
p pb , pb +
a
2 n
2 n
(2.15)
Again, in (2.15), it is the interval, not p, that is random. The larger the constant a is, the greater
the probability that p is in the random interval. Table 2.2 shows some possible choices of a and
the associated upper bound in (2.13), which the confidence level associated with the interval. The
confidence interval given by (2.15) is on the conservative side; the probability on the left-hand side
of (2.15) may be much closer to one than the right-hand side. Often in practice other confidence
intervals are used which are based on different assumptions (see Example 3.6.10).
Example 2.9.2 Suppose the fraction p of telephone numbers that are busy in a large city at a
given time is to be estimated by pb = X
n , where n is the number of phone numbers that will be
tested and X will be the number of the tested numbers found to be busy. If p is to be estimated to
within 0.1 with 96% confidence, how many telephone numbers should be sampled, based on (2.15)?
53
1
.
a2
1 (1/a2 )
75%
96%
99%
a
2
5
10
Solution: For 96% confidence we take a = 5, so the half-width of the confidence interval is
a
2.5
2
=
, which should be less than or equal to 0.1. This requires n ( 2.5
0.1 ) = 625.
2 n
n
2.10
Events E1 , . . . , Ek are said to form a partition of if the events are mutually exclusive and =
E1 Ek . Of course for a partition, P (E1 ) + + P (Ek ) = 1. More generally, for any
event A, the law of total probability holds because A is the union of the mutually exclusive sets
AE1 , AE2 , . . . , AEk :
P (A) = P (AE1 ) + + P (AEk ).
If P (Ei ) 6= 0 for each i, this can be written as
P (A) = P (A|E1 )P (E1 ) + + P (A|Ek )P (Ek ).
Figure 2.8 illustrates the conditions of the law of total probability.
E1
E2
A
E3
E
!
Figure 2.8: Partitioning a set A using a partition of .
The definition of conditional probability and the law of total probability lead to Bayes formula
for P (Ei |A) (if P (A) 6= 0) in simple form:
P (Ei |A) =
P (AEi )
P (A|Ei )P (Ei )
=
,
P (A)
P (A)
(2.16)
54
or in expanded form:
P (Ei |A) =
P (A|Ei )P (Ei )
.
P (A|E1 )P (E1 ) + + P (A|Ek )P (Ek )
(2.17)
An important point about (2.16) is that it is a formula for P (Ei |A), whereas the law of total
probability uses terms of the form P (A|Ei ). In many instances, as illustrated in the following
examples, the values P (A|Ei ) are specified by the probability model, and (2.16) or (2.17) are used
for calculating P (Ei |A).
Example 2.10.1 There are three dice in a bag. One has one red face, another has two red faces,
and the third has three red faces. One of the dice is drawn at random from the bag, each die having
an equal chance of being drawn. The selected die is repeatedly rolled.
(a) What is the probability that red shows on the first roll?
(b) Given that red shows on the first roll, what is the conditional probability that red shows on
the second roll?
(c) Given that red shows on the first three rolls, what is the conditional probability that the selected
die has red on three faces?
Solution: Let Ei be the event that the die with i red faces is drawn from the bag, for i =1, 2, or
3. Let Rj denote the event that red shows on the jth role of the die.
(a) By the law of total probability,
P (R1 ) = P (R1 |E1 )P (E1 ) + P (R1 |E2 )P (E2 ) + P (R1 |E3 )P (E3 )
11 21 31
1
=
+
+
= .
63 63 63
3
(b) We need to find P (R2 |R1 ), and well begin by finding P (R1 R2 ). By the law of total probability,
P (R1 R2 ) = P (R1 R2 |E1 )P (E1 ) + P (R1 R2 |E2 )P (E2 ) + P (R1 R2 |E3 )P (E3 )
2
2
2
1
1
2
1
3
1
7
=
+
+
= .
6
3
6
3
6
3
54
7
. (Note: we have essentially derived/used Bayes formula.)
Therefore, P (R2 |R1 ) = PP(R(R1 R1 )2 ) = 18
(c) We need to find P (E3 |R1 R2 R3 ), and will do so by finding the numerator and denominator in the
definition of P (E3 |R1 R2 R3 ). The numerator is given by P (E3 R1 R2 R3 ) = P (E3 )P (R1 R2 R3 |E3 ) =
1 3 3
1
= 24
. Using the law of total probability for the denominator yields
3 6
P (R1 R2 R3 ) = P (R1 R2 R3 |E1 )P (E1 ) + P (R1 R2 R3 |E2 )P (E2 ) + P (R1 R2 R3 |E3 )P (E3 )
3
3
3
1
1
2
1
3
1
1
=
+
+
= .
6
3
6
3
6
3
18
Therefore, P (E3 |R1 R2 R3 ) =
formula.)
P (E3 R1 R2 R3 )
P (R1 R2 R3 )
18
24
55
Example 2.10.2 Consider a two stage experiment. First roll a die, and let X denote the number
showing. Then flip a fair coin X times, and let Y denote the total number of times heads shows.
Find P {Y = 3} and P (X = 3|Y = 3).
Solution:
6
X
j=1
=
=
=
3 3
4 4
5 5
6 6
1
0+0+
2 +
2 +
2 +
2
6
3
3
3
3
1
1 1 10 20
0+0+ + +
+
6
8 4 32 64
1
.
6
23
6
1
48 .
Example 2.10.3 According to the Center for Disease Control (CDC), Compared to nonsmokers,
men who smoke are about 23 times more likely to develop lung cancer and women who smoke are
about 13 times more likely. (a) If you learn that a person is a woman who has been diagnosed
with lung cancer, and you know nothing else about the person, what is the probability she is a
smoker? In your solution, use the CDC information that roughly 15% of all women smoke.
(b) Suppose that in the USA, 15% of women are smokers, 18% of all adults are smokers, and half
of adults are women. What fraction of adult smokers in the USA are women?
Solution: (a) Let c denote the fraction of nonsmoking women who get lung cancer (over some
time period). Then 13c is the fraction of smoking women who get lung cancer over the same time
period. The total probability a typical woman gets lung cancer is the sum of the probability the
women is a smoker and gets lung cancer plus the probability she is a nonsmoker and gets lung
cancer, or
P {a woman gets lung cancer during the time period} = (0.15)13c + (0.85)c.
Thus, the conditional probability a women is a smoker given she got lung cancer is the first of the
terms over the sum:
(0.15)13c
(0.15)13
1.95
=
=
67%.
(0.15)13c + (0.85)c
(0.15)13 + (0.85)
2.80
56
(b) The fraction of adults that smoke, namely 18%, is the average of the fraction of women that
smoke, 15%, and the fraction of men that smoke. It follows that 21% of men smoke because
15+21
= 18. Thus, the ratio of the number of women that smoke to the total number of adults that
2
smoke is 15/(15+21)=15/36=5/12.
Example 2.10.4 Consider two boxes as shown in Figure 2.9. Box 1 has three black and two white
Box 1
Box 2
57
Third,
P (W B)
P (B)
P (W )P (B|W )
P (B)
2 2
4
5 5
13 = 13 .
25
P (W |B) =
=
=
(We just used Bayes formula, perhaps without even realizing it.) For this example it is interesting
to compare P (W |B) to P (W ). We might reason about the ordering as follows. Transferring a white
ball (i.e. event W ) makes removing a black ball in Step 2 (i.e. event B) less likely. So given that
B is true, we would expect the conditional probability of W to be smaller than the unconditional
4
< 52 ).
probability. That is indeed the case ( 13
As weve seen, the law of total probability can be applied to calculate the probability of an
event, if there is a partition of the sample space. The law of total probability can also be used to
compute the mean of a random variable. The conditional mean of a discrete-type random variable
X given an event A is defined the same way as the original unconditional mean, but using the
conditional pmf:
X
E[X|A] =
ui P (X = ui |A),
i
g(ui )P (X = ui |A).
These equations were used implicitly in Section to derive the mean and variance of a geometrically
distributed random variable. The law of total probability extends from probabilities to conditional
expectations as follows. If E1 , . . . , EJ is a partition of the sample space, and X is a random variable,
E[X] =
J
X
E[X|Ej ]P (Ej ).
j=1
Example 2.10.5 Let 0 < p < 0.5. Suppose there are two biased coins. The first coin shows heads
with probability p and the second coin shows heads with probability q, where q = 1 p. Consider
the following two stage experiment. First, select one of the two coins at random, with each coin
being selected with probability one half, and then flip the selected coin n times. Let X be the
number of times heads shows. Compute the pmf, mean, and standard deviation of X.
58
Solution:
The two conditional pmfs and the unconditional pmf of X are shown in Figure 2.10 for n = 24 and
p = 1/3.
Figure 2.10: The pmfs of X for n = 24 and p = 1/3. Top left: conditional pmf of X given the first
coin is selected. Top right: conditional pmf of X given the second coin is selected. Bottom: The
unconditional pmf of X.
The two plots in the top of the figure are the binomial pmf for n = 24 and parameter p = 1/3
and the binomial pmf for n = 24 and the parameter p = 2/3. The bottom plot in the figure shows
the resulting pmf of X. The mean of X can be calculated using the law of total probability:
E[X] = E[X|A]P (A) + E[X|Ac ]P (Ac )
np nq
n(p + q)
n
=
+
=
= .
2
2
2
2
59
To calculate Var(X) we will use the fact Var(X) = E[X 2 ] (E[X])2 , and apply the law of total
probability:
E[X 2 ] = E[X 2 |A]P (A) + E[X 2 |Ac ]P (Ac ).
Here E[X 2 |A] is the second moment of the binomial distribution with parameters n and p, which
is equal to the mean squared plus the variance: E[X 2 |A] = (np)2 + npq. Similarly, E[X 2 |Ac ] =
(nq)2 + nqp. Therefore,
E[X 2 ] =
so
Var(X) = E[X 2 ] E[X]2
2
2
1
2 p +q
+ npq
= n
2
4
n(1 2p) 2
=
+ np(1 p).
2
Note that
p
(1 2p)n
Var(X)
,
2
which for fixed p grows linearly with n. Inp
comparison, the standard deviation for a binomial
60
2.11
The basic framework for binary hypothesis testing is illustrated in Figure 2.11. It is assumed that
H1
system
decision rule
H1 or H0
H0
In practice, the numbers in the table might be based on data accumulated from past experiments
when either one or the other hypothesis is known to be true. As mentioned above, a decision
rule specifies, for each possible observation, which hypothesis is declared. A decision rule can
be conveniently displayed on the likelihood matrix by underlining one entry in each column, to
specifying which hypothesis is to be declared for each possible value of X. An example of a decision
rule is shown below, where H1 is declared whenever X 1. For example, if X = 2 is observed,
then H1 is declared, because the entry underlined under X = 2 is in the H1 row of the likelihood
matrix.
X=0 X=1 X=2 X=3
underlines indicate
the decision rule
H1
0.0
0.1
0.3
0.6
used for this example.
H0
0.4
0.3
0.2
0.1
Since there are two possibilities for which hypothesis is true, and two possibilities for which hypothesis is declared, there are four possible outcomes:
Hypothesis H0 is true and H0 is declared.
Hypothesis H1 is true and H1 is declared.
61
2.11.1
The ML decision rule declares the hypothesis which maximizes the probability (or likelihood) of
the observation. Operationally, the ML decision rule can be stated as follows: Underline the larger
2
Use of false alarm is common in the signal processing literature. This is called a type I error in the statistics
literature. The word miss is used in the signal processing literature for the other type of error, which is called a
type II error in the statistics literature.
62
entry in each column of the likelihood matrix. If the entries in a column of the likelihood matrix
are identical, then either can be underlined. The choice may depend on other considerations such
as whether we wish to minimize pfalse alarm or pmiss . The ML rule for the example likelihood matrix
above is the following:
H1
H0
underlines indicate
the ML decision rule .
It is easy to check that for the ML decision rule, pfalse alarm = 0.2+0.1 = 0.3 and pmiss = 0.0+0.1 = 0.1.
There is another way to express the ML decision rule. Note that for two positive numbers a
and b, the statement a > b is equivalent to the statement that ab > 1. Thus, the ML rule can be
rewritten in a form called a likelihood ratio test (LRT) as follows. Define the likelihood ratio (k)
for each possible observation k as the ratio of the two conditional probabilities:
(k) =
p1 (k)
.
p0 (k)
The ML rule is thus equivalent to deciding that H1 is true if (X) > 1 and deciding H0 is true if
(X) < 1. The ML rule can be compactly written as
> 1 declare H1 is true
(X)
< 1 declare H0 is true.
We shall see that the other decision rule,
as an LRT, but with the threshold 1 changed
written as
>
(X)
<
Note that if the threshold is increased, then there are fewer observations that lead to deciding H1
is true. Thus, as increases, pfalse alarm decreases and pmiss increases. For most binary hypothesis
testing problems there is no rule that simultaneously makes both pfalse alarm and pmiss small. In a
sense, the LRTs are the best possible family of rules, and the parameter can be used to select a
given operating point on the tradeoff between the two error probabilities. As noted above, the ML
rule is an LRT with threshold = 1.
2.11.2
The other decision rule we discuss requires the computation of joint probabilities such as P ({X =
1} H1 ). For brevity we write this probability as P (H1 , X = 1). Such probabilities cannot be
deduced from the likelihood matrix alone. Rather, it is necessary for the system designer to assume
some values for P (H0 ) and P (H1 ). Let the assumed value of P (Hi ) be denoted by i , so 0 = P (H0 )
and 1 = P (H1 ). The probabilities 0 and 1 are called prior probabilities, because they are the
probabilities assumed prior to when the observation is made.
63
Together the conditional probabilities listed in the likelihood matrix and the prior probabilities determine the joint probabilities P (Hi , X = k), because P (Hi , X = k) = i pi (k). The joint
probability matrix is the matrix of joint probabilities P (Hi , X = k). For our first example, suppose
0 = 0.8 and 1 = 0.2. Then the joint probability matrix is given by
H1
H0
Note that the row for Hi of the joint probability matrix is i times the corresponding row of the
likelihood matrix. Since the row sums for the likelihood matrix are one, the sum for row Hi of the
joint probability matrix is i . Therefore, the sum of all entries in the joint probability matrix is
one. The joint probability matrix can be viewed as a Venn diagram.
Conditional probabilities such as P (H1 |X = 2) and P (H0 |X = 2) are called a posteriori probabilities, because they are probabilities that an observer would assign to the two hypotheses after
making the observation (in this case observing that X = 2). Given an observation, such as X = 2,
the maximum a posteriori (MAP) decision rule chooses the hypothesis with the larger conditional
P (H1 ,X=2)
0.06
1 ,X=2)
probability. By Bayes formula, P (H1 |X = 2) = P (H
= P (H1 ,X=2)+P
P (X=2)
(H0 ,X=2) = 0.06+0.16
That is, P (H1 |X = 2) is the top number in the column for X = 2 in the joint probability matrix
divided by the sum of the numbers in the column for X = 2. Similarly the conditional probability
P (H0 |X = 2) is the bottom number in the column for X = 2 divided by the sum of the numbers
in the column for X = 2. Since the denominators are the same (both denominators are equal to
P {X = 2}) it follows that whether P (H1 |X = 2) > P (H0 |X = 2) is equivalent to whether the top
entry in the column for X = 2 is greater than the bottom entry in the column for X = 2.
Thus, the MAP decision rule can be specified by underlining the larger entry in each column of
the joint probability matrix. For our original example, the MAP rule is given by
H1
H0
underlines indicate
the MAP decision rule .
Thus, if the observation is X = k, the MAP rule declares hypothesis H1 is true if 1 p1 (k) > 0 p0 (k),
or equivalently if (k) > 10 , where is the likelihood ratio defined above. Therefore, the MAP
rule is equivalent to the LRT with threshold = 01 .
If 1 = 0 , the prior is said to be uniform, because it means the hypotheses are equally likely.
For the uniform prior the threshold for the MAP rule is one, and the MAP rule is the same as the
ML rule. Does it make sense that if 0 > 1 , then the threshold for the MAP rule (in LRT form)
is greater than one? Indeed it does, because a larger threshold value in the LRT means there are
fewer observations leading to deciding H1 is true, which is appropriate behavior if 0 > 1 .
The MAP rule has a remarkable optimality property, as we now explain. The average error
probability, which we call pe , for any decision rule can be written as pe = 0 pfalse alarm + 1 pmiss .
A decision rule is specified by underlining one number from each column of the joint probability
matrix. The corresponding pe is the sum of all numbers in the joint probability matrix that are
not underlined. From this observation it easily follows that, among all decision rules, the MAP
64
decision rule is the one that minimizes pe . That is why some books call the MAP rule the minimum
probability of error rule.
Examples are given in the remainder of this section, illustrating the use of ML and MAP decision
rules for hypothesis testing.
Example 2.11.1 Suppose you have a coin and you know that either : H1 : the coin is biased,
showing heads on each flip with probability 2/3; or H0 : the coin is fair. Suppose you flip the coin
five times. Let X be the number of times heads shows. Describe the ML and MAP decision rules,
and find pfalse alarm , pmiss , and pe for both of them, using the prior (0 , 1 ) = (0.2, 0.8) for the MAP
rule and for defining pe for both rules.
Solution: The rows of the likelihood matrix consist of the pmf of the binomial distribution with
n = 5 and p = 2/3 for H1 and p = 1/2 for H0 :
H1
H0
X=0
1 5
3
1 5
2
X=1
1 4
2
3
1 5
2
10
X=2
2 2 1 3
3
10
1 5
2
10
X=3
2 3 1 2
3
10
1 5
2
X=4
X=5
2 4 1
2 5
3
1 5
2
1 5
.
2
.
1 5
3
7.6
2
65
Example 2.11.2 (Detection problem with Poisson distributed observations) A certain deep space
transmitter uses on-off modulation of a laser to send a bit, with value either zero or one. If the bit
is zero, the number of photons, X, arriving at the receiver has the Poisson distribution with mean
0 = 2; and if the bits is one, X has the Poisson distribution with mean 1 = 6. A decision rule
is needed to decide, based on observation of X, whether the bit was a zero or a one. Describe (a)
the ML decision rule, and (b) the MAP decision rule under the assumption that sending a zero is
a priori five times more likely than sending a one (i.e. 0 /1 = 5). Express both rules as directly
in terms of X as possible.
Solution: The ML rule is to decide a one is sent if (X) 1, where is the likelihood ratio
function, defined by
k
P (X = k|one is sent)
e1 k1 /k!
1
3k
(k) =
= k
=
.
e(1 0 ) = 3k e4
P (X = k|zero is sent)
0
54.6
e 0 0 /k!
Therefore, the ML decision rule is to decide a one is sent if X 4.
The MAP rule is to decide a one is sent if (X) 01 , where is the likelihood ratio already
X
3
5, or equivalently, if X 6.
found. So the MAP rule with 0 = 51 decides a one is sent if 54.6
Note that when X = 4 or X = 5, the ML rule decides that a one was transmitted, not a zero, but
because zeroes are so much more likely to be transmitted than ones, the MAP rule decides in favor
of a zero in this case.
Example 2.11.3 (Sensor fusion) Two motion detectors are used to detect the presence of a person
in a room, as part of an energy saving temperature control system. The first sensor outputs a value
X and the second sensor outputs a value Y . Both outputs have possible values {0, 1, 2}, with larger
numbers tending to indicate that a person is present. Let H0 be the hypothesis a person is absent
and H1 be the hypothesis a person is present. The likelihood matrices for X and for Y are shown:
H1
H0
H1
H0
Y =0 Y =1 Y =2
0.1
0.1
0.8
0.7
0.2
0.1
For example, P (Y = 2|H1 ) = 0.8. Suppose, given one of the hypotheses is true, the sensors provide
conditionally independent readings, so that
P (X = i, Y = j|Hk ) = P (X = i|Hk )P (Y = j|Hk ) for i, j {0, 1, 2} and k {0, 1}.
(a) Find the likelihood matrix for the observation (X, Y ) and indicate the ML decision rule. To be
definite, break ties in favor of H1 .
(b) Find pfalse alarm and pmiss for the ML rule found in part (a).
(c) Suppose, based on past experience, prior probabilities 1 = P (H1 ) = 0.2 and 0 = P (H0 ) = 0.8
are assigned. Compute the joint probability matrix and indicate the MAP decision rule.
66
(d) For the MAP decision rule, compute pfalse alarm , pmiss , and the unconditional probability of error
pe = 0 pfalse alarm + 1 pmiss .
(e) Using the same priors as in part (c), compute the unconditional error probability, pe , for the
ML rule from part (a). Is it smaller or larger than pe found for the MAP rule in (d)?
Solution:
The ML decisions are indicated by the underlined elements. The larger number in each column is
underlined, with the tie in case (0, 2) broken in favor of H1 , as specified in the problem statement.
Note that the row sums are both one.
(b) For the ML rule, pfalse alarm is the sum of the entries in the row for H0 in the likelihood matrix
that are not underlined. So pfalse alarm = 0.08 + 0.02 + 0.01 + 0.02 + 0.01 = 0.14.
For the ML rule, pmiss is the sum of the entries in the row for H1 in the likelihood matrix that are
not underlined. So pmiss = 0.01 + 0.01 + 0.03 + 0.06 = 0.11.
(c) The joint probability matrix is given by
(X, Y ) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)
H1
0.002 0.002 0.016 0.006 0.006 0.048 0.012 0.012 0.096
H0
0.448 0.128 0.064 0.056 0.016 0.008 0.056 0.016 0.008.
(The matrix specifies P (X = i, Y = j, Hk ) for each hypothesis Hk and for each possible observation
value (i, j). The 18 numbers in the matrix sum to one. The MAP decisions are indicated by
the underlined elements in the joint probability matrix. The larger number in each column is
underlined.)
(d) For the MAP rule,
pfalse alarm = P [(X, Y ) {(1, 2), (2, 2)}|H0 ] = 0.01 + 0.01 = 0.02,
and
pmiss = P ((X, Y ) 6 {(1, 2), (2, 2)}|H1 ) = 1 P ((X, Y ) {(1, 2), (2, 2)}|H1 ) = 1 0.24 0.48 = 0.28.
Thus, for the MAP rule, pe = (0.8)(0.02) + (0.2)(0.28) = 0.072. (This pe is also the sum of the
probabilities in the joint probability matrix that are not underlined.)
(e) Using the conditional probabilities found in (a) and the given values of 0 and 1 yields that
for the ML rule: pe = (0.8)(0.14) + (0.2)(0.11) = 0.134, which is larger than the value 0.072 for the
MAP rule, as expected because of the optimality of the MAP rule for the given priors.
2.12. RELIABILITY
2.12
67
Reliability
Reliability of complex systems is of central importance to many engineering design problems. Extensive terminology, models, and graphical representations have been developed within many different
fields of engineering, from construction of major structures to logistics of complex operations. A
common theme is to try to evaluate the reliability of a large system by recursively evaluating the
reliability of its subsystems. Often no more underlying probability theory is required beyond that
covered earlier in this chapter. However, intuition can be sharpened by considering the case that
many of the events have very small probabilities.
2.12.1
Union bound
A general tool for bounding failure probabilities is the following. Given two events A and B, the
union bound is
P (A B) P (A) + P (B).
A proof of the bound is that P (A) + P (B) P (A B) = P (AB) 0. If the bound, P (A) + P (B),
is used as an approximation to P (A B), the error or gap is P (AB). If A and B have large
probabilities, the gap can be significant; P (A) + P (B) might even be larger than one. However, in
general, P (AB) min{P (A), P (B)}, so the bound is never larger than two times P (A B). Better
yet, if A and B are independent and have small probabilities, then P (AB) = P (A)P (B) P (AB)
(here means much smaller than). For example, if P (A) = P (B) = 0.001 and A and B are
independent, then P (A B) = 0.002 0.000001 which is very close to the union bound, 0.002.
The union bound can be extended from a bound on the union of two events to a bound on the
union of m events for any m 2. The union bound on the probability of the union of a set of m
events is
P (A1 A2 Am ) P (A1 ) + P (A2 ) + + P (Am ).
(2.18)
The union bound for m events follows from the union bound for two events and argument by
induction on m. The union bound is applied in the following sections.
2.12.2
The network outage problem addressed here is one of many different ways to describe and analyze
system reliability involving serial and parallel subsystems. An s t network consists of a source
node s, a terminal node t, possibly some additional nodes, and some links that connect pairs of
nodes. Figure 2.12 pictures five such networks, network A through network E. Network A has two
links, link 1 and link 2, in parallel. Network B has two links in series. Network C has two parallel
paths, with each path consisting of two links in series. Network D has two stages of two parallel
links. Network E has five links.
Suppose each link i fails with probability pi . Let Fi be the event that link i fails. So pi = P (Fi ).
Assume that for a given network, the links fail independently. That is, the Fi s are independent
events. Network outage is said to occur if at least one link fails along every s t path, where an
s t path is a set of links that connect s to t. Let F denote the event of network outage. We will
68
2.12. RELIABILITY
69
second stage fails can be written as F2 F4 , which has probability p2 p4 . Since the two stages involve
separate sets of links, the first stage fails independently of the second stage. So the network outage
probability is P {first stage fails}+P {second stage fails}P {both stages fail}. That is, for network
D, P (F ) = p1 p3 + p2 p4 p1 p2 p3 p4 . If pi = 0.001 for all i, P (F ) 0.000002, or about half the outage
probability for network C.
Finally, consider network E. One way to approach the computation of P (F ) is to use the
law of total probability, and consider two cases: either link 5 fails or it doesnt fail. So P (F ) =
P (F |F5 )P (F5 ) + F (F |F5c )P (F5c ). Now P (F |F5 ) is the network outage probability if link 5 fails. If
link 5 fails we can erase it from the picture, and the remaining network is identical to network C.
So P (F |F5 ) for Network E is equal to P (F ) for network C. If link 5 does not fail, the network
becomes equivalent to network D. So P (F |F5c ) for Network E is equal to P (F ) for network D.
Combining these yields:
P (outage in network E) = p5 P (outage in network C) + (1 p5 )P (outage in network D).
Another way to calculate P (F ) for network E is by closer comparison to network D. If the
same four links used in network D are used along with link 5 in network E, then network E fails
whenever network D fails. Moreover, some thought shows there are exactly two ways network E
can fail such that network D does not fail, namely, links 1,4, and 5 are the only ones that fail, or
links 2, 3, and 5 are the only ones that fail. So P (F ) for network D is equal to P (F ) for network D
plus p1 q2 q3 p4 p5 + q1 p2 p3 q4 p5 , where qi = 1 pi for all i. This yields the outage probability given in
Figure 2.12. For example, if pi = 0.001 for 1 i 5, then P (F ) = 0.000002001995002 0.000002.
The network outage probability is determined mainly by the probability that both links 1 and 3
fail or both links 2 and 4 fail. Thus, the outage probability of network E is about the same as for
network D.
Applying the union bound The union bound doesnt apply and isnt needed for network A.
For network B, F = F1 F2 . The union bound for two events implies that P (F ) p1 + p2 . The
numerical value is 0.002 if p1 = p2 = 0.001.
For network C, F = F1 F3 F1 F4 F2 F3 F2 F4 so that the union bound for m = 4 yields
P (F ) p1 p3 + p1 p4 + p2 p3 + p2 p4 = (p1 + p2 )(p3 + p4 ). This upper bound could also be obtained by
first upper bounding the probability of failure of the upper branch by p1 + p2 and the lower branch
by p3 + p4 and multiplying these bounds because the branches are independent. The numerical
value is 0.000004 in case pi = 0.001 for 1 i 4.
For network D, F = F1 F3 F2 F4 . The union bound for two events yields P (F ) p1 p3 + p2 p4 .
The numerical value is 0.000002 in case pi = 0.001 for 1 i 4.
Fo network E, F = F1 F3 F2 F4 F1 F5 F4 F3 F5 F2 . The union bound for four events yields
P (F ) p1 p3 + p2 p4 + p1 p5 p4 + p3 p5 p2 . The numerical value is 0.000002002 in case pi = 0.001 for
1 i 5, which is approximately 0.000002.
Note that for all four networks, B through E, the numerical value of the union bound, in case
pi = 0.001 for all i, is close to the exact value of network outage computed above.
70
Calculation by exhaustive listing of network states Another way to calculate the outage
probability of an s t network is to enumerate all the possible network states, determine which
ones correspond to network outage, and then add together the probabilities of those states. The
method becomes intractable for large networks,3 but it is often useful for small networks and it
sharpens intuition about most likely failure modes.
For example, consider network C. Since there are four links, the network state can be represented
by a length four binary string such as 0110, such that the ith bit in the string is one if link i fails,
This computation is illustrated in Table 2.3, where we let qi = 1 pi . For example, if the network
Table 2.3: A table for calculating P (F ) for network C.
State
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Network fails?
no
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
probability
q1 p2 q3 p4
q1 p 2 p 3 q4
q1 p2 p3 p4
p 1 q2 q3 p 4
p 1 q2 p 3 q4
p1 q2 p3 p4
p1 p2 q3 p4
p1 p2 p3 q4
p1 p2 p3 p4
state is 0110, meaning that links 2 and 3 fail and links 1 and 4 dont, then the network fails. The
probability of this state is q1 p2 p3 q4 . For given numerical values of p1 through p4 , numerical values
can be computed for the last column of the table, and the sum of those values is P (F ).
2.12.3
An s t flow network is an s t network such that each link has a capacity. Two s t flow
networks are shown in Figure 2.13. We assume as before that each link i fails with probability pi .
If a link fails, it cannot carry any flow. If a link does not fail, it can carry flow at a rate up to
3
From a computational complexity point of view, the problem of computing the outage probability, also called the
reliability problem, has been shown to be P -space complete, meaning exact solution for large networks is probably
not feasible.
2.12. RELIABILITY
71
72
states with that capacity. The first few lines of such a table are shown in Table 2.4. For example,
pY (10) is the sum of all probabilities in rows such that the capacity shown in the row is 10.
Table 2.4: A table for calculating the distribution of capacity of network G.
2.12.4
State
00000
00001
00010
00011
00100
00101
00110
00111
..
.
capacity
30
20
20
10
10
10
10
10
..
.
probability
q1 q2 q3 q4 q5
q1 q2 q3 q4 p5
q1 q2 q3 p 4 q5
q1 q2 q3 p4 p5
q1 q2 p 3 q4 q5
q1 q2 p3 q4 p5
q1 q2 p3 p4 q5
q1 q2 p 3 p 4 p 5
..
.
11111
p1 p2 p3 p4 p5
The array shown in Figure 2.12.4 below illustrates a two-dimensional error detecting code for use
in digital systems such as computer memory or digital communication systems. There are 49 data
1
data
block
row parity
column parity
overall parity
2.12. RELIABILITY
73
and the binary array with ones at the locations of the bits in error is called the error pattern. There
are 264 error patterns, including the pattern with no errors. When the data is needed, the reader
or receiver first checks to see if all rows and columns have even parity. If yes, the data is deemed
to be correct. If no, the data is deemed to be corrupted, and we say the errors were detected.
Since there may be backup copies in storage, or the data can possibly be retransmitted, the effect
of errors is not severe if they are detected. But the effect of errors can be severe if they are not
detected, which can happen for some nonzero error patterns.
It is explained next that any error pattern with either one, two, or three bit errors is detected,
but certain error patterns with four bit errors are not detected. An error pattern is undetected if
and only if the parity of each row and column is even, which requires that the parity of the entire
error pattern be even. Thus, any error pattern with an odd number of bit errors is detected. In
particular, error patterns with one or three bit errors are detected. If there are two bit errors, they
are either in different rows or in different columns (or both). If they are in different rows then those
rows have odd parity, and if they are in different columns then those columns have odd parity. So,
either way, the error pattern is detected. A pattern with four bit errors is undetected if and only
if there are two rows and two columns such that the bit errors are at the four intersections of the
two rows and two columns.
We now discuss how to bound the probability of undetected error, assuming that each bit is in
error with probability 0.001, independently of all other bits. Let Y denote the number of bits in
error. As already discussed, there can be undetected bit errors only if Y 4. So the probability there
exist undetected bit errors is less than or equal to P {Y 4}. We have Y = X1 + + X64 , where
Xi is one if the ith bit in the array is in error, and Xi = 0 otherwise. By assumption, the random
variables X1 , . . . , X64 are independent Bernoulli random variables with parameter p = 0.001, so Y
is a binomial random variable with parameters n = 64 and p = 0.001. The event {Y 4} can be
expressed as the union of 64
indexed by the subsets
4 events of the form {Xi = 1 for all i A}, Q
A of {1, . . . , n} with |A| = 4. For such an A, P {Xi = 1 for all i A} = iA P {Xi = 1} = p4 .
4
So the union bound with m = 64
yields P {Y 4} 64
p . (This is in contrast to the fact
4
4
64 4
60
P {Y = 4} = 4 p (1 p) .) Thus,
P (undetected errors) P {Y 4}
64 4
p = (635376)p4 = 0.635376 106 .
4
A considerably tighter bound can be obtained with a little more work. Four errors go undetected
if and only if they are at the intersection points of two columns and two rows. There are 82 = 28
ways to choose two columns and 82 = 28 ways to choose two rows, so there are 282 = 784 error
patterns with four errors that go undetected. The probability of any given one of those patterns
occurring is less than p4 , so by the union bound, the probability that the actual error pattern is an
undetected pattern of four bit errors is less than or equal to (784)p4 . Any pattern of five bit errors
is detected because any pattern with an odd number of bit errors is detected. So the only other
way undetected errors can occur is if Y 6, and P {Y 6} can again be bounded by the union
74
bound. Thus,
P (undetected errors)
2
8
64
1012 +
1018
2
6
2.12.5
Suppose there are two machines (such as computers, generators, vehicles, etc.) each used to back up
the other. Suppose each day, given a machine was up the day before or was finished being repaired
the day before or that it is the first day, the machine is up on the given day with probability 0.999
and it goes down with probability 0.001. If a machine goes down on a given day, then the machine
stays down for repair for a total of five days. The two machines go up and down independently.
We would like to estimate the mean time until there is a day such that both machines are down.
This is a challenging problem. Can you show that the mean time until outage is between 200
and 400 years? Here is a reasonably accurate solution. Given both machines were working the day
before, or were just repaired, or it is the first day, the probability that at least one of them fails
on a given day is about 0.002. That is, the waiting time until at least one machine goes down is
approximately geometrically distributed with parameter p = 0.002. The mean of such a distribution
is 1/p, so the mean time until at least one of the machines goes down is about 500 days. Given that
one machine goes down, the probability the other machine goes down within the five day repair
time is about 5/1000=1/200. That is, the number of repair cycles until a double outage occurs
has a geometric distribution with parameter about 1/200, so on average, well have to wait for 200
cycles, each averaging about 500 days. Therefore, the mean total waiting time until double outage
is about 500 200 = 100, 000 days, or approximately 274 years.
2.13
Section 2.2[video]
1. Ten balls, numbered one through ten, are in a bag. Three are drawn out at random without
replacement, all possibilities being equally likely. Find E[S], where S is the sum of numbers
on the three balls.
2. Find Var(X) if the pmf of X is pX (i) =
i
10
for 1 i 4.
i
10
for 1 i 4.
Section 2.3[video]
1. Two fair dice are rolled. Find the conditional probability that doubles are rolled given the sum
is even.
2. Two fair dice are rolled. Find the conditional probability the sum is six given the product is
even.
75
3. What is the maximum number of mutually exclusive events that can exist for a given probability
experiment, if each of them has probability 0.3?
Section 2.4[video]
1. Find the probability a one is rolled exactly four times in six rolls of a fair die.
2. Find Var(3X + 5) assuming X has the binomial distribution with parameters n = 12 and
p = 0.9.
3. Each step a certain robot takes is forward with probability 2/3 and backward with probability
1/3, independently of all other steps. What is the probability the robot is two steps in front
of its starting position after taking 10 steps.
Section 2.5[video]
1. Find the median of the geometric distribution with parameter p = .02.
2. Suppose each trial in a sequence of independent trials is a success with probability 0.5. What
is the mean number of trials until two consecutive trials are successful?
Section 2.6[video]
1. Suppose a Bernoulli process starts with 0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1 Identify the corresponding values of L1 through L4 .
2. Suppose a Bernoulli process starts with 0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1 Identify the corresponding values of C10 through C14 .
3. Suppose a Bernoulli process starts with 0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1 Identify the corresponding values of S1 through S4 .
Section 2.7[video]
1. Suppose X is a Poisson random variable such that P {X = 0} = 0.1. Find E[X].
2. Suppose X is a Poisson random variable with E[X 2 ] = 12. Find P {X = 2}.
Section 2.8[video]
1. Suppose X has a Poisson distribution with mean 2 . Find bM L (10), the maximum likelihood
estimate of if it is observed that X = 10.
2. Suppose X = 2Y + 4, where Y has the geometric distribution with parameter p. Find pbM L (10),
the maximum likelihood estimate of p if it is observed that X = 10.
Section 2.9[video]
1. Let X have mean 100 and standard deviation X = 5. Find the upper bound on P {|X 100|
20} provided by the Chebychev inequality.
2.
Suppose X denotes the number of heads showing in one million flips of a fair coin. Use the
Chebychev inequality to identify d large enough that P {|X 500, 000| d} 0.1.
Section 2.10[video]
1. Two coins, one fair and one with a 60% vs. 40% bias towards heads, are in a pocket, one
is drawn out randomly, each having equal probability, and flipped three times. What is the
probability that heads shows on all three flips?
2. The number of passengers for a limousine pickup is thought to be either 1, 2, 3, or 4, each with
equal probability, and the number of pieces of luggage of each passenger is thought to be 1
or 2, with equal probability, independently for different passengers. What is the probability
that there will be six or more pieces of luggage?
3. Two fair dice are rolled. What is the conditional probability the sum is a multiple of three,
given it is a multiple of two?
Section 2.11[video]
1.
i
Suppose observation X has pmf p1 (i) = 10
for 1 i 4 if H1 is true, and pmf p0 (i) = 0.25
for 1 i 4 if H0 is true. Find pmiss and pf alse alarm , respectively, for the ML decision rule.
2.
i
Suppose observation X has pmf p1 (i) = 10
for 1 i 4 if H1 is true, and pmf p0 (i) = 0.25
for 1 i 4 if H0 is true. Suppose H0 is twice as likely as H1 , a priori (i.e. 0 /1 = 2. Find
the minimum possible average probability of error.
3. Suppose X has the Poisson distribution with parameter 10 if H1 is true and the Poisson
distribution with parameter 3 if H0 is true. The ML decision rule declares that H1 is true if
X . Find the threshold .
Section 2.12[video]
1. Consider a storage system with five servers which does not fail as long as three or more servers
is functioning. Suppose the system fails if three or more of the servers crash, and suppose
each server crashes independently in a given day with probability 0.001. Find a numerical
upper bound, provided by the union bound, on the probability of system failure in a given
day.
2. Suppose each corner of a three dimensional cube burns out with probability 0.001, independently
of the other corners. Find an upper bound, using the union bound, on the probability that
there exist two neighboring corners that both burn out.
2.14
Problems
Discrete random variables (Sections 2.1 and 2.2)
2.14. PROBLEMS
77
(c)
11
12
1
10
(b)
6
5
9
4
8
7
5
6
78
2.14. PROBLEMS
79
(b) Find the mean, E[X], and standard deviation, X , of X. Correct numerical answers are
fine, but show your work.
(c) Derive the pmf of Y and sketch it.
(d) Find the mean E[Y ] and standard deviation, Y , of Y . Correct numerical answers are
fine, but show your work. (Hint: It may be helpful to use a spreadsheet or computer
program.)
(e) Which is larger, X or Y ? Is that consistent with your sketches of the pmfs?
(f) The random variable X takes values in the set {1, 2, 3, 4, 5, 6}. Specify another pmf on
the same set which has a larger standard deviation than X.
2.7. [Up to five rounds of double or nothing]
A gambler initially having one chip participates in up to five rounds of gambling. In each
round, the gambler bets all of her chips, and with probability one half, she wins, thereby
doubling the number of chips she has, and with probability one half, she loses all her chips.
Let X denote the number of chips the gambler has after five rounds, and let Y denote
the maximum number of chips the gambler ever has (with stopping after five rounds). For
example, if the gambler wins in the first three rounds and loses in the fourth, Y = 8. If the
gambler loses in the first round, Y = 1.
(a) Find the pmf, mean, and variance of X.
(b) Find the pmf, mean, and variance of Y.
2.8. [Selecting supply for a random demand]
A reseller reserves and prepays for L rooms at a luxury hotel for a special event, at a price of
a dollars per room, and the reseller sells the rooms for b dollars per room, for some known,
fixed values a and b with 0 < a b. Letting U denote the number of potential buyers, the
actual number of buyers is min{U, L} and the profit of the reseller is b min{U, L} aL. The
reseller must declare L before observing U, but when L is selected the reseller assumes U
1
takes on the possible values 1, 2, , M, each with probability M
for some known M 1.
(a) Express the expected profit of the reseller in terms of M, L, a, and b. Simplify your answer
as much as possible. For simplicity, without loss of generality, assume 0 L M.
(Hint: 1 + + L = L(L+1)
. Your answer should be valid whenever 0 L M ; for
2
L = 3 and M = 5 it should reduce to E[profit] = 12b
5 3a.)
(b) Determine the value of L as a function of a, b and M that maximizes the expected profit.
2.9. [The Zipf distribution]
The Zipf distribution has been found to model well the distribution of popularity of items
such as books or videos in a library. The Zipf distribution with parameters M and , is the
80
(a) If X has the Zipf distribution with parameters M = 2000 and = 0.8 (suitable for a
typical video library), what is P {X 500}?
(b) If Y has the Zipf distribution with parameters M = 2000 and > 0, for what numerical
value of is it true that P {Y 100} = 30%?
1
2
and 2 = 43 .
(b) Express a and b in terms of and m2 . Determine and sketch the region of (, m2 ) pairs
for which there is a valid choice of (a, b).
(c) Determine and sketch the set of (, 2 ) pairs for which there is a valid choice of (a, b).
(Hint: For fixed, the set of possible values of 2 is the set of possible values of m2 2 .)
(a) Write a computer program that can simulate N independent rolls of a fair die for any
integer value of N that you put into the program. Display the empirical distribution,
where the empirical distribution for such an experiment is the fraction of rolls that are
one, the fraction of rolls that are two, and so on, up to the fraction of rolls that are six,
for N = 100, N = 1000 and N = 10, 000. Your program output should look something
like the histograms shown.
N=100
N=1000
N=10000
0.25
0.25
0.25
0.2
0.2
0.2
0.15
0.15
0.15
0.1
0.1
0.1
0.05
0.05
0.05
(b) Run your program at least twenty times; the displayed triplet of empirical distributions
should change each time. See how much the distributions change from one simulation
to the next, for each of the three values of N. The empirical distribution for the case
N = 100 can often be far from uniform. Here is what to turn in for this part: Find,
print out, and turn in the printout, for an example triplet of plots produced by the
program such that the first empirical distribution (i.e. the one for N = 100) is far from
the uniform one.
(c) Produce a modification of the program so that for each N , two dice are rolled N times,
and for each time the sum of the two dice is recorded. The empirical distributions will
thus have support over the range from 2 to 12. Print out and turn in a figure showing
empirical distributions for N = 100, N = 1000, and N = 10000, and also turn in a copy
of the computer code you used. (Hint: Even though the sum of two dice is always at
least 2, it might be easier to make the plot if one of the bins for the histogram is centered
at one; the count for such bin will always be zero.)
82
84
2.14. PROBLEMS
85
so on. Let X denote the number of times player 1 is a winner. Find P {X = 0}, P {X = 1},
and P {X = 2}. (Hint: Your answers should sum to 34 .)
2.25. [Ultimate verdict]
Suppose each time a certain defendant is given a jury trial for a particular charge (such as
trying to sell a seat in the US Senate), an innocent verdict is given with probability qI , a
guilty verdict is given with probability qG , and a mistrial occurs with probability qM , where
qI , qG , and qM are positive numbers that sum to one. Suppose the prosecutors are determined
to get a guilty or innocent verdict, so that after any number of consecutive mistrials, another
trial is given. The process ends immediately after the first trial with a guilty or innocent
verdict; appeals are not considered. Let T denote the total number of trials required, and let
I denote the event that the verdict for the final trial is innocent.
(a) Find P (I|T = 1). Express your answer in terms of qI and qG .
(b) Find the pmf of T.
(c) Find P (I). Express your answer in terms of qI and qG .
(d) Compare your answers to parts (a) and (c). For example, is one always larger than the
other?
2.26. [ML parameter estimation for independent geometrically distributed rvs]
A certain task needs to be completed n times, where each completion requires multiple attempts. Let Li be the number of attempts that are needed to complete the task for the ith
time. Suppose that L1 , . . . , Ln are independent and each is geometrically distributed with
the same parameter p, to be estimated.
(a) Suppose it is observed that (L1 , . . . , Ln ) = (k1 , . . . , kn ) for some particular vector of
positive integers, (k1 , . . . , kn ). Write the probability (i.e. likelihood) of this observation
as simply as possible in terms of p and (k1 , . . . , kn ). (Hint: By the assumed independence,
the likelihood factors: P {(L1 , . . . , Ln ) = (k1 , . . . , kn )} = P {L1 = k1 } P {Ln = kn }.)
(b) Find pbM L if it is observed that (L1 , . . . , Ln ) = (k1 , . . . , kn ). Simplify your answer as
much as possible.
2.27. [Maximum likelihood estimation and the Poisson distribution]
An auto insurance company wishes to charge monthly premiums based on an individuals
risk factor. It defines the risk factor as the probability p that individual is involved in a auto
accident during a trip. Assume that whether an accident occurred on one trip is independent
of accidents occurring on others, i.e., the insurance company assumes that drivers are reckless
and dont learn to be cautious after being in an accident. The insurance company assumes
that each driver will be driving 120 trips a month.
(a) Determine the maximum likelihood estimate of the risk factor pM L if no accidents are
reported by a driver in a month. Repeat for the cases when the driver reports 1, 2 and
3.
(b) Assume that the actual value of p = 0.01. Compute the approximate values of P {X = k}
for k = 0, 1, 2, 3 using the Poisson approximation to the binomial distribution, and
compare those approximations to the actual probabilities computed using the binomial
distribution.
2.28. [Scaling of a confidence interval]
Suppose the fraction of people in Tokyo in favor of a certain referendum will be estimated by
a poll. A confidence interval based on the Chebychev bound will be used (i.e. the interval is
centered at pb with width an and confidence level 1 a12 , for some constant a, where pb is the
fraction of the n people sampled that are in favor of the referendum.)
(a) Suppose the width of the confidence interval would be 0.1 for sample size n = 300 and
some given confidence level. How many samples would be needed instead to yield a
confidence interval that has only half the width, for the same level of confidence?
(b) What is the confidence level for the test of part (a)?
(c) Keeping the width of the confidence interval at 0.1 as in (a), how many samples would
be required for a 96% confidence level?
2.29. [Estimation of signal amplitude for Poisson observation]
The number of photons X detected by a particular sensor over a particular time period is
assumed to have the Poisson distribution with mean 1 + a2 , where a is the amplitude of an
incident field. It is assumed a 0, but otherwise a is unknown.
(a) Find the maximum likelihood estimate, b
aM L , of a for the observation X = 6.
(b) Find the maximum likelihood estimate, b
aM L , of a given that it is observed X = 0.
2.30. [Parameter estimation for the binomial distribution]
Suppose X has the binomial distribution with parameters n and p.
(a) Suppose (for this part only) p = 0.03 and n = 100. Find the Poisson approximation to
P {X 2}. (You may leave one or more powers of e in your answer, but not an infinite
number of terms.)
(b) Suppose (for this part only) p is unknown and n = 10, 000, and based on observation of
X we want to estimate p within 0.025. That is, we will use a confidence interval with
half-width 0.025. Whats the largest confidence level we can claim? (The confidence
level is the probability, from the viewpoint before X is observed, that the confidence
interval will contain the true value p.)
(c) Suppose (for this part only) it is known that p = 0.03, but n is unknown. The parameter
n is to be estimated. Suppose it is observed that X = 7. Find the maximum likelihood
estimate n
M L . (Hint: It is difficult to differentiate with respect to the integer parameter
n, so another approach is needed to identify the minimizing value. Think about how
a function on the integers behaves near its maximum. How can you tell whether the
function is increasing or decreasing from one integer value to the next?)
2.14. PROBLEMS
87
Bayes Formula and binary hypothesis testing Sections 2.10 & 2.11
2.31. [Explaining a sum]
Suppose S = X1 + X2 + X3 + X4 where X1 , X2 , X3 , X4 are mutually independent and Xi has
the Bernoulli distribution with parameter pi = 5i for 1 i 4.
(a) Find P {S = 1}.
(b) Find P (X1 = 1|S = 1).
2.32. [The weight of a positive]
(Based on G. Gigerenzer, Calculated Risks, Simon and Schuster, 2002 and S. Strogatz NYT
article, April 25, 2010.) Women aged 40 to 49 years of age have a low incidence of breast cancer; the fraction is estimated at 0.8%. Given a woman with breast cancer has a mammogram,
the probability of detection (i.e. a positive mammogram) is estimated to be 90%. Given a
woman does not have breast cancer, the probability of a false positive (i.e a false alarm) is
estimated to be 7%.
(a) Based on the above numbers, given a woman aged 40 to 49 has a mammogram, what is
the probability the mammogram will be positive?
(b) Given a woman aged 40 to 49 has a positive mammogram, what is the conditional
probability the woman has breast cancer?
(c) For 1000 women aged 40 to 49 getting mammograms for the first time, how many are
expected to have breast cancer, for how many of those is the mammogram positive, and
how many are expected to get a false positive?
2.33. [Which airline was late?]
Three airlines fly out of the Bloomington airport:
American has five flights per day; 20% depart late,
AirTrans has four flights per day; 5% depart late,
Delta has nine flights per day; 10% depart late.
(a) What fraction of flights flying out of the Bloomington airport depart late?
(b) Given that a randomly selected flight departs late (with all flights over a long period of
time being equally likely to be selected) what is the probability the flight is an American
flight?
2.34. [Conditional distribution of half-way point]
Consider a robot taking a random walk on the integer line. The robot starts at zero at time
zero. After that, between any two consecutive integer times, the robot takes a unit length
88
alarm
2
3
and 1 = 13 .
(d) Find pe for the MAP rule found in part (c), assuming the prior used in part (c) is true.
(e) For what values of
favor of H1 .
0
1
does the MAP rule always decide H0 ? Assume ties are broken in
2.14. PROBLEMS
89
field goals
119
212
shooting percentage
42.35%
40.70%
We take the two numbers of attempts, 281 and 521, as given, and not part of the random
experiment.
(a) Using the methodology of Section 2.9, use the given data to calculate 95% confidence
intervals for ph and for pa . (Note: If the two intervals you find intersect each other,
then the accepted scientific methodology would be to say that there is not significant
evidence in the data to reject the null hypothesis, H0 . In other words, the experiment is
inconclusive.)
(b) Suppose an analysis of data similar to this one were conducted, and the two intervals
calculated in part (a) did not intersect. What statement could we make in support of
hypothesis H1 ? ( Hint: Refer to equation (2.11) of the notes, for which the true value p
is fixed and arbitrary, and pb = X
n is random.)
2.39. [Detection problem with the geometric distribution]
The number of attempts, Y, required for a certain basketball player to make a 25 foot shot,
is observed, in order to choose one of the following two hypotheses:
H1 (outstanding player) : Y has the geometric distribution with parameter p = 0.5
H0 (average player) :
Y has the geometric distribution with parameter p = 0.2.
4
This hypothesis testing problem falls outside the scope of Section 2.11, because the hypotheses are composite
hypotheses, meaning that they involve one or more unknown parameters. For example, H0 specifies only that ph = pa ,
without specifying the numerical value of the probabilities. Problems like this are often faced in scientific experiments.
A common methodology is based on the notion of p-value, and on certain functions of the data that are not sensitive
to the parameters, such as in T tests orF tests. While the details are beyond the scope of this course, this problem
aims to give some insight into this common problem in scientific data analysis.
1a
,
aa7
so that
(a) Find simple expressions for pi (u1 , . . . , un ) = P (X1 = u1 , . . . , Xn = un |Hi ) for i = 0 and
i = 1, where ui {1, 2, 3, 4, 5, 6} for each i. Express your answers using the variables tk ,
for 1 k 6, where tk is the number of the n rolls that show k. The vector (t1 , . . . , t6 ) is
called the type vector of (u1 , . . . , un ). Intuitively, the order of the observations shouldnt
matter, so decision rules will naturally only depend on the type vector of the observation
sequence.
(u1 ,...,un )
(b) Find a simple expression for the likelihood ratio, (u1 , . . . , un ) = pp01(u
and de1 ,...,un ).
scribe, as simply as possible, the likelihood ratio test for H1 vs. H0 given the observations
(u1 , . . . , un ).
(c) In particular, suppose that n = 100, (t1 , . . . , t6 ) = (18, 12, 13, 19, 18, 20), and a = 1.1.
Which hypothesis does the maximum likelihood decision rule select?
2.14. PROBLEMS
91
AND
AND
B3
B4
OR
stuck at 1
(a) Suppose there is a stuck at one fault as shown, so that the value one is always fed into
the second AND gate, instead of the output of the OR gate. Assuming that B1 , . . . , B4
are independent and equally likely to be zero or one, what is the probability that the
output value Y is incorrect?
(b) Suppose that the circuit is working correctly with probability 0.5, or has the indicated
stuck at one fault with probability 0.5. Suppose three distinct randomly generated test
patterns are applied to the circuit. (Here, a test pattern is a binary sequence of length
92
4
Network A
t s
t s
2
t
2
Network B
4
Network C
For each network, suppose that each link fails with probability p, independently of the other
links, and suppose network outage occurs if at least one link fails on each path from s to t.
Not all the links are labeledfor example, Network A has six links. For simplicity, we suppose
that links can only be used in the forward direction, so that the paths with five links for
Network B do not count;. Let F denote the event of network outage.
(a) Find P (F ) in terms of p for Network A, and give the numerical value of P (F ) for
p = 0.001, accurate to within four significant digits.
(b) This part aims to find P (F ) for Network B, using the law of total probability, based
on the partition of the sample space into the events: D0 , D1 , D2,s , D2,d , D3 , D4 . Here,
Di is the event that exactly i links from among links {1, 2, 3, 4} fail, for i = 0, 1, 3, 4;
D2,s is the event that exactly two links from among links {1, 2, 3, 4} fail and they are on
the same side (i.e. either links 1 and 2 fail or links 3 and 4 fail); D2,d is the event that
exactly two links from among links {1, 2, 3, 4} fail and they are on different sides. Find
the probability of each of these events, find the conditional probabilities of F given any
one of these events, and finally, find P (F ). Express your answers as a function of p, and
give the numerical value of P (F ) for p = 0.001, accurate to within four significant digits.
(c) Find the numerical value of P (D2,d |F ) for p = 0.001.
(d) Find the limit of the ratio of the outage probability for Network B to the outage probability for Network A, as p 0. Explain the limit. Is it zero? Hint: For each network,
P (F ) is a polynomial in p. As p 0, the term with the smallest power of p dominates.
(e) Without doing any detailed calculations, based on the reasoning used in part (d), give
the limit of the ratio of the outage probability for Network C to the outage probability
for Network A, as p 0. Explain the limit. Is it zero?
2.46. [Distribution of capacity of an s t flow network]
Consider the following s t flow network. The link capacities, in units of some quantity per
unit time, are shown for links that do not fail. Suppose each link fails with some probability
p and if a link fails it can carry no flow. Let Y denote the s t capacity of the network.
2.14. PROBLEMS
93
C1 = 20
s
C4 = 20
C3 = 10
t
C5 = 8
C 2= 8
Find the pmf of Y. Express your answer in terms of p. Te facilitate grading, express each
nonzero term of the pmf as a polynomial in p, with terms arranged in increasing powers of p.
(Hint: What is P {Y > 0}?)
2.47. [Reliability of a fully connected s t network with five nodes]
Consider the s t network shown.
1
t
Suppose each link fails independently with probability p = 0.001. The network is said to fail
if at least one link fails along each path from s to t.
(a) Identify the minimum number of link failures, k , that can cause the network to fail.
(b) There are k10 sets of k links. So by the union bound, the probability of network failure
is bounded above: P (F ) k10 pk . Compute the numerical value of this bound.
(c) A blocking set is a set of links such that if every link in the set fails then the network
fails. By definition, k is the minimum number of links in a blocking set. Show that
there are exactly two blocking sets with k links.
(d) Using the result of part (c), derive a tighter upper bound on the probability of system
failure, and give its numerical value in case p = 0.001.
94
Chapter 3
3.1
Let a probability space (, F, P ) be given. Recall that in Chapter 2 we defined a random variable
to be a function X from to the real line R. To be on mathematically firm ground, random
variables are also required to have the property that sets of the form { : X() c} should be
eventsmeaning that they should be in F. Since a probability measure P assigns a probability to
every event, every random variable X has a cumulative distribution function (CDF), denoted by
FX . It is the function, with domain the real line R, defined by
FX (c) = P { : X() c}
= P {X c} (for short).
Example 3.1.1 Let X denote the number showing for a roll of a fair die, so the pmf of X is
pX (i) = 61 for integers i with 1 i 6. The CDF FX is shown in Figure 3.1.
95
96
FX
lim F (y)
yx
y<x
F (x+) =
lim F (y).
yx
y>x
Note that the CDF in Example 3.1.1 has six jumps of size 1/6. The jumps are located at
the six possible values of X, namely at the integers one through six, and the size of each of
those jumps is 1/6, which is the probability assigned to each of the six possible values. The
value of the CDF exactly at a jump point is equal to the right limit at that point. For example,
FX (1) = FX (1+) = 1/6 and FX (1) = 0. The size of the jump at any x can be written as
4FX (x) = FX (x) FX (x).
The CDF of an arbitrary random variable X determines the probabilities of any events of the
form {X A}. Of course, the CDF of a random variable X determines P {X c} for any real
number cby definition it is just FX (c). Similarly, P {X (a, b]} = FX (b) FX (a), whenever a < b.
The next proposition explains how FX also determines probabilities of the form P {X < c} and
P {X = c}.
Proposition 3.1.2 Let X be a random variable and let c be any real number. Then P {X < c} =
FX (c) and P {X = c} = 4FX (c), where FX is the CDF of X.
Proof. Fix c and let c1 , c2 , . . . be a sequence with c1 < c2 < . . . such that limj cj = c. Let
G1 be the event G1 = {X c1 } and for j 2 let Gj = {cj1 < X cj }. Then for any n 1,
{X cn } = G1 G2 Gn . Also, {X < c} = G1 G2 and the events G1 , G2 , . . . are
mutually exclusive. Therefore, by Axiom P.2,
P {X < c} = P (G1 ) + P (G2 ) + .
97
The sum of a series is, by definition, the limit of the sum of the first n terms as n . Therefore,
P {X < c} = lim
n
X
k=1
That is, P {X < c} = FX (c). The second conclusion of the proposition follows directly from the
first: P {X = c} = P {X c} P {X < c} = FX (c) FX (c) = 4FX (c).
0.5
Solution: (a) The CDF has only one jump, namely a jump of size 0.5 at u = 0. Thus, P {X =
0} = 0.5, and there are no values of u with u 6= 0 such that P {X = u} > 0.
(b) P {X 0} = FX (0) = 1. (c) P {X < 0} = FX (0) = 0.5.
Example 3.1.4 Let X have the CDF shown in Figure 3.3. Find the numerical values of the
following quantities:
(a) P {X 1}, (b) P {X 10}, (c) P {X 10}, (d) P {X = 10}, (e) P {|X 5| 0.1}.
F (u)
X
1.0
0.75
0.5
0.25
0
u
5
10
15
98
(a)
(d)
(b)
(e)
(c)
(f)
Solution: The functions shown in plots (a), (c), and (f) are valid CDFs and the other three are
not. The function in (b) is not nondecreasing, and it does not converge to zero at . The function
in (d) does not converge to zero at . The function in (e) is not right continuous.
The vast majority of random variables described in applications are one of two types, to be
described next. A random variable X is a discrete-type random variable if there is a finite or
99
countably infinite set of values {ui : i I} such that P {X {ui : i I}} = 1. The probability
mass function (pmf) of a discrete-type random variable X, denoted pX (u), is defined by pX (u) =
P {X = u}. Typically the pmf of a discrete random variable is much more useful than the CDF.
However, the pmf and CDF of a discrete-type random variable are related by pX (u) = 4FX (u)
and conversely,
X
FX (c) =
pX (u),
(3.1)
u:uc
where the sum in (3.1) is taken only over u such that pX (u) 6= 0. If X is a discrete-type random
variable with only finitely many mass points in any finite interval, then FX is a piecewise constant
function.
A random variable X is a continuous-type random variable if the CDF is the integral of a
function:
Z
fX (u)du.
FX (c) =
The function fX is called the probability density function. Continuous-type random variables are
the subject of the next section.
The relationship among CDFs, pmfs, and pdfs is summarized in Figure 3.5. Any random vari-
CDF FX
(always exists)
pdf f X
pmf p X
Often more (for continuous!type
(for discrete!type
random variables)
useful
random variables )
100
3.2
for all c R. The support of a pdf fX is the set of u such that fX (u) > 0.
It follows by the fundamental theorem of calculus that if X is a continuous-type random variable
and if the pdf fX is continuous, then the pdf is the derivative of the CDF: fX = FX0 . In particular, if
X is a continuous-type random variable with a continuous pdf, FX is differentiable, and therefore FX
is a continuous function.1 That is, there are no jumps in FX , so for any constant v, P {X = v} = 0.
It may seem strange at first that P {X = v} = 0 for all numbers v, because if we add these
probabilities up over all v, we get zero. It would seem like X cant take on any real value. But remember that there are uncountably many real numbers, and the axiom that probability is additive,
Axiom P.2, only holds for countably infinite sums.
If a < b then
Z
b
fX (u)du.
a
fX (u)du.
a
So when we work with continuous-type random variables, we dont have to be precise about whether
the endpoints of intervals are included when calculating probabilities.
It follows that the integral of fX over every interval (a, b) is greater than or equal to zero, so
fX must be a nonnegative function. Also,
Z
1 = lim lim FX (b) FX (a) =
fX (u)du.
a b+
Therefore, fX integrates to one. In most applications, the density functions fX are continuous, or
piecewise continuous.
Although P {X = u} = 0 for any real value of u, there is still a fairly direct interpretation of fX
involving probabilities. Suppose uo is a constant such that fX is continuous at uo . Since fX (uo ) is
the derivative of FX at uo , it is also the symmetric derivative of FX at uo , because taking h 0
on each side of
FX (uo + h) FX (uo h)
1 FX (uo + h) FX (uo ) FX (uo h) FX (uo )
=
+
2h
2
h
h
1
0
In these notes we only consider pdfs fX that are at least piecewise continuous, so that fX (u) = FX
(u) except
at discontinuity points of fX . The CDF FX is continuous in general for continuous-type random variables.
101
yields
FX (uo + h) FX (uo h)
1
= (fX (uo ) + fX (uo )) = fX (uo ).
2h
2
Substituting h = /2 and considering > 0 only, (3.2) yields
lim
h0
fX (uo ) = lim
P {uo
0
2
< X < uo + 2 }
,
(3.2)
(3.3)
or equivalently,
n
o
P uo < X < uo +
= fX (uo ) + o(),
(3.4)
2
2
where o() represents a term such that lim0 o()
= 0. The equivalent equations (3.3) and (3.4)
show how the pdf fX is directly related to probabilities of events involving X.
Many of the definitions and properties for discrete-type random variables carry over to continuoustype random variables, with summation replaced by integration. The mean (or expectation), E[X],
of a continuous-type random variable X is defined by:
Z
ufX (u)du.
X = E[X] =
The Law of the Unconscious Statistician (LOTUS) holds and is the fact that for a function g:
Z
E[g(X)] =
g(u)fX (u)du.
It follows from LOTUS, just as in the case of discrete-type random variables, that expectation is a
linear operation. For example, E[aX 2 + bX + c] = aE[X 2 ] + bE[X] + c.
Variance is defined for continuous-type random variables exactly as it is for discrete-type random
variables: Var(X) = E[(X X )2 ], and it has the same properties. It is a measure of how spread
out
p
2 , where =
the distribution of X is. As before, the variance of X is often denoted by X
Var(X)
X
is called the standard deviation of X. If X is a measurement in some units, then X is in the same
units. For example. if X represents a measurement in feet, then Var(X) represents a number of
feet2 and X represents a number of feet. Exactly as shown in Section 2.2 for discrete-type random
variables, the variance for continuous-type random variables scales as Var(aX + b) = a2 Var(X).
X
The standardized random variable, X
X , is a dimensionless random variable with mean zero and
variance one.
Finally, as before, the variance of a random variable is equal to its second moment minus the
square of its first moment:
Var(X) = E[X 2 2X X + 2X ]
= E[X 2 ] 2X E[X] + 2X
= E[X 2 ] 2X .
Example 3.2.2 Suppose X has the following pdf, where A is a constant to be determined:
A(1 u2 ) 1 u 1
fX (u) =
0
else.
Find A, P {0.5 < X < 1.5}, FX , X , Var(X), and X .
102
Solution:
Z
1 =
fX (u)du
Z 1
A(1 u2 )du
=
1
u3 1
4A
= A u
=
,
3
3
1
so A = 34 .
The support of fX (the set on which it is not zero) is the interval [1, 1]. To find P {0.5 < X <
1.5}, we only need to integrate over the portion of the interval [0.5, 1.5] in the support of fX . That
is, we only need to integrate over [0.5, 1] :
Z
1.5
fX (u)du
0.5
Z 1
=
=
3(1 u2 )
du
4
0.5
3
u3 1
5
3 2 11
u
= .
=
4
3
4 3 24
32
0.5
Because the support of fX is the interval [1, 1], we can immediately write down the partial
answer:
0 c 1
? 1 < c < 1
FX (c) =
1 c 1,
where the question mark represents what hasnt yet been determined, namely, the value of FX (c)
for 1 < c < 1. So let 1 < c < 1. Then
FX (c) = P {X c}
Z c
3(1 u2 )
=
du
4
1
3
u3 c
u
=
4
3 1
=
2 + 3c c3
.
4
0
c 1
2+3cc3
FX (c) =
1 < c < 1
4
1
c 1.
103
ufX (u)du,
9
32
0.5
0.5
Z
=
=
Thus, Var(X) = 0.2 and X =
3.3
3
u2 (1 u2 )du
4
1
Z 1
3
(u2 u4 )du = 0.2.
4 1
Uniform distribution
Let a < b. A random variable X is uniformly distributed over the interval [a, b] if
1
ba a u b
fX (u) =
0 else.
The mean of X is given by
1
E[X] =
ba
u2 b
b2 a2
(b a)(b + a)
a+b
=
=
=
.
udu =
2(b a) a 2(b a)
2(b a)
2
Thus, the mean is the midpoint of the interval. The second moment of X is given by
b
1
E[X ] =
ba
2
b3 a3
(b a)(b2 + ab + a2 )
a2 + ab + b2
u3 b
u du =
=
=
=
.
3(b a) a 3(b a)
3(b a)
3
2
So
a2 + ab + b2
Var(X) = E[X ] E[X] =
3
2
a+b
2
2
=
(a b)2
.
12
Note that the variance is proportional to the length of the interval squared. A useful special case
is when a = 0 and b = 1, in which case X is uniformly distributed over the unit interval [0, 1]. In
that case, for any k 0, the k th moment is given by
k
uk+1 1
1
u du =
=
k+1 0 k+1
1
3
( 12 )2 =
E[X ] =
and the variance is
1
12 .
104
3.4
Exponential distribution
A random variable T has the exponential distribution with parameter > 0 if its pdf is given by
et t 0
fT (t) =
0
else.
The pdfs for the exponential distributions with parameters = 1, 2, and 4 are shown in Figure
3.6. Note that the initial value of the pdf is given by fT (0) = , and the rate of decay of fT (t) as
f (t)
T
!=4
!=2
!=1
t
Figure 3.6: The pdfs for the exponential distributions with = 1, 2, and 4.
t increases is also . Of course, the area under the graph of the pdf is one for any value of . The
mean is decreasing in , because the larger is, the more concentrated the pdf becomes towards
the left.
The CDF evaluated at a t 0, is given by
t
Z t
Z t
s
s
FT (t) =
fT (s)ds =
e ds = e = 1 et .
Therefore, in general,
FT (t) =
1 et t 0
0
t < 0.
The complementary CDF, defined by FTc (t) = P {T > t} = 1 FT (t), therefore satisfies:
t
e
t0
c
FT (t) =
1
t < 0.
To find the mean and variance we first find a general formula for the nth moment of T , for n 1,
using integration by parts:
Z
E[T n ] =
tn et dt
0
Z
n t
= nt e +
ntn1 et dt
0
Z 0
n n1 t
n
= 0+
t
e dt = E[T n1 ].
0
105
In particular, E[T ] = 1 and E[T 2 ] = 22 , and in general, but induction on n, E[T n ] = n!n . Therefore,
Var(T ) = E[T 2 ] E[T ]2 = 22 12 = 12 . The standard deviation of T is T = 1 . The standard
deviation is equal to the mean. (Sound familiar? Recall that for p close to zero, the geometric
distribution with parameter p has standard deviation nearly equal to the mean.)
Example 3.4.1 Let T be an exponentially distributed random variable with parameter = ln 2.
Find the simplest expression possible for P {T t} as a function of t for t 0, and find
P (T 1|T 2).
Solution.
P (T 1|T 2) =
P {T 1,T 2}
P {T 2}
P {T 1}
P {T 2}
1 12
1 14
= 32 .
P {T > s + t, T > s}
P {T > s}
P {T > s + t}
P {T > s}
e(s+t)
es
t
= e
= P {T > t}.
That is, P (T > s + t|T > s) = P {T > t}. This is called the memoryless property for continuous
time. If T is the lifetime of a component installed in a system at time zero, the memoryless
property of T has the following interpretation: Given that the component is still working after s
time units, the probability it will continue to be working after t additional time units, is the same
as the probability a new component would still be working after t time units. As discussed later in
Section 3.9, the memoryless property is equivalent to the failure rate being constant.
Connection between exponential and geometric distributions As just noted, the exponential distribution has the memoryless property in continuous time. Recall from Section 2.5 that
the geometric distribution has the memoryless property in discrete time. We shall further illustrate the close connection between the exponential and geometric distributions by showing that the
exponential distribution is the limit of scaled geometric distributions. In essence, the exponential
distribution is the continuous time analog of the geometric distribution. When systems are modeled
or simulated, it is useful to be able to approximate continuous variables by discrete ones, which are
easily represented in digital computers.
Fix > 0, which should be thought of as a failure rate, measured in inverse seconds. Let h > 0
represent a small duration of time, measured in seconds. Imagine there is a clock that ticks once
106
every h time units. Thus, the clock ticks occur at times h, 2h, 3h, . . . . If we measure how long a
lightbulb lasts by the number of clock ticks, the smaller h is the larger the number of ticks. We
can model the lifetime of a lightbulb as a discrete random variable if we assume the bulb can only
fail at the times of the clock ticks. For small values of h this can give a good approximation to a
continuous type random variable. For small values of h, the probability p that the lightbulb fails
between consecutive clock ticks should be proportionally small. Specifically, set p = h. Consider a
lightbulb which is new at time zero, and at each clock tick it fails with probability p, given it hasnt
failed earlier. Let Lh be the number of ticks until the lightbulb fails. Then Lh has the geometric
1
distribution with parameter p. The mean of Lh is given by E[Lh ] = p1 = h
, which converges to
infinity as h 0. Thats what happens when we measure the lifetime by the number of clock ticks
using a clock with a very high tick rate. Let Th be the amount of time until the lightbulb fails.
Then Th = hLh . We shall show that the distribution of Th is close to the exponential distribution
with parameter for small h.
We have a handle on the distribution of Th because we know the distribution of Lh . The
complementary CDF of Th can be found as follows. For any c 0,
P {Th > c} = P {Lh h > c}
= P {Lh > bc/hc}
= (1 h)bc/hc ,
n
e as n .
where the last equality follows from (2.8). Recall from (2.9) that 1 n
Replacing n by 1/h implies that (1 h)1/h e as h 0, and therefore, (1 h)c/h ec as
h 0. Also, the difference between bc/hc and c/h is not important, because 1 (1 h)c/h /(1
h)bc/hc (1 h) 1 as h 0. Therefore, if T has the exponential distribution with parameter
,
P {Th > c} = (1 h)bc/hc ec = P {T > c}
as h 0. So 1 P {Th > c} 1 P {T > c} as well. That is, the CDF of Th converges to the CDF
of T. In summary, the CDF of h times a geometrically distributed random variable with parameter
p = h converges to the CDF of an exponential random variable with parameter , as h 0.
3.5
Poisson processes
Bernoulli processes are discussed in Section 2.6. Here we examine Poisson processes. Just as
exponential random variables are limits of scaled geometric random variables (as seen in Section
3.4), Poisson processes are limits of scaled Bernoulli processes.
3.5.1
107
(a)
Cs
3
2
Xs
k
0
10
L1
S1
20
L2
30
S2
50
S3
~
N
(b)
3
2
40
L3
t
0
~
U1
~
T
1
~
U
2
2
~
T
~
U
~
T
Figure 3.7: (a) A sample path of a Bernoulli process and (b) the associated time-scaled sample
path of the time-scaled Bernoulli process, for h = 0.1.
random process tracks the number of counts versus time. See Figure 3.7. Figure 3.7(a) shows a
Bernoulli process, along with the associated random variables defined in Section 2.6: the number
of additional trials needed for each additional outcome of a one (the Ls), the total number of trials
needed for a given number of ones (the Ss), and the cumulative number of ones for a given number
of trials (the Cs). Note that the index in the sketch is k, which indexes the trials, beginning with
trial one. Figure 3.7(b) shows the corresponding time-scaled Bernoulli process for h = 0.1. The
seventh trial is the first trial to result in a count, so the time of the first count is 7h, as shown in
3.7(b) for h = 0.1. We define the following random variables to describe the time-scaled Bernoulli
process with time step h:
ej = hLj : the amount of time between the j 1th count and the j th count
U
Tej = hSj : the time the j th count occurs
et = Cbt/hc : the number of counts up to time t
N
108
The tildes on the random variables here are used to distinguish the variables from the similar
random variables for a Poisson process, defined below.
Suppose is fixed and that h is so small that p = h is much smaller than one. Then the
random variables describing the scaled Bernoulli process have simple approximate distributions.
Each Lj is a geometrically distributed random variable with parameter p, so as explained in Section
ej = hLj , approximately has the exponential distribution
3.4, the scaled version of Lj , namely U
et is the sum of bt/hc Bernoulli random variables with
with parameter = p/h. For t fixed, N
et has the binomial distribution with parameters bt/hc and p = h. So
parameter p. Therefore, N
e
E[Nt ] = bt/hch t. Recall from Section 2.7 that the limit of a binomial distribution as n
and p 0 with np is the Poisson distribution with parameter . Therefore, as h 0, the
et is the Poisson distribution with mean t. More generally, if 0 s < t,
limiting distribution of N
et N
es converges to the Poisson distribution with parameter
the distribution of the increment N
e
(t s). Also, the increments of Nt over disjoint intervals are independent random variables.
3.5.2
A Poisson process with rate > 0 is obtained as the limit of scaled Bernoulli random counting
processes as h 0 and p 0 such that p/h . This limiting picture is just used to motivate the
definition of Poisson processes, given below, and to explain why Poisson processes naturally arise
in applications. A sample path of a Poisson process (i.e. the function of time the process yields for
some particular in ) is shown in Figure 3.8. The variable Nt for each t 0 is the cumulative
Nt
3
2
1
t
0
U1
T1
U2
U3
...
T2
T3
I{tTn }
n=1
Tn = min{t : Nt n}
Tn = U1 + + Un .
109
Definition 3.5.1 Let 0. A Poisson process with rate is a random counting process N =
(Nt : t 0) such that
N.1 N has independent increments: if 0 t0 t1 tn . the increments
Nt1 Nt0 , Nt2 Nt1 , . . . , Ntn Ntn1 are independent.
N.2 The increment Nt Ns has the P oi((t s)) distribution for t s.
Proposition 3.5.2 Let N be a random counting process and let > 0. The following are equivalent:
(a) N is a Poisson process with rate .
(b) The intercount times U1 , U2 , . . . are mutually independent, exponentially distributed random
variables with parameter .
Proof. Either (a) or (b) provides a specific probabilistic description of a random counting process.
Furthermore, both descriptions carry over as limits from the Bernoulli random counting process,
as h, p 0 with p/h . This provides a basis for a proof of the proposition.
Example 3.5.3 Consider a Poisson process on the interval [0, T ] with rate > 0, and let 0 <
< T . Define X1 to be the number of counts during [0, ], X2 to be the number of counts during
[, T ], and X to be the total number of counts during [0, T ]. Let i, j, n be nonnegative integers such
that n = i + j. Express the following probabilities in terms of n, i, j, , T , and , simplifying your
answers as much as possible:
(a) P {X = n}, (b) P {X1 = i}, (c) P {X2 = j}, (d) P (X1 = i|X = n), (e) P (X = n|X1 = i).
Solution:
(a) P {X = n} =
(b) P {X1 = i} =
(c) P {X2 = j} =
(d)
eT (T )n
.
n!
e ( )i
.
i!
e(T ) ((T ))j
.
j!
P {X1 = i, X = n}
P {X1 = i, X2 = j}
=
P {X = n}
P {X = n}
P {X1 = i}P {X2 = j}
n! i T j
=
=
P {X = n}
i!j! T
T
n i
=
p (1 p)ni ,
i
P (X1 = i|X = n) =
where p = T . As a function of i, the answer is thus the pmf of a binomial distribution. This result
is indicative of the following stronger property that can be shown: given there are n counts during
[0, T ], the times of the counts are independent and uniformly distributed over the interval [0, T ].
110
(e) Since X1 and X2 are numbers of counts in disjoint time intervals, they are independent.
Therefore,
P (X = n|X1 = i) = P (X2 = j|X1 = i)
= P {X2 = j} =
(T )
ni
((T ))
. That is, given X1 = i, the
which can also be written as P (X = n|X1 = i) = e
(ni)!
total number of counts is i plus a random number of counts. The random number of counts has
the Poisson distribution with mean (T ).
Example 3.5.4 Calls arrive to a cell in a certain wireless communication system according to a
Poisson process with arrival rate = 2 calls per minute. Measure time in minutes and consider an
interval of time beginning at time t = 0. Let Nt denote the number of calls that arrive up until
time t. For a fixed t > 0, the random variable Nt is a Poisson random variable with parameter 2t,
2t
i
so its pmf is given by P {Nt = i} = e i!(2t) for nonnegative integers i.
(a) Find the probability of each of the following six events:
E1 =No calls arrive in the first 3.5 minutes.
E2 =The first call arrives after time t = 3.5.
E3 =Two or fewer calls arrive in the first 3.5 minutes.
E4 =The third call arrives after time t = 3.5.
E5 =The third call arrives after time t. (for general t > 0)
E6 =The third call arrives before time t. (for general t > 0)
(b) Derive the pdf of the arrival time of the third call.
(c) Find the expected arrival time of the tenth call?
Solution: Since = 2, N3.5 has the Poisson distribution with mean 7. Therefore, P (E1 ) =
7
0
P {N3.5 = 0} = e 0!(7) = e7 = 0.00091.
Event E2 is the same as E1 , so P (E2 ) = 0.00091.
Using the pmf of the Poisson distribution with mean 7 yields: P (E3 ) = P {N3.5 2} = P {N3.5 =
2
0} + P {N3.5 = 1} + P {N3.5 = 2} = e7 (1 + 7 + 72 ) = 0.0296.
Event E4 is the same as event E3 , so P (E4 ) = 0.0296.
Event E5 is the same as the event that the number of calls that arrive by time t is less than or
2
equal to two, so P (E5 ) = P {Nt 2} = P {Nt = 0} + P {Nt = 1} + P {Nt = 2} = e2t (1 + 2t + (2t)
2 ).
Event E6 is just the complement of E5 , so: P (E6 ) = 1 P (E5 ) = 1 e2t (1 + 2t +
(2t)2
2 ).
111
(b) As a function of t, P (E6 ) is the CDF for the time of the arrival time of the third call. To
get the pdf, differentiate it to get
23 t2
(2t)2
2t
2 4t = e2t (2t)2 = e2t
f (t) = e
2 1 + 2t +
.
2
2
This is the Erlang density with parameters r = 3 and = 2.
(c) The expected time between arrivals is 1/, so the expected time until the tenth arrival is
10/ = 5.
4
3
e
3
=
e
+ (e ) +
e
=
+ +
e3 .
2
2
2
2
4
(c) By the definition of conditional probability,
P (B020 |A) =
=
P (B020 A)
P (B020 )
=
P (A)
P (A)
2
2
2
2
3
4
4
2
.
2 + 4 + 2
Notice that in part (b) we applied the law of total probability to find A, and in part (c) we applied
the definition of the conditional probability P (B020 |A). Together, this amounts to application of
Bayes rule for finding P (B020 |A).
112
3.5.3
Let Tr denote the time of the rth count of a Poisson process. Thus, Tr = U1 + + Ur , where
U1 , . . . , Ur are independent, exponentially distributed random variables with parameter . One
way to derive the pdf of fTr is to use this characterization of Tr and the method of Section 4.5.2,
showing how to find the pdf of the sum of independent continuous-type random variables. But the
following method is less work. Notice that for a fixed time t, the event {Tr > t} can be written as
{Nt r 1}, because the rth count happens after time t if and only if the number of counts that
happened by time t is less than or equal to r 1. Therefore,
P {Tr > t} =
r1
X
exp(t)(t)k
k=0
k!
k!
k!
k=0
k=1
!
r1 k+1 k
r1
X
X
t
k tk1
= exp(t)
k!
(k 1)!
k=0
k=1
!
r1 k+1 k
r2 k+1 k
X
X
t
t
= exp(t)
k!
k!
fTr (t) =
k=0
k=0
exp(t)r tr1
.
(r 1)!
The distribution of Tr is called the Erlang distribution2 with parameters r and . The mean of Tr
is r , because Tr is the sum of r random variables, each with mean 1/. It is shown in Example
4.8.1 that Var(Tr ) = r2 .
Recall from Section 3.4 that the exponential distribution is the limit of a scaled geometric
random variable. In the same way, the sum of r independent exponential random variables is the
scaled limit of the sum of r independent geometric random variables. That is exactly what we
just showed: The Erlang distribution with parameters r and is the limit of a scaled negative
binomial random variable hSr , where Sr has the negative binomial distribution with parameters r
and p = h.
2
The Erlang distribution can be generalized to the case r is any positive real number R(not necessarily an integer)
by replacing the term (r 1)! by (r), where is the gamma function defined by (r) = 0 tr1 et dt. Distributions
in this more general family are called gamma distributions. Thus, Erlang distributions are special cases of Gamma
distributions for the case that r is a positive integer, in which case (r) = (r 1)!.
3.6
3.6.1
113
Let X be a random variable with pdf fX and let Y = aX + b where a > 0.3 The pdf of Y is given
by the following scaling rule:
vb 1
.
(3.5)
Y = aX + b fY (v) = fX
a
a
We explain what the scaling rule means graphically, assuming a 1. The situation for 0 < a 1 is
similar. To obtain fY from fX , first stretch the graph of fX horizontally by a factor a and shrink it
vertically by a factor a. That operation leaves the area under the graph equal to one, and produces
the pdf of aX. Then shift the graph horizontally by b (to the right if b > 0 or to the left if b < 0.)
Here is a derivation of the scaling rule (3.5). Since a > 0, the event {aX + b v} is the same
as X vb
, so the CDF of Y can be expressed as follows:
a
vb
vb
= FX
.
FY (v) = P {aX + b v} = P X
a
a
Differentiate FY (v) with respect to v, using the chain rule of calculus and the fact FX0 = fX , to
obtain (3.5):
vb 1
0
fY (v) = FY (v) = fX
.
a
a
Section 3.2 recounts how the mean, variance, and standard deviation of Y are related to the
mean, variance, and standard deviation of X, in case Y = aX + b. These relations are the same
ones discussed in Section 2.2 for discrete-type random variables, namely:
E[Y ] = aE[X] + b
Var(Y) = a2 Var(X)
Y = aX .
XX
X ,
Example 3.6.1 Let X denote the pdf of the high temperature, in degrees C (Celsius), for a
certain day of the year in some city. Let Y denote the pdf of the same temperature, but in degrees
F (Fahrenheit). The conversion formula is Y = (1.8)X + 32. This is the linear transformation that
maps zero degrees C to 32 degrees F and 100 degrees C to 212 degrees F.
(a) Express fY in terms of fX .
(b) Sketch fY in the case X is uniformly distributed over the interval [15, 20].
Solution: (a) By the scaling formula with a = 1.8 and b = 32, fY (c) = fX ( c32
1.8 )/1.8.
(b) The case when X is uniformly distributed over [15, 20] leads to Y uniformly distributed over
[59, 68]. This is illustrated in Figure 3.9. The pdf of X is shown at the top of the figure. The pdf
3
The case a < 0 is discussed in Example 3.8.4, included in the section on functions of a random variable.
114
10
20
30
40
50
60
70
Example 3.6.2 Let T denote the duration of a waiting time in a service system, measured in
seconds. Suppose T is exponentially distributed with parameter = 0.01. Let S denote the same
waiting time, but measured in minutes. Find the mean and the pdf of S.
Solution: First we identify the pdf and mean of T. By definition of the exponential distribution
with parameter = 0.01, the pdf of T is
fT (u) =
(0.01)e(0.01)u for u 0
0
for u < 0,
where the variable u is a measure of time in seconds. The mean of an exponentially distributed
1
random variable with parameter is 1 , so E[T ] = 0.01
= 100.
]
T
The waiting time in minutes is given by S = 60
. By the linearity of expectation, E[S] = E[T
60 =
100
60 = 1.66 . . . . That is, the mean waiting time is 100 seconds, or 1.666 . . . minutes. By the scaling
formula with a = 1/60 and b = 0,
fS (v) = fT (60v)60 =
where v is a measure of time in minutes. Examining this pdf shows that S is exponentially distributed with parameter 0.60. From this fact, we can find the mean of S a second wayit is one
1
over the parameter in the exponential distribution for S, namely 0.6
= 1.666 . . . , as already noted.
Example 3.6.3 Let X be a uniformly distributed random variable on some interval [a, b]. Find
X
the distribution of the standardized random variable X
X .
115
Solution The mean, X , is the midpoint of the interval [a, b], and the standard deviation is
(see Section 3.3). The pdf for X X is obtained by shifting the pdf of X to be
X = (ba)
2 3
ba
centered at zero. Thus, X X is uniformly distributed over the interval [ ba
2 , 2 ]. When this
random variable is divided by X , the resulting pdf is shrunk horizontally
by the factor X . This
ba ba
results in a uniform distribution over the interval [ 2X , 2X ] = [ 3, 3]. This makes sense,
because the uniform distribution over the interval [ 3, 3] is the unique uniform distribution
with mean zero and variance one.
3.6.2
The Gaussian distribution, also known as the normal distribution, has the pdf
1
(u )2
,
f (u) =
exp
2 2
2 2
where the parameters are and 2 . The distribution is often denoted by N (, 2 ) distribution,
which is read aloud as the normal , 2 distribution. Later in this section it is shown that the
pdf integrates to one, the mean is , and the variance is 2 . The pdf is pictured in Figure 3.10. The
"
1
2! "
68.3%
"
13.6%
2.3%
116
and the complementary CDF of the N (0, 1) distribution is traditionally denoted by the letter Q,
(at least in much of the systems engineering literature). So
Z
Q(u) =
u
2
1
v
exp
dv = 1 (u) = (u).
2
2
Since Q(u) = 1 (u) = (u), any probabilities we can express using the Q function we can also
express using the function. There is no hard and fast rule for whether to use the function or the
Q function, but typically we use Q(u) for values of u that are larger than three or four, and (u)
for smaller positive values of u. When we deal with probabilities that are close to one it is usually
more convenient to represent them as one minus something, for example writing 1 8.5 106
instead of 0.9999915. The and Q functions are available on many programable calculators and
on Internet websites.4 Some numerical values of these functions are given in Tables 6.1 and 6.2, in
the appendix.
Let be any number and > 0. If X is a standard Gaussian random variable, and Y = X +,
then Y is a N (, 2 ) random variable. Indeed,
2
1
u
fX (u) =
exp
,
2
2
so by the scaling rule (3.5),
fY (v) =
=
1
v
fX
1
(v )2
,
exp
2 2
2
so fY is indeed the N (, 2 ) pdf. Graphically, this means that the N (, 2 ) pdf can be obtained
from the standard normal pdf by stretching it horizontally by a factor , shrinking it vertically by
a factor , and sliding it over by .
Working in the other direction, if Y has the N (, 2 ) distribution, then the standardized version
of Y , namely X = Y , is a standard normal random variable. Graphically, this means that the
standard normal pdf can be obtained from the N (, 2 ) pdf by sliding it over by (so it becomes
centered at zero), shrinking it by a factor horizontally and stretching it by a factor vertically.
Lets check that the N (, 2 ) density indeed integrates to one, has mean , and variance 2 . To
show that the normal density integrates to one, it suffices to check that the standard normal density
integrates to one, because the density for general and 2 is obtained from the standard
R unormal
2
pdf by the scaling rule, which preserves the total integral of the density. Let I = e /2 du.
4
117
Z
Z
2
2
e(u +v )/2 dudv
=
2 Z
er
=
0
rdrd
er /2 rdr
0
r2 /2
= 2e
= 2.
= 2
2 /2
Therefore, I = 2, which means the standard normal density integrates to one, as claimed.
The fact that is the mean of the N (, 2 ) density follows from the fact that the density is
symmetric about the point . To check that 2 is indeed the variance of the N (, 2 ) density, first
note that if X is a standard normal random variable, then LOTUS and integration by parts yields
2
Z
u2
u
2
exp
E[X ] =
du
2
2
2
Z
1
u
u u exp
du
=
2
2
2
2
Z
u
u
1
u
= exp
exp
+
du = 0 + 1 = 1.
2
2
2
2
Since X has mean zero, the variance, Var(X), for a standard normal random variable is one. Finally,
a random variable Y with the N (, 2 ) distribution can be written as X +, where X is a standard
normal random variable, so by the scaling formula for variances, Var(Y ) = 2 Var(X) = 2 , as
claimed.
Example 3.6.4 Let X have the N (10, 16) distribution (i.e. Gaussian distribution with mean 10
and variance 16). Find the numerical values of the following probabilities:
P {X 15}, P {X 5}, P {X 2 400}, and P {X = 2}.
Solution: The idea is to use the fact that X10
is a standard normal random variable, and use
4
either the or Q function.
X 10
15 10
15 10
P {X 15} = P
=Q
= Q(1.25) = 1 (1.25) = 0.1056.
4
4
4
X 10
5 10
5 10
P {X 5} = P
=
= Q(1.25) = 0.1056.
4
4
4
X 10
X 10
2
P {X 400} = P {X 20} + P {X 20} = P
2.5 + P
7.5
4
4
= Q(2.5) + Q(7.5) Q(2.5) = 1 (2.5) = 0.0062. (Note: Q(7.5) < 1012 .)
118
X 10
P {X < 10 3} = P
1 = (1) = 1 (1) = Q(1) 0.1587.
3
a+b
(b a)2
and variance
. Hence,
2
12
a + b = 20 and b a = 6, giving a = 7, b = 13. That is, X is uniformly distributed over [7, 13].
Therefore, P {X < 8.27} = 8.277
0.211.
6
(b) A random variable uniformly distributed on [a, b] has mean
In some applications the distribution of a random variable seems nearly Gaussian, but the
random variable is also known to take values in some interval. One way to model the situation is
used in the following example.
Example 3.6.6 Suppose the random variable X has pdf
K
(u 2)2
2 2 exp
8
fX (u) =
0,
0 u 4,
otherwise.
where K is a constant to be determined. Determine K, the CDF FX , and the mean E[X].
Solution: Note that for u [0, 4], fX (u) = KfZ (u), where Z is has the N ( = 2, 2 = 4)
distribution. That is, fX is obtained by truncating fZ to the interval [0, 4] (i.e. setting it to
zero outside the interval) and then multiplying it by a constant K > 1 so fX integrates to one.
Therefore,
Z 4
Z
1 =
fX (u)du =
fZ (u)du K = P {0 Z 4} K
0
02
Z 2
42
= P
K
2
2
2
Z 2
= P 1
1 K
2
= ((1) (1)) K = ((1) (1 (1)))K
= (2(1) 1)K = (0.6826)K,
119
2
2
v2
v2
(1) K =
0.1587 K.
=
2
2
Thus,
0
FX (v) =
v2
2
0.1587 K
if v 0
if 0 < v 4
if v 4.
Finally,
as for finding
R
R 0 E[X], since the pdf fX (u) is symmetric about the point u = 2, and
0 ufX (u)du and ufX (u)du are both finite, it follows that E[X] = 2.
3.6.3
The Gaussian distribution arises frequently in practice, because of the phenomenon known as
the central limit theorem (CLT), and the associated Gaussian approximation. There are many
mathematical formulations of the CLT which differ in various details, but the main idea is the
following: If many independent random variables are added together, and if each of them is small
in magnitude compared to the sum, then the sum has an approximately Gaussian distribution.
e is a Gaussian random variable with the same mean and variance
That is, if the sum is X, and if X
e have approximately the same CDF:
as X, then X and X
e v}
P {X v} P {X
(Gaussian approximation).
An important special case is when X is the sum of n Bernoulli random variables, each having the
same parameter p. In other words, when X has the binomial distribution, with parameters n and p.
The approximation is most accurate if both np and n(1 p) are at least moderately large, and the
probabilities being approximated are not extremely close to zero or one. As an example, suppose
X has the binomial distribution with parameters n = 10 and p = 0.2. Then E[X] = np = 2 and
e be a Gaussian random variable with the same mean and variance
Var(X) = np(1 p) = 1.6. Let X
e
as X. The CDFs of X and X are shown in Figure 3.11. The two CDFs cannot be close everywhere
e is
because the CDF of X is piecewise constant with jumps at integer points, and the CDF of X
continuous. Notice, however, that the functions are particularly close when the argument is halfway
e 2.5} = 0.6537.
between two consecutive integers. For example, P {X 2.5} = 0.6778 and P {X
Since X is an integer-valued random variable, P {X 2.5} = P {X 2}. Therefore, we have:
e 2.5}.
P {X 2} = P {X 2.5} P {X
e 2.5} is a fairly good approximation to P {X 2}. In particular, it is a better approxSo, P {X
e 2} isin this case P {X
e 2} = 0.5. In general, when X is an integer-valued
imation than P {X
120
Figure 3.11: The CDF of a binomial random variable with parameters n = 10 and p = 0.2, and the
Gaussian approximation of it.
random variable, the Gaussian approximation with the continuity correction is:
e k + 0.5}
P {X k} P {X
e
P {X k} P {X k 0.5}.
A simple way to remember the continuity correction is to think about how the
pmf of X could
R k+0.5
be approximated. Namely, for any integer k, we ideally have P {X = k} k0.5 fXe (u)du. So,
R 11.5
for example, if we add this equation over k=9,10,11, we get P {9 X 11} 8.5 fXe (u)du =
R
e 11.5}. Similarly, P {0 X < 3} 2.5 f e (u)du = P {0.5 X
e 2.5}.
P {8.5 X
0.5 X
The example just given shows that, with the continuity correction, the Gaussian approximation
is fairly accurate even for small values of n. The approximation improves as n increases. The CDF
e are shown in Figure 3.12 in case X has the binomial distribution with parameters
of X and X
n = 30 and p = 0.2.
As mentioned earlier, the Gaussian approximation is backed by various forms of the central limit
theorem (CLT). Historically, the first version of the CLT proved is the following theorem, which
pertains to the case of binomial distributions. Recall that if n is a positive integer and 0 < p < 1,
a binomial random variable Sn,p with parameters n and p has mean np and variance np(1 p).
S np
Therefore, the standardized version of Sn,p is n,p
.
np(1p)
Theorem 3.6.7 (DeMoivre-Laplace limit theorem) Suppose Sn,p is a binomial random variable
with parameters n and p. For p fixed, with 0 < p < 1, and any constant c,
(
)
Sn,p np
c = (c).
lim P p
n
np(1 p)
121
Figure 3.12: The CDF of a binomial random variable with parameters n = 30 and p = 0.2, and the
Gaussian approximation of it.
The practical implication of the DeMoivre-Laplace limit theorem is that, for large n, the standardized version of Sn,p has approximately the standard normal distribution, or equivalently, that Sn,p
has approximately the same CDF as a Gaussian random variable with the same mean and variance.
That is, the DeMoivre-Laplace limit theorem gives evidence that the Gaussian approximation to
the binomial distribution is a good one. A more general version of the CLT is stated in Section
4.10.2.
Example 3.6.8 (a) Suppose a fair coin is flipped a thousand times, and let X denote the number
of times heads shows. Using the Gaussian approximation with the continuity correction,5 find the
approximate numerical value K so P {X K} 0.01. (b) Repeat, but now assume the coin is
flipped a million times.
Solution: (a) The random variable X has the binomial distribution with
p parameters n
= 1000 and
p = 0.5. It thus has mean X = np = 500 and standard deviation = np(1 p) = 250 15.8.
By the Gaussian approximation with the continuity correction,
X
K 0.5
K 0.5
P {X K} = P {X K 0.5} = P
Q
.
537.26. Thus, K = 537 or K = 538 should do. So, if the coin is flipped a thousand times, there is
about a one percent chance that heads shows for more than 53.7% of the flips.
5
Use of the continuity correction is specified here to be definite, although n is so large that the continuity correction
is not very important.
122
(b) By the same reasoning, we should take K + 2.325 + 0.5, where, for n = 106 , = 500000
and = 250, 000 = 500. Thus, K = 501163 should do. So, if the coin is flipped a million times,
there is about a one percent chance that heads shows for more than 50.12% of the flips.
Example 3.6.9 Suppose a testing service is to administer a standardized test at a given location
on a certain test date, and only students who have preregistered can take the test. Suppose 60 students preregistered, and each of those students actually shows up to take the test with probability
p = 5/6. Let X denote the number of preregistered students showing up to take the test.
(a) Using the Gaussian approximation with the continuity correction, find the approximate numerical value of P {X 52}.
(b) Similarly, find the approximate numerical value of P {|X 50| 6}.
(c) Find the Chebychev upper bound on P {|X 50| 6}. Compare it to your answer to part (b).
Solution: (a) The random variable X has the binomial distribution with
p parameters n = 60 and
p = 5/6. It thus has mean X = np = 50 and standard deviation X = np(1 p) 2.887. Thus,
X50
2.89 is approximately a standard normal random variable. Therefore,
X 50
52.5 50
P {X 52} = P {X 52.5} = P
2.887
2.887
X 50
= P
0.866 (0.866) 0.807.
2.887
The Gaussian approximation in this case is fairly accurate: numerical calculation, using the binomial distribution, yields that P {X 52} 0.8042.
(b)
P {|X 50| 6} = P {X 44 or X 56}
= P {X 44.5 or X 55.5}
X 50
X 50
= P
1.905 or
1.905 2Q(1.905) 0.0567.
2.887
2.887
The Gaussian approximation in this case is fairly accurate: numerical calculation, using the binomial distribution, yields that P {|X 50| 6} 0.0540.
(c) The Chebychev inequality implies that P {|X 50| 6}
times larger than the value found in part (b).
2
X
36
Example 3.6.10 A campus network engineer would like to estimate the fraction p of packets
going over the fiber optic link from the campus network to Bardeen Hall that are digital video disk
(DVD) packets. The engineer writes a script to examine n packets, counts the number X that are
DVD packets, and uses p = X
n to estimate p. The inspected packets are separated by hundreds of
123
other packets, so it is reasonable to assume that each packet is a DVD packet with probability p,
independently of the other packets.
(a) Using the Gaussian approximation to the binomial distribution, find an approximation to
P {|
p p| } as a function of p, n, and . Evaluate it for p = 0.5 and for p = 0.1, with = 0.02
and n = 1000.
(b) If p = 0.5 and n = 1000, find so P {|
pp| } 0.99. Equivalently, P {p [
p, p+]} 0.99.
Note that p is not random, but the confidence interval [
p , p + ] is random. So we want to find
the half-width of the interval so we have 99% confidence that the interval will contain the true
value of p.
(c) Repeat part (b), but for p = 0.1.
(d) However, the campus network engineer doesnt know p to begin with, so she cant select the
halfwidth of the confidence interval as a function of p. A reasonable approach is to select so
that, the Gaussian approximation to P {p [
p , p + ]} is greater than or equal to 0.99 for any
value of p. Find such a for n = 1000.
(e) Using the same approach as in part (d), what n is needed (not depending on p) so that the
random confidence interval [
p 0.01, p + 0.01] contains p with probability at least 0.99 (according
to the Gaussian approximation of the binomial)?
Solution: (a) Since X has the binomial distribution with parameters n and p, the Gaussian
approximation yields
)
(
r
X np
X
n
P [|
p p| ] = P p = P p
np(1 p)
n
p(1 p)
r
r
r
n
n
n
= 2
1.
p(1 p)
p(1 p)
p(1 p)
For n = 1000, = 0.02, and p = 0.5, this is equal to 2(1.265) = 0.794 = 79.4%, and
for n = 1000, = 0.02, and
this is equal to 2(2.108) = 0.965 = 96.5%.
qp = 0.1,
1000
(b) Select so that 2 p(1p) 1 = 0.99. Observing from Table 6.1 for the standard normal
q
q
1000
CDF that 2(2.58) 1 0.99, we select so that p(1p) = 2.58, or = 2.58 p(1p)
1000 . The 99%
q
confidence interval for p = 0.5 requires = 2.58 0.5(10.5)
= 0.04.
1000
q
(c) Similarly, the 99% confidence interval for p = 0.1 requires = 2.58 0.1(0.9)
1000 = 0.025.
(d) The product p(1 p), and hence the required , is maximized by p = 0.5. Thus, if = 0.04 as
found in part (b), then the confidence interval contains p with probability at least 0.99 (no matter
what the value of p is), at least up to the accuracy of the Gaussian approximation.
(e) The value of n needed for p = 0.5 works
q for any p (the situation is similar to that in part (d)) so
n needs to be selected so that 0.01 = 2.58
0.5(10.5)
.
n
2
This yields n = ( 2.58
0.01 ) (0.5)(10.5) 16, 641.
124
3.7
As discussed in Section 2.8 for discrete-type random variables, sometimes when we devise a probability model for some situation we have a reason to use a particular type of probability distribution,
but there may be a parameter that has to be selected. A common approach is to collect some data
and then estimate the parameter using the observed data. For example, suppose the parameter is
, and suppose that an observation is given with pdf f that depends on . Section 2.8 suggests
estimating by the value that maximizes the probability that the observed value is u. But for
continuous-type observations, the probability of a specific observation u is zero, for any value of .
However, if f is a continuous pdf for each value
of , then recall
from the interpretation, (3.4), of
1
a pdf, that for sufficiently small, f (u) P u 2 < X < u + 2 . That is, f (u) is proportional
to the probability that the observation is in an -width interval centered at u, where the constant
of proportionality, namely 1 , is the same for all . Following tradition, in this context, we call
f (u) the likelihood of the observation u. The maximum likelihood estimate of for observation
u, denoted by bM L (u), is defined to be the value of that maximizes the likelihood, f (u), with
respect to .
Example 3.7.1 Suppose a random variable T has the exponential distribution with parameter ,
bM L (t), of
and suppose it is observed that T = t, for some fixed value of t. Find the ML estimate,
, based on the observation T = t.
bM L (t), is the value of > 0 that maximizes et with respect to ,
Solution: The estimate,
for t fixed. Since
d(et )
= (1 t)et ,
d
the likelihood is increasing in for 0 1t , and it is decreasing in for 1t , so the likelihood
bM L (t) = 1 .
is maximized at 1t . That is,
t
Example 3.7.2 Suppose it is assumed that X is drawn at random from the uniform distribution
on the interval [0, b], where b is a parameter to be estimated. Find the ML estimator of b given
X = u is observed. (This is the continuous-type distribution version of Example 2.8.2.)
Solution: The pdf can be written as fb (u) = 1b I{0ub} . Recall that I{0ub} is the indicator
function of {0 u b}, equal to one on that set and equal to zero elsewhere. The whole idea now
is to think of fb (u) not as a function of u (because u is the given observation), but rather, as a
function of b. It is zero if b < u; it jumps up to u1 at b = u; as b increases beyond u the function
decreases. It is thus maximized at b = u. That is, bbM L (u) = u.
3.8
3.8.1
125
Often one random variable is a function of another. Suppose Y = g(X) for some function g and
a random variable X with a known pdf, and suppose we want to describe the distribution of Y. A
general way to approach this problem is the following three step procedure:
Step 1: Scope the problem. Identify the support of X. Sketch the pdf of X and sketch g.
Identify the support of Y. Determine whether Y is a continuous-type or discrete-type random
variable. Then take a breathyouve done very important ground work here.
Step 2: Find the CDF of Y. Use the definition of the CDF: For any constant c,
FY (c) = P {Y c} = P {g(X) c}. In order to find the probability of the event {g(X) c},
try to describe it in a way that involves X in a simple way. In Step 2 most of the work is
usually in finding FY (c) for values of c that are in the support of Y.
Step 3: Differentiate FY to find its derivative, which is fY . Typically the pdf gives a more
intuitive idea about the distribution of Y than the CDF.
The above three step procedure addresses the case that Y is continuous-type. If the function g is
piecewise constant, such as a quantizer function, then Y is a discrete-type random variable. In that
case, Steps 2 and 3 should be replaced by the following single step:
Step 2: (if Y is discrete-type) Find the pmf of Y. Work with the definition of the pmf:
pY (v) = P {Y = v} = P {g(X) = v}. In order to find the probability of the event {g(X) = v},
Rtry to describe it in a way that involves X in a simple way. Basically, P {g(X) = v} =
{u:g(u)=v} fX (u)du Here pY (v) only needs to be identified for v in the support of Y. Usually
the pmf is the desired answer and there is no need to find the CDF.
Example 3.8.1 Suppose Y = X 2 , where X is a random variable with pdf fX (u) =
Find the pdf, mean, and variance of Y.
Solution:
e|u|
2
for u R.
Note that Y = g(X), where g is the function g(u) = u2 . Sketches of fX and g are
g(u)
f (u)
X
126
nonnegative reals, so the support of Y is the nonnegative reals. Also, there is no value v such that
P {Y = v} > 0. So Y is a continuous-type random variable and we will find its CDF. At this point
we have:
0 c<0
FY (c) =
??? c 0;
we must find FY (c) for c 0. This completes Step 1, scoping the problem, so we take a breath.
Continuing, for c 0,
P {X 2 c} = P { c X c}
Z c
Z c
exp(|u|)
f (u)du =
=
du
X
2
c
c
Z c
exp(u)du = 1 e c ,
=
0
f (u)
X
g(u)
c
u
c
! c
FY (c) =
0 c<0
1 e c c 0.
Finally, taking the derivative of FY (using the chain rule of calculus) yields
(
fY (c) =
e c
2 c
c0
c > 0,
Note that the value of a pdf at a single point can be changed without changing the integrals of the
pdf. In this case we decided to let fY (0) = 0.
To find the mean and variance of Y we could use the pdf just found. However, instead of that,
127
Z
=
u2 e|u|
du
2
u2 eu du = 2! = 2.
and
E[Y 2 ] =
Z
=
(g(u))2 fX (u)du
u4 e|u|
du
2
u4 eu du = 4! = 24.
FY (c) = P {X 2 c} = P { c X c}
c2
X 2
c2
= P
3
3
3
c2
c2
.
3
3
2
Differentiate with respect to c, using the chain rule and the fact: 0 (s) = 12 exp( s2 ), to obtain
n
o
(
( c+2)2
( c2)2
1
exp
+
exp
if c 0
6
6
24c
fY (c) =
(3.6)
0
if c < 0.
Example 3.8.3 Suppose X is uniformly distributed over the interval [0, 3], and Y = (X 1)2 .
Find the CDF, pdf, and expectation of Y.
6
n
n!
128
Solution. Since X ranges over the interval [0, 3], Y ranges over the interval [0, 4]. The expression
for FY (c) is qualitatively different for 0 c 1 and 1 c 4, as seen in the following sketch:
4
c
1
c
1
1
In each case, FY (c) is equal to one third the length of the shaded interval. For 0 c 1,
2
FY (c) = P {(X 1) c} = P {1
cX 1+
For 1 c 4,
2
FY (c) = P {(X 1) c} = P {0 X 1 +
2 c
c} =
.
3
1+ c
c} =
.
3
FY (c) =
c<0
0c<1
1c<4
c 4.
2 c
3
1+ c
3
Differentiating yields
3 c
1
6 c
dFY (c)
fY (c) =
=
dc
0c<1
1c<4
else.
By LOTUS,
E[Y ] = E[(X 1)2 ] =
Z
0
1
(u 1)2 du = 1
3
Example 3.8.4 Suppose X is a continuous-type random variable. (a) Describe the distribution
of X in terms of fX . (b) More generally, describe the distribution of aX + b in terms of fX , for
constants a and b with a 6= 0. (This generalizes Section 3.6.1, which covers only the case a > 0.)
129
Solution: (a) Let Y = X, or equivalently, Y = g(X) where g(u) = u. We shall find the pdf
of Y after first finding the CDF. For any constant c, FY (c) = P {Y c} = P {X c} = P {X
c} = 1 FX (c). Differentiating with respect to c yields fY (c) = fX (c). Geometrically, the
graph of fY is obtained by reflecting the graph of fX about the vertical axis.
(b) Suppose now that Y = aX + b. The pdf of Y in case a > 0 is given in Section 3.6.1. So
cb
suppose a < 0. Then FY (c) = P {aX + b c} = P {aX c b} = P X cb
=
1
F
.
X
a
a
Differentiating with respect to c yields
fY (c) = fX
cb
a
1
,
|a|
(3.7)
where we use the fact that a = |a| for a < 0. Actually, (3.7) is also true if a > 0, because in that
case it is the same as (3.5). So (3.7) gives the pdf of Y = aX + b for any a 6= 0.
Example 3.8.5 Suppose a vehicle is traveling in a straight line at constant speed a, and that a
random direction is selected, subtending an angle from the direction of travel. Suppose is
uniformly distributed over the interval [0, ]. See Figure 3.15. Then the effective speed of the
a
Figure 3.15: Direction of travel and a random direction.
vehicle in the random direction is B = a cos(). Find the pdf of B.
Solution: The range of a cos() as ranges over [0, ] is the interval [a, a]. Therefore, FB (c) = 0
for c a and FB (c) = 1 for c a. Let now a < c < a. Then, because cos is monotone
nonincreasing on the interval [0, ],
n
co
FB (c) = P {a cos() c} = P cos()
a o
n
1 c
= P cos
a
cos1 ( ac )
= 1
.
130
1
a2 c2
|c| < a
.
|c| > a
fB
Figure 3.16: The pdf of the effective speed in a uniformly distributed direction.
Example 3.8.6 Suppose Y = tan(), as illustrated in Figure 3.17, where is uniformly distributed over the interval ( 2 , 2 ) . Find the pdf of Y .
1
!
Figure 3.17: A horizontal line, a fixed point at unit distance, and a line through the point with
random direction.
Solution: The function tan() increases from to over the interval ( 2 , 2 ), so the support
of fY is the entire real line. For any real c,
FY (c) = P {Y c}
= P {tan() c}
= P { tan1 (c)} =
tan1 (c) +
131
Differentiating the CDF with respect to c yields that Y has the Cauchy distribution, with pdf:
fY (c) =
1
(1 + c2 )
< c < .
Example 3.8.7 Given an angle expressed in radians, let ( mod 2) denote the equivalent angle
in the interval [0, 2]. Thus, ( mod 2) is equal to + 2n, where the integer n is such that
0 + 2n < 2.
= (( + h) mod 2).
Let be uniformly distributed over [0, 2], let h be a constant, and let
e
Find the distribution of .
takes values in the interval [0, 2], so fix c with 0 c 2 and
Solution: By its definition,
seek to find P { c}. Since h can be replaced by (h mod 2) if necessary, we can assume without
e = g(), where the function g(u) = ((u + h) mod 2)
loss of generality that 0 h < 2. Then
is graphed in Figure 3.18. Two cases are somewhat different: The first, shown in Figure 3.18(a), is
(a)
(b)
2!
2!
g(u)
g(u)
c
h
h
c
2!
0
2 ! !h 2!!h+c
u
0 c!h
2 ! !h
2!
e = g().
Figure 3.18: The function g such that
e c if is in the interval [2 h, 2 h + c], of length c. Therefore,
that 0 c h. In this case,
c
in this case, P { c} = 2 . The other case is that h < c 2, shown in Figure 3.18(b). In this
e c if is in the union of intervals [2 h, 2] [0, c h], which has total length c. So,
case,
c} = c . Therefore, in either case, P {
c} = c , so that
is itself uniformly
again, P {
2
2
distributed over [0, 2].
Angles can be viewed as points on the unit circle in the plane. The result of this example is
that, if an angle is uniformly distributed on the unit circle, then the angle plus a constant is also
uniformly distributed over the unit circle.
Example 3.8.8 Express the pdf of |X| in terms of the pdf of X, for an arbitrary continuous-type
random variable X. Draw a sketch and give a geometric interpretation of the solution.
132
Solution: We seek the pdf of Y, where Y = g(X) and g(u) = |u|. The variable Y takes nonnegative
values, and for c 0, FY (c) = P {Y c} = P {c X c} = FX (c) FX (c). Thus,
FX (c) FX (c) c 0
0
c 0;
fX (c) + fX (c) c 0
0
c < 0;
FY (c) =
Differentiating to get the pdf yields:
fY (c) =
Basically, for each c > 0, there are two terms in the expression for fY (c) because there are two ways
for Y to be ceither X = c or X = c. A geometric interpretation is given in Figure 3.19. Figure
(a)
!2 !1 0
X
1
(b)
!2 !1 0
!2 !1 0
!2 !1 0
!2 !1 0
(c)
(d)
f
!2 !1 0
Y
1
Example 3.8.9 Let X be an exponentially distributed random variable with parameter . Let
Y = bXc, which is the integer part of X, and let R = X bXc, which is the remainder. Describe
the distributions of Y and R, and find the limit of the pdf of R as 0.
133
g(u)
1
c
u
0
1+c 2
2+c 3
...
3+c
X
=
P {k X k + c}
k=0
=
=
Z
X
k+c
k=0 k
X
k
du =
ek e(k+c)
k=0
(1 ec ) =
k=0
X
1
1
ec
1e
1 ec
,
1 e
0c1
otherwise.
134
The limit of fR is the pdf for the uniform distribution on the interval [0, 1]. Intuitively, the remainder
R is nearly uniformly distributed over [0, 1] for small because for such the density of X is spread
out over a large range of integers.
Example 3.8.10 This example illustrates many possible particular cases. Suppose X has a pdf
fX which is supported on an interval [a, b]. Suppose Y = g(X) where g is a strictly increasing
function mapping the interval (a, b) onto the interval (A, B). The situation is illustrated in Figure
3.21. The support of Y is the interval [A, B]. So let A < c < B. There is a value g 1 (c) on the u
B
c
g(u)
A
f (u)
X
u
a
g!1(c)
135
A variation of this example would be to assume g is strictly decreasing. Then (3.8) holds with
g 0 (g 1 (c)) replaced by |g 0 (g 1 (c))|. More generally, suppose g is continuously differentiable with no
flat spots, but increasing on some intervals and decreasing on other intervals. For any real number
c let u1 , P
. . . , un be the values of u such that f (uk ) = c for all k. Here n 0. Then (3.8) becomes
1
fY (c) = k fX (uk ) |g0 (u
.
k )|
Example 3.8.11 Suppose X is a continuous-type random variable with CDF FX . Let Y be the
result of applying FX to X, that is, Y = FX (X). Find the distribution of Y.
Solution: Since Y takes values in the interval [0, 1], let 0 < v < 1. Since FX increases continuously
from zero to one, there is a value cv such that FX (cv ) = v. Then P {FX (X) v} = P {X cv } =
FX (cv ) = v. That is, FX (X) is uniformly distributed over the interval [0, 1]. This result may seem
surprising at first, but it is natural if it is thought of in terms of percentiles. Consider, for example,
the heights of all the people within a large group. Ignore ties. A particular person from the
group is considered to be in the 90th percentile according to height, if the fraction of people in the
group shorter than that person is 90%. So 90% of the people are in the 90th percentile or smaller.
Similarly, 50% of the people are in the 50th percentile or smaller. So the percentile ranking of
a randomly selected person from the group is uniformly distributed over the range from zero to
one hundred percent. This result does not depend on the distribution of heights within the group.
For this example, Y can be interpreted as the rank of X (expressed as a fraction rather than as a
percentile) relative to the distribution assumed for X.
3.8.2
(3.9)
If the graphs of F and F 1 are closed up by adding vertical lines at jump points, then the graphs
are reflections of each other about the line through the origin of slope one, as illustrated in Figure
3.22. It is not hard to check that for any real co and uo with 0 < uo < 1,
F 1 (uo ) co
if and only if
uo F (co ).
136
F (u)
F(c)
1
c
1
Example 3.8.13 Find a function g so that, if U is uniformly distributed over the interval [0, 1],
then g(U ) has the distribution of the number showing for the experiment of rolling a fair die.
Solution: The desired CDF of g(U ) is shown in Figure 3.1. Using g = F 1 and using (3.9) or
the graphical method illustrated in Figure 3.22 to find F 1 , we get that for 0 < u < 1, g(u) = i for
i1
i
6 < u 6 for 1 i 6. To double check the answer, note that if 1 i 6, then
i1
i
1
P {g(U ) = i} = P
<U
= ,
6
6
6
so g(U ) has the correct pmf, and hence the correct CDF.
137
Example 3.8.14 (This example is a puzzle that doesnt use the theory of this section.) Weve
discussed how to generate a random variable with a specified distribution starting with a uniformly
distributed random variable. Suppose instead that we would like to generate Bernoulli random
variables with parameter p = 0.5 using flips of a biased coin. That is, suppose that the probability
the coin shows H (for heads) is a, where a is some number that might not be precisely known.
Explain how this coin can be used to generate independent Bernoulli random variables with p = 0.5.
Solution: Here is one solution. Flip the coin repeatedly, and look at the outcomes two at a time.
If the first two outcomes are the same, ignore them. If they are different, then they are either HT
(a head followed by a tail) or T H, and these two possibilities each have probability a(1 a). In this
case, give as output either H or T , whichever appeared first in those two flips. Then
P (output = H|there is output after two flips)
= P (first two flips are HT |first two flips are HT or T H)
P (first two flips are HT )
a(1 a)
=
=
= 0.5.
P (first two flips are HT ) + P (first two flips are T H)
2a(1 a)
Then repeat this procedure using the third and fourth flips, and so fourth. One might wonder
how many times the biased coin must be flipped on average to produce one output. The required
number of pairs of flips has the geometric distribution, with parameter 2a(1 a). Therefore, the
mean number of pairs of coin flips required until an HT or T H is observed is 1/(2a(1 a)). So the
mean number of coin flips required to produce one Bernoulli random variable with parameter 0.5
using the biased coin by this method is 1/(a(1 a)).
3.8.3
There is a simple rule for determining the expectation, E[X], of a random variable X directly from
its CDF, called the area rule for expectation. See Figure 3.23, which shows an example of a CDF
u
u=1
F
1
(u)
X
+
F (c)
X
c
Figure 3.23: E[X] is the area of the + region minus the area of the region.
138
FX plotted as a function of c in the c u plane. Consider the infinite strip bounded by the c axis
and the horizontal line given by u = 1. Then E[X] is the area of the region in the strip to the right
of the u axis above the CDF, minus the area of the region in the strip to the left of the u axis
below the CDF, as long as at least one of the two regions has finite area. The area rule can also be
written as an equation, by integrating over the c axis:
Z
Z 0
E[X] =
(1 FX (c))dc
FX (c)dc,
(3.10)
FX1 (u)du.
(3.11)
The area rule can be justified as follows. As noted in Section 3.8.2, if U is a random variable uniformly distributed on the interval [0, 1], then FX1 (U ) has the same distribution as X. Therefore, it
also has the same expectation: E[X] = E[FX1 (U )]. Since FX1 (U ) is a function of U, its expectation
R1
can be found by LOTUS. Therefore, E[X] = E[FX1 (U )] = 0 FX1 (u)du, which proves (3.11), and
hence, the area rule itself, because the right hand sides of both (3.10) and (3.11) are equal to the
area of the region in the strip to the right of the u axis above the CDF, minus the area of the region
in the strip to the left of the u axis below the CDF.
3.9
Eventually a system or a component of a particular system will fail. Let T be a random variable
that denotes the lifetime of this item. Suppose T is a positive random variable with pdf fT . The
failure rate function, h = (h(t) : t 0), of T (and of the item itself) is defined by the following
limit:
P (t < T t + |T > t)
4
h(t) = lim
.
0
That is, given the item is still working after t time units, the probability the item fails within the
next time units is h(t) + o().
The failure rate function is determined by the distribution of T as follows:
P {t < T t + }
FT (t + ) FT (t)
= lim
0
P {T > t}
(1 FT (t))
1
FT (t + ) FT (t)
lim
(1 FT (t)) 0
fT (t)
,
1 FT (t)
h(t) = lim
0
=
=
(3.12)
Rt
0
h(s)ds
(3.13)
139
It is easy to check that F given by (3.13) has failure rate function h. To derive (3.13), and hence
show it gives the unique distribution with failure rate function h, start with the fact that we would
like F 0 /(1 F )R= h. Equivalently, (ln(1 F ))0 = h, or, integrating over [0, t], ln(1 F (t)) =
t
ln(1 F (0)) 0 h(s)ds. Since F should be the CDF of a nonnegative continuous type random
Rt
variable, F (0) = 0 or ln(1 F (0)) = 0. So ln(1 F (t)) = 0 h(s)ds, which is equivalent to (3.13).
For a given failure rate function h, the mean lifetime of the item can be computed by first
computing the CDF by and then using the
R area rule for expectation, (3.10), which, since the
lifetime T is nonnegative, becomes E[T ] = 0 (1 F (t))dt.
Example 3.9.1 (a) Find the failure rate function for an exponentially distributed random variable
with parameter . (b) Find the distribution with the linear failure rate function h(t) = t2 for t 0.
(c) Find the failure rate function of T = min{T1 , T2 }, where T1 and T2 are independent random
variables such that T1 has failure rate function h1 and T2 has failure rate function h2 .
Solution: (a) If T has the exponential distribution with parameter , then for t 0, fT (t) =
et and 1FT (t) = et , so by (3.12), h(t) = for all t 0. That is, the exponential distribution
with parameter has constant failure rate . The constant failure rate property is connected with
the memoryless property of the exponential distribution; the memoryless property implies that
P (t < T T + |T > t) = P {T }, which in view of the definition of h shows that h is constant.
(b) If h(t) =
t
2
t2
fT (t) =
0
else.
P {T > t} = P {T1 > t and T2 > t} = P {T1 > t}P {T2 > t} = e
h1 (s)ds
Rt
0
h2 (s)ds
= e
Rt
0
h(s)ds
where h = h1 + h2 . Therefore, the rate function for the minimum of two independent random
variables is the sum of their failure rate functions. This makes intuitive sense; for a two-component
system that fails when either components fails, the rate of system failure is the sum of the rates of
component failure.
Example 3.9.2 Suppose the failure rate function for an item with lifetime T is h(t) = for
0 t < and h(t) = for t 1. Find the CDF and mean of the lifetime T.
Solution:
140
E[T ] =
dt +
(t1)
1
1 e e
=
+
= + e
1
1
.
We remark that E[T ] decreases if or increase, but the relationship is not simple unless ,
corresponding to constant failure rate and the exponential distribution for T.
Two commonly appearing types of failure rate functions are shown in Figure 3.24. The failure
(a)
h(t)
(b)
h(t)
Figure 3.24: Failure rate functions for (a) nearly constant lifetime and (b) an item with a burn-in
period (bath tub failure rate function)
rate function in Figure 3.24(a) corresponds to the lifetime T being nearly deterministic, because
the failure rate is very small until some fixed time, and then it shoots up quickly at that time; with
high probability, T is nearly equal to the time at which the failure rate function shoots up.
The failure rate function in Figure 3.24(b) corresponds to an item with a significant failure
rate when the item is either new or sufficiently old. The large initial failure rate could be due to
manufacturing defects or a critical burn-in requirement, and the large failure rate for sufficiently
old items could be a result of wear or fatigue.
3.10
Section 2.11 describes the binary hypothesis testing problem, with a focus on the case that the
observation X is discrete-type. There are two hypotheses, H1 and H0 , and if hypothesis Hi is true
then the observation is assumed to have a pmf pi , for i = 1 or i = 0. The same framework works
for continuous-type observations, as we describe in this section.
Suppose f1 and f0 are two known pdfs, and suppose that if Hi is true, then the observation
X is a continuous-type random variable with pdf fi . If a particular value of X is observed, the
decision rule has to specify which hypothesis to declare. For continuous-type random variables,
the probability that X = u is zero for any value of u, for either hypothesis. But recall that if is
small, then fi (u) is the approximate probability that the observation is within a length interval
centered at u, if Hi is the true hypothesis. For this reason, we still call fi (u) the likelihood of X = u
if Hi is the true hypothesis. We also define the likelihood ratio, (u), for u in the support of f1 or
f0 , by
f1 (u)
(u) =
.
f0 (u)
141
so
f1 (u)
f0 (u)
(u m1 )2 (u m0 )2
= exp
+
2 2
2 2
m0 + m1
m1 m0
= exp
u
.
2
2
(u) =
m0 +m1
,
2
1
where M L = m0 +m
. (The letter is used for the threshold here because it is the threshold for
2
X directly, whereas the letter is used for the threshold applied to the likelihood ratio.)
142
The LRT for a general threshold and a general binary hypothesis testing problem with
continuous-type observations is equivalent to
> ln declare H1 is true
ln (X)
< ln declare H0 is true,
which for this example can be expressed as a threshold test for X :
2
1
>
ln + m0 +m
declare H1 is true
m
m
2
1
0
X
2
m
+m
0
1
<
declare H0 is true.
m1 m0 ln +
2
In particular, the MAP rule is obtained by letting = 01 , and it becomes:
> M AP declare H1 is true
X
< M AP declare H0 is true.
2
1
where M AP = m1m0 ln 10 + m0 +m
.
2
For this example, both the ML and MAP rules have the form
> declare H1 is true
X
< declare H0 is true.
Therefore, we shall examine the error probabilities for a test of that form. The error probabilities
are given by the areas of the shaded regions shown in Figure 3.25.
"
p
false alarm
p
miss
m1
Figure 3.25: Error probabilities for direct threshold detection between two normal pdfs with the
same variance.
pfalse alarm = P (X > |H0 )
X m0
m0
= P
>
H0
m0
= Q
.
m1
= Q
.
143
m0 +m1
2
yields that the error probabilities for the ML rule for this
m1 m0
2
.
0
Note that m1 m
can be interpreted as a signal-to-noise ratio. The difference in the means, m1 m0 ,
can be thought of as the difference between the hypotheses due to the signal, and is the standard
deviation of the noise. The error probabilities for the MAP rule can be obtained by substituting
in = M AP in the above expressions.
Example 3.10.2 Based on a sensor measurement X, it has to be decided which hypothesis about
a remotely monitored machine is true: H0 : the machine is working vs. H1 : the machine is broken.
Suppose if H0 is true X is normal with mean 0 and variance a2 , and if H1 is true X is normal with
mean 0 and variance b2 . Suppose a and b are known and that 0 < a < b. Find the ML decision rule,
the MAP decision rule (for a given choice of 0 and 1 ) and the error probabilities, pf alse alarm and
pmiss , for both rules.
Solution:
(u) =
u
1 e 2b2
b 2
u2
1 e 2a2
a 2
a u2 1 1
a u22 + u22
e 2b 2a = e 2 ( a2 b2 )
b
b
The ML rule is to choose H1 when (X) > 1. Thus, by taking the natural logarithm of both sides
2
of this inequality we obtain the rule: If (ln ab ) + X2 ( a12 b12 ) > 0 choose H1 . Equivalently, after a
bit of algebra, we find that the ML rule selects H1 when
2 2
b
X 2 b2 a2
2a b ln(b/a)
ln <
or
< X 2.
a
2 a2 b2
b2 a2
Thus, the ML rule can be expressed as a threshold test on the magnitude |X| of X:
> K declare H1 is true
|X|
(3.14)
< K declare H0 is true.
q
2
2
where K = KM L = ab 2bln(b/a)
2 a2 . Figure 3.10.2 plots the pdfs for the case a = 1 and b = 4.
0
The MAP rule is to choose H1 when (X) > 1 . After a bit of algebra, we derive the rule that
H1 should be chosen when
b0
X 2 b2 a2
ln
<
a1
2 a2 b2
144
Z
pmiss = P {|X| < K | H1 } =
Example 3.10.3 Based on a sensor measurement X, it has to be decided which hypothesis about
a remotely monitored room is true: H0 : the room is empty vs. H1 : a person is present in the
room. Suppose if H0 is true then X has pdf f0 (x) = 21 e|x+1| and if H1 is true then X has
pdf f1 (x) = 12 e|x1| . Both densities are defined on the whole real line. These distributions are
examples of the Laplace distribution. Find the ML decision rule, the MAP decision rule for prior
probability distribution (0 , 1 ) = (2/3, 1/3), and the associated error probabilities, including the
average error probability based on the given prior.
Solution:
To help with the computations, we express the pdfs without using absolute value signs:
1 u1
1 u+1
e
: u<1
: u < 1
2
2e
f1 (u) =
f0 (u) =
1 u+1
1 u1
: u1
: u 1
2e
2e
145
Therefore,
(u) =
e|u1|
=
e|u+1|
eu1
eu+1
= e2 :
eu1
eu1
= e2u
eu+1
eu1
u < 1
: 1 u < 1
e2
1<u
The likelihood ratio (u) is nondecreasing and it crosses 1 at u = 0. Thus, the ML decision rule is
to decide H1 is true if X > 0 and decide H0 otherwise. The error probabilities for the ML decision
rule are:
Z u1
Z
e
1
f0 (u)du =
pf alse alarm =
du =
0.1839.
2
2e
0
0
1
By symmetry, or by a similar computation, we see that pmiss is also given by pmiss = 2e
. Of course
1
the average error probability for the ML rule is also 2e for any prior distribution.
The MAP decision rule is to choose H1 if (X) > 01 = 2, and choose H0 otherwise. Note that
the solution of e2u = 2 is u = ln22 , which is between -1 and 1. Thus, the MAP decision rule is to
The average error probability for the MAP rule using the given prior distribution is
pe = (2/3)pf alse alarm + (1/3)pmiss = 3e22 = 0.1734.
3.11
Section 3.1[video]
1. Suppose X is a random variable with CDF FX (u) =
1
.
1+eu
1
1+exp(buc)
Section 3.2[video]
1. Find Var(X) if fX (u) = 2u for 0 u 1 and fX (u) = 0 elsewhere.
2. Find E[cos(X)] if fX (u) =
Section 3.3[video]
cos(u)
2
146
0.5
0.4
M AP
0.3
f0
f1
0.2
p miss
p f al se
0.1
0
5
al ar m
o
1
= 2.
147
3. Given three counts of a Poisson process occur during the interval [0, 2], what is the conditional
probability that no counts occur during [0, 1]? (Hint: The answer does not depend on the
rate parameter.)
Section 3.6[video]
1. Suppose the graph of the pdf fY of Y has a triangular shape, symmetric about zero, and Y is
standardized (mean zero and variance one). Find P {Y 1}.
2. Suppose X has the N (3, 4) distribution. Find P {X 5}.
3. Let Y have the N (1, 9) distribution. Find P {Y 2 4}.
4. Let X have the binomial distribution with parameters 100 and 0.4. Find the approximation to
P {X 45} yielded by the Gaussian approximation with the continuity correction.
Section 3.7[video]
2
t
1. Let T have the pdf f2 (t) = I{t0} t2 exp 2
(i.e. T has the Rayleigh pdf with parameter
2
2 ) where 2 is unknown. Find the ML estimate of 2 for the observation T = 10.
2.
Section 3.8[video]
1. Find the pdf of X 2 assuming X has the N (0, 2 ) distribution for some 2 > 0.
2. Find the increasing function g so that if U is uniformly distributed on [0, 1], then g(U ) has pdf
f (t) = I{0t1} 2t.
Section 3.9[video]
1. Find E[T ] assuming T is a random variable with failure rate function h(t) =
where a > 0.
a
1t
for 0 t < 1,
2. The Pareto distribution with minimum value one and shape parameter > 0 has complementary CDF P {T t} = t for t 1. Find the failure rate function, h(t) for t 0. (Hint: It
is zero for 0 t < 1.)
Section 3.10[video]
1. Suppose the observation X is exponentially distributed with parameter one if H0 is true, and
is uniformly distributed over the interval [0, 1] if H1 is true. Find pmiss and pf alse alarm for
the ML decision rule.
2. Suppose the observation X is exponentially distributed with parameter one if H0 is true, and
is uniformly distributed over the interval [0, 1] if H1 is true. Suppose H0 is a priori true with
probability 0 = 23 . Find the average error probability for the MAP decision rule.
3. Suppose the observation X is has pdf f0 (u) = exp(2|u|) if H0 is true and pdf f1 (u) =
1
2 exp(|u|) if H1 is true. Find the largest values of prior probability for H0 , 0 , so that the
MAP rule always declares that H1 is true.
3.12
Problems
Cumulative distribution functions (CDFs) Section 3.1
F (u)
X
1
0.75
0.5
0.25
!15
!10
!5
10
F (c)
X
0.75
0.5
0.25
10
(a) P {X = 5}
10
(b) P {X = 0}
(c) P {|X| 5}
(d) P {X 2 4}
3.3. [Recognizing a valid CDF]
Which of the six plots below show valid CDFs? For each one that is not valid, state a property
of CDFs that is violated.
(a)
(d)
(b)
(c)
(e)
(f)
A
u
f (u) =
B(2
u)
u0
0<u1
1<u2
u2
a + bu2 , 0 u 1
0,
else
150
u
a b c
a b
c
a+b
2b
a+b+c
c
3.12. PROBLEMS
151
(b) Now suppose the street is of infinite length- stretching from point 0 outward to . If
the distance of a fire from point 0 is exponentially distributed with rate , where should
the fire station now be located?
3.10. [Uniform and exponential distribution II]
A factory produced two equal size batches of radios. All the radios look alike, but the lifetime
of a radio in the first batch is uniformly distributed from zero to two years, while the lifetime
of a radio in the second batch is exponentially distributed with parameter = 0.1(years)1 .
(a) Suppose Alicia bought a radio and after five years it is still working. What is the
conditional probability it will still work for at least three more years?
(b) Suppose Venkatesh bought a radio and after one year it is still working. What is the
conditional probability it will work for at least three more years?
3.11. [Poisson process]
A certain application in a cloud computing system is accessed on average by 15 customers
per minute. Find the probability that in a one minute period, three customers access the
application in the first ten seconds and two customers access the application in the last fifteen
seconds. (Any number could access the system in between these two time intervals.)
3.12. [Disk crashes modeled by a Poisson process]
Suppose disk crashes in a particular cloud computing center occur according to a Poisson
process with mean rate per hour, 24 hours per day, seven days a week (i.e. 24/7). Express
each of the following three quantities in terms of , and explain or show your work.
(a) The expected number of disk crashes in a 24 hour period.
(b) The probability there are exactly three disk crashes in a five hour period.
(c) The mean number of hours from the beginning of a particular day, until three disks have
crashed.
(d) The pdf of the time the third disk crashes.
3.13. [Poisson buses]
Buses arrive at a station starting at time zero according to a Poisson process (Nt : t 0)
with rate , where Nt denotes the number of buses arriving up to time t.
(a) Given exactly one bus arrives during the interval [0, 3], find the probability it arrives
before t = 1. Show your work or explain your reasoning.
(b) Let T2 be the arrival time of the second bus. What is the distribution of T2 ? Find the
pdf.
(c) What is the probability that there is a bus arriving exactly at time t = 1?
3.14. [Poisson tweets]
Suppose tweets from @313 follow a Poisson process with rate 1/7 per day.
(a) What is the probability there is exactly one tweet every week for the next four weeks?
(b) What is the probability it takes more than two weeks to receive three tweets?
(c) If there have been no tweets during two weeks of waiting, what is the mean amount of
additional time until five tweets have been sent?
u
0
3.12. PROBLEMS
153
154
u2
(a) Describe the maximum likelihood decision rule for an observation k, where k is an
arbitrary integer with 0 k 72. Express the rule in terms of k as simply as possible.
To be definite, in case of a tie in likelihoods, declare H1 to be the hypothesis.
(b) Suppose a particular decision rule declares that H1 is the true hypothesis if and only if
X 34. Find the approximate value of pf alsealarm for this rule by using the Gaussian
approximation to the binomial distribution. (To be definite, dont use the continuity
correction.)
(c) Describe the MAP decision rule for an observation k, where k is an arbitrary integer
with 0 k 72, for the prior distribution 0 = 0.9 and 1 = 0.1. Express the rule in
terms of k as simply as possible.
(d) Assuming the same prior distribution as in part (c), find P (H0 |X = 38).
if u 0
if u < 0
for some positive constant . So Y is the output of a binary quantizer with input X.
e|u|
2
156
(
|u|
f1 (u) =
0
if 1 u 1
otherwise
3.12. PROBLEMS
157
158
(a) Describe the ML decision rule for deciding which hypothesis is true for observation X.
(b) Find pf alsealarm and pmiss for the ML rule.
(c) Suppose it is assumed a priori that H0 is true with probability 0 and H1 is true with
probability 1 = 1 0 . For what values of 0 does the MAP decision rule declare H1
with probability one, no matter which hypothesis is really true?
(d) Suppose it is assumed a priori that H0 is true with probability 0 and H1 is true with
probability 1 = 1 0 . For what values of 0 does the MAP decision rule declare H0
with probability one, no matter which hypothesis is really true?
3.36. [(COMPUTER EXERCISE) Running averages of independent, identically distributed random variables]
Consider the following experiment, for an integer N 1. (a) Suppose U1 , U2 , . . . , UN are
mutually independent, uniformly distributed random variables on the interval [0.5, 0.5] Let
Sn = U1 + + Un denote the cumulative sum for 1 n N. Simulate this experiment on a
computer and make two plots, the first showing Snn for 1 n 100 and the second showing
Sn
n for 1 n 10000. (b) Repeat part (a), but change Sn to Sn = Y1 + + Yn where
1
Yk = tan(Uk ). (This choice makes each Yk have the Cauchy distribution, fY (v) = (1+v
2) ;
see Example 3.8.6. Since the pdf fY is symmetric about zero, it Ris tempting to think that
E[Yk ] = 0, but that is false; E[Yk ] is not well defined because 0 vfY (v)dv = + and
R0
vfY (v)dv = . Thus, for any function g(n) defined for n 1, it is possible to select
Rb
an and bn so that ann vfY (v)dv = g(n) for all n. It is said, therefore, that the
R
integral vfY (v)dv is indeterminate.)
3.37. [Generation of random variables with specified probability density function]
Find a function g so that, if U is uniformly distributed over the interval [0, 1], and X = g(U ),
then X has the pdf:
2v if 0 u 1
fX (v) =
0 else.
(Hint: Begin by finding the cumulative distribution function FX .)
3.38. [Failure rate of a network with two parallel links]
Consider the s t network with two parallel links, as shown:
1
s
t
2
3.12. PROBLEMS
159
Suppose that for each i, link i fails at time Ti , where T1 and T2 are independent, exponentially
distributed with some parameter > 0. The network fails at time T , where T = max{T1 , T2 }.
(a) Express FTc (t) = P {T > t} for t 0 in terms of t and .
(b) Find the pdf of T.
(c) Find the failure rate function, h(t), for the network. Simplify your answer as much as
possible. (Hint: Check that your expression for h satisfies h(0) = 0 and limt h(t) =
.)
(d) Find P (min{T1 , T2 } < t|T > t) and verify that h(t) = P {min{T1 , T2 } < t|T > t). That
is, the network failure rate at time t is times the conditional probability that at least
one of the links has already failed by time t, given the network has not failed by time t.
3.39. [Cauchy vs. Gaussian detection problem]
On the basis of a sensor output X, it is to be decided which hypothesis is true: H0 or H1 .
Under H1 , X is a Gaussian random variable with mean zero and variance 2 =0.5: f1 (u) =
2
1
1 eu . Under H0 , X has the Cauchy density, f0 (u) =
.
(1+u2 )
arctan(c)
and F1 (c) = (c 2), respectively.
(c) Consider the decision rule that decides H1 is true if |X| 1.4 and decides H0 otherwise.
This rule is the MAP rule for some prior distribution (1 , 0 ). Find the ratio = 0 /1
for that distribution.
160
Chapter 4
4.1
Recall that any random variable has a CDF, but pmfs (for discrete-type random variables) and
pdfs (for continuous-type random variables) are usually simpler. This situation is illustrated in
Figure 3.5. In this chapter well see that the same scheme holds for joint distributions of multiple
random variables. We begin in this section with a brief look at joint CDFs.
Let X and Y be random variables on a single probability space (, F, P ). The joint cumulative
distribution function (CDF) is the function of two variables defined by
FX,Y (uo , vo ) = P {X uo , Y vo }.
161
162
for any (uo , vo ) R2 . (Here, R2 denotes the set of all pairs of real numbers, or, equivalently, the
two-dimensional Euclidean plane.) That is, FX,Y (uo , vo ) is the probability that the random point
(X, Y ) falls into the shaded region in the u v plane, shown in Figure 4.1.
v
(uo ,v o)
u
(4.1)
as illustrated in Figure 4.2. The joint CDF of X and Y also determines the probabilities of any
v
d
c
u
a
Figure 4.2: P {(X, Y ) shaded region} is equal to FX,Y evaluated at the corners with signs shown.
event concerning X alone, or Y alone. To show this, we show that the CDF of X alone is determined
by the joint CDF of X and Y.
Proposition 4.1.1 Suppose X and Y have joint CDF FX,Y . For each u fixed the limit, limv FX,Y (u, v)
exists; call it FX,Y (u, ). Furthermore, FX (u) = FX,Y (u, ). Similarly, FY (v) = FX,Y (, v).
163
lim P (G0 G1 Gn )
Also, FX,Y (u, v) is nondecreasing in v. Thus, limv FX,Y (u, v) exits and is equal to FX (u), as
claimed.
Properties of CDFs The joint CDF, FX,Y , for a pair of random variables X and Y , has the
following properties. For brevity, we drop the subscripts on FX,Y and write it simply as F :
0 F (u, v) 1 for all (u, v) R2
F (u, v) is nondecreasing in u and is nondecreasing in v
F (u, v) is right-continuous in u and right-continuous in v
If a < b and c < d, then F (b, d) F (b, c) F (a, d) + F (a, c) 0
limu F (u, v) = 0 for each v, and limv F (u, v) = 0 for each u
limu limv F (u, v) = 1
It can be shown that any function F satisfying the above properties is the joint CDF for some pair
of random variables (X, Y ).
4.2
If X and Y are each discrete-type random variables on the same probability space, they have a
joint probability mass function (joint pmf), denoted by pX,Y . It is defined by pX,Y (u, v) = P {X =
u, Y = v}. If the numbers of possible values are small, the joint pmf can be easily described in a
table. The joint pmf determines the probabilities of any events that can be expressed as conditions
on X and Y. In particular, the pmfs of X and Y can be recovered from the joint pmf, using the law
of total probability, as follows. By definition, since X is a discrete-type random variable, there is a
finite or countably infinite set of possible values of X; denote them by u1 , u2 , . . . . Similarly, there
is a finite or countably infinite set of possible values of Y ; denote them by v1 , v2 , . . .
The events of the form {Y = vj } are mutually exclusive, there are at most countably infinitely
many such events, and together with the zero probability event F = {Y 6 {v1 , v2 , . . .}}, they
164
partition . So for any u R, the event {X = u} can be written as a union of mutually exclusive
events:
[
{X = u} = {X = u, Y = vj } ({X = u}F ).
j
Therefore, by the additivity axiom of probability, Axiom P.2, and the fact P ({X = u}F ) = 0
(because P (F ) = 0):
X
P {X = u} =
P {X = u, Y = vj }
(4.2)
j
or equivalently,
pX (u) =
pX,Y (u, vj ).
(4.3)
It is useful to consider (4.2), or equivalently, (4.3), to be an instance of the law of total probability
based on the partition F, {Y = v1 }, {Y = v2 }, Similarly,
pY (v) =
pX,Y (ui , v)
In this case, pX and pY are called the marginal pmfs of the joint pmf, pX,Y . The conditional pmfs
are also determined by the joint pmf. For example, the conditional pmf of Y given X, pY |X (v|uo ),
is defined for all uo such that pX (uo ) > 0 by:
pY |X (v|uo ) = P (Y = v|X = uo ) =
pX,Y (uo , v)
pX (uo ).
Example 4.2.1 Let (X, Y ) have the joint pmf given by Table 4.1. The graph of the pmf is shown
Table 4.1: A simple joint pmf.
Y =3
Y =2
Y =1
0.1
X=1
0.1
0.2
0.3
X=2
0.2
0.1
X=3
in Figure 4.3. Find (a) the pmf of X, (b) the pmf of Y , (c) P {X = Y }, (d) P {X > Y },
(e) pY |X (v|2), which is a function of v.
165
v
3
2
p X,Y(u,v)
1
1/6 v = 3.
The following three properties are necessary and sufficient for p to be a valid joint pmf:
Property pmf.1: p is nonnegative,
Property pmf.2 There are finite or countably infinite sets {u1 , u2 , . . .} and {v1 , v2 , . . .} such that
p(u, v) = 0 if u 6 {u1 , u2 , . . .} or if v 6 {v1 , v2 , . . .}.
P P
Property pmf.3:
i
j p(ui , vj ) = 1.
4.3
The random variables X and Y are jointly continuous-type if there exists a function fX,Y , called
the joint probability density function (pdf), such that FX,Y (uo , vo ) is obtained for any (uo , vo ) R2
by integration over the shaded region in Figure 4.1:
Z uo Z vo
FX,Y (uo , vo ) =
fX,Y (u, v)dvdu.
It follows from (4.1) that if R is the rectangular region (a, b] (c, d] in the plane, shown in
Figure 4.2, then
Z Z
P {(X, Y ) R} =
fX,Y (u, v)dudv.
(4.4)
R
More generally, (4.4) holds for any set R in the plane that has a piecewise differentiable boundary,
because such sets can be approximated by a finite or countably infinite union of disjoint rectangular
166
regions. If g is a function on the plane then the expectation of the random variable g(X, Y ) can be
computed using the law of the unconscious statistician (LOTUS), just as for functions of a single
random variable:
Z Z
E[g(X, Y )] =
g(u, v)fX,Y (u, v)dudv.
(4.5)
where all the integrals are over the entire real line.
The following two properties are necessary and sufficient for f to be a valid joint pdf:
Property pdf.1: For any (u, v) R2 , f (u, v) 0.
Property pdf.2: The integral of f over R2 is one, i.e.
R R
f (u, v)dudv
= 1.
Property pdf.1 has to be true for any pdf f because, by (4.4), the integral of f over any set is
the probability of an event, which must be nonnegative. Property pdf.2 has to be true for any
pdf because the integral of f over R2 is, by (4.4), equal to P {(X, Y ) R2 }, which is equal to one
because, by definition, the pair (X, Y ) always takes a value in R2 .
Note that pdfs are defined over the entire plane R2 , although often the pdfs are zero over large
regions. The support of a joint pdf is the set over which it is nonzero.
The pdf of X alone or of Y alone can be obtained from the joint pdf, as follows. Given any
real number uo , let R(uo ) be the region to the left of the vertical line {u = uo } in the (u, v)
plane: R(uo ) = {(u, v) : < u uo , < v < }. Then by (4.4), if X and Y are jointly
continuous-type,
FX (uo ) = P {X uo } = P {(X, Y ) R(uo )}
Z uo Z
=
fX,Y (u, v)dv du.
(4.6)
Equation (4.6) expresses FX (uo ) as the integral over (, uo ], of a quantity in square brackets,
for any real number uo . So by the definition of pdfs (i.e. Definition 3.2.1), the quantity in square
brackets is the pdf of X :
Z
fX (u) =
(4.7)
(4.8)
Similarly,
Z
fY (v) =
167
The pdfs fX and fY are called the marginal pdfs of the joint distribution of X and Y. Since X is
trivially a function of X and Y , the mean of X can be computed directly from the joint pdf by
LOTUS:
Z Z
ufX,Y (u, v)dudv.
E[X] =
If the integration over v is performed first it yields the definition of E[X] in terms of the marginal
pdf, fX :
Z
Z Z
ufX (u)du.
u
fX,Y (u, v)dv du =
E[X] =
fX,Y (uo , v)
fX (uo )
< v < +.
(4.9)
Graphically, the connection between fX,Y and fY |X (v|uo ) for uo fixed, is quite simple. For uo
fixed, the right hand side of (4.9) depends on v in only one place; the denominator, fX (uo ), is
just a constant. So fY |X (v|uo ) as a function of v is proportional to fX,Y (uo , v), and the graph of
fX,Y (uo , v) with respect to v is obtained by slicing through the graph of the joint pdf along the line
u uo in the u v plane. The choice of the constant fX (uo ) in the denominator in (4.9) makes
fY |X (v|uo ), as a function of v for uo fixed, itself a pdf. Indeed, it is nonnegative, and
Z
fY |X (v|uo )dv =
fX,Y (uo , v)
1
dv =
fX (uo )
fX (uo )
fX (uo )
= 1.
fX (uo )
In practice, given a value uo for X, we think of fY |X (v|uo ) as a new pdf for Y , based on our
change of view due to observing the event X = uo . There is a bit of difficulty in this interpretation,
because P {X = uo } = 0 for any uo . But if the joint density function is sufficiently regular, then
the conditional density is a limit of conditional probabilities of events:
P Y v 2 , v + 2 X uo 2 , uo + 2
fY |X (v|uo ) = lim
.
0
Also, the definition of conditional pdf has the intuitive property that fX,Y (u, v) = fY |X (v|u)fX (u).
So (4.8) yields a version of the law of total probability for the pdf of Y :
Z
fY (v) =
fY |X (v|u)fX (u)du.
(4.10)
The expectation computed using the conditional pdf is called the conditional expectation (or
conditional mean) of Y given X = u, written as
Z
E[Y |X = u] =
vfY |X (v|u)dv.
168
(4.11)
where c is a constant to be determined. The pdf and its support are shown in Figure 4.4. Find the
v
f
(u,v)
X,Y
1
v
u
support
u
1
( R
2
1uo
o)
c(1 uo v)dv = c(1u
0 uo 1
2
0
=
0
otherwise.
Note that if we hadnt already found that c = 1/6, we could find c using the expression for fX just
found, because fX must integrate to one. That yields:
Z
Z 1
c(1 u)2
c(1 u)3 1
c
1=
fX (u)du =
du =
= ,
2
6
6
0
0
169
or, again, c = 6. Thus, fX has support [0, 1) with fX (uo ) = 3(1 uo )2 for 0 uo 1.
The conditional density fY |X (v|uo ) is defined only if uo is in the support of fX : 0 uo < 1.
For such uo ,
(
2(1uo v)
0 v 1 uo
(1uo )2
fY |X (v|uo ) =
0
else.
That is, the conditional pdf of Y given X = uo has a triangular shape, over the interval [0, 1uo ], as
shown in Figure 4.5. This makes sense geometricallya slice through the three dimensional region
2
1!uo
(v|uo )
Y|X
v
0
1!uo 1
Uniform joint pdfs A simple class of joint distributions are the distributions that are uniform
over some set. Let S be a subset of the plane with finite area. Then the uniform distribution over
S is the one with the following pdf:
1
area of S if (u, v) S
fX,Y (u, v) =
0
else.
If A is a subset of the plane, then
P {(X, Y ) A} =
area of A S
.
area of S
(4.12)
The support of the uniform pdf over S is simply S itself. The next two examples concern uniform
distributions over two different subsets of R2 .
Example 4.3.2 Suppose (X, Y ) is uniformly distributed over the unit circle. That is, take S =
{(u, v) : u2 + v 2 = 1}. Since the area of S is , the joint pdf is given by
1
2
2
if u + v 1
fX,Y (u, v) =
0 else.
The pdf and its support are pictured in Figure 4.6. The three dimensional region under the pdf
170
X,Y
(u,v)
support of f
X,Y
!1
(u,v)
1
u
1
v
!1
Figure 4.6: The pdf for X, Y uniformly distributed over the unit circle in R2
(and above the u v plane) is a cylinder centered at the origin with height 1 . Note that the volume
of the region is 1, as required.
(a) Find P {(X, Y ) A} for A = {(u, v) : u 0, v 0}.
(b) Find P {X 2 + Y 2 r2 } for r 0.
(c) Find the pdf of X.
(d) Find the conditional pdf of Y given X.
Solution: (a) The region A S is a quarter of the support, S, so by symmetry, P {(X, Y ) A} =
1/4.
(b) If 0 r 1, the region {(u, v) : u2 + v 2 r2 } is a disk of radius r contained in S. The area of
this region intersect S is thus the area of this region, which is r2 . Dividing this by the area of S
yields: P {X 2 + Y 2 r} = r2 , for 0 r 1. If r > 1, the region {(u, v) : u2 + v 2 r2 } contains S,
so P {X 2 + Y 2 r} = 1 for r > 1.
(c) The marginal, fX , is given by
Z
2 1u2o
R 1u2o 1
dv
=
if |uo | 1
1u2o
fX (uo ) =
fX,Y (uo , v)dv =
0
else.
(d) The conditional density fY |X (v|uo ) is undefined if |uo | 1, so suppose |uo | < 1. Then
fY |X (v|uo ) =
1u2
o
= 1
if
else.
1u2o
p
p
1 u2o v 1 u2o
i
h p
p
That is, if |uo | < 1, then given X = uo , Y is uniformly distributed over the interval 1 u2o , 1 u2o .
This makes sense geometricallya slice through the cylindrically shaped region under the joint pdf
is a rectangle.
171
v
f
(u,v)
X,Y
0.5
0.5
4/3
support
S
u
u
0.5
0.5
Figure 4.7: The pdf fX,Y and its support, S, for (X, Y ) uniformly distributed over S.
Example 4.3.3 Heres a second example of a uniform distribution over a set in the plane. Suppose
(X, Y ) is uniformly distributed over the set S = {(u, v) : 0 u 1, 0 v 1, max{u, v} 0.5}.
The pdf and its support set S are shown in Figure 4.7. Since the area of S is 3/4, the pdf is:
fX,Y (u, v) =
4
3
if (u, v) S
0 else.
R1 4
2
0.5 3 dv = 3
Z
R
1 4
4
fX (uo ) =
fX,Y (uo , v)dv =
0 3 dv = 3
0 uo < 0.5
0.5 uo 1
else.
f (u)
0< uo <0.5
4/3
Y|X
(v|u o )
0.5 < uo < 1
2/3
u
0
0.5
1
v
1
0 0.5 1
v
0 0.5 1
Figure 4.8: The marginal and conditional pdfs for (X, Y ) uniformly distributed over S.
The conditional density fY |X (v|uo ) is undefined if uo < 0 or uo > 1. It is well defined for
0 uo 1. Since fX (uo ) has different values for uo < 0.5 and uo 0.5, we consider these two cases
172
2 0.5 v 1
0 else
fY |X (v|uo ) =
fY |X (v|uo ) =
1 0v1
0 else
for 0.5 uo 1.
That is, if 0 uo < 0.5, the conditional distribution of Y given X = uo is the uniform distribution
over the interval [0.5, 1], And if 0.5 uo 1, the conditional distribution of Y given X = uo is the
uniform distribution over the interval [0, 1]. The conditional distribution fY |X (v|uo ) is not defined
for other values of uo . The conditional pdfs are shown in Figure 4.8.
eu if 0 v u
0 else.
(4.13)
The pdf and its support are shown in Figure 4.9. Find the marginal and conditional pdfs.
f
(u,v)
X,Y
support
u
Solution:
R uo
0
fX (uo ) =
euo dv = uo euo
0
R
fY (vo ) =
vo
eu du = evo
0
uo 0
else.
vo 0
else.
euo
uo euo
1
uo
0 v uo
else.
173
f (v)
Y|X
(v|uo )
X|Y
(u|vo )
u
vo
uo
Figure 4.10: The marginal and conditional pdfs for the joint pdf in (4.13).
That is, the conditional distribution of Y given X = uo is the uniform distribution over the interval
[0, uo ]. The conditional density fX|Y (u|vo ) is undefined if vo 0. For vo > 0 :
fX|Y (u|vo ) =
eu
evo
= e(uvo ) u vo
0
else.
That is, the conditional distribution of X given Y = vo is the exponential distribution with parameter one, shifted to the right by vo . The conditional pdfs are shown in Figure 4.10.
Example 4.3.5 Suppose the unit interval [0, 1] is randomly divided into three subintervals by two
points, located as follows. The point X is uniformly distributed over the interval [0, 1]. Given
X = u, a second point is uniformly distributed over the interval [u, 1]. Let Y denote the length of
the middle subinterval. If the unit interval represents a stick, and the two points represent locations
of breaks, then X and Y are the lengths of the left and center sticks formed. See Figure 4.11. Find
X
0
Y
1
174
the interval [0, 1 u]. Since the second random point is uniformly distributed over [u, 1], and Y is
the difference between that point and the constant u, the conditional distribution of Y is uniform
over the interval [0, 1 u]. That is, if 0 < u < 1:
fY |X (v|u) =
1
1u
0v 1u
else.
For 0 < u < 1, fX,Y (u, v) = fX (u)fY |X (v|u) = fX|Y (v|u), and fX,Y (u, v) = 0 otherwise. Therefore,
fX,Y (u, v) =
1
1u
(u, v) T
else.
fY (vo ) =
Z
fX,Y (u, vo )du =
1vo
1vo
1
du = ln(1 u)
= ln(vo ).
1u
0
This is infinite if vo = 0, but the density at a single point doesnt make a difference, so we set
fY (0) = 0. Thus, we have:
ln(v) 0 < v < 1
fY (v) =
0
else.
Y
,
X2
(
1 if 0 u 1, 0 v 1
fX,Y (u, v) =
0 else.
(a) Find the numerical value of P {Z 0.5}. Also, sketch a region in the plane so that the area of
the region is P {Z 0.5}. (b) Find the numerical value of P {Z 4}. Also, sketch a region in the
plane so that the area of the region is P {Z 4}.
Solution.
R1
0
(0.5)u2 du = 16 .
v
1
v=(0.5)u 2
u
1
(b) P {Z 4} = P {Y 4X 2 } =
R 0.5
0
4u2 du + 0.5 = 32 .
175
v
1
v=4u 2
u
0.5
4.4
Independence of events is discussed in Section 2.4. Recall that two events, A and B, are independent, if and only if P (AB) = P (A)P (B). Events A, B, and C are mutually independent if they
are pairwise independent, and P (ABC) = P (A)P (B)P (C). The general definition of mutual independence for n random variables was given in Section 2.4.2. Namely, X1 , X2 , . . . , Xn are mutually
independent if any set of events of the form {X1 A1 }, {X2 A2 }, . . . , {Xn An } are mutually independent. In this section we cover in more detail the special case of independence for two
random variables, although factorization results are given which can be extended to the case of n
mutually independent random variables.
4.4.1
Definition 4.4.1 Random variables X and Y are defined to be independent if any pair of events
of the form {X A} and {Y B}, are independent. That is:
P {X A, Y B} = P {X A}P {Y B}.
(4.14)
(4.15)
Since the CDF completely determines probabilities of the form P {X A, Y B}, it turns out
that that the converse is also true: If the CDF factors (i.e. (4.15) holds for all uo , vo ). then X and
Y are independent (i.e. (4.14) holds for all A, B.
In practice, we like to use the generality of (4.14) when doing calculations, but the condition
(4.15) is easier to check. To illustrate that (4.15) is stronger than it might appear, suppose that
(4.15) holds for all values of uo , vo , and suppose a < b and c < d. By the four point difference
formula for CDFs, illustrated in Figure 4.2,
P {a < X b, c < Y d} = FX (b)FY (d) FX (b)FY (c) FX (a)FY (d) + FX (a)FY (c)
= (FX (b) FX (a))(FY (d) FY (c)) = P {a < X b}P {c < Y d}.
176
Therefore, if (4.15) holds for all values of uo , vo , then (4.14) holds whenever A = (a, b] and B = (c, d]
for some a, b, c, d. It can be shown that (4.14) also holds when A and B are finite unions of intervals,
and then for all choices of A and B.
Recall that for discrete-type random variables it is usually easier to work with pmfs, and for
jointly continuous-type random variables it is usually easier to work with pdfs, than with CDFs.
Fortunately, in those instances, independence is also equivalent to a factorization property for a
joint pmf or pdf. Therefore, discrete-type random variables X and Y are independent if and only
if the joint pmf factors:
pX,Y (u, v) = pX (u)pY (v),
for all u, v. And for jointly continuous-type random variables, X and Y are independent if and only
if the joint pdf factors:
fX,Y (u, v) = fX (u)fY (v).
4.4.2
Suppose X and Y are jointly continuous with joint pdf fX,Y . So X and Y are independent if and
only if fX,Y (u.v) = fX (u)fY (v) for all u and v. It takes a little practice to be able to tell, given
a choice of fX,Y , whether independence holds. Some propositions are given in this section that
can often be used to help determine whether X and Y are independent. To begin, the following
proposition gives a condition equivalent to independence, based on conditional pdfs.
Proposition 4.4.2 X and Y are independent if and only if the following condition holds: For all
u R, either fX (u) = 0 or fY |X (v|u) = fY (v) for all v R.
Proof. (if part) If the condition holds, then for any (u, v) R2 there are two cases: (i) fX (u) = 0 so
fX,Y (u, v) = 0 so fX,Y (u, v) = fX (u)fY (v). (ii) fX (u) > 0 and then fX,Y (u, v) = fX (u)fY |X (v|u) =
fX (u)fY (v). So in either case, fX,Y (u, v) = fX (u)fY (v). Since this is true for all (u, v) R2 , X and
Y are independent.
(only if part) If X and Y are independent, then fX,Y (u, v) = fX (u)fY (v) for all (u, v) R2 , so if
f
(u,v)
fX (u) > 0, then fY |X (v|u) = X,Y
fX (u) = fY (v), for all v R.
The remaining propositions have to do with the notion of product sets. Suppose A and B each
consist of a finite union of disjoint finite intervals of the real line. Let |A| denote the sum of the
lengths of all intervals making up A, and |B| denote the sum of the lengths of all intervals making
up B. The product set of A and B, denoted by A B, is defined by A B = {(u, v) : u A, v B},
as illustrated in Figure 4.12. The total area of the product set, |A B|, is equal to |A| |B|.
Definition 4.4.3 A subset S in R2 has the swap property if for any two points (a, b) S and
(c, d) S, the points (a, d) and (b, c) are also in S.
The swap property holds for a product set A B because if (a, b) A B and (c, d) A B,
then both a and c are in A and both c and d are in B, so (a, d) and (b, c) must be in A B. It is
not difficult to show that the converse is true too, so following proposition holds.
177
v
[
]
[
]
[
Example 4.4.7 Decide whether X and Y are independent for each of the following three pdfs:
Cu2 v 2 u, v 0, u + v 1
(a) fX,Y (u, v) =
for an appropriate choice of C.
0
else,
178
u+v
0,
9u2 v 2
(c) fX,Y (u, v) =
0,
u, v [0, 1]
else
u, v [0, 1]
else
Solution: (a) No, X and Y are not independent. This one is a little tricky because the function
Cu2 v 2 does factor into a function of u times a function of v. However, fX,Y (u, v) is only equal
to Cu2 v 2 over a portion of the plane. One reason (only one reason is needed!) X and Y are not
independent is because the support of fX,Y is not a product set. For example, fX,Y (0.3, 0.6) > 0 and
fX,Y (0.6, 0.3) > 0, but fX,Y (0.6, 0.6) 6> 0. That is, (0.3, 0.6) and (0.6, 0.3) are both in the suppert,
but (0.6, 0.6) is not, so the support fails the swap test and is not a product set by Proposition
4.4.4. Another reason X and Y are not independent is that the conditional distribution of Y given
X = u depends on u: for u fixed the conditional density of Y given X = u has support equal to
the interval [0, 1 u]. Since the support of the conditional pdf of Y given X = u depends on u, the
conditional pdf itself depends on u. So X and Y are not independent by Proposition 4.4.2.
(b) No, X and Y are not independent. One reason is that the function u+v on [0, 1]2 does not factor
into a function of u times a function of v.1Another reason is that the conditional distributions of
u+v
u+0.5 v [0, 1] for 0 u 1. So X and Y are not
Y given X = u depend on u: fY |X (v|u) =
0,
else
independent by Proposition 4.4.2.
3u2 u [0, 1]
and fY fX .
(c) Yes, because fX,Y (u, v) = fX (u)fY (v) where fX (u) =
0, else,
4.5
Section 3.8 describes a two or three step procedure for finding the distribution of a random variable
that is is a function, g(X), of another random variable. If g(X) is a discrete random variable, Step 1
is to scope the problem, and Step 2 is to find the pmf of g(X). If g(X) is a continuous-type random
variable, Step 1 is to scope the problem, Step 2 is to find the CDF, and Step 3 is to differentiate the
CDF to find the pdf. The same procedures work to find the pmf or pdf of a random variable that
is a function of two random variables, having the form g(X, Y ). An important function of (X, Y )
is the sum, X + Y. Weve seen by linearity of expectation, (4.5), that E[X + Y ] = E[X] + E[Y ].
This section focuses on determining the distribution of the sum, X + Y, under various assumptions
on the joint distribution of X and Y.
4.5.1
Suppose S = X + Y, where X and Y are integer-valued random variables. We shall derive the pmf
of S in terms of the joint pmf of X and Y. For a fixed value k, the possible ways to get S = k can
1
If the density did factor, then for 0 < u1 < u2 < 1 and 0 < v1 < v2 < 1 the following would have to hold:
(u1 + v1 )(u2 + v2 ) = (u1 + v2 )(u2 + v1 ), or equivalently, (u2 u1 )(v2 v1 ) = 0, which is a contradiction.
179
be indexed according to the value of X. That is, for S = k to happen, it must be that X = j and
Y = k j for some value of j. Therefore, by the law of total probability,
pS (k) = P {X + Y = k}
X
=
P {X = j, Y = k j}
j
(4.16)
If X and Y are independent, then pX,Y (j, k j) = pX (j)pY (k j), and (4.16) becomes:
pS (k) =
pX (j)pY (k j).
(4.17)
Example 4.5.1 Suppose X has the binomial distribution with parameters m and p, Y has the
binomial distribution with parameters n and p (the same p as for X), and X and Y are independent.
Describe the distribution of S = X + Y.
Solution: This problem can be solved with a little thought and no calculation, as follows. Recall
that the distribution of X arises as the number of times heads shows if a coin with bias p is flipped
m times. Then Y could be thought of as the number of times heads shows in n additional flips of
the same coin. Thus, S = X + Y is the number of times heads shows in m + n flips of the coin. So
S has the binomial distribution with parameters m + n and p.
That was easy, or, at least it didnt require tedious computation. Lets try doing it another
way, and at the same time get practice using the convolution formula (4.17). Since the support of
X is {0, 1, . . . , m} and the support of Y is {0, 1, . . . , n}, the support of S is {0, 1, . . . , m + n}. So
select an integer k with 0 k m + n. Then (4.17) becomes
min(m,k)
m j
n
mj
pS (k) =
p (1 p)
pkj (1 p)nk+j
j
kj
j=0
min(m,k)
X
m
n
pk (1 p)m+nk
=
j
kj
j=0
m+n k
=
p (1 p)m+nk ,
k
where the last step can be justified as follows. Recall that m+n
is the number of subsets of size
k
k that can be formed by selecting from among m + n distinct objects. Suppose the first m objects
X
180
are orange and the other n objects are blue. Then the number of subsets of size k that can be made
n
using the m + n objects, such that j of the objects in the set are orange, is m
j
kj , because the
j orange objects in the set can be chosen separately from the k j blue objects. Then summing
over j gives m+n
. This verifies the equation, and completes the second derivation of the fact that
k
S has the binomial distribution with parameters m + n and p.
Example 4.5.2 Suppose X and Y are independent random variables such that X has the Poisson
distribution with parameter 1 and Y has the Poisson distribution with parameter 2 . Describe
the distribution of S = X + Y.
Solution: This problem can also be solved with a little thought and no calculation, as follows.
Recall that the Poisson distribution is a limiting form of the binomial distribution with large n
and small p. So let p be a very small positive number, and let m = 1 /p and n = 2 /p. (Round to
the nearest integer if necessary so that m and n are integers.) Then the distribution of X is well
approximated by the binomial distribution with parameters m and p, and the distribution of Y is
well approximated by the binomial distribution with parameters n and p. So by Example 4.5.1, the
distribution of X + Y is well approximated by the binomial distribution with parameters m + n and
p. But m + n is large and p is small, with the product (m + n)p = mp + np = 1 + 2 . Therefore,
the distribution of X + Y is well approximated by the Poisson distribution with parameter 1 + 2 .
The approximations become better and better as p 0, so we conclude that X + Y has the Poisson
distribution with parameter 1 + 2 .
Lets derive the solution again, this time using the convolution formula (4.17). The support of
X + Y is the set of nonnegative integers, so select an integer k 0. Then (4.17) becomes
k
2
X
j1 e1 kj
2 e
j!
(k j)!
j=0
k
j kj
X
1 2 (1 +2 )
=
e
j!(k j)!
j=0
k
X
k j
k e
=
p (1 p)kj
k!
j
pX+Y (k) =
j=0
k e
k!
where = 1 + 2 , and p = 1 . In the last step we used the fact that the sum over the binomial
pmf with parameters k and p is one. Thus, we see again that X + Y has the Poisson distribution
with parameter 1 + 2 .
181
Example 4.5.3 Suppose X and Y represent the numbers showing for rolls of two fair dice. Thus,
each is uniformly distributed over the set {1, 2, 3, 4, 5, 6} and they are independent. The convolution
formula for the pmf of S = X + Y is an instance of the fact that the convolution of two identical
rectangle functions results in a triangle shaped function. See Fig. 4.13.
pX
1/6
pY
pS
1/6
1/6
*
1 2 3 4 5 6
1 2 3 4 5 6
2 3 4 5 6 7 8 9 10 11 12
Figure 4.13: The pmf for the sum of the numbers showing for rolls of two fair dice.
4.5.2
Suppose S = X + Y where X and Y are jointly continuous-type. We will express the pdf fS in
terms of the joint pdf, fX,Y . The method will be to first find the CDF of S and then differentiate
it to get the pdf. For any c R, the event {S c} is the same as the event that the random point
(X, Y ) in the plane falls into the shaded region of Figure 4.14. The shaded region can be integrated
v
u+v=c
u
Z
cu
FS (c) = P {S c} =
fX,Y (u, v)dv du.
182
Therefore,
Z cu
Z
dFS (c)
d
fX,Y (u, v)dv du
=
dc
dc
Z
=
fX,Y (u, c u)du.
fS (c) =
(4.18)
The integral in (4.18) can be viewed as the integral of fX,Y over the line u + v = c shown in Figure
4.14. This is an integral form of the law of total probability, because in order for X + Y = c, it
is necessary that there is some value u such that X = u and Y = c u. The integral in (4.18)
integrates (and integration is a type of summation) over all possible values of u.
If X and Y are independent, so that fX,Y (u, v) = fX (u)fY (v), then (4.18) becomes
Z
fS (c) =
fX (u)fY (c u)du.
(4.19)
Note the strong similarity between (4.16) and (4.17), derived for sums of integer-valued random
variables, and (4.18) and (4.19), derived for sums of continuous-type random variables.
Example 4.5.4 Suppose X and Y are independent, with each being uniformly distributed over
the interval [0, 1]. Find the pdf of S = X + Y.
Solution: We have fS = fX fY , and we will use Figure 4.15 to help work out the integral in
(4.19). As shown, the graph of the function fY (c u) is a rectangle over the interval [c 1, c].
f (u)
f (c!u)
fX (u)
f (c!u)
Y
u
c !1
u
0
c !1 1
183
interval [c 1, 1] which has length 2 c, and therefore fX fY (c) = 2 c. Otherwise, the value of
the convolution is zero. Summarizing,
0<c1
c
2c 1<c2
fS (c) =
0
else.
so the graph of fS has the triangular shape shown in Figure 4.16.
f (c)
S
c
2
Example 4.5.5 Suppose X and Y are independent, with X having the N (0, 12 ) distribution and
Y having the N (0, 22 ) distribution. Find the pdf of X + Y.
Solution:
Z
=
fX (u)fY (c u)du
1 2
Z
1
21 2
2 2
(cu)2
u2
2 2 2
21
2
(cu)2
2
22
du
du.
(4.20)
2 +2Bu+C)/2
u2
2
21
= p
e
2/A
, where A =
(u+B/A)2
2/A
#r
1
12
1
,
22
B = c2 , and C =
2 B 2 /2AC/2
e
.
A
c2
.
22
(4.21)
and the quantity in square brackets in (4.21) is a normal pdf that integrates to one, yielding the
identity:
r
Z
2 B 2 /2AC/2
(Au2 +2Bu+C)/2
e
du =
e
.
A
184
Therefore,
c2
1
fX+Y (c) = e 22 ,
2
where 2 = 12 + 22 . That is, the sum of two independent, mean zero Gaussian random variables is
again Gaussian, and the variance of the sum, 2 , is equal to the sum of the variances. The result
can be extended to the case that X1 and X2 have nonzero means 1 and 2 . In that case, the sum
is again Gaussian, with mean = 1 + 2 and variance 2 = 12 + 22 .
The above derivation is somewhat tedious. There are two more elegant but more advanced
ways to show that the convolution of two Gaussian pdfs is again a Gaussian pdf. (1) One way
is to use Fourier transformsconvolution of signals is equivalent to multiplication of their Fourier
transforms in the Fourier domain. The Fourier transform of the N (0, 2 ) pdf is exp( 2 (2f )2 /2),
so the result is equivalent to the fact:
exp(12 (2f )2 /2) exp(22 (2f )2 /2) = exp((12 + 22 )(2f )2 /2).
(2) The other way is to use the fact that the the sum of two independent binomial random variables
with the same p parameter is again a binomial random variable, and then appeal to the DeMoivreLaplace limit theorem, stating Gaussian distributions are limits of binomial distributions. A similar
approach was used in Examples 4.5.1 and 4.5.2.
4.6
Three examples are presented in this section. The first two examples continue the theme from
the previous section; they illustrate, for functions of two random variables, use of the three step
procedure of Section 3.8 for finding the pdf of a function g(X, Y ) of two random variables. A key
step in the procedure is to calculate the probabilities of events involving two random variables in
terms of their joint pdf. The third example also involves calculating the probability of an event
determined by two random variables.
Example 4.6.1 Suppose W = max(X, Y ), where X and Y are independent, continuous-type
random variables. Express fW in terms of fX and fY . (Note: This example complements the
analysis of the minimum of two random variables, discussed in Section 3.9 in connection with
failure rate functions.)
Solution: We compute FW first. A nice observation is that, for any constant t, max(X, Y ) t if
and only if X t and Y t. Equivalently, the following equality holds for events: {max(X, Y )
t} = {X t} {Y t}. By the independence of X and Y,
FW (t) = P {max(X, Y ) t} = P {X t}P {Y t} = FX (t)FY (t).
Differentiating with respect to t yields
fW (t) = fX (t)FY (t) + fY (t)FX (t).
(4.22)
185
There is a nice interpretation of (4.22), explained by the following alternative derivation of it. If
h is a small positive number, the probability that W is in the interval (t, t + h] is fW (t)h + o(h),
where o(h)/h 0 as h 0.2 There are three mutually exclusive ways that W can be in (t, t + h]:
Y t and X (t, t + h] : has probability FY (t)fX (t)h + o(h)
X t and Y (t, t + h] : has probability FX (t)fY (t)h + o(h)
X (t, t + h] and Y (t, t + h] : has probability fX (t)hfY (t)h + o(h)
So fW (t)h + o(h) = FY (t)fX (t)h + FX (t)fY (t)h + fX (t)fY (t)h2 + o(h). Dividing this by h and
letting h 0 yields (4.22).
Example 4.6.2 Suppose X and Y are jointly continuous-type random variables. Let R = X 2 + Y 2 ,
so R is the distance of the random point (X, Y ) from the origin. Express fR in terms of fX,Y .
Solution: Clearly R is a nonnegative random variable. We proceed to find its CDF. Let c > 0,
and let D(c) denote the disk of radius c centered at the origin. Using polar coordinates,
FR (c) = P {R c} = P {(X, Y ) D(c)}
Z Z
=
fX,Y (u, v)dudv
Z
D(c)
2 Z c
(4.23)
The integral in (4.23) is just the path integral of fX,Y over the circle of radius c. This makes sense,
because the only way R can be close to c is if (X, Y ) is close to the circle, and so (4.23) is a
continuous-type example of the law of total probability.
A special case is when fX,Y is circularly symmetric,
which by definition means that fX,Y (u, v)
depends on (u, v) only through the value r = u2 + v 2 . Equivalently, circularly symmetric means
that fX,Y (u, v) = fX,Y (r, 0). So, if fX,Y is circularly symmetric, (4.23) simplifies to:
fR (c) = (2c)fX,Y (c, 0)
2
See Appendix 6.1 for additional explanation of little oh notation, commonly used in calculus for approximation
errors.
186
Example 4.6.3 Buffons needle problem Suppose a needle of unit length is thrown at random
onto a large grid of lines with unit spacing. Find the probability the needle, after it comes to rest,
intersects a grid line.
Solution: An important aspect of solving this problem is how to model it, and that depends on
how we think about it. Here is one possible way to think about itbut there are others. Imagine
that the needle lands in a plane with horizontal and vertical directions, and that the grid lines
are horizontal. When the needle lands, let , with 0 , denote the angle between the
line going through the needle and a grid line, measured counter-clockwise from the grid line, as
shown in Figure 4.17. Assume that is uniformly distributed over the interval [0, ]. The vertical
sin(! )
"
0
187
Z0 0
d
2
cos()
sin()
=
= 0.6366.
=
0
The answer can also be written as E[sin()]. A way to interpret this is that, given , the probability
the needle intersects a line is sin(), and averaging over all (in a continuous-type version of the
law of total probability) gives the overall answer.
Letting = P {needle intersects grid}, the above shows that = 2 . Here is another derivation
you can show your friends or students. Begin by asking What is the mean number of times the
needle crosses the grid? Of course, the answer is because the number of crossings is either one
(with probability ) or zero (with probability 1 ). Then bend the needle to about a right angle
in the middle, throw it on the grid and ask again, What is the mean number of times the needle
crosses the grid? Convince yourself and the audience that the number is still bending the needle
does not change the mean; each small segment of the needle has the same probability of intersecting
the grid no matter the shape of the needle, and the sum of expectations is the expectation of the
sum. Next ask, what if a longer bent needle, of length , is randomly dropped onto the grid. Then
what is the mean number of intersections of the needle with the grid? Convince yourself and the
audience that the expected number is proportional to the length of the needle, so the expected
number is . Now, bend the length needle into a ring of diameter one; the circumference is .
By the discussion so far, the mean number of intersections of the ring with the grid when the ring
is randomly tossed onto the grid is . But no matter where the ring lands, it intersects the grid
exactly twice. Thus, the mean number of intersections is two, which is equal to . So, = 2 .
Example 4.6.4 Consider the following variation of the Buffons needle problem (Example 4.6.3).
Suppose a needle of unit length is thrown at random onto a plane with both a vertical grid and
a horizontal grid, each with unit spacing. Find the probability the needle, after it comes to rest,
does NOT intersect any grid line.
Solution: Let Mh be the event that the needle misses the horizontal grid (i.e. does not intersect
a horizontal grid line) and let Mv denote the event that the needle misses the vertical grid. We
seek to find P (Mh Mv ). By the solution to Buffons needle problem, P (Mh ) = P (Mv ) = 1 2 . If
Mh and Mv were independent, we would have that P (Mh Mv ) = (1 2 )2 (0.363)2 0.132. But
these events are not independent.
Let be defined relative to the horizontal grid as in the solution of Buffons needle problem.
Then the vertical displacement of the needle is sin() and the horizontal displacement is | cos()|.
Assume that the position of the needle relative to the horizontal grid is independent of its position
relative to the vertical grid. Let U be as in the solution to Buffons needle problem, and let V
similarly denote the distance from the leftmost endpoint of the needle to the first vertical grid
line to the right of that point, as shown in Figure 4.19. Then U and V are independent, and the
188
sin(! )
!
V
cos(!)
Figure 4.19: Variation of the Buffon needle problem, with horizontal and vertical grids.
needle misses both grids if and only if U sin() and V | cos()|. Therefore, P (Mh Mv | =
) = P {U sin(), V | cos()|} = P {U sin()}P {V | cos()|} = (1 sin())(1 | cos()|).
Averaging over using its pdf yields (using the trigometric identity 2 sin() cos() = sin(2))
Z
1
P (Mh Mh ) =
(1 sin())(1 | cos()|)d
0
Z
2 /2
=
(1 sin())(1 cos())d
0
Z
2 /2
=
1 sin() cos() sin() cos()d
0
Z
2 /2
sin(2)
=
1 sin() cos()
d
0
2
2
1
3
=
11+
= 1 0.045.
2
2
The true probability of missing both grids is almost three times smaller than what it would be if
Mh and Mv were independent. Intuitively, the events are negatively correlated. That is because
if Mh is true, it is likely that the position of the needle is more horizontal than vertical, and that
makes it less likely that Mv is true. A way to express the negative correlation is to note that
0.045
h Mh )
P (Mv |Mh ) = P P(M(M
0.363
0.124, which is nearly three times smaller than the unconditional
h)
probability, P (Mv ) 0.363.
4.7
The previous two sections present examples with one random variable that is a function of two
other random variables. For example, X + Y is a function of (X, Y ). In this section we consider
the case that there are two random variables, W and Z, that are both functions of (X, Y ), and we
see how to determine the joint pdf of W and Z from the joint pdf of X and Y. For example, (X, Y )
could represent a random point in the plane in the usual rectangular coordinates, and we may be
interested in determining the joint distribution of the polar coordinates of the same point.
4.7.1
189
To get started, first consider the case that W and Z are both linear functions of X and Y. It is much
simpler to work with matrix notation, and following the
usual convention, we represent pointsuin
the plane as column vectors. So sometimes we write X
Y instead of (X, Y ), and we write fX,Y ( v )
instead of fX,Y (u, v).
Suppose X and Y have a joint pdf fX,Y , and suppose W = aX + bY and Z = cX + dY for
some constants a, b, c, and d. Equivalently, in matrix notation, suppose X
Y has a joint pdf fX,Y ,
and suppose
W
X
a b
=A
where A =
.
Z
Y
c d
Thus, we begin with a random point X
random point W
Y and get another
Z . For ease of analysis,
we can suppose that X
is in the u v plane and W
is in the plane. That is, W
is the
Y
Z
Z
image of X
under
the
linear
mapping:
Y
u
=A
.
v
The determinant of A is defined by det(A) = ad bc. If det A 6= 0 then the mapping has an inverse,
given by
1
u
d b
1
1
where A =
=A
.
v
det A c a
An important property of such linear transformations is that if R is a set in the u v plane and
if S is the image of the set under the mapping, then area(S) = | det(A)|area(R), where det(A) is
the determinant of A. Consider the problem of finding fW,Z (, ) for some fixed (, ). If there is
a small rectangle S with a corner at (, ), then fW,Z (, ) P {(W,Z)S}
area(S) . Now {(W, Z) S} is
the same as {(X, Y ) R}, where R is the preimage of S under the linear transformation. Since
S is a small set, R is also a small set. So P {(W, Z) S} = P {(X, Y ) R} fX,Y (u, v)area(R).
(R)
1
Thus, fW,Z (.) fX,Y (u, v) area
area(S) = | det A| fX,Y (u, v). This observation leads to the following
proposition:
Proposition 4.7.1 Suppose W
= A X
, where
Z
Y
W
det(A) 6= 0. Then Z has joint pdf given by
X
Y
1
fW,Z (, ) =
fX,Y
| det A|
1
A
.
Example 4.7.2 Suppose X and Y have joint pdf fX,Y , and W = X Y and Z = X + Y. Express
the joint pdf of W and Z in terms of fX,Y .
190
Solution:
A1 =
1
2
21
1
2
1 .
2
v to such that = u
+
u
2 , or equivalently, v =
v and
A1 .
for all (, ) R2 .
Example 4.7.3 Suppose X and Y are independent, continuous-type random variables. Find the
joint pdf of W and Z, where W = X + Y and Z = Y. Also, find the pdf of W.
Solution:
Equivalently, fW is the convolution: fW = fX fY . This expression for the pdf of the sum of two
independent continuous-type random variables was found by a different method in Section 4.5.2.
4.7.2
191
g1 (u,v)
u
g2 (u,v)
u
g1 (u,v)
v
g2 (u,v)
v
The Jacobian is also called the matrix derivative of g. Just as for functions of one variable, the
function g near a fixed point uvoo can be approximated by a linear function:
u
uo
u
uo
,
+A
g
g
vo
vo
v
v
where the matrix A is given by A = J(uo , vo ). More relevant, for our purposes, is the related
fact that for a small set R near a point (u, v), if S is the image of R under the mapping, then
area(S)
area(R) | det(J)|.
X
W
If W
Z = g( Y ), and we wish to find the pdf of Z at a particular point (, ), we need to
consider values of (u, v) such that g(u, v) = (, ). The simplest case is if there is at most one
value of (u, v) so that g(u, v) = (, ). If that is the case for all (, ), g is said to be a one-to-one
function. If g is one-to-one, then g 1 (, ) is well-defined for all (, ) in the range of g.
These observations lead to the following proposition:
X
X
Proposition 4.7.4 Suppose W
=
g(
),
where
Z
Y
Y has pdf fX,Y , and g is a one-to-one mapping
from the support of fX,Y to R2 . Suppose
the
Jacobian
J of g exists, is continuous, and has nonzero
has
joint
pdf
given
by
determinant everywhere. Then W
Z
1
fW,Z (, ) =
fX,Y
| det J|
1
g
.
192
Solution: The vector (X, Y ) in the u v plane is transformed into the vector (W, Z) in the
plane under a mapping g that maps u, v to = u2 and = u(1 + v). The image in the plane
of the square [0, 1]2 (with its left side removed) in the u v plane is the set A given by
v
1
"
u
1
+ 1
2
if (, ) A
else.
Example 4.7.6 Suppose X and Y are independent N (0, 2 ) random variables. View X
Y as a
random point in the u v plane, and let (R, ) denote the polar coordinates of that point. Find
the joint pdf of R and .
Solution: Changing from rectangular coordinates to polar coordinates can be viewed as a map1
ping g from the u v plane to the set [0, ) [0, 2) in the r plane, where r = (u2 + v 2 ) 2 and
= tan1 ( uv ). The inverse of this mapping is given by
u = r cos()
v = r sin().
193
1 r22
1 u2 +v2 2
2
e
=
e 2 .
2 2
2 2
v
r
rv2
u
r2
and so det(J) = 1r . (This corresponds to the well known fact from calculus that dudv = rdrd
for changing integrals in rectangular coordinates to integrals in polar coordinates.) Therefore,
Proposition 4.7.4 yields that for (r, ) in the support [0, ) [0, 2) of (R, ),
fR, (r, ) = rfX,Y (r cos(), r sin()) =
r r22
e 2 .
2 2
Of course fR, (r, ) = 0 off the range of the mapping. The joint density factors into a function of r
and a function of , so R and are independent. Moreover, R has the Rayleigh distribution with
parameter 2 , and is uniformly distributed on [0, 2]. The distribution of R here could also be
found using the result of Example 4.6.2, but the analysis here shows that, in addition, for the case
X and Y are independent and both N (0, 2 ), the distance R from the origin is independent of the
angle .
The result of this example can be used to generate two independent N (0, 2 ) random variables,
X and Y , beginning with two independent
random variables, A and B, that are uniform on the
p
interval [0, 1], as follows. Let R = ln(A), which can be shown to have he Rayleigh distribution
with parameter , and let = 2B, which is uniformly distributed on [0, 2]. Then let X =
R cos() and Y = R sin().
Example 4.7.7 Suppose a ball is thrown upward at time t = 0 with initial height X and initial
upward velocity Y, such that X and Y are independent, exponentially distributed with parameter
2
. The height at time t is thus H(t) = X + Y t ct2 , where c is the gravitational constant. Let
T denote the time the ball reaches its maximum height and M denote the maximum height. Find
the joint pdf of T and M , fT,M (, ).
2
T
Solution Solving H 0 (t) = 0 to find T yields that T = Yc and M = X + Y2c . Or M
=g X
,
Y
v
where g is the mapping from the u v plane to the plane given by = g1 (u, v) = c and
2
= g2 (u, v) = u + v2c . The support of fX,Y is the positive quadrant of the plane, and fX,Y (u, v) =
fX (u)fY (v) = 2 e(u+v) for (u, v) in the positive quadrant. The support of fT,M is the range
of g(u, v) as (u, v) ranges over the support of fX,Y . For (, ) to be in the support of fT,M , it
2
2
must be that = vc for some v 0, or v = c, and then = u + v2c = u + 2 c . Therefore, for
2
0 fixed, the point (, ) is in the support if 2 c . That is, the support of fT,M (, ) is
2
{(, ) : 0, 2 c }, pictured:
194
.
The mapping g is one-to-one, because for any (, ) in the range set, u =
The Jacobian of g is given by
0 1c
.
J = J(u, v) =
1 vc
so that | det(J)| =
1
c
and v = c.
1
fT,M (, ) =
fX,Y
| det J|
4.7.3
2 c
2
2
c exp((
1
g
=
2 c
2
+ c)) 0,
else.
2 c
2
Example 4.7.8 Suppose W = min{X, Y } and Z = max{X, Y }, where X and Y are jointly
continuous-type random variables. Express fW,Z in terms of fX,Y .
Solution: Note that (W, Z) is the image of (X, Y ) under the mapping from the u v plane to
the plane defined by = min{u, v} and = max{u, v}, shown in Figure 4.21. This mapping
maps R2 into the set {(, ) : }, which is the set of points on or above the diagonal (i.e. the
= ) in the plane. Since X and Y are jointly continuous, P {W = Z} = P {X = Y } =
Rline
Rv
v fX,Y (u, v)dudv = 0, so it does not matter how the joint density of (W, Z) is defined exactly
on the diagonal; we will set it to zero there. Let U = {(, ) : < }, which is the region that lies
strictly above the diagonal. For any subset A U, {(W, Z) A} = {(X, Y ) A} {(Y, X) A},
195
"
U
u
where in the last step we simply changed the variables of integration. Consequently, the joint pdf
of fW,Z is given by
fX,Y (, ) + fX,Y (, ) <
fW,Z (, ) =
0
.
There are two terms on the right hand side because for each (, ) with < there are two points
in the (u, v) plane that map into that point: (, ) and (, ). No Jacobian factors such as those
in Section 4.7.2 appear for this example because the mappings (u, v) (u, v) and (u, v) (v, u)
both have Jacobians with determinant equal to one. Geometrically, to get fW,Z from fX,Y imagine
spreading the probability mass for (X, Y ) on the plane, and then folding the plane at the diagonal
by swinging the part of the plane below the diagonal to above the diagonal, and then adding
together the two masses above the diagonal. This interpretation is similar to the one found for the
pdf of |X|, in Example 3.8.8.
4.8
The first and second moments, or equivalently, the mean and variance, of a single random variable
convey important information about the distribution of the variable, and the moments are often
simpler to deal with than pmfs, pdfs, or CDFs. Use of moments is even more important when
considering more than one random variable at a time. That is because joint distributions are much
more complex than distributions for individual random variables.
196
Let X and Y be random variables with finite second moments. Three important related quantities are:
the correlation: E[XY ]
the covariance: Cov(X, Y ) = E[(X E[X])(Y E[Y ])]
Cov(X, Y )
Cov(X, Y )
.
the correlation coefficient: X,Y = p
=
X Y
Var(X)Var(Y )
Covariance generalizes variance, in the sense that Var(X) = Cov(X, X). Recall that there are
useful shortcuts for computing variance: Var(X) = E[X(X E[X])] = E[X 2 ] E[X]2 . Similar
shortcuts exist for computing covariances:
Cov(X, Y ) = E[X(Y E[Y ])] = E[(X E[X])Y ] = E[XY ] E[X]E[Y ].
In particular, if either X or Y has mean zero, then E[XY ] = Cov(X, Y ).
Random variables X and Y are called uncorrelated if Cov(X, Y ) = 0. (If Var(X) > 0 and
Var(Y ) > 0, so that X,Y is well defined, then X and Y being uncorrelated is equivalent to
X,Y = 0.) If Cov(X, Y ) > 0, or equivalently, X,Y > 0, the variables are said to be positively
correlated, and if Cov(X, Y ) < 0, or equivalently, X,Y < 0, the variables are said to be negatively
correlated. If X and Y are independent, then E[XY ] = E[X]E[Y ], which implies that X and
Y are uncorrelated. The converse is falseuncorrelated does not imply independenceand in fact,
independence is a much stronger condition than being uncorrelated. Specifically, independence
requires a large number of equations to hold, namely FXY (u, v) = FX (u)FY (v) for every real value
of u and v. The condition of being uncorrelated requires only a single equation to hold.
Three or more random variables are said to be uncorrelated if they are pairwise uncorrelated.
That is, there is no difference between a set of random variables being uncorrelated or being pairwise
uncorrelated. Recall from Section 2.4.1 that, in contrast, independence of three or more events is
a stronger property than pairwise independence. Therefore, mutual independence of n random
variables is a stronger property than pairwise independence. Pairwise independence of n random
variables implies that they are uncorrelated.
Covariance is linear in each of its two arguments, and adding a constant to a random variable
does not change the covariance of that random variable with other random variables:
Cov(X + Y, U + V ) = Cov(X, U ) + Cov(X, V ) + Cov(Y, U ) + Cov(Y, V )
Cov(aX + b, cY + d) = acCov(X, Y ),
for constants a, b, c, d. The variance of the sum of uncorrelated random variables is equal to the
sum of the variances of the random variables. For example, if X and Y are uncorrelated,
Var(X + Y ) = Cov(X + Y, X + Y ) = Cov(X, X) + Cov(Y, Y ) + 2Cov(X, Y ) = Var(X) + Var(Y ),
and this calculation extends to three or more random variables
197
For example, consider the sum Sn = X1 + + Xn , such that X1 , , Xn are uncorrelated (so
Cov(Xi , Xj ) = 0 if i 6= j) with E[Xi ] = and Var(Xi ) = 2 for 1 i n. Then
E[Sn ] = n
(4.24)
and
n
X
Xi ,
i=1
n X
n
X
i=1 j=1
n
X
Xj
j=1
Cov(Xi , Xj )
Cov(Xi , Xi ) +
i=1
n
X
n
X
Cov(Xi , Xj )
i,j:i6=j
Var(Xi ) + 0 = n 2 .
(4.25)
i=1
S
n n
.
n 2
Example 4.8.1 Identify the mean and variance of (a) a binomial random variable with parameters
n and p, (b) a negative binomial random variable with parameters r and p, and (c) an Erlang random
variable with parameters r and .
Solution: (a) A binomial random variable with parameters n and p has the form S = X1 + . . . +
Xn , where X1 , . . . , Xn are independent Bernoulli random variables with parameter p. So E[Xi ] = p
for each i, Var(Xi ) = p(1 p), and, since the Xi s are independent, they are uncorrelated. Thus,
by (4.24) and (4.25), E[S] = np and Var(S) = np(1 p).
(b) Similarly, as seen in Section 2.6, a negative binomial random variable, Sr , with parameters r and
p, arises as the sum of r independent geometrically distributed random variables with parameter
p. Each such geometric random variable has mean 1/p and variance (1 p)/p2 . So, E[Sr ] = pr and
Var(Sr ) = r(1p)
.
p2 .
(c) Likewise, as seen in Section 3.5, an Erlang random variable, Tr , with parameters r and ,
arises as the sum of r independent exponentially distributed random variables with parameter .
An exponentially distributed random variable with parameter has mean 1/ and variance 1/2 .
Therefore, E[Tr ] = r and Var(Tr ) = r2 .
198
10X,Y +4
(10X )Y
X Y
X,Y
It is clear from the definition that the correlation coefficient X,Y is a scaled version of Cov(X, Y ).
The units that E[XY ] or Cov(X, Y ) are measured in are the product of the units that X is measured in times the units that Y is measured in. For example, if X is in kilometers and Y is in
seconds, then Cov(X, Y ) is in kilometer-seconds. If we were to change units of the first variable to
meters, then X in kilometers would be changed to 1000X in meters, and the covariance between
the new measurement in meters and Y would be Cov(1000X, Y ) = 1000Cov(X, Y ), which would be
measured in meter-seconds. In contrast, the correlation coefficient X,Y is dimensionlessit carries
no units. That is because the units of the denominator, X Y , in the definition of X,Y , are the
units of X times the units of Y, which are also the units of the numerator, Cov(X, Y ). The situation
is similar to the use of the standardized versions of random variables X and Y , namely XE[X]
X
]
and Y E[Y
. These standardized versions have mean zero, variance one, and are dimensionless. In
Y
fact, the covariance between the standardized versions of X and Y is X,Y :
X E[X] Y E[Y ]
X Y
Cov(X, Y )
Cov
,
,
= X,Y .
= Cov
=
X
Y
X Y
X Y
If the units of X or Y are changed (by linear or affine scaling, such as changing from kilometers to
meters, or degrees C to degrees F) the correlation coefficient does not change:
aX+b,cY +d = X,Y
for a, c > 0.
In a sense, therefore, the correlation coefficient X,Y is the standardized version of the covariance,
Cov(X, Y ), or of the correlation, E[XY ]. As shown in the corollary of the following proposition,
correlation coefficients are always in the interval [1, 1]. As shown in Section 4.9, covariance or
correlation coefficients play a central role for estimating Y by a linear function of X. Positive
correlation (i.e. X,Y > 0) means that X and Y both tend to be large or both tend to be small,
whereas a negative correlation (i.e. X,Y < 0) means that X and Y tend to be opposites: if X is
larger than average it tends to indicate that Y is smaller than average. The extreme case X,Y = 1
means Y can be perfectly predicted by a linear function aX + b with a > 0, and the extreme
case X,Y = 1 means Y can be perfectly predicted by a linear function aX + b with a < 0. As
mentioned earlier in this section, X and Y are said to be uncorrelated if Cov(X, Y ) = 0, and being
uncorrelated does not imply independence.
Proposition 4.8.3 (Schwarzs inequality) For two random variables X and Y :
p
E[X 2 ]E[Y 2 ].
(4.26)
|E[XY ]|
p
Furthermore, if E[X 2 ] 6= 0, equality holds in (4.26) (i.e. |E[XY ]| = E[X 2 ]E[Y 2 ]) if and only if
P {Y = cX} = 1 for some constant c.
199
(4.27)
which implies that E[XY ]2 E[X 2 ]E[Y 2 ], which is equivalent to the Schwarz inequality. If
P {Y = cX} = 1 for some c then equality holds in (4.26), and conversely, if equality holds in
(4.26) then equality also holds in (4.27), so E[(Y X)2 ] = 0, and therefore P {Y = cX} = 1 for
c = .
Corollary 4.8.4 For two random variables X and Y,
p
|Cov(X, Y )|
Var(X)Var(Y ).
Furthermore, if Var(X) 6= 0 then equality holds if and only if Y = aX + b for some constants a
and b. Consequently, if Var(X) and Var(Y ) are not zero, so the correlation coefficient X,Y is well
defined, then
|X,Y | 1,
X,Y = 1 if and only if Y = aX + b for some a, b with a > 0, and
X,Y = 1 if and only if Y = aX + b for some a, b with a < 0.
Proof. The corollary follows by applying the Schwarz inequality to the random variables X E[X]
and Y E[Y ].
X1
5 2 0
Example 4.8.5 Suppose the covariance matrix of a random vector X2 is 2 5 2 .
X3
0 2 5
Here, the ij th entry of the matrix, meaning the ith from the top and j th from the left, is Cov(Xi , Xj ).
For example, Cov(X1 , X2 ) = 2.
(a) Find Cov(X1 + X2 , X1 + X3 ).
(b) Find a so that X2 aX1 is uncorrelated with X1 .
(c) Find the correlation coefficient, X1 ,X2
(d) Find Var(X1 + X2 + X3 ).
Solution: Let ci,j = Cov(Xi , Xj ). Then
Cov(X1 + X2 , X1 + X3 ) = c1,1 + c1,3 + c2,1 + c2,3 = 5 + 0 + 2 + 2 = 9.
(b) Cov(X2 aX1 , X1 ) = c2,1 ac1,1 = 2 5a, which is zero for a = 52 .
(c) Var(Xi ) = Cov(Xi , Xi ) = 5 for all i and Cov(X1 , X2 ) = 2.
So X1 ,X2 = 255 = 52 .
P
P
(d) Var(X1 + X2 + X3 ) = Cov( 3i=1 Xi , 3j=1 Xj ) = 23, because the covariance expands out to the
sum of all entries in the covariance matrix.
200
Yk =
P
P
Let X = nk=1 Xk , which is the number of ones showing, and Y = nk=1 Yk , which is the number
of twos showing. Note that if a histogram is made recording the number of occurrences of each of
the six numbers, then X and Y are the heights of the first two entries in the histogram.
(a) Find E[X1 ] and Var(X1 ).
(b) Find E[X] and Var(X).
(c) Find Cov(Xi , Yj ) if 1 i n and 1 j n (Hint: Does it make a difference if i = j?)
(d) Find Cov(X, Y ).
(e) Find the correlation coefficient X,Y . Are X and Y positively correlated, uncorrelated, or negatively correlated?
Solution (a) Each Xk is a Bernoulli random variable with parameter p = 16 , so E[Xk ] = 16 and
5
Var(Xk ) = E[Xk2 ] E[Xk ]2 = p p2 = p(1 p) = 36
.
n
(b) E[X] = nE[X1 ] = 6 , and Var(X) = nVar(X1 ) = 5n
36 .
(c) If i 6= j then Cov(Xi , Xj ) = 0, because Xi and Yj are associated with different, independent,
dice rolls. If i = j the situation is different. The joint pmf of Xi and Yi for any i is:
pXi ,Yi (1, 0) =
1
6
1
6
4
pXi ,Yi (0, 0) = .
6
In particular, Xi Yi is always equal to zero, because it is not possible for both a one and a two to show
on a single roll of the die. Thus, E[Xi Yi ] = 0. Therefore, Cov(Xi , Yi ) = E[Xi Yi ] E[Xi ]E[Yi ] =
1
0 16 61 = 36
. Not surprisingly, Xi and Yi are negatively correlated.
(d) Using the answer to part (c) yields
n
n
X
X
Cov(X, Y ) = Cov
Xi ,
Yj
i=1
n
n X
X
j=1
Cov(Xi , Yj )
i=1 j=1
n
X
Cov(Xi , Yi ) +
i=1
n
X
i=1
36
Cov(Xi , Yj )
i,j:i6=j
+ 0=
n
.
36
201
(e) Using the definition of correlation coefficient and answers to (b) and (d) yields:
n
X,Y = q 36
5n 5n
36 36
1
= .
5
Since Cov(X, Y ) < 0, or, equivalently, X,Y < 0, X and Y are negatively correlated. This makes
sense; if X is larger than average, it means that there were more ones showing than average, which
would imply that there should be somewhat fewer twos showing than average.
Example 4.8.7 Suppose X1 , . . . , Xn are independent and identically distributed random variables,
with mean and variance 2 . It might be that the mean and variance are unknown, and that the
distribution is not even known to be a particular type, so maximum likelihood estimation is not
appropriate. In this case it is reasonable to estimate and 2 by the sample mean and sample
variance defined as follows:
n
X
b= 1
X
Xk
n
c2 =
k=1
1 X
b 2.
(Xk X)
n1
k=1
Note the perhaps unexpected appearance of n 1 in the sample variance. Of course, we should
have n 2 to estimate the variance (assuming we dont know the mean) so it is not surprising that
the formula is not defined if n = 1. An estimator is called unbiased if the mean of the estimator is
equal to the parameter that is being estimated. (a) Is the sample mean an unbiased estimator of
b 2 ], for estimation of the mean by the sample mean.
? (b) Find the mean square error, E[( X)
(c) Is the sample variance an unbiased estimator of 2 ?
Solution
k=1
k=1
1X
1X
E[Xk ] =
= ,
n
n
b is an unbiased estimator of .
so X
b is given by
(b) The mean square error for estimating by X
n
X
b 2 ] = Var(X)
b = 1 Var
E[( X)
Xk
n2
k=1
!
=
n
n
1 X
1 X 2 2
Var(X
)
=
=
.
k
n2
n2
n
k=1
k=1
(c)
h i
c2 =
E
1 X
b 2 ] = n E[(X1 X)
b 2 ],
E[(Xk X)
n1
n1
k=1
202
b 2 ] = E[(X1 X)
b 2 ] for all k. Now, E[X1 X]
b = = 0, so
because, by symmetry, E[(Xk X)
h
i
2
b
b
E (X1 X)
= Var(X1 X)
!
n
(n 1)X1 X Xk
= Var
n
n
k=2
!
n
n1 2 X 1
(n 1) 2
2
=
+
=
.
n
n2
n
k=2
Therefore,
h i
c2 =
E
n (n 1) 2
= 2,
n1
n
c2 is an unbiased estimator of 2 .
so,
Example 4.8.8 A portfolio selection problem Suppose you are an investment fund manager
with three financial instruments to invest your funds in for a one year period. Assume that, based
on past performance, the returns on investment for the instruments have the following means and
standard deviations.
Instrument
Stock fund (S)
Bond fund (B)
T-bills (T)
(So T 1.02.) Also assume the correlation coefficient between the stocks and bonds is S,B = 0.8.
Some fraction of the funds is to be invested in stocks, some fraction in bonds, and the rest in T-bills,
and at the end of the year the return per unit of funds is R. There is no single optimal choice of
what values to use for these fractions; there is a tradeoff between the mean, R , (larger is better)
and the standard deviation, R (smaller is better). Plot your answers to the parts below using a
horizontal axis for mean return ranging from 1.0 to 1.1, and a vertical axis for standard deviation
ranging from 0 to 0.15. Label the points PS = (1.1, 0.15), PB = (1, 0.15) and PT = (1.02, 0) on the
plot corresponding to the three possibilities for putting all the funds into one instrument.
(a) Let R = S + (1 )T, so R is the random return resulting when a fraction of the funds
is put into stocks and a fraction 1 is put into T-bills. Determine and plot the set of (R , R )
pairs as ranges from zero to one.
(b) Let R = S + (1 )B, so R is the random return resulting when a fraction of the
funds is put into stocks and a fraction 1 is put into bonds. Determine and plot the set of
(R , R ) pairs as ranges from zero to one. (Hint: Use the fact S,B = 0.8.).
(c) Combining parts (a) and (b), let R, = R + (1 )T, so R, is the random return
resulting when a fraction 1 of the funds is invested in T-bills as in part (a), and a fraction of
the funds is invested in the same mixture of stock and bond funds considered in part (b). For each
203
{0.2, 0.4, 0.6, 0.8}, determine and plot the set of (R, , R, ) pairs as ranges from zero to
one. (Hint: You may express your answers in terms of (R , R ) found in part (b).)
(d) As and both vary over the interval [0, 1], the corresponding point (R, , R, ) sweeps
out the set of achievable (mean, standard deviation) pairs. An achievable pair (e
R ,
eR ) is said to
be strictly better than an achievable pair (R , R ) if either (R <
eR and R
eR ) or (R
eR
and R >
eR .). An achievable pair (e
R ,
eR ) is undominated if there is no other achievable pair
strictly better than it. Identify the set of undominated achievable pairs.
2 = Var(S) =
Solution (a) By linearity of expectation, R = S + (1 )T . Since T 1.02, R
2
2
S , and so R = S. As ranges from 0 to 1, the point (R , R ) sweeps out the line segment
from PT to PS . See Figure 4.22.
0.16
PS
PB
0.14
(b) S+B
0.12
S+B
(a) S+T
0.1
0.08
0.06
directions of better
performance
0.04
0.02
0
1
1.01
1.02
(d)
S+B+T
P
T
1.03
1.04
1.05
1.06
mean return
1.07
1.08
1.09
1.1
Figure 4.22: Mean and standard deviation of returns for various portfolios
(b) By linearity of expectation, R = S +(1)B . By the fact Cov(S, B) = S B S,B , and
204
Thus R, = R . As ranges from 0 to 1, the point (R, , R, ) sweeps out the line segment
from PT to the point found in part (b) corresponding to . Those lines for {0.2, 0.4, 0.6, 0.8}
are shown as dashed lines in Figure 4.22.
(d) By part (c), the set of all achievable points is the union of all line segments from PT to
points on the curve found in part(b). Consider the line through PT that is tangent to the curve
found in part (b). The set of undominated points consists of the segment of that line from PT up
to the point of tangency, and then continues along the curve found in part (b) up to the point PS .
See Figure 4.22. Thus, the optimal (and only) portfolio with zero standard deviation is the one
investing all funds in T-bills. For small, positive standard deviation, the optimal portfolio is to
use a mixture of all three instruments. For larger standard deviation less than 0.15, the optimal
portfolio is to use a mixture of stock and bond funds only. Finally, the largest mean return of 0.10
is uniquely achieved by investing only in the stock fund, for which the standard deviation is 0.15.
4.9
4.9.1
Let Y be a random variable with some known distribution. Suppose Y is not observed but that
we wish to estimate Y . If we use a constant to estimate Y , the estimation error will be Y .
The mean square error (MSE) for estimating Y by is defined by E[(Y )2 ]. By LOTUS, if Y is
a continuous-type random variable,
Z
(y )2 fY (y)dy.
(4.28)
MSE (for estimation of Y by a constant ) =
205
From this expression it is easy to see that the mean square error is minimized with respect to if
and only if = EY , and the minimum possible value is Var(Y ).
4.9.2
Unconstrained estimators
Suppose instead that we wish to estimate Y based on an observation X. If we use the estimator
g(X) for some function g, the resulting mean square error (MSE) is E[(Y g(X))2 ]. We want to
find g to minimize the MSE. The resulting estimator g (X) is called the unconstrained optimal
estimator of Y based on X because no constraints are placed on the function g.
Suppose you observe X = 10. What do you know about Y ? Well, if you know the joint pdf
of X and Y , you also know or can derive the conditional pdf of Y given X = 10, denoted by
fY |X (v|10). Based on the fact, discussed above, that the minimum MSE constant estimator for a
random variable is its mean, it makes sense to estimate Y by the conditional mean:
Z
E[Y |X = 10] =
vfY |X (v|10)dv.
The resulting conditional MSE is the variance of Y , computed using the conditional distribution
of Y given X = 10.
E[(Y E[Y |X = 10])2 |X = 10] = E[Y 2 |X = 10] (E[Y |X = 10]2 ).
Conditional expectation indeed gives the optimal estimator, as we show now. Recall that
fX,Y (u, v) = fX (u)fY |X (v|u). So
MSE = E[(Y g(X))2 ]
Z Z
2
=
(v g(u)) fY |X (v|u)dv fX (u)du.
(4.29)
For each u fixed, the integral in parentheses in (4.29) has the same form as the integral (4.28).
Therefore, for each u, the integral in parentheses in (4.29) is minimized by using g(u) = g (u),
where
Z
g (u) = E[Y |X = u] =
vfY |X (v|u)dv.
(4.30)
2
=
(v g (u)) fY |X (v|u)dv fX (u)du
Z Z
a
2
2
=
(v (g (u)) )fY |X (v|u)dv fX (u)du
(4.31)
(4.32)
(4.33)
where the equality (a) follows from the shortcut Var(Y ) = E[Y 2 ] E[Y ]2 , applied using the
conditional distribution of Y given X = u. In summary, the minimum MSE unconstrained estimator
of Y given X is E[Y |X] = g (X) where g (u) = E[Y |X = u], and expressions for the MSE are
given by (4.31)-(4.33).
206
4.9.3
Linear estimators
In practice it is not always possible to compute g (u). Either the integral in (4.30) may not have a
closed form solution, or the conditional density fY |X (v|u) may not be available or might be difficult
to compute. The problems might be more than computational. There might not even be a good
way to decide what joint pdf fX,Y to use in the first place. A reasonable alternative to using g is
to consider linear estimators of Y given X. A linear estimator has the form L(X) = aX + b, and
to specify L we only need to find the two constants a and b, rather than finding a whole function
g . The MSE for the linear estimator aX + b is
M SE = E[(Y (aX + b))2 ].
Next we identify the linear estimator that minimizes the MSE. One approach is to multiply out
(Y (aX + b))2 , take the expectation, and set the derivative with respect to a equal to zero and
the derivative with respect to b equal to zero. That would yield two equations for the unknowns a
and b. We will take a slightly different approach, first finding the optimal value of b as a function
of a, substituting that in, and then minimizing over a. The MSE can be written as follows:
E[((Y aX) b)2 ].
Therefore, we see that for a given value of a, the constant b should be the minimum MSE constant
estimator of Y aX, which is given by b = E[Y aX] = Y aX . Therefore, the optimal linear
estimator has the form aX + Y aX or, equivalently, Y + a(X X ), and the corresponding
MSE is given by
M SE = E[(Y Y a(X X ))2 ]
= Var(Y aX)
= Cov(Y aX, Y aX)
= Var(Y ) 2aCov(Y, X) + a2 Var(X).
(4.34)
It remains to find the constant a. The MSE is quadratic in a, so taking the derivative with respect
(Y,X)
to a and setting it equal to zero yields that the optimal choice of a is a = Cov
Var(X) . Therefore,
b |X], where
the minimum MSE linear estimator is given by L (X) = E[Y
Cov(Y,
X)
b |X] = Y +
(X X )
E[Y
Var(X)
X X
= Y + Y X,Y
.
X
(4.35)
(4.36)
Setting a in (4.34) to a gives the following expression for the minimum possible MSE:
minimum MSE for linear estimation = Y2
(Cov(X, Y ))2
= Y2 (1 2X,Y ).
Var(X)
(4.37)
207
2
b |X] = Y X,Y Var(X) = 2 2 , the following alternative expressions for the
Since Var E[Y
Y X,Y
X
minimum MSE hold
b |X]) = E[Y 2 ] E[E[Y
b |X]2 ].
minimum MSE for linear estimation = Y2 Var(E[Y
(4.38)
In summary, the minimum mean square error linear estimator is given by (4.35) or (4.36), and
the resulting minimum mean square error is given by (4.37) or (4.38). The minimum MSE linear
b |X] is called the wide sense conditional expectation of Y given X. In analogy with
estimator E[Y
regular conditional expectation, we write
Cov(Y, X)
b
(u X ).
L (u) = E[Y |X = u] = Y +
Var(X)
b |X] is equal to the linear function E[Y
b |X = u] applied to the random variable X. The
so that E[Y
b |X] are both used for the minimum MSE linear estimator of Y given X.
notations L (X) and E[Y
Y2
|{z}
|
{z
}
|
{z
}
b |X]
MSE for =E[Y ].
MSE for g (X)=E[Y |X] MSE for L (X)=E[Y
(4.39)
Note that all three estimators are linear as functions of the variable to be estimated:
E[aY + bZ + c] = aE[Y ] + bE[Z] + c
E[aY + bZ + c|X] = aE[Y |X] + bE[Z|X] + c
b
b |X] + bE[Z|X]
b
E[aY
+ bZ + c|X] = aE[Y
+c
Example 4.9.1 Let X = Y + N , where Y has the exponential distribution with parameter , and
2 . Suppose the variables Y and N are independent, and
N is Gaussian with mean 0 and variance N
2 are known and strictly positive. (Recall that E[Y ] = 1 and Var(Y ) =
the parameters and N
1
2
Y = 2 .)
b |X], the MSE linear estimator of Y given X, and also find the resulting MSE.
(a) Find E[Y
b |X] does.
(b) Find an unconstrained estimator of Y yielding a strictly smaller MSE than E[Y
208
Solution:
So, by (4.35),
1/2
b |X] = 1 +
E[Y
2
1/2 + N
2
X + N
1
1
1
1
= +
=
X
X
2
2 ,
1 + 2 N
1 + 2 N
and by (4.37),
b |X] = Y2
MSE for E[Y
2
2
Y4
Y2 N
N
=
=
2 .
2
2
Y2 + N
Y2 + N
1 + 2 N
(b) Although Y is always nonnegative, the estimator L (X) can be negative. An estimator with
b |X]}, because (Y Yb )2 (Y E[Y
b |X])2 with probability one,
smaller MSE is Yb = max{0, E[Y
b |X] < 0.
and the inequality is strict whenever E[Y
Example 4.9.2 Suppose (X, Y ) is uniformly distributed over the triangular region with vertices at
(1, 0), (0, 1), and (1, 1), shown in Figure 4.23. (a) Find and sketch the minimum MSE estimator
v
1
u
!1
209
R(1+u)/2 2dv = 1 + u 1 u 0
1
fX (u) =
(1+u)/2 2dv = 1 u 0 u 1
0
else,
so the graph of fX is the triangle shown if Figure 4.24(a). This is intuitively clear because the
(a)
(b)
E[Y|X=u]
f (u)
X
v E[Y|X=u]
1
u
1
u
1
Figure 4.24: (a) The pdf of X. (b) The functions used for the minimum MSE error unconstrained
estimator and linear estimator.
lengths of the vertical cross-sections of the support of fX,Y increase linearly from 0 over the interval
1 < u 0, and then decrease linearly back down to zero over the interval 0 u 1. Since
fX (u) > 0 only for 1 < u < 1, the conditional density fY |X (v|u) is defined only for u in that
interval. The cases 1 < u 0 and 0 < u < 1 will be considered separately, because fX has a
different form over those two intervals. So the conditional pdf of Y given X = u is
2
1+u
1+u
v 1+u
1+u
2
1 u < 0 fY |X (v|u) =
,1 + u
uniform on
0 else
2
2
1+u
1+u
v1
1u
2
0 u < 1 fY |X (v|u) =
uniform on
,1 ,
0 else
2
As indicated, the distribution of Y given X = u for u fixed, is the uniform distribution over an
interval depending on u, as long as 1 < u < 1. The mean of a uniform distribution over an interval
is the midpoint of the interval, and thus:
g (u) = E[Y |X = u] =
3(1+u)
4
3+u
4
1 < u 0
0<u<1
undefined else.
This estimator is shown in Figure 4.24(b). Note that this estimator could have been drawn by
inspection. For each value of u in the interval 1 < u < 1, E[Y |X = u] is just the center of mass
of the cross section of the support of fX,Y along the vertical line determined by u. That is true
whenever (X, Y ) is uniformly distributed over some region. The mean square error given X = u
(1|u|)2
is the variance of a uniform distribution on an interval of length 1|u|
. Averaging
2 , which is
48
210
(1 + u)
1
1
Z
= 2
0
(1 |u|)2
du +
48
(1 u)
0
(1 |u|)2
du
48
(1 |u|)2
1
(1 u)
du = .
48
96
Z
=
2vdudv
0
1
Z
=
0
v1
2
2v 2 dv = ,
3
1 Z 2v1
E[XY ] =
2uvdudv
0
v1
1
2
v[(2v 1) (v 1) ]dv =
=
0
3v 3 2v 2 dv =
E[X 2 ] = 2
Z
0
1
,
12
1
u2 (1 u)du = .
6
1
A glance at fX shows E[X] = 0, so Var(X) = E[X 2 ] = 61 and Cov(X, Y ) = E[XY ] = 12
. Therefore,
by (4.35),
b |X = u] = L (u) = 2 + 1/12 u = 2 + u .
E[Y
3
1/6
3 2
b |X = u] was tedious,
This estimator is shown in Figure 4.24(b). While the exact calculation of E[Y
its graph could have been drawn approximately by inspection. It is a straight line that tries to be
close to E[Y |X = u] for all u. To find the MSE we shall use (4.38). By LOTUS,
Z 1 Z 2v1
2
E[Y ] =
2v 2 dudv
0
Z
=
0
b |X = u]2 = E
and E E[Y
2
3
i
X 2
2
4
9
v1
1
1
2v 3 dv = .
2
E[X 2 ]
4
35
72 .
Thus,
b |X] = 1 35 = 1 .
MSE for E[Y
2 72
72
1
b |X]) = 1 Var(Y ) =
Note that (MSE using E[Y |X]) = 96
(MSE using E[Y
72
MSEs are ordered in accordance with (4.39).
1
18 ,
so the three
4.10
211
The law of large numbers, in practical applications, has to do with approximating sums of random
variables by a constant. The Gaussian approximation, backed by the central limit theorem, in
practical applications, has to do with a more refined approximation: approximating sums of random
variables by a single Gaussian random variable.
4.10.1
There are many forms of the law of large numbers (LLN). The law of large numbers is about the
sample average of n random variables: Snn , where Sn = X1 + . . . + Xn . The random variables have
the same mean, . The random variables are assumed to be independent, or weakly dependent, and
some condition is placed on the sizes of the individual random variables. The conclusion is that as
n , Snn converges in some sense to the mean, . The following version of the LLN has a simple
proof.
Proposition 4.10.1 (Law of large numbers) Suppose X1 , X2 , . . . is a sequence of uncorrelated
random variables such that each Xk has finite mean and variance less than or equal to C. Then
for any > 0,
Sn
C n
P
2 0.
n
n
Proof. The mean of
The variance of
Sn
n
is given by
Pn
Pn
E[Xk ]
Sn
n
k=1 Xk
=E
= k=1
E
=
= .
n
n
n
n
Sn
n
Therefore, the proposition follows from the Chebychev inequality, (2.12), applied to the random
variable Snn .
The law of large numbers is illustrated in Figure 4.25, which was made using a random number
generator on a computer. For each n 1, Sn is the sum of the first n terms of a sequence of
independent random variables, each uniformly distributed on the interval [0, 1]. Figure 4.25(a)
illustrates the statement of the LLN, indicating convergence of the averages, Snn , towards the mean,
0.5, of the individual uniform random variables. The same sequence Sn is shown in Figure 4.25(b),
except the Sn s are not divided by n. The sequence of partial sums Sn converges to +. The LLN
tells us that the asymptotic slope is equal to 0.5. The sequence Sn is not expected to get closer to
n
2 as n increasesjust to have the same asymptotic slope. In fact, the central limit theorem, given
in the next section, implies that for large n, the difference Sn n2 p
has approximately the Gaussian
n
n
distribution with mean zero, variance 12
, and standard deviation 12
.
212
Figure 4.25: Results of simulation of sums of independent random variables, uniformly distributed
on [0, 1]. (a) Sn vs. n. (b) Sn vs. n.
Example 4.10.2 Suppose a fair die is rolled 1000 times. What is a rough approximation to the
sum of the numbers showing, based on the law of large numbers?
Solution: The sum is S1000 = X1 + X2 + + X1000, where Xk denotes the number showing on
the k th roll. Since E[Xk ] = 1+2+3+4+5+6
= 3.5, the expected value of S1000 is (1000)(3.5) = 3500.
6
By the law of large numbers, we expect the sum to be near 3500.
Example 4.10.3 Suppose X1 , . . . , X100 are random variables, each with mean = 5 and variance
2 = 1. Suppose also that |Cov(Xi , Xj )| 0.1 if i = j 1, and Cov(Xi , Xj ) = 0 if |i j| 2. Let
S100 = X1 + + X100 .
(a) Show that Var(S100 ) 120.
100
(b) Use part (a) and Chebychevs inequality to find an upper bound on P (| S100
5| 0.5).
213
n1
X
Cov(Xi , Xi+1 ) +
i=1
n
X
Cov(Xi , Xi ) +
i=1
n
X
Cov(Xi , Xi1 )
i=2
4.10.2
S100
100
= = 5. Thus, by Chebychevs
(0.5)
The following version of the central limit theorem (CLT) generalizes the DeMoivre-Laplace limit
theorem, discussed in Section 3.6.3. While the DeMoivre-Laplace limit theorem pertains to sums of
independent, identically distributed Bernoulli random variables, the version here pertains to sums
of independent, identically distributed random variables with any distribution, so long as their
mean and variance are finite.
Theorem 4.10.4 (Central limit theorem) Suppose X1 , X2 , . . . are independent, identically distributed random variables, each with mean and variance 2 . Let Sn = X1 + + Xn . Then:
lim P
Sn n
c = (c).
n 2
The practical implication of the central limit theorem is the same as that of the DeMoivre-Laplace
limit theorem. That is, the CLT gives evidence that the Gaussian approximation, discussed in
Section 3.6.3, is a good one, for sums of independent, identically distributed random variables.
Figure 4.26 illustrates the CLT for the case the Xk s are uniformly distributed on the interval [0, 1].
The approximation in this case is so good that some simulation programs generate Gaussian random
variables by generating six uniformly distributed random variables and adding them together to
produce one approximately Gaussian random variable.
Example 4.10.5 Let S denote the sum of the numbers showing in 1000 rolls of a fair die. By
the law of large numbers, S is close to 3500 with high probability. Find the number L so that
P {|S 3500| L} 0.9. To be definite, use the continuity correction.
214
Figure 4.26: Comparison of the pdf of Sn to the Gaussian pdf with the same mean and variance,
for sums of uniformly distributed random variables on [0, 1]. For n = 3 and n = 4, the Gaussian
pdf is the one with a slightly higher peak.
Solution: As noted in Example 2.4.10, the number showing for one roll of a die has mean = 3.5
2
and
variance = 2.9167. Therefore, S has mean 3500, variance 2916.7, and standard deviation
2916.7 = 54.01. By the Gaussian approximation with the continuity correction,
P {|S 3500| L} = P {|S 3500| L + 0.5}
S 3500 L + 0.5
= P
54.01
54.01
L + 0.5
1 2Q
.
54.01
Now Q(1.645) = 0.05, so the desired value of L solves
L + 0.5
1.645,
54.01
or L 88.34. Thus, L = 88 should give the best approximation.
Example 4.10.6 Suppose each of 100 real numbers are rounded to the nearest integer and then
added. Assume the individual roundoff errors are independent and uniformly distributed over the
interval [0.5, 0.5]. Using the Gaussian approximation suggested by the CLT, find the approximate
probability that the absolute value of the sum of the errors is greater than 5.
215
R 0.5
1
The mean of each roundoff error is zero and the variance is 0.5 u2 du = 12
. Thus,
n
o
S
5
5
100
E[S] = 0 and Var(S) = 12 = 8.333. Thus, P {|S| 5} = P 8.333 8.333 2Q 8.333
=
2Q(1.73) = 2(1 (1.732)) = 0.083.
Solution:
Example 4.10.7 Suppose each day of the year, the value of a particular stock: increases by one
percent with probability 0.5, remains the same with probability 0.4, and decreases by one percent
with probability 0.1. Changes on different days are independent. Consider the value after one year
(365 days), beginning with one unit of stock. Find (a) the probability the stock at least triples
in value, (b) the probability the stock at least quadruples in value, (c) the median value after one
year.
Solution: The value, Y , after one year, is given by Y = D1 D2 D365 , where Dk for each k is the
growth factor for day k, with pmf pD (1.01) = 0.5, pD (1) = 0.4, pD (0.99) = 0.1. The CLT pertains
to sums of random variables, so we will apply the Gaussian approximation to ln(Y ), which is a sum
of a large number of random variables:
ln Y =
365
X
ln Dk .
k=1
Thus, ln Y is approximately Gaussian with mean 365 = 1.450 and standard deviation 365 2 =
0.127. Therefore,
P {Y c} = P {ln(Y ) ln(c)}
ln(Y ) 1.450
ln(c) 1.450
= P
0.127
0.127
ln(c) 1.450
Q
.
0.127
In particular:
(a) P {Y 3} Q(2.77) 0.997.
(b) P {Y 4} Q(0.4965) = 0.69.
(c) The median is the value c such that P {Y c} = 0.5, which by the Gaussian approximation is
e1.450 = 4.26. (This is the same result one would get by the following argument, based on the law
of large numbers. We expect, during the year, the stock to increase by one percent about 365*(0.5)
times, and to decrease by one percent about 365*(0.1) times. That leads to a year end value of
(1.01)182.5 (0.99)36.5 = 4.26.)
216
4.11
Recall that Gaussian distributions often arise in practice; this fact is explained by the CLT. The
CLT can be extended to two, or even more, correlated random variables. For example, suppose
(U1 , V1 ), (U2 , V2 ), are independent, identically distributed pairs of random variables. For example, Ui might be the height, and Vi the weight, of the ith student to enroll
at a university. Suppose
n V1 ++V
n
for convenience that they have mean zero. Then as n , the pair U1 ++U
,
has a
n
n
limiting bivariate distribution, where bivariate means the limit distribution is a joint distribution
of two random variables. Suppose X and Y have such a limit distribution. Then X and Y must
each be Gaussian random variables, by the CLT. But also for any constants a and b, aX + bY
n +bVn )
must be Gaussian, because aX + bY has the limiting distribution of (aU1 +bV1 )++(aU
, which
n
is Gaussian by the CLT. This observation motivates the following definition:
Definition 4.11.1 Random variables X and Y are said to be jointly Gaussian if every linear
combination aX + bY is a Gaussian random variable. (For the purposes of this definition, a
constant is considered to be a Gaussian random variable with variance zero.)
Being jointly Gaussian includes the case that X and Y are each Gaussian and linearly related:
X = aY + b for some a, b or Y = aX + b for some a, b. In these cases, X and Y do not have a
joint pdf. Aside from those two degenerate cases, a pair of jointly Gaussian random variables has
a bivariate normal (or Gaussian) pdf, given by3
fX,Y (u, v) =
p
exp
2X Y 1 2
uX
X
2
vY
Y
2
2(1 2 )
uX
X
vY
Y
(4.40)
where the five parameters X , Y , X , Y , satisfy X > 0, Y > 0 and 1 < < 1. As shown
below, X and Y are the means of X and Y , respectively, X and Y are the standard deviations,
respectively, and is the correlation coefficient. We shall describe some properties of the bivariate
normal pdf in the remainder of this section.
There is a simple way to recognize whether a pdf is a bivariate normal. Namely, such a pdf has
the form: fX,Y (u, v) = C exp(P (u, v)), where P is a second order polynomial of two variables:
P (u, v) = au2 + buv + cv 2 + du + ev + f.
The constant C is selected so that the pdf integrates to one. Such a constant exists if and only if
P (u, v) + as |u| + |v| +, which requires a > 0, c > 0 and b2 4ac < 0. Without loss of
generality, we can take f = 0, because it can be incorporated into the constant C. Thus, the set of
bivariate normal pdfs can be parameterized by the five parameters: a, b, c, d, e.
3
4.11.1
217
Suppose W and Z are independent, standard normal random variables. Their joint pdf is the
product of their individual pdfs:
2
2
2
2
2
2
+
2
e
e
e
fW,Z (, ) = =
.
2
2
2
This joint pdf is called the standard bivariate normal pdf, and it is the special case of the general
bivariate normal pdf obtained by setting the means to zero, the standard deviations to one, and
the correlation coefficient to zero. The general bivariate normal pdf can be obtained from fW,Z by
a linear transformation. Specifically, if X and Y are related to W and Z by
X
W
X
=A
+
,
Y
Z
Y
where A is the matrix
A=
q
2 (1+)
X
2
2 (1+)
Y
2
2 (1)
X
q
2 (1)
Y
2
then fX,Y is given by (4.40), as can be shown by the method of Section 4.7.1.
Figure 4.27 illustrates the geometry of the bivariate normal pdf. The graph of the standard
Figure 4.27: (a) Mesh plots of both the standard bivariate normal, and the bivariate normal with
X = 3, Y = 4, X = 2, Y = 1, = 0.5, shown on the same axes. (b) Contour plots of the same
pdfs.
bivariate normal, fW,Z , has a bell shape that is rotationally symmetric about the origin. The level
1
sets are thus circles centered at the origin. The peak value of fW,Z is 2
. The half-peak level set is
218
the set of (, ) pairs such that fW,Z (, ) is equal to half of the peak value, or {(, ) : 2 + 2 =
2 ln 2 (1.18)2 }, which is a circle centered at the origin with radius approximately 1.18. Since all
the bivariate normal pdfs are obtained from fW,Z by linear scaling and translation, the half-peak
level set of a general bivariate normal pdf completely determines the pdf. The half-peak level set
for the general bivariate normal fX,Y given by (4.40) is
2
2
uX
vY
uX
vY
+ Y
2 X
X
Y
2
(u, v) :
= 2 ln 2 (1.18) ,
1 2
which is an ellipse. The space of ellipses in the plane is five dimensionaltwo coordinates specify the
center of an ellipse, two numbers give the lengths of the major and minor axes, and a final number
gives the angle of the major axis from the horizontal. This gives another way to parameterize the
set of bivariate normal pdfs using five parameters.
4.11.2
Proposition 4.11.2 Suppose X and Y have the bivariate normal pdf with parameters X , Y , X , Y ,
and . Then
2 ) distribution, and Y has the N ( , 2 ) distribution.
(a) X has the N (X , X
Y
Y
(b) Any linear combination of the form aX + bY is a Gaussian random variable (i.e., X and Y
are jointly Gaussian).
(c) is the correlation coefficient between X and Y (i.e. X,Y = ).
(d) X and Y are independent if and only if = 0.
b |X]. That is, the
(e) For estimation of Y from X, L (X) = g (X). Equivalently, E[Y |X] = E[Y
219
The first factor is a function of u alone, and is the standard normal pdf. The second factor, as a
function of v for u fixed, is a Gaussian pdf with mean u and variance 1 2 . In particular, the
integral of the second factor with respect to v is one. Therefore, the first factor is the marginal pdf
of X and the second factor is the conditional pdf of Y given X = u :
2
1
u
fX (u) = exp
2
2
1
(v u)2
fY |X (v|u) = p
.
(4.42)
exp
2(1 2 )
2(1 2 )
Thus, X is a standard normal random variable. By symmetry, Y is also a standard normal random
variable. This proves (a).
The class of bivariate
normal pdfs is preserved under linear transformations corresponding to
multiplication of X
by
a
matrix A if det A 6= 0. Given a and b, we can select c and d so that
Y
a b
the matrix A = c d has det(A) = ad bc 6= 0. Then the random vector A X
Y has a bivariate
normal pdf, so by part (a) already proven, both of its coordinates are Gaussian random variables.
In particular, its first coordinate, aX + bY, is a Gaussian random variable. This proves (b).
By (4.42), given X = u, the conditional distribution of Y is Gaussian with mean u and variance
1 2 . Therefore, g (u) = E[Y |X = u] = u. Since X and Y are both standard (i.e. they have
mean zero and variance one), X,Y = E[XY ], so
Z Z
X,Y = E[XY ] =
uvfX (u)fY |X (v|u)dvdu
Z
Z
=
ufX (u)
vfY |X (v|u)dv du
Z
=
u2 fX (u)du = .
Solution: Let Z = X + 2Y. Then Z is a linear combination of jointly Gaussian random variables,
so Z itself is a Gaussian random variable. Also, E[Z] = E[X] + 2E[Y ] = 0 and
Z2
2
= X
+ 4Cov(X, Y ) + 4Y2 = 5 4 + 8 = 9.
Thus, P {X + 2Y 1} = P {Z 1} = P Z3 13 = Q
1
3
= 1 ( 13 ) 0.3694.
220
Example 4.11.4 Let X and Y be jointly Gaussian random variables with mean zero, variance
one, and Cov(X, Y ) = . Find E[Y 2 |X], the best estimator of Y 2 given X. (Hint: X and Y 2 are
not jointly Gaussian. But you know the conditional distribution of Y given X = u and can use it
to find the conditional second moment of Y given X = u.)
Solution: Recall the fact that E[Z 2 ] = E[Z]2 + Var(Z) for a random variable Z. The idea
is to apply the fact to the conditional distribution of Y given X. Given X = u, the conditional
distribution of Y is Gaussian with mean u and variance 12 . Thus, E[Y 2 |X = u] = (u)2 +12 .
Therefore, E[Y 2 |X] = (X)2 + 1 2 .
Example 4.11.5 Suppose X and Y are zero-mean unit-variance jointly Gaussian random variables
with correlation coefficient = 0.5.
(a) Find Var(3X 2Y ).
(b) Find the numerical value of P {(3X 2Y )2 28}.
(c) Find the numerical value of E[Y |X = 3].
Solution:
(a)
Var(3X 2Y ) = Cov(3X 2Y, 3X 2Y )
= 9Var(X) 12Cov(X, Y ) + 4Var(Y ) = 9 6 + 4 = 7.
(b) Also, the random variable 3X 2Y has mean zero andnis Gaussian. Therefore,
q o
28
= 2( (2)0.5)
7
7
0.9545.
b |X = 3], so plugging numbers into
(c) Since X and Y are jointly Gaussian, E[Y |X = 3] = E[Y
(4.36) yields:
3 X
= 3X,Y = 1.5.
E[Y |X = 3] = Y + Y X,Y
X
4.12
Section 4.1[video]
1. Suppose X and Y are random variables with values in [0, 1] and joint CDF that satisfies
2 2
3
FX,Y (u, v) = u v 2+uv for (u, v) [0, 1]2 . Find FX,Y (2, 0.5).
2. Suppose X and Y are random variables with values in [0, 1] and joint CDF that satisfies
2 2
3
FX,Y (u, v) = u v 2+uv for (u, v) [0, 1]2 . Find P (X, Y ) 13 , 32 12 , 1 .
221
Section 4.2[video]
1. Find P {X = Y } if pX,Y (i, j) = 2(i+j+2) for nonnegative integers i and j.
Section 4.3[video]
1. Find the constant c so that f defined by f (u, v) = c(u2 + uv + v 2 )I{(u,v)[0,1]2 } is a valid pdf.
2. Find E[X] if (X, Y ) has joint pdf fX,Y (u, v) = 54 (1 + uv)I{(u,v)[0,1]2 } .
3. Consider an equilateral triangle inscribed in a circle. If a random point is uniformly distributed
over the region bounded by the circle, what is the probability the point is inside the triangle?
4. Find E[|X Y |] assuming (X, Y ) is uniformly distributed over the square region [0, 1]2 .
5. Suppose (X, Y ) has joint pdf fX,Y (u, v) = c(1 + uv)I{u0,v0,u+v1} , where the value of the
constant c is not needed to answer this question. Find E[Y |X = 0.5].
Section 4.4[video]
1. Suppose X and Y are independent random variables such that X is uniformly distributed over
the interval [0, 2] and Y is uniformly distributed over the interval [1, 3]. Find P {X Y }.
2. Suppose X and Y are independent random variables such that X has the exponential distribution with parameter = 2 and Y is uniformly distributed over the interval [0, 1]. Find
P {X Y }.
h
i
1
3. Find E X+Y
, assuming that X and Y are independent random variables, each uniformly
distributed over the interval [0, 1].
Section 4.5[video]
1. Find the maximum value of the pmf of S (i.e. maxk pS (k)) where S = X + Y, and X and Y
j2
i
are independent with pX (i) = 10
for 1 i 4, and pY (j) = 30
for 1 j 4.
2. Find the maximum value of the pdf of S (i.e. maxc fS (c)) where S = X + Y and (X, Y ) is
uniformly distributed over {(u, v) : u2 + v 2 1}.
Section 4.6[video]
1. Suppose an ant is positioned at a random point that is uniformly distributed over a square
region with unit area. Let D be the minimum distance the ant would need to walk to exit
the region. Find E[D].
2. Suppose X and Y are independent random variables, with each being uniformly distributed
over the interval [0.9, 1.1]. Find P {XY 1.1}.
Section 4.7[video]
X
X+Y
and Z = X + Y. Find
Section 4.8[video]
1. Find Cov(X, Y ) if Var(X + Y ) = 9 and Var(X Y ) = 6.
2. Find Var(X + 2Y ) under the assumptions E[X] = E[Y ] = 1, E[X 2 ] = E[XY ] = 5 and
E[Y 2 ] = 10.
3. Find the maximum possible value of Cov(X, Y ) subject to Var(X) = 12 and Var(Y ) = 3.
S30
where S30 = X1 + + X30 such that the Xi s are uncorrelated, E[Xi ] = 1
4. Find Var
30
and Var(Xi ) = 4.
5. Find Var(X1 + + X100 ) where Cov(Xi , Xj ) = max{3 |i j|, 0} for i, j {1, , 100}.
Section 4.9[video]
1. Suppose X = Y + N where Y and N are uncorrelated with mean zero, Var(Y ) = 4, and
b |X])2 ], and find
Var(N ) = 8. Find the minimum MSE for linear estimation, E[(Y E(Y
b
E[Y |X = 6].
2. Suppose X has pdf fX (u) =
e|u|
2 .
3. Suppose (Nt : t 0) is a Poisson random process with rate = 10. Find E[N7 |N5 = 42].
Section 4.10[video]
1. A gambler repeatedly plays a game, each time winning two units of money with probability 0.3
and losing one unit of money with probability 0.7. Let A be the event that after 1000 plays,
the gambler has at least as much money as at the beginning. Find an upper bound on P (A)
using the Chebychev inequality as in Proposition 4.10.1.
2. A gambler repeatedly plays a game, each time winning two units of money with probability 0.3
and losing one unit of money with probability 0.7. Let A be the event that after 1000 plays,
the gambler has at least as much money as at the beginning. Find the approximate value of
P (A) based on the Gaussian approximation. (To be definite, use the continuity correction.)
Section 4.11[video]
1. Suppose X and Y are jointly Gaussian with equal variances. Find a 0 so that X + aY is
independent of X aY.
2 = 3, 2 = 12, and = 0.5. Find
2. Suppose X and Y are jointly Gausian with mean zero, X
Y
P {X 2Y 10}.
4.13. PROBLEMS
4.13
223
Problems
v=4
v=5
v=6
u=0
0
0.2
0
u=1
0.1
0
0.2
u=2
0.1
0
0.1
u=3
0.2
0
0.1
224
4.13. PROBLEMS
225
0
(
ln(u)v
21
0
(96)u2 v 2
4 (u2 +v 2 )
e
u, v 0
else.
0 u 1, 1 v 4
else.
u2 + v 2 1
else.
(c) Find .
(d) For what values of u is the conditional pdf fY |X (v|u) well defined for all v?
(e) Find P {X < Y }.
4.11. [Near simultaneous failure times ]
Suppose S and T represent the lifetimes of two computers. Suppose the lifetimes are mutually
independent, and each has the exponential distribution with some parameter = 1.
(a) Find P {|S T | 1} by setting up an integral over {(u, v) : u 0, v 0, |u v| 1}
and evaluating the integral.
(b) Use the memoryless property of the exponential distribution and the fact the computer
lifetimes have the same distribution, to explain why P {|S T | 1} is equal to P {S 1},
or equivalently, to P {T 1}.
4.12. [The volume of a random cylinder]
Consider a cylinder with height H and radius R. Its volume V is given by V = HR2 . Suppose
H and R are mutually independent, and each is uniformly distributed over the interval [0, 1].
(a) Using LOTUS, find the mean of V.
(b) What is the support of the distribution of V ?
(c) Find the CDF of V.
(d) Find the pdf of V.
A 1 u2 + v 2
u2 + v 2 < 1
fX,Y (u, v) =
0
else.
Hint: Use of polar coordinates is useful for all parts of this problem.
(a) Find the value of A.
(b) Let Z = X 2 + Y 2 . Find the pdf of the random variable Z.
RR
(c) Find E[Z 5 ] using LOTUS for joint pdfs: E[g(X, Y )] =
R2 g(u, v)fX,Y (u, v)dudv.
4.16. [A function of two random variables]
Two resistors are connected in series to a one volt voltage source. Suppose that the resistance
values R1 and R2 (measured in ohms) are independent random variables, each uniformly
distributed on the interval (0,1). Find the pdf fI (a) of the current I (measured in amperes)
in the circuit.
228
3/4 if i = j
1/4 if |i j| = 1
Xi ,Yj =
0 else.
Pn
Pn
Let W = i=1 Xi and Z = i=1 Yi . Express Cov(W, Z) as a function of n.
4.13. PROBLEMS
229
(d) Express the linear estimator found in part (c) as a linear combination of X1 and X2 .
4.27. [A singular estimation problem]
Suppose Y represents a signal, and P {Y = 1} = P {Y = 1} = 0.5. Suppose N represents
noise, and N is uniformly distributed over the interval [1, 1]. Assume S and N are mutually
independent, and let X = Y + N.
(a) Find g (X), the MMSE estimator of Y given X. (Hint: Consider some typical values of
(X, Y ) and think about how Y can be estimated from X. )
(b) Find L (X), the MMSE linear estimator of Y given X.
4.28. [Simple estimation problems]
Suppose (X, Y ) is uniformly distributed over the set {(u, v) : 0 v u 1}.
(a) Find E[Y 2 |X = u] for 0 u 1.
b |X = u]. (Hint: There is a Y 2 in part (a) but only a Y in this part. This
(b) Find E[Y
problem can be done with little to no calculation.)
4.29. [An estimation problem]
Suppose X and Y have the following joint pdf:
(
8uv
u 0, v 0, u2 + v 2 (15)2
(15)4
fX,Y (u, v) =
0
else
(a) Find the constant estimator, , of Y, with the smallest mean square error (MSE), and
find the MSE.
(b) Find the unconstrained estimator, g (X), of Y based on observing X, with the smallest
MSE, and find the MSE.
(c) Find the linear estimator, L (X), of Y based on observing X, with the smallest MSE,
and find the MSE. (Hint: You may use the fact E[XY ] = 75
4 58.904, which can be
derived using integration in polar coordinates.)
LLN, CLT, and joint Gaussian distribution Sections 4.10 & 4.11
4.30. [Law of Large Numbers and Central Limit Theorem]
A fair die is rolled n times. Let Sn = X1 + X2 + . . . + Xn , where Xi is the number showing on
the ith roll. Determine a condition on n so the probability the sample average Snn is within
1% of the mean X , is greater than 0.95. (Note: This problem is related to Example 4.10.5.)
(a) Solve the problem using the form of the law of large numbers based on the Chebychev
inequality (i.e. Proposition 4.10.1 in the notes).
4.13. PROBLEMS
231
(b) Solve the problem using the Gaussian approximation for Sn , which is suggested by the
CLT. (Do not use the continuity correction, because, unless 3.5n(0.01)nX are integers,
inserting the term 0.5 is not applicable).
4.31. [Rate of convergence in law of large numbers for independent Gaussians]
By the law of large numbers, if > 0 and Sn = X1 +
+ Xn , where
X1 , X2 , . . . are
uncorrelated with mean zero and bounded variance, limn P Snn = 0.
(a) Express P Snn in terms of n, and the Q function, for the special case X1 , X2 , . . .
are independent, N (0, 1) random variables.
(b) An upper bound for Q(x) for x > 0, which is also a good approximation for x at least
moderately large, is given by
Z
Z
1 u2 /2
1 u u2 /2
1
2
e
du
e
du = ex /2 .
Q(x) =
2
2 x
x 2
x
x
Use this bound to obtain an upper bound on the probability in part (a) in terms of
and n but not using the Q function.
4.32. [Marathon blackjack]
In a particular version of the card game blackjack offered by gambling casinos4 , if a player
uses a particular optimized strategy, then in one game with one unit of money initial bet,
the net return is -0.0029 and the standard deviation of the net return is 1.1418 (which can
be squared to get the variance). Suppose a player uses this strategy and bets $100 on each
game, regardless of how much the player won or lost in previous games.
(a) What is the expected net gain of the player after 1000 games? (Answer should be a
negative dollar amount.)
(b) What is the probability the player is ahead after 1000 games? (Use the Gaussian approximation suggested by the central limit theorem for this and other parts below.)
(c) What is the probability the player is ahead by at least $1000 after 1000 games?
(d) What value of n is such that after playing n games (with the same initial bet per game),
the probability the player is ahead after n games is about 0.4?
4.33. [Achieving potential in a class]
(The following is roughly based on the ECE 313 grading scheme, ignoring homework scores
and the effect of partial credit.) Consider a class in which grades are based entirely on
midterm and final exams. In all, the exams have 100 separate parts, each worth 5 points. A
student scoring at least 85%, or 425 points in total, is guaranteed an A score. Throughout
this problem, consider a particular student, who, based on performance in other courses
and amount of effort put into the class, is estimated to have a 90% chance to complete any
particular part correctly. Problem parts not completed correctly receive zero credit.
4
See http://wizardofodds.com/games/blackjack/appendix/4/
232
p
where 2 is the variance of X: = np(1 p) 2n . If n is known and p is estimated
by pb = X
b 2a n contains p with
n , it follows that the confidence interval with endpoints p
probability at least 1 a12 . (See Section 2.9.) A less conservative, commonly used approach
is to note that by the central limit theorem,
P {|X np| a} 2Q(a).
(4.44)
4.13. PROBLEMS
233
Example 2.9.2 showed that n = 625 is large enough for the random interval with endpoints
pb 10% to contain the true value p with probability at least 96%. Calculate the value of
n that would be sufficient for the same precision (i.e. within 10% of p) and confidence (i.e.
96%) based on (4.44) rather than (4.43). Explain your reasoning.
4.36. [Conditional means for a joint Gaussian pdf ]
Suppose X and Y have a bivariate Gaussian joint distribution with E[X] = E[Y ] = 0 and
Var(X) = 1. (The variance of Y and the correlation coefficient are not given.) Finally, suppose
X is independent of X + Y.
(a) Find Cov(X, Y ).
(b) Find E[X|X + Y = 2].
(c) Find E[Y |X = 2].
4.37. [Transforming joint Gaussians to independent random variables]
Suppose X and Y are jointly Gaussian such that X is N (0, 9), Y is N (0, 4), and the correlation
coefficient is denoted by . The solutions to the questions below may depend on and may
fail to exist for some values of .
(a) For what value(s) of a is X independent of X + aY ?
(b) For what value(s) of b is X + Y independent of X bY ?
(c) For what value(s) of c is X + cY independent of X cY ?
(d) For what value(s) of d is X + dY independent of (X dY )3 ?
4.38. [Jointly Gaussian Random Variables I]
2 = 9, 2 = 16, and
Suppose X and Y are jointly Gaussian with X = 1, Y = 2, X
Y
Cov(X, Y ) = 6.
(a) Describe the marginal distribution of X in words and write the explicit formula for its
pdf, fX (u).
(b) Describe the conditional distribution of Y given X = 5 in words, and write the explicit
formula for the conditional pdf, fY |X (v|5).
(c) Find the numerical value of P (Y 2|X = 5).
(d) Find the numerical value of E[Y 2 |X = 5].
4.39. [Jointly Gaussian Random Variables II]
Suppose Y and W are jointly Gaussian random variables with E[Y ] = 2, E[W ] = 0, Var(Y ) =
16, Var(W ) = 4, and = 0.25. Let X = 3Y + W + 3.
(a) Find E[X] and Var(X).
(b) Calculate the numerical value of P {X 20}.
234
4.13. PROBLEMS
(a) Find E[Y |X = u] as a function of u.
(b) Describe in words the conditional distribution of Y given X = u.
235
236
Chapter 5
Wrap-up
The topics in these notes are listed in both the table of contents and the index. This chapter briefly
summarizes the material, while highlighting some of the connections among the topics.
The probability axioms allow for a mathematical basis for modeling real-world systems with
uncertainty, and allow for both discrete-type and continuous-type random variables. Counting
problems naturally arise for calculating probabilities when all outcomes are equally likely, and a
recurring idea for counting the total number of ways something can be done is to do it sequentially,
such as in, There are n1 choices for the first ball, and for each choice of the first ball, there are
n2 ways to choose the second ball, and so on. Working with basic probabilities includes working
with Karnaugh maps, de Morgans laws, and definitions of conditional probabilities and mutual
independence of two or more events. Our intuition can be strengthened and many calculations
made by appealing to the law of total probability and the definition of conditional probability. In
particular, we sometimes look back, conditioning on what happened at the end of some scenario,
and ask what is the conditional probability that the observation happened in a particular way
using Bayes rule. Binomial coefficients form a link between counting and the binomial probability
distribution.
A small number of key discrete-type and continuous-type distributions arise again and again
in applications. Knowing the form of the CDFs, pdfs, or pmfs, and formulas for the means and
variances, and why each distribution arises frequently in nature and applications, can thus lead to
efficient modeling and problem solving. There are relationships among the key distributions. For
example, the binomial distribution generalizes the Bernoulli, and the Poisson distribution is the
large n, small p limit of the Bernoulli distribution with np = . The exponential distribution is
the continuous time version of the geometric distribution; both are memoryless. The exponential
distribution is the limit of scaled geometric distributions, and the Gaussian (or normal) distribution,
by the central limit theorem, is the limit of standardized sums of large numbers of independent,
identically distributed random variables.
The following important concepts apply to both discrete-type random variables and continuoustype random variables:
Independence of random variables
237
238
CHAPTER 5. WRAP-UP
Marginals and conditionals
Functions of one or more random variables, the two or three step procedure to find their
distributions, and LOTUS to find their means
E[X], Var(X), E[XY ], Cov(X, Y ), X,Y , X , and relationships among these.
Binary hypothesis testing (ML rule, MAP rule as likelihood ratio tests)
Maximum likelihood parameter estimation
b |X], and g (X) = E[Y |X].
The minimum MSE estimators of Y : = E[Y ], L (X) = E[Y
Markov, Chebychev, and Schwarz inequalities (The Chebychev inequality can be used for
confidence intervals; the Schwarz inequality implies correlation coefficients are between one
and minus one.)
Law of large numbers and central limit theorem
Poisson random processes arise as limits of scaled Bernoulli random processes. Discussion of
these processes together entails the Bernoulli, binomial, geometric, negative geometric, exponential,
Poisson, and Erlang distributions.
Reliability in these notes is discussed largely in discrete settingssuch as the outage probability
for an s t network. Failure rate functions for random variables are discussed for continuous-time
positive random variables only, but could be formulated for discrete time.
There are two complementary approaches for dealing with multiple random variables in statistical modeling and analysis, described briefly by the following two lists:
Distribution approach
joint pmf or joint pdf or joint CDF
marginals, conditionals
independent
E[Y |X]
Moment approach
means, (co)variances, correlation coefficients
uncorrelated (i.e. Cov(X, Y ) = 0 or X.Y = 0)
b |X]
E[Y
That is, on one hand, it sometimes makes sense to postulate or estimate joint distributions. On
the other hand, it sometimes makes sense to postulate or estimate joint moments, without explicitly estimating distributions. For jointly Gaussian random variables, the two approaches are
equivalent. That is, working with the moments is equivalent to working with the distributions
themselves. Independence, which in general is stronger than being uncorrelated, is equivalent to
being uncorrelated for the case of jointly Gaussian random variables.
Chapter 6
Appendix
6.1
Some notation
AB
complement of A
AB
AB c
1 if x A
0
else
(a, b] = {x : a < x b}
[a, b) = {x : a x < b}
[a, b] = {x : a x b}
AB
IA (x)
set of integers
Z+
R+
btc =
dte =
The little oh notation for small numbers is sometimes used: o(h) represents a quantity such
that limh0 o(h)
h = 0. For example, the following equivalence holds:
f (t + h) f (t)
= f 0 (t)
h0
h
lim
239
240
6.2
CHAPTER 6. APPENDIX
Some sums
rk = 1 + r + r2 + + rn =
k=0
rk = 1 + r + r2 + =
k=0
k
X
r
k!
k=0
n k nk
a b
k
n
X
k=0
n
X
1
1r
if |r| < 1
r2 r3
+
+ = er
2
6
n n2 2
n
n1
= a + na
b+
a
b + + nabn1 + bn = (a + b)n
2
k = 1 + 2 + + n =
n(n + 1)
2
k 2 = 1 + 4 + 9 + + n2 =
k=1
6.3.1
if r 6= 1
= 1+r+
k=1
n
X
6.3
1 rn+1
1r
n(n + 1)(2n + 1)
6
Bernoulli(p):
0p1
pmf: p(i) =
mean: p
p
i=1
1p i=0
variance: p(1 p)
Example: One if heads shows and zero if tails shows for the flip of a coin. The coin is called fair if
p = 12 and biased otherwise.
Binomial(n, p):
n 1, 0 p 1
pmf: p(i) =
mean: np
n i
p (1 p)ni
0in
i
variance: np(1 p)
241
0
pmf: p(i) =
mean:
i e
i0
i!
variance:
Example: Number of phone calls placed during a ten second interval in a large city.
Significant property: The Poisson pmf is the limit of the binomial pmf as n + and p 0 in
such a way that np .
Geometric(p):
0<p1
pmf: p(i) = (1 p)i1 p
i1
1
1p
mean:
variance:
p
p2
6.3.2
mean:
a+b
2
variance:
(b a)2
12
>0
1
1
variance: 2
Example: Time elapsed between noon sharp and the first time a telephone call is placed after that,
in a city, on a given day.
pdf: f (t) = et t 0
mean:
s, t 0
Any nonnegative random variable with the memoryless property in continuous time is exponentially
distributed. Failure rate function is constant: h(t) .
242
CHAPTER 6. APPENDIX
Erlang(r, ):
r 1, 0
pdf: f (t) =
r tr1 et
(r 1)!
t0
mean:
variance:
r
2
Significant property: The distribution of the sum of r independent random variables, each having
the exponential distribution with parameter .
Gaussian or Normal(, 2 ):
R, 0
1
(u )2
pdf : f (u) =
exp
2 2
2 2
mean:
variance: 2
R
u2
Notation: Q(c) = 1 (c) = c 12 e 2 du
Significant property (CLT): For independent, identically distributed r.v.s with mean , variance
2:
X1 + + Xn n
lim P
c
= (c)
n
n 2
Rayleigh( 2 ):
2 > 0
r2
r
r>0
pdf: f (r) = 2 exp 2
2
r
mean:
variance: 2 2
2
2
r2
CDF : 1 exp 2
2
Example: Instantaneous value of the envelope of a mean zero, narrow band noise signal.
1
6.4
243
Normal tables
Tables 6.1 and 6.2 below were computed using Abramowitz and Stegun, Handbook of Mathematical
Functions, Formula 7.1.26, which has maximum error at most 1.5 107 .
Table 6.1: function, the area under the standard normal pdf to the left of x.
x
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
0.00
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713
0.9772
0.9821
0.9861
0.9893
0.9918
0.9938
0.9953
0.9965
0.9974
0.9981
0.9987
0.01
0.5040
0.5438
0.5832
0.6217
0.6591
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.9463
0.9564
0.9649
0.9719
0.9778
0.9826
0.9864
0.9896
0.9920
0.9940
0.9955
0.9966
0.9975
0.9982
0.9987
0.02
0.5080
0.5478
0.5871
0.6255
0.6628
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.9474
0.9573
0.9656
0.9726
0.9783
0.9830
0.9868
0.9898
0.9922
0.9941
0.9956
0.9967
0.9976
0.9982
0.9987
0.03
0.5120
0.5517
0.5910
0.6293
0.6664
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.9484
0.9582
0.9664
0.9732
0.9788
0.9834
0.9871
0.9901
0.9925
0.9943
0.9957
0.9968
0.9977
0.9983
0.9988
0.04
0.5160
0.5557
0.5948
0.6331
0.6700
0.7054
0.7389
0.7704
0.7995
0.8264
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.9495
0.9591
0.9671
0.9738
0.9793
0.9838
0.9875
0.9904
0.9927
0.9945
0.9959
0.9969
0.9977
0.9984
0.9988
0.05
0.5199
0.5596
0.5987
0.6368
0.6736
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8944
0.9115
0.9265
0.9394
0.9505
0.9599
0.9678
0.9744
0.9798
0.9842
0.9878
0.9906
0.9929
0.9946
0.9960
0.9970
0.9978
0.9984
0.9989
0.06
0.5239
0.5636
0.6026
0.6406
0.6772
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.9515
0.9608
0.9686
0.9750
0.9803
0.9846
0.9881
0.9909
0.9931
0.9948
0.9961
0.9971
0.9979
0.9985
0.9989
0.07
0.5279
0.5675
0.6064
0.6443
0.6808
0.7157
0.7486
0.7794
0.8078
0.8340
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.9525
0.9616
0.9693
0.9756
0.9808
0.9850
0.9884
0.9911
0.9932
0.9949
0.9962
0.9972
0.9979
0.9985
0.9989
0.08
0.5319
0.5714
0.6103
0.6480
0.6844
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.9535
0.9625
0.9699
0.9761
0.9812
0.9854
0.9887
0.9913
0.9934
0.9951
0.9963
0.9973
0.9980
0.9986
0.9990
0.09
0.5359
0.5753
0.6141
0.6517
0.6879
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
0.9545
0.9633
0.9706
0.9767
0.9817
0.9857
0.9890
0.9916
0.9936
0.9952
0.9964
0.9974
0.9981
0.9986
0.9990
244
CHAPTER 6. APPENDIX
Table 6.2: Q function, the area under the standard normal pdf to the right of x.
x
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
0.00
0.5000
0.4602
0.4207
0.3821
0.3446
0.3085
0.2743
0.2420
0.2119
0.1841
0.1587
0.1357
0.1151
0.0968
0.0808
0.0668
0.0548
0.0446
0.0359
0.0287
0.0228
0.0179
0.0139
0.0107
0.0082
0.0062
0.0047
0.0035
0.0026
0.0019
0.0013
0.01
0.4960
0.4562
0.4168
0.3783
0.3409
0.3050
0.2709
0.2389
0.2090
0.1814
0.1562
0.1335
0.1131
0.0951
0.0793
0.0655
0.0537
0.0436
0.0351
0.0281
0.0222
0.0174
0.0136
0.0104
0.0080
0.0060
0.0045
0.0034
0.0025
0.0018
0.0013
x
0.0
1.0
2.0
3.0
4.0
0.02
0.4920
0.4522
0.4129
0.3745
0.3372
0.3015
0.2676
0.2358
0.2061
0.1788
0.1539
0.1314
0.1112
0.0934
0.0778
0.0643
0.0526
0.0427
0.0344
0.0274
0.0217
0.0170
0.0132
0.0102
0.0078
0.0059
0.0044
0.0033
0.0024
0.0018
0.0013
0.0
0.5000000
0.1586553
0.0227501
0.0013500
0.0000317
0.03
0.4880
0.4483
0.4090
0.3707
0.3336
0.2981
0.2643
0.2327
0.2033
0.1762
0.1515
0.1292
0.1093
0.0918
0.0764
0.0630
0.0516
0.0418
0.0336
0.0268
0.0212
0.0166
0.0129
0.0099
0.0075
0.0057
0.0043
0.0032
0.0023
0.0017
0.0012
0.04
0.4840
0.4443
0.4052
0.3669
0.3300
0.2946
0.2611
0.2296
0.2005
0.1736
0.1492
0.1271
0.1075
0.0901
0.0749
0.0618
0.0505
0.0409
0.0329
0.0262
0.0207
0.0162
0.0125
0.0096
0.0073
0.0055
0.0041
0.0031
0.0023
0.0016
0.0012
0.2
0.4207403
0.1150697
0.0139034
0.0006872
0.0000134
0.05
0.4801
0.4404
0.4013
0.3632
0.3264
0.2912
0.2578
0.2266
0.1977
0.1711
0.1469
0.1251
0.1056
0.0885
0.0735
0.0606
0.0495
0.0401
0.0322
0.0256
0.0202
0.0158
0.0122
0.0094
0.0071
0.0054
0.0040
0.0030
0.0022
0.0016
0.0011
0.4
0.3445783
0.0807567
0.0081975
0.0003370
0.0000054
0.06
0.4761
0.4364
0.3974
0.3594
0.3228
0.2877
0.2546
0.2236
0.1949
0.1685
0.1446
0.1230
0.1038
0.0869
0.0721
0.0594
0.0485
0.0392
0.0314
0.0250
0.0197
0.0154
0.0119
0.0091
0.0069
0.0052
0.0039
0.0029
0.0021
0.0015
0.0011
0.6
0.2742531
0.0547993
0.0046612
0.0001591
0.0000021
0.07
0.4721
0.4325
0.3936
0.3557
0.3192
0.2843
0.2514
0.2206
0.1922
0.1660
0.1423
0.1210
0.1020
0.0853
0.0708
0.0582
0.0475
0.0384
0.0307
0.0244
0.0192
0.0150
0.0116
0.0089
0.0068
0.0051
0.0038
0.0028
0.0021
0.0015
0.0011
0.08
0.4681
0.4286
0.3897
0.3520
0.3156
0.2810
0.2483
0.2177
0.1894
0.1635
0.1401
0.1190
0.1003
0.0838
0.0694
0.0571
0.0465
0.0375
0.0301
0.0239
0.0188
0.0146
0.0113
0.0087
0.0066
0.0049
0.0037
0.0027
0.0020
0.0014
0.0010
0.8
0.2118553
0.0359303
0.0025552
0.0000724
0.0000008
0.09
0.4641
0.4247
0.3859
0.3483
0.3121
0.2776
0.2451
0.2148
0.1867
0.1611
0.1379
0.1170
0.0985
0.0823
0.0681
0.0559
0.0455
0.0367
0.0294
0.0233
0.0183
0.0143
0.0110
0.0084
0.0064
0.0048
0.0036
0.0026
0.0019
0.0014
0.0010
6.5
245
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
1.2
1.3
1.4
1.5
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
1. 0.32 2. 0.30
1. 5040 2. 2790 3. 6 4. 2520
1. 1/5 2. 1/20
1. 6/9
1. 16.5 2. 1 3. 9
1. 1/3 2. 2/27 3. 3
1. 0.00804 2. 9.72 3. 0.2276
1. 35 2. 6
1. 7,2,5,4 2. 2,2,2,2,3 3. 7,9,14,18
1. 2.303
2. 0.2240
1. 10 2. 1/3
1. 1/16 2. 1,582
1. 0.1705 2. 0.2031 3. 1/3
1. 0.3, 0.5 2. 1/3 3. 6
1. 108 2. 1.2 105
1. 0.73775 2. 0.9345879
1. 1/18 2. /4
1. 4/45 2. 1/3
1. 0.21972 2. 34.6573
1. 0.0029166 2. 41.589, 6.45 3. 0.125
1. 0.175085 2. 0.1587 3. 0.5281 4. 0.1792
1. 50 2. 4.5249
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
Section
3.9
3.10
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
1
1. 1+a
2. t I{t1}
1. 0, 0.63212 2. 0.3191 3. 1/3
1. 3/16 2. 0.2708333
1. 1/3
1. 12/11 2. 8/15 3. 0.4135 4. 1/3
1. 1/8 2. 0.43233 3. 1.38629
1. 0.28 2. 0.45016
1. 1/6 2. 0.12897
1. 0.8
1. 0.75 2. 56 3. 6 4. 4 5. 892
1. 2.6666, 2 2. 12X 3. 62
1. 0.19 2. 0.01105
1. 1 2. 0.05466
2 c
5. 7/27
246
6.6
CHAPTER 6. APPENDIX
B
0.4
a
B
0.3 !a
0.3
We filled in the variable a for P (AB c ), and then, since the sum of the probabilities is one, it
must be that P (Ac B) = 0.3 a. The valid values of a are 0 a 0.3, and (P (A), P (B)) =
(0.3 + a, 0.6 a). So, in parametric form, the set of possible values of (P (A), P (B)) is {(0.3 +
247
a, 0.6 a) : 0 a 0.3}. Equivalent ways to write this set are {(u, v) : v = 0.9 u and 0.3
u 0.6} or {(x, 0.9 x) : 0.3 x 0.6}. The set is also represented by the solid line segment
in the following sketch:
P(B)
0.9
0.6
0.3
P(A)
0.3
0.6
0.9
B
12,14,16,21,
25,41,45,52,
54,56,61,65
11,15,22,24,26,
42,44,46,51,55,
(a) 62,64,66
(b) P (AB) =
5
36
B
23,32,34,
36,43,63
13,31,33,
35,53
0.13888.
2a
6
T
7
c
2
T
F
F
3
T
2b 4b
6
T
7
c
2
T
F
F
3
T
I
5
2
T
F
F
3
T
248
CHAPTER 6. APPENDIX
Finally, since at least one student is neither on Twitter nor on Facebook, 2b + b > 0 or b > 0.
There are only 12 students total not on Facebook, so b = 1. The final diagram is shown on
the right. Two students are not on Twitter or Facebook and dont have iPads.
3 6 11
0.0475
5 17 49
2 4 11
0.0211
5 17 49
(c) There are 13 ways to choose the number common to four of the cards. Given that choice,
there are 12 ways to choose the number showing on the remaining card, and four ways
to choose the suit of that card. Thus,
P (FOUR OF A KIND) =
13 12 4
52
5
=
=
13 12 4 5 4 3 2
52 51 50 49 48
1
0.0002401
4165
(c) There are 2n
n ways to choose n objects from a set of 2n objects, half of which are orange
and half of which are blue. The number
of such ways containing exactly k orange objects
n
and n k blue objects is nk nk
. Summing over k yields the original 2n
n possibilities.
(d) Suppose the n objects are numbered 1 through n. Every set of k objects contains a
highest numbered object. The number of subsets of the n objects of size k with largest
l1
object l is the number of subsets of the first l 1 objects of size k 1, equal to k1
.
Summing over l gives the identity.
5/18
3/18
(b)
1/3
(c)
4/12
6k
18 .
1/2
1/6
1/12
0 1 2 3 4 5
2(6k)
36
0 1 2 3
1/12
0 1 2 3
3
5
4
3
2
1
E[D] = 18
0 + 18
1 + 18
2 + 18
3 + 18
4 + 18
5 = 35
18 = 1.944
5
4
3
2
1
6
2
2
2
2
2
2
E[D ] = 18 0 + 18 1 + 18 2 + 18 3 + 18 4 + 18 52 = 105
18 = 5.8333.
Var(D) = E[D2 ] E[D]2 = 2.0525
1)/(1
2)
2n
1
2
the shoes sequentially. Regardless of what the first shoe drawn is, the chance of getting
250
CHAPTER 6. APPENDIX
its mate on the second draw is
perfect sense.
1
. Note that this has value 1 if n = 1, which makes
2n 1
(b) Anyoneof the n left shoes can be paired with any one of the n right shoes. So, n2 of
2n
the
choices yield one left shoe and one right shoe in the two drawn, giving that
2
P (one L, one R ) =
n2
2n =
2
n2
n
=
.
2n(2n 1)/(1 2)
2n 1
Again, more simply, regardless of what the first shoe is, the chances of getting a shoe of
n
the opposite footality on the second draw is
.
2n 1
Note that this has value 1 if n = 1, which makes perfect sense.
Suppose now that n 2 and that you choose 3 shoes at random from the bag.
(c) Any of the n pairs and any of the other 2n 2 shoes form a set of 3 shoes. Hence,
3
n(2n 2)
n(2n 2)
=
.
=
P (pair among three) =
2n
2n(2n
1)(2n
2)/(1
3)
2n
1
3
Note that this has value 1 when n = 2, which makes perfect sense.
n
n ways to choose
shoe,
and
(d) There are n2 n ways to choose two left shoes and one right
2
n
2
two right shoes and one left shoe. Adding gives 2 2 n = n (n 1) ways to choose at
n2 (n 1)
3n
least one of each. So P (one L and one R among three) =
=
.
2n
4n 2
3
Note that this has value 1 when n = 2, which makes perfect sense.
2.6. [Mean and standard deviation of two simple random variables]
(a) For 1 k 6, there are six sample points with X(i, j) = k, so pX (k) =
(b) E[X] = 1+2+3+4+5+6
= 3.5. To compute X we can first find E[X 2 ] =
6 p
and then X = E[X 2 ] E[X]2 = 1.707825.
6
36
= 16 .
(c) {Y = 1} = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (3, 1), (4, 1), (5, 1), 6, 1)},
{Y = 2} = {(2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (3, 2), (4, 2), (5, 2), (6, 2)},
{Y = 3} = {((3, 3), (3, 4), (3, 5), (3, 6), (4, 3), (5, 3), (6, 3)}, and so on. In general, for
1 k 6, there are 13 2k sample points for which Y = k. Thus, pY (k) = 132k
for
36
1 k 6.
p (k)
Y
p (k)
X
1/6
1/6
k
1
(d) Using part (c) or using a spreadsheet/program, we find E[Y ] = 2.5277 . . . and Y2 =
1.9714062, so Y = 1.404.
251
(e) Y < X . This is consistent with the sketches. Intuitively, the triangular shape of
pY is more concentrated than the rectangular shape of pX . To elaborate, this is not
a mathematical statement; the mathematical fact is simply that (X) (Y ). For
intuition, look at the sketches of the pmfs in part (c) above. Suppose you start out with
the pmf of X, which has mass uniformly spread out over the six points. Then you take
your right hand and push some of the mass on the right side over to the left. There
is very little mass left at points 5 and 6. The mass gets more bunched up. Maybe
it helps to squint a bit when you look at the figure. The standard deviation would be
even smaller if you took both hands and pushed mass together from both sides to get a
symmetric triangular shape.
(f) Here is one example. Consider a random variable Zpwith pZ (1) = pZ (6) = 0.5. Then
1+36
2
E[Z] = 1+6
= 18.5, and Z = 18.5 (3.5)2 = 2.5. In fact, this
2 = 3.5, E[Z ] =
2
distribution has the largest variance and standard deviation of any distribution on the
set {1, 2, 3, 4, 5, 6}.
2.8. [Selecting supply for a random demand]
(a) Begin by calculating:
E[min{U, L}] =
M
1 X
min{i, L}
M
i=1
L
X
1
M
min{i, L} +
i=1
L
X
1
M
i=1
M
X
!
min{i, L}
i=L+1
i+
M
X
!
L
i=L+1
L(L + 1)
+ (M L)L
=
2
L2 L
= L
2M
1
M
Therefore,
E[profit] = (b a)L b
L2 L
2M
2
L
(b) As seen in part (a), the expected profit is h(L), where h(L) = (b a)L b L2M
. The
graph of h is a parabola facing downward. If we ignore the requirement that L be an
integer we can solve h0 (L) = 0 and find that h is maximized at Lr , where
Lr =
1
a
+M 1
.
2
b
Since the function h is symmetric about Lr , the maximizing integer value L is obtained
by rounding Lr to the nearest integer. That
is, the revenue maximizing integer value of
L is the integer nearest to Lr . (If M 1 ab happens to be an integer, then the fractional
a
b
and M 1
a
b
Alternative approach To find the integer L that maximizes h(L) we calculate that
h(L + 1) h(L) = (b a) bL
M , and note that it is strictly decreasing in L. A maximizing
integer L is the minimum integer L such that h(L + 1) h(L) 0, or equivalently, the
L = dM (1 ab )e. ( If
minimuminteger L such that L M 1 ab , or equivalently,
a
a
M 1 b happens
to be an integer, then L = M 1 b and h(L + 1) h(L ) = 0,
so M 1 ab and M 1 ab + 1 are both choices for L that maximize expected profit.)
L
Comment: The expression h(L + 1) h(L) = b(1 M
) a has a nice interpretation.
L
The term (1 M ) equals P {U L+1}, which is the probability an (L+1)th room would
be sold, and a is the price to the reseller of reserving and prepaying for one additional
room.
2.10. [First and second moments of a ternary random variable ]
(a) Since m2 = 2 + 2 , the given requirement is equivalent to = 12 and m2 = 1. In general,
and m2 can be expressed in terms of a and b as follows: = b a and m2 = a + b.
So b a = 12 and a + b = 1, or a = 0.25 and b = 0.75. (Note that this choice for (a, b) is
valid, i.e. a 0, b 0 and a + b 1.)
(b) As noted in the solution to part (a), in general, and m2 can be expressed in terms of a
and b as follows: = b a and m2 = a + b. Given and m2 , this gives two equations for
the two unknowns a and b. Solving yields a = m22 and b = m22+ . The constraint a 0
translates to m2 The constraint b 0 translates to m2 . These two constraints
are equivalent to the combined constraint || m2 . The constraint a + b 1 translates
to m2 1. In summary, there is a valid choice for (a, b) if and only if || m2 1. See
the sketch of this region.
(b)
m
1
(c)
(c) As seen in part (b), the mean must be in the range 1 1. For any such , m2
can be anywhere in the interval [||, 1], and so, by the hint, 2 can be anywhere in the
interval [|| 2 , 1 2 ]. See the sketch of this region.
1
6
36 / 36
= 16 .
253
(c) This event contains the eleven outcomes {61, 62, 63, 64, 65, 66, 56, 46, 36, 26, 16} so it has
probability 11
36 .
P (at least one 6, not doubles)
10/36
1
= 30/36
(d)
= 10
30 = 3 . (This answer is slightly larger than the
P (not doubles)
answer to (c) as expected.)
2.14. [Independence]
(a) For brevity, we write ij for the outcome that the orange die shows i and the blue die shows
j. Then, A = {26, 34, 43, 62}, B = {ij : 1 j < i 6}, C = {13, 22, 31, 26, 35, 44, 53, 62, 66},
and D = {1j : 1 j 6} {3j : 1 j 6}.
1
5
1
1
P (A) = , P (B) = , P (C) = , P (D) =
9
12
4
3
1
1
1
, P (AC) = P {26, 62} = , P (AD) = P {34} =
18
18
36
1
1
1
P (BC) = P {31, 53, 62} = , P (BD) = P {31, 32} = , P (CD) = P {13, 31, 35} =
12
18
12
So C and D are mutually independent; no other pair is mutually independent.
P (AB) = P {43, 62} =
(b) No three events are even pairwise independent by part (a), so no three events are mutually independent.
2.16. [A team selection problem]
(a) There are 74 = 73 = 765
321 = 35 ways to select the four people, and the number of ways
to select a team with Alice on it is the number of ways to select three of the other six
654
4
people to go with Alice, or 63 = 321
= 20. So P (A) = 20
35 = 7 . ALTERNATIVELY,
a faction 4/7 of the entire debate team is selected so by symmetry, each person has a
probability 4/7 of being selected.
(b) Given Bob is selected, there
is
are 20 ways to fill out the rest of the team, but if Alice
5
10
also selected, there are 2 = 10 ways to fill out the rest of the team, so P (A|B) = 20 =
0.5. ALTERNATIVELY by the definition, P (A|B) = PP(AB)
(B) = (number of selections
including both Alice and Bob)/(number of selections including Bob) = 10
20 = 0.5. OR
ALTERNATIVELY given Bob is selected, half of the other six people are also selected
so by symmetry, Alice has chance 0.5 to be among them.
6|
20
10
+ 20
(c) P (AB) = P (A)+P (B)P (AB) = 35
35 35 = 7 . ALTERNATIVELY, the number
of selections that exclude both Alice and Bob is the number of ways to select four out
5
of the other five people, which is 5. So P (A B) = 1 P ((A B)c ) = 1 35
= 7.6
1
5
(b) P {X = 1} = P {Y = 0} = 32
. P {X = 4} = P {Y = 1} = 32
.
10
P {X = 16} = P {Y = 2} = 32 . P {X = 64} = P {Y = 3} = 10
32 .
5
1
P {X = 256} = P {Y = 4} = 32
. P {X = 1024} = P {Y = 5} = 32
.
(To elaborate on the above solution, perhaps it would help to consider some particular
outcomes for the five weeks. Since there are two possibilities for each week, there are
25 = 32 possibilities for the five weeks. For example, if the value is cut in half all five
weeks (represent this by 00000) then X = 1 (start out with 32 and divide it by 2 five
times). If the investment doubles the first week and is cut in half the other four weeks
(represent this by 10000) then X = 4 (thered be 64 after one week and that gets cut in
half four times. Similarly, X = 4 if the investment doubles the second week and is cut in
half the other four weeks. Continuing, we see there are 5 ways for X = 4. The possible
values of X are 1,4, 16, 64, 256, 1024. X is not a binomially distributed random variable.
But X is determined by the number of good weeks, when the investment doubles. It
doesnt depend on the order of good and bad weeks. The number of good weeks out of
the five weeks has the binomial distributionit is the number of successes for a specified
number of independent trials with the same success probability.)
1
5
10
5
1
(c) E[X] = 1 32
+ 4 32
+ 16 32
+ 64 10
32 + 256 32 + 1024 32 = 97.65625. The TV
commercial understates the performance - undoubtedly a first!
(b) Let n be the number of flights. As each flight is independent of the other, and if X
represents the number of crashes in n flights, then its pmf is binomially distributed with
parameters
(n, p). Thus, the probability of having at least one crash in n flights is given
by 1 n0 (1 p)n = 1 (1 p)n . Searching over values of n starting from 1, we find that
n = 100, 006 flights are needed in order for the probability of the aircraft experiencing
at least one crash reaches 0.01%.
(c) Sweeping pc from say 0 upwards in the equation below until it evaluates to 109 :
4
X
4
k=2
(pc )k (1 pc )4k
we get pc = 1.2910 105 . This number dictates how reliable each aircraft subsystem
needs to be designed.
255
(a) This is the same as the probability exactly two even numbers show. Each die shows an
even number with probability 0.5, so the number of even numbers showing,
X, has the
binomial distribution with parameter p = 0.5. Therefore, P {X = 2} = 42 (0.5)2 (0.5)2 =
6
3
16 = 8 .
(b) Each roll does not produce two even and two odd numbers with probability 5/8, and for
125
that to happen on three rolls has probability (5/8)3 = 512
.
2.24. [A knockout game]
P{X = 0} = P{of players 1,2: 2 has the higher number } = 1/2
P{X = 1} = P{of players 1,2,3: 3 has largest, 1 the next largest} = (1/3)(1/2) = 1/6
P{X = 2} = P{of 1,2,3,4: 4 has largest, 1 the next largest} = (1/4)(1/3) = 1/12
2.26. [ML parameter estimation for independent geometrically distributed rvs]
(a) Since P {Li = ki } = p(1 p)ki 1 for each i, the likelihood is
p(1 p)k1 1 p(1 p)kn 1 = pn (1 p)sn , where s = k1 + + kn . That is, s is the
total number of attempts needed for the completion of the n tasks.
(b) By definition, pbM L is the value of p that maximizes the likelihood found in (a). A special
case is s = n, i.e., each completion requires only one attempt. The likelihood of that
is pn which is maximized when p = 1. Suppose now that s > n. Differentiating the
likelihood with respect to p yields:
d(pn (1 p)sn )
dp
The derivative is zero at p = ns (and is positive for p smaller than ns and negative for
p larger than ns ). So pbM L = ns . This formula also works for the case s = n, already
discussed, so it works in general. This formula makes intuitive sense; during the observation period, n successes are observed out of a total of s attempts, so pbM L is the
fraction of observed attempts that are successful.)
2.28. [Scaling of a confidence interval]
(a) The width of the original window is an , where a determines the confidence level (which
we dont need to find yet). To reduce this width by a factor of two, n should be increased
by a factor of four. So 1200 samples would be needed.
(b) The confidence interval has width 0.1 if an = 0.1. Solving for a with n = 300 yields
1
a = (0.1) 300, so the confidence level is given by 1 a12 = 1 (0.01)(300)
= 23 .
1
(c) In order for 1 a12 = 0.96 we need a2 = 0.04
= 25, or a = 5. Then, in order to have
5
a = 0.1, we need
n = 0.1 = 50, or n = 2500.
n
1
a2
=1
1
52
2 n
a
,
2 10,000
so a = 5. Therefore,
= 0.96.
L(n)
L(n1)
L(n)
L(n1)
Bayes Formula and binary hypothesis testing Sections 2.10 & 2.11
2.32. [The weight of a positive]
(a)
P (positive) = P (positive|cancer)P (cancer) + P (positive|no cancer)P (no cancer)
= (0.9)(0.008) + (0.07)(0.992) = 0.07664 7.7%
(b)
P (cancer|positive) =
=
=
(c) Out of 1000 woman getting a mammogram, we expect 1000*0.008=8 women to have
breast cancer, and 8*(0.9)=7.2 of those to get a positive mammogram. Expect
1000*(0.992)*(0.07)=69.4 women to get a false positive. There is a debate within the
health industry as to whether women in this age range should get mammograms.
2.34. [Conditional distribution of half-way point]
(a) In this problem we use repeatedly
the fact that the probability that k out of n consecutive steps are right steps is nk 2n . That is because any particular sequence of n steps
2
4
36
9
P ({X = 0}F ) =
28 =
=
2
256
64
1
.
256
)
(d) By the definition of conditional probability, pX (i|F ) = P ({X=i}F
. So by the answers to
P (F )
parts (a) and (c), we find that the support of the conditional pmf is {4, 2, 0, 2, 4} and
pX (0|A) =
36
256
70
256
36
70
pX (2) = pX (2) =
pX (4) = pX (4) =
1
256
70
256
16
256
70
256
16
70
1
70
!4
!4
shows that the conditional pmf of X given F is more concentrated around zero than the
unconditional pmf of X.
2.36. [A simple hypothesis testing problem with discrete observations]
258
CHAPTER 6. APPENDIX
2
(i)
(a) The likelihood ratio function is (i) = pp01 (i)
= 9i
60 . The ML rule is as follows: if X = i,
decide H1 if (i) 1, or equivalently, if |i| 2.58, or equivalently, because i is integer
valued, if |i| 3. Equivalently, 1 = {4, 3, 3, 4} and 0 = {2, 1, 0, 1, 2}. (The
meaning of 1 and 0 is that if X 1 we declare that H1 is true, and if X 0 we
declare that H0 is true. )
9(4)2
60
0
1
> 2.4.
(a) Using a = 20 so that 1 a12 = 0.95, the symmetric, two-sided confidence interval for
the parameter
p of a binomial distribution with known n and observed p has endpoints
20
k
n 2 n . The given data yields the confidence interval [0.290, 0.557] for ph and the
interval [0.309, 0.505] for pa . These intervals intersect; we say the data is not conclusive.
(b) It can be said that if H0 were true, and if p denotes the common value of ph and pa ,
from a perspective before the experiment is conducted, the probability the confidence
intervals will not intersect is less than or equal to the probability that at least one of
the confidence intervals will not contain p. The probability the confidence interval for ph
will not contain p is less than or equal to 5%, and the probability the confidence interval
for pa will not contain p is less than or equal to 5%. So the probability that at least
one of the two confidence intervals will not contain p is less than or equal to 10%. So
therefore, if H0 is true, before we know about the data, we would say the probability
the intervals do not intersect is less than or equal to 10%. Thus, if and when we observe
nonintersecting confidence intervals, we can say either H0 is false, or we just observed
data with an unlikely extreme. Note that we cannot conclude that there is a 90% chance
that H1 is true.
2.40. [Hypothesis testing for independent geometrically distributed observations]
(a) In Problem 2.26 weve seen that if p is the true parameter, then P {(L1 , . . . , Ln ) =
(k1 , . . . , kn )} = pn (1 p)sn , where s = k1 + + kn . The likelihood ratio for a given
observed vector (k1 , . . . , kn ) is the ratio of this probability for p = 0.25 to the probability
for p = 0.5:
(0.25)n (0.75)sn
(1.5)s
(k1 , . . . , kn ) =
=
.
(0.5)n (0.5)sn
3n
equal to 1100, the second test pattern is not equal to 1100 with probability 14
15 . Given that
the first two test patterns are not equal to 1100, the third test pattern is not equal to 1100
13
with probability 13
14 . Multiplying these three probabilities gives P (A|F ) = 16 .) Thus, by
Bayes rule, P (F |A) =
P (F,A)
P (A)
P (A|F )P (F )
P (A|F )P (F )+P (A|F c )P (F c )
13 1
16 2
1
13 1
+1
16 2
2
13
29
= 0.4483.
0,
c2 /2,
F (c) =
c2 /2 + 2c 1,
1,
c0
0c<1
1c<2
c2
There are several ways to derive the above solution. One is to perform the integration of
the pdf following the definition. Other ways are to argue geometrically, using the fact
261
that the area of a triangle is one half the base times height. For example, if c [1, 2],
then F (c) is one minus the area of the triangle formed by the pdf over the interval [c, 2].
That triangle has base and height 2 c, so its area is (1/2)(2 c)2 , and the CDF is
1 (1/2)(2 c)2 . A different way for c [1, 2] is to add the area of the left half of the
pdf (which is 1/2) to the area of the trapezoid formed by the pdf over the interval [1, c].
This gives F (c) = (1/2) + (c 1)(1 + 2 c)/2, because the area of a trapezoid is the
base times the average height.
3.6. [Selecting pdf parameters to match a mean and CDF]
The density is symmetric about a so the mean is a. Therefore a = X = 2.
In order for fX to be a valid pdf, it must integrate to one. The region under fX consists
of a rectangle of area 2b(1) and two triangles of area c(1)/2 each, so its total area is 2b + c;
therefore, 2b + c = 1.
By inspection of the figure, FX (a b) is the area of the first triangle, so FX (a b) = 2c . By the
2
2
formula given for FX , FX (a b) = 5c6 . So 2c = 5c6 , or c = 0.6. ( ALTERNATIVELY, taking
the derivative of the expression given form FX shows that fX (u) = 53 (u (a b c)) in the
interval [a b c, a b]. Setting fX (a b) = 1 givens 5c
3 = 1 or c = 0.6. ALTERNATIVELY,
5
00
0
taking the derivative of FX twice yields FX = fX = 3 over the interval [a b c, a b]. By the
picture, the slope of fX over the interval is 1c . So 53 = 1c or c = 0.6. There are other ways to
derive c = 0.6; they all involve comparing the picture of fX over the interval [a b c, a b]
to the formula given for FX .) Since 2b + c = 1, b = 0.2.
3.8. [A continuous approximation of the Zipf distribution]
(a) In order for fY to integrate to one,
Z M +0.5
u1 M +0.5 (M + 0.5)1 (0.5)1
C=
u du =
=
1 0.5
1
0.5
(b) Using the formula for the integral from part (a), we find
R 500+0.5
u du
(500.5)0.2 (0.5)0.2
P {Y 500.5} = R 0.5
=
= 0.7011
2000+0.5
(2000.5)0.2 (0.5)0.2
u du
0.5
(This is close to P {X 500}, which is 0.6997.)
3.10. [Uniform and exponential distribution II]
(a) Let B be the event the radio comes from the first batch and let L be the lifetime of the
radio. Lets calculate P {L c} for any constant c 0, because it will be needed for
several values of c. By the law of total probability,
P {L c} = P (L c|B)P (B) + P (L c|B c )P (B c ).
Since L is uniformly distributed over the interval [0, 2] if B is true,
1 2c 0 c 2
.
P (L c|B) = 1 P (L c|B) =
0
c2
Since L has the Exp(0.1) distribution if B c is true, P (L c|B c ) = ec/10 for all c 0.
Combining these facts and using P (B) = P (B c ) = 21 , yields
P (L c) =
1
2 (1
2c + ec/10 ) 0 c 2
1 c/10
)
c2
2 (e
P (L 5 + 3|L 5) =
1 0.8
2e
1 0.5
2e
(6.1)
= e0.3 = 0.7408
A shorter answer is the following. Given L 5, the radio must be from the second
batch, and thus have an exponentially distributed lifetime with parameter . By the
memoryless property of the exponential distribution, the radio after five years is as good
a new. So the probability it lasts at least three more years is e3 = e0.3 .
(b) By the definition of conditional probabilities and (6.1),
P (L 1 + 3|L 1) =
1 0.4
2e
1 1
0.1 )
2(2 + e
= 0.4771
263
3
3
Note that we could get the same answer without using the symmetry property:
X +4
6 + 4
2
2
P {X 6} = P
=
=Q
.
3
3
3
3
(c)
P {0 < X < 2} = P {X > 0} P {X 2}
X +4
X +4
= P
> 4/3 P
2
3
3
= Q(4/3) Q(2).
We remark that in the above solution we could have written instead of > and/or
> instead of , because X is a continuous type random variable.
(d)
P {X 2 < 9} = P {3 < X < 3}
1
X +4
= P
<
< 7/3
3
3
= Q(1/3) Q(7/3).
3.18. [Blind guessing answers on an exam]
(a) By the problem description we take X to have the binomial distribution with parameters
n = 10 and p = 0.5. Since S = 3X 3(10 X) = 6X 30,
P {S 12} = P {6X 30 12} = P {X 7}.
Since E[X] = 5, the Markov inequality yields:
P {S 12} = P {X 7}
E[X]
5
= .
7
7
264
CHAPTER 6. APPENDIX
(b) By the observed symmetry and the connection to X,
P {S 12} = (0.5)P {|S| 12} = (0.5)P {|X 5| 2}.
Since E[X] = 5 and Var(X) = 10(0.5)(0.5) = 52 , Chebyshev inequality yields
P {S 12)} = (0.5)P {|X 5| 2}
5
Var(X)
= .
24
16
(c)
P {S 12} = P {X 7} = P {X 6.5} = P
1.5
X 5
2.5
2.5
Q
1.5
2.5
= 0.1714.
176
X
log2 (3)
= P {X 110}.
=1Q
= 0.9568.
P {S 1} = P
7
7
7
7
Using the continuity correction as follows gives a more accurate estimate. To derive it,
note that P {X 110} = P {X 110.5} because X is integer valued, and approximate
P {X 110.5} :
X 98
110.5 98
12.5
12.5
P {S 1} = P
=1Q
= 0.9629.
7
7
7
7
(Using the binomial distribution directly gives P {S 1} = P {X 110} = 0.9631.)
3.22. [ML parameter estimation for Rayleigh and uniform distributions]
(a) By definition, bM L (10) is the value of that maximizes the likelihood of X = 10. The
100
50
2 , and the log likelihood is ln(10) ln()
likelihood of X = 10 is f (10) = 10
e
.
d ln f (10)
1
50
Differentiation with respect to yields
= + 2 . This derivative is zero for
d
= 50, and it is positive for < 50 and negative for > 50. Hence, bM L (10) = 50.
(Note: If is replaced by 2 , then f is the Rayleigh pdf with parameter 2 . The same
u2
d
2)
reasoning as above shows that, in general, for observation X = u, bM L = (
M L = 2 .)
1
2
a1
a
= a if
1
a
2
a
else
In order for fa (3) > 0, the support of fa must include the observed value u = 3, or
1
2
1
2
a 3 a , or equivalently 3 a 3 . So fa (3) is an increasing function of a over the
1
2
interval 3 a 3 , and it is zero elsewhere. Therefore, the ML estimate is the largest
value of a in this interval. Thus, a
M L = 23 .
3.24. [ML parameter estimation for independent exponentially distributed rvs]
(a) Since f (ui ) = exp(ui ) for each i, the likelihood is
exp(u1 ) exp(un ) = n exp(t) where t = u1 + + un . That is, t is the
sum of the lifetimes of all n lasers.
bM L is the value of that maximizes the likelihood found in (a). Differ(b) By definition,
entiating the likelihood with respect to yields:
d(n exp(t))
d
The derivative is zero at = nt (and is positive for smaller than nt and negative for
bM L = n . This formula makes intuitive sense; n laser failures
larger than nt ). So
t
bM L is the observed rate of failures
happened during operation of total duration t, so
per unit time.
E[X 2 ]
u2
0 3 du
R3
3
u3
9
0
= 3.
2 3
3 .
(c) The range of Y is (, ln 3]. Thus, using the fact the pdf of X is
FY (c) =
P {ln X c} = P {X ec } =
1
ec
3
1
3
< c ln 3
.
c ln 3
266
CHAPTER 6. APPENDIX
(a) Y takes values in {, }; its pmf is
P {X 0} = 0.5 if u =
P {X < 0} = 0.5 if u =
pY (u) =
0
else
(b) Note that (X Y )2 is a function of X. If u < 0 and X = u then the quantizer output
is -, so the squared error is (u ())2 = (u + )2 . For example, if = 1, and if the
input is -1.2, the output will be -1, and the squared error is (0.2)2 . If u > 0 then the
quantizer output is so the squared error is (u )2 . Thus, in general (X Y )2 = h(X)
where h is the function defined by
(u + )2 if u 0
h(u) =
(u )2 if u < 0
So using LOTUS and symmetry yields
Z
E[(X Y ) ] = E[h(X)] =
(u + )
2e
Z
du +
(u )2
2
0
Z
=
(u )2 eu du
Z0
=
(u2 2u + 2 )eu du
eu
du
2
= 2 2 + 2
(c) The derivative of the mean square error with respect to is 2(1 + ), so the mean
square error is decreasing in for < 1 and increasing for > 1. So = 1 minimizes
the mean square error. ALTERNATIVELY, the mean square error can be expressed as
1 + (1 )2 , which makes it clear that = 1 minimizes it.
3.30. [Log uniform and log normal random variables]
(a) Z takes values in the set [ea , eb ]; for ea c eb :
FZ (c) = P {eU c} = P {U ln c} =
ln c a
.
ba
Differentiating yields
fZ (c) =
(b) By LOTUS, E[Z] =
eu
a ba du
Rb
1
c(ba)
ea c eb
else.
eb ea
ba .
267
(d) By LOTUS,
2
u
1
e exp
du
2
2
2
Z
1
u 2u
exp
du
2
2
Z
1
1
(u 1)2
exp
exp
du
2
2
2
1
exp
= e = 1.64872.
2
Z
E[Y ] =
=
=
=
p
|u|fX (u)du
=
=
=
=
1
u 2 du +
2u
Z
1
u 2 du
u
1
Z
u3/2 du
1
1/2
(2u
) = 2
1
du
2u2
FY (c) = P {Y c} = 2P {1 X c } =
1
c2
2
1
1 c
1
du = = 1 2 .
2
u
u 1
c
2/c3 if c 1
0
else.
(c) By Example 3.8.11 of the notes, the function h(u) is just the CDF of X : h(c) = FX (c).
For c 1,
Z c
1 c
1
1
FX (c) =
du =
=
2
2u
2c
2u
In particular, FX (1) = 0.5. Since fX (u) = 0 for u (1, 1), it follows that FX is flat
over the interval [1, 1], with FX (1) = 0.5. For c 1
Z c
1
1 c
1
FX (c) = 0.5 +
du = 0.5 + = 1
2
2u
2u
2c
1
1
268
CHAPTER 6. APPENDIX
Combining the above,
1
2c
0.5
h(c) = FX (c) =
1
1 2c
if c 1
if 1 < c 1
if c > 1.
Rt
respect to
0 r(s)ds with
R
0t r(s)ds
0
FT (t) = r(t)e
. Thus,
t is
r(t). Using that fact and the chain rule yields fT (t) =
the
fT (t)
failure rate function of T is given by h(t) = 1FT (t) = r(t). That is, r is the failure rate
function of T.
3.36. [(COMPUTER EXERCISE) Running averages of independent, identically distributed random variables]
Sn/n, uniform distn
20
40
60
80
100
2000
4000
6000
8000
40
60
80
100
20
10000
2000
4000
6000
8000
10000
(a) {T > t} = {T1 > t} {T2 > t} and the events {T1 > t} and {T2 > t} are independent
(because T1 and T2 are independent) so
P {T > t} = P {T1 > t} + P {T2 > t} P {T1 > t}P {T2 > t} = 2et e2t .
(b) fT (t) = (P {T > t})0 = 2(et e2t ) for t 0. Also, fT (t) = 0 for t < 0.
(c) Using the answers to parts (a) and (b),
fT (t)
2(et e2t )
2(1 et )
1
h(t) =
.
=
=
= 1 t
P {T > t}
2et e2t
2 et
2e 1
(d) By the definition of conditional probabilities,
P {min{T1 , T2 } < t|T > t} =
Observe that {min{T1 , T2 } < t, T > t} = {T1 < t, T2 > t} {T2 < t, T1 > t} and
the two events {T1 < t, T2 > t} and {T1 < t, T2 > t} are mutually exclusive, and by
symmetry, they have equal probability. So P {min{T1 , T2 } < t, T > t} = 2P {T1 <
t, T2 > t} = 2P {T1 < t}P {T2 > t} = 2(1 et )et . Combining this with part (a)
yields an expression for P {min{T1 , T2 } < t|T > t}, which when multiplied by is the
same as the expression for h found in part (c).
270
CHAPTER 6. APPENDIX
Since the support of fY is R+ , the conditional pdf fX|Y (u|v) is well-defined only for
v 0. For such v,
(
ve(1+u)v
= veuv u 0
ev
fX|Y (u|v) =
0
u < 0.
That is, the conditional distribution of X given Y = v is the exponential distribution
with parameter v.
(c) The joint CDF FX,Y (uo , vo ) is zero if either uo < 0 or vo < 0. For uo 0 and vo 0,
Z vo Z uo
FX,Y (u0 , v0 ) =
ve(1+u)v dudv
Z0 vo 0 Z uo
v
uv
e
ve
du dv
=
0
0
Z vo
ev (1 euo v )dv
=
Z0 vo
(ev ev(1+uo ) )dv
=
0
o
1 n
=
uo + evo (1+uo ) evo
1 + uo
(d) No. For example, the answer to part (b) shows that fY |X (v|u) is well defined for u > 0
and it is not a function of v alone.
ev 0 u 1, v 0
0
else.
(b) The probability to be found, P {Y X}, is the integral of the joint density over the set
{(u, v) : v u} intersected with the support of fX,Y , which is the shaded region shown:
v
1
u=v
u
0
R1R
R1
271
(c) The probability to be found, P {XeY 1}, is the integral of the joint density over the set
{(u, v) : uev 1} intersected with the support of fX,Y . Note that for u and v positive,
the condition uev 1 is equivalent to v ln(u), so the region to be integrated over is
shown:
v
v=!ln(u)
u
0
R1R
R1
Thus, P {XeY 1} = 0 ln(u) ev dvdu = 0 u du =
1} = 1 and lim P {XeY 1} = 0.
1
1+ .
So lim0 P {XeY
2e2v for v 0
0
for v < 0
fX,Y (u,v)
fX (u)
2e2v .
(c) The probability in question is the integral of the joint pdf over the region {u 0, v
0, u + 2v 2}.
v
1
u+2v=2
u
0
R 2 R 1u/2
0
2ueu2v dvdu or
R 1 R 12v
0
2ueu2v dudv
272
CHAPTER 6. APPENDIX
2 Z 1u/2
P {X + 2Y 2} =
0
2ueu2v dvdu
0
2
1u/2
ue
=
Z
2e2v dvdu
0
2
ueu ue2 du
0
2 Z 2
Z
u
u
2
= ue +
e du e
=
udu
u
1
They are not independent because, for example, the support of fX,Y is not a product
set (see Propositions 4.4.3 and 4.4.4).
273
(2u)3
2
(u + v )dv = 2u +
.
3
2
fX (u) =
0
4u3
1 1
4
2
2 u +
du = 2
= .
+
3
3 3
3
1 Z 2u
3
(u + v 2 )dvdu
4
0
u
Z
7u3
3 1
2
u +
du
4 0
3
!
1
7
11
3 u3 1 7u4 1
+
= +
=
4 3 0
12 0
4 16
16
P {X < Y } =
=
=
E[V ] =
Z 1Z 1
=
0
Z 01
=
0
uv 2 dudv
Z 1
1
1
udu
v 2 dv =
=
2
3
6
0
(b) Always V 0 because H 0 and R 0. V can be close to zero because the same is
true for H and R. The largest V can be is , which happens if H = R = 1. Thus, the
distribution of V has support [0, ].
(c) FV (c) = 0 for c 0 and FV (c) = 1 for c . So fix c with 0 c and focus on
calculating FV (c) = P {HR2 c}. It is the same as the area of the region {(u, v)
[0, 1]2 : uv 2 c}, which is pictured below as a shaded region.
u
1
c/
The p
curved part of the boundary is given by the equation uv 2 = c, or equivalently,
c
c
v = u
. The curve intersects the horizontal
line
v = 1 at u = . Thus, the shaded
c
region
c is the union of the rectangle with u 0, and the region under the curve with
u , 1 . Therefore,
n
co
FV (c) = P HR2
Z 1Z c
Z cZ 1
u
=
dvdu +
dvdu
0
r Z 1
1
c
c
=
+
u 2 du
r
r
c
c
c
=
+
2 1
r
c
c
= 2
Summarizing,
p0
2 c
FV (c) =
c0
0c
c
1
c
0<c
else
(Note: Alternatively we could have let the pdf equal infinity at c = 0. The value of a pdf
at a single point doesnt matter, because the value of the pdf at a single point doesnt
change the integrals of the pdf.)
275
v
a
u
a/2
The integral can be computed using either integration with respect to u on the outside or
integration with respect to v on the outside. Using u on the outside permits us to do the
computation using only one double integral:
Z a/2 Z au
2euv dv du = 1 (1 + a)ea .
FZ (a) =
0
Therefore,
FZ (a) =
1 (1 + a)ea a 0
0
else.
Differentiating the CDF yields fZ (a) = aea for a 0. That is, Z has the Erlang distribution
with parameters r = 2 and = 1. (This could have been deduced without calculation. The
joint CDF is the same as the joint CDF of two independent exponential random variables,
conditioned on the first to be smaller than the second. Since the two random variables are
independent and identically distributed, such conditioning does not change the distribution
of the sum, and the sum of two independent exponentials has the Erlang distribution with
r = 2.)
4.16. [A function of two random variables]
1
The current I is given by I = R1 +R
. One approach to this problem is to use the fact that the
2
distribution of R1 +R2 has the triangular pdf as found in Example 4.5.4. Here we take a direct
approach. Note that I takes values in the set [ 12 , ). For a in that range, FI (a) = P {I
a} = P {1/(R1 + R2 ) a} = P {R1 + R2 a1 }. Since (R1 , R2 ) is uniformly distributed over
the unit square region [0, 1]2 , the probability P {R1 + R2 a1 } is the area of the shaded
region, consisting of the intersection of [0, 1]2 and the half-plane {(u, v) : u + v a1 }. The
shape of the intersection is qualitatively different for a < 1 and a > 1; an example of each of
these two cases is shown in the figure.
2!a!1
u
1
0.5< a < 1
u
1
a!1
1 <a < !
We thus consider these two cases separately. In case 0.5 < a < 1 the shaded region is
triangular with area one half base times height. The edge of the half plane, given by {u + v =
a1 }, intersects the line v = 1 at u = a1 1. Therefore, the width of the triangular region
is 1 (a1 1) or 2 a1 , and the height is the same. In case 1 < a < the shaded region
is [0, 1]2 with a triangular region removed from the lower left corner, so its area is one minus
the area of the triangle. The width and height of the triangle are a1 . Therefore,
0
a < 1/2
(2 a1 )2 /2 1/2 a 1
FI (a) =
2a2 a3 , 1/2 a 1
a3 ,
1<a<
(b) No. For example, suppose X is any random variable with positive variance and P {X =
Y } = 1. Then Var(X) = Var(Y ) but Cov(X, Y ) = Var(X) 6= 0.
(c) Yes, because Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ), so the given statement is
equivalent to Var(X) + Var(Y ) + 2Cov(X, Y ) = Var(X) + Var(Y ), or Cov(X, Y ) = 0.
(d) No. For example, suppose P {X = Y = 1} = P {X = Y = 1} = 0.5. Then X 2 and
Y 2 are both equal to one with probability one, so Cov(X 2 , Y 2 ) = Cov(1, 1) = 0, but
Cov(X, Y ) = E[XY ] E[X]E[Y ] = 1 0 0 = 1 6= 0.
4.20. [Variances and covariances of sums]
(a) Cov(3X + 2, 5Y 1) = Cov(3X, 5Y ) = 15Cov(X, Y ).
(b)
Cov(2X + 1, X + 5Y 1) = Cov(2X, X + 5Y ) = Cov(2X, X) + Cov(2X, 5Y )
= 2Cov(X, X) + 10Cov(X, Y ) = 2Var(X) + 10Cov(X, Y )
(c)
Cov(2X + 3Z, Y + 2Z) = Cov(2X, Y ) + Cov(2X + 2Z) + Cov(3Z, Y ) + Cov(3Z, 2Z)
= 2Cov(X, Y ) + 4Cov(X, Z) + 3Cov(Z, Y ) + 6Cov(Z, Z)
= 2Cov(X, Y ) + 6Var(Z)
4.22. [Signal to noise ratio with correlated observations]
277
X1 +X2
.
2
X1 + X2
=
E[S] = S = E
2
Var(X1 + X2 )
2 2 + 2Cov(X1 , X2 )
2 + Cov(X1 , X2 )
S2 =
=
=
4
4
2
22
SN RS =
2 + Cov(X1 , X2 )
22
.
2 (1 + X1 X2 )
MSE = Var(Y )
Cov(Y, X1 )2 Cov(Y, X2 )2
.
Var(X1 )
Var(X2 )
(b)
e2 ) = Cov(X2 (0.5)X1 , X2 (0.5)X1 )
Var(X
= Var(X2 ) 2(0.5)Cov(X1 , X2 ) + (0.5)2 Var(X1 )
= 1 0.5 + 0.25 = 0.75.
e
Cov(Y, X2 ) = Cov(Y, X2 (0.5)X1 )
= Cov(Y, X2 ) (0.5)Cov(Y, X1 )
= 0.8 (0.5)(0.8) = 0.4
(Y,X1 )
Cov(Y,Xe2 )
(c) a = E[Y ] = 0, b = Cov
Var(X1 ) = 0.8, and c = Var(Xe2 ) =
MSE = Var(Y )
0.4
0.75 ,
and
e2 )2
(0.4)2
Cov(Y, X1 )2 Cov(Y, X
= 1 0.64
= 0.1466 . . . .
e2 )
Var(X1 )
0.75
Var(X
0.4
0.75 (X2
(0.5)X1 ) =
0.4
0.75 (X1
+ X2 ).
LLN, CLT, and joint Gaussian distribution Sections 4.10 & 4.11
4.30. [Law of Large Numbers and Central Limit Theorem]
(a) From LLN, we have
2
Sn
X
P
X 0.01X
n
n104 2X
2
Sn
X
P
X < 0.01X
1
n
n104 2
2
X
n104 2X
0.95
2 we would
gives n 47620. (If we were to plug in the cruder approximation 2.92 for X
get n 47674, which for most purposes is close enough.)
279
(b) E[Sn ] = 3.5n, Var(Sn ) = (2.9167)n, and the standard deviation of Sn is 1.7078 n.
|Sn 3.5n|
(0.035)n
1 2Q
1.7078 n
From the normal tables, Q(x) = 0.025 for x = 1.96. Thus, n d((1.96)(1.7078)/(0.035))2 e =
9147 is sufficient. (We would get a slightly different answer if we took a cruder approximation of X .)
4.32. [Marathon blackjack]
(a) The expected net gain is (1000)($100)(0.0029) = $290
(b) The net gain, in dollars, is the sum of the gains for the 1000 games, S = X1 + + Xn ,
where each Xi has mean 100(0.0029) = 0.29 and standard deviation (100)(1.141) =
114. As already
mentioned, S has mean 1000(0.29) = 290, and its standard devia
tion is 114 1000 = 3605. By the Gaussian approximation backed by the central limit
theorem,
290
S + 290
290
P {S > 0} = P
>
Q
= 0.4679
3605
3605
3605
(c)
P {S > 1000} = P
S + 290
1290
>
3605
3605
Q
1290
3605
= 0.3602
P {S > 0} = P
>
Q
114 n
114 n
114 n
Since Q(0.253) = 0.4, we solve
(0.29)n
114 n
n=
= 0.253 or
(0.253)(114)
0.29
2
= 9891
1012 e10
12!
= 0.09478.
280
CHAPTER 6. APPENDIX
e has the N (10, 10)
(c) Since X has mean and variance equal to , the randomn variable X
o
e
11.510
X10
12.510
e 12.5} = P
around 10%. This inaccuracy is not too surprising, since ten is a small number of
random variables for the CLT based Gaussian approximation.)
4.36. [Conditional means for a joint Gaussian pdf ]
(a) Since X and X + Y are independent, they are uncorrelated. So
0 = Cov(X, X + Y ) = Cov(X, X) + Cov(X, Y ) = Var(X) + Cov(X, Y ) = 1 + Cov(X, Y ).
Therefore, Cov(X, Y ) = 1.
(b) Since X and X + Y are independent, E[X|X + Y = 2] = E[X] = 0.
(c) Since X and Y are jointly Gaussian,
b |X = 2] = Y + Cov(Y,X) (2 X ) = 2.
E[Y |X = 2] = E[Y
Var(X)
4.38. [Jointly Gaussian Random Variables I]
(a) X has the N (1, 9) distribution; fX (u) =
2
1 e(u1) /18 .
18
b |X = 5] = 2+ 6 (51) = 4.667,
(b) By the formula for wide sense conditional expectation, E[Y
9
2
Cov
(X,Y )2
2
2
and by the formula for the corresponding MSE, =
= 16 6 = 12.
e
2
X
281
(c) Since Z and Y are linear combinations of the jointly Gaussian random variables X
and Y , the variables Z and Y are jointly Gaussian. Therefore, the best unconstrained
estimator of Y given Z is the best linear estimator of Y given Z.
So
b |Z] = E[Y ] + Cov(Y, Z) (Z E[Z]).
g (Z) = L (Z) = E[Y
Var(Z)
Using
Cov(Y, Z) = Cov(Y, X + 4Y 1) = Cov(Y, X) + 4Var(Y ) = 8 + 100 = 108,
we find
g (Z) = L (Z) = 5 +
and
MSE = Var(Y )
108
27
(Z 23) = 5 +
(Z 23)
480
120
(108)2
(Cov(Y, Z))2
= 25
= 0.7.
Var(Z)
480
(35)(0.71)
(u
19
e = Y 1 2 = 35 1 0.712 = 24.65. Thus, given X = u, the conditional distribution of Y is normal with mean 152 + 1.31(u 67) and standard deviation 24.65 (i.e.
variance 607.5).
Index
area rule for expectation, 137
average error probability, 63
Bayes formula, 53
Bernoulli distribution, 37, 240
Bernoulli process, 43
binomial coefficient, 13
binomial distribution, 38, 240
cardinality, 10
Cauchy distribution, 131
CDF, see cumulative distribution function
central limit theorem, 213
Chebychev inequality, 51
complementary CDF, 104
conditional
expectation, 167
mean, see conditional expectation
pdf, 167
probability, 32
confidence interval, 52
confidence level, 52
correlation
coefficient, 196
count times, 108
coupon collector problem, 43
cumulative distribution function, 95
inverse of, 135
joint, 161
decision rule, 60
DeMorgans laws, 7
determinant, 189
distribution of a function of a random variable,
125
Erlang distribution, 112
event, 6
events, 8
282
INDEX
right, 96
LOTUS, see law of the unconscious statistician
LRT, see likelihood ratio test
283
product set, 176
random variable, 25
continuous-type, 99
Maclaurin series, 38
discrete-type, 25, 98
MAP decision rule, see maximum a posteriori Rayleigh distribution, 139, 193
decision rule
sample mean, 201
Markov inequality, 50
sample path, 43
maximum a posteriori decision rule, 63
sample space, 6, 8
maximum likelihood
sample variance, 201
decision rule, 61
scaling rule for pdfs, 113
parameter estimation, 124
Schwarzs inequality, 198
maximum likelihood parameter estimate, 47
sensor fusion, 65
mean of a random variable, 27
standard deviation, 31
mean square error, 204
standard normal distribution, 115
mean square error., 205
standardized version of a random variable, 31
memoryless property
support
in continuous time, 105
of a joint pdf, 166
in discrete time, 43, 241
of a pdf, 100
miss probability, 61
of a pmf, 25
mode of a pmf, 39
moment, 31
n choose k, 13
negative binomial distribution, 45
network outage, 67
one-to-one function, 191
partition, 7, 53
pdf, see probability density function
pmf, see probability mass function
Poisson distribution, 241
Poisson process, 109
principle of counting, 10
principle of over counting, 12
prior probabilities, 62
probability density function, 100
conditional, 167
joint, 165
marginal, 167
probability mass function, 25
conditional, 164
joint, 163
marginal, 164
probability measure, 8
probability space, 8
Taylor series, 38
unbiased estimator, 201
uncorrelated random variables, 196
uniform prior, 63
union bound, 67
variance, 30
wide-sense conditional expectation, 207