Probability For Statistics and Machine Learning
Probability For Statistics and Machine Learning
Series Editors:
G. Casella
S. Fienberg
I. Olkin
ABC
Anirban DasGupta
Department of Statistics
Purdue University
150 N. University Street
West Lafayette, IN 47907, USA
[email protected]
Mathematica
R
is a registered trademark of Wolfram Research, Inc.
vii
viii Preface
and thoughtfulness that she put into correcting my numerous errors. The staff at
SPi Technologies, Chennai, India did an astounding and marvelous job of produc-
ing this book. Six anonymous reviewers gave extremely gracious and constructive
comments, and their input has helped me in various dimensions to make this a
better book. Doug Crabill is the greatest computer systems administrator, and with
an infectious pleasantness has bailed me out of my stupidity far too many times.
I also want to mention my fond memories and deep-rooted feelings for the Indian
Statistical Institute, where I had all of my college education. It was just a wonderful
place for research, education, and friendships. Nearly everything that I know is due
to my years at the Indian Statistical Institute, and for this I am thankful.
This is the third time that I have written a book in contract with John Kimmel.
John is much more than a nearly unique person in the publishing world. To me,
John epitomizes sensitivity and professionalism, a singular combination. I have now
known John for almost six years, and it is very very difficult not to appreciate and
admire him a whole lot for his warmth, style, and passion for the subjects of statis-
tics and probability. Ironically, the day that this book entered production, the news
came that John was leaving Springer. I will remember John’s contribution to my
professional growth with enormous respect and appreciation.
xi
xii Contents
xix
Chapter 1
Review of Univariate Probability
are HH; HT; TH; TT, with H denoting the occurrence of heads and T denoting
the occurrence of tails. We call
D f1; 2; 3; : : :g:
Sample spaces can also be uncountably infinite; for example, consider the experi-
ment of choosing a number at random from the interval Œ0; 1. The sample space of
this experiment is D Œ0; 1. In this case, is an uncountably infinite set. In all
cases, individual elements of a sample space are denoted as !. The first task is to
define events and to explain the meaning of the probability of an event.
Definition 1.1. Let be the sample space of an experiment . Then any subset A
of , including the empty set and the entire sample space is called an event.
Events may contain even one single sample point !, in which case the event
is a singleton set f!g. We want to assign probabilities to events. But we want to
assign probabilities in a way that they are logically consistent. In fact, this cannot
be done in general if we insist on assigning probabilities to arbitrary collections of
sample points, that is, arbitrary subsets of the sample space . We can only define
probabilities for such subsets of that are tied together like a family, the exact
concept being that of a -field. In most applications, including those cases where
the sample space is infinite, events that we would want to normally think about
will be members of such an appropriate -field. So we do not mention the need
for consideration of -fields any further, and get along with thinking of events as
subsets of the sample space , including in particular the empty set and the entire
sample space itself.
Here is a definition of what counts as a legitimate probability on events.
Property (c) is known as countable additivity. Note that it is not something that
can be proved, but it is like an assumption or an axiom. In our experience, we have
seen that operating as if the assumption is correct leads to useful and credible an-
swers in many problems, and so we accept it as a reasonable assumption. Not all
1.1 Experiments and Sample Spaces 3
probabilists agree that countable additivity is natural; but we do not get into that
debate in this book. One important point is that finite additivity is subsumed in
countable additivity; that is if there are some Pmfinite number m of disjoint subsets
A1 ; A2 ; : : : ; Am of , then P .[m A
i D1 i / D i D1 P .Ai /: Also, it is useful to note
that the last two conditions in the definition of a probability measure imply that
P ./, the probability of the empty set or the null event, is zero.
One notational convention is that strictly speaking, for an event that is just a
singleton set f!g, we should write P .f!g/ to denote its probability. But to reduce
clutter, we simply use the more convenient notation P .!/.
One pleasant consequence of the axiom of countable additivity is the following
basic result. We do not prove it here as it is a simple result; see DasGupta (2010) for
a proof.
Example 1.1 (The Shoe Problem). Suppose there are five pairs of shoes in a closet
and four shoes are taken out at random. What is the probability that among the four
that are taken out, there is at least one complete
pair?
The total number of sample points is 10 4 D 210. Because selection was done
completely at random, we assume that all sample points are equally likely. At least
one complete pair would mean two complete pairs, or exactly one complete pair
and two other nonconforming shoes. Two complete pairs can be chosen in 52 D 10
ways. Exactly one complete pair can be chosen in 51 42 2 2 D 120 ways. The
5
1
term is for choosing the pair that is complete; the 42 term is for choosing two
incomplete pairs, and then from each incomplete pair, one chooses the left or the
right shoe. Thus, the probability that there will be at least one complete pair among
the four shoes chosen is .10 C 120/=210 D 13=21 D :62.
Example 1.2 (Five-Card Poker). In five-card poker, a player is given 5 cards from a
full deck of 52 cards at random. Various named hands of varying degrees of rarity
exist. In particular, we want to calculate the probabilities of A D two pairs and
4 1 Review of Univariate Probability
B D a flush. Two pairs is a hand with 2 cards each of 2 different denominations and
the fifth card of some other denomination; a flush is a hand with 5 cards of the same
suit, but the cards cannot be of denominations in a sequence.
42 44 52
Then, P .A/ D 13 2 2 1 = 5 D :04754:
To find P .B/, note that there are 10 ways to select 5 cards from a suit such
that the cards are in a sequence, namely, fA; 2; 3; 4; 5g; f2; 3; 4; 5; 6g; : : : ; f10; J; Q;
K; Ag, and so,
! ! !, !
4 13 52
P .B/ D 10 D :00197:
1 5 5
These are basic examples of counting arguments that are useful whenever there
is a finite sample space and we assume that all sample points are equally likely.
A major result in combinatorial probability is the inclusionexclusion formula,
which says the following.
X
n X X
S1 D P .Ai /I S2 D P .Ai \Aj /I S3 D P .Ai \Aj \Ak /I
i D1 1i <j n 1i <j <kn
Then,
X
n X X
P .[niD1 Ai / D P .Ai / P .Ai \ Aj / C P .Ai \ Aj \ Ak /
i D1 1i <j n 1i <j <kn
C .1/nC1 P .A1 \ A2 \ \ An /
D S1 S2 C S3 C .1/nC1 Sn :
Example 1.3 (Missing Suits in a Bridge Hand). Consider a specific player, say
North, in a Bridge game. We want to calculate the probability that North’s hand
is void in at least one suit. Towards this, denote the suits as 1, 2, 3, 4 and let
Ai D North’s hand is void in suit i.
Then, by the inclusion exclusion formula,
The inclusion–exclusion formula can be hard to apply exactly, because the quan-
tities Sj for large indices j can be difficult to calculate. However, fortunately, the
inclusion–exclusion formula leads to bounds in both directions for the probability
of the union of n general events. We have the following series of bounds.
pn S 1 I pn S 1 S 2 I pn S1 S2 C S3 I : : : :
In addition,
X
n
P .\niD1 Ai / 1 P .Aci /:
i D1
Both conditional probability and independence are fundamental concepts for proba-
bilists and statisticians alike. Conditional probabilities correspond to updating one’s
beliefs when new information becomes available. Independence corresponds to ir-
relevance of a piece of new information, even when it is made available. In addition,
the assumption of independence can and does significantly simplify development,
mathematical analysis, and justification of tools and procedures.
Definition 1.4. Let A; B be general events with respect to some sample space ,
and suppose P .A/ > 0. The conditional probability of B given A is defined as
P .A \ B/
P .BjA/ D :
P .A/
Theorem 1.4. (a) (Multiplicative Formula). For any two events A, B such that
P .A/ > 0, one has P .A \ B/ D P .A/P .BjA/I
(b) For any two events A, B such that 0 < P .A/ < 1, one has P .B/ D P .BjA/
P .A/ C P .BjAc /P .Ac /I
(c) (Total Probability Formula). If A1 ; A2 ; : : : ; Ak form a partition of the sample
space , (i.e., Ai \ Aj D for all i ¤ j , and [kiD1 Ai D ), and if 0 <
P .Ai / < 1 for all i , then
X
k
P .B/ D P .BjAi /P .Ai /:
i D1
6 1 Review of Univariate Probability
Example 1.4. One of two urns has a red and b black balls, and the other has c red
and d black balls. One ball is chosen at random from each urn, and then one of these
two balls is chosen at random. What is the probability that this ball is red?
If each ball selected from the two urns is red, then the final ball is definitely red.
If one of those two balls is red, then the final ball is red with probability 1/2. If none
of those two balls is red, then the final ball cannot be red.
Thus,
2ac C ad C bc
Then D :745:
2.a C b/.c C d /
Although the total percentage of red balls in the two urns is more than 98%, the
chance that the final ball selected would be red is just about 75%.
Example 1.5 (A Clever Conditioning Argument). Coin A gives heads with probabil-
ity s and coin B gives heads with probability t. They are tossed alternately, starting
off with coin A. We want to find the probability that the first head is obtained on
coin A.
We find this probability by conditioning on the outcomes of the first two tosses;
more precisely, define
A1 D fH g D First toss gives HI A2 D fTH gI A3 D fT T g:
Let also,
A D The first head is obtained on coin A:
One of the three events A1 ; A2 ; A3 must happen, and they are also mutually
exclusive. Therefore, by the total probability formula,
X
3
P .A/ D P .Ai /P .AjAi / D s 1 C .1 s/t 0 C .1 s/.1 t/P .A/
i D1
) P .A/ D s=Œ1 .1 s/.1 t/ D s=.s C t st/:
1.2 Conditional Probability and Independence 7
As an example, let s D :4; t D :5. Note that coin A is biased against heads. Even
then, s=.s C t st/ D :57 > :5. We see that there is an advantage in starting first.
Example 1.6 (Lotteries). Although many people buy lottery tickets out of an expec-
tation of good luck, probabilistically speaking, buying lottery tickets is usually a
waste of money. Here is an example. Suppose in a weekly state lottery, five of the
numbers 00; 01; : : : ; 49 are selected without replacement at random, and someone
holding exactly those numbers wins the lottery. Then, the probability that someone
holding one ticket will be the winner in a given week is
1
D 4:72 107 :
50
5
Suppose this person buys a ticket every week for 40 years. Then, the probability
that he will win the lottery on at least one week is 1 .1 4:72 107 /5240 D
:00098 < :001; still a very small probability. We assumed in this calculation that
the weekly lotteries are all mutually independent, a reasonable assumption. The
calculation would fall apart if we did not make this independence assumption.
It is not uncommon to see the conditional probabilities P .AjB/ and P .BjA/
confused with each other. Suppose in some group of lung cancer patients, we see a
large percentage of smokers. If we define B to be the event that a person is a smoker,
and A to be the event that a person has lung cancer, then all we can conclude is that
in our group of people P .BjA/ is large. But we cannot conclude from just this
information that smoking increases the chance of lung cancer, that is, that P .AjB/
is large. In order to calculate a conditional probability P .AjB/ when we know the
other conditional probability P .BjA/, a simple formula known as Bayes’ theorem
is useful. Here is a statement of a general version of Bayes’ theorem.
P .BjAj /P .Aj /
P .Aj jB/ D Pm :
i D1 P .BjAi /P .Ai /
Example 1.7 (Multiple Choice Exams). Suppose that the questions in a multiple
choice exam have five alternatives each, of which a student has to pick one as the
correct alternative. A student either knows the truly correct alternative with proba-
bility :7, or she randomly picks one of the five alternatives as her choice. Suppose a
particular problem was answered correctly. We want to know what the probability
is that the student really knew the correct answer.
8 1 Review of Univariate Probability
Define
P .BjA/P .A/ 1 :7
P .AjB/ D D D :921:
P .BjA/P .A/ C P .BjAc /P .Ac / 1 :7 C :2 :3
Before the student answered the question, our probability that she would know the
correct answer to the question was :7; but once she answered it correctly, the poste-
rior probability that she knew the correct answer increases to :921. This is exactly
what Bayes’ theorem does; it updates our prior belief to the posterior belief, when
new evidence becomes available.
In some sense, the entire subject of probability and statistics is about distributions of
random variables. Random variables, as the very name suggests, are quantities that
vary, over time, or from individual to individual, and the reason for the variability is
some underlying random process. Depending on exactly how an underlying exper-
iment ends, the random variable takes different values. In other words, the value
of the random variable is determined by the sample point ! that prevails, when the
underlying experiment is actually conducted. We cannot know a priori the value
of the random variable, because we do not know a priori which sample point ! will
prevail when the experiment is conducted. We try to understand the behavior of
a random variable by analyzing the probability structure of that underlying random
experiment.
Random variables, like probabilities, originated in gambling. Therefore, the ran-
dom variables that come to us more naturally, are integer-valued random variables;
for examples, the sum of the two rolls when a die is rolled twice. Integer-valued
random variables are special cases of what are known as discrete random variables.
Discrete or not, a common mathematical definition of all random variables is the
following.
Definition 1.6. Let be a sample space corresponding to some experiment and
let X W ! R be a function from the sample space to the real line. Then X is
called a random variable.
Discrete random variables are those that take a finite or a countably infinite
number of possible values. In particular, all integer-valued random variables are
discrete. From the point of view of understanding the behavior of a random variable,
the important thing is to know the probabilities with which X takes its different
possible values.
1.3 Integer-Valued and Discrete Random Variables 9
Proposition. Let F .x/ be the CDF of some random variable X . Then, for any x,
(a) P .X D x/ D F .x/ limy"x F .y/ D F .x/ F .x/, including those points
x for which P .X D x/ D 0.
(b) P .X x/ D P .X > x/ C P .X D x/ D .1 F .x// C .F .x/ F .x// D
1 F .x/.
F .x/ D 0 if x < 0I
D :304 if 0 x < 1I
D :743 if 1 x < 2I
D :956 if 2 x < 3I
D :997 if 3 x < 4I
D1 if x 4:
Example 1.9 (Indicator Variables). Consider the experiment of rolling a fair die
twice and now define a random variable Y as follows.
Definition 1.10. Let A be any event in a sample space . The indicator random
variable for A is defined as
IA D 1 if A happens:
IA D 0 if A does not happen:
Thus, the distribution of an indicator variable is simply P .IA D 1/ D P .A/I
P .IA D 0/ D 1 P .A/.
An indicator variable is also called a Bernoulli variable with parameter p, where
p is just P .A/. We later show examples of uses of indicator variables in calculation
of expectations.
In applications, we are sometimes interested in the distribution of a function,
say g.X /, of a basic random variable X . In the discrete case, the distribution of a
function is found in the obvious way.
Note that g.X / is a one-to-one function of X , but h.X / is not one-to-one. The
values of Y are 0; ˙1; ˙8; ˙27. For example, P .Y D 0/ D P .X D 0/ D c D
5=13I P .Y D 1/ D P .X D 1/ D c=2 D 5=26, and so on. In general, for y D 0;
˙1; ˙8; ˙27; P .Y D y/ D P .X D y 1=3 / D 1Cyc2=3 , with c D 5=13.
However, Z D h.X / is not a one-to-one function of X . The possible values of
Z are as follows.
x h.x/
3 1
2 0
1 1
0 0
1 1
2 0
3 1
12 1 Review of Univariate Probability
Example 1.11 (Two Simple Illustrations). Consider the experiment of tossing a fair
coin (or any coin) four times. Suppose X1 is the number of heads in the first two
tosses, and X2 is the number of heads in the last two tosses. Then, it is intuitively
clear that X1 ; X2 are independent, because the last two tosses carry no informa-
tion regarding the first two tosses. The independence can be easily mathematically
verified by using the definition of independence.
Next, consider the experiment of drawing 13 cards at random from a deck of 52
cards. Suppose X1 is the number of aces and X2 is the number of clubs among the 13
cards. Then, X1 ; X2 are not independent. For example, P .X1 D 4; X2 D 0/ D 0,
but P .X1 D 4/, and P .X2 D 0/ are both > 0, and so P .X1 D 4/P .X2 D 0/ > 0.
So, X1 ; X2 cannot be independent.
1.3 Integer-Valued and Discrete Random Variables 13
The next few results summarize the most fundamental properties of expectations.
14 1 Review of Univariate Probability
Example 1.12. Let X be the number of heads obtained in two tosses of a fair coin.
The pmf of X is p.0/ D p.2/ D 1=4; p.1/ D 1=2. Therefore, E.X / D 0 1=4 C
1 1=2 C 2 1=4 D 1. Because the coin is fair, we expect it to show heads 50% of
the number of times it is tossed, which is 50% of 2, that is, 1.
Example 1.13 (Dice Sum). Let X be the sum of the two rolls when a fair die
is rolled twice. The pmf of X is p.2/ D p.12/ D 1=36I p.3/ D p.11/ D
2=36I p.4/ D p.10/ D 3=36I p.5/ D p.9/ D 4=36I p.6/ D p.8/ D 5=36I p.7/ D
6=36. Therefore, E.X / D 21=36C32=36C43=36C C121=36 D 7. This
can also be seen by letting X1 D the face obtained on the first roll; X2 D the face
obtained on the second roll, and by using E.X / D E.X1 C X2 / D E.X1 / C
E.X2 / D 3:5 C 3:5 D 7.
Let us now make this problem harder. Suppose that a fair die is rolled 10 times
and X is the sum of all 10 rolls. The pmf of X is no longer so simple; it will be
cumbersome to write it down. But, if we let Xi D the face obtained on the ith roll, it
is still true by the linearity of expectations that E.X / D E.X1 CX2 C CX10 / D
E.X1 / C E.X2 / C C E.X10 / D 3:5 10 D 35. We can easily compute the
expectation, although the pmf would be difficult to write down.
Example 1.14 (A Random Variable Without a Finite Expectation). Let X take the
positive integers 1; 2; 3; : : : as its values with the pmf
1
p.x/ D P .X D x/ D ; x D 1; 2; 3; : : : :
x.x C 1/
1
This is a valid pmf, because obviously x.xC1/ > 0 for any x D 1; 2; 3; : : : ; and also
P1 1
the infinite series xD1 x.xC1/ sums to 1, a fact from calculus. Now,
1
X 1
X X1 X1
1 1 1
E.X / D xp.x/ D x D D D 1;
xD1 xD1
x.x C 1/ xD1
xC1 xD2
x
Proposition. Let XPbe an integer-valued random variable such that it can be rep-
m
resented as X D i D1 ci IAi for somePm, constants c1 ; c2 ; : : : ; cm , and suitable
events A1 ; A2 ; : : : ; Am . Then, E.X / D m i D1 ci P .Ai /.
16 1 Review of Univariate Probability
Example 1.15 (Coin Tosses). Suppose a coin that has probability p of showing
heads in any single toss is tossed n times, andP
let X denote the number of times in the
n tosses that a head is obtained. Then, X D niD1 IP Ai , where Ai is P
the event that a
head is obtained in the ith toss. Therefore, E.X / D niD1 P .Ai / D niD1 p D np.
P of the expectation would involve finding the pmf of X and
A direct calculation
obtaining the sum nxD0 xP .X D x/; it can also be done that way, but that is a
much longer calculation.
The random variable X of this example is a binomial random variable
with parameters n and p. Its pmf is given by the formula P .X D x/ D xn p x
.1 p/nx ; x D 0; 1; 2; : : : ; n.
Example 1.16 (Consecutive Heads in Coin Tosses). Suppose a coin with probability
p for heads in a single toss is tossed n times. How many times can we expect to see
a head followed by at least one more head? For example, if n D 5, and we see the
outcomes HTHHH, then we see a head followed by at least one more head twice.
Define Ai D The i th and the .i C 1/th toss both result in heads. Then
X
n1
X D number of times a head is followed by at least one more head D IAi ;
i D1
P Pn1 2
and so E.X / D n1 i D1 P .Ai / D i D1 p D .n 1/p . For example, if a fair coin
2
is tossed 20 times, we can expect to see a head followed by another head about five
times (19 :52 D 4:75).
Another useful technique for calculating expectations of nonnegative integer-
valued random variables is based on the CDF of the random variable, rather than
directly on the pmf. This method is useful when calculating probabilities of the form
P .X > x/ is logically more straightforward than directly calculating P .X D x/.
Here is the expectation formula based on the tail CDF.
Example 1.17 (Family Planning). Suppose a couple will have children until they
have at least one child of each sex. How many children can they expect to have?
Let X denote the childbirth at which they have a child of each sex for the first time.
Suppose the probability that any particular childbirth will be a boy is p, and that all
births are independent. Then,
P .X > n/ D P .the first n children are all boys or all girls/ D p n C .1 p/n :
P
Therefore, E.X / D 2 C 1 nD2 Œp C .1 p/ D 2 C p =.1 p/ C .1 p/ =p D
n n 2 2
1
p.1p/
1. If boys and girls are equally likely on any childbirth, then this says that
a couple waiting to have a child of each sex can expect to have three children.
1.3 Integer-Valued and Discrete Random Variables 17
In particular, EŒ.X /3 will be zero. Likewise, statisticians also use the fourth
moment around the mean as a measure of how spiky the distribution is around the
mean. To make these indices independent of the choice of unit of measurement (e.g.,
inches or centimeters), they use certain scaled measures of asymmetry and peaked-
ness. Here are the definitions.
Definition 1.16. (a) Let X be a random variable with EŒjX j3 < 1. The skewness
of X is defined as
EŒ.X /3
ˇD :
3
(b) Suppose X is a random variable with EŒX 4 < 1. The kurtosis of X is
defined as
EŒ.X /4
D 3:
4
The skewness ˇ is zero for symmetric distributions, but the converse need not be
true. The kurtosis is necessarily 2, but can be arbitrarily large, with spikier
distributions generally having a larger kurtosis. But a very good interpretation of
is not really available. We later show that D 0 for all normal distributions; hence
the motivation for subtracting 3 in the definition of .
Example 1.18 (Variance of Number of Heads). Consider the experiment of two
tosses of a fair coin and let X be the number of heads obtained. Then, we have seen
that p.0/ D p.2/ D 1=4; and p.1/ D 1=2. Thus, E.X 2 / D 0 1=4 C 1 1=2 C
4 1=4 D 3=2, and E.X / D 1. Therefore, Var.X / D E.X 2 / 2 D 3=2 1 D 12 ,
p
and the standard deviation is D :5 D :707.
Example 1.19 (A Random Variable with an Infinite Variance). If a random variable
has a finite variance, then it can be shown that it must have a finite mean. This
example shows that the converse need not be true.
Let X be a discrete random variable with the pmf
c
P .X D x/ D ; x D 1; 2; 3; : : : ;
x.x C 1/.x C 2/
Therefore, by direct verification, X has a finite expectation. Let us now examine the
second moment of X .
1
X X1
4 1
E.X 2 / D x2 D4 x D 1;
xD1
x.x C 1/.x C 2/ xD1
.x C 1/.x C 2/
1.4 Inequalities 19
is not finitely summable, a fact from calculus. Because E.X 2 / is infinite, but E.X /
is finite, 2 D E.X 2 / ŒE.X /2 must also be infinite.
If a collection of random variables is independent, then just like the expectation,
the variance also adds up. Precisely, one has the following very useful fact.
Theorem 1.10. Let X1 ; X2 ; : : : ; Xn be n independent random variables. Then,
An important corollary of this result is the following variance formula for the
mean, XN , of n independent and identically distributed random variables.
Corollary 1.1. Let X1 ; X2 ; : : : ; Xn be independent random variables with a com-
mon variance 2 < 1. Let XN D X1 CCX n
n N D 2 .
. Then Var.X/ n
1.4 Inequalities
The mean and the variance, together, have earned the status of being the two most
common summaries of a distribution. A relevant question is whether ; are useful
summaries of the distribution of a random variable. The answer is a qualified yes.
The inequalities below suggest that knowing just the values of ; , it is in fact
possible to say something useful about the full distribution.
Theorem 1.11. (a) (Chebyshev’s Inequality). Suppose E.X /D and Var.X / D
2 , assumed to be finite. Let k be any positive number. Then
1
P .jX j k/ :
k2
(b) (Markov’s Inequality). Suppose X takes only nonnegative values, and sup-
pose E.X / D , assumed to be finite. Let c be any postive number. Then,
P .X c/ :
c
The virtue of these two inequalities is that they make no restrictive assumptions on
the random variable X . Whenever ; are finite, Chebyshev’s inequality is appli-
cable, and whenever ; is finite, Markov’s inequality applies, provided the random
variable is nonnegative. However, the universal nature of these inequalities also
makes them typically quite conservative.
Although Chebyshev’s inequality usually gives conservative estimates for tail
probabilities, it does imply a major result in probability theory in a special case.
20 1 Review of Univariate Probability
The only condition needed is that E.jXi j/ should be finite. This is called the strong
law of large numbers. It is impossible to prove it without using much more so-
phisticated concepts and techniques than we are using here. The strong law of
large numbers is treated later in the book. Inequalities better than Chebyshev’s or
Markov’s inequality are available under additional restrictions on the distribution
of the underlying random variable X . We state three other inequalities that can
sometimes give bounds better than what Chebyshev’s or Markov’s inequality can
give.
1
P .X k/ :
k2 C 1
1
P .X k/ :
k C1
2
2
P .X > c/ .1 c/2 :
2 C 2
2
P .X D 0/ :
2 C 2
These inequalities may be seen in Rao (1973), Paley and Zygmund (1932), and Alon
and Spencer (2000, p. 58), respectively.
The area of probability inequalities is an extremely rich and diverse area. The
reason for it is that inequalities are tremendously useful in giving approximate an-
swers when the exact answer to a problem, or a calculation, is very hard or perhaps
even impossible to obtain. We periodically present and illustrate inequalities over
the rest of the book. Some really basic inequalities based on moments are presented
in the next theorem.
1.4 Inequalities 21
R
(b) (Holder’s Inequality). Let X; Y be two random variables, and 1 < p < 1 a
p
real number such that E.jX jp / < 1. Let q D p1 , and suppose E.jY jq / <
1. Then,
1 1
E.jX Y j/ ŒE.jX jp / p ŒE.jY jq / q :
(c) (Minkowski’s Inequality). Let X; Y be two random variables, and p 1 a
real number such that E.jX jp /; E.jY jp / < 1. Then,
1 1 1
ŒE.jX C Y jp / p ŒE.jX jp / p C ŒE.jY jp / p ;
) m.7 m/ 10:63
)m 4:77;
because the quadratic m.7 m/ 10:63 D 0 has the two roots m D 2:23; 4:77:
It is interesting that this bound is reasonably accurate, as the exact value of m D
36 D 4:47.
E.X / is 161
is finite.
In this definition, 00 is to be understood as being equal to 1. Note that G.s/ is
always finite for jsj 1, but it could be finite over a larger interval, depending on
the specific random variable X .
Two basic properties of the generating function are the following.
Theorem 1.15. (a) Suppose G.s/ is finite in some open interval containing the ori-
gin. Then, G.s/ is infinitely differentiable in that open interval, and P .X D k/ D
G .k/ .0/
kŠ ; k 0, where G .0/ .0/ means G.0/.
(b) If lims"1 G .k/ .s/ is finite, then EŒX.X 1/ .X k C 1/ exists and is finite,
and G .k/ .1/ D lims"1 G .k/ .s/ D EŒX.X 1/ .X k C 1/.
Remark. The kth factorial moment of X exists if and only if the kth moment E.X k /
exists.
One of the most important properties of generating functions is the following.
1.5 Generating and Moment-Generating Functions 23
One reason that the generating function is useful as a tool is its distribution
determining property, in the following sense.
Theorem 1.17. Let G.s/ and H.s/ be the generating functions of two random vari-
ables X; Y . If G.s/ D H.s/ in any nonempty open interval, then X; Y have the same
distribution.
Summarizing, then, one can find from the generating function of a nonnegative
integer-valued random variable X , the pmf of X , and every moment of X , including
the moments that are infinite.
Example 1.21 (Discrete Uniform Distribution). Suppose X has the discrete uni-
form distribution on f1; 2; : : : ; ng. Then, its generating function is
X
n
1X x
n
G.s/ D EŒs X D s x P .X D x/ D s
xD1
n xD1
s.s 1/ n
D ;
n.s 1/
P
by summing the geometric series nxD1 s x : As a check, if we differentiate G.s/
once, we get
1 C s n Œn.s 1/ 1
G 0 .s/ D :
n.s 1/2
The first derivative of G.s/ is G 0 .s/ D e s1 , and therefore G 0 .1/ D e 0 D 1. From
our theorem above, we conclude that E.X / D 1. Indeed, the pmf that we have in
this example is the pmf of the so-called Poisson distribution with mean one. The pmf
x
of the Poisson distribution with a general mean is p.x/ D e xŠ ; x D 0; 1; 2; : : : :
The Poisson distribution is an extremely important distribution in probability theory
and is studied in more detail below.
We have defined the probability-generating function only for nonnegative
integer-valued random variables. The moment-generating function is usually dis-
cussed in the context of general random variables, not necessarily integer-valued,
or discrete. The two functions are connected. Here is the formal definition.
Theorem 1.18. (a) Suppose the mgf .t/ of a random variable X is finite in some
open interval containing zero. Then, .t/ is infinitely differentiable in that open
interval, and for any k 1;
E.X k / D .k/
.0/:
Y
n
1 X tk
n
e t .e nt 1/
.t/ D EŒe tX D e D :
n n.e t 1/
kD1
By direct differentiation,
0 e t .1 C ne .nC1/t .n C 1/e nt /
.t/ D :
n.e t 1/2
On applying L’Hospital’s rule twice, we get the previously derived fact that
E.X / D nC1
2
:
Definition 1.20. Let a random variable X have a finite j th moment for some spec-
ified j 1. The j th central moment of X is defined as j D EŒ.X /j , where
D E.X /.
Definition 1.21. Let X have a finite mgf .t/ is some neighborhood of zero,
and let K.t/ D log .t/, when it exists. The rth cumulant of X is defined as
26 1 Review of Univariate Probability
r
r D dtd
r K.t/jt D0 . Equivalently, the cumulants of X are the coefficients in the
P
power series expansion K.t/ D 1 tn
nD1 n nŠ , within its radius of convergence.
Note that K.t/ D log .t/ implies that e K.t / D .t/. By equating
coefficients in the power series expansion of e K.t / with those in the power
series expansion of .t/, it is easy to express the first few moments (and
therefore, the first few central moments) in terms of the cumulants. Indeed, de-
noting ci D E.X i /; D E.X / D c1 ; i D E.X /i ; 2 D 2 , one obtains the
expressions
2 3
c1 D 1I c2 D 2 C 1I c3 D 3 C3 1 2 C 1:
c4 D 4 C4 1 3 C3 2
2 C6 2
1 2 C 4
1:
The corresponding expressions for the central moments are much simpler:
2 D 2I 3 D 3I 4 D 4 C3 2
2:
2 D 2 I 3 D 3 I 4 D 4 322 :
High-order cumulants have quite complex expressions in terms of the central mo-
ments j ; the corresponding expressions in terms of the cj are even more complex.
The derivations of these expressions stated above involve straight differentiation.
We do not present the algebra. It is useful to know these expressions for some prob-
lems in statistics.
We now describe some problems in discrete probability that are generally known as
problems of patterns. Generating functions turn out to be crucially useful in analyz-
ing many of these problems. Suppose a coin with probability p for heads is tossed
repeatedly. How long does it take before we see three heads in succession for the
first time? Questions such as this which pertain to waiting times for seeing one or
more specified patterns are particularly amenable to use of the generating function.
1.6 Applications of Generating Functions to a Pattern Problem 27
Theorem 1.19. Suppose a coin with probability p for heads is tossed repeatedly,
and N D Nr is the first toss at which a head run of length r is obtained. Then
r k
X 1
E.N / D I
p
kD1
1 p 1C2r qp r .1 C 2r/
Var.N / D :
q 2 p 2r
Proof. Let pk P .N D k/. The trick is to write a recursion relation for the se-
quence pk and then convert it to a generating function problem. This technique has
been found to be successful in solving numerous hard combinatorial problems.
Clearly, p1 D p2 D D pr1 D 0. Also, pr D p r . The first head run of
length r occurs at the .r C 1/th trial if and only if the first trial is a tail and the
last r trials are all heads; therefore, prC1 D qp r , where q D 1 p. Similarly,
prC2 D q 2 p r C pq p r D qprC1 C pqpr . For a general k r C 1, we have the
recursion relation
pk D qpk1 C pqpk2 C p r1 qpkr :
Multiplying by s k and summing over k, we get
on summing the geometric series 1 C ps C C .ps/r1 , and on using the fact that
p C q D 1. Thus, by using a very clever recursion, the generating function of N has
been obtained; it is:
p r s r .1 ps/
G.s/ D :
1 s C qp r s rC1
Inasmuch as we have a closed-form formula for G.s/, we can determine pk for any
specified k by simply repeated differentiation; we can also obtain the expected value
of N , as E.N / D G 0 .1/.
By using the fact that G 00 .1/ D EŒN.N 1/, we can obtain the second moment,
and from there the variance of N . t
u
Example 1.25 (Run of Heads). If the coin is a fair coin, we get the result that the
expected number of tosses necessary to get the first head run of length r is 2 C 22
C C 2r ; for example, it takes on average 14 tosses of a fair coin to obtain a run
of three heads for the first time.
On computing using the variance formula in the theorem, one can see that the
variance of N is very large for r > 3; sometimes one has to wait a very long time
to see a run of four or more consecutive heads.
28 1 Review of Univariate Probability
A few special discrete distributions arise very frquently in applications. Either the
underlying probability mechanism of a problem is such that one of these distribu-
tions is truly the correct distribution for that problem, or the problem may be such
that one of these distributions is a very good choice to model that problem. The spe-
cial distributions we present are the Binomial, the geometric, the negative binomial,
the hypergeometric, and the Poisson.
The Binomial Distribution. The binomial distribution represents a sequence of
independent coin tossing experiments. Suppose a coin with probability p; 0 < p < 1
for heads in a single trial is tossed independently a prespecified number of times,
say n times, n 1. Let X be the number of times in the n tosses that a head is
obtained. Then the pmf of X is:
!
n x
P .X D x/ D p .1 p/nx ; x D 0; 1; : : : ; n;
x
the xn term giving the choice of the x tosses out of the n tosses in which the heads
occur.
Coin tossing, of course, is just an artifact. Suppose a trial can result in only one
of two outcomes, called a success(S) or a failure(F), the probability of obtaining
a success being p in any trial. Such a trial is called a Bernoulli trial. Suppose a
Bernoulli trial is repeated independently a prespecified number of times, say n times,
Let X be the number of times in the n trials that a success is obtained. Then X has
the pmf given above, and we say that X has a binomial distribution with parameters
n and p, and write X Bin.n; p/.
The Geometric Distribution. Suppose a coin with probability p; 0 < p < 1, for
heads in a single trial is repeatedly tossed until a head is obtained for the first time.
Assume that the tosses are independent. Let X be the number of the toss at which
the very first head is obtained. Then the pmf of X is:
P .X D x/ D p.1 p/x1 ; x D 1; 2; 3; : : : :
e x
P .X D x/; ; x D 0; 1; 2; : : : I
xŠ
P
by using the power series expansion of e D 1 x
xD0 xŠ , it follows that this is indeed
a valid pmf.
Three specific situations where a Poisson distribution is almost routinely adopted
as a model are the following.
(A) The number of times a specific event happens in a specified period of time,
for example, the number of phone calls received by someone over a 24-hour
period.
(B) The number of times a specific event or phenomenon is observed in a specified
amount of area or volume, for example, the number of bacteria of a certain kind
in one liter of a sample of water, or the number of misprints per page of a book,
and so on.
(C) The number of times a success is obtained when a Bernoulli trial with success
probability p is repeated independently n times, with p being small and n being
large, such that the product np has a moderate value, say between :5 and 10.
Thus, although the true distribution is a binomial, a Poisson distribution is used
as an effective and convenient approximation.
30 1 Review of Univariate Probability
1 q
E.X / D I Var.X / D :
p p2
as N ! 1.
(d) The integer part of is always a mode of X . If is itself an integer, then and
1 are both modes of X .
Example 1.26 (Guessing on a Multiple Choice Exam). A multiple choice test with
20 questions has five possible answers for each question. A completely unprepared
student picks the answer for each question at random and independently. Suppose
X is the number of questions that the student answers correctly.
We identify each question with a Bernoulli trial and a correct answer as a success.
Because there are 20 questions and the student picks the answer at random from
five choices, X Bin.n; p/, with n D 20; p D 15 D :2. We can now answer any
question we want about X .
For example,
whereas,
P .the student gets every answer right/ D P .X D 20/ D :220 D 1:05 1014 ;
a near impossibility. Suppose the instructor has decided that it will take at least 13
correct answers to pass this test. Then,
!
X
20
20 x 20x
P .the student will pass/ D :2 :8 D :000015;
xD13
x
Example 1.27 (Meeting Someone with the Same Birthday). Suppose you were
born on October 15. How many different people do you have to meet before you
find someone who was also born on October 15? Under the usual conditions of
equally likely birthdays, and independence of the birthdays of all people that you
will meet, the number of people X you have to meet to find the first person with the
same birthday as yours is geometric: X Geo.p/ with p D 365 1
. The pmf of X is
P .X D x/ D p.1 p/ : Thus, for any given k,
x1
1
X 1
X
P .X > k/ D p.1 p/x1 D p .1 p/x D .1 p/k :
xDkC1 xDk
For example, the chance that you will have to meet more than 1000 people to find
someone with the same birthday as yours is .364=365/1000 D :064.
That is, suppose you are waiting for some event to happen for the first time. You
have tried, say, 20 times, and you still have not succeeded. You may feel that it is
due anytime now. But the chance that it will take another ten tries is the same as if
you just started, and forget that you have been patient for long time and have already
tried very hard for success.
The proof is simple. Indeed,
P
P .X > m C n/ x>mCn p.1 p/
x1
P .X > m C njX > n/ D D P
x>n p.1 p/
P .X > n/ x1
.1 p/mCn
D D .1 p/m D P .X > m/:
.1 p/n
Although the idea is extremely original, this estimate can run into various kinds of
1.7 Standard Discrete Distributions 33
difficulties if, for example, the first catch of fish cluster around after being returned,
or hide, or if the fish population has changed between the two catches due to death
or birth, and of course if X turns out to be zero. Modifications of this estimate
(known as the Petersen estimate) are widely used in wildlife estimation, census, and
by the government for estimating tax frauds and number of people afflicted with
some infection.
Example 1.30 (Events over Time). April receives three phone calls at her home on
the average per day. On what percentage of days does she receive no phone calls;
more than five phone calls?
Because the number of calls received in a 24-hour period counts the occurrences
of an event in a fixed time period, we model X D number of calls received by April
on one day as a Poisson random variable with mean 3. Then,
X
5
P .X D 0/ D e 3 D :0498I P .X > 5/ D 1 P .X 5/ D 1 e 3 3x =xŠ
xD0
D 1 :9161 D :0839:
Thus, she receives no calls on 4:98% of the days and she receives more than five
calls on 8:39% of the days. It is important to understand that X has only been mod-
eled as a Poisson random variable, and other models could also be reasonable.
Example 1.31 (A Hierarchical Model with a Poisson Base). Suppose a chick lays
a Poi. / number of eggs in some specified period of time, say a month. Each egg
has a probability p of actually developing. We want to find the distribution of the
number of eggs that actually develop during that period of time.
Let X Poi. / denote the number of eggs the chick lays, and Y the number of
eggs that develop. For example,
1
X 1
X e x
P .Y D 0/ D P .Y D 0jX D x/P .X D x/ D .1 p/x
xD0 xD0
xŠ
1
X . .1 p// x
D e D e e .1p/ D e p :
xD0
xŠ
In general,
1
!
X x y e x
P .Y D y/ D p .1 p/xy
xDy
y xŠ
1
.p=.1 p//y X 1
D e .1 p/x x
yŠ xDy
.x y/Š
X1
.p=.1 p//y . .1 p//n
D e . .1 p//y ;
yŠ nD0
nŠ
34 1 Review of Univariate Probability
and so, we recognize that Y Poi. p/. What is interesting here is that the dis-
tribution still remains Poisson, under assumptions that seem to be very realistic
physically.
A binomial random variable is the sum of n indicator variables. When the expecta-
tion of these indicator variables, namely p is small, and the number of summands n
is large, the Poisson distribution provides a good approximation to the binomial. The
Poisson distribution can also sometimes serve as a good approximation when the in-
dicators are independent, but have different expectations pi , or when the indicator
variables have some weak dependence. We start with the Poisson approximation to
the binomial when n is large, and p is small.
P .Xn D k/ ! P .Y D k/;
as n ! 1.
In fact, the convergence is not just pointwise for each fixed k, but it is uniform
in k. This follows from the next theorem.
Theorem 1.27 (Le Cam, Barbour and Hall, Steele). Let Xn D B1 CB2 C CBn ,
Bernoulli variables with parameters pi D pi;n . Let Yn
where Bi are independentP
Poi. /, where D n D niD1 pi . Then,
1
X 1 e X
n
jP .Xn D k/ P .Yn D k/j 2 pi2 :
kD0 i D1
and,
and,
1 .1 C :00043n/e :00043n D :006:
If he plays for ten years,
and,
1 .1 C :00043n/e :00043n D :022:
We can see that the chances of any luck are at best moderate even after pro-
longed tries.
Sums of random variables arise very naturally in practical applications. For ex-
ample, the revenue over a year is the sum of the monthly revenues; the time taken to
finish a test with ten problems is the sum of the times taken to finish the individual
problems, and so on. Sometimes we can reasonably assume that the various random
variables being added are independent. Thus, the following general question is an
important one.
Suppose X1 ; X2 ; : : : ; Xk are k independent random variables, and suppose we
know the distributions of the individual Xi . What is the distribution of the sum
X1 C X2 C C Xk ?
In general, this is a very difficult question. Interestingly, if the individual Xi have
one of the distinguished distributions we have discussed in this chapter, then their
sum is also often a distribution of that same type.
Theorem 1.28. (a) Suppose X1 ; X2 ; : : : ; Xk are k independent binomial random
variables, with Xi Bin.ni ; p/. Then X1 C X2 C C Xk Bin.n1 C n2 C
C nk ; p/I
(b) Suppose X1 ; X2 ; : : : ; Xk are k independent negative binomial random vari-
ables, with Xi NB.ri ; p/. Then X1 C X2 C C Xk NB.r1 C r2 C
C rk ; p/;
(c) Suppose X1 ; X2 ; : : : ; Xk are k independent Poisson random variables, with
Xi Poi. i /. Then X1 C X2 C C Xk Poi. 1 C 2 C k /:
36 1 Review of Univariate Probability
Rb
The statement that P .a X b/ D a f .x/dx is the same as saying that if
we plot the density function f .x/, then the area under the graph between a and
b
R 1will give the probability that X is between a and b, while the statement that
1 f .x/dx D 1 is the same as saying that the area under the entire graph must be
one. This is a visually helpful way to think of probabilities for continuous random
variables; larger areas under the graph of the density function correspond to larger
probabilities.
The density function f .x/ can in principle be used to calculate the probability
that the randomR variable X belongs to a general set A, not just an interval. Indeed,
P .X 2 A/ D A f .x/dx.
Caution. Integrals over completely general sets A in the real line are not defined.
To make this completely rigorous, one has to use measure theory and concepts of
a Lebesgue integral. However, generally we only want to calculate P .X 2 A/ for
R A that are countable union of intervals. For such sets, defining the integral
sets
A f .x/dx would not be a problem and we can proceed as if we were just calculat-
ing ordinary integrals.
The definition of the cumulative distribution function remains the same as before.
Definition 1.23. Let X be a continuous random variable with a pdf f .x/. Then the
CDF of X is defined as
Z x
F .x/ D P .X x/ D P .X < x/ D f .t/dt:
1
1.9 Continuous Random Variables 37
Remark. At any point x0 at which f .x/ is continuous, the CDF F .x/ is differen-
tiable, and F 0 .x0 / D f .x0 /: In particular, if f .x/ is continuous everywhere, then
F 0 .x/ D f .x/ at all x.
Again, to be strictly rigorous, one really needs to say in the above sentence that
F 0 .x/ D f .x/ at almost all x, a concept in measure theory.
Example 1.33 (Using the Density to Calculate a Probability). Suppose X has the
uniform density on Œ0; 1 defined by f .x/ D 1; 0 x 1, and f .x/ D 0 otherwise.
We write X U Œ0; 1. Consider the events
Next, note that x.1 x/ D :21 has two roots in [0,1], namely x D :3; :7, and
x.1 x/ :21 if x :3 or :7. Therefore,
Z :3 Z 1
P .B/ D P .X :3/ C P .X :7/ D dx C dx D :3 C :3 D :6:
0 :7
1
X )X :
2 4 2
Thus, Z 1
1 1
P .C / D P X D dx D :
2 1
2
2
Finally, the set of rationals in [0,1] is a countable set. Therefore,
X X
P .D/ D P .X D x/ D 0 D 0:
xIx is rational xIx is rational
38 1 Review of Univariate Probability
Example 1.34 (From CDF to PDF and Median). Consider the function F .x/ D 0;
if x < 0I D 1 e x if 0 x < 1: This is a nonnegative nondecreasing function,
that goes to one as x ! 1, is continuous at any real number x, and is also differen-
tiable at any x except x D 0. Thus, it is the CDF of a continuous random variable,
and the PDF can be obtained by the relation f .x/ D F 0 .x/ D e x ; 0 < x < 1,
and f .x/ D F 0 .x/ D 0; x < 0. At x D 0, F .x/ is not differentiable. But we can
define the PDF in any manner we like at one specific point; so to be specific, we
write our PDF as
f .x/ D e x if 0 x < 1I
D 0; if x < 0:
This density is called the standard exponential density and is enormously important
in practical applications.
From the formula for the CDF, we see that F .m/ D :5 ) 1 e m D :5 )
m
e D :5 ) m D log 2 D :693: Thus, we have established that the standard
exponential density has median log 2 D :693.
In general, given a number p, there can be infinitely many values x such that
F .x/ D p. Any such value splits the distribution into two parts, 100p% of the
probability below it, and 100.1 p/% above. Such a value is called the pth quantile
or percentile of F . However, in order to give a prescription for choosing a unique
value when there is more than one x at which F .x/ D p, the following definition is
adopted.
Definition 1.24. Let X have the CDF F .x/. Let 0 < p < 1. The pth quantile or
the pth percentile of X is defined to be the first x such that F .x/ p:
The function F 1 .p/ is also sometimes denoted as Q.p/ and is called the quantile
function of F or X .
Remark. Statisticians call Q.:25/ and Q.:75/ the first and the third quartile of F
or X .
The distribution of a continuous random variable is completely described if we
describe either its density function, or its CDF. For flexible modeling, it is useful
to know how to create new densities or new CDFs out of densities or CDFs that
we have already thought of. This is similar to generating new functions out of old
functions in calculus. The following theorem describes some standard methods to
make new densities or CDFs out of already available ones.
Theorem 1.29. (a) Let f .x/ be any density function. Then, for any real number
and any > 0,
1 x
g.x/ D g; .x/ D f
is also a valid density function.
1.9 Continuous Random Variables 39
X
k
f .x/ D pi fi .x/
i D1
Example 1.36 (The Double Exponential Density). We have previously seen the stan-
dard exponential density on Œ0; 1/ defined as e x ; x 0. We can extend this to the
negative real numbers by writing x for x in the above formula; that is, simply
define the density to be e x for x 0. Then, we have an overall function that equals
e x for x 0:
e x for x 0:
This function integrates to
Z 1 Z 0
x
e dx C e x dx D 1 C 1 D 2:
0 1
40 1 Review of Univariate Probability
1.5
0.5
x
0.2 0.4 0.6 0.8 1
So, if we use a normalizing constant of 12 , then we get a valid density on the entire
real line:
1 x
f .x/ D e for x 0:
2
1 x
f .x/ D e for x 0:
2
Example 1.37 (The Normal Density). The double exponential density tapers off
to zero at the linear exponential rate at both tails (i.e., as x ! ˙1). If we force
the density to taper off at a quadratic exponential rate, then we will get a function
2
like e ax , for some chosen a > 0. Although this is obviously nonnegative, and
also has a finite integral over the whole real line, it does not integrate to one. So we
need a normalizing constant to make it a valid density function. Densities of this
form are called normal densities, and occupy the central place among all distribu-
tions in the theory and practice of probability and statistics. Gauss, while using the
method of least squares for analyzing astronomical data, used the normal distribu-
tion to justify least squares methods; the normal distribution is also often called the
Gaussian distribution, although de Moivre and Laplace both worked with it before
1.9 Continuous Random Variables 41
0.5
0.4
0.3
0.2
0.1
x
-4 -2 2 4
Gauss. Physical data on many types of variables approximately fit a normal distribu-
tion. The theory of statistical methods is often best understood when the underlying
distribution is normal. The normal distributions have many unique properties not
shared by any other distribution. Because of all these reasons, the normal density,
also called the bell curve, is the most used, important, and well-studied distribution.
Let
2
.x/
f .x/ D f .xj; / D ce 22
; 1 < x < 1;
where c is a normalizing constant. The normalizing constant can be proved to be
equal to p1 . Thus, a normal density with parameters and is given by
2
2
1 .x/
f .xj; / D p e 2 2 ; 1 < x < 1:
2
We write X N.; 2 /; we show later that the two parameters and 2 are
the mean and the variance of this distribution. Note that the N.; 2 / density is a
location-scale parameter density.
If D 0 and D 1, this simplifies to the formula
1 x2
p e 2 ; 1 < x < 1;
2
and is universally denoted by the notation .x/. It is called the standard normal
density. The standard normal density, then, is:
1 x2
.x/ D p e 2 ; 1 < x < 1:
2
42 1 Review of Univariate Probability
0.4 1
0.8
0.3
0.6
0.2
0.4
0.1
0.2
-3 -2 -1 1 2 3 -3 -2 -1 1 2 3
As for discrete random variables, we are often interested in the distribution of some
function g.X / of a continuous random variable X . For example, X could measure
the input into some production process, and g.X / could be a function that describes
the output. For one-to-one functions g.X /, one has the following important formula.
Theorem 1.30 (The Jacobian Formula). Let X have a continuous pdf f .x/
and a CDF F .x/, and suppose Y D g.X / is a strictly monotone function of X with
a nonzero derivative. Then Y has the pdf
f .g 1 .y//
fY .y/ D ;
jg 0 .g 1 .y//j
1 y a
fY .y/ D f :
jbj b
Example 1.39 (From Exponential to Uniform). Suppose X has the standard expo-
nential density f .x/ D e x ; x 0. Let Y D g.X / D e X . Again, g.X / is a
strictly monotone function, and the inverse function is found as follows.
Also, g 0 .x/ D e x ;
Theorem 1.31. Let X have a continuous CDF F .x/. Consider the new random
variables Y D 1 F .X / and Z D F .X /. Then both Y; Z are distributed as
U Œ0; 1.
It is useful to remember this result in informal notation:
F .X / D U; and F 1 .U / D X:
The implication is a truly useful one. Suppose for purposes of computer experiments,
we want to have computer-simulated values of some random variable X that has
some CDF F and the quantile function Q D F 1 . Then, all we need to do is
to have the computer generate U Œ0; 1 values, say u1 ; u2 ; : : : ; un , and use x1 D
F 1 .u1 /; x2 D F 1 .u2 /; : : : ; xn D F 1 .un / as the set of simulated values for
our random variable of interest, namely X . Thus, the problem can be reduced to
simulation of uniform values, a simple task. The technique has so many uses that
there is a name for this particular function Z D F 1 .U / of a uniform random
variable U .
44 1 Review of Univariate Probability
Example 1.40 (The Cauchy Distribution). The Cauchy density, like the normal
and the double exponential, is also symmetric and unimodal, but the properties are
very different. It is such an atypical density that we often think of the Cauchy density
first when we look for a counterexample to a conjecture. There is a very interest-
ing way to obtain a Cauchy density from a uniform density by using the quantile
transformation. We describe that derivation in this example.
Suppose a person holds a flashlight in her hand, and standing one foot away
from an infinitely long wall, points the beam of light in a random direction. Here,
by random direction, we mean that the point of landing of the light ray makes an
angle X with the individual (considered to be a straight line one foot long), and
this angle X U Œ=2; =2: Let Y be the horizontal distance of the point at
which the light lands from the person, with Y being considered negative if the light
lands on the person’s left, and it being considered positive if it lands on the person’s
right.
Then, by elementary trigonometry,
Y
tan.X / D ) Y D tan.X /:
1
Now g.X / D tan X is a strictly monotone function of X , and the inverse function
is g 1 .y/ D arctan.y/; 1 < y < 1: Also, g 0 .x/ D 1 C tan2 x: Putting it all
together,
1
1
fY .y/ D
D ; 1 < y < 1:
1C Œtan.arctan y/2 .1 C y 2 /
Example 1.41 (An Interesting Function that Is Not Strictly Monotone). Suppose X
2
has the standard normal density f .x/ D p1 e x =2 on .1; 1/. We want to find
2
the density of Y D g.X / D X 2 . However, we immediately realize that X 2 is not a
strictly monotone function on the whole real line (its graph is a parabola). Thus, the
general formula given above for densities of strictly monotone functions cannot be
applied in this problem. We attack the problem directly. Thus,
1.10 Functions of a Continuous Random Variable 45
P .Y y/ D P .X 2 y/ D P .X 2 y; X > 0/ C P .X 2 y; X < 0/
p p
D P .0 < X y/ C P . y X < 0/
p p p p
D F . y/ F .0/ C ŒF .0/ F . y/ D F . y/ F . y/;
y > 0. This is a very special density in probability and statistics, and is called the
chi-square density with one degree of freedom. We have thus proved that the square
of a standard normal random variable has a chi-square distribution with one degree
of freedom.
There is an analogous Jacobian formula for transformations g.X / that are not
one-to-one. Basically, we need to break the problem up into disjoint intervals, on
each of which the function g is one-to-one, apply the usual Jacobian technique on
each such subinterval, and then piece them together. Here is the formula.
Theorem 1.32 (Density of a Nonmonotone Transformation). Let X have a
continuous pdf f .x/ and let Y D g.X / be a transformation of X such that for a
given y, the equation g.x/ D y has at most countably many roots, say x1 ; x2 ; : : :,
where the xi depend on the given y. Assume also that g has a nonzero derivative at
each xi . Then, Y has the pdf
X f .xi /
fY .y/ D :
jg 0 .xi /j
i
Theorem 1.33. Let X be a continuous random variable with pdf R 1f .x/. Let g.X / be
a function of X . The expectation of g.X / exists if and only if 1 jg.x/jf .x/dx <
1, in which case the expectation of g.X / is
Z 1
EŒg.X / D g.x/f .x/dx:
1
The definitions of moments and the variance remain the same as in the discrete case.
Definition 1.29. Let X be a continuous random variable with pdf f .x/. Then the
kth moment of X is defined to be E.X k /; k 1. We say that the kth moment does
not exist if E.jX jk / D 1.
Corollary. Suppose X is a continuous random variable with pdf f .x/. Then its
variance, provided it exists, is equal to
Z 1 Z 1
2 D .x /2 f .x/dx D x 2 f .x/dx 2 :
1 1
One simple observation that saves calculations, but is sometimes overlooked, is the
following fact; the proof of it merely uses the integration result that the integral of
the product of an odd function and an even function on a symmetric interval is zero,
if the integral exists.
where s D aCbCc
2 . When all the side lengths are equal, say, to a, this reduces to
p p
3 2 3 2
4
a . Therefore, in this example, we want the mean and variance of Y D 4
X .
The mean is p p
3 31 1
E.Y / D E.X / D
2
D p :
4 4 3 4 3
The variance equals
3 1 3 1 1
var.Y / D E.Y 2 / ŒE.Y /2 D E.X 4 / D
16 48 16 5 48
3 1 1
D D :
80 48 60
For the next example, we need the definition of the Gamma function. It repeatedly
necessary for us to work with the Gamma function in this text.
In particular,
In particular,
E.X / D 1I E.X 2 / D 2;
Example 1.44 (Absolute Value of a Standard Normal). This is often required in cal-
culations in statistical theory. Let X have the standard normal distribution; we want
to find E.jX j/: By definition,
48 1 Review of Univariate Probability
Z 1 Z 1 Z 1
1 2 =2 2 2 =2
E.jX j/ D jxjf .x/dx D p jxje x dx D p xe x dx
1 2 1 2 0
2 =2
(because jxje x is an even function of x on .1; 1/)
Z ˇ1
2 1
d x 2 =2 2 x 2 =2 ˇ
ˇ
D p e dx D p e ˇ
2 0 dx 2 0
r
2 2
D p D :
2
Example 1.45 (A Random Variable Whose Expectation Does Not Exist). Con-
sider the standard Cauchy random variable with the density f .x/ D .1Cx
1
2/ ;
R1
1 < x < 1. Recall that for E.X / to exist, we must have 1 jxjf .x/dx < 1.
But,
Z 1 Z 1 Z 1
1 jxj 1 x
jxjf .x/dx D dx dx
1 1 1Cx 2 0 1 C x2
Z M
1 x
dx
0 1 C x2
2
because x 2nC1 is an odd function and e x =2 is an even function. Thus, every odd
moment of the standard normal distribution is zero.
Next, take k to be an even integer, say k D 2n; n 1: Then,
Z 1 Z 1
1 2 =2 2 2 =2
E.X k / D p x 2n e x dx D p x 2n e x dx
2 1 2 0
Z 1 Z 1
2 1 1
D p zn e z=2 p d z D p zn1=2 e z=2 d z;
2 0 2 z 2 0
Z 1 Z 1
1 2n
E.X 2n / D p .2u/n1=2 e u 2d u D p un1=2 e u du:
2 0 0
R1
Now, we recognize 0 un1=2 e u d u to be .n C 12 /, and so, we get the formula
2n .n C 12 /
E.X 2n / D p ; n 1:
this reduces to
.2n/Š
E.X 2n / D ; n 1:
2n nŠ
We now describe methods to calculate moments of a random variable from its sur-
vival function, namely FN .x/ D 1F .x/. There are important relationships between
the existence of moments and the rapidity with which the survival function goes to
zero as jxj ! 1.
50 1 Review of Univariate Probability
(a) Let X be a nonnegative random variable and suppose E.X / exists. Then
(b) Let X beR a nonnegative random variable and suppose E.X / exists. Then
1
E.X / D 0 F .x/dx.
(c) Let X be a nonnegative random variable and suppose E.X k / exists, where k 1
is a given positive integer. Then
x k F .x/ D x k Œ1 F .x/ ! 0; as x ! 1:
(d) Let X be a nonnegative random variable and suppose E.X k / exists. Then
Z 1
k
E.X / D .kx k1 /Œ1 F .x/dx:
0
(e) Let X be a general real-valued random variable and suppose E.X / exists. Then
(f) Let X be a general real-valued random variable and suppose E.X / exists. Then
Z 1 Z 0
E.X / D Œ1 F .x/dx F .x/dx:
0 1
if t < 1, and it equals C1 if t 1. Thus, the mgf of the standard exponential distri-
bution is finite if and only if t < 1. So, the moments can be found by differentiating
the mgf, namely, E.X n / D .n/ .0/: Now, at any t < 1, by direct differentiation,
.n/
.t/ D .1tnŠ/nC1 ) E.X n / D .n/ .0/ D nŠ, a result we have derived before
directly.
R1 2
because p1 1 e z =2 d z is the integral of the standard normal density, and so
2
must be equal to one.
We have therefore proved that the mgf of the standard normal distribution exists
2
at any real t and equals .t/ D e t =2 :
The mgf is useful in deriving inequalities on probabilities of tail values of a ran-
dom variable that have proved to be extremely useful in many problems in statistics
and probability. In particular, these inequalities typically give much sharper bounds
on the probability that a random variable would be far from its mean value than
Chebyshev’s inequality can give. Such probabilities are called large deviation prob-
abilities. We treat large deviations in detail in Chapter 17. We present a particular
large deviation inequality below and then present some neat applications.
Theorem 1.34 (Chernoff–Bernstein Inequality). Let X have the mgf .t/, and
assume that .t/ < 1 for t < t0 for some t0 ; 0 < t0 1. Let .t/ D log .t/,
and for a real number x, define
Then,
P .X x/ e I.x/ :
See Bernstein (1927) and Chernoff (1952) for this inequality and other refinements
of it.
To apply the Chernoff–Bernstein inequality, it is necessary to be able to find the
mgf .t/ and then be able to find the function I.x/, which is called the rate function
of X .
Example 1.50 (Testing the Bound in the Standard Normal Case). Suppose X is a
standard normal variable. Then the exact value of the probability P .X > x/ D
1 P .X x/ D 1 ˆ.x/ is easily computable, although no formula can be
written for it. The Chebyshev inequality will give for x > 0,
1 1
P .X > x/ D P .jX j > x/ :
2 2x 2
2 =2
To apply the Chernoff–Bernstein bound, use the formula .t/ D e t ) .t/ D
t 2 =2 ) I.x/ D supt >0 Œtx t 2 =2 D x 2 =2: Therefore,
2 =2
P .X > x/ e I.x/ D e x :
Example 1.51. Let X be any random variable with a finite mean . Consider the
function g.x/ D e ax ; where a is a real number. Then, by the second derivative test,
g is a convex function on the entire real line, and therefore, by Jensen’s inequality
E.e aX / e a :
EX ˛1 EX ˇ ˛1 EX ˛2 EX ˇ ˛2 :
Example 1.52 (An Expectation Paradox). Suppose X; Y are two positive noncon-
stant independent random variables, with the same distribution; for example, X; Y
could be independent variables with a uniform distribution on Œ5; 10. We need
the assumption that the common distribution of X and Y is such that E. X1 / D
E. Y1 / < 1.
Let R D XY . Then, by Jensen’s inequality,
X 1 1
E.R/ D E D E.X /E > E.X / D 1:
Y Y E.Y /
1
f .x/ D ; a x b;
ba
D 0 otherwise;
F .x/ D 0; x < a:
xa
D ; a x b:
ba
D 1; x > b:
e t b e t a
(c) The mgf of the U Œa; b distribution equals .t/ D .ba/t
:
(d) The nth moment of the U Œa; b distribution equals
b nC1 anC1
E.X n / D :
.b a/.n C 1/
(e) The mean and the variance of the U Œa; b distribution equal
aCb .b a/2
D I 2 D :
2 12
Example 1.53. A point is selected at random on the unit interval, dividing it into
two pieces with total length 1. Find the probability that the ratio of the length of the
shorter piece to the length of the longer piece is less than 1=4.
minfX;1Xg
Let X U Œ0; 1; we want P . maxfX;1Xg < 1=4/: This happens only if X < 1=5
or >4=5. Therefore, the required probability is P .X < 1=5/ C P .X > 4=5/ D
1=5 C 1=5 D 2=5.
1.12 Some Special Continuous Distributions 55
quite a bit smaller than 50%. This implies that the median of the distribution has to
be smaller than 4, where 4 is the mean. Indeed, the median is a number m such that
F .m/ D 12 (the median is unique in this example) ) 1 e m=4 D 12 ) m D
4 log 2 D 2:77:
This phenomenon that the mean is larger than the median is quite typical of
distributions that have a long right tail, as does the exponential.
In general, if X Exp. /; the median of X is log 2.
Example 1.55 (Lack of Memory of the Exponential Distribution). The exponential
densities have a lack of memory property similar to the one we established for the
geometric distribution. Let X Exp. /, and let s; t be positive numbers. The lack
of memory property is that P .X > s C tjX > s/ D P .X > t/. So, suppose that X
is the waiting time for an elevator, and suppose that you have already waited s D 3
minutes. Then the probability that you have to wait another two minutes is the same
as the probability that you would have to wait two minutes if you just arrived. This
is not true if the waiting time distribution is something other than an exponential.
The proof of the property is simple:
f .y 1=˛ / 1=˛ 1
fY .y/ D 0
D e y
jg .y /j
1=˛ ˛y .˛1/=˛
1 .1˛/=˛ y 1=˛
D y e ; y > 0:
˛
ˇ x ˇ 1 x ˇ
f .yjˇ; / D e . / ; y > 0:
e x= x ˛1
f .xj˛; / D ˛ .˛/
; x > 0; ˛; > 0I
we write X G.˛; /. The Gamma density reduces to the exponential density with
mean when ˛ D 1; for ˛ < 1, the Gamma density is decreasing and unbounded,
whereas for large ˛, it becomes nearly a bell-shaped curve. A plot of some Gamma
densities in Fig. 1.4 reveals these features.
The basic facts about a Gamma distribution are given in the following theorem.
Theorem 1.39. (a) The CDF of the G.˛; / density is the normalized incomplete
Gamma function
.˛; x= /
F .x/ D ;
.˛/
Rx
where .˛; x/ D 0 e t t ˛1 dt.
1.12 Some Special Continuous Distributions 57
0.8 0.35
2 0.3
1.5 0.6 0.25
0.2
1 0.4 0.15
0.5 0.2 0.1
0.05
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
0.175 0.1
0.15 0.08
0.125
0.1 0.06
0.075 0.04
0.05
0.025 0.02
2 4 6 8 10 12 14 5 10 15 20 25 30
n .˛ C n/
E.X n / D ; n 1:
.˛/
(c) The mgf equals
1
.t/ D .1 t/˛ ; t< :
(d) The mean and the variance equal
D˛ I 2 D ˛ 2
:
which agrees with the mgf of a G.n; / distribution, and therefore, by the dis-
tribution determining property of mgfs, it follows that X1 C X2 C C Xn
G.n; /. t
u
Example 1.57 (The General Chi-Square Distribution). We saw in the previous
section that the distribution of the square of a standard normal variable is the chi-
square distribution with one degree of freedom. A natural question is what is the
distribution of the sum of squares of several independent standard normal variables.
58 1 Review of Univariate Probability
Although we do not yet have the technical tools necessary to derive this distribu-
tion, it turns out that this distribution is in fact a Gamma distribution. Precisely,
P if
X1 ; X2 ; : : : ; Xm are m independent standard normal variables, then T D m X
i D1 i
2
m
has a G. 2 ; 2/ distribution, and therefore has the density
e t =2 t m=21
fm .t/ D ; t > 0:
2m=2 . m 2/
This is called the chi-square density with m degrees of freedom, and arises in nu-
merous contexts in statistics and probability. We write T 2m . From the general
formulas for the mean and variance of a Gamma distribution, we get that
The chi-square density is rather skewed for small m, but becomes approximately
bell-shaped when m gets large; we have seen this for general Gamma densities.
One especially important context in which the chi-square distribution arises is in
consideration of the distribution of the sample variance for iid normal observations.
The sample variance of a set of n random variables X1 ; X2 ; : : : ; Xn is defined as
1 Pn N 2 N X1 CCXn
s 2 D n1 i D1 .Xi X / , where X D n is the mean of X1 ; : : : ; Xn . The
name sample variance derives from the following property.
X
n X
n X
n X
n
.Xi XN /2 D .Xi2 2Xi XN C XN 2 / D Xi2 2nXN 2 CnXN 2 D Xi2 nXN 2 :
i D1 i D1 i D1 i D1
Therefore,
" n # 2
1 X 1
E.s / D2
E Xi nXN D
2 2
n. 2 C 2 /n C 2 D 2 :
n1 n1 n
i D1
N
If, in particular, X1 ; : : : ; Xn are iid N.; 2 /, then XiX are also normally dis-
tributed, each with mean zero. However, they are no longer independent. If we sum
their squares, then the sum of the squares will still be distributed as a chi square, but
there will be a loss of one degree of freedom, due to the fact that Xi XN are not
independent, even though the Xi are independent.
We state this important fact formally. t
u
Pn
N 2
i D1 .Xi X/
Theorem 1.41. Suppose X1 ; : : : ; Xn are iid N.; 2 /. Then 2
2n1 .
Example 1.58 (Inverse Gamma Distribution). Suppose X G.˛; /. The distribu-
tion of X1 is called the inverse Gamma distribution. We derive its density.
1.12 Some Special Continuous Distributions 59
e 1=.y/ y 1˛
D ˛ .˛/
; y > 0:
The inverse Gamma density is extremely skewed for small values of ˛; furthermore,
the right tail is so heavy for small ˛, that the mean does not exist if ˛ 1. Inverse
Gamma distributions are quite popular in studies of economic inequality, reliability
problems, and as prior distributions in Bayesian statistics.
For continuous random variables that take values between 0 and 1, the most stan-
dard family of densities is the family of Beta densities. Their popularity is due to
their analytic tractability, and due to the large variety of shapes that Beta densities
can take when the parameter values change. It is a generalization of the U Œ0; 1
density.
Definition 1.34. X is said to have a Beta density with parameters ˛ and ˇ if it has
the density
x ˛1 .1 x/ˇ 1
f .x/ D ; 0 x 1; ˛; ˇ > 0;
B.˛; ˇ/
.˛ C n/.˛ C ˇ/
E.X n / D :
.˛ C ˇ C n/.˛/
60 1 Review of Univariate Probability
˛ ˛ˇ
D I 2 D :
˛Cˇ .˛ C ˇ/2 .˛ C ˇ C 1/
(b) The nth moment exists if and only if n < ˛, in which case
˛ n
E.X n / D :
˛n
(c) For ˛ > 1, the mean exists; for ˛ > 2, the variance exists. Furthermore, they
equal
˛ ˛ 2
E.X / D I Var.X / D :
˛1 .˛ 1/2 .˛ 2/
We next define the Gumbel law. A random variable X is said to have the Gumbel
density with parameters ; if it has the density
Empirical data on many types of variables across disciplines tend to exhibit uni-
modality and only a small amount of skewness. It is quite common to use a normal
distribution as a model for such data. The normal distribution occupies the central
place among all distributions in probability and statistics. There is also the cen-
tral limit theorem, which says that the sum of many small independent quantities
approximately follows a normal distribution. By a combination of reputation, con-
venience, mathematical justification, empirical experience, and habit, the normal
62 1 Review of Univariate Probability
distribution has become the most ubiquitious of all distributions. Detailed algebraic
properties can be seen in Rao (1973), Kendall and Stuart (1976), and Feller (1971).
Petrov (1975) is a masterly account of the role of the normal distribution in the limit
theorems of probability.
We have actually already defined a normal density. But let us recall the definition
here.
The CDF ˆ.x/ cannot be written in terms of the elementary functions, but can be
computed at a given value x, and tables of the values of ˆ.x/ are widely available.
For example, here are some selected values.
x ˆ.x/
4 .00003
3 .00135
2 .02275
1 .15866
0 .5
1 .84134
2 .97725
3 .99865
4 .99997
.2k/Š
E.Z 2k / D ; k 1:
2k kŠ
(e) The mgf of the N.; 2 / distribution exists at all real t, and equals
t 2 2
.t/ D e tC 2 :
(f) If X N.; 2 /,
Next, what is the lowest setting of the thermostat that will maintain a temperature
of at least 72ı C with a probability of .99?
64 1 Review of Univariate Probability
We want to find the value of d that makes P .X 72/ D :99 ) P .X < 72/ D
:01: Now, from a standard normal table, P .Z < 2:326/ D :01. Therefore, we
want to find d that makes d C .2:326/ D 72 ) d :5 2:326 D 72 ) d D
72 C :5 2:326 D 73:16ıC:
on some manipulation. P
For example, if D 1, then this equals 2 1 D1 Œ1 ˆ.i C 1=2/ D :76358,
ip
whereas the unrounded jX j has the expectation 2= D :79789. The effect of
rounding is not serious when D 1.
A plot of the expected value of Y and the expected value of jX j is shown in
Fig. 1.5 to study the effect of rounding.
sigma
1 2 3 4 5
Fig. 1.5 Expected value of rounded and unrounded jXj when X is N.0; sigma^ 2)
1.13 Normal Distribution and Confidence Interval for a Mean 65
We can see that the effect of rounding is uniformly small. There is classic
literature on corrections needed in computing means, variances, and higher mo-
ments when data are rounded. These are known as Sheppard’s corrections. Kendall
and Stuart (1976) give a thorough treatment of these needed corrections.
1 .log y/2
fY .y/ D p e 2 2 ; y > 0I
y 2
2 2 2
E.Y / D e C 2 I Var.Y / D .e 1/e 2C :
One of the main reasons for the popularity of the lognormal distribution is its
skewness; the lognormal density is extremely skewed for large values of . The
coefficient of skewness has the formula
2
p
ˇ D .2 C e / e 2 1 ! 1; as ! 1:
Note that the lognormal densities do not have a finite mgf at any t > 0, although
all its moments are finite. It is also the only standard continuous distribution that is
not determined by its moments. That is, there exist other distributions besides the
lognormal all of whose moments exactly coincide with the moments of a given log-
normal distribution. This is not true of any other distribution with a name that we
have come across in this chapter. For example, the normal and the Poisson distribu-
tions are all determined by their moments.
We had remarked in the above that sums of many independent variables tend to
be approximately normally distributed. A precise version of this is the central limit
theorem, which we study in the next section. What is interesting is that sums of any
number of independent normal variables are exactly normally distributed. Here is
the result.
Example 1.64 (Confidence Interval and Margin of Error). Suppose some random
variable X N.; 2 /, and we have n independent observations X1 ; X2 ; : : : ; Xn
on this variable X ; another way to put it is that X1 ; X2 ; : : : ; Xn are iid N.; 2 /.
Therefore, X N.; 2 =n/, and we have
p p
P .X 1:96= n X C 1:96= n/
p p
D P .1:96= n X 1:96= n/
!
X
D P 1:96 p 1:96 D ˆ.1:96/ ˆ.1:96/ D :95;
= n
In 1981, Charles Stein gave a simple lemma for a normal distribution, and extended
it to the case of a finite number of independent normal variables, which seems in-
nocuous on its face, but has proved to be a really powerful tool in numerous areas
of statistics. It has also had its technical influence on the area of Poisson approxi-
mations, which we briefly discussed in this chapter. We present the basic lemma, its
extension to the case of several independent variables, and show some applications.
It would not be possible to give more than just a small glimpse of the applications of
Stein’s lemma here; the applications are too varied. Regrettably, no comprehensive
book or review of the various applications of Stein’s lemma is available at this time.
1.14 Stein’s Lemma 67
The original article is Stein (1981); Wasserman (2006) and Diaconis and Zabell
(1991) are two of the best sources to learn more about Stein’s lemma.
Theorem 1.47. (a) Let X N.; 2 /, and suppose g W R ! R is such that g is
differentiable at all but at most a finite number of points, and
2
x2
(i) For some < 1; g.x/e 2 ! 0; as x ! ˙1.
(ii) EŒjg 0 .X /j < 1.
Then,
EŒ.X /g.X / D 2 EŒg 0 .X /:
(b) Let X1 ; X2 ; : : : ; Xk be independent N.i ; 2 / variables, and suppose g W
Rk ! R is such that g has a partial derivative with respect to each xi at
all but at most a finite number of points. Then,
@
EŒ.Xi i /g.X1 ; X2 ; : : : ; Xk / D 2 E g.X1 ; X2 ; : : : ; Xk / :
@Xi
and,
2
.1/x C x
e 2 2 2
is uniformly bounded in x:
The principal applications of Stein’s lemma are in statistical theory. Here, we show
a simple application. t
u
68 1 Review of Univariate Probability
In 1981, Herman Chernoff gave a proof of an inequality for the normal distribu-
tion that essentially says that a smoothing operation, such as integration, is going to
reduce the variance of a function. The inequality has since been extensively general-
ized; see Chernoff (1981) for this inequality. We present this inequality, but present
a different proof, which works more generally.
Theorem 1.48. Let X N.0; 1/, and let g W R ! R be a function such that g is
once continuously differentiable, and EŒ.g 0 .X //2 < 1. Then,
Proof. We need to use the fact that for any random variable Y; ŒE.Y /2 E.Y 2 /;
we choose the variable Y suitably in the proof below. By the fundamental theorem
of calculus,
1.16 Various Characterizations of Normal Distributions 69
x Z 2
Z 2
0 1 x 0
.g.x/ g.0// D
2
g .t/dt D x g .t/dt
0 x 0
Z x 2 Z x
1 1
D x2 g 0 .t/dt x2 Œg 0 .t/2 dt :
x 0 x 0
Z x
Dx Œg 0 .t/2 dt;
0
Example 1.66. Let X N.0; 1/. As a simple example, let g.X / D .X a/2 , where
a is a general constant. With some algebra, the exact variance of g.X / can be found;
we use Chernoff’s inequality to find an upper bound on the variance.
Clearly, g0 .x/ D 2.x a/, and so, by Chernoff’s inequality, VarŒ.X a/2
EŒ4.X a/2 D 4.1 C a2 /:
Example 1.67. Consider a general cubic polynomial g.X / D aCbX CcX 2 CdX 3 ,
and suppose that X N.0; 1/. The derivative of g is g0 .x/ D 3dx 2 C 2cx C b )
.g 0 .x//2 D 9d 2 x 4 C12cdx 3 C.4c 2 C 6bd /x 2 C 4bcxC b 2 : Because X is standard
normal, E.X / D E.X 3 / D 0; E.X 2 / D 1; E.X 4 / D 3: Thus, by Chernoff’s
inequality, for a general cubic polynomial,
Let
Var.g.X //
B.g/ D :
2 EŒ.g 0 .X //2
Many of the special discrete and special continuous distributions that we have
discussed can be well approximated by a normal distribution, for suitable config-
urations of their underlying parameters. Typically, the normal approximation works
well when the parameter values are such that the skewness of the distribution is
small. For example, binomial distributions are well approximated by a normal when
n is large and p is not too small or too large. Gamma distributions are well ap-
proximated by a normal when the shape parameter ˛ is large. There is a unifying
mathematical result here. The unifying mathematical result is one of the most im-
portant results in all of mathematics, and is called the central limit theorem. The
subject of central limit theorems is incredibly diverse. In this section, we present the
basic or the canonical central limit theorem, and present its applications to certain
problems with which we are already familiar. Among numerous excellent references
on central limit theorems, we recommend Feller (1968, 1971) for lucid exposition
and examples. The subject of central limit theorems also has a really interesting his-
tory; we recommend Le Cam (1986) and Stigler (1986) for reading some history
of the central limit theorem. Careful and comprehensive mathematical treatment is
available in Hall (1992) and Bhattacharya and Rao (1986). For a diverse selection
of examples, see DasGupta (2008).
A very important case in which the general central limit theorem applies is the bino-
mial distribution. The CLT allows us to approximate clumsy binomial probabilities
72 1 Review of Univariate Probability
as n ! 1.
The de Moivre–Laplace CLT tells us that if X Bin.n; p/, then we can approx-
imate the type probability P .X k/ as
!
X np k np
P .X k/ D P p p
np.1 p/ np.1 p/
!
k np
ˆ p :
np.1 p/
Note that, in applying the normal approximation in the binomial case, we are using
a continuous distribution to approximate a discrete distribution taking only integer
values. The quality of the approximation improves, sometimes dramatically, if we
fill up the gaps between the successive integers. That is, pretend that an event of the
form X D x really corresponds to x 12 X x C 12 . In that case, in order to
approximate P .X k/, we in fact expand the domain of the event to k C 12 , and
approximate P .X k/ as
!
k C 12 np
P .X k/ ˆ p :
np.1 p/
Example 1.68 (Coin Tossing). This is the simplest example of a normal approxi-
mation of binomial probabilities. We solve a number of problems by applying the
normal approximation method.
1.17 Normal Approximations and Central Limit Theorem 73
First, suppose a fair coin is tossed 100 times. What is the probability that we
obtain between 45 and 55 heads? Denoting X as the number of heads obtained
in 100 tosses, X Bin.n; p/, with n D 100; p D :5. Therefore, by using the
continuity corrected normal approximation,
55:5 50 44:5 50
P .45 X 55/ ˆ p ˆ p
12:5 12:5
D ˆ.1:56/ ˆ.1:56/ D :9406 :0594 D :8812:
So, the probability that the percentage of heads is between 45% and 55% is high, but
not really high if we toss the coin 100 times. Here is the next question. How many
times do we need to toss a fair coin to be 99% sure that the percentage of heads
will be between 45 and 55%? The percentage of heads is between 45 and 55% if
and only if the number of heads is between :45n and :55n. Using the continuity
corrected normal approximation, again, we want
:55n C :5 :5n :45n :5 :5n
:99 D ˆ p ˆ p
:25n :25n
:55n C :5 :5n
) :99 D 2ˆ p 1
:25n
Now, from a standard normal table, we find that ˆ.2:575/ D :995. Therefore, we
equate
:05n C :5
p D 2:575
:25n
p p
) :05n C :5 D 2:575 :5 n D 1:2875 n:
p
Writing n D x, we have here a quadratic equation :05x 2 1:2875x C :5 D 0 to
solve. The root we want is x D 25:71, and squaring it gives n .25:71/2 D 661:04.
Thus, an approximate value of n such that in n tosses of a fair coin, the percentage of
heads will be between 45 and 55% with a 99% probability is n D 662. Most people
find that the value of n needed is higher than what they would have guessed.
Example 1.69 (Random Walk). The theory of random walks is one of the most beau-
tiful areas of probability. Here, we give an introductory example that makes use of
the normal approximation to a binomial.
74 1 Review of Univariate Probability
Suppose a drunkard is standing at time zero (say 11:00 PM) at some point, and
every second he either moves one step to the right, or one step to the left, with
equal probability, of where he is at that time. What is the probability that after two
minutes, he will be ten or more steps away from where he started? Note that the
drunkard will take 120 steps in 2 minutes.
Let the drunkard’s movement at the i th step be denoted as Xi ; then, P .Xi D
˙1/ D :5. So, we can think of Xi as Xi D 2Yi 1, where Yi Ber.:5/; 1 i
n D 120. If we assume that the drunkard’s successive movements X1 ; X2 ; : : : are
independent, then Y1 ; Y2 ; : : : are also independent, and so, Sn D Y1 C Y2 C Yn
Bin.n; :5/. Furthermore,
P .j2.Y1 C Y2 C C Yn / nj 10/
n n
D P Sn 5 C P Sn 5
2 2
Sn n 5 Sn n 5
DP p 2 p CP p 2 p :
:25n :25n :25n :25n
The normal approximation to the binomial distribution forms the basis for most of
the confidence intervals for the parameter p in common use. We describe two of
these in this section, the Wald confidence interval and the score confidence interval
for p. The Wald interval used to be the textbook interval, but the score interval is
gaining in popularity due to recent research establishing unacceptably poor proper-
ties of the Wald interval. The derivation of each interval is sketched below.
Let X Bin.n; p/. By the normal approximation to the Bin.n; p/ distribution,
for large n; X N.np; np.1p//, and therefore, the standardized binomial variable
pXnp N.0; 1/. This implies
np.1p/
1.17 Normal Approximations and Central Limit Theorem 75
6 Steps
20 40 60 80 100 120
4 -2.5
2 -5
Steps -7.5
20 40 60 80 100 120 -10
-2
-12.5
-4
10
Steps
7.5 20 40 60 80 100 120
-2
5
2.5 -4
Steps -6
20 40 60 80 100 120
-2.5 -8
-5 -10
A famous theorem in probability places an upper bound on the error of the normal
approximation in the central limit theorem. If we make this upper bound itself small,
then we can be confident that the normal approximation will be accurate. This up-
per bound on the error of the normal approximation is known as the Berry–Esseen
bound. Specialized to the binomial case, it says the following; a proof can be seen in
Bhattacharya and Rao (1986) or in Feller (1968). The general Berry–Esseen bound
is treated in this text in Chapter 8.
4 1 2p.1 p/
jP .X x/ P .Y x/j p :
5 np.1 p/
1.17 Normal Approximations and Central Limit Theorem 77
b nCx !
X2 c
1 k n
fn .x/ D n .1/ .n C x 2k/n1 ; if jxj nI
2 .n 1/Š k
kD0
X
9
Sn D Xi N.36; 9/:
i D1
78 1 Review of Univariate Probability
0.5
0.4
0.3
0.2
0.1
x
-2 -1 1 2
Fig. 1.7 Exact and approximating normal density for sum of uniforms; n D 2
0.35
0.3
0.25
0.2
0.15
0.1
0.05
x
-4 -2 2 4
Fig. 1.8 Exact and approximating normal density for sum of uniforms; n D 4
Therefore,
30 36
P .Sn 30/ ˆ D ˆ.2/ D :0228:
3
0.25
0.2
0.15
0.1
0.05
x
-7.5 -5 -2.5 2.5 5 7.5
Fig. 1.9 Exact and approximating normal density for sum of uniforms; n D 8
So we see that although the chances of having any accidents in a particular plant in
any particular year are small, collectively, and in the long run the chances are high
that there will be quite a few such accidents.
X
P 1:96 p 1:96 :95
.X /2
,P 1:96 2
:95
In statistics, one often treats the parameter as unknown, and uses the data value
X to estimate the unknown . The statement ./ is interpreted as saying that with
approximately 95% probability, will fall inside the interval of values
p p
.X C 1:92/ 3:69 C 3:84X .X C 1:92/ C 3:69 C 3:84X;
The functions fOn;0 .x/; fOn;1 .x/; and fOn;2 .x/ are called the CLT approximation, the
first-order Edgeworth expansion, and the second-order Edgeworth expansion for the
density of the mean.
The approximations are of the form
p1 .x/ p2 .x/
.x/ C p .x/ C .x/ C :
n n
The relevant polynomials p1 .x/; p2 .x/ are related to some very special polyno-
mials, known as Hermite polynomials. Hermite polynomials are obtained from
successive differentiation of the standard normal density .x/. Precisely, the jth
Hermite polynomial Hj .x/ is defined by the relation
d
.x/ D .1/j Hj .x/.x/:
dx j
In particular,
By comparing the formulas for the refined density approximations to the formulas
for the Hermite polynomials, the connection becomes obvious. They arise in the
density approximation formulas as a matter of fact; there is no intuition for it.
Exercises
Exercise 1.1. The population of Danville is 20,000. Can it be said with certainty
that there must be two or more people in Danville with exactly the same three
initials?
Exercise 1.2. The letters in the word FULL are rearranged at random. What is the
probability that it still spells FULL?
Exercise 1.3 (Skills Exercise). Let E; F , and G be three events. Find expressions
for the following events:
(a) Only E occurs
(b) Both E and G occur, but not F
84 1 Review of Univariate Probability
Exercise 1.4. An urn contains 5 red, 5 black, and 5 white balls. If 3 balls are chosen
without replacement at random, what is the probability that they are of exactly 2
different colors?
Exercise 1.5 (Matching Problem). Four men throw their watches into the sea, and
the sea brings each man one watch back at random. What is the probability that at
least one man gets his own watch back?
Exercise 1.7 * (The General Shoes Problem). There are n pairs of shoes of n dis-
tinct colors in a closet and 2m are pulled out at random from the 2n shoes. What is
the probability that there is at least one complete pair among the shoes pulled?
Exercise 1.8. * There are n people are lined up at random for a photograph. What
is the probability that a specified set of r people happen to be next to each other?
Exercise 1.9. Calculate the probability that in Bridge, the hand of at least one player
is void in a particular suit.
Exercise 1.10 * (The Rumor Problem). In a town with n residents, someone starts
a rumor by saying it to one of the other n 1 residents. Thereafter, each recipient
passes the rumor on to one of the other residents, chosen at random. What is the
probability that by the kth time that the rumor has been told, it has not come back to
someone who has already heard it?
Exercise 1.11. Jen will call Cathy on Saturday with a 60% probability. She will
call Cathy on Sunday with a 80% probability. The probability that she will call on
neither of the two days is 10%. What is the probability that she will call on Sunday
if she calls on Saturday?
Exercise 1.12. Two distinct cards are drawn, one at a time. from a deck of 52 cards.
The first chosen card is the ace of spades. What is the probability that the second
card is neither an ace nor a spade?
Exercise 1.13. Suppose P .A/ D P .B/ D :9. Give a useful lower bound on
P .BjA/.
Exercise 1.14. * The probability that a coin will show all heads or all tails when
tossed four times is .25. What is the probability that it will show two heads and two
tails?
Exercises 85
Exercise 1.18 (The Parking Problem). At a parking lot, there are 12 places ar-
ranged in a row. A man observed that there were 8 cars parked, and that the four
empty places were adjacent to each other. Given that there are 4 empty places, is
this arrangement surprising?
Exercise 1.19. Suppose a fair die is rolled twice and suppose X is the absolute
value of the difference of the two rolls. Find the pmf and the CDF of X and plot the
CDF. Find a median of X ; is the median unique?
Exercise 1.20 * (A Two-Stage Experiment). Suppose a fair die is rolled once and
the number observed is N . Then a fair coin is tossed N times. Let X be the number
of heads obtained. Find the pmf, the CDF, and the expected value of X . Does the
expected value make sense intuitively?
Exercise 1.21. * Find a discrete random variable X such that E.X / D E.X 3 / D 0;
E.X 2 / D E.X 4 / D 1:
Exercise 1.22 * (Waiting Time). An urn contains four red and four green balls that
are taken out without replacement, one at a time, at random. Let X be the first draw
at which a green ball is taken out. Find the pmf and the expected value of X .
Exercise 1.23 * (Runs). Suppose a fair die is rolled n times. By using the indicator
variable method, find the expected number of times that a six is followed by at least
two other sixes. Now compute the value when n D 100.
Exercise 1.24. * Suppose a couple will have children until they have at least two
children of each sex. By using the tail sum formula, find the expected value of the
number of children the couple will have.
86 1 Review of Univariate Probability
Exercise 1.27 * (Tail Sum Formula for the Second Moment). Let X be aPnon-
negative integer-valued random variable. Show that E.X 2 / E.X / D 2 1
nD1
nP .X > n/:
Exercise 1.30 (Existence of Some Moments, but Not All). Give an example of
a random variable X taking the values 1; 2; 3; : : : such that E.X k / < 1 for any
k < p (p is specified), but E.X p / D 1.
Exercise 1.31. Find the generating function and the mgf of the random variable X
with the pmf P .X D n/ D 21n ; n D 1; 2; 3; : : : :
Exercise 1.32 (MGF of a Linear Function). Suppose X has the mgf .t/. Find
an expression for the mgf of aX C b, where a; b are real constants.
Exercise 1.33 (Convexity of the MGF). Suppose X has the mgf .t/, finite in
some open interval. Show that .t/ is convex in that open interval.
Exercise 1.34. Suppose G.s/; H.s/ are both generating functions. Show that
pG.s/ C .1 p/H.s/ is also a valid generating function for any p in .0; 1/. What
is an interesting interpretation of the distribution that has pG.s/ C .1 p/H.s/ as
its generating function?
Exercise 1.35. * Give an example of a random variable X such that X has a finite
mgf at any t, but X 2 does not have a finite mgf at any t > 0.
Exercise 1.36. Suppose a fair coin is tossed n times. Find the probability that ex-
actly half of the tosses result in heads, when n D 10; 30; 50; where does the
probability seem to converge as n becomes large?
Exercises 87
Exercise 1.37. Suppose one coin with probability .4 for heads, one with proba-
bility .6 for heads, and one that is a fair coin are each tossed once. Find the pmf of
the total number of heads obtained; is it a binomial distribution?
Exercise 1.38. In repeated rolling of a fair die, find the minimum number of rolls
necessary in order for the probability of at least one six to be:
Exercise 1.40 * (Poisson Approximation). One hundred people will each toss a
fair coin 200 times. Approximate the probability that at least 10 of the 100 people
would each have obtained exactly 100 heads and 100 tails.
Exercise 1.41. Suppose a fair coin is tossed repeatedly. Find the probability that 3
heads will be obtained before 4 tails.
Generalize to r heads and s tails.
Exercise 1.44 (Poisson Approximation). There are 5 coins on a desk, with prob-
abilities .05, .1, .05, .01, and .04 for heads. By using a Poisson approximation, find
the probability of obtaining at least one head when the five coins are each tossed
once.
Is the number of heads obtained binomially distributed in this problem?
n
Exercise 1.45. Let X Bin.n; p/. Prove that P .X is even/ D 12 C .12p/ 2
.
Hence, show that P .X is even/ is larger than 12 for any n if p < 12 , but that it is
larger than 12 for only even values of n, if p > 12 .
Exercise 1.47. Show that for every p; 0 p 1; the function f .x/ D p sin x C
.1 p/ cos x; 0 x =2 (and f .x/ D 0 otherwise), is a density function. Find
its CDF and use it to find all the medians.
Exercise 1.50 * (Random Division). Jen’s dog broke her six-inch long pencil off
at a random point on the pencil. Find the density function and the expected value of
the ratio of the lengths of the shorter piece and the longer piece of the pencil.
Exercise 1.52 (Percentiles of the Standard Cauchy). Find the pth percentile of
the standard Cauchy density for a general p, and compute it for p D :75.
Exercise 1.53 * (Functional Similarity). Suppose X has the standard Cauchy den-
sity. Show that X X1 also has a Cauchy density.
Can you find another function with this property on your own?
Hint: Think of simple rational functions.
Exercise 1.54 * (An Intriguing Identity). Suppose X has the standard Cauchy
density. Give a rigorous proof that P .X > 1/ D P .X > 2/ C P .X > 3/.
Exercise 1.55 * (Integer Part). Suppose X has a uniform distribution on Œ0; 10:5.
Find the expected value of the integer part of X .
Exercise 1.56 (The Density Function of the Density Function). Suppose X has a
density function f .x/. Find the density function of f .X / when f .x/ is the standard
normal density.
Exercise 1.58. Suppose X is a positive random variable with mean one. Show that
E.log X / 0:
Exercise 1.59. Suppose X is a positive random variable with four finite moments.
Show that E.X /E.X 3 / ŒE.X 2 /2 :
Exercises 89
Exercise 1.60 (Rate Function for Exponential). Derive the rate function I.x/ of
the Chernoff–Bernstein inequality for the standard exponential density, and hence
derive a bound for P .X > x/.
Exercise 1.61 * (Rate Function for the Double Exponential). Derive the rate
function I.x/ of the Chernoff–Bernstein inequality for the double exponential den-
sity and hence derive a bound for P .X > x/:
Exercise 1.62. X is uniformly distributed on some interval Œa; b. If its mean is 2,
and variance is 3, what are the values of a; b?
Exercise 1.63. Let X U Œ0; 1. Find the density of each of the following:
(a) X 3X I
3
2
(b) X 12 I
4
(c) sin 2 X :
Exercise 1.64 * (Mode of a Beta Density). Show that if a Beta density has a mode
in the open interval .0; 1/, then we must have ˛ > 1; ˛ C ˇ > 2, in which case, the
˛1
mode is unique and equals ˛Cˇ 2 .
P .X n/ D P .G.n C 1; 1/ > /;
Exercise 1.71 * (A Relation Between Binomial and Beta). Suppose X Bin.n; p/.
Prove that
P .X k 1/ D P .B.k; n k C 1/ > p/;
where B.k; n k C 1/ means a Beta random variable with parameters k; n k C 1.
90 1 Review of Univariate Probability
Exercise 1.72. Suppose X has the standard Gumbel density. Find the density of
e X .
Exercise 1.73. Suppose X is uniformly distributed on Œ0; 1. Find the density of
log log X1 .
Exercise 1.76. The 25th and the 75th percentile of a normally distributed random
variable are 1 and C1. What is the probability that the random variable is between
2 and C2?
Exercise 1.77 (Standard Normal CDF in Terms of the Error Function). In some
places, instead of the standard normal CDF, one sees use of the error function
p Rx 2
erf.x/ D .2= / 0 e t dt.
Express ˆ.x/ in terms of erf.x/.
Exercise 1.81 (Median and Mode of Lognormal). Show that a general lognormal
density is unimodal, and find its mode and median.
Hint: For the median, remember that a lognormal variable is e X , where X is a
normal variable.
Exercise 1.83. Suppose X N.0; 1/; Y N.0; 9/, and X; Y are independent.
Find the value of P ..X Y /2 > 5/.
Exercise 1.84. A fair die is rolled 25 times. Let X be the number of times a six is
obtained. Find the exact value of P .X D 6/, and compare it to a normal approxi-
mation of P .X D 6/.
Exercise 1.85. A basketball player has a history of converting 80% of his free
throws. Find a normal approximation with a continuity correction of the probability
that he will make between 18 and 22 throws out of 25 free throws.
Exercise 1.86 (Airline Overbooking). An airline knows from past experience that
10% of fliers with a confirmed reservation do not show up for the flight. Suppose a
flight has 250 seats. How many reservations over 250 can the airline permit, if they
want to be 95% sure that no more than two passengers with a confirmed reservation
would have to be bumped?
Exercise 1.90 (Anything That Can Happen Will Eventually Happen). If you
predict in advance the outcomes of 10 tosses of a fair coin, the probability that you
get them all correct is .:5/10 , which is very small. Show that if 2000 people each try
to predict the 10 outcomes correctly, the chance that at least one of them succeeds
is better than 85%.
Exercise 1.91 * (Random Walk). Consider the drunkard’s random walk example.
Find the probability that the drunkard will be at least 10 steps over on the right from
his starting point after 200 steps. Compute a normal approximation.
92 1 Review of Univariate Probability
Exercise 1.92 (Test Your Intuition). Suppose a fair coin is tossed 100 times.
Which is more likely: you will get exactly 50 heads, or you will get more than
60 heads?
Exercise 1.93 * (Density of Uniform Sums). Give a direct proof that the density
of pSn=3
n
at zero converges to .0/, where Sn is the sum of n independent U Œ1; 1
variables.
Exercise 1.94 (Confidence Interval for Poisson mean). Derive a formula for an
approximate 99% confidence interval for a Poisson mean, by using the normal ap-
proximation to a Poisson distribution. Compare your formula to the formula for an
approximate 95% confidence interval that was worked out in text. Compute the 95%
and the 99% confidence interval if X D 5; 8; 12.
References
Alon, N. and Spencer, J. (2000). The Probabilistic Method, Wiley, New York.
Ash, R. (1972). Real Analysis and Probability, Academic Press, New York.
Barbour, A. and Hall, P. (1984). On the rate of Poisson convergence, Math. Proc. Camb. Phil. Soc.,
95, 473–480.
Bernstein, S. (1927). Theory of Probability, Nauka, Moscow.
Bhattacharya, R.N. and Rao, R.R. (1986). Normal Approximation and Asymptotic Expansions,
Robert E. Krieger, Melbourne, FL.
Bhattacharya, R.N. and Waymire, E. (2009). A Basic Course in Probability Theory, Springer,
New York.
Billingsley, P.(1995). Probability and Measure, Third Edition, John Wiley, New York.
Breiman, L. (1992). Probability, Addison-Wesley, New York.
Brown, L., Cai, T., and DasGupta, A. (2001). Interval estimation for a binomial proportion, Statist.
Sci., 16, 101–133.
Brown, L., Cai, T., and DasGupta, A. (2002). Confidence intervals for a binomial proportion and
asymptotic expansions, Ann. Statist., 30, 160–201.
Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum
of observations, Ann. Math. Statist., 23, 493–507.
Chernoff, H. (1981). A note on an inequality involving the normal distribution, Ann. Prob., 9,
533–535.
Chung, K. L. (1974). A Course in Probability, Academic Press, New York.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
DasGupta, A. (2010). Fundamentals of Probability: A First Course, Springer, New York.
Diaconis, P. and Zabell, S. (1991). Closed form summation formulae for classical distributions,
Statist. Sci., 6, 284–302.
Dudley, R. (2002). Real Analysis and Probability, Cambridge University Press, Cambridge, UK.
Everitt, B. (1998). Cambridge Dictionary of Statistics, Cambridge University Press, New York.
Feller, W. (1968). Introduction to Probability Theory and its Applications, Vol. I, Wiley, New York.
Feller, W. (1971). Introduction to Probability Theory and Its Applications, Vol. II, Wiley,
New York.
Fisher, R.A. (1929). Moments and product moments of sampling distributions, Proc. London Math.
Soc., 2, 199–238.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York.
References 93
Johnson, N., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distributions, Vol. I,
Wiley, New York.
Kagan, A., Linnik, Y., and Rao, C.R. (1973). Characterization Problems in Mathematical Statis-
tics, Wiley, New York.
Kendall, M.G. and Stuart, A. (1976). Advanced Theory of Statistics, Vol. I, Wiley, New York.
Le Cam, L. (1960). An approximation theorem for the Poisson binomial distribution, Pacific J.
Math., 10, 1181–1197.
Le Cam, L. (1986). The central limit theorem around 1935, Statist. Sci., 1, 78–96.
Paley, R.E. and Zygmund, A. (1932). A note on analytic functions in the unit circle, Proc. Camb.
Philos. Soc., 28, 266–272.
Petrov, V. (1975). Limit Theorems of Probability Theory, Oxford University Press, Oxford, UK.
Pitman, J. (1992). Probability, Springer-Verlag, New York.
Rao, C.R. (1973), Linear Statistical Inference and Applications, Wiley, New York.
Ross, S. (1984). A First Course in Probability, Macmillan, New York.
Steele, J.M. (1994). Le Cam’s inequality and Poisson approximations, Amer. Math Month., 101,
48–54.
Stein, C. (1981). Estimation of the mean of a multivariate normal distribution, Ann. Stat., 9,
1135–1151.
Stigler, S. (1986). The History of Statistics, Belknap Press, Cambridge, MA.
Stirzaker, D. (1994). Elementary Probability, Cambridge University Press, London.
Wasserman, L. (2006). All of Nonparametric Statistics, Springer, New York.
Widder, D. (1989). Advanced Calculus, Dover, New York.
Chapter 2
Multivariate Discrete Distributions
Example 2.1 (Coin Tossing). Consider the experiment of tossing a fair coin three
times. Let X be the number of heads among the first two tosses, and Y the num-
ber of heads among the last two tosses. If we consider X and Y individually, we
realize immediately that they are each Bin.2; :5/ random variables. But the individ-
ual distributions hide part of the full story. For example, if we knew that X was 2,
then that would imply that Y must be at least 1. Thus, their joint behavior cannot
be fully understood from their individual distributions; we must study their joint
distribution.
Here is what we mean by their joint distribution. The sample space of this
experiment is
Each sample point has an equal probability 18 . Denoting the sample points as
!1 ; !2 ; : : : ; !8 , we see that if !1 prevails, then X.!1 / D Y .!1 / D 2. but if !2
prevails, then X.!2 / D 2; Y .!2 / D 1. The combinations of all possible values of
.X; Y / are
.0; 0/; .0; 1/; .0; 2/; .1; 0/; .1; 1/; .1; 2/; .2; 0/; .2; 1/; .2; 2/:
The joint distribution of .X; Y / provides the probability p.x; y/ D P .X Dx; Y Dy/
for each such combination of possible values .x; y/. Indeed, by direct counting us-
ing the eight equally likely sample points, we see that
1 1 1 1
p.0; 0/ D ; p.0; 1/ D ; p.0; 2/ D 0; p.1; 0/ D ; p.1; 1/ D I
8 8 8 4
1 1 1
p.1; 2/ D ; p.2; 0/ D 0; p.2; 1/ D ; p.2; 2/ D :
8 8 8
Y
X 0 1 2
1 1
0 8 8
0
1 1 1
1 8 4 8
1 1
2 0 8 8
Such a layout is a convenient way to present the joint distribution of two discrete
random variables with a small number of values. The distribution itself is called the
joint pmf ; here is a formal definition.
Definition 2.1. Let X; Y be two discrete random variables with respective sets of
values x1 ; x2 ; : : : ; and y1 ; y2 ; : : : ; defined on a common sample space . The joint
pmf of X; Y is defined to be the function p.xi ; yj /DP .X Dxi ; Y Dyj /; i; j 1,
and p.x; y/ D 0 at any other point .x; y/ in R2 .
2.1 Bivariate Joint Distributions and Expectations of Functions 97
Y
X 1 2 3 4 5 6
1
1 36 0 0 0 0 0
1 1
2 18 36
0 0 0 0
1 1 1
3 18 18 36
0 0 0
1 1 1 1
4 18 18 18 36
0 0
1 1 1 1 1
5 18 18 18 18 36
0
1 1 1 1 1 1
6 18 18 18 18 18 36
The individual pmfs of X; Y are easily recovered from the joint distribution. For
example,
X
6
1
P .X D 1/ D P .X D 1; Y D y/ D ; and
yD1
36
X
6
1 1 1
P .X D 2/ D P .X D 2; Y D y/ D C D ;
yD1
18 36 12
98 2 Multivariate Discrete Distributions
and see on. The individual pmfs are obtained by summing the joint probabilities
over all values of the other variable. They are:
x 1 2 3 4 5 6
1 3 5 7 9 11
pX .x/ 36 36 36 36 36 36
y 1 2 3 4 5 6
11 9 7 5 3 1
pY .y/ 36 36 36 36 36 36
X
n X
n
p.x; y/ D 1
xD1 yD1
X
n X
n
,c .x C y/ D 1
xD1 yD1
X
n
n.n C 1/
,c nx C D1
xD1
2
n .n C 1/
2
n2 .n C 1/
,c C D1
2 2
, cn2 .n C 1/ D 1
1
,c D :
n2 .n C 1/
2.1 Bivariate Joint Distributions and Expectations of Functions 99
The joint pmf is symmetric between x and y (because x C y D y C x), and so,
X; Y have the same marginal pmf. For example, X has the pmf
Xn
1 X n
pX .x/ D p.x; y/ D 2 .x C y/
yD1
n .n C 1/ yD1
1 n.n C 1/
D nx C
n2 .n C 1/ 2
x 1
D C ; 1 x n:
n.n C 1/ 2n
Suppose now we want to compute P .X > Y /. This can be found by summing
p.x; y/ over all combinations for which x > y. But this longer calculation
can be avoided by using a symmetry argument that is often very useful. Note
that because the joint pmf is symmetric between x and y, we must have
P .X > Y / D P .Y > X / D p (say). But, also,
P .X > Y / C P .Y > X / C P .X D Y / D 1 ) 2p C P .X D Y / D 1
1 P .X D Y /
)pD :
2
Now,
X
n X
n
P .X D Y / D p.x; x/ D c 2x
xD1 xD1
1 1
D n.n C 1/ D :
n2 .n C 1/ n
Therefore, P .X > Y / D p D n1
2n
1
2
; for large n.
Example 2.4 (Dice Rolls Revisited). Consider again the example of two rolls of a
fair die, and suppose X; Y are the larger and the smaller of the two rolls. We have
worked out the joint distribution of .X; Y / in Example 2.2. Suppose we want to
find the distribution of the difference, X Y . The possible values of X Y are
0; 1; : : : ; 5, and we find P .X Y D k/ by using the joint distribution of .X; Y /:
1
P .X Y D 0/ D p.1; 1/ C p.2; 2/ C C p.6; 6/ D I
6
5
P .X Y D 1/ D p.2; 1/ C p.3; 2/ C C p.6; 5/ D I
18
2
P .X Y D 2/ D p.3; 1/ C p.4; 2/ C p.5; 3/ C p.6; 4/ D I
9
1
P .X Y D 3/ D p.4; 1/ C p.5; 2/ C p.6; 3/ D I
6
1
P .X Y D 4/ D p.5; 1/ C p.6; 2/ D I
9
1
P .X Y D 5/ D p.6; 1/ D :
18
There is no way to find the distribution of X Y except by using the joint distribution
of .X; Y /.
100 2 Multivariate Discrete Distributions
Suppose now we also want to know the expected value of X Y . Now that we
have the distribution of X Y worked out, we can find the expectation by directly
using the definition of expectation:
X
5
E.X Y / D kP .X Y D k/
kD0
5 4 1 4 5 35
D C C C C D :
18 9 2 9 18 18
161 91 35
E.X Y / D E.X / E.Y / D D
36 36 18
Sometimes we want to know what the expected value is of one of the variables,
say X , if we knew the value of the other variable Y . For example, in the die tossing
experiment above, what should we expect the larger of the two rolls to be if the
smaller roll is known to be 2?
To answer this question, we have to find the probabilities of the various values
of X , conditional on knowing that Y equals some given y, and then average by
using these conditional probabilities. Here are the formal definitions.
Definition 2.4 (Conditional Distribution). Let .X; Y / have the joint pmf p.x; y/.
The conditional distribution of X given Y D y is defined to be
p.x; y/
p.xjy/ D P .X D xjY D y/ D ;
pY .y/
2.2 Conditional Distributions and Conditional Expectations 101
Definition 2.5 (Independence). Let .X; Y / have the joint pmf p.x; y/. Then X; Y
are said to be independent if
The third equivalent condition in the above list is usually the most convenient one
to verify and use.
One more frequently useful fact about conditional expectations is the following.
Proposition. Suppose X; Y are independent random variables. Then, for any func-
tion g.X / such that the expectations below exist, and for any y,
Example 2.5 (Maximum and Minimum in Dice Rolls). In the experiment of two rolls
of a fair die, we have worked out the joint distribution of X; Y , where X is the larger
102 2 Multivariate Discrete Distributions
and Y the smaller of the two rolls. Using this joint distribution, we can now find the
conditional distributions. For instance,
Y
X 0 1
0 s t
1 u v
0 p.0; 0/ C 1 p.1; 0/ u
E.X jY D 0/ D D I
p.0; 0/ C p.1; 0/ sCu
0 p.0; 1/ C 1 p.1; 1/ v
E.X jY D 1/ D D :
p.0; 1/ C p.1; 1/ t Cv
Therefore,
v u vs ut
E.X jY D 1/ E.X jY D 0/ D D :
t Cv sCu .t C v/.s C u/
u vs ut
E.X jY D y/ D C y;
sCu .t C v/.s C u/
1 1
C18 Œ2 C
1
C 6 41
E.X jY D 1/ D 36
D D 3:73I
36 C 18
1 5 11
as another example,
3 1
C 18
1
15 33
E.X jY D 3/ D 36
D D 4:71I
1
36
C 18
3 7
and,
5 1
C6 1
17
E.X jY D 5/ D 36 18
D D 5:77:
1
36
C 1
18
3
We notice that E.X jY D 5/ > E.X jY D 3/ > E.X jY D 1/I in fact, it is true that
E.X jY D y/ is increasing in y in this example. This does make intuitive sense.
Just as in the case of a distribution of a single variable, we often also want a mea-
sure of variability in addition to a measure of average for conditional distributions.
This motivates defining a conditional variance.
Definition 2.6 (Conditional Variance). Let .X; Y / have the joint pmf p.x; y/.
Let X .y/ D E.X jY D y/: The conditional variance of X given Y D y is
defined to be
X
Var.X jY D y/ D EŒ.X X .y//2 jY D y D .x X .y//2 p.xjy/:
x
Example 2.8 (Conditional Variance in Dice Experiment). We work out the condi-
tional variance of the maximum of two rolls of a die given the minimum. That is,
suppose a fair die is rolled twice, and X; Y are the larger and the smaller of the two
rolls; we want to compute Var.X jy/.
For example, if y D 3, then X .y/ D E.X jY D y/ D E.X jY D 3/ D 4:71
(see the previous example). Therefore,
X
Var.X jy/ D .x 4:71/2 p.xj3/
x
.3 4:71/2 36
1
C.4 4:71/2 1
C.5 4:71/2 18
1
C.6 4:71/2 1
D 18 18
1
36
C 18 C 18 C 18
1 1 1
D 1:06:
104 2 Multivariate Discrete Distributions
To summarize, given that the minimum of two rolls of a fair die is 3, the expected
value of the maximum is 4.71 and the variance of the maximum is 1.06.
These two values, E.X jy/ and Var.X jy/, change as we change the given
value y. Thus E.X jy/ and Var.X jy/ are functions of y, and for each separate y, a
new calculation is needed. If X; Y happen to be independent, then of course what-
ever y is, E.X jy/ D E.X /, and Var.X jy/ D Var.X /.
The next result is an important one in many applications.
P .X D x; X C Y D t/
P .X D xjX C Y D t/ D
P .X C Y D t/
P .X D x; Y D t x/
D
P .X C Y D t/
e x
e t x tŠ
D
xŠ .t x/Š e .C/ . C /t
which is the pmf of the Bin.t; C / distribution. t
u
Proof. We prove this for the discrete case. By definition of conditional expectation,
P
xp.x; y/
X .y/ D x
pY .y/
X XX XX
) X .y/pY .y/ D xp.x; y/ D xp.x; y/
y y x x y
X X X
D x p.x; y/ D xpX .x/ D E.X /:
x y x
The corresponding variance calculation formula is the following. The proof of
this uses the iterated mean formula above, and applies it to .X X /2 . u
t
Remark. These two formulas for iterated expectation and iterated variance are valid
for all types of variables, not just the discrete ones. Thus, these same formulas still
hold when we discuss joint distributions for continuous random variables in the next
chapter.
Some operational formulas that one should be familiar with are summarized
below.
Let us see some applications of the two iterated expectation and iterated variance
formulas.
106 2 Multivariate Discrete Distributions
Example 2.9 (A Two-Stage Experiment). Suppose n fair dice are rolled. Those that
show a six are rolled again. What are the mean and the variance of the number of
sixes obtained in the second round of this experiment?
Define Y to be the number of dice in the first round that show a six, and X the
number of dice in the second round that show a six. Given Y D y; X Bin.y; 16 /,
and Y itself is distributed as Bin.n; 16 /. Therefore,
hy i n
E.X / D EŒE.X jY D y/ D EY D :
6 36
Also,
and,
Interestingly, the number of eggs actually fertilized has the same mean and variance
p , (Can you see why?)
Remark. In all of these examples, it was important to choose the variable Y wisely
on which one should condition. The efficiency of the technique depends on this very
crucially.
Sometimes a formal generalization of the iterated expectation formula when a
third variable Z is present is useful. It is particularly useful in hierarchical statis-
tical modeling of distributions, where an ultimate marginal distribution for some
X is constructed by first conditioning on a number of auxiliary variables, and then
gradually unconditioning them. We state the more general iterated expectation for-
mula; its proof is exactly similar to that of the usual iterated expectation formula.
2.4 Covariance and Correlation 107
We know that variance is additive for independent random variables; that is, if
X1 ; X2 ; : : : ; Xn are independent random variables, then Var.X1 CX2 C CXn / D
Var.X1 / C C Var.Xn /: In particular, for two independent random variables
X; Y; Var.X CY / D Var.X /CVar.Y /: However, in general, variance is not additive.
Let us do the general calculation for Var.X C Y /.
We thus have the extra term 2ŒE.X Y / E.X /E.Y / in the expression for
Var.X C Y /; of course, when X; Y are independent, E.X Y / D E.X /E.Y /, and
so the extra term drops out. But, in general, one has to keep the extra term. The
quantity E.X Y / E.X /E.Y / is called the covariance of X and Y .
Definition 2.7 (Covariance). Let X; Y be two random variables defined on a
common sample space , such that E.X Y /; E.X /; E.Y / all exist. The covariance
of X and Y is defined as
Cov.X; Y /
X;Y D p p :
Var.X / Var.Y /
Some important properties of covariance and correlation are put together in the next
theorem.
Theorem 2.6 (Properties of Covariance and Correlation). Provided that the re-
quired variances and the covariances exist,
(a) Cov.X; c/ D 0 for any X and any constant cI
(b) Cov.X; X / D var.X / for!any X I
Pn P
m Pn Pm
(c) Cov ai Xi ; bj Y j D ai bj Cov.Xi ; Yj /;
i D1 j D1 i D1 j D1
and in particular,
Var.aX CbY / D Cov.aX CbY ;aX CbY /
D a2 Var.X /Cb 2 Var.Y /C2abCov.X; Y /;
and, !
X
n X
n X X
n
Var Xi D Var.Xi / C 2 Cov.Xi ; Xj /I
i D1 i D1 i <j D1
X
n X
m X
n X
m
D ai bj E.Xi ; Yj / ai bj E.Xi /E.Yj /
i D1 j D1 i D1 j D1
X
n X
m
D ai bj ŒE.Xi ; Yj / E.Xi /E.Yj /
i D1 j D1
X
n X
m
D ai bj Cov.Xi ; Yj /:
i D1 j D1
Part (d) follows on noting that E.X Y / D E.X /E.Y / if X; Y are independent. For
part (e), first note that Cov.a C bX; c C d Y / D bd Cov.X; Y / by using part (a)
and part (c). Also, Var.a C bX / D b 2 Var.X /; Var.c C d Y / D d 2 Var.Y /
bd Cov.X; Y /
) aCbX;cCd Y D p p
b Var.X / d 2 Var.Y /
2
bd Cov.X; Y /
D p p
jbj Var.X /jd j Var.Y /
bd
D X;Y D sgn.bd /X;Y :
jbd j
The proof of part (f) uses the Cauchy–Schwarz inequality (see Chapter 1) that for
any two random variables U; V; ŒE.U V /2 E.U 2 /E.V 2 /. Let
X E.X / Y E.Y /
U D p ; V D p :
Var.X / Var.Y /
Example 2.11 (Correlation Between Minimum and Maximum in Dice Rolls). Con-
sider again the experiment of rolling a fair die twice, and let X; Y be the maximum
and the minimum of the two rolls. We want to find the correlation between X; Y .
The joint distribution of .X; Y / was worked out in Example 2.2. From the joint
distribution,
1 2 4 3 6 9 30 36 49
E.X Y / D C C C C C CC C D :
36 18 36 18 18 36 18 36 4
110 2 Multivariate Discrete Distributions
The marginal pmfs of X; Y were also worked out in Example 2.2. From the marginal
pmfs, by direct calculation, E.X / D 161=36; E.Y / D 91=36; Var.X / D Var.Y / D
2555=1296: Therefore,
The correlation between the maximum and the minimum is in fact positive for any
number of rolls of a die, although the correlation will converge to zero when the
number of rolls converges to 1.
Example 2.12 (Correlation in the Chicken–Eggs Example). Consider again the ex-
ample of a chicken laying a Poisson number of eggs N with mean , and each egg
fertilizing, independently of others, with probability p. If X is the number of eggs
actually fertilized, we want to find the correlation between the number of eggs laid
and the number fertilized, that is, the correlation between X and N .
First,
Thus, the correlation goes up with the fertility rate of the eggs.
Example 2.13 (Best Linear Predictor). Suppose X and Y are two jointly distributed
random variables, and either by necessity, or by omission, the variable Y was not
observed. But X was observed, and there may be some information in the X value
about Y . The problem is to predict Y by using X . Linear predictors, because of their
functional simplicity, are appealing. The mathematical problem is to choose the best
linear predictor a C bX of Y , where best is defined as the predictor that minimizes
the mean squared error EŒY .a C bX /2 . We show that the answer has something
to do with the covariance between X and Y .
By breaking the square, R.a; b/
@
R.a; b/ D 2a C 2bE.X / 2E.Y / D 0
@a
, a C bE.X / D E.Y /I
@
R.a; b/ D 2bE.X 2 / C 2aE.X / 2E.X Y / D 0
@b
, aE.X / C bE.X 2 / D E.X Y /:
Cov.X; Y / Cov.X; Y /
best linear predictor of Y D E.Y / E.X / C X
Var.X / Var.X /
Cov.X; Y /
D E.Y / C ŒX E.X /:
Var.X /
Example 2.14 (Zero Correlation Does Not Mean Independence). If X; Y are inde-
pendent, then necessarily Cov.X; Y / D 0, and hence the correlation is also zero.
The converse is not true. Take a three-valued random variable X with the pmf
P .X D ˙1/ D p; P .X D 0/ D 1 2p; 0 < p < 12 . Let the other variable
Y be Y D X 2 : Then, E.X Y / D E.X 3 / D 0, and E.X /E.Y / D 0, because
E.X / D 0. Therefore, Cov.X; Y / D 0. But X; Y are certainly not independent; for
example, P .Y D 0jX D 0/ D 1, but P .Y D 0/ D 1 2p ¤ 0:
Indeed, if X has a distribution symmetric around zero, and if X has three finite
moments, then X and X 2 always have a zero correlation, although they are not
independent.
The extension of the concepts for the bivariate discrete case to the multivariate dis-
crete case is straightforward. We give the appropriate definitions and an important
example, namely that of the multinomial distribution, an extension of the binomial
distribution.
112 2 Multivariate Discrete Distributions
Once again, we mention that it is not convenient or interesting to work with the
CDF for discrete random variables; for discrete variables, it is preferable to work
with the pmf.
Analogous to the case of one random variable, we can define the joint mgf for sev-
eral random variables. The definition is the same for all types of random variables,
discrete or continuous, or other mixed types. As in the one-dimensional case, the
joint mgf of several random variables is also a very useful tool. First, we repeat
the definition of expectation of a function of several random random variables; see
Chapter 1, where it was first introduced and defined. The definition below is equiv-
alent to what was given in Chapter 1.
Remark. It is important to note that the last two theorems are not limited to discrete
random variables; they are valid for general random variables. The proofs of these
two theorems follow the same arguments as in the one-dimensional case, namely
that when an mgf exists in a nonempty open rectangle, it can be differentiated in-
finitely often with respect to each variable ti inside the expectation; that is, the order
of the derivative and the expectation can be interchanged.
One of the most important multivariate discrete distributions is the multinomial dis-
tribution. The multinomial distribution corresponds to n balls being distributed to k
cells, independently, with each ball having the probability pi of being dropped into
the i th cell. The random variables under consideration are X1 ; X2 ; : : : ; Xk , where
Xi is the number of balls that get dropped into the i th cell. Then their joint pmf is
the multinomial pmf defined below.
Definition 2.13. A multivariate random vector .X1 ; X2 ; : : : ; Xk / is said to have a
multinomial distribution with parameters n; p1 ; p2 ; : : : ; pk if it has the pmf
nŠ x
P .X1 D x1 ; X2 D x2 ; : : : ; Xk D xk / D p x1 p x2 : : : pk k ;
x1 Šx2 Š xk Š 1 2
X
k
xi 0; xi Dn;
i D1
P
pi 0; kiD1 pi D 1:
We write .X1 ; X2 ; : : : ; Xk / Mult.n; p1 ; : : : ; pk / to denote a random vector
with a multinomial distribution.
Example 2.15 (Dice Rolls). Suppose a fair die is rolled 30 times. We want to find
the probabilities that
(i) Each face is obtained exactly five times.
(ii) The number of sixes is at least five.
If we denote the number of times face number i is obtained as Xi , then
.X1 ; X2 ; : : : ; X6 / Mult.n; p1 ; : : : ; p6 /, where n D 30 and each pi D 16 .
Therefore,
P .X1 D 5; X2 D 5; : : : ; X6 D 5/
5
30Š 1 5 1
D :::
.5Š/6 6 6
30Š 1 30
D
.5Š/6 6
D :0004:
2.5 Multivariate Case 115
Next, each of the 30 rolls will either be a 6 or not, independently of the other rolls,
with probability 16 , and so, X6 Bin.30; 16 /: Therefore,
!
X4
30 1 x 5 30x
P .X6 5/ D 1 P .X6 4/ D 1
xD0
x 6 6
D :5757:
Example 2.16 (Bridge). Consider a Bridge game with four players, North, South,
East, and West. We want to find the probability that North and South together
have two or more aces. Let Xi denote the number of aces in the hands of player
i; i D 1; 2; 3; 4; we let i D 1; 2 mean North and South. Then, we want to find
P .X1 C X2 2/:
The joint distribution of .X1 ; X2 ; X3 ; X4 / is Mult.4; 14 ; 14 ; 14 ; 14 / (think of each
ace as a ball, and the four players as cells). Then, .X1 C X2 ; X3 C X4 /
Mult.4; 12 ; 12 /: Therefore,
4
4Š 1 4Š 1 4 4Š 1 4
P .X1 C X2 2/ D C C
2Š2Š 2 3Š1Š 2 4Š0Š 2
11
D :
16
Important formulas and facts about the multinomial distribution are given in the
next theorem.
Proof. Define Wi r as the indicator of the event that the rth ball lands in the i th cell.
Note that for a given i , the variables Wi r are independent. Then,
X
n
Xi D Wi r ;
rD1
P P
and therefore, E.Xi / D nrD1 EŒWi r D npi , and Var.Xi / D nrD1 Var.Wi r / D
npi .1 pi /: Part (b) follows from the definition of a multinomial experiment
116 2 Multivariate Discrete Distributions
(the trials are identical and independent, and each ball either lands or not in the
i th cell). For part (c),
!
Xn X
n
Cov.Xi ; Xj / D Cov Wi r ; Wjs
rD1 sD1
X
n X
n
D Cov.Wi r ; Wjs /
rD1 sD1
Xn
D Cov.Wi r ; Wjr /
rD1
Part (d) follows immediately from part (c) and part (a). Part (e) is a calculation, and
is omitted. t
u
Example 2.17 (MGF of the Multinomial Distribution). Let .X1 ; X2 ; : : : ; Xk /
Mult.n; p1 :p2 ; : : : ; pk /. Then the mgf .t1 ; t2 ; : : : ; tk / exists at all t, and a formula
follows easily. Indeed,
X nŠ
EŒe t1 X1 CCtk Xk D
x
e t1 x1 e t2 x2 e tk xk p1x1 p2x2 cdotspk k
Pk x1 Š xk Š
xi 0; i D1 xi Dn
X nŠ
D .p1 e t1 /x1 .p2 e t2 /x2 .pk e tk /xk
Pk x1 Š xk Š
xi 0; i D1 xi Dn
D .p1 e C p2 e t2 C C pk e tk /n ;
t1
P .X1 D x1 ; X2 D x2 ; : : : ; Xk D xk /
1
X e n
D P .X1 D x1 ; X2 D x2 ; : : : ; Xk D xk jN D n/
nD0
nŠ
X1
.x1 C x2 C C xk /Š x1 x2 x e
n
D p1 p2 pk k InDx1 Cx2 CCxk
nD0
x1 Šx2 Š xk Š nŠ
1
D e
x
x1 x2
xk
p1x1 p2x2 pk k
x1 Šx2 Š xk Š
1
D e . p1 /x1 . p2 /x2 . pk /xk
x1 Šx2 Š xk Š
Y
k
e pi . pi /xi
D ;
xi Š
i D1
which establishes that the joint marginal pmf of .X1 ; X2 ; : : : ; Xk / is the product
of k Poisson pmfs, and so X1 ; X2 ; : : : ; Xk must be marginally independent, with
Xi Poi. pi /. t
u
Example 2.18 (No Empty Cells). Suppose n balls are distributed independently and
at random into k cells. We want to find a formula for the probability that no cell
remains empty.
We use the Poissonization technique to solve this problem. We want a formula
for P .Y1 ¤ 0; Y2 ¤ 0; : : : ; Yk ¤ 0/.
Marginally, each Xi Poi. k /, and therefore,
Exercises
Exercise 2.1. Consider the experiment of picking one word at random from the
sentence
ALL IS WELL IN THE NEWELL FAMILY
Let X be the length of the word selected and Y the number of Ls in it. Find in a
tabular form the joint pmf of X and Y , their marginal pmfs, means, and variances,
and the correlation between X and Y .
Exercise 2.2. A fair coin is tossed four times. Let X be the number of heads, Z the
number of tails, and Y D jX Zj. Find the joint pmf of .X; Y /, and E.Y /.
Exercise 2.3. Consider the joint pmf p.x; y/ D cxy; 1 x 3; 1 y 3.
(a) Find the normalizing constant c.
(b) Are X; Y independent? Prove your claim.
(c) Find the expectations of X; Y; X Y:
Exercise 2.4. Consider the joint pmf p.x; y/ D cxy; 1 x y 3.
(a) Find the normalizing constant c.
(b) Are X; Y independent? Prove your claim.
(c) Find the expectations of X; Y; X Y:
Exercise 2.5. A fair die is rolled twice. Let X be the maximum and Y the minimum
of the two rolls. By using the joint pmf of .X; Y / worked out in text, find the pmf
of X X
Y , and hence the mean of Y .
Exercise 2.6. A hat contains four slips of paper, numbered 1, 2, 3, and 4. Two slips
are drawn at random, without replacement. X is the number on the first slip and Y
the sum of the two numbers drawn. Write in a tabular form the joint pmf of .X; Y /.
Hence find the marginal pmfs. Are X; Y independent?
Exercises 119
Exercise 2.8. A fair die is rolled four times. Find the probabilities that:
(a) At least 1 six is obtained;
(b) Exactly 1 six and exactly one two is obtained,
(c) Exactly 1 six, 1 two, and 2 fours are obtained.
Exercise 2.11. Suppose X and Y are independent Geo.p/ random variables. Find
P .X Y /I P .X > Y /:
Exercise 2.12. * Suppose X and Y are independent Poi. / random variables. Find
P .X Y /I P .X > Y /:
Exercise 2.13. Suppose X and Y are independent and take the values 1, 2, 3, 4 with
probabilities .2, .3, .3, .2. Find the pmf of X C Y .
Exercise 2.14. Two random variables have the joint pmf p.x; x C 1/ D
nC1 ; x D 0; 1; : : : ; n. Answer the following questions with as little calculation
1
as possible.
(a) Are X; Y independent?
(b) What is the variance of Y X ?
(c) What is Var.Y jX D 1/?
Exercise 2.18. Suppose a fair die is rolled twice. Let X; Y be the two rolls. Find
the following with as little calculation as possible:
(a) E.X C Y jY D y/.
(b) E.X Y jY D y/.
(c) Var.X 2 Y jY D y/.
(d) XCY;XY :
Exercise 2.19 (A Waiting Time Problem). In repeated throws of a fair die, let X
be the throw in which the first six is obtained, and Y the throw in which the second
six is obtained.
(a) Find the joint pmf of .X; Y /.
(b) Find the expectation of Y X .
(c) Find E.Y X jX D 8/.
(d) Find Var.Y X jX D 8/.
Exercise 2.20 * (Family Planning). A couple want to have a child of each sex, but
they will have at most four children. Let X be the total number of children they
will have and Y the number of girls at the second childbirth. Find the joint pmf of
.X; Y /, and the conditional expectation of X given Y D y; y D 0; 2.
Exercise 2.27 (Joint MGF). Suppose a fair die is rolled four times. Let X be the
number of ones and Y the number of sixes. Find the joint mgf of X and Y , and
hence, the covariance between X; Y .
Exercise 2.29 (Joint MGF). In repeated throws of a fair die, let X be the throw in
which the first six is obtained, and Y the throw in which the second six is obtained.
Find the joint mgf of X; Y , and hence the covariance between X and Y .
Exercise 2.30 * (Poissonization). A fair die is rolled 30 times. By using the Pois-
sonization theorem, find the probability that the maximum number of times any face
appears is 9 or more.
These are all just the most natural extensions of the corresponding one-
dimensional facts to the present multidimensional case. We now formally define a
joint density function.
P .a1 X1 b1 ; a2 X2 b2 ; : : : ; an Xn bn /
Z bn Z b2 Z b1
D f .x1 ; x2 ; : : : ; xn / dx1 dx2 dxn :
an a2 a1
The definition of the joint CDF is the same as that given in the discrete case. But
now the joint CDF is an integral of the density rather than a sum. Here is the precise
definition.
Definition 3.2. Let X be an n-dimensional random vector with the density function
f .x1 ; x2 ; : : : ; xn /. The joint CDF or simply the CDF of X is defined as
Z xn Z x1
F .x1 ; x2 ; : : : ; xn / D f .t1 ; : : : ; tn / dt1 dtn :
1 1
As in the one-dimensional case, both the CDF and the density completely specify
the distribution of a continuous random vector and one can be obtained from the
other. We know how to obtain the CDF from the density; the reverse relation is that
(for almost all .x1 ; x2 ; : : : ; xn /),
@n
f .x1 ; x2 ; : : : ; xn / D F .x1 ; x2 ; : : : ; xn / :
@x1 @xn
Again, the qualification almost all is necessary for a rigorous description of the
interrelation between the CDF and the density, but we operate as though the identity
above holds for all .x1 ; x2 ; : : : ; xn /,
Analogous to the case of several discrete variables, the marginal densities are
obtained by integrating out (instead of summing) all the other variables. In fact,
all lower-dimensional marginals are obtained that way. The precise statement is the
following.
3.1 Joint Density Function and Its Role 125
Y
n
f .x1 ; x2 ; : : : ; xn / D fi .xi / ;
i D1
Y
n
F .x1 ; x2 ; : : : ; xn / D Fi .xi / ;
i D1
and partially differentiate both sides successively with respect to x1Q ; x2 ; : : : ; xn , and
it follows that the joint density factorizes as f .x1 ; x2 ; : : : ; xn / D niD1 fi .xi / : u t
f .x; y/ D 1 if 0 x 1; 0 y 1I
D 0 otherwise:
f .x; y/ D c; if x; y 0; x C y 1;
D 0 otherwise:
y
1
0.8
0.6
0.4
0.2
x
0.2 0.4 0.6 0.8 1
Contrary to the previous example, X; Y are not independent now. There are many
ways to see this. For example,
1 1
P X > jY > D 0:
2 2
R1
But, P X > 12 D 1 2.1 x/dx D 14 ¤ 0: So, X; Y cannot be independent. We
2
can also see that the joint density f .x; y/ does not factorize as the product of the
marginal densities, and so X; Y cannot be independent.
Hence, f .x; y/ is a valid joint density. It is plotted in Fig. 3.2. Next, let us find the
marginal densities:
128 3 Multidimensional Densities
Z 1 Z 1
f1 .x/ D xe x.1Cy/ dy D x e x.1Cy/ dy
0 0
Z 1
e x
Dx e xy dy D x D e x ; x 0:
1 x
Example 3.4 (Nonuniform Joint Density with Uniform Marginals). Let .X; Y / have
the joint density function f .x; y/ D c 2.c 1/.x C y 2xy/; x; y 2 Œ0; 1; 0 <
c < 2: This is nonnegative in the unit square, as can be seen by considering the
cases c < 1; c D 1; c > 1 separately. Also,
Z 1 Z 1
f .x; y/dxdy
0 0
Z 1Z 1
D c 2.c 1/ .x C y 2xy/dxdy
0 0
Z 1
1
D c 2.c 1/ C y y dy D c .c 1/ D 1:
0 2
3.1 Joint Density Function and Its Role 129
Similarly, the marginal density of Y is also the constant function 1. So each marginal
is uniform, although the joint density is not uniform if c ¤ 1.
Example 3.5 (Using the Density to Calculate a Probability). Suppose .X; Y / has
the joint density f .x; y/ D 6xy 2 ; x; y 0; x C y 1: Thus, this is yet another
density on the triangle with vertices at .0; 0/; .1; 0/; and .0; 1/. We want to find
P .X C Y < 12 /: By definition,
Z
1
P X CY < D 6xy2 dxdy
2 .x;y/Ix;y0;xCy< 12
Z 1 Z 1 y
2 2
D6 xy 2 dxdy
0 0
Z 1
2 . 12 y/2
D6 y2 dy
0 2
Z 1 2
2 1
D3 y2 y dy
0 2
1 1
D 3 D :
960 320
This example gives an elementary illustration of the need to work out the limits of
the iterated integrals carefully while using a joint density to calculate the probability
of some event. In fact, properly finding the limits of the iterated integrals is the part
that requires the greatest care when working with joint densities.
Example 3.6 (Uniform Distribution in a Circle). Suppose C denotes the unit circle
in the plane: ˚
C D .x; y/ W x 2 C y 2 1 :
We pick a point .X; Y / at random from C ; what that means is that .X; Y / has the
density
f .x; y/ D c; if .x; y/ 2 C;
and is zero otherwise. Because
Z Z
f .x; y/dxdy D c dxdy D c Area of C D c D 1;
C C
130 3 Multidimensional Densities
The joint density f .x; y/ is symmetric between x; y (i.e., f .x; y/ D f .y; x/) thus
Y has the same marginal density as X ; that is,
p
2 1 y2
f2 .y/ D ; 1 y 1:
Because f .x; y/ ¤ f1 .x/f2 .y/, X; Y are not independent. Note that if X; Y has a
joint uniform density in the unit square, we find them to be independent; but now,
when they have a uniform density in the unit circle, we find them to be not indepen-
dent. In fact, the following general rule holds.
Suppose a joint density f .x; y/ can be written in a form g.x/h.y/; .x; y/ 2 S ,
and f .x; y/ zero otherwise. Then, X; Y are independent if and only if S is a rect-
angle (including squares).
Example 3.7 (An Interesting Property of Exponential Variables). Suppose X; Y are
independent Exp. /, Exp./ variables. We want to find P .X Y /. A possible ap-
plication is the following. Suppose you have two televisions at your home, a plasma
unit with a mean lifetime of five years, and an ordinary unit with a mean lifetime
of ten years. What is the probability that the plasma tv will fail before the ordinary
one?
R From our general definition of probabilities of events, we need to calculate
x;y>0;xy f .x; y/dxdy: In general, there need not be an interesting answer for
this integral. But, here in the independent exponential case, there is.
Since X; Y are independent, the joint density is f .x; y/ D 1
e x=y= ;
x; y > 0: Therefore,
Z
1 x=y=
P .X Y/ D e dxdy
x;y>0;xy
Z 1Z y
1
D e x=y= dxdy
0 0
Z Z
1 1 y= y= x
D e e dxdy
0 0
Z 1
1
D e y= 1 e y= dy
0
Z
1 1 y.1=C1=/
D 1 e dy
0
3.1 Joint Density Function and Its Role 131
1
D 1 D1
1
C 1
C
1
D D :
C 1C
Thus, the probability that X is less than Y depends in a very simple way on just the
quantity E.X/
E.Y / :
f .x1 ; x2 ; : : : ; xn / D c; 1 xi 1; 1 i n;
where c D 1
2n
. Also, by definition of probability,
Z
P .X 2 Bn / D cdx1 dx2 dxn
Bn
Vol .Bn /
D ;
2n
where Vol .Bn / is the volume of the n-dimensional unit sphere Bn , and equals
n
2
Vol .Bn / D n :
2 C1
Thus, finally,
n
2
P .X 2 Bn / D n :
2n 2
C1
132 3 Multidimensional Densities
This is a very pretty formula. Let us evaluate this probability for various values of n,
and examine the effect of increasing the number of dimensions on this probability.
Here is a table.
n P .X 2 Bn /
2 .785
3 .524
4 .308
5 .164
6 .081
10 .002
12 .0003
15 .00001
18 3.13 107
We see that in ten dimensions, there is a 1 in 500 chance that a uniform random vec-
tor will fall in the central inscribed sphere, and in 18 dimensions, the chance is much
less than one in a million. Therefore, when you are dealing with a large number of
random variables at the same time, you will need a huge amount of sample data to
learn about the behavior of their joint distribution in the central region; most of the
data will come from the corners! You must have a huge amount of data to have at
least some data points in your central region. As stated above this phenomenon has
been termed the curse of dimensionality.
Remark. It is clear from the definition that the expectation of each individual
Xi can be evaluated by either interpreting Xi as a function of the full vector
.X1 ; X2 ; : : : ; Xn /, or by simply using the marginal density fi .x/ of Xi ; that is,
3.2 Expectation of Functions 133
Z
E .Xi / D xi f .x1 ; x2 ; : : : ; xn / dx1 dx2 dxn
ZR1
n
D xfi .x/dx:
1
A similar comment applies to any function h.Xi / of just Xi alone. All the proper-
ties of expectations that we have previously established, for example, linearity of
expectations, continue to hold in the multidimensional case. Thus,
Z 1 Z 1
E.jX Y j/ D jx yjdxdy
0 0
Z 1 Z y Z 1
D .y x/dx C .x y/dx dy
0 0 y
Z
1
y2 1 y2
D y2 C y.1 y/ dy
0 2 2
Z 1
1
D y C y 2 dy
0 2
1 1 1 1
D C D :
2 2 3 3
2 2
D C
. C /2 . C /2
1
D D ;
C 1
C 1
Example 3.11 (Use of Polar Coordinates). Suppose a point .x; y/ is picked at ran-
dom from inside the unit circle. We want to find its expected distance from the center
of the circle.
Thus, let .X; Y / have the joint density
1
f .x; y/ D ; x2 C y 2 1;
p
EŒ X 2 C Y 2
Z p
1
D x 2 C y 2 dxdy:
.x;y/Wx 2 Cy 2 1
x D r cos ; y D r sin ;
3.2 Expectation of Functions 135
The bivariate normal density is one of the most important densities for two jointly
distributed continuous random variables, just as the univariate normal density is for
one continuous variable. Many correlated random variables across applied and so-
cial sciences are approximately distributed as a bivariate normal. A typical example
is the joint distribution of two size variables, such as height and weight.
x 2 Cy 2
Definition 3.4. The function f .x; y/D 2 1
e 2 ; 1 < x; y < 1 is called the
bivariate standard normal density.
Clearly, we see that f .x; y/ D .x/.y/ 8 x; y: Therefore, the bivariate stan-
dard normal distribution corresponds to a pair of independent standard normal
variables X; Y . If we make a linear transformation
U D 1 C 1 X
h p i
V D 2 C 2 X C 1 2 Y ;
then we get the general five-parameter bivariate normal density, with means 1 ; 2 ,
standard deviations 1 ; 2 , and correlation U;V D ; here, 1 < < 1:
Definition 3.5. The density of the five-parameter bivariate normal distribution is
1 < x; y < 1:
If 1 D 2 D 0; 1 D 2 D 1, then the bivariate normal density has just the
parameter , and it is denoted as SBV N./.
If we sample observations from a general bivariate normal distribution, and plot
the data points as points in the plane, then they would roughly plot out to an elliptical
shape. The reason for this approximate elliptical shape is that the exponent in the
formula for the density function is a quadratic form in the variables. Figure 3.3
is a simulation of 1000 values from a bivariate normal distribution. The roughly
elliptical shape is clear. It is also seen in the plot that the center of the point cloud
is quite close to the true means of the variables, which were chosen to be 1 D
4:5; 2 D 4.
From the representation we have given above of the general bivariate normal
vector .U; V / in terms of independent standard normals X; Y , it follows that
E.U V / D 1 2 C 1 2
) Cov.U; V / D 1 2 :
The symmetric matrix with the variances as diagonal entries and the covariance as
the off-diagonal entry is called the variance–covariance matrix, or the dispersion
3.3 Bivariate Normal 137
Y
7
X
2 3 4 5 6 7
Fig. 3.3 Simulation of a bivariate normal with means 4.5, 4; variances 1; correlation .75
Fig. 3.4 Bivariate normal densities with zero means, unit variances, and rho D 0, .5
matrix, or sometimes simply the covariance matrix of .U; V /. Thus, the covariance
matrix of .U; V / is
12 1 2
†D :
1 2 22
A plot of the SBV N./ density is provided in Fig. 3.4 for D 0; :5; the zero cor-
relation case corresponds to independence. We see from the plots that the bivariate
density has a unique peak at the mean point .0; 0/ and falls off from that point like a
mound. The higher the correlation, the more the density concentrates near a plane.
In the limiting case, when D ˙1, the density becomes fully concentrated on a
plane, and we call it a singular bivariate normal.
When D 0, the bivariate normal density does factorize into the product of the
two marginal densities. Therefore, if D 0, then U; V are actually independent, and
so, in that case, P .U > 1 ; V > 2 / D P .Each variable is larger than its mean
value/ D 12 21 D 14 . When the parameters are general, one has the following classic
formula.
138 3 Multidimensional Densities
Theorem 3.1 (A Classic Bivariate Normal Formula). Let .U; V / have the
five-parameter bivariate normal density with parameters 1 ; 2 ; 1 ; 2 ; .
Then,
Example 3.13. Suppose a bivariate normal vector .U; V / has correlation . Then,
by applying the formula above, whatever 1 ; 2 are,
1 1
P .U > 1 ; V > 2 / D 1=4 C 1=.2/arcSin D ;
2 3
when D 12 : When D :75, the probability increases to .385. In the limit, when
! 1, the probability tends to .5. That is, when ! 1, all the probability becomes
confined to the first and the third quadrants fU > 1 ; V > 2 g, and fU < 1 ; V <
2 g, with the probability of each of these two quadrants approaching .5.
Another important property of a bivariate normal distribution is the following
result.
Theorem 3.2. Let .U; V / have a general five-parameter bivariate normal distribu-
tion. Then, any linear function aU C bV of .U; V / is normally distributed:
aU C bV N a1 C b2 ; a2 12 C b 2 22 C 2ab1 2 :
U D 1 C 1 X
h p i
V D 2 C 2 X C 1 2 Y :
3.3 Bivariate Normal 139
In fact, a result stronger than the previous theorem holds. What is true is that any
two linear functions of U; V will again be distributed as a bivariate normal. Here is
the stronger result.
Theorem 3.3. Let .U; V / have a general five-parameter bivariate normal distri-
bution. Let Z D aU C bV; W D cU C dV be two linear functions, such
that ad bc ¤ 0. Then, .Z; W / also has a bivariate normal distribution, with
parameters
The proof of this theorem is similar to the proof of the previous theorem, and the
details are omitted.
Example 3.14 (Independence of Mean and Variance). Suppose X1 ; X2 are two iid
N.; 2 / variables. Then, of course, they are also jointly bivariate normal. Define
now two linear functions
Z D X1 C X2 ; W D X1 X2 :
And, of course, XN D X1 CX 2
2
D Z2 . Therefore, it follows that XN and s 2 are
independent.
This is true not just for two observations, but for any number of iid obser-
vations from a normal distribution. This is proved after we introduce multivari-
ate normal distributions, and it is also proved in Chapter 18 by using Basu’s
theorem.
Example 3.15 (Normal Marginals Do Not Guarantee Joint Normal). Although joint
bivariate normality of two random variables implies that each variable must be
marginally a univariate normal, the converse is in general not true.
Let Z N.0; 1/, and let U be a two-valued random variable with the pmf
P .U D ˙1/ D 12 . Take U and Z to be independent. Define now X D U jZj, and
Y D Z.
Then, each of X; Y has a standard normal distribution. That X has a standard
normal distribution is easily seen in many ways, for example, by just evaluating its
CDF. Take x > 0; then,
1 1
P .X x/ D P .X xjU D 1/ C P .X xjU D 1/
2 2
1 1
D 1 C P .jZj x/
2 2
1 1
D C Œ2ˆ.x/ 1 D ˆ.x/:
2 2
f .x; y/
f .xjy/ D f .xjY D y/ D ; 8y such that fY .y/ > 0:
fY .y/
3.4 Conditional Densities and Expectations 141
f .x; y/
f .yjx/ D ; 8x such that fX .x/ > 0:
fX .x/
Theorem 3.4 (Bayes Theorem for Conditional Densities). Let .X; Y / have a
joint density f .x; y/. Then, 8x; y; such that fX .x/ > 0; fY .y/ > 0,
f .xjy/fY .y/
f .yjx/ D :
fX .x/
Proof.
f .x;y/
f .xjy/fY .y/ fY .y/ Y
f .y/
D
fX .x/ fX .x/
f .x; y/
D D f .yjx/:
fX .x/
Thus, we can convert one conditional density to the other one by using Bayes’ the-
orem; note the similarity to Bayes’ theorem discussed in Chapter 1. t
u
Definition 3.7 (Conditional Variance). Let .X; Y / have a joint density f .x; y/.
The conditional variance of X given Y D y is defined as
R1
1 .x X .y//2 f .x; y/dx
Var.X jy/ D Var.X jY D y/ D R1 ;
1 f .x; y/dx
Remark. All the facts and properties about conditional pmfs and conditional ex-
pectations that were presented in the previous chapter for discrete random variables
continue to hold verbatim in the continuous case, with densities replacing the pmfs
in their statements. In particular, the iterated expectation and variance formula, and
all the rules about conditional expectations and variance in Section 2.3 hold in the
continuous case.
An important optimizing property of the conditional expectation is that the best
predictor of Y based on X among all possible predictors is the conditional expecta-
tion of Y given X . Here is the exact result.
Proof. Denote Y .x/ D E.Y jX D x/. Then, by the property of the mean of any
random variable U that E.U E.U //2 E.U a/2 for any a, we get that here,
for any x.
Inasmuch as this inequality holds for any x, it also holds on taking an expectation:
EX E .Y Y .x//2 jX D x EX E .Y g.x//2 jX D x
) EX;Y .Y Y .X //2 EX;Y .Y g.X //2 ;
where the final line is a consequence of the iterated expectation formula (see
Chapter 2). t
u
f .x; y/ D 2; if x; y 0; x C y 1:
f .x; y/ 1
f .xjy/ D D ;
fY .y/ 1y
3.4 Conditional Densities and Expectations 143
1y
E.X jy/ D ; 8y; 0 < y < 1:
2
Also, the conditional variance of X given Y D y is, by the general variance formula
for uniform distributions,
.1 y/2
Var.X jy/ D :
12
Example 3.17 (Uniform Distribution in a Circle). Let .X; Y / have a uniform den-
sity in the unit circle, f .x; y/ D 1 , x 2 C y 2 1; We find the conditional
expectation of X given Y D y. First, the conditional density is
1
f .x; y/ 1
f .xjy/ D D p D p ;
fY .y/ 2 1y 2
2 1 y2
p p
1 y2 x 1 y2:
Thus, we have the interesting
p result that the conditional density of X given
p
Y D y is uniform on Œ 1 y 2 ; 1 y 2 . It being an interval symmetric about
zero, we have in addition the result that for any y; E.X jY D y/ D 0:
Let us now find the conditional
p variance.
p The conditional distribution of X given
Y D y is uniform on Œ 1 y 2 ; 1 y 2 , therefore by the general variance
formula for uniform distributions,
p 2
2 1 y2 1 y2
Var.X jy/ D D :
12 3
Thus, the conditional variance decreases as y moves away from zero, which makes
sense intuitively, because as y moves away from zero, the line segment in which x
varies becomes smaller.
To find E.X jY D y/, our first task would be to find f .xjy/, the conditional density
of X given Y D y. This is, by its definition,
Therefore,
Z 1
E.X jY D y/ D xf .xjy/dx
y
R1 1
y x x f .x/dx 1 F .y/
D R1 1 D R1 1 ;
y x f .x/dx y x f .x/dx
1 F .y/ 1y
E.X jY D y/ D R 1 1 D ; 0 < y < 1:
y x f .x/dx
log y
The important thing to note is that although X has marginally a uniform density
and expectation 12 , given Y D y; X is not uniformly distributed, and E.X jY D y/
is not 12 . Indeed, as Fig. 3.5 shows, E.X jY D y/ is an increasing function of y,
increasing from zero at y D 0 to one at y D 1.
0.8
0.6
0.4
0.2
y
0.2 0.4 0.6 0.8 1
Example 3.19 (E.X jY D y/ Exists for Any y, but E.X / Does Not). Consider the
setup of the preceding example once again (i.e., X f .x/) and given X D x;
Y U Œ0; x. Suppose f .x/ D x12 ; x 1. Then the marginal expectation E.X /
R1 R1
does not exist, because 1 x x12 dx D 1 x1 dx diverges.
However, from the general formula in the preceding example,
1
1 F .y/ y
E.X jY D y/ D R 1 f .x/ D 1
D 2y;
y x dx 2y 2
Example 3.22 (Conditional Density Given the Sum). Suppose X; Y are two in-
dependent Exp.1/ variables. What is the conditional density of X given that
X C Y D t? Denote X CY D T . Then, we know from Chapter 1 that T G.2; 1/.
Also, by definition of probabilities for jointly continuous random variables, by de-
noting the joint density of .X; Y / as f .x; y/,
Z
P .X x; T t/ D f .u; v/d ud v
ux;uCvt
Z
D e uv d ud v
0<ux;0<uCvt
Z x Z t u
u v
D e e dv du
0 0
Z x
D e u .1 e ut /d u
0
Z x Z x
D e u d u e t d u
0 0
D 1 e x xe t ;
for x > 0; t > x.
Therefore, the joint density of X and T is
@2
fX;T .x; t/ D Œ1 e x xe t
@x@t
D e t ; 0 < x < t < 1:
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 147
E.X jX C Y D t/ D E.Y jX C Y D t/
) t D E.X C Y jX C Y D t/ D 2E.X jX C Y D t/
t
) E.X jX C Y D t/ D :
2
So, if we wanted just the conditional expectation, then the conditional density calcu-
lation would not be necessary in this case. This sort of symmetry argument is often
very useful in reducing algebraic calculations. But one needs to be absolutely sure
that the symmetry argument will be valid in a given problem.
Definition 3.8. Suppose for some fixed n 1; .X1 ; : : : ; Xn / have the joint density,
or the joint pmf, f .x1 ; : : : ; xn j/, where is a real-valued parameter, taking values
in an interval .a; b/, where a; b may be ˙1. Formally, consider itself to be a
random variable, and suppose has a density g./ on .a; b/. Then, the conditional
density of given X1 D x1 ; : : : ; Xn D xn is called the posterior density of , and
is given by
f .x1 ; : : : ; xn j/g./
f .jx1 ; : : : ; xn / D R b :
a f .x 1 ; : : : ; xn j/g./d
The function l./ D f .x1 ; : : : ; xn j/ is called the likelihood function, the function
g./ is called the prior density, and the conditional expectation of given X1 D
x1 ; : : : ; Xn D xn ; E.jX1 D x1 ; : : : ; Xn D xn / , if it exists, is called the posterior
mean of .
Remark. Note that in the expression for the posterior density, only the numera-
tor depends on . The denominator depends only on x1 ; : : : ; xn , because in the
denominator, is being completely integrated out. So, we should think of the de-
nominator in the expression for the posterior density to be merely a normalizing
constant.
Note also that if .a; b/ is a bounded interval, and we take g to be the uniform
density on .a; b/, then, apart from the normalizing constant in the denominator, the
posterior density of is exactly the same as the likelihood function.
Example 3.23 (Posterior Density for Exponential Mean). Suppose we have a single
observation X Exp. /, and that has the marginal density g. / D 2 ; 0 <
< 1: Then, by Bayes’ theorem,
f .xj /g. /
f . jx/ D
fX .x/
f .xj /g. / f .xj /g. /
D R1 D R1
0 f .x; /d 0 f .xj /g. /d
1 x x
e 2 2e
D R1
D R1
1 x x
0 e
2 d 0 2e d
x
e
D ;
k.x/
R1 x
where k.x/ denotes the integral 0 e d , which exists, but does not have a sim-
ple final formula. Thus, finally, the posterior density of given that the data value
X D x, is
x
e
f . jx/ D ; 0 < < 1:
k.x/
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 149
lambda
0.2 0.4 0.6 0.8 1
We give a plot of the prior density for , along with the posterior density for in
Fig. 3.6. A comparison of the two density plots explains the effect of the data value
X D x on updating the prior density to the posterior density. We see from the plots
that the data value (x D 3) makes larger values more likely under the posterior
than they were under the prior.
Example 3.24 (Posterior Mean for Binomial p). Suppose X Bin.n; p/, where
the probability of success p is treated as an unknown parameter. For example, you
may take a sample of n people independently from a population and count how many
are vegetarians. Then p will correspond to the fraction of vegetarians in the entire
population, and it seems likely that you cannot really know what that proportion is
in the entire population.
In the Bayesian approach, you have to assign the parameter p a distribution.
For simplicity of calculations, suppose we give p the U Œ0; 1 prior. So, the Bayes
model is:
p U Œ0; 1; X jp Bin.n; p/:
The posterior density, by definition, is the conditional density of p given X D x,
x being the actual observed value of X . Then, from Bayes’ theorem for conditional
densities,
n x
f .xjp/g.p/ p .1 p/nx
f .pjx/ D R 1 D R 1 xn
0 x p .1 p/
x nx dp
0 f .xjp/g.p/dp
p x .1 p/nx p x .1 p/nx
D R1 D .xC1/.nxC1/
:
0 p x .1 p/nx dp .nC2/
Here, in the last line, the denominator is obtained by actually doing the integration.
But, if we did not bother to do the integration in the denominator, and just looked
150 3 Multidimensional Densities
at the numerator, we would have realized that apart from an as yet unevaluated
constant term that the denominator will contribute, the posterior density is a Beta
density with parameters x C 1 and n x C 1, respectively. From the formula for
the mean of a Beta density(see Chapter 1), we get the additional formula that the
conditional expectation.
xC1 xC1
E.pjX D x/ D D :
xC1CnxC1 nC2
This is called the posterior mean. So, if you believe in your U Œ0; 1 prior for p,
then as a Bayesian you may wish to estimate p by xC1
nC2 , the posterior mean. This is
different from xn , the more common estimate of p. The slight alteration is caused by
treating p as a random variable, and by adopting the Bayesian approach.
More generally, if p has the general Beta prior density, p Be.˛; ˇ/, then the
same calculation as above shows that the posterior density is another Beta, and it is
the Be.x C ˛; n x C ˇ/ density.
Example 3.25 (Posterior Density for a Poisson Mean). Suppose we have n iid sam-
ple values X1 ; X2 ; : : : ; Xn from a Poisson distribution with mean . The parameter
is considered unknown. Therefore, in the Bayesian approach, we have to choose
a prior distribution for it. Suppose we choose a standard exponential prior for : the
Bayes model is:
P
Y
n
e xi
n
e n i D1 xi
f .x1 ; x2 ; : : : ; xn j / D D Qn I
xi Š i D1 xi Š
i D1
and g. / D e ; > 0.
Therefore, by definition of a posterior density,
f .x1 ; x2 ; : : : ; xn j /g. /
f . jx1 ; x2 ; : : : ; xn / D R 1
0 f .x1 ; x2 ; : : : ; xn j /g. /d
Pn Pn
e .nC1/ i D1 xi
e .nC1/ i D1 xi
D R1 Pn D Pn
e .nC1/ i D1 xi d . i D1Pxi C1/
0 1C n x
.nC1/ i D1 i
Pn Pn
.n C 1/1C i D1 xi e .nC1/ i D1 xi
D P :
. niD1 xi C 1/
Once again, the integration in the denominator did not really need to be done. By
simply looking at the numerator,
P we would recognize this to be a Gamma density
with shape parameter niD1 xi C 1 and scale parameter nC1 1
. From the general
formula for the mean of a GammaP density(see Chapter 1), we get the posterior mean
n
xi C1
formula E. jx1 ; x2 ; : : : ; xn / D i D1
nC1 . Once again, it is a slight alteration of
Pn
i D1 xi
the estimate one may have thought of intuitively, namely the estimate n
.
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 151
1
Pn 2 2
e 2 i D1 .xi / e 2
f .jx1 ; : : : ; xn / D R Pn 2
1 1
e 2 i D1 .xi / e
2
1
2 d
nC1 2 Pn
e 2 C . i D1 xi /
D R1 nC1 2 Pn
e 2 C . i D1 xi / d
1
Pn 2
x
nC1 i D1 i
2 nC1
e
D Pn 2
:
R1 nC1
2 nC1
x
i D1 i
1 e d
Once again, it is not necessary to work out the integral in the denominator, although
it is certainly not difficult to do so. All we need to recognize is that the numerator,
after doing all the algebra that we did, has Preduced to yet another normal density
n
i D1 xi 1
on , namely, a normal density with mean nC1 and variance nC1 . If we did go
through the chores of actually performing the integration in the denominator, we
would surely find it to be just the normalizing constant of this normal density. The
conclusion is that if X1 ; X2 ; : : : ; Xn N.;
Pn1/, and has a standard normal prior,
i D1 xi 1
then the posterior density of is the N nC1 ; nC1 density. In particular, the
Pn
x
posterior mean is E.jx1 ; : : : ; xn / D nC1i D1 i
, and the variance of the posterior
1
distribution is nC1 . Note that the posterior variance does not depend on x1 ; : : : ; xn !
This rather remarkable fact is entirely specific to the choice of a normal prior den-
sity; any normal prior will result in this constant posterior variance property.
Definition 3.9. Let X f .xj/, and suppose g./, where g belongs to some
family of densities G. The family G is called a family of conjugate priors for the
model f if the posterior density f .jx/ is also a member of G for any x.
The Beta family is a conjugate family for the binomial case. The Gamma family
is conjugate for the Poisson case. Normal distributions on form a conjugate family
for the mean in the normal case. Conjugate families are not unique, and in each
new problem, one has to find a convenient one by inspection.
The posterior density combines the likelihood function l./ and the prior density
g./ with suitable normalization. Fisher’s idea was to use just the likelihood func-
tion itself as the yardstick for assessing the credibility of each for being the true
value of the parameter . If the likelihood function l./ is large at some , that
value is consistent with the data that where obtained; on the other hand, if the likeli-
hood function l./ is small at some , that value is inconsistent with the data that
were obtained. Fisher suggested maximizing the likelihood function over all possi-
ble values of , and using the maxima as an estimate of . This is the celebrated
maximum likelihood estimate.
Many think that maximum likelihood is the greatest conceptual invention in the
history of statistics. Although in some high or infinite-dimensional problems com-
putation and performance of maximum likelihood estimates are less than desirable,
or even poor, in a vast majority of models in practical use, MLEs are about the best
that one can do. They have many asymptotic optimality properties that translate
into fine performance in finite samples. We give a few illustrative examples, after
defining an MLE.
Definition 3.10. Suppose given a parameter ; X .n/ D .X1 ; : : : ; Xn / have a joint
pdf or joint pmf f .x1 ; : : : ; xn j /; 2 ‚. Any value O D .X
O 1 ; : : : ; Xn / at which
the likelihood function l./ D f .x1 ; : : : ; xn j / is maximized is called a maximum
likelihood estimate (MLE) of , provided O 2 ‚, and l./ O < 1.
It is important to understand that an MLE need not exist, or be unique. But, in
many examples, it exists and is unique for any dataset X .n/ . In maximizing the
likelihood function over , any pure constant terms not involving may be ignored.
Also, in many standard models, it is more convenient to maximize L./ D log l./;
it simplifies the algebra without affecting the correctness of the final answer.
iid
Example 3.27P(MLE of Binomial p). Let X1 ; : : : ; Xn Ber.p/; 0 < p < 1. Then,
writing X D niD1 Xi (the total number of successes in these n trials),
For 0 < X < n, L.p/ has a unique stationary point, namely a point at which the first
derivative L0 .p/ D 0. This point is p D Xn . Furthermore, it is easily verified that
3.6 Maximum Likelihood Estimates 153
L00 .p/ < 0 for all p 2 .0; 1/; that is, L.p/ is strictly concave. So, for 0 < X < n,
there is a unique MLE of p, and it is just the common sense estimate Xn , the sample
proportion of successes. If X D 0 or n, the likelihood function is maximized at a
boundary value p D 0 or 1. In those two cases, an MLE of p does not exist.
iid
Example 3.28 (Mean of an Exponential). Let X1 ; : : : ; Xn Exp. /; > 0. Then,
Pn
1X
1 n
e i D1 Xi
l. / D n
) L. / D Xi n log :
i D1
Pn
X
L. / has a unique stationary point, it being D i D1 n
i
D X . Furthermore,
L00 . / < 0 at D X . Also note that l. / ! 0 as ! 0 or 1. These three facts
together imply that for all possible datasets X1 ; : : : ; Xn , there is a unique MLE of
, and it is the sample mean X .
iid
Example 3.29 (MLE of Normal Mean and Variance). Let X1 ; : : : ; Xn N
.; 2 /; 1 < < 1; 2 > 0. This is a two-parameter example. The likelihood
function is P
12 n i D1 .Xi /
2
2 e 2
l.; / D n :
. 2 / 2
Maximizing a function of two variables by calculus methods has to be done care-
fully, because the second derivative tests are subtle and must be carefully applied.
We instead obtain the MLEs of and 2 directly, as follows. Pn
The argument
Xi
uses a sequence of simple inequalities. As usual, let X D i D1 , and also let
1 Pn
n
s0 D n i D1 .Xi X / ; note that this is different from the sample variance
2 2
1 Pn
s 2 D n1 i D1 .Xi X / . The argument below shows that the unique MLEs of
2
Y
n
1 1
l./ D I0xi D n I max.x1 ;:::;xn / Imin.x1 ;:::;xn /0
i D1
1
D I max.x1 ;:::;xn / ;
n
because under the model, with probability one under any > 0; min.X1 ; : : : ; Xn /
is greater than zero.
154 3 Multidimensional Densities
Suppose .X; Y / have a joint bivariate normal distribution. A very important property
of the bivariate normal is that each conditional distribution, the distribution of Y
given X D x, and that of X given Y D y is a univariate normal, for any x and
any y. This really helps in easily computing conditional probabilities involving one
variable, when the other variable is held fixed at some specific value.
Theorem 3.5. Let .X; Y / have a bivariate normal distribution with parameters
1 ; 2 ; 1 ; 2 ; : Then,
(a) X jY D y N 1 C 21 .y 2 /; 12 .1 2 / I
(b) Y jX D x N 2 C 21 .x 1 /; 22 .1 2 / :
and the variance of each conditional distribution is a constant, and does not depend
on the conditioning values x or y.
The proof of this theorem involves some tedious integration manipulations, and
we omit it; the details of the proof are available in Tong (1990).
Remark. We see here that the conditional expectation is linear in the bivariate nor-
mal case. Specifically, take E.Y jX D x/ D 2 C 21 .x 1 /. Previously, we
have seen in Chapter 2 that the conditional expectation E.Y jX / is, in general, the
best predictor of Y based on X . Now we see that the conditional expectation is a
linear predictor in the bivariate normal case, and it is the best predictor and there-
fore, also the best linear predictor. In Chapter 2, we called the best linear predictor
the regression line of Y on X . Putting it all together, we have the very special result
that in the bivariate normal case, the regression line of Y on X and the best overall
predictor are the same:
For bivariate normal distributions, the conditional expectation of one variable
given the other coincides with the regression line of that variable on the other
variable.
3.8 Useful Formulas and Characterizations for Bivariate Normal 155
Example 3.31. Suppose incomes of husbands and wives in a population are bivari-
ate normal with means 75 and 60 (in thousands of dollars), standard deviations 20
each, and a correlation of .75. We want to know in what percentage of those families
where the wife earns 80,000 dollars, the family income exceeds 175,000 dollars.
Denote the income of the husband and the wife by X and Y . Then, we want
to find P .X C Y > 175jY D 80/. By the above theorem. X jY D 80 N.75 C
:75.80 60/; 400.1 :752 // D N.90; 175/: Therefore,
P .X C Y > 175jY D 80/ D P .X > 95jY D 80/
95 90
DP Z> p D P .Z > :38/
175
D :3520;
where Z denotes a standard normal variable.
Example 3.32 (Galton’s Observation: Regression to the Mean). This example is
similar to the previous example, but makes an interesting different point. It is often
found that students who get a very good grade on the first midterm, do not do as
well on the second midterm. We can try to explain it by doing a bivariate normal
calculation.
Denote the grade on the first midterm by X , that on the second midterm by Y ,
and suppose X; Y are jointly bivariate normal with means 70, standard deviations
10, and a correlation .7. Suppose a student scored 90 on the first midterm. What are
the chances that she will get a lower grade on the second midterm?
This is
P .Y < X jX D 90/ D P .Y < 90jX D 90/
90 84
DP Z< p D P .Z < :84/
51
D :7995;
where Z is a standard normal variable, and we have used the fact that Y jX D 90
N.70 C :7.90 70/; 100.1 :72 // D N.84; 51/.
Thus, with a fairly high probability, the student will not be able to match her first
midterm grade on the second midterm. The phenomenon of regression to mediocrity
was popularized by Galton, who noticed that the offspring of very tall parents tended
to be much closer to being of just about average height, and the extreme tallness in
the parents was not commonly passed on to the children.
A number of extremely elegant characterizations and also some very neat formulas
for useful quantities are available for a general bivariate normal distribution. An-
other practical issue is the numerical computation of bivariate normal probabilities.
156 3 Multidimensional Densities
Although tables are widely available for the univariate standard normal, for the
bivariate normal corresponding tables are found only in specialized sources, and
are sketchy. Thus, a simple and reasonably accurate approximation formula is prac-
tically useful. We deal with these issues in this section.
First, we need some notation. For jointly distributed random variables X; Y with
means 1 ; 2 , standard deviations 1 ; 2 , and positive integers r; s, we denote
EŒ.X 1 /r .Y 2 /s EŒjX 1 jr jY 2 js
r;s D I r;s D :
1r 2s 1r 2s
We then have the following useful formulas for a general bivariate normal
distribution.
and
D 1 2 I 2 D 12 C 22 21 2 :
There are many approximations to the CDF of a general bivariate normal distribu-
tion. The most accurate ones are too complex for quick use. The relatively simple
approximations are not computationally accurate for all configurations of the argu-
ments and the parameters. Keeping a balance between simplicity and accuracy, we
present here two approximations.
Mee–Owen Approximation. Let .X; Y / have the general five-parameter bivariate
normal distribution. Then,
kc
P .X 1 C h1 ; Y 2 C k2 / ˆ.h/ˆ ;
.h/
where c D ˆ.h/ ; 2 D 1 C hc c 2 :
Cox–Wermuth Approximation.
!
.h/ k
P .X 1 C h1 ; Y 2 C k2 / ˆ.h/ˆ p ;
1 2
.h/
where .h/ D 1ˆ.h/ :
See Mee and Owen (1983) and Cox and Wermuth (1991) for the motivation
behind these approximations. See Plackett (1954) for reducing the dimension of
the integral for computing multivariate normal probabilities. Genz (1993) provides
some comparison of the different algorithms and approximations for computing
multivariate normal probabilities.
0.9564 0.975
0.9562
0.956 0.97
0.9558
0.9556 0.965
0.9554
0.9552 rho
0.05 0.1 0.15 0.2 0.25 0.3
rho
0.05 0.1 0.15 0.2 0.25 0.3 0.955
Fig. 3.7 P(X < 2, Y < 2) and the Mee–Owen approximation in the standardized case
0.3
0.25
0.2
0.15
0.1
0.05
k
0.5 1 1.5 2 2.5 3
Fig. 3.8 P(X > k, Y > k) and Cox–Wermuth approximation in the standardized case, rho D .5
in two trial cases. We can see from Fig. 3.8 that the Cox–Wermuth approximation
is nearly exact in the trial case. The Mee–Owen approximation in Fig. 3.7 is rea-
sonable, but not very accurate. It should also be noted that the quantity 2 in the
Mee–Owen approximation can be negative, in which case the approximation is not
usable. Generally, the Mee–Owen approximation is inaccurate or unusable if h; k;
are large. The Cox–Wermuth approximation should not be used when is large
(>:75 or so).
In applications, one is very often interested in finding the expectation of one random
variable given that another random variable belongs to some set, rather than given
that it is exactly equal to some value. For instance, we may want to know what the
average income of husbands is among those families where the wife earns more than
$100,000.
3.9 Conditional Expectation Given a Set and Borel’s Paradox 159
The mathematical formulation of the problem is to find E.X jY 2 A/, for some
given set A. It is not possible to talk rigorously about this without using measure
theory. In fact, even defining E.X jY 2 A/ can be a problem. We limit ourselves to
special types of sets A.
Definition 3.11. Let .X; Y / have a joint density f .x; y/, with marginal densities
fX .x/; fY .y/. Let A be a subset of the real line such that P .Y 2 A/ > 0. Then
R1 R
xf .x; y/dydx
E.X jY 2 A/ D R1
1 R
A
:
1 A f .x; y/dydx
Remark. When the conditioning event A has probability zero, we can get into para-
doxical situations when we try to compute E.X jY 2 A/. What happens is that it
may be possible to rewrite the conditioning event Y 2 A as an equivalent event
V 2 B for some carefully chosen function V D V .X; Y /. Yet, when we com-
pute E.X jY 2 A/ and E.X jV 2 B/, we arrive at different answers! The paradox
arises because of subtleties of measure zero sets. It is not possible to describe how
one avoids such a paradox without the knowledge of abstract measure theory. We
do however, give an example illustrating this paradox, popularly known as Borel’s
paradox.
Example 3.34 (Borel’s Paradox). Let .X; Y / have the joint density
f .x; y/ D 1; if 0 x 1; x y 1 x;
f .xjY D 0/ D 1; 0 x 1:
This is just the uniform density on Œ0; 1, and so we get E.X jY D 0/ D :5.
Now transform .X; Y / by the one-to-one transformation .X; Y / ! .U; V /,
where U D X; V D XCY X . The Jacobian of the transformation is J D u, and
hence the joint density of .U; V / is
1
fU;V .u; v/ D u; 0 < u < 1; 0 < v < :
u
160 3 Multidimensional Densities
Example 3.35 (Mean Residual Life). In survival analysis and medicine, a quantity
of great interest is the mean residual life. Suppose that a person afflicted with some
disease has survived five years. How much longer can the patient be expected to
survive? Thus, suppose X is a continuous random variable with density f .x/. We
want to find E.X c j X c/: Assuming that P .X c/ > 0,
E.X c j X c/ D E.X j X c/ c
R1
xf .x/dx
D c c:
1 F .c/
E.X c j X c/
R 1 1 x=
x e dx
D c c
e c=
.c C /e c=
D c
e c=
D ;
which is independent of c. We recognize that this is just the lack of memory property
of an exponential distribution manifesting itself in the mean residual life calculation.
In contrast, suppose X N.0; 1/ (of course, in reality a survival time X cannot
have mean zero!). Then,
E.X c j X c/ D E.X j X c/ c
R1
x.x/dx
D c c
1 ˆ.c/
.c/ 1
D c D c;
1 ˆ.c/ R.c/
where R.c/ D 1ˆ.c/.c/ is the Mills ratio. The calculation shows that the Mills ratio
arises very naturally in a calculation of interest in survival analysis.
Note that now the mean residual life is no longer independent of c. Take c to be
positive. From Laplace’s expansion for the Mills ratio (see Chapter 1),
1 c3 c 1
c c D 2 :
R.c/ c2 1 c 1 c
Exercises 161
That is, the mean residual life is approximately equal to 1c , which is a decreasing
function of c. So, unlike in the exponential case, if survival time is normal, then a
patient who has survived a long time is increasingly more unlikely to survive too
much longer.
Exercises
Exercise 3.1. Suppose .X; Y / have the joint density f .x; y/ D cxy; x; y 2 Œ0; 1.
(a) Find the normalizing constant c.
(b) Are X; Y independent?
(c) Find the marginal densities and expectations of X; Y .
(d) Find the expectation of X Y .
Exercise 3.2. Suppose .X; Y / have the joint density f .x; y/ D cxy; x; y 0I x C
y 1.
(a) Find the normalizing constant c.
(b) Are X; Y independent?
(c) Find the marginal densities and expectations of X; Y .
(d) Find the expectation of X Y .
Exercise 3.5 (Uniform Distribution in a Sphere). Suppose .X; Y; Z/ has the den-
sity f .x; y; z/ D c; if x 2 C y 2 C z2 1:
(a) Find the constant c.
(b) Are any of X; Y , or Y; Z or X; Z pairwise independent?
(c) Find the marginal densities and expectations of X; Y; Z.
162 3 Multidimensional Densities
Exercise 3.6. Suppose X; Y are independent U Œ0; 1 variables. Find the conditional
expectation E.jX Y j jY D y/.
Exercise 3.13 * (Breaking a Stick). Suppose X U Œ0; 1, and given that X D
x; Y U Œ0; x. Let U D 1 X; V D Y; W D X Y . Find the expectation of the
maximum of U; V; W .
This amounts to breaking a stick, and then breaking the left piece again.
iid
Exercise 3.18. Let X1 ; : : : ; Xn N.; /; > 0. Show that there is a unique MLE
of , and find it.
Exercise 3.19 (MLE in a Genetics Problem). According to Mendel’s law, the
genotypes aa; Aa, and AA in a population with genetic equilibrium with respect
to a single gene having two alleles have proportions f 2 ; 2f .1 f /, and .1 f /2
in the population. Suppose n individuals are sampled from the population and the
number of observed individuals of each genotype are n1 ; n2 ; n3 , respectively. Find
the MLE of f .
Exercise 3.20 * (MLE in a Discrete Parameter Problem). Two independent
proofreaders A and B are asked to read a manuscript containing N errors; N 0
is unknown. n1 errors are found by A alone, n2 by B alone, and n12 by both. What
is the MLE of N ? State your assumptions.
iid
Exercise 3.21 * (MLE for Double Exponential Case). Let X1 ; : : : ; Xn
DoubleExp.; 1/. Show that the sample median is one MLE of ; is it the only
MLE?
Exercise 3.22 * (MLE of Common Mean). Suppose X1 ; X2 ; : : : ; Xm are iid
N.; 12 / and Y1 ; Y2 ; : : : ; Yn are iid N.; 22 /, and all m C n observations are
independent. Find the MLE of .
iid
Exercise 3.23 * (MLE Under a Constraint). Let X1 ; : : : ; Xn N.; 1/, where
we know that 0. Show that there is a unique MLE of , and find it.
Hint: Think intuitively.
iid
Exercise 3.24 * (MLE in the Gamma Case). Let X1 ; : : : ; Xn G.˛; /; ˛>0;
> 0. Show that there is a unique MLE of .˛; /, which is the only stationary point
of the logarithm of the likelihood function. Compute it for the following simple
dataset (n D 8/ W :5; 1; 1:4; 2; 1; 2:5; 1:5; 2.
Exercise 3.25 (Bivariate Normal Probability). Suppose X; Y are jointly bivariate
normal with zero means, unit standard deviations, and correlation . Find all values
of for which 14 P .X > 0; Y > 0/ 12 5
.
Exercise 3.26. Suppose X; Y are jointly bivariate normal with zero means, unit
standard deviations, and correlation D :75. Find P .Y > 2jX D 1/.
Exercise 3.27. Suppose X; Y are jointly bivariate normal with general parameters.
Characterize all constants a; b such that X C Y and aX C bY are independent.
Exercise 3.28 * (Probability of a Diamond). Suppose X; Y; Z are independent
U Œ1; 1 variables. Find the probability that jX j C jY j C jZj 1:
Exercise 3.29 (Missing the Bus). A bus arrives at a random time between 9:00 AM
and 9:15 AM at a stop. Tim will arrive at that stop at a random time between 9:00
AM and 9:15 AM, independently of the bus, and will wait for (at most) five minutes
at the stop. Find the probability that Tim will meet the bus.
164 3 Multidimensional Densities
Exercise 3.30. Cathy and Jen plan to meet at a cafe and each will arrive at the cafe
at a random time between 11:00 AM and 11:30 AM, independently of each other.
Find the probability that the first to arrive has to wait between 5 and 10 minutes for
the other to arrive.
Exercise 3.31 (Bivariate Normal Probability). Suppose the amounts of oil (in
barrels) lifted on a given day from two wells are jointly bivariate normal, with means
150 and 200, and variances 100 and 25, and correlation .5. What is the probability
that the total amount lifted is larger than 400 barrels on one given day? The proba-
bility that the amounts lifted from the two wells on one day differ by more than 50
barrels?
Exercise 3.35 (Bivariate Normal Calculation). Suppose X N.0; 1/, and given
X D x, Y N.x C 1; 1/.
(a) What is the marginal distribution of Y ?
(b) What is the correlation between X and Y ?
(c) What is the conditional distribution of X given Y D y?
Exercise 3.41 * (Buffon’s Needle). Suppose the plane is gridded by a series of par-
allel lines, drawn h units apart. A needle of length l is dropped at random on the
plane. Let p.l; h/ be the probability that the needle intersects one of the parallel
lines. Show that
(a) p.l; h/ D 2l
h , if l ph;
(b) p.l; h/ D h h Œ l
2l 2 2 h2 C h arcsin. hl / C 1, if l > h.
References
Cox, D. and Wermuth, N. (1991). A simple approximation for bivariate and trivariate normal inte-
grals, Internat. Statist. Rev., 59, 263–269.
Genz, A. (1993). Comparison of methods for the computation of multivariate norwal probabilities,
Computing Sciences and Statistics, 25, 400–405.
Kagan, A., Linnik, Y., and Rao, C. R. (1973). Characterization Problems in Mathematical Statis-
tics, Wiley, New York.
Kamat, A. (1953). Incomplete and absolute moments of the multivariate normal distribution, with
applications, Biometrika, 40, 20–34.
Mee, R. and Owen. D. (1983). A simple approximation for bivariate normal probability, J. Qual.
Tech., 15, 72–75.
Patel, J. and Read, C. (1996). Handbook of the Normal Distribution, Marcel Dekker, New York.
Plackett, R. (1954). A reduction formula for multivariate normal probabilities, Biometrika, 41,
351–360.
Tong, Y. (1990). Multivariate Normal Distribution, Springer-Verlag, New York.
Chapter 4
Advanced Distribution Theory
P .X C Y z/ D EŒIXCY z
D EY ŒE.IXCY z jY D y/ D EY ŒE.IXCyz jY D y/
D EY ŒP .X z yjY D y/
Z 1
D EY ŒFXjY .z y/ D FXjY .z y/fY .y/dy:
1
In particular, if X and Y are independent, then the conditional CDF FXjY .u/ will
be the same as the marginal CDF FX .u/ of X . In this case, the expression above
simplifies to Z 1
P .X C Y z/ D FX .z y/fY .y/dy:
1
The density of X C Y can be obtained by differentiating the CDF of X C Y :
d
fXCY .z/ D P .X C Y z/
dz
Z 1
d
D FX .z y/fY .y/dy
d z 1
Z 1
d
D FX .z y/fY .y/ dy
1 d z
Indeed, this is the general formula for the density of the sum of two real-valued
independent continuous random variables.
More generally, if X; Y are not necessarily independent, and have joint density
f .x; y/, then Z D X C Y has the density
Z 1
fZ .z/ D fXjY .z y/fY .y/dy:
1
z
ze
D 2
;
which is the density of a Gamma distribution with parameters 2 and . Recall that
we had proved this earlier in Chapter 1 by using mgfs.
Example 4.3 (Difference of Exponentials). Let U; V be independent standard expo-
nentials. We want to find the density of Z D U V . Writing X D U , and Y D V ,
we notice that Z D X C Y , and X; Y are still independent. However, now Y is a
negative exponential, and so has density fY .y/ D e y Iy < 0: It is also important to
note that Z can now take any real value, positive or negative. Substituting into the
formula for the convolution density,
Z 1
fZ .z/ D e .zy/ .Iy < z/e y .Iy < 0/dy:
1
1 jzj
fZ .z/ D e ; 1 < z < 1I
2
that is, if X; Y are independent standard exponentials, then the difference X Y
has a standard double exponential density. This representation of the double expo-
nential is often useful. Also note that although the standard exponential distribution
is obviously not symmetric, the distribution of the difference of two independent
exponentials is symmetric. This is a useful technique for symmetrizing a random
variable.
170 4 Advanced Distribution Theory
Example 4.4 (A Neat General Formula). Suppose X; Y are positive random vari-
ables with a joint density of the form f .x; y/ D g.x C y/: What is the density of
the convolution?
Note that now X; Y are in general not independent, because a joint density of
the form g.x C y/ does not in general factorize into the product form necessary for
independence. First, the conditional density
g.x C y/
fXjY .x/ D R 1
0 g.x C y/dx
g.x C y/ g.x C y/
D R1 D ;
g.x/dx N
G.y/
y
R1
N
writing G.y/ for g.x/dx: Also, the marginal density of Y is
y
Z 1 Z 1
fY .y/ D g.x C y/dx D N
g.x/dx D G.y/:
0 y
Substituting into the general case formula for the density of a sum,
Z z
g.z/ N
fZ .z/ D G.y/dy D zg.z/;
N
0 G.y/
1 2
D :
.1 C .z y/2 /.1 C y 2 / .4 C z2 /
More generally, if X N.0; 2 /, and Y Poi. /. Then the density of the sum is
1 X z y
1
e y
fZ .z/ D :
yD0 yŠ
0.25
0.2
0.15
0.1
0.05
x
-2 2 4 6 8 10 12
For purposes of summary and easy reference, we list some convolutions of com-
mon types below.
Suppose X; Y are two random variables. Then two other functions that arise nat-
urally in many applications are the product X Y , and the quotient X Y . Following
exactly the same technique as for convolutions, one can find the density of each of
X Y and X Y
. More precisely, one first finds the CDF by using the iterated expecta-
tion technique, exactly as we did for convolutions, and then differentiates the CDF
to obtain the density. Here are the density formulas; they are extremely important
and useful. They are proved in exactly the same way that the formula for the den-
sity of the convolution was obtained above; you would condition, and then take an
iterated expectation. Therefore, the formal detail is omitted.
Theorem 4.2. Let X; Y be continuous random variables with a joint density
f .x; y/. Let U D X Y; V D X
Y . Then the densities of U; V are given by
Z 1
1 u
fU .u/ D f x; dxI
1 jxj x
Z 1
fV .v/ D jyjf .vy; y/dy:
1
.minf 1v ; 1g/2
D ; 0 < v < 1I
2
thus, the density of the quotient V is
1
fV .v/ D ; if 0 < v 1I
2
1
D 2 ; if v > 1:
2v
The density of the quotient is plotted in Fig. 4.2; we see that it is continuous, but not
differentiable at v D 1.
Example 4.8 (Ratio of Standard Normals). The distribution of the ratio of two in-
dependent standard normal variables is an interesting one; we show now that it is
in fact a standard Cauchy distribution. Indeed, by applying the general formula, the
density of the quotient V D XY is
Z 1
fV .v/ D jyjf .vy; y/dy
1
Z 1
1 y2 2
D jyje 2 .1Cv / dy
2 1
0.5
0.4
0.3
0.2
0.1
v
1 2 3 4 5
Example 4.9 (The F -Distribution). Let X G.˛; 1/; Y G.ˇ; 1/, and suppose
X=˛
X; Y are independent. The distribution of the ratio R D Y =ˇ arises in statistics in
many contexts and is called an F -distribution. We derive the explicit form of the
density here.
X=˛
First, we find the density of XY , from which the density of R D Y =ˇ follows
easily. Again, by applying the general formula for the density of a quotient, the
density of the quotient V D XY
is
Z 1
fV .v/ D jyjf .vy; y/dy
1
Z 1
1
D ye y.1Cv/ .vy/˛1 y ˇ 1 dy
.˛/.ˇ/ 0
Z 1
1
D v˛1 e y.1Cv/ y ˛Cˇ 1 dy
.˛/.ˇ/ 0
1 .˛ C ˇ/
D v˛1
.˛/.ˇ/ .1 C v/˛Cˇ
.˛ C ˇ/ v˛1
D ; 0 < v < 1:
.˛/.ˇ/ .1 C v/˛Cˇ
1 r
fR .r/ D fV ;
c c
4.2 Products and Quotients and the t - and F -Distribution 175
where fV is the function we just derived above. If we simplify fR .r/, we get the
final expression
ˇ
ˇ
˛ r ˛1
fR .r/ D ˛Cˇ
; r > 0:
ˇ
B.˛; ˇ/ r C ˛
Example 4.10 (The Student t-Distribution). Once again, the t-distribution is one
that arises frequently in statistics.qSuppose, X N.0; 1/; Z 2m , and suppose
X; Z are independent. Let Y D Z
m
. Then the distribution of the quotient V D
X
Y
is called t-distribution with m degrees of freedom. We derive its density in this
example.
Recall that Z has the density
e z=2 zm=21
; z > 0:
2m=2 . m2
/
q
Therefore, Y D Z
m
has the density
2
mm=2 e my =2 y m1
fY .y/ D ; y > 0:
2m=2 1 . m2
/
Because, by hypothesis, X and Z are independent, it follows that X and Y are also
independent, and so their joint density f .x; y/ is just the product of the marginal
densities of X and Y .
Once again, by applying our general formula for the density of a quotient,
Z 1
fV .v/ D jyjf .vy; y/dy
1
Z 1
mm=2 2 y 2 =2 2 =2
D p ye v e my y m1 dy
22m=2 1 . m
2
/ 0
Z 1
mm=2 2 Cm/y 2 =2
D p e .v y m dy
22m=2 1 . m
2/ 0
mm=2 . mC1
2 /2
.m1/=2
D p
22m=2 1 . m
2
/ .m C v2 /.mC1/=2
176 4 Advanced Distribution Theory
mm=2 . mC1 / 1
D p 2
. 2 / .m C v /.mC1/=2
m 2
. mC1
2 /
D p 2
; 1 < v < 1:
m. 2 /.1 C vm /.mC1/=2
m
-6 -4 -2 2 4 6 -4 -2 2 4 -4 -2 2 4
Fig. 4.3 t Density for m D 3, 20, 30 degrees of freedom with N(0,1) density superimposed
4.3 Transformations 177
Z 1 2
1 xe x =2
D p dx
juj x 2 u2
r
1 u2 =2 1 2
D e D p e u =2 ;
2 2
4.3 Transformations
The simple technique that we used in the previous section to derive the density of
a sum or a product does not extend to functions of a more complex nature. Con-
sider the simple case of just two continuous variables X; Y with some joint density
f .x; y/, and suppose we want to find the density of some function U D g.X; Y /.
Then, the general technique is to pair up U with another function V D h.X; Y /, and
first obtain the joint CDF of .U; V / from the joint CDF of .X; Y /. The pairing up has
to be done carefully: only some judicious choices of V will work in a given example.
Having found the joint CDF of .U; V /, by differentiation one finds the joint density
of .U; V /, and then finally integrates v out to obtain the density of just U . Fortu-
nately, this agenda does work out, because the transformation from .X; Y / to .U; V /
can be treated as just a change of variable in manipulation with double integrals, and
calculus tells us how to find double integrals by making suitable changes of vari-
ables (i.e., substitutions). Indeed, the method works out for any number of jointly
distributed variables, X1 ; X2 ; : : : ; Xn , and a function U D g.X1 ; X2 ; : : : ; Xn /, and
the reason it works out is that the whole method is just a change of variables in
manipulating a multivariate integral.
Here is the theorem on density of a multivariate transformation, a major theo-
rem in multivariate distribution theory. It is really nothing but the change of variable
theorem of multivariate calculus. After all, probabilities in the continuous case are
integrals, and an integral can be evaluated by changing variables to a new set of coor-
dinates. If we do that, then we have to put in the Jacobian term coming from making
the change of variable. Translated into densities, the theorem is the following.
Theorem 4.3 (Multivariate Jacobian Formula). Let X D .X1 ; X2 ; : : : ; Xn /
have the joint density function f .x1 ; x2 ; : : : ; xn /, such that there is an open set
S Rn with P .X 2 S / D 1. Suppose ui D gi .x1 ; x2 ; : : : ; xn /; 1 i n are n
real-valued functions of x1 ; x2 ; : : : ; xn such that
(a) .x1 ; x2 ; : : : ; xn / ! .g1 .x1 ; x2 ; : : : ; xn /; : : : ; gn .x1 ; x2 ; : : : ; xn // is a one-to-
one function of .x1 ; x2 ; : : : ; xn / on S with range space T .
(b) The inverse functions xi D hi .u1 ; u2 ; : : : ; un /; 1 i n; are continuously
differentiable on T with respect to each uj .
178 4 Advanced Distribution Theory
is nonzero.
Then the joint density of .U1 ; U2 ; : : : ; Un / is given by
@x @x @y @y
D vI D uI D vI D 1 u:
@u @v @u @v
4.4 Applications of Jacobian Formula 179
and the rectangle being .0; 1/ .0; 1/. Therefore, we have proved that if X; Y are
X
independent standard exponentials, then XCY and X C Y are independent, and they
are, respectively, uniform and a Gamma. Of course, we already knew that X C Y
G.2; 1/ from our mgf proof in Chapter 1. In Chapter 18 we show that this result can
also be proved by using Basu’s theorem.
Example 4.13 (A Relation Between Gamma and Beta). The previous example gen-
eralizes in a nice way. Let X; Y be independent variables, distributed respectively
as G.˛; 1/; G.ˇ; 1/. Let again, U D XCYX
; V D X C Y . Then, from our previous
example, the Jacobian determinant is still J D v. Therefore, the joint density of
U; V is
1
fU;V .u; v/ D e v .uv/˛1 .v.1 u//ˇ 1 v
.˛/.ˇ/
1
D u˛1 .1 u/ˇ 1 e v v˛Cˇ 1 ; 0 < u < 1; v > 0:
.˛/.ˇ/
Once again, we have factorized the joint density of U and V as the product of the
marginal densities, with .U; V / varying in the rectangle .0; 1/.0; 1/, the marginal
densities being
e v v˛Cˇ 1
fV .v/ D ; v > 0;
.˛ C ˇ/
.˛ C ˇ/ ˛1
fU .u/ D u .1 u/ˇ 1 ; 0 < u < 1:
.˛/.ˇ/
x D r cos ; y D r sin :
@x @x @y @y
D cos ; D r sin ; D sin ; D r cos :
@r @ @r @
with r; belonging to a suitable rectangle, which depends on the exact set of values
.x; y/ on which the original joint density f .x; y/ is strictly positive. But, in any
case, we have established that the joint density of .r; / factorizes into the prod-
uct form on a rectangle, and so in any spherically symmetric situation, the polar
coordinates r and are independent, a very convenient fact. Always, in a spheri-
cally symmetric case, r will have the density crg.r/ on some interval and for some
suitable normalizing constant c, and will have a uniform density on some interval.
Now consider three specific choices of the original density function. First con-
sider the uniform case:
1
f .x; y/ D ; 0 < x 2 C y 2 < 1:
Then g.r/ D 1 ; 0 < r < 1: So in this case, r has the density 2r; 0 < r < 1, and
has the uniform density 2 1
; < < .
Next consider the case of two independent standard normals. Indeed, in this case,
the joint density is spherically symmetric, namely,
1 .x 2 Cy 2 /=2
f .x; y/ D e ; 1 < x; y < 1:
2
4.5 Polar Coordinates in Two Dimensions 181
2
1 r =2
Thus, g.r/ D 2 e ; r > 0. Therefore, in this case r has the Weibull density
2
r =2
re ; r > 0, and , again is uniform on .; /.
Finally, consider the caseq of two independent folded standard normals, that is,
2
each of X; Y has the density 2 e x =2 ; x > 0. In this case, r varies on .0; 1/, but
varies on .0; 2 /. Thus, r and are still independent, but this time, is uniform
2
on .0; 2 /, whereas r still has the same Weibull density re r =2 ; r > 0.
Example 4.15 (Usefulness of the Polar Transformation). Suppose .X; Y / are jointly
uniform in the unit circle. We use the joint density of .r; / to find the answers to a
number of questions.
First, by using the polar transformation,
It would have been a much more tedious calculation to do this using the original
rectangular coordinates.
Example 4.16 (Product of n Uniforms). Let X1 ; X2 ; : : : ; Xn be independent U Œ0; 1
variables, and suppose we want to find the density of the product U D Un D
Q n
i D1 Xi . According to our general discussion, we have to choose n 1 other func-
tions, and then apply the Jacobian theorem. Define
u1 D x1 ; u2 D x1 x2 ; u3 D x1 x2 x3 ; : : : ; un D x1 x2 ; : : : ; xn :
182 4 Advanced Distribution Theory
1
fU1 ;U2 ;:::;Un .u1 ; u2 ; : : : ; un / D Qn1 ;
i D1 ui
0 < u < 1: This example illustrates that applying the Jacobian theorem needs care-
ful manipulation with multiple integrals, and skills in using the Jacobian technique
are very important in deriving distributions of functions of many variables.
x1 D cos 1 ;
x2 D sin 1 cos 2 ;
4.6 n-Dimensional Polar and Helmert’s Transformation 183
The transformation has the useful property that x12 C x22 C C xn2 D
2 8 .x1 ; x2 ; : : : ; xn / 2 Rn , that is, is the length of the vector x D .x1 ; x2 ; : : : ; xn /.
The Jacobian determinant of the transformation equals
where in the right side, one writes for x1 ; x2 ; : : : ; xn , their defining expressions in
terms of ; 1 ; 2 ; : : : ; n1 , as provided above.
In particular, if X1 ; X2 ; : : : ; Xn have a spherically symmetric joint density
q
f .x1 ; x2 ; : : : ; xn / D g x12 C Cx22 CC xn2
for some function g, then the joint density of .; 1 ; 2 ; : : : ; n1 / equals
1 Pn
1 2
i D1 xi ;
f .x1 ; x2 ; : : : ; xn / D e 2 1 < xi < 1; i D 1; 2; : : : ; n:
.2/n=2
where
1 2
g./ D n=2
e 2 :
.2/
184 4 Advanced Distribution Theory
qP
n
Therefore, from our general theorem above, D 2
i D1 Xi has the density
2
2
cn1 e for some normalizing constant c. Making the transformation W D 2 ,
we get from the general formula for the density of a monotone transformation in
one dimension (see Chapter 1) that W has the density
. n2 C 1/
g./ D n ; 0 < < 1:
2
Hence, by our general theorem above, has the density cn1 for some normalizing
constant c. The normalizing constant is easily evaluated:
Z 1
c
1D cn1 d D
0 n
) c D n:
Thus, the length of an n-dimensional vector picked at random from the unit sphere
has the density nn1 ; 0 < < 1: As a consequence, the expected length of an
n-dimensional vector picked at random from the unit sphere equals
Z 1
n
E./ D n n d D ;
0 nC1
which is very close to one for large n. So, one can expect that a point chosen at
random from a high-dimensional sphere would be very close to the boundary, rather
than the center. Once again, we see the curse of dimensionality in action.
Transformation to polar coordinates also results in some striking formulas and
properties for general spherically symmetric distributions. They are collected to-
gether in the following theorem. We do not prove this theorem in the text, as each
part only requires a transformation to the n-dimensional polar coordinates, and then
straightforward calculations.
4.6 n-Dimensional Polar and Helmert’s Transformation 185
in particular, if n D 3, then each jXi j U Œ0; 1, and each Xi U Œ1; 1.
case are now available, nevertheless the utility of Helmert’s transformation in spher-
ically symmetric situations makes it an important tool. We first need to recall a few
definitions and facts from linear algebra.
Verbally, in the first row, every element is p1n , and in the subsequent rows, say
the i th row, every element after the diagonal element in that row is zero.
Two important properties of the Helmert transformation are the following.
X
n X
n X
n X
n
Yi2 D Xi2 I Yi2 D .Xi XN /2 ;
i D1 i D1 i D2 i D1
Pn
where XN D i D1 Xi
n
P P P
Pn P2 is an2 orthogonal
Proof. Pn matrix, thus niD1 Yi2 D niD1 Xi2 : Also, niD2 Yi2 D
N2
i D1 Yi Y1 D i D1 Xi nX , by definition of Y1 , because the first row of P
2
1
has all entries equal to pn :
These two properties lead to the following two important results. t
u
The result in Theorem 4.7 leads to one of the mainstays of statistical methodology,
namely the t confidence interval for the mean of a normal distribution, when the
variance 2 is unknown. In Section 1.13, we described how to construct a confidence
interval for when we know the value of 2 . The interval derived there is X ˙
z ˛2 psn , where z ˛2 D ˆ1 .1 ˛2 /. Obviously, this interval cannot be used if we do not
know the value of . However, we can easily remedy this slight problem by simply
using part (b) of Theorem 4.7, which says that if X1 ; : : : ; Xn are iid N.; 2 /, then
188 4 Advanced Distribution Theory
p
N
n.X/
s
t.n 1/. This result implies, with t ˛2 ;n1 denoting the 1 ˛
2
quantile
of the t.n 1/ distribution,
p !
n.XN /
P t ˛2 ;n1 t ˛2 ;n1 D 1 ˛
s
s N s
, P t ˛2 ;n1 p X t ˛2 ;n1 p D 1˛
n n
s s
, P XN t ˛2 ;n1 p XN C t ˛2 ;n1 p D 1 ˛:
n n
The interval XN ˙ t ˛2 ;n1 . psn is called the 100.1 ˛/% t confidence interval. It is
based on the assumption of X1 ; : : : ; Xn being iid N.; 2 / for some and some
2 . In practice, it is often used for very nonnormal or even correlated data. This is
unjustified and in fact wrong.
The Jacobian density formula, when suitably applied to a set of independent Gamma
random variables, results in a hugely useful and important density for random vari-
ables in a simplex. In the plane, the standard simplex is the triangle with vertices at
.0; 0/; .0; 1/, and .1; 0/. In the general n dimensions, the standard simplex is the
Pnof all n-dimensional vectors x D .x1 ; : : : ; xn / such that eachP
set xi 0, and
n
i D1 ix 1: If we define an additional xnC1 as x nC1 D 1 i D1 xi , then
.x1 ; : : : ; xnC1 / forms a vector of proportions adding to one. Thus, the Dirichlet dis-
tribution can be used in any situation where an entity has to necessarily fall into
one of n C 1 mutually exclusive subclasses, and we want to study the proportion
of individuals belonging to the different subclasses. Indeed, when statisticians want
to model an ensemble of fractional variables adding to one, they often first look
at the Dirichlet distribution as their model. See Aitchison (1986). Dirichlet distri-
butions are also immensely important in Bayesian statistics. Fundamental work on
the use of Dirichlet distributions in Bayesian modeling and on calculations using
the Dirichlet distribution has been done in Ferguson (1973), Blackwell (1973), and
Basu and Tiwari (1982).
Let X1 ; X2 ; : : : ; XnC1 be independent Gamma random variables, with Xi
G.˛i ; 1/. Define
Xi
pi D PnC1 ; 1 i n;
j D1 Xj
P
and denote pnC1 D 1 niD1 pi . Then, we have the following theorem.
Theorem 4.8. p D .p1 ; p2 ; : : : ; pn / has the joint density
P
nC1
i D1 ˛i Y ˛ 1
nC1
f .p1 ; p2 ; : : : ; pn / D QnC1 pi i :
i D1 .˛ i / i D1
4.7 The Dirichlet Distribution 189
Proof. This is proved by using the Jacobian density theorem. The transformation
0 1
X
nC1
.x1 ; x2 ; : : : ; xnC1 / ! @p1 ; p2 ; : : : ; pn ; Xj A
j D1
P
is a one-to-one transformation with the Jacobian determinant J D . nC1 n
j D1 Xj / :
Inasmuch as X1 ; X2 ; : : : ; XnC1 are independent Gamma random variables, Papplying
the Jacobian density theorem, we get the joint density of .p1 ; p2 ; : : : ; pn ; nC1
j D1 Xj /
as
1 PnC1 Y
nC1
˛ 1
s mC ˛i 1
fp1 ;p2 ;:::;pn ;s .p1 ; p2 ; : : : ; pn ; s/ D QnC1 e s i D1 pi i :
i D1 .˛i / i D1
If we now integrate s out (on .0; 1/), we get the joint density of p1 ; p2 ; : : : ; pn ,
as stated in this theorem. t
u
Definition 4.7 (Dirichlet Density). An n-dimensional vector p D .p1 ; p2 ;
: : : ; pn / is said to have the Dirichlet distribution with parameter vector ˛ D
.˛1 ; : : : ; ˛nC1 /; ˛i > 0, if it has the joint density
P
. nC1 ˛i / Y ˛i 1
nC1
f .p1 ; p2 ; : : : ; pn / D QnC1i D1 pi ;
i D1 .˛i / i D1
P
pi 0; niD1 pi 1:
We write p Dn .˛/.
Remark. When n D 1, the Dirichlet density reduces to a Beta density with param-
eters ˛1 ; ˛2 . Simple integrations give the following moment formulas.
Proposition. Let p Dn .˛/. Then,
˛i ˛i .t ˛i / ˛i ˛j
E.pi / D ; Var.pi / D ; Cov.pi ; pj / D ; i ¤ j;
t t 2 .t C 1/ t .t C 1/
2
P
where t D nC1i D1 ˛i :
Thus, notice that the covariances (and hence the correlations) are always
negative.
A convenient fact about the Dirichlet density is that lower-dimensional marginals
are also Dirichlet distributions. So are the conditional distributions of suitably
renormalized subvectors given the rest.
Theorem 4.9 (Marginal and Conditional Distributions).
(a) Let p Dn .˛/.PFix m < n, and let pm D .p1 ; : : : ; pm /, and ˛m D
.˛1 ; : : : ; ˛m ; t m
i D1 ˛i /. Then pm Dm .˛m /. In particular, each pi
Be.˛i ; t ˛i /.
190 4 Advanced Distribution Theory
p
(b) Let p D.˛/. Fix m < n, and let qi D P i
1 m
; i D m C 1; : : : ; n. Let
i D1 pi
ˇ m D .˛mC1 ; : : : ; ˛nC1 /. Then,
.qmC1 ; : : : ; qn /j.p1 ; : : : ; pm / Dnm .ˇm /:
These two results follow in a straightforward manner from the definition of con-
ditional densities, and the functional form of a Dirichlet density.
It also follows from the representation of a Dirichlet random vector in terms
of independent Gamma variables that sums of a subset of the coordinates must
have Beta distributions.
Theorem 4.10 (Sums of Subvectors).
Pm
(a) Let p Pmm < n, and let Sm D Sm;n D
Pm Dn .˛/. Fix i D1 pi . Then Sm
Be. i D1 ˛i ; t i D1 ˛i /:
(b) More generally, suppose the entire Dirichlet vector p is partitioned into k sub-
vectors,
.p1 ; : : : ; pm1 /I .pm1 C1 ; : : : ; pm2 /I : : : I .pmk1 C1 ; : : : ; pn /:
Actually, part (b) of this last theorem brings out a very interesting connection be-
tween the normal distribution and the problem of picking a point at random from
the boundary of a high-dimensional sphere. If Un D .Un1 ; Un2 ; : : : ; Unn / is uni-
formly distributed on the boundary of the n-dimensional unit sphere, and if we take
k < n, hold k fixed and let n ! 1, then part (b) leads to the very useful fact that
the joint distribution of .Un1 ; Un2 ; : : : ; Unk / is approximately the same as the joint
distribution of k independent N.0; n1 /-variables. That is, if a point was picked at
random from the surface of a high-dimensional sphere, and if we then looked at a
low-dimensional projection of that point, the projection would act as a set of nearly
independent normal variables with zero means and variances n1 . This is known as
Poincaré’s lemma. Compare this with Theorem 15.5, where the exact density of
a lower-dimensional projection was worked out; Poincaré’s lemma can be derived
from there.
p p p Y
k
P nUn1 x1 ; nUn2 x2 ; : : : ; nUnk xk ! ˆ.xi /;
i D1
8 .x1 ; x2 ; : : : ; xk / 2 Rk :
Theorem 4.13.
n
2
(a) Volume of n-dimensional unit sphere = . n
:
2 C1/
n
(b) Surface area of n-dimensional unit sphere = .n n
2
:
Z 2 C1/
1
(c) Volume of n-dimensional simplex = dx1 dxn D :
x 0;x CCxn 1 nŠ
Z X n !i n 1
xi ˛i Y pi 1
(d) P f xi dx1 dxn
xi 0; n
xi ˛
i D1 . ci / 1
i ci
i D1 i D1
Qn p Qn pi Z 1 Pn pi
c i i D1 . ˛i / 1
D QinD1 ˛i Pn pi f .t/t i D1 ˛i dt; .ci ; ˛i ; pi > 0/:
i D1 i . i D1 ˛i / 0
Z Qn pi 1
xi
(e) Pn ˛
PinD1 ˛ dx1 dxn
xi 0; i D1 xi i 1 i D1 xi i
Qn
pi P
i D1 ˛i n pi
D Qn P
n pi Pn pi ; i D1 ˛i > :
i D1 ˛i i D1 ˛i . i D1 ˛i /
Z !2mC1
X
n
(f) Pn pi xi dx1 dxn D 0 8p1 ; : : : ; pn ; 8m 1:
2
i D1 xi 1 i D1
Z !2m
X n
(g) Pn pi xi dx1 dxn
2
i D1 xi 1 i D1
.2m1/Š
n Pn
2 m
D 22m1 .m1/Š . n
2
8p1 ; : : : ; pn ; 8m 1:
i D1 pi ;
2 CmC1/ q
P2n 2
Z P2n
n
.2/ In i D1 ci
(h) P e i D1 ci xi dx1 dxn D P n ;
2n 2
i D1 xi 1 2n 2 2
i D1 ci
Exercises
Exercise 4.1. Suppose X U Œ0; 1, and Y has the density 2y; 0 < y < 1; and that
X; Y are independent. Find the density of X Y and of X
Y .
Exercises 193
Exercise 4.2. Suppose X U Œ0; 1, and Y has the density 2y; 0 < y < 1; and that
X; Y are independent. Find the density of X C Y; X Y; jX Y j.
Exercise 4.3. Suppose .X; Y / have the joint pdf f .x; y/ D c.xCy/e xy ; x; y>0:
(a) Are X; Y independent?
(b) Find the normalizing constant c.
(c) Find the density of X C Y .
Exercise 4.4. Suppose X; Y have the joint density cxy; 0 < x < y < 1:
(a) Are X; Y independent?
(b) Find the normalizing constant c.
(c) Find the density of X Y .
Exercise 4.5 * (A Conditioning Argument). Suppose a fair coin is tossed twice
and the number of heads obtained is N . Let X; Y be independent U Œ0; 1 variables,
and independent of N . Find the density of NX Y .
Exercise 4.6. Suppose X U Œ0; a; Y U Œ0; b; Z U Œ0; c; 0 < a < b < c,
and that X; Y; Z are independent. Let m D minfX; Y; Zg. Find expressions for
P .m D X /; P .m D Y /; P .m D Z/:
Exercise 4.7. Suppose X; Y are independent standard exponential random vari-
XY
ables. Find the density of X Y , and of .XCY /2
.
Hint: Use Y
XCY
D 1 X
XCY
, and see the examples in text.
Exercise 4.8 * (Uniform in a Circle). Suppose .X; Y / are jointly uniform in the
unit circle. By transforming to polar coordinates, find the expectations of X 2XY
CY 2
,
p XY
and of .
X CY
2 2
p of X C Y , and P .X C Y
2 2 2 2
(a) Find the density 1/.
(b) Show that E. X C Y / :765:
2 2
Exercise 4.13. Suppose X; Y are independent standard normal variables. Find the
values of P . X
Y
< 1/, and of P .X < Y /. Why are they not the same?
Exercise 4.19. Prove that the square of a t random variable has an F -distribution.
Exercise 4.28 * (Bivariate Cauchy). Suppose .X; Y / has the joint pdf f .x; y/ D
c
.1Cx 2 Cy 2 /3=2
; 1 < x; y < 1.
(a) Find the normalizing constant c.
(b) Are X; Y independent?
(c) Find the densities of the polar coordinates r; .
(d) Find P .X 2 C Y 2 1/.
Exercise 4.31 (Correlation). Suppose X; Y are jointly uniform in the unit circle.
Find the correlation between X Y and X 2 C Y 2 .
Exercise 4.36 * (Integer Part and Fractional Part). Let X be a standard expo-
nential variable. Find the joint distribution of the integer part and the fractional part
of X . Note that they do not have a joint density.
Exercise 4.37 * (Factorization of Chi Square). Suppose X has a chi square distri-
bution with one degree of freedom. Find nonconstant independent random variables
Y; Z such that Y Z has the same distribution as X .
Hint: Look at text.
196 4 Advanced Distribution Theory
has an F distribution.
.˛i C r 1/.˛j C s 1/
E.pir pjs / D ;
t 2 .t C 1/
PnC1
where t D i D1 ˛i .
References 197
References
Aitchison, J. (1986). Statistical Analysis of Compositional Data, Chapman and Hall, New York.
Basu, D. and Tiwari, R. (1982). A Note on Dirichlet Processes, in Statistics and Probability, Es-
says in Honor of C. R. Rao, 89-103, J. K. Ghosh and G. Kallianpur, Eds., North-Holland,
Amsterdam.
Blackwell, D. (1973). Discreteness of Ferguson selections, Ann. Stat., 1, 2, 356–358.
Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems, Ann. Stat., 1, 209–230.
Hinkley, D. (1969). On the ratio of two correlated normal random variables, Biometrika, 56, 3,
635–639.
Chapter 5
Multivariate Normal and Related Distributions
1 1 0 1 .x/
f .x1 ; x2 ; : : : ; xn / D n 1
e 2 .x/ † ;
.2/ j†j
2 2
where x D .x1 ; x2 ; : : : ; xn / 2 Rn :
It follows from the linearity of expectations, and the linearity of covariance, that
E.Xi / D i , and Cov.Xi ; Xj / D ij , the .i; j /th element of †. The vector of ex-
pectations is usually called the mean vector, and the matrix of pairwise covariances
is called the covariance matrix. Thus, we have the following facts about the physical
meanings of ; †:
Example 5.1. Figure 5.1 of a simulation of 1000 values from a bivariate normal
distribution shows the elliptical shape of the point cloud, as would be expected from
the fact that the formula for the density function is a quadratic form in the variables.
It is also seen in the plot that the center of the point cloud is quite close to the true
means of the variables, namely 1 D 4:5; 2 D 4, which were used for the purpose
of the simulation.
An important property of a multivariate normal distribution is the property of
closure under linear transformations; that is, any number of linear functions of X
will also have a multivariate normal distribution. The precise closure property is as
follows.
Y
7
X
2 3 4 5 6 7
Fig. 5.1 Simulation of a bivariate normal with means 4.5, 4; variances 1; correlation .75
Corollary (MGF of Multivariate Normal). Let X Nn .; †/. Then the mgf of
X exists at all points in Rn and is given by
0 C t 0 †t
;† .t/ D et 2 :
This follows on simply observing that the theorem above implies that t0 X
N.t 0 ; t 0 †t/, and by then using the formula for the mgf of a univariate normal
distribution. t
u
and
0 3 2
A†A D :
2 12
In particular, X1 X2 N.1; 3/; X1 C X2 C X3 N.3; 12/, and Cov.X1
X2 ; X1 C X2 C X3 / D 2: Therefore, the correlation between X1 X2 and X1 C
X2 C X3 equals p 2 p D 13 :
3 12
As in the bivariate normal case, zero correlation between a particular pair of vari-
ables implies that the particular pair must be independent. And, as in the bivariate
normal case, all lower-dimensional conditional distributions are also multivariate
normal.
Theorem 5.4. Suppose X Nn .; †/. Then, Xi ; Xj are independent if and only
if ij D 0. More generally, if X is partitioned as X D .X1 ; X2 /, where X1 is
k-dimensional, and X2 is n k dimensional, and if † is accordingly partitioned as
†11 †12
†D ;
†21 †22
Proof. The first statement follows immediately because we have proved that all
lower-dimensional marginals are also multivariate normal. Therefore, any pair
.Xi ; Xj /; i ¤ j , is bivariate normal. Now use the bivariate normal property that
a zero covariance implies independence.
5.2 Conditional Distributions 203
The second part uses the argument that if †12 is the null matrix, then the joint
density of .X1 ; X2 / factorizes in a product form, on some calculation, and therefore,
X1 and X2 must be independent. Alternatively, the second part also follows from the
next theorem given immediately below. t
u
This involves some tedious manipulations with matrices, and we omit the proof.
See Tong (1990, pp. 33–35) for the details. An important result that follows from
the conditional covariance matrix formula in the above theorem is the following; it
says that once you take out the effect of X2 on X1 , X2 and the residual will actually
become independent.
Corollary. Let X D .X1 ; X2 / Nn .; †/. Then, X1 E.X1 jX2 / and X2 are
independent.
And the conditional covariance matrix is found from the formula in the above
theorem as 7
16
16
1
:
16
1 7
16
99:5 np
) p D 1:28
np.1 p/
p p
p 1:28 p.1p/C 1:64p.1p/C398p
) nD
2p
p
(on solving the quadratic equation in n from the line before). For p instance, if
D :5, then p D :25, and by plugging into the last formula, we get n D 21:1 )
n 445:
Remark. So, for example, if n D 2, and the variables are continuous, then finite
exchangeability simply means that the joint density f .x1 ; x2 / satisfies f .x1 ; x2 / D
f .x2 ; x1 / 8 x1 ; x2 .
An exchangeable multivariate normal sequence has a simple description.
Proof. The theorem follows on observing that if all the variances are equal, and if
every pair has the same correlation, then .X1 ; X2 ; : : : ; Xn / and .X.1/ ; X.2/ ; : : :,
X.n/ / have the same covariance matrix for any permutation ..1/; .2/; : : : ; .n//
of .1; 2; : : : ; n/. Because they also have the same mean vector (obviously), and be-
cause any multivariate normal distribution is fully determined by only its mean
vector and its covariance matrix, it follows that .X1 ; X2 ; : : : ; Xn / and .X.1/ ,
X.2/ ; : : : ; X.n/ / have the same multivariate normal distribution, and hence
fX1 ; X2 ; : : : ; Xn g is exchangeable. t
u
Z 1 Y
n p
ai C z
P .X1 x1 ; : : : ; Xn xn / D .z/ ˆ p d z:
1 i D1
1
p p
Proof. We use the representation Xi D C . Z0 C 1 Zi /, where
Z0 ; : : : ; Zn are all independent standard normals. Now, use the iterated expecta-
tion technique as
Z 1 Y
n p
ai C z
D .z/ ˆ p d z;
1 i D1
1
The probabilities decrease with n for fixed ; this, again, is the curse of dimension-
ality. On the other hand, for fixed n, the probabilities increase with ; this is because
the event under consideration says that the variables, in some sense, act similarly.
The probability of their doing so increases when they have a larger correlation.
Much as in the case of one dimension, structured results are available for functions
of a set of n-dimensional independent random vectors, each distributed as a mul-
tivariate normal. The applications of most of these results are in statistics. A few
major distributional results are collected together in this section.
First, we need some notation. Given N independent random vectors, Xi ; 1
i N , each Xi Nn .; †/, we define the sample mean vector and the sample
covariance matrix as
X
N
N D 1
X Xi ;
N
i D1
1 X
N
S D N
.Xi X/.X N 0
i X/ ;
N 1
i D1
P
where in the definitions above is defined as vector addition, and for a vector
u; uu0 means a matrix product. Note that in one dimension (i.e., when n D 1) XN is
also distributed as a normal, and X; N S are independently distributed. Moreover, in
one dimension, .N 1/S has a 2N 1 distribution. Analogues of all of these results
exist in this general multivariate case. This part of the multivariate normal theory is
very classic.
We need another definition.
1 1 W /
f .W / D cjW j.kp1/=2 e 2 t r.† ;
208 5 Multivariate Normal and Related Distributions
1
cD Qp :
ki C1
2kp=2 j†jk=2 p.p1/=4 i D1 2
A series of elegant and highly useful expectation identities for the Wishart distri-
bution were derived in Haff (1977, 1979a,b, 1981). The identities were derived by
clever use of the divergence theorem of multidimensional calculus, and resulted in
drastic reduction in the amount of algebraic effort involved in classic derivations of
various Wishart expectations and moments. Although they can be viewed as results
in multivariate probability, their main concrete applications are in multivariate statis-
tics. A selection of these moment and expectation formulas are collected together in
the result below.
where mij are the elements of M . We mention that the identities are also valid for
the case p D 1, which corresponds to the chi-square case.
Theorem 5.10 (Wishart Identities). Let S Wp .k; †/. Suppose f .S /; g.S / are
twice differentiable with respect to each sij . Let Q be a nonrandom real matrix.
Then,
(a) If kp > 4; EŒf .S /tr†1 D .k p 1/tr.EŒf .S /S 1 / C 2tr.EŒ @S
@f
/I
1 1 @g
(b) If kp > 4; EŒg.S /tr.† Q/D.k p 1/tr.EŒg.S /S Q/C2tr.EŒ @S :
Q. 1 / /I
2
(c) If kp > 2; EŒg.S /tr.†1 S / D kpEŒg.S / C 2tr.EŒ @S
@g
:S. 1 / /I
2
(d) If kp > 4; EŒf .S /tr.S 1 Q†1 / D .k p 2/EŒf .S /tr.S 2 Q/
EŒf .S /.trS 1 /.tr.S 1 Q// C 2tr.EŒ @S
@f
:.S 1 Q/. 1 / /:
2
Example 5.8. In identity (b) above, choose g.S / D 1, and Q D Ip . Then, the
identity gives tr.†1 / D .k p 1/EŒtrS1 ) EŒtrS1 D kp1 1
tr†1 :
Next, in identity (a), choose f .S / D tr.S 1 /. Then, the identity gives
Note that we are able to obtain these expectations without doing the hard distribu-
tional calculations that accompany classic derivations of such Wishart expectations.
N /0 S 1 .X
T 2 D N.X N /;
assuming that N > n, so that S 1 exists with probability one. This is an ex-
tremely important statistic in multivariate statistical analysis. Its distribution was
also worked out in Hotelling (1931).
Theorem 5.11. Let Xi Nn .; †/; 1 i N , and suppose they are indepen-
dent. Assume that N > n. Then
N n 2
T F .n; N n/;
n.N 1/
Theorem 5.13. Let X Nn .; †/, and Bnn a symmetric matrix of rank r. Then
Q D .X /0 B.X / 2r if and only if B†B D B.
The following theorem is of great use in statistics, and especially so in the area
of linear statistical models.
Once again, see Tong (1990) for proofs of the last two theorems. Here is a pretty
application of the Fisher Cochran theorem.
X
n X
n
.Xi /2 D .Xi XN /2 C n.XN /2
i D1 i D1
X
n X
n
) Yi2 D .Yi YN /2 C n.YN /2 ;
i D1 i D1
P
and that moreover niD1 .Yi YN /P 2
and .YN /2 must be independent. On using the
n N 2 N
symmetry of the Yi , it follows
Pthat i D1 .Yi Y / and Y also must be independent,
n N N
which is the same as saying i D1 .Xi X / and X must be independent.
2
(b) The linear functions c0 X and the quadratic form X0 BX are independent if and
only if Bc D 0.
212 5 Multivariate Normal and Related Distributions
Pn
nXN YN
i D1 Xi Yi
D qP P : (5.2)
n N 2 n .Yi YN /2
i D1 .Xi X / i D1
n2
fR .r/ D p .1 2 /.n1/=2 .1 r 2 /.n4/=2 .1 r/3=2n
2.n 1/B.1=2; n 1=2/
1 C r
2 F1 1=2; 1=2I n 1=2I ;
2
.n 1/.n1/=2
cDp :
. n1
2
/2n=21
Theorem 5.18. The statistic Sa2 has the noncentral chi-square distribution with n
2
degrees of freedom and noncentrality parameter D n .a/ 2
, with the density
function given by a Poisson mixture of ordinary chi squares
1
X e k
fSa2 .x/ D gnC2k .x/;
kŠ
kD0
where gj .x/ stands for the density of an ordinary chi-square random variable
with j degrees of freedom. Furthermore,
P .X C s 2 C / P .X C t 2 C /:
P .X 2 C / P .Y 2 C /:
Y
n
ai i
P .\niD1 Xi ai / ˆ I
i
i D1
Y
n
i ai
P .\niD1 Xi ai / ˆ :
i
i D1
Positive Dependence Inequality. Let X Nn .; †/, and suppose ij 0 for all
i; j; i ¤ j . Then,
Y
n
P .Xi ci ; i D 1; : : : ; n/ P .Xi ci /:
i D1
Y
n
P .jXi j ci ; i D 1; : : : ; n/ P .jXi j ci /;
i D1
t2
P .f .X / Mf t/ e 2 ;
Note that it would not be very easy to try to find the variance of X0 X directly.
Exercises
Exercise 5.1. Prove that the density of any multivariate normal distribution is uni-
formly bounded.
Exercise 5.2. Suppose X1 ; X1 CX2 ; X3 .X1 CX2 / are jointly multivariate normal.
Prove or disprove that X1 ; X2 ; X3 have a multivariate normal distribution.
Exercise 5.15. Let X Nn .0; In /. Find the covariance between c0 X and X0 BX, for
a general vector c and a general symmetric matrix B. When is it zero?
218 5 Multivariate Normal and Related Distributions
Exercise 5.16. Let X N2 .0; †/. Characterize all symmetric matrices B such that
X0 BX has a chi square distribution.
Exercise 5.17 * (Noncentral Chi Square MGF). Show that a noncentral chi-
t
square distribution has the mgf e 12t .1 2t/k=2 , where is the noncentrality
parameter and k the degrees of freedom.
For what values of t does this formula apply?
Exercise 5.18 * (Normal Approximation to Noncentral Chi Square). Show that
if the degrees of freedom k is large, then a noncentral chi-square random variable
is approximately normal, with mean k C and variance 2.k C 2 /, where is the
noncentrality parameter.
Exercise 5.19 (Noncentral F Distribution). Suppose X; Y are independent ran-
dom variables, and X is noncentral chi square with m degrees of freedom and
noncentrality parameter , and Y is an ordinary (or central) chi square with n de-
nX
grees of freedom. Find the density of mY .
nX
Remark. The density of mY is called the noncentral F -distribution with m and n
degrees of freedom, and noncentrality parameter .
Exercise 5.20 (Noncentral F Mean). Suppose X has a noncentral F -distribution
with m and n degrees of freedom, and noncentrality parameter . Show that
E.X / D n.mC/
m.n2/
:
When is this formula valid?
Exercise 5.21 (Noncentral F -Distribution). If X has a noncentral t-distribution,
show that X 2 has a noncentral F distribution.
Exercise 5.22 * (Application of Anderson’s Inequality). Let X have a bivariate
normal distribution with means zero, variances 1; 4, and a correlation of :5. Let Y
qa bivariate normal distribution with means zero, variances 6; 9 and a correlation
have
of 23 . Show that P .X12 C X22 < c/ P .Y12 C Y22 < c/ for all c > 0.
Exercise 5.23. * Let .X; Y; Z/ N3 .0; †/. Show that P ..X 1/2 C .Y 1/2 C
.Z 1/2 1/ > P ..X 2/2 C .Y 2/2 C .Z 2/2 1/.
Hint: Use one of the Slepian inequalities.
Exercise 5.24 (Normal Marginals with Nonnormal Joint). Give an example of
a random vector .X; Y; Z/ such that each of X; Y; Z has a normal distribution, but
jointly .X; Y; Z/ is not multivariate normal.
References
Basu, D. (1955). On statistics independent of a complete sufficient statistic, Sankhyá, 15, 377–380.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer-Verlag, New York.
Dasgupta, S. (1971). Nonsingularity of the sample covariance matrix, Sankhyá, Ser A, 33, 475–478.
Eaton, M. and Perlman, M. (1973). The nonsingularity of generalized sample covariance matrices,
Ann. Stat., 1, 710–717.
References 219
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples
from an indefinitely large population, Biometrika, 10, 507–515.
Ghosh, M. and Sinha, B. (2002). A simple derivation of the Wishart distribution, Amer. Statist., 56,
100–101.
Haff, L. (1977). Minimax estimators for a multinormal precision matrix, J. Mult. Anal., 7, 374–385.
Haff, L. (1979a). Estimation of the inverse covariance matrix, Ann. Stat., 6, 1264–1276.
Haff, L. (1979b). An identity for the Wishart distribution with applications, J. Mult. Anal., 9,
531–544.
Haff, L. (1981). Further identities for the Wishart distribution with applications in regression,
Canad. J. Stat., 9, 215–224.
Hotelling, H. (1931). The generalization of Student’s ratio, Ann. Math. Statist., 2, 360–378.
Mahalanobis, P., Bose, R. and Roy, S. (1937). Normalization of statistical variates and the use of
rectangular coordinates in the theory of sampling distributions, Sankhyá, 3, 1–40.
Olkin, I. and Roy, S. (1954). On multivariate distribution theory, Ann. Math. Statist., 25, 329–339.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, Wiley, New York.
Tong, Y. (1990). The Multivariate Normal Distribution, Springer-Verlag, New York.
Chapter 6
Finite Sample Theory of Order Statistics
and Extremes
The ordered values of a sample of observations are called the order statistics of the
sample, and the smallest and the largest are called the extremes. Order statistics
and extremes are among the most important functions of a set of random variables
that we study in probability and statistics. There is natural interest in studying the
highs and lows of a sequence, and the other order statistics help in understanding
the concentration of probability in a distribution, or equivalently, the diversity in
the population represented by the distribution. Order statistics are also useful in sta-
tistical inference, where estimates of parameters are often based on some suitable
functions of the order statistics. In particular, the median is of very special impor-
tance. There is a well-developed theory of the order statistics of a fixed number n
of observations from a fixed distribution, as also an asymptotic theory where n goes
to infinity. We discuss the case of fixed n in this chapter. A distribution theory for
order statistics when the observations are from a discrete distribution is complex,
both notationally and algebraically, because of the fact that there could be several
observations which are actually equal. These ties among the sample values make the
distribution theory cumbersome. We therefore concentrate on the continuous case.
Principal references for this chapter are the books by David (1980), Reiss (1989),
Galambos (1987), Resnick (2007), and Leadbetter et al. (1983). Specific other ref-
erences are given in the sections.
X
D f .y1 /f .y2 / f .yn /
Likewise, if we want the marginal density of just U.n/ , that is, of the sample maxi-
mum, then we want to integrate out u1 ; u2 ; : : : ; un1 , and now the answer is
Z un Z un1 Z u2
fn .un / D nŠ d u1 d u2 d un1
0 0 0
D nun1
n ; 0 < un < 1:
However, it is useful to note that for the special case of the minimum and the max-
imum, we could have obtained the densities much more easily and directly. Here is
why. First take the maximum. Consider its CDF; for 0 < u < 1:
Y
n
P .U.n/ u/ D P .\niD1 fXi ug/ D P .Xi u/
i D1
D un ;
Likewise, for the minimum, for 0 < u < 1, the tail CDF is:
d
f1 .u/ D Œ1 .1 u/n D n.1 u/n1 ; 0 < u < 1:
du
For a general r; 1 r n, the density of U.r/ works out to a Beta density:
nŠ
fr .u/ D ur1 .1 u/nr ; 0 < u < 1;
.r 1/Š.n r/Š
14
12
10
x
0.2 0.4 0.6 0.8 1
Above we have plotted in Fig. 6.1 the density of the minimum, median, and
maximum in the U Œ0; 1 case when n D 15. The minimum and the maximum clearly
have skewed densities, whereas the density of the median is symmetric about .5.
Example 6.3 (Density of One and Two Order Statistics). The joint density of any
subset of the order statistics X.1/ ; X.2/ ; : : : ; X.n/ can be worked out from their
joint density, which we derived in the preceding section. The most important case
in applications is the joint density of two specific order statistics, say X.r/ and
X.s/ ; 1 r < s n, or the density of a specific one, say X.r/ . A verbal heuristic
argument helps in understanding the formula for the joint density of X.r/ and X.s/ ,
and also the density of a specific one X.r/ .
First consider the density of just X.r/ . Fix u. To have X.r/ D u, we must have
exactly one observation at u, another r 1 below u, and n r above u. This will
suggest that the density of X.r/ is
!
n1
fr .u/ D nf .u/ .F .u//r1 .1 F .u//nr
r 1
nŠ
D .F .u//r1 .1 F .u//nr f .u/;
.r 1/Š.n r/Š
1 < u < 1. This is in fact the correct formula for the density of X.r/ .
Next, consider the case of the joint density of two order statistics, X.r/ and X.s/ .
Fix 0 < u < v < 1. Then, to have X.r/ D u; X.s/ D v, we must have exactly one
226 6 Finite Sample Theory of Order Statistics and Extremes
nŠ
fr .u/ D F r1 .u/.1 F .u//nr f .u/; 1 < u < 1:
.r 1/Š.n r/Š
(d) For general 1 r<s n; .X.r/ ; X.s/ / have the joint density
nŠ
D .F .u//r1 .1 F .v//ns .F .v/ F .u//sr1
.r 1/Š.n s/Š.s r 1/Š
f .u/f .v/; 1 < u < v < 1:
(e) The minimum and the maximum; X.1/ andX.n/ have the joint density
f1;n .u; v/ D n.n 1/.F .v/ F .u//n2 f .u/f .v/; 1 < u < v < 1:
Example 6.4 (Moments of Uniform Order Statistics). The general formulas in the
above theorem lead to the following moment formulas in the uniform case.
In the U Œ0; 1 case,
1 n
E.U.1/ / D ; E.U.n/ / D ;
nC1 nC1
n L
Var.U.1/ / D Var.U.n/ / D I 1 U.n/ D U.1/ I
.n C 1/2 .n C 2/
1
Cov.U.1/ ; .U.n/ / D ;
.n C 1/2 .n C 2/
n1 2.n 1/
E.Wn / D ; Var.Wn / D :
nC1 .n C 1/2 .n C 2/
For a general order statistic, it follows from the fact that U.r/ Be.r; n r C 1/,
that
r r.n r C 1/
E.U.r/ / D I Var.U.r/ / D :
nC1 .n C 1/2 .n C 2/
Furthermore, it follows from the formula for their joint density that
r.n s C 1/
Cov.U.r/ ; U.s/ / D :
.n C 1/2 .n C 2/
0 < u1 < u2 < < un < 1. Also, in particular, the minimum X.1/ has the
density
f1 .u/ D n.1 F .u//n1 f .u/ D ne .n1/u e u D ne nu ;
0 < u < 1. In other words, we have the quite remarkable result that the minimum
of n independent standard exponentials is itself an exponential with mean n1 . Also,
from the general formula, the maximum X.n/ has the density
!
X
n1
n 1 .i C1/u
fn .u/ D n.1 e u /n1 e u D n .1/i e ; 0 < u < 1:
i
i D0
228 6 Finite Sample Theory of Order Statistics and Extremes
As a result,
! !
X
n1
n1 1 X
n
i 1 n 1
E.X.n/ / D n .1/ i
D .1/ ;
i .i C 1/2 i i
i D0 i D1
Again, this is a skewed density. It can be shown, either directly, or by making use
of the general theorem on existence of moments of order statistics (see the next
section) that every moment, and in particular the mean and the variance of Z.n/ ;
exists. Except for very small n, closed-form formulas for the mean or variance are
not possible. For small n, integration tricks do produce exact formulas. For example,
1 3
E.Z.n/ / D p ; if n D 2I E.Z.n/ / D p ; if n D 3:
2
Such formulas are available for n 5; see David (1980).
We tabulate the expected value of the maximum for some values of n to illustrate
the slow growth.
n E.Z.n/ /
2 .56
5 1.16
10 1.54
20 1.87
30 2.04
50 2.25
100 2.51
500 3.04
1000 3.24
10000 3.85
0.8
0.6
0.4
0.2
x
-1 1 2 3 4 5
The density of Z.n/ is plotted in Fig. 6.2 for three values of n. We can see that the
density is shifting to the right, and at the same time getting more peaked. Theoretical
asymptotic (i.e., as n ! 1) justifications for these visual findings are possible, and
we show some of them in a later chapter.
Uniform order statistics play a very special role in the theory of order statistics,
because many problems about order statistics of samples from a general density can
be reduced, by a simple and common technique, to the case of uniform order statis-
tics. It is thus especially important to understand and study uniform order statistics.
The technique that makes helpful reductions of problems in the general continuous
case to the case of a uniform distribution on [0,1] is one of making just the quantile
transformation. We describe the exact correspondence below.
L
(c) F .X.i / / D U.i / :
L
(d) F 1 .U.i / / D X.i / :
L
(e) .F .X.1/ /; F .X.2/ /; : : : ; F .X.n/ // D .U.1/ ; U.2/ ; : : : ; U.n/ /:
L
(f) .F 1 .U.1/ /; F 1 .U.2/ /; : : : ; F 1 .U.n/ // D .X.1/ ; X.2/ ; : : : ; X.n/ /:
L
(g) .Y.1/ ; Y.2/ ; : : : ; Y.n/ / D .g.F 1 .U.1/ //; g.F 1 .U.2/ //; : : : ; g.F 1 .U.n/ ///:
Remark. We are already familiar with parts (a) and (b); they are restated here only
to provide the context. The parts that we need to focus on are the last two parts.
They say that any question about the set of order statistics X.1/ ; X.2/ ; : : : ; X.n/ of
a sample from a general continuous distribution can be rephrased in terms of the
set of order statistics from the U Œ0; 1 distribution. For this, all we need to do is to
substitute F 1 .U.i / / in place of X.i /, where F 1 is the quantile function of F .
So, at least in principle, as long as we know how to work skillfully with the joint
distribution of the uniform order statistics, we can answer questions about any set of
order statistics from a general continuous distribution, because the latter is simply a
transformation of the set of order statistics of the uniform. This has proved to be a
very useful technique in the theory of order statistics.
As a corollary of part (f) of the above theorem, we have the following connection
between order statistics of a general continuous CDF and uniform order statistics.
Corollary. Let X.1/ ; X.2/ ; : : : ; X.n/ be the order statistics of a sample from a gen-
eral continuous CDF F , and U.1/ ; U.2/ ; : : : ; U.n/ the uniform order statistics. Then,
for any 1 r1 < r2 < < rk n,
8u1 ; : : : ; uk .
Several important applications of this quantile transformation method are given
below.
Proof. We use the fact that if g.x1 ; x2 ; : : : ; xn /; h.x1 ; x2 ; : : : ; xn / are two functions
such that they are coordinatewise nondecreasing in each xi , then
Cov.g.U.1/ ; : : : ; U.n/ /; h.U.1/ ; : : : ; U.n/ // 0. By the quantile transformation
theorem, Cov.X.r/ ; X.s/ / D Cov.F 1 .U.r/ /; F 1 .U.s/ // 0, as F 1 .U.r/ / is
a nondecreasing function of U.r/ , and F 1 .U.s/ / is a nondecreasing function of
U.s/ , and hence they are also coordinatewise nondecreasing in each order statistic
U.1/ ; U.2/ ; : : : ; U.n/ . t
u
This proposition was first proved in Bickel (1967), but by using a different
method. The next application is to existence of moments of order statistics.
Corollary. Suppose F is a continuous CDF such that EF .jX jk / < 1, for some
given k. Then, E.jX.i /jk / < 1 8i 1 i n. t
u
Aside from just the existence of the moment, explicit bounds are always useful.
Here is a concrete bound (see Reiss (1989)); approximations for moments of order
statistics for certain distributions are derived in Hall (1978).
Proposition. (a) 8r n; E.jX.r/ jk / ..r1/Š.nr/Š/
nŠ
EF .jX jk /I
(b) E.jX.r/ jk / < 1 ) jF 1 .p/jk p r .1 p/nrC1 C < 1 8pI
(c) jF 1 .p/jk p r .1 p/nrC1 C < 1 8p ) E.jX.s/ jm / < 1, if 1 C mr
k
s n .nrC1/m k
.
Example 6.7 (Nonexistence of Every Moment of Every Order Statistic). Consider
the continuous CDF F .x/ D 1 log1 x ; x e. Setting 1 log1 x D p, we get the
1
quantile function F 1 .p/ D e 1p . Fix any n; k, and r n. Consider what happens
when p ! 1.
k
jF 1 .p/jk p r .1 p/nrC1 D e . 1p/ p r .1 p/nrC1
k
C e 1p .1 p/nrC1 D C e ky y .nrC1/ ;
writing y for 1p1
. For any k > 0, as y ! 1., p ! 1/; e ky y .nrC1/ ! 1.
Thus, the necessary condition of the proposition above is violated, and it follows
that for any r; n; k; E.jX.r/ jk / D 1.
Remark. The preceding example and the proposition show that existence of
moments of order statistics depends on the tail of the underlying CDF (or, equiva-
lently, the tail of the density). If the tail is so thin that the density has a finite mgf
in some neighborhood of zero, then all order statistics will have all moments finite.
Evaluating them in closed form is generally impossible, however. If the tail of the
underlying density is heavy, then existence of moments of the order statistics, and
232 6 Finite Sample Theory of Order Statistics and Extremes
especially the minimum and the maximum, may be a problem. It is possible for
some central order statistics to have a few finite moments, and the minimum or
the maximum to have none. In other words, depending on the tail, anything can
happen. An especially interesting case is the case of a Cauchy density, notorious for
its troublesome tail. The next result describes what happens in that case.
Proposition. Let X1 ; X2 ; : : : ; Xn be independent C.; / variables. Then,
(a) 8n; E.jX.n/ j/ D E.jX.1/ j/ D 1I
(b) For n 3; E.jX.n1/ j/ and E.jX.2/ j/ are finiteI
(c) For n 5; E.jX.n2/ j2 / and E.jX.3/ j2 / are finiteI
(d) In general; E.jX.r/ jk / < 1 if and only if k < minfr; n C 1 rg:
Example 6.8 (Cauchy Order Statistics). From the above proposition we see that the
truly problematic order statistics in the Cauchy case are the two extreme ones, the
minimum and the maximum. Every other order statistic has a finite expectation for
n 3, and all but the two most extremes from each tail even have a finite variance,
as long as n 5. The table below lists the mean of X.n1/ and X.n2/ for some
values of n.
n E.X.n1/ / E.X.n2/ /
5 1.17 .08
10 2.98 1.28
20 6.26 3.03
30 9.48 4.67
50 15.87 7.90
100 31.81 15.88
250 79.56 39.78
500 159.15 79.57
Example 6.9 (Mode of Cauchy Sample Maximum). Although the sample maximum
X.n/ never has a finite expectation in the Cauchy case, it always has a unimodal
density (see a general result in the exercises). So it is interesting to see what the
modal values are for various n. The table below lists the mode of X.n/ for some
values of n.
n Mode of X.n/
5 .87
10 1.72
20 3.33
30 4.93
50 8.12
100 16.07
250 39.98
500 79.76
By comparing the entries in this table with the previous table, we see that the
mode of X.n/ is quite close to the mean of X.n2/ . It would be interesting to find a
theoretical result in this regard.
6.4 Spacings 233
6.4 Spacings
Definition 6.6. Let X.1/ ; X.2/ ; : : : ; X.n/ be the order statistics of a sample of n ob-
servations X1 ; X2 ; : : : ; Xn . Then, Wi D X.i C1/ X.i / ; 1 i n 1 are called the
spacings of the sample, or the spacings of the order statistics.
The spacings of an exponential sample have the characteristic property that the spac-
ings are all independent exponentials as well. Here is the precise result.
Theorem 6.5. Let X.1/ ; X.2/ ; : : : ; X.n/ be the order statistics from an Exp. / distri-
bution. Then W0 D X.1/ ; W1 ; : : : ; Wn1 are independent, with Wi Exp. ni
/; i D
0; 1; : : : ; n 1.
Proof. The proof follows on transforming to the set of spacings from the set
of order statistics, and by applying the Jacobian density theorem. The transfor-
mation .u1 ; u2 ; : : : ; un / ! .w0 ; w1 ; : : : ; wn1 /, where w0 D u1 ; w1 D u2
u1 ; : : : ; wn1 D un un1 is one to one, with the inverse transformation u1 D
w0 ; u2 D w0 C w1 ; u3 D w0 C w1 C w2 ; : : : ; un D w0 C w1 C C wn1 . The
Jacobian matrix is triangular, and has determinant one. From our general theorem,
the order statistics X.1/ ; X.2/ ; : : : ; X.n/ have the joint density
f1;2;:::;n .u1 ; u2 ; : : : ; un / D nŠf .u1 /f .u2 / f .un /If0<u1 <u2 <<un <1g :
X
r
1 X
r
1
2
E.X.r/ / D I Var.X.r/ / D :
ni C1 .n i C 1/2
i D1 i D1
Furthermore,
Pr by using the same representation, for 1 r <s n; Cov.X.r/ ; X.s/ / D
2 1
i D1 ni C12 , and therefore the correlation
vP
u r
u i D1 1
D t Ps
.ni C1/2
X.r/ ;X.s/ 1
:
i D1 .ni C1/2
In particular, p
1
n 6
X.1/ ;X.n/ D qP ;
n 1 n
i D1 i 2
The results on exponential spacings lead to some highly useful and neat represen-
tations for the spacings and the order statistics of a uniform distribution. The next
result describes the most important properties of uniform spacings and order statis-
tics. Numerous other properties of a more special nature are known. David (1980)
and Reiss (1989) are the best references for such additional properties of the uniform
order statistics.
6.5 Conditional Distributions and Markov Property 235
Proof. For part (a), use the fact that the negative of the logarithm of a U Œ0; 1 vari-
able is standard exponential, and then use the result that the exponential spacings
are themselves independent exponentials. That V1 ; V22 ; : : : ; Vn1
n1
are also uniformly
distributed follows from looking at the joint density of U.i / ; U.i C1/ for any given i .
It follows trivially from the density of Vn that Vnn U Œ0; 1.
For parts (b) and (c), first consider the joint density of the uniform order statis-
tics, and then transform to the variables Wi ; i D 0; : : : ; n 1. This is a one-to-one
transformation, and so we can apply the Jacobian density theorem. The Jacobian
theorem easily gives the joint density of the Wi ; i D 0; : : : ; n 1, and we simply
recognize it to be the density of a Dirichlet with the parameter vector having each
coordinate equal to one. Finally, use the representation of a Dirichlet random vector
in the form of ratios of Gammas (see Chapter 4). t
u
Remark. Part (d) of this theorem, representing uniform order statistics in terms of
independent exponentials is one of the most useful results in the theory of order
statistics.
The conditional distributions of a subset of the order statistics given another subset
satisfy some really structured properties. An illustration of such a result is that if
we know that the sample maximum X.n/ D x, then the rest of the order statistics
would act like the order statistics of a sample of size n 1 from the original CDF,
but truncated on the right at that specific value x. Another prominent property of
the conditional distributions is the Markov property. Again, a lot is known about
the conditional distributions of order statistics, but we present the most significant
236 6 Finite Sample Theory of Order Statistics and Extremes
and easy to state results. The best references for reading more about the conditional
distributions are still David (1980) and Reiss (1989). Each of the following theorems
follows on straightforward calculations, and we omit the calculations.
Y
n1
f .ui /
fX.1/ ;:::;X.n1/ jX.n/ Dx .u1 ; : : : ; un1 / D .n 1/Š ; u1 < < un1 < x:
F .x/
i D1
Remark. A similar and transparent result holds about the conditional distribution of
X.2/ ; X.3/ ; : : : ; X.n/ given X.1/ D x.
6.5 Conditional Distributions and Markov Property 237
Y
n1
f .ui /
fX.2/ ;:::;X.n1/ jX.1/ Dx;X.n/ Dy .u2 ; : : : ; un1 / D .n 2/Š ;
F .y/ F .x/
i D2
x < u2 < < un1 < y:
X
n1
ix
DxC ;
n
i D1
because, given X.n/ D x; X.1/ ; X.2/ ; : : : ; X.n1/ act like the order statistics of a
sample of size n 1 from the U Œ0; x distribution. Now summing the series, we get,
.n 1/x nC1
E.nXN jX.n/ D x/ D x C D x;
2 2
N .n/ nC1
) E.XjX D x/ D x:
2n
Example 6.12 (Maximum Given the First Half). Suppose X1 ; X2 ; : : : ; X2n are in-
dependent standard exponentials. We want to find E.X.2n/ jX.1/ D x1 ; : : : ; X.n/ D
xn /. By the theorem on the Markov property, this conditional expectation equals
E.X.2n/ jX.n/ D xn /. Now, we further use the representation that
!
L
X
n
Xi X
2n
Xi
.X.n/ ; X.2n/ / D ; :
2n i C 1 2n i C 1
i D1 i D1
Therefore,
X
n
Xi
E.X.2n/ jX.n/ D xn / D E
2n i C 1
i D1
!
X
2n
Xi X n
Xi
C j D xn
2n i C 1 2n i C 1
i DnC1 i D1
238 6 Finite Sample Theory of Order Statistics and Extremes
!
X
2n
Xi X n
Xi
D xn C E j D xn
2n i C 1 2n i C 1
i DnC1 i D1
!
X2n
Xi
D xn C E
2n i C 1
i DnC1
X
2n
1
D xn C :
2n i C 1
i DnC1
For example, in a sample of size 4, E.X.4/ jX.1/ Dx; X.2/ Dy/DE.X.4/ jX.2/ Dy/ D
P
y C 4iD3 5i
1
D y C 32 .
Order statistics and the related theory have many interesting and important appli-
cations in statistics, in modeling of empirical phenomena, for example, climate
characteristics, and in probability theory itself. We touch on a small number of ap-
plications in this section for purposes of reference. For further reading on the vast
literature on applications of order statistics, we recommend, among numerous possi-
bilities, Lehmann (1975), Shorack and Wellner (1986), David (1980), Reiss (1989),
Martynov (1992), Galambos (1987), Falk et al. (1994), Coles (2001), Embrechts
et al. (2008), and DasGupta (2008).
6.6.1 Records
Record values and their timings are used for the purposes of tracking changes in
some process, such as temperature, and for preparation for extremal events, such as
protection against floods. They are also interesting on their own right.
Let X1 ; X2 ; : : : ; be an infinite sequence of independent observations from a con-
tinuous CDF F . We first give some essential definitions.
Example 6.13. The values 1.46, .28, 2.20, .72, 2.33, .67, .42, .85, .66, .67, 1.54, .76,
1.22, 1.72, .33 are 15 simulated values from a standard exponential distribution.
The record values are 1:46; 2:20; 2:33, and the record times are T1 D 1; T2 D 3;
T3 D 5. Thus, there are three records at time n D 15. We notice that no records were
observed after the fifth observation in the sequence. In fact, in general, it becomes
increasingly more difficult to obtain a record as time passes; justification for this
statement is shown in the following theorem.
The following theorem summarizes a number of key results about record values,
times, and number of records; this theorem is a superb example of the power of the
quantile transformation method, because the results for a general continuous CDF
F can be obtained from the U Œ0; 1 case by making a quantile transformation. The
details are worked out in Port (1993, pp. 502–509).
X
n
1 X
n
i 1
E.N / D I Var.N / D :
i i2
i D1 i D1
Œ log.1 F .x//r1
fr .x/ D f .x/; 1 < x < 1:
.r 1/Š
(e) The first n record values, XT1 ; XT2 ; : : : ; XTn have the joint density
Y
n1
f .xi /
f12n .x1 ; x2 ; : : : ; xn / D f .xn / Ifx <x <<xn g :
rD1
1 F .xi / 1 2
(f) Fix a sequence of reals t1 < t2 < t3 < < tk , and let for any given real
t; M.t/ be the total number of record values that are t:
Remark. From part (a) of the theorem, we learn that if indeed the sequence of obser-
vations keeps coming from the same CDF, then obtaining a record becomes harder
as time passes; P .Zi D 1/ ! 0. We learn from part (b) that both the mean and the
variance of the number of records observed until time n are of the order of log n.
The number of records observed until time n is well approximated by a Poisson
distribution with mean log n, or a normal distribution with mean and variance equal
to log n. We learn from parts (c) and (d) that the interarrival times of the record
values do not depend on F , but the magnitudes of the record values do. Part (f) is
another example of the Poisson distribution providing an approximation in an in-
teresting problem. It is interesting to note the connection between part (b) and part
(f). In part (f), if we take t D F 1 .1 n1 /, then heuristically, Nn , the number of
records observed up to time n, satisfies Nn M.X.n/ / M.F 1 .1 n1 //
Poi. log.1 F .F 1 .1 n1 /// D Poi.log n/, which is what we mentioned in the
paragraph above.
0.4
0.3
0.2
0.1
5 10 15 20 25
Fig. 6.3 Density of the third record value for, top to bottom, N.0; 1/, double exp (0,1), C(0,1) case
6.6 Some Applications 241
0.12
0.1
0.08
0.06
0.04
0.02
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
Fig. 6.4 PMF of interarrival time between second and third record
Next, consider the distribution of the gap between the arrival times of the
second and the third record. Note the long right tail, akin to an exponential den-
sity in Fig. 6.4.
The empirical CDF Fn .x/, defined in Section 6.1.2, is a tool of tremendous impor-
tance in statistics and probability. The reason for its effectiveness as a tool is that if
sample observations arise from some CDF F , then the empirical CDF Fn will be
very close to F for large n. So, we can get a very good idea of what the true F is
by looking at Fn . Furthermore, because Fn F , it can be expected that if T .F / is
a nice functional of F , then the empirical version T .Fn / would be close to T .F /.
Perhaps the simplest example of this P is the mean T .F / D EF .X /. The empirical
n
Xi
version then is T .Fn / D EFn .X / D i D1 n
, because Fn assigns the equal proba-
1
bility n to just the observation values X1 ; : : : ; Xn . This means that the mean of the
sample values should be close to the expected value under the true F . And, this is
indeed true under simple conditions, and we have already seen some evidence for
it in the form of the central limit theorem. We provide some basic properties and
applications of the empirical CDF in this section.
(c) Assume that F is continuous. For any given n, and ˛; 0 < ˛ < 1, there exist
positive constants Dn , independent of F , such that whatever be F ,
Remark. Part (b), the DKW inequality, was first proved in Dvoretzky et al. (1956),
but in a weaker form. The inequality stated here is proved in Massart (1990). Fur-
thermore, the constant 2 in the inequality is the best possible choice of the constant;
that is, the inequality is false with any other constant C < 2. The inequality says that
uniformly in x, for large n, the empirical CDF is arbitrarily close to the true CDF
with a very high probability, and the probability of the contrary is sub-Gaussian. We
show more precise consequences of this in a later chapter. Part (c) is important for
statisticians, as we show in our next example.
Example 6.15 (Confidence Band for a Continuous CDF). This example is another
important application of the quantile transformation method. Imagine a hypothet-
ical sequence of independent U Œ0; 1 variables, U1 ; U2 ; : : :, and let Gn denote the
empirical CDF of this sequence of uniform random variables; that is,
#fi W Ui tg
Gn .t/ D :
n
By the quantile transformation,
L
n D supx2R jFn .x/ F .x/j D supx2R jGn .F .x// F .x/j
D sup0<t <1 jGn .t/ tj;
which shows that as long as F is a continuous CDF, so that the quantile transfor-
mation can be applied, for any n, the distribution of n is the same for all F . This
common distribution is just the distribution of sup0<t <1 jGn .t/ tj. Consequently,
if Dn is such that P .sup0<t <1 jGn .t/ tj > Dn / ˛, then Dn also satisfies (the
apparently stronger statement)
The probability statement above provides the assurance that with probability 1 ˛
or more, the true CDF F .x/, as a function, is caught between the pair of functions
Fn .x/ ˙ Dn . As a consequence, the band Fn .x/ Dn F .x/ Fn .x/ C Dn ; x 2
R, is called a 100.1 ˛/% confidence band for F . This is of great use in statistics,
because statisticians often consider the true CDF F to be not known.
The constants Dn have been computed and tabulated for small and moderate n.
We tabulate the values of Dn for some selected n for easy reference and use.
6.7 Distribution of the Multinomial Maximum 243
see Chapter 2.
As a result, we can compute P .maxff1 ; f2 ; : : : ; fk g x/ exactly whenever we
P j K
can compute the coefficient of n in the expansion of . x1 j D0 j Š / . This is possible
to do by using symbolic software; see Ethier (1982) and DasGupta (2009).
244 6 Finite Sample Theory of Order Statistics and Extremes
Example 6.16 (Maximum Frequency in Die Throws). Suppose a fair six-sided die
is rolled 30 times. Should we be surprised if one of the six faces appears 10
times? The usual probability calculation to quantify the surprise is to calculate
P .maxff1 ; f2 ; : : : ; f6 g 10/, namely the P-value, where f1 ; f2 ; : : : ; f6 are the
frequencies of the six faces in the 30 rolls. Because of our Poissonization result,
we can compute this probability. From the table of exact probabilities below, we
can see that it would not be very surprising if some face appeared 10 times in 30
rolls of a fair die; after all P .maxff1 ; f2 ; : : : ; f6 g 10/ D :1176, not a very small
number, for 30 rolls of a fair die. Similarly, it would not be very surprising if some
face appeared 15 times in 50 rolls of a fair die, as can be seen in the table below.
P .maxff1 ; f2 ; : : : ; fk g x/.k D 6/
x n D 30 n D 50
8 .6014 1
9 .2942 1
10 .1176 .9888
11 .0404 .8663
12 .0122 .6122
13 .0032 .3578
14 .00076 .1816
15 .00016 .0827
16 .00003 .0344
Exercises
Exercise 6.8 * (Waiting Time). Peter, Paul, and Mary went to a bank to do some
business. Two counters were open, and Peter and Paul went first. Each of Peter,
Paul, and Mary will take, independently, an Exp. / amount of time to finish their
business, from the moment they arrive at the counter.
(a) What is the density of the epoch of the last departure?
(b) What is the probability that Mary will be the last to finish?
(c) What is the density of the total time taken by Mary from arrival to finishing her
business?
.n 1/.x/ D xˆ.x/:
Exercise 6.25 (Conditional Expectation Given the Order Statistics). Let g.x1 ;
x2 ; : : : ; xn / be a general real-valued function of n variables. Let X1 ; X2 ; : : : ; Xn be
independent observations from a common CDF F . Find as clean an expression as
possible for E.g.X1 ; X2 ; : : : ; Xn / jX.1/ ; X.2/ ; : : : ; X.n/ /.
Exercise 6.26. Derive a formula for the expected value of the rth record when the
sample observations are from an exponential density.
Exercise 6.28. Let Fn .x/ be the empirical CDF of n observations from a CDF F .
Show that
References
Bickel, P. (1967). Some contributions to order statistics, Proc. Fifth Berkeley Symp., I, 575–591,
L. Le Cam and J. Neyman Eds., University of California Press, Berkeley.
Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values, Springer, New York.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
DasGupta, A. (2009). Exact tail probabilities and percentiles of the multinomial maximum, Tech-
nical Report, Purdue University.
David, H. (1980). Order Statistics, Wiley, New York.
Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic minimax character of the sample
distribution function, Ann. Math. Statist., 27, 3, 642–669.
Embrechts, P., KlRuppelberg, C., and Mikosch, T. (2008). Modelling Extremal Events: For Insurance
and Finance, Springer, New York.
Ethier, S. (1982). Testing for favorable numbers on a roulette wheel, J. Amer. Statist. Assoc., 77,
660–665.
Falk, M., HRusler, J. and Reiss, R. (1994). Laws of Small Numbers, Extremes, and Rare Events,
Birkhauser, Basel.
Galambos, J. (1987). Asymptotic Theory of Extreme Order Statistics, Academic Press, New York.
Hall, P. (1978). Some asymptotic expansions of moments of order statistics, Stoch. Proc. Appl., 7,
265–275.
Hall, P. (1979). On the rate of convergence of normal extremes, J. Appl. Prob., 16, 2, 433–439.
Leadbetter, M., Lindgren, G., and Rootzén, H. (1983). Extremes and Related Properties of Random
Sequences and Processes, Springer, New York.
248 6 Finite Sample Theory of Order Statistics and Extremes
Some basic definitions, notation, and concepts are put together in this section.
Definition 7.1. Let an be a sequence of real numbers. We write an D o.1/ if
limn!1 an D 0. We write an D O.1/ if 9 K < 1 3 jan j K 8n 1.
More generally, if an ; bn > 0 are two sequences of real numbers, we write an D
o.bn / if abnn D o.1/; we write an D O.bn / if abnn D O.1/.
Remark. Note that the definition allows the possibility that a sequence an which is
O.1/ is also o.1/. The converse is always true; that is, an D o.1/ ) an D O.1/.
Definition 7.2. Let an ; bn be two real sequences. We write an bn if abnn ! 1, as
n ! 1. We write an bn if 0 < lim inf abnn lim sup abnn < 1, as n ! 1.
Example 7.1. Let an D nC1 n
. Then, jan j 18n 1; so an D O.1/. But, an ! 1.
as n ! 1; so an is not o.1/.
However, suppose an D n1 . Then, again, jan j 18n 1; so an D O.1/. But,
this time an ! 0. as n ! 1; so an is both O.1/ and o.1/. But an D O.1/ is a
weaker statement in this case than saying an D o.1/.
Next, suppose an D n. Then jan j D n ! 1, as n ! 1; so an is not O.1/,
and therefore also cannot be o.1/.
cn
Example 7.2. Let cn D log n, and an D cnCk , where k 1 is a fixed positive
integer. Thus,
log n log n 1 1
an D D D ! D 1:
log.n C k/ log n C log.1 C kn / 1C 1
log n
log.1 C k
n
/ 1C0
1 1 1
an D p C p p
n nC1 n
p p
1 n nC1 1 1
D p C p p Dp p p p p
n n nC1 n n n C 1. n C n C 1/
1 1
D p p p p
n n n.1 C o.1//.2 n C o.1//
1 1
D p p
n n.1 C o.1//.2 n C o.1//
1 1
D p 3=2
n 2n .1 C o.1//.1 C o.1//
7.1 Some Basic Notation and Convergence Concepts 251
1 1
D p 3=2
n 2n .1 C o.1//
1 1 1 1
D p 3=2 .1 C o.1// D p 3=2 C o n3=2 :
n 2n n 2n
Example 7.4. For n 1, consider the simple discrete random variables Xn with the
pmf P .Xn D n1 / D 1 n1 ; P .Xn D n/ D n1 . Then, for large n, Xn is close to
zero with a large probability. Although for any given n; Xn is never equal to zero,
the probability of it being far from zero is very small for large n. For example,
P .Xn > :001/ :001 if n 1000. More formally, for any given > 0, P .Xn >
/ P .Xn > n1 / D n1 , if we take n to be so large that n1 < . As a consequence,
P .jXn j > / D P .Xn > / ! 0, as n ! 1. This example motivates the following
definition.
P P
Notation. If Xn ) 0, then we write Xn D op .1/. More generally, if an Xn ) 0
for some positive sequence an , then we write Xn Dop . a1n /.
If Xn is bounded in probability, then we write Xn D Op .1/. If an Xn D Op .1/,
we write Xn D Op . a1n /.
Proposition. Suppose Xn D op .1/. Then, Xn D Op .1/. The converse is, in
general, not true.
Proof. If Xn D op .1/, then by definition of convergence in probability, given
c > 0; P .jXn j > c/ ! 0, as n ! 1. Thus, given any fixed > 0, for all
large n, say n n0 . /; P .jXn j > 1/ < . Next find constants c1 ; c2 ; : : : ; cn0 ,
such that P .jXi j > ci / < ; i D 1; 2; : : : ; n0 . Choose k D maxf1; c1 ; c2 ; cn0 g.
Then, P .jXn j > k/ < 8n 1. Therefore, Xn D Op .1/.
To see that the converse is, in general, not true, let X N.0; 1/, and define
Xn X; 8n 1. Then, Xn D Op .1/. But P .jXn j > 1/ P .jX j > 1/, which is
a fixed positive number, and so, P .jXn j > 1/ does not converge to zero. t
u
Definition 7.6. Let fXn ; X g be defined on the same probability space. We say that
Xn converges almost surely to X (or Xn converges to X with probability 1) if P .! W
a:s: a:s:
Xn .!/ ! X.!// D 1. We write Xn ! X or Xn ) X .
a:s:
Remark. If the limit X is a finite constant c with probability one, we write Xn ) c.
a:s:
If P .! W Xn .!/ ! 1/ D 1, we write Xn ! 1. Almost sure convergence is a
stronger mode of convergence than convergence in probability. In fact, a character-
ization of almost sure convergence is that for any given > 0,
lim P .jXn X j 8n m/ D 1:
m!1
It is clear from this characterization that almost sure convergence is stronger than
convergence in probability. However, the following relationships hold.
P
Theorem 7.1. (a) Let Xn ) X . Then there is a sub-sequence Xnk such that
a:s:
X nk ! X .
P
(b) Suppose Xn is a monotone nondecreasing sequence and that Xn ) X . Then
a:s:
Xn ! X .
Example 7.6 (Pattern in Coin Tossing). For iid Bernoulli trials with a success prob-
ability p D 12 , let Tn denote the number of times in the first n trials that a success is
followed by a failure. Denoting
Ii D Ii th trial is a success and (i C 1)th trial is a failure;
Pn1
we have Tn D i D1 Ii , and therefore E.Tn / D n1
4 , and
X
n1 X
n2
3.n 1/ 2.n 2/ nC1
Var.Tn / D Var.Ii / C 2 Cov.Ii ; Ii C1 / D D :
16 16 16
i D1 i D1
Tn P 1
It now follows by an application of Chebyshev’s inequality that n ) 4.
7.1 Some Basic Notation and Convergence Concepts 253
Example 7.9 (Some Counterexamples). Let Xn be the sequence of two point random
variables with pmf P .Xn D 0/ D 1 n1 ; P .Xn D n/ D n1 . Then Xn converges in
probability to zero. But, E.jXn j/ D 1 8n, and hence Xn does not converge in L1
to zero. In fact, it does not converge to zero in Lp for any p 1.
Now take the same sequence Xn as above, and assume moreover that they are
independent. Take an > 0; and positive integers m; k. Then,
P .jXn j < 8m n m C k/
Y
mCk
1
D P .Xn D 0 8m n m C k/ D 1
nDm
n
m1
D :
mCk
E.jXn X jp /
P .jXn X j > / D P .jXn X jp > p
/ p
! 0, by hypothesis. t
u
Xn converges in Lp to X ) Xn converges in L1 to X:
The definitions and the treatment in the previous section are for general sequences of
random variables. Averages and sums are sequences of special importance in appli-
cations. The classic laws of large numbers, which characterize the long run behavior
of averages, are given in this section. Truly, the behavior of averages and sums in
complete generality is very subtle, and is beyond the scope of this book. Specialized
treatments are available in Feller (1971), Révész (1968), and Kesten (1972).
A very useful tool for establishing almost sure convergence is stated first.
Proof. We prove the first statement. In order that infinitely many among the events
An ; n 1, occur, it is necessary and sufficient that given any m, there is at least one
event among Am ; AmC1 ; : : : that occurs. In other words,
D 0;
P1
because, by assumption, nD1 P .An / < 1. t
u
Remark. Although pairwise independence suffices for the conclusion of the second
part of the Borel–Cantelli lemma, common applications involve cases where the An
are mutually independent.
The next example gives an application of the Borel–Cantelli lemma,
Example 7.10 (Tail Runs of Arbitrary Length). Most of us do not feel that it is likely
that in tossing a coin repeatedly, we are likely to see many tails or many heads in
succession. Problems of this kind were discussed in Chapter 1. This example shows
that in some sense, that intuition is wrong.
Consider a sequence of independent Bernoulli trials in which success occurs with
probability p and failure with probability q D 1 p. Suppose p > q, so that
successes are more likely than failures. Consider a hypothetical long uninterrupted
run of m failures, say FF : : : F , for some fixed m. Break up the Bernoulli trials
into nonoverlapping blocks of m trials, and consider An to be the event that the nth
m
P1of only failures. The probability of each An is q , which is free of n.
block consists
Therefore, nD1 P .An / D 1 and it follows from the second part of the Borel–
Cantelli lemma that no matter how large p may be, as long as p < 1, a string of
consecutive failures of any given arbitrary length m reappears infinitely many times
in the sequence of Bernoulli trials. In particular, if we keep tossing an ordinary coin,
then with certainty, we will see 1000 tails (or, 10,000 tails) in succession, and we
will see this occur again and again, infinitely many times, as our coin tosses continue
indefinitely.
Here is another important application of the Borel–Cantelli lemma.
Example 7.11 (Almost Sure Convergence of Binomial Proportion). Let X1 ; X2 ; : : :
be an infinite sequence of independent Ber.p/ random variables, where 0 < p < 1.
Let
Sn X1 C X2 C : : : C Xn
XNn D D :
n n
256 7 Essential Asymptotics and Applications
It follows from the Borel–Cantelli lemma that the binomial sample proportion XNn
converges almost surely to p.
In fact, the convergence of the sample mean XNn to E.X1 / (i.e., the common mean
of the Xi ) holds in general. The general results, due to Khintchine and Kolmogorov,
are known as the laws of large numbers, stated below.
Theorem 7.3 (Weak Law of Large Numbers). Suppose X1 ; X2 ; : : : are inde-
pendent and identically distributed (iid) random variables (defined on aPcommon
sample space ), such that E.jX1 j/ < 1, and E.X1 / D . Let XNn D n1 niD1 Xi .
P
Then XN n ) .
Theorem 7.4 (Strong Law of Large Numbers). Suppose X1 ; X2 ; : : : are inde-
pendent and identically distributed random variables (defined on a common sample
space ). Then, XN n has an a.s. (almost sure) limit iff E.jX1 j/ < 1, in which case
a:s:
XN n ) D E.X1 /.
Remark. It is not very simple to prove either of the two laws of large numbers in the
generality stated above. We prove the weak law in Chapter 8, and the strong law in
Chapter 14. If the Xi have a finite variance, then Markov’s inequality easily leads to
the weak law. If Xi have four finite moments, then a careful argument on the lines
of our special binomial proportion example above does lead to the strong law. Once
again, the Borel–Cantelli lemma does the trick.
It is extremely interesting that existence of an expectation is not necessary for the
WLLN (weak law of large numbers) to hold. That is, it is possible that E.jX j/ D 1,
P
and yet XN ) a, for some real number a. We describe this more precisely shortly.
The SLLN (strong law of large numbers) already tells us that if X1 ; X2 ; : : : are
independent with a common CDF F (that is, iid), then the sample mean XN does not
7.2 Laws of Large Numbers 257
have any almost sure limit if EF .jX j/ D 1. An obvious question is what happens
to XN in such a case. A great deal of deep work has been done on this question, and
there are book-length treatements of this issue. The following theorem gives a few
key results only for easy reference.
Definition 7.8. Let x be a real number. The positive and negative part of x are
defined as
x C D maxfx; 0gI x D maxfx; 0g:
That is, x C D x when x 0, and 0 when x 0. On the other hand, x D 0 when
x 0, and x when x 0. Consequently, for any real number x,
x C ; x 0I x D xC x I jxj D x C C x :
Example 7.12. Let F be the CDF of the standard R 1 Cauchy distribution. Due to the
symmetry, we get E.X C / D E.X / D 1 0 1Cx x
2 dx D 1. Therefore, from
part (a) of the above theorem. with probability one, lim sup jXN j D 1 (i.e., the
sequence of sample means cannot remain bounded). Also, from the statement of
the strong law itself, the sequence will not settle down near any fixed real number.
The four simulated plots in Fig. 7.1 help illustrate these phenomena. In each plot,
1000 standard Cauchy values were simulated, and the sequence of means XN j D
1 Pj
j i D1 Xi were plotted, for j D 1; 2; : : : ; 1000.
Now, we consider the issue of a possibility of a WLLN when the expectation
does not exist. The answer is that the tail of F should not be too slow. Here is the
precise result.
20 1
15
200 400 600 800 1000
10 -1
-2
5
-3
200 400 600 800 1000
1
-0.5
-2 -2
-2.5
-3
200 400 600 800 1000
R1
In particular, if F is symmetric, x.1 F .x// ! 0 as x ! 1, and 0 .1
P
F .x//dx D 1, then EF .jX j/ D 1, where as XN ) 0.
Remark. See Feller (1971, p. R235) for a proof. It should be noted that the two con-
1
ditions x.1 F .x// ! 0 and 0 .1 F .x//dx D 1 are not inconsistent. It is easy
to find an F that satisfies both conditions.
We close this section with an important result on the uniform closeness of the
empirical CDF to the underlying CDF in the iid case.
Proof. The main idea of the proof is to discretize the problem, and exploit Kol-
mogorov’s SLLN.
Fix m and define a0 ; a1 ; : : : ; am ; amC1 by the relationships Œai ; ai C1 / D fx W
m
i
F .x/ < i C1
m g; i D 1; 2; : : : ; m 1, and a0 D 1; amC1 D 1. Now fix an i
and look at x 2 Œai ; ai C1 /. Then
1
Fn .x/ F .x/ Fn .ai C1 / F .ai / Fn .ai C1 / F .ai 1 / C :
m
1
F .x/ Fn .x/ F .ai / Fn .ai / C :
m
7.3 Convergence Preservation 259
1
supx2R jFn .x/ F .x/j D max jF .ai / Fn .ai /j C jF .ai / Fn .ai /j C :
i m
For fixed m, as n ! 1, by the SLLN each of the terms within the absolute
value sign above goes almost surely to zero, and so, for any fixed m, almost surely,
limn supx2R jFn .x/ F .x/j m1
. Now let m ! 1, and the result follows. t
u
We have already seen the importance of being able to deal with transformations of
random variables in Chapters 3 and 4. This section addresses the question of when
convergence properties are preserved if we suitably transform a sequence of random
variables.
The next important theorem gives some frequently useful results, that are analo-
gous to corresponding results on convergence of sequences in calculus.
The next result is one of the most useful results on almost sure convergence and
convergence in probability. It says that convergence properties are preserved if we
make smooth transformations. However, the force of the result is partially lost if
we insist on the transformations being smooth everywhere. To give the most useful
version of the result, we need a technical definition.
P .X 2 C.f // D 1:
Then,
P P
(a) Xn ) X ) f .Xn / ) f .X/I
a:s: a:s:
(b) Xn ) X ) f .Xn / ) f .X/:
7.3 Convergence Preservation 261
1X X
n n
N 2D 1
.Xi X/ Xi2 .XN /2
n n
i D1 i D1
(an algebraic identity). Because F has a finite variance, it also possesses a finite
second moment, namely, EF .X 2 / D 2 C 2 < 1. By applying the strong law
P a:s:
of large numbers to the sequence X12 ; X22 ; : : :, we get n1 niD1 Xi2 ) EF .X 2 / D
a:s:
2 C 2 . By applying the SLLN to the sequence X1 ; X2 ; : : :, we get XN ) , and
a:s:
therefore by the continuous mapping theorem, .XN /2 ) 2 . Now, by the theorem on
P a:s:
preservation of convergence, we get that n1 niD1 Xi2 .XN /2 ) 2 C 2 2 D 2 ,
which finishes the proof.
Fn .x/ D P .Xn x/
D P .Xn x; jXn X j / C P .Xn x; jXn X j > /
P .X x C / C P .jXn X j > /:
Now let n ! 1 on both sides of the inequality. Then, we get lim supn Fn .x/
F .x C /, because P .jXn X j > / ! 0 by hypothesis. Now, letting ! 0, we
get lim supn Fn .x/ F .x/, because F .x C / ! F .x/ by right continuity of F .
The proof will be complete if we show that lim infn Fn .x/ F .x/. This is
proved similarly, except we now start with P .X x / on the left, and follow the
same steps. It should be mentioned that it is in this part that the continuity of F at x
is used. t
u
Remark. The fact that if a sequence Xn of random variables converges in
distribution, then the sequence must be Op .1/, tells us that there must be se-
quences of random variables which do not converge in distribution to anything. For
264 7 Essential Asymptotics and Applications
example, take Xn N.n; 1/; n 1. This sequence Xn is not Op .1/, and therefore
cannot converge in distribution. The question arises if the Op .1/ property suffices
for convergence. Even that, evidently, is not true; just consider X2n1 N.0; 1/,
and X2n N.1; 1/. However, separately, the odd and the even subsequences do
converge. That is, there might be a partial converse to the fact that if a sequence
Xn converges in distribution, then it must be Op .1/. This is a famous theorem on
convergence in distribution, and is stated below.
Example 7.17 (Various Convergence Phenomena Are Possible). This quick exam-
ple shows that a sequence of discrete distributions can converge in distribution to
a discrete distribution, or a continuous distribution, and a sequence of continuous
distributions can converge in distribution to a continuous one, or a discrete one.
A good example of discrete random variables converging in distribution to a
discrete random variable is the sequence Xn Bin.n; n1 /. Although it was not ex-
plicitly put in the language of convergence in distribution, we have seen in Chapter 6
that Xn converges to a Poisson random variable with mean one. A familiar example
of discrete random variables converging in distribution to a continuous random vari-
able is the de Moivre–Laplace central limit theorem (Chapter 1), which says that if
n np
Xn Bin.n; p/, then pXnp.1p/ converges to a standard normal variable.
Examples of continuous random variables converging to a continuous random
variable are immediately available by using the general central limit theorem p(also
N
Chapter 1). For example, if Xi are independent U Œ1; 1 variables, then nX ,
where 2 D 3 , converges to a standard normal variable. Finally, as an example
1
Theorem 7.14 (Pólya’s Theorem). Let Xn ; n 1 have CDF Fn , and let X have
L
CDF F . If F is everywhere continuous, and if Xn ) X , then
supx2R jFn .x/ F .x/j ! 0;
as n ! 1.
A large number of equivalent characterizations of convergence in distribution
are known. Collectively, these conditions are called the portmanteau theorem.
Note that the parts of the theorem are valid for real valued random variables, or
d -dimensional random variables, for any 1 < d < 1.
Example 7.19 (Weierstrass’s Theorem). Weierstrass’s theorem says that any con-
tinuous function on a closed bounded interval can be uniformly approximated by
polynomials. In other words, given a continuous function f .x/ on a bounded in-
terval, one can find a polynomial p.x/ (of a sufficiently large degree) such that
jp.x/ f .x/j is uniformly small. Consider the case of the unit interval; the case of
a general bounded interval reduces to this case.
Here we show pointwise convergence by using the portmanteau theorem.
Laws of large numbers are needed for establishing uniform approximability.
We give a constructive proof. Towards this, for n 1; 0 p 1, and a
given continuous function
P g W Œ0; 1
! R, define the sequence of Bernstein
polynomials, Bn .p/ D nkD0 g. kn / nk p k .1 p/nk . Note that we can think of
266 7 Essential Asymptotics and Applications
P
Bn .p/ as Bn .p/ D EŒg. Xn / jX Bin.n; p/. As n ! 1; Xn ! p, and it follows
L
that Xn ) ıp , the one-point distribution concentrated at p (we have already seen
that convergence in probability implies convergence in distribution). Because g
is continuous and hence bounded, it follows from the portmanteau theorem that
Bn .p/ ! g.p/, at any p.
It is not hard to establish that Bn .p/ g.p/ converges uniformly to zero, as
n ! 1. Here is a sketch. As above, X denotes a binomial random variable with
parameters n and p. We need to use the facts that a continuous function on Œ0; 1
is also uniformly continuous and bounded. Thus, for any given > 0, we can find
ı > 0 such that jx yj < ı ) jg.x/ g.y/j , and also we can find a finite C
such that jg.x/j C 8x. So,
ˇ ˇ
ˇ X ˇ
ˇ
jBn .p/ g.p/j D ˇE g g.p/ˇˇ
n
ˇ ˇ
ˇ X ˇ
E ˇˇg g.p/ˇˇ
n
ˇ ˇ
ˇ X ˇ
D E ˇˇg g.p/ˇˇ Ifj X pjı g
n n
ˇ ˇ
ˇ X ˇ
CE ˇˇg g.p/ˇˇ Ifj X pj>ı g
n n
ˇ ˇ
ˇX ˇ
C 2CP ˇˇ p ˇˇ > ı :
n
Now, in the last line, just apply Chebyshev’s inequality and bound the function
p.1 p/ in Chebyshev’s inequality by 14 . It easily follows then that for all large n,
the second term 2CP .j Xn pj > ı/ is also , which means, that for all large n,
uniformly in p, jBn .p/ g.p/j 2 .
The most important result on convergence in distribution is the central limit the-
orem, which we have already seen in Chapter 1. The proof of the general case is
given later in this chapter; it requires some additional development.
p L
N / ) Nd .0; †/:
n.X
Remark. If Xi ; i 1 are iid with mean and variance 2 , then the CLT in one di-
n L
Sn p
mension says that n
) N.0; 1/, where Sn is the nth partial sum X1 C CXn .
In particular, therefore, n
Sn p
n
D Op .1/. In other words, in a distributional sense,
n
Sn p
n
stabilizes. If we take a large n, then for most sample points !; jSn .!/nj
p
n
will be, for example, less than 4. But as n changes, this collection of good sample
points also changes. Indeed, any fixed sample point ! is one of the good sample
points for certain values of n, and falls into the category of bad sample points for
(many) other values of n. The law of the iterated logarithm says that if we fix ! and
look at jSn .!/nj
p
n
along such unlucky values of n, then Sn .!/n
p
n
will not appear to
be stable. In fact, it will keep growing with n, although at a slow rate. Here is what
the law of the iterated logarithm says.
Theorem 7.18 (Law of Iterated Logarithm(LIL)).
P Let Xi ; i 1 be iid with
mean , variance 2 < 1, and let Sn D niD1 Xi ; n 1. Then,
(a) lim supn p Sn n D a.s.
2n log log n
S n
(b) lim infn p2nnlog log n D a.s.
(c) If finite constants a; satisfy
Sn na
lim sup p D ;
n 2n log log n
@gi
where Dg./ is the matrix of partial derivatives .. @x //jxD :
j
Proof. For part (a), we use the Portmanteau theorem. Denote g.Xn / D Yn ; g.X/ D Y,
and consider bounded continuous functions f .Yn /. Now, f .Yn / D f .g.Xn // D
h.Xn /, where h.:/ is the composition function f .g.://. Because h is continuous,
because f; g are, and h is bounded, because f is, the Portmanteau theorem implies
270 7 Essential Asymptotics and Applications
that E.h.Xn // ! E.h.X//, that is, E.f .Yn // ! E.f .Y//. Now the reverse
L
implication in the Portmanteau theorem implies that Yn ) Y.
We prove part (b) for the case d D p D 1. First note that it follows from the
assumption cn ! 1 that Xn D op .1/. Also, by an application of Taylor’s
theorem,
g.x0 C h/ D g.x0 / C hg 0 .x0 / C o.h/
if g is differentiable at x0 . Therefore,
That the remainder term is op .Xn / follows from our observation that Xn D
op .1/. Taking g./ to the left and multiplying by cn , we obtain
Example 7.22 (A Quadratic Form). Let Xi ; i 1 be iid random variables with finite
p L
N
mean and finite variance 2 . By the central limit theorem, n.X/
) Z, where
Z N.0; 1/. Therefore, by the continuous mapping theorem, if Qn D n2 .XN /2 ,
p N 2 L L
n.X/
then Qn D
) Z 2 . But Z 2 21 . Therefore, Qn ) 21 .
P
Next, by the WLLN, pO ) p, and hence by the continuous mapping theorem for
p P p
O p/
convergence in probability, p.1 O ) p.1 p/. This gives, by Slutsky’s
p L
O
theorem, pn.pp/ ) N.0; 1/: Finally, because the absolute value function is con-
O
p.1 O
p/
tinuous, by the continuous mapping theorem for convergence in distribution,
p
n.pO p/ L
Tn D j p j ) jZj;
O p/
p.1 O
Example 7.24. Suppose Xi ; i 1, are iid with mean and variance 2 < 1.
p L
N
By the central limit theorem, n.X/
) Z, where Z N.0; 1/. Consider the
function g.x/ D x 2 . This is continuously differentiable, in fact at any x, and
g 0 .x/ D 2x. If ¤ 0, then g 0 ./ D 2 ¤ 0. By the delta theorem, we get
p L
that n.XN 2 2 / ) N.0; 42 2 /: If D 0, this last statement is still true, and
p P
that means nXN 2 ) 0, if D 0.
Example 7.25 (Sample Variance and Standard Deviation). Suppose again Xi ; i
1, are iid with mean , variance 2 , and E.X14 / < 1. Also let j D E.X1
/j ; 1 j 4. This example has d D 2; p D 1. Take
XN EX1 Var.X1 / Cov.X1 ; X12 /
Xn D 1 Pn 2 ; D ; †D
n i D1 Xi EX12 Cov.X1 ; X12 / Var.X12 /
p L
By the multivariate central limit theorem, n.Xn / ) N.0; †/. Now consider
the function g.u; v/ D v u2 . This is once continuously differentiable with respect
to each of u; v (in fact at any u; v), and the partial derivatives are gu D 2u; gv D 1.
Using the delta theorem, with a little bit of matrix algebra calculations, it follows
that
!
1X
n
p L
n .Xi XN / Var.X1 / ) N.0; 4 4 /:
2
n
i D1
Pn
If we choose sn2 D i D1 .Xi XN /2 =.n 1/ then
Pn !
XN /2 p 1X
n
p 2 i D1 .Xi
n.sn 2 / D p C n .Xi XN /2 2
.n 1/ n n
i D1
p P
D n. n1 niD1 .Xi XN /2 2 / C op .1/, which also converges in law to N.0; 4
4 / by Slutsky’s theorem. By anotherpuse of the delta theorem, this time with d D
p D 1, and with the function g.u/ D u, one gets
p L 4 4
n.sn / ) N 0; :
4 2
Example 7.26 (Sample Correlation). Another use of the delta theorem is the deriva-
tion of the limiting distribution of the sample correlation coefficient r for iid
bivariate data .Xi ; Yi /. We have
P
1
Xi Yi XN YN
rn D q P n
q P :
1
n .Xi XN /2 n1 .Yi YN /2
272 7 Essential Asymptotics and Applications
By taking
N N 1X 2 1X 2 1X
Tn D X ; Y ; Xi ; Yi ; Xi Yi
n n n
D EX1 ; EY1 ; EX12 ; EY12 ; EX1 Y1
for some v > 0. It is not possible to write a clean formula for v in general. If
.Xi ; Yi / are iid N2 .X ; Y ; X2 ; Y2 ; / then the calculation of v2 can be done in
closed-form, and
p L
n.rn / ) N.0; .1 2 /2 /:
./ ./
P .Tn z ˛2 p Tn C z ˛2 p / 1 ˛;
n n
where ˛ is some specified number in (0,1) and z˛=2 D ˆ1 .1 ˛2 /. Finally, plugging
in Tn in place of in ./, a confidence interval for is Tn ˙ z ˛2 .T
p n / . The delta
n
theorem provides an alternative solution that is sometimes preferred. By the delta
theorem, if g./ is once differentiable at with g 0 ./ 6D 0, then
p L
n .g.Tn / g.// ) N.0; Œg 0 ./2 2 .//:
7.5 Preservation of Convergence and Statistical Applications 273
Therefore, if we set
Œg 0 ./2 2 ./ D k 2
p L
for some constant k, then n .g.Tn / g.// ) N.0; k 2 /, and this produces a
confidence interval for g./:
k
g.Tn / ˙ z ˛2 p :
n
The reason that this one is sometimes preferred to the first confidence interval,
namely, Tn ˙ z ˛2 .T
p n / , is that no additional plug-in is necessary to estimate the
n
penultimate variance function in this second confidence interval. The penultimate
variance function is already a constant k 2 by choice in this second method. The
transformation g.Tn / obtained from its defining property
Œg 0 ./2 2 ./ D k 2
Example 7.28 (Fisher’s z). Suppose .Xi ; Yi /, i D 1; : : : ; n, are iid bivariate nor-
p L
mal with parameters X ; Y ; X2 ; Y2 ; . Then, as we saw above, n.rn / )
N.0; .1 2 /2 /, rn being the sample correlation coefficient. Therefore,
Z
1 1 1C
g./ D d D log D arctanh./
.1 / 2 2 1
provides a variance-stabilizing transformation for rn . This is the famous arctanh
transformation
p of Fisher, popularly known as Fisher’s z. By the delta theorem,
n.arctanh.rn / arctanh.// converges in distribution to the N.0; 1/ distribution.
Confidence intervals for are computed from Fisher’s z as
z˛=2
tanh arctanh.rn / ˙ p :
n
The arctanh transformation z D arctanh.rn / attains approximate normality much
more quickly than rn itself.
Example 7.29 (An Unusual VST). Here is a nonregular example on variance stabi-
lization. Suppose we have iid observations X1 ; X2 ; : : : from the U Œ0; distribution.
L
Then, the usual estimate of is the sample maximum X.n/ , and n. X.n/ / )
Exp./. The asymptotic variance function in the distribution of the sample maxi-
mum is therefore simply 2 , and therefore, a VST is
Z
1
g./ D d D log :
So, g.X.n/ / D log X.n/ is a variance-stabilizing transformation of X.n/ , In fact,
L
n.log log X.n/ / ) Exp.1/. However, the interesting fact is that for every n,
the distribution of n.log log X.n/ / is exactly a standard exponential. There is
no nontrivial example such as this in the regular cases (although N.; 1/ is a trivial
example).
Therefore, E.jXn X jp / ! 0. t
u
276 7 Essential Asymptotics and Applications
Remark. Sometimes we do not need the full force of the result that E.jXn X jp /
! 0, but all we want is that E.Xnp / converges to E.X p /. In that case, the conditions
in the previous theorem can be relaxed, and in fact from a statistical point of view,
the relaxed condition is much more natural. The following result gives the relaxed
conditions.
Theorem 7.22. Suppose Xn ; X; n 1 are defined on a common sample space ,
L
that Xn ) X , and that for some given p 1; jXn jp is uniformly integrable. Then
E.Xnk / ! E.X k / 8k p.
Remark. To apply these last two theorems, we have to verify that for the appropriate
sequence Xn , and for the relevant p; jXn jp is uniformly integrable. Direct verifi-
cation of uniform integrability from definition is often cumbersome. But simple
sufficient conditions are available, and these are often satisfied in many applica-
tions. The next result lists a few useful sufficient conditions.
Theorem 7.23 (Sufficient Conditions for Uniform Integrability).
(a) Suppose for some ı > 0, supn EjXn j1Cı < 1. Then fXn g is uniformly inte-
grable.
(b) If jXn j Y; n 1, and E.Y / < 1, then fXn g is uniformly integrable.
(c) If jXn j Yn ; n 1, and Yn is uniformly integrable, then fXn g is uniformly
integrable.
(d) If Xn ; n 1 are identically distributed, and E.jX1 j/ < 1, then fXn g is uni-
formly integrable.
(e) If fXn g and fYn g are uniformly integrable then fXn C Yn g is uniformly inte-
grable.
(f) If fXn g is uniformly integrable and jYn j M < 1, then fXn Yn g is uniformly
integrable.
See Chow and Teicher (1988, p. 94) for further details on the various parts of
this theorem.
Example 7.30 (Sample Maximum). We saw in Chapter 6 that if X1 ; X2 ; : : : are iid,
and if E.jX1 jk / < 1, then any order statistic X.r/ satisfies
nŠ
E.jX.r/ jk / E.jX1 jk /:
.r 1/Š.n r/Š
In particular, for the sample maximum X.n/ of n observations,
jX.n/ j
E.jX.n/ j/ nE.jX1 j/ ) E E.jX1 j/:
n
jX j
By itself, this does not ensure that n.n/ is uniformly integrable.
However, if we also assume that E.X12 / < 1, then the same argument gives
jX j
E.jX.n/ j2 / nE.X1 /2 , so that supn E. n.n/ /2 < 1, which is enough to conclude
jX.n/ j
that n
is uniformly integrable.
7.6 Convergence of Moments 277
However, we do not need the existence of E.X12 / for this conclusion. Note that
X
n Pn
jX.n/ j i D1 jXi j
jX.n/ j jXi j ) :
n n
i D1
Pn
i D1 jXi j
If E.jX1 j/ < 1, then in fact n is uniformly integrable, and as a conse-
jX.n/ j
quence, n is also uniformly integrable under just the condition E.jX1 j/ < 1.
Remark. See Feller (1971, pp. 227–228 and p. 251) for the previous two theorems.
Basically, part (b) is the primary result here, because if the conditions in (a) or (c)
hold, then the mgf exists in an interval containing zero. However, (a) and (c) are
useful special sufficient conditions.
for the function f .x/ D x k on .0; 1. R 1Therefore, as n ! 1; E.Xnk /, which is the
k
upper Riemann sum, converges to 0 x dx, which is the kth moment of a ran-
dom variable X having the uniform distribution on the unit interval. Because k is
arbitrary, it follows from part (a) of the above theorem that the discrete uniform dis-
tribution on f n1 ; n2 ; : : : ; 1g converges to the uniform distribution on the unit interval.
and
P
t.x1 ; x2 ; : : : ; xd / t.1 ; : : : ; d / C diD1 .xi i /ti .1 ; : : : ; d /
1P
C .xi i /.xj j /tij .1 ; : : : ; d /:
2 1i;j d
If we formally take an expectation on both sides, we get the first- and second-order
approximations to EŒTd .X1 ; X2 ; : : : ; Xd /:
and
1 X
EŒTd .X1 ; X2 ; : : : ; Xd / Td .1 ; : : : ; d / C tij .1 ; : : : ; d /ij ;
2
1i;j d
X
d
t.x1 ; x2 ; : : : ; xd / t.1 ; : : : ; d / C .xi i /ti .1 ; : : : ; d /;
i D1
by formally taking the variance of both sides, we get the first-order variance approx-
imation
!
Xd
Var.Td .X1 ; X2 ; : : : ; Xd // Var .xi i /ti .1 ; : : : ; d /
i D1
X
D ti .1 ; : : : ; d /tj .1 ; : : : ; d /ij :
1i;j d
The second-order variance approximation takes more work. By using the second-
order Taylor approximation for t.x1 ; x2 ; : : : ; xd /, the second-order variance ap-
proximation is
!
X
Var.Td .X1 ; X2 ; : : : ; Xd // Var .Xi i /ti .1 ; : : : ; d /
i
0 1
1 X
C Var @ .Xi i /.Xj j /tij .1 ; : : : ; d /A
4
i;j
X
C Cov .Xi i /ti .1 ; : : : ; d /;
!
X
.Xj j /.Xk k /tjk .1 ; : : : ; d / :
j;k
1 X
C tij .1 ; : : : ; d /tkl .1 ; : : : ; d /
4
i;j;k;l
1
Var.T .X // Œt 0 ./2 2 C t 0 ./t 00 ./E.X /3 C Œt 00 ./2 ŒE.X /4 4 :
4
Example 7.32. Let X; Y be two jointly distributed random variables, with means
1 ; 2 , variances 12 ; 22 , and covariance 12 . We work out the second-order ap-
proximation to the expectation of T .X; Y / D X Y . Writing t for T as above, the
various relevant partial derivatives are tx D y; ty D x; txx D tyy D 0; txy D 1.
Plugging into the general formula for the second-order approximation to the mean,
we get E.X Y / 1 2 C 12 Œ12 C 21 D 1 2 C 12 . Thus, in this case, the
second-order approximation reproduces the exact mean of X Y .
Example 7.33 (A Multidimensional Example). Let X D .X1 ; X2 ; : : : ; Xd / have
mean vector and covariance matrix †. Assume that is not the null vector. We
find a second-order approximation to E.jjXjj/. Denoting T .x1 ; : : : ; xd / D jjxjj, the
successive partial derivatives are
i 1 2i i j
ti ./ D ; ti i ./ D ; tij ./ D .i ¤ j /:
jjjj jjjj jjjj3 jjjj3
Plugging these into the general formula for the second-order approximation of the
expectation, on some algebra, we get the approximation
P 2 P
tr† i i i i i ¤j i j ij
E.jjxjj/ jjjj C 3
2jjjj 2jjjj 2jjjj3
1 0 †
D jjjj C tr† 0 :
2jjjj
0 †
The ratio 0
varies between the minimum and the maximum eigenvalue of †,
0 †
where as tr† equals the sum of all the eigenvalues. Thus, tr† 0
0; † being
0
a nnd matrix, which implies that the approximation jjjj C 1
2jjjj
Œtr† † 0 is
jjjj. This is consistent with the bound E.jjXjj/ jjjj, as is implied by Jensen’s
inequality.
The second-order variance approximation is difficult to work out in this example.
However, the first-order approximation is easily worked out, and gives
0 †
Var.jjXjj/ :
0
7.6 Convergence of Moments 281
one can make substantial algebraic simplification towards calculating the variance
of m2 . Indeed,
The expectation of each term above can be found by using the independence of
the Xi and the zero mean assumption, and interestingly, in fact the variance of m2
can be thus found exactly for any n, namely,
1
Var.m2 / D Œ.n 1/2 .4 4 / C 2.n 1/ 4 :
n3
The approximate methods would have produced the answer
4 4
Var.m2 / :
n
It is useful to know that the approximate methods would likewise produce the gen-
eral first-order variance approximation
Theorem 7.26. Suppose X1 ; X2 ; : : : are iid observations with a finite fourth mo-
ment. Let E.X1 / D , and Var.X1 / D 2 . Let g be a scalar function with four
uniformly bounded derivatives. Then
g .2/ ./ 2
(a) EŒg.X / D g./ C 2n
C O.n2 /I
.g 0 .//2 2
(b) VarŒg.X / D n
C O.n2 /:
See Bickel and Doksum (2007) for a proof of this theorem.
Remark. Again, actually the set A is not completely arbitrary. We do need the
restriction that A be a Borel set, a concept in measure theory. However, we make no
further mention of this qualification.
The relation between total variation distance and densities when the random vari-
ables X; Y are continuous is described by the following result.
7.7 Convergence of Densities and Scheffé’s Theorem 283
Example 7.36 (Total Variation Distance Between Two Normals). Total variation
distance is usually hard to find in closed analytical form. The absolute value
sign makes closed-form calculations difficult. It is, however, possible to write a
closed-form formula for the total variation distance between two arbitrary normal
distributions in one dimension. No such formula would be possible in higher dimen-
sions.
RLet X N.1 ; 12 /; Y N.2 ; 22 /. We use the result that dT V .X; Y / D
1 1
2 1
jf .x/ g.x/jdx, where f; g are the densities of X; Y . To evaluate the
integral of jf .x/ g.x/j, we need to find the set of all values of x for which
f .x/ g.x/. We assume that 1 > 2 , and use the notation
1 1 2
cD ; D ;
2 2
p p
.c 2 1/2 log c C 2 c .c 2 1/2 log c C 2 C c
AD ; B D :
c2 1 c2 1
The case 1 D 2 is commented on below.
With this notation, by making a change of variable,
Z 1 Z 1
jf .x/ g.x/jdx D j.z/ c. C cz/jd z;
1 1
R 1In particular, if fn ; f are all density functions, and if fn .x/ ! f .x/ 8x, then
1 jfn .x/ f .x/jdx ! 0.
Now note that minffn .x/; f .x/g ! f .x/ 8x, as fn .x/ ! f .x/ 8x, and minffn .x/;
f .x/g f .x/.
R 1Therefore, by the dominated
R1 convergence theorem (see the previ-
ous section), 1 minffn .x/; f .x/gdx ! 1 f .x/dx: The pointwise algebraic
identity now gives that
Z 1 Z 1 Z 1 Z 1
jfn .x/ f .x/jdx ! f .x/dx C f .x/dx 2 f .x/dx D 0;
1 1 1 1
Remark. This is an easily stated result covering many examples. But better results
are available. Feller (1971) is an excellent reference for some of the better results,
which, however, involve more complex concepts.
It follows that P .Xn x/ ! ˆ. x / for any fixed x, and so, Xn converges in
distribution to another normal, namely to X N.; 2 /. Now, either by direct ver-
ification, or from Ibragimov’s theorem, we have that Xn also converges to X in total
variation. The converse is also true. That is, if Xn N.n ; n2 /; n 1, then Xn
can either converge to a one-point distribution, or to another normal distribution, say
N.; 2 /, in which case n ! ; n ! , and convergence in total variation also
holds. Conversely, if n ! ; n ! > 0, then Xn converges in total variation to
X N.; 2 /.
Exercises
2
Exercise 7.1. (a) Show that Xn ! c (i.e., Xn converges in quadratic mean to c)
if and only if E.Xn c/ and Var.Xn / both converge to zero.
(b) Show by an example (different from text) that convergence in probability does
not necessarily imply almost sure convergence.
Exercise 7.2. (a) Suppose EjXn cj˛ ! 0, where 0 < ˛ < 1. Does Xn
necessarily converge in probability to c?
L
(b) Suppose an .Xn / ) N.0; 1/. Under what condition on an can we conclude
P
that Xn ) ?
(c) op .1/ C Op .1/ D?
(d) op .1/Op .1/ D‹
(e) op .1/ C op .1/Op .1/ D‹
L
(f) Suppose Xn ) X: Then, op .1/Xn D?
Exercise 7.3 (Monte Carlo). Consider the purely mathematical problem of finding
a definite integral f .x/dx for some (possibly complicated) function f .x/. Show
that the SLLN provides a methodP for approximately finding the value of the integral
by using appropriate averages n1 niD1 f .Xi /.
Numerical analysts call this Monte Carlo integration.
as n ! 1.
Exercise 7.10. * Let Xn be any sequence of random variables. Prove that one can
Xn a:s:
always find a sequence of numbers cn such that cn ) 0.
Exercise 7.11 (Sample Maximum). Let Xi ; i 1 be an iid sequence, and X.n/ the
maximum of X1 ; : : : ; Xn . Let .F / D supfx W F .x/ < 1g, where F is the common
a:s:
CDF of the Xi . Prove that X.n/ ) .F /.
Exercise 7.15. Suppose Xi are iid standard exponential. Show that lim supn
Xn
log n
D 1 with probability 1.
Exercise 7.17. Let Xn Bin.n; p/. Show that .Xn =n/2 and Xn .Xn 1/=n.n 1/
both converge in probability to p 2 . Do they also converge almost surely?
Exercise 7.18. Suppose X1 ; : : : ; Xn are iid standard exponential variables, and let
Sn D X1 C : : : C Xn . Apply the Chernoff–Bernstein inequality (see Chapter 1) to
show that for c > 1,
P .Sn > cn/ e n.c1ln c/
and hence that P .Sn > cn/ ! 0 exponentially fast.
Gn D .X1 X2 : : : Xn /1=n :
P
Find c such that Gn ) c.
L
Exercise 7.25. Give an example of a sequence fXn g, and an X such that Xn )
X; E.Xn / ! E.X /, but E.jXn j/ does not converge to E.jX j/.
Exercise 7.26. Suppose Xn has a normal distribution with mean n and variance
n2 . Let n ! and n ! as n ! 1. What is the limiting distribution of Xn ?
Exercise 7.27 (Delta Theorem). Suppose X1 ; X2 ; : : : are iid with mean and vari-
ance 2 , a finite fourth moment, and let Z N.0; 1/.
p 2 L
(a) Show that n.X 2 / ) 2Z.
p L
(b) Show that n.e X e / ) e Z.
p P L
(c) Show that n.log.1=n niD1 .Xi X/2 / log 2 / ) .1= 2 /.EX14 /1=2 Z.
Exercise 7.28 (Asymptotic Variance and True Variance). Let X1 ; X2 ; : : : be iid
observations from a CDF F with four finite P moments. For each of the following
cases, find the exact variance of m2 D n1 niD1 .Xi XN /2 by using the formula
in the text, and also find the asymptotic variance by using the formula in the text.
Check when the true variance is larger than the asymptotic variance.
(a) F D N.; 2 /.
(b) F D Exp. /.
(c) F D Poi. /.
Exercise 7.29 (All Distributions as Limits of Discrete). Show that any distribu-
tion on Rd is the limit in distribution of distributions on Rd that are purely discrete
with finitely many values.
L L
Exercise 7.30 (Conceptual). Suppose Xn ) X , and also Yn ) X . Does this
mean that Xn Yn converge in distribution to (the point mass at) zero?
Exercise 7.31. (a) Suppose an .Xn / ! N.0; 2 /; what can be said about the
limiting distribution of jXn j, when ¤ 0; D 0?
(b) * Suppose Xi are iid Bernoulli(p); what can be said about the limiting distribu-
tion of the sample variance s 2 when p D 12 I p ¤ 12 ?
Exercise 7.32 (Delta Theorem). Suppose X1 ; X2 ; : : : are iid Poi. /. Find the lim-
N
iting distribution of e X .
Remark. It is meant that on suitable centering and norming, you will get a nonde-
generate limiting distribution.
Exercise 7.33 (Delta Theorem). Suppose X1 ; X2 ; : : : are iid N.; 1/. Find the lim-
iting distribution of ˆ.XN /, where ˆ, as usual, is the standard normal CDF.
Exercise 7.34 * (Delta Theorem with Lack of Smoothness). Suppose X1 ; X2 ; : : :
are iid N.; 1/. Find the limiting distribution of jXN j when
(a) ¤ 0.
(b) D 0.
290 7 Essential Asymptotics and Applications
Exercise 7.35 (Delta Theorem). For each F below, find the limiting distributions
N
of Xs and XsN :
(i) F D U Œ0; 1, (ii) F D Exp. /, (iii) F D 2 .p/.
Exercise 7.36 * (Delta Theorem). Suppose X1 ; X2 ; : : : are iid N.; 2 /. Let
P 1 P
1
.Xi XN /3 .Xi XN /4
b1 D P n
3=2 and b2 D n
1 P 2 3
1
n
.Xi XN /2 n .Xi XN /2
be the sample skewness and kurtosis coefficients. Find the joint limiting distribution
of .b1 ; b2 /.
Exercise 7.37 * (Slutsky). Let Xn ; Ym be independent Poisson with means
n; m; m; n 1. Find the limiting distribution of Xn pYXm CY
.nm/
as n; m ! 1.
n m
Exercise 7.38 (Approximation of Mean and Variance). Let X Bin.n; p/. Find
X
the first- and the second-order approximation to the mean and variance of nX .
Exercise 7.39 (Approximation of Mean and Variance). Let X Poi. /. Find
the first- and the second-order approximation to the mean and variance of e X .
Compare to the exact mean and variance by consideration of the mgf of X .
Exercise 7.40 (Approximation of Mean and Variance). Let X1 ; : : : ; Xn be iid
N.; 2 /. Find the first- and the second-order approximation to the mean and vari-
ance of ˆ.XN /.
Exercise 7.41 * (Approximation of Mean and Variance). Let X Bin.n; p/.
Find the
q first- and the second-order approximation to the mean and variance of
X
arcsin. n /.
(a) N.; 2 /.
(b) Exp./.
(c) U Œ 1; C 1.
Exercise 7.46 * (Coefficient of Variation). Given a set of positive iid random vari-
ables X1 ; X2 ; : : : ; Xn , the coefficient of variation (CV) is defined as CV D XsN .
Find a second-order approximation to its mean, and a first-order approximation to
its variance, in terms of suitable moments of the distribution of the Xi . Make a note
of how many finite moments you need for each approximation to make sense.
Exercise 7.49. For each of the following cases, evaluate the total variation distance
between the indicated distributions:
(a) N.0; 1/ and C.0; 1/.
(b) N.0; 1/ and N.0; 104/.
(c) C.0; 1/ and C.0; 104 /.
292 7 Essential Asymptotics and Applications
Exercise 7.50 (Plotting the Variation Distance). Calculate and plot (as a function
of ) dT V .X; Y / if X N.0; 1/; Y N.; 1/.
Exercise 7.51 (Convergence of Densities). Let Z N.0; 1/ and Y independent
of Z. Let Xn D Z C Yn ; n 1.
(a) Prove by direct calculation that the density of Xn converges pointwise to the
standard normal density in each of the following cases.
(i) Y N.0; 1/.
(ii) Y U Œ0; 1.
(iii) Y Exp.1/.
(b) Hence, or by using Ibragimov’s theorem prove that Xn ! Z in total variation.
Exercise 7.52. Show that dT V .X; Y / P .X ¤ Y /:
p
Exercise 7.53. Suppose X1 ; X2 ; : : : are iid Exp.1/. Does n.XN 1/ converge to
standard normal in total variation? Prove or disprove.
Exercise 7.54 * (Minimization of Variation Distance). Let X U Œa; a and
Y N.0; 1/. Find a that minimizes dT V .X; Y /.
References
Ash, R. (1973). Real Analysis and Probability, Academic Press, New York.
Bhattacharya, R. and Rao, R. (1986). Normal Approximation and Asymptotic Expansions, Wiley,
New York.
Bickel, P. and Doksum (2007). Mathematical Statistics: Basic Ideas and Selected Topics, Prentice
Hall, Upper Saddle River, NJ.
Breiman, L. (1968). Probability, Addison-Wesley, Reading, MA.
Chow, Y. and Teicher, H. (1988). Probability Theory, 3rd ed., Springer, New York.
Cramér, H. (1946). Mathematical Methods of Statistics, Princeton University Press, Princeton, NJ.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Feller, W. (1971). Introduction to Probability Theory with Applications, Wiley, New York.
Ferguson, T. (1996). A Course in Large Sample Theory, Chapman and Hall, New York.
Hall, P. (1997). Bootstrap and Edgeworth Expansions, Springer, New York.
Hall, P. and Heyde, C. (1980). Martingale Limit Theory and Its Applications, Academic Press,
New York.
Kesten, H. (1972). Sums of independent random variables, Ann. Math. Statist., 43, 701–732.
Lehmann, E. (1999). Elements of Large Sample Theory, Springer, New York.
Petrov, V. (1975). Limit Theorems of Probability Theory, Oxford University Press, London.
Port, S. (1994). Theoretical Probability for Applications, Wiley, New York.
Reiss, R. (1989). Approximate Distribution of Order Statistics, Springer-Verlag, New York.
Révész, P. (1968). The Laws of Large Numbers, Academic Press, New York.
Sen, P. K. and Singer, J. (1993). Large Sample Methods in Statistics, Chapman and Hall, New York.
Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York.
van der Vaart, Aad (1998). Asymptotic Statistics, Cambridge University Press, Cambridge, UK.
Chapter 8
Characteristic Functions and Applications
Proposition.
(a) For any real-valued random variable X , .t/ exists for any t; j .t/j 1, and
.0/ D 1.
(b) X .t/ D X .t/ D X .t/.
(c) For any real-valued random variables X; Y such that X; Y are independent,
XCY .t/ D X .t/ Y .t/.
(d) If Y D a C bX , then Y .t/ D e ita X .bt/.
(e) If X .t/ D Y .t/ 8t, then X; Y have the same distribution; that is, a cf deter-
mines the distribution.
(f) A random variable X has a distribution symmetric about zero if and only if its
characteristic function is real and even.
(g) The cf of any real-valued random variable X is continuous, and even uniformly
continuous on the whole real line.
Proof. j .t/j D jE.e itX /j E.je itX j/ D E.1/ D 1; it is obvious that .0/ D 1.
Next, X .t/ D E.e it .X/ / D E.e i tX / D E.cos tX / iE.sin tx/ D X .t/ D
X .t/: If X; Y are independent, XCY .t/ D E.e / D E.e itX e i t Y / D
it .XCY /
E.e /E.e / D X .t/ Y .t/. Part (d) is obvious. Part (e) is proved later. For
itX it Y
part (f), X has a distribution symmetric about zero if and only if X and X have
the same distribution if and only if X .t/ D X .t/ D X .t/. Part (g) can be
proved by using simple inequalities on the exponential function and at the same
time, the dominated convergence theorem. We leave this as a short exercise for the
reader. t
u
Distribution Density/pmf cf
Point mass at a e ita
Bernoulli.p/ p x .1 p/1x 1 p C pe it
n x
Binomial
x
p .1 p/nx .1 p C pe it /n
x
e xŠ e .e 1/
it
Poisson
pe it
Geometric p.1 p/x1 1.1p/e it
(continued)
Distribution Density/pmf cf
p J .t/
Beta.˛; ˛/ .2˛/ .2˛/ ˛ 21 2
Œ.˛/2
x ˛1 .1 x/˛1 .˛/ 1
t ˛ 2
Œcos. 2t / C i sin. 2t /; t >0
Exponential 1 x= 1
e 1i t
Gamma e x= x ˛1
˛ .˛/ Œ 1i1 t ˛
Double exponential 1 jxj= 1
2
e 1C 2 t 2
e .x/ =.2 /
2 2 itt 2 2 =2
Normal p1
e
2
Cauchy 1
.x/2
e itjtj
1C
2
t . aC1
2 / aa=4
p
p 2 2a=21 . a2 /
jt ja=2 K a2 . a jt j/
a. a2 /.1C x 2 /.aC1/=2
a
Multivariate Normal 1 0 1 .x 0 t 0 †t
1
1 e 2 .x/ † /
e it 2
.2/d=2 j†j 2
Example 8.1 (Binomial, Normal, Poisson). In this example we work out the cf of
general binomial, Normal, and Poisson distributions.
If X Ber.p/, then immediately from definition its cf is X .t/ D .1pCpe it /.
because the general binomial random variable can be written as the sum of n iid
bernoulli variables, the cf of the Bin.n; p/ distribution is .1 p C pe it /n .
For the Normal case, first consider the standard Normal case. By virtue of
symmetry, from part (f) of the aboveR 1 theorem (or, directly from definition), if
Z N.0; 1/, then its cf is Z .t/ D 1 .cos tz/.z/d z: Therefore,
Z 1
d d
Z .t/ D .cos tz/.z/d z
dt dt 1
Z 1 Z 1
d
D cos tz .z/d z D .sin tz/.z.z//d z
1 dt 1
Z 1 Z 1
d
D .sin tz/ .z/ d z D t .cos tz/.z/d z
1 dz 1
D t Z .t/;
where the interchange of the derivative and the integral is permitted by the dom-
inated convergence theorem, and integration by parts has been used in the final
Z .t/ D t Z .t/ )
d
step of the calculation. Because Z .0/ must be one, dt
t2
Z .t/ D e 2 .
t 2 2
If X N.; 2 /, by part (d) of the above theorem, its cf equals e it 2 . If
X Poi. /, then
296 8 Characteristic Functions and Applications
1
X x
X .t/ D e e itx
xD0
xŠ
X1
. e it /x
D e
xD0
xŠ
D e e e D e .e
it it 1/
:
The power series representation of the exponential function on C has been used in
P
summing the series 1 xD0
.e it /x
xŠ
.
Example 8.2 (Exponential, Double Exponential, and Cauchy). R1 Let X have the
standard exponential density. Then its cf is X .t/ D 0 e itx e x dx. One can
use methods of integration of complex-valued functions, R 1and get the answer.
x
Alternatively,
R1 one can separately find the real integrals .cos
0R 1 tx/e dx, and
x x
.sin
0R 1 tx/e dx, and thus obtain the cf as X .t/ D 0 .cos tx/e dx C
i 0 .sin tx/e x dx. Each integral can be evaluated by various methods. For ex-
P
ample, one can take the power series expansion 1 k .tx/2k
kD0 .1/ .2k/Š for cos.tx/ and
R1
integrate 0 .cos tx/e x dx term by term, and then sum the series, and likewise
R1
for the integral 0 .sin tx/e x dx. Or, one can evaluate the integrals by repeated
integration by parts. Any of these gives the formula X .t/ D 1it 1
.
If X has the standard double exponential density, then simply use the fact that
L
X D X1 X2 , where X1 ; X2 are iid standard exponential. Then, by parts (b) and
(c) of the above theorem,
1 1 1
X .t/ D D :
1 it 1 C it 1 C t2
Note the very interesting outcome that the cf of the standard double exponential
density is the renormalized standard Cauchy density. By a theorem (the inversion
theorem in the next section) that we have not discussed yet, this implies that the cf of
the standard Cauchy density is e jt j , the renormalized standard double exponential
density.
Example 8.3 (Uniform Distribution in n-Dimensional Unit Ball). This example
well illustrates the value of a skilled calculation. We work out the cf of the uniform
distribution in Bn , the n-dimensional unit ball, with the constant density
1 . n2 C 1/
f .x/ D Ifx2Bn g D n Ifx2Bn g :
Vol.Bn / 2
We write f .x/ D cn Ifx2Bn g . Before we start the derivation, it is important to
note that the constants in the calculation do not have to be explicitly carried along
the steps, and a final constant can be found at the end by simply forcing the cf to
equal one at t D 0.
First note that by virtue of symmetry
R (around the origin) of the uniform density
in the ball, the cf equals .t/ D cn Bn cos.t0 x/d x. Let P be an orthogonal ma-
trix such that P t D jjtjje1 , where e1 is the first n-dimensional basis unit vector
8.1 Characteristic Functions of Standard Distributions 297
.1; 0; : : : ; 0/0 . Because jjP xjj D jjxjj (P being an orthogonal matrix), and because
jP j D 1, we get
Z
.t/ D cn cos.jjtjjx1 /d x:
Bn
Now make the n-dimensional polar transformation .x1 ; x2 ; : : : ; xn / ! .; 1 ;
: : : ; n1 / (consult Chapter 4), which we recall has the Jacobian n1 sinn2 1
sinn3 2 sin n2 . Thus,
Z 1 Z Z Z 2
.t/ D cn n1 cos.jjtjj cos 1 /.sin 1 /n2 sinn3 2
0 0 0 0
sin n2
(it is not important to know right now what the constant k1n is)
Z 1 Z 1
n3
D k2n n1 cos.jjtjju/.1 u2 / 2 dud
0 1
(on making the change of variable cos 1 D u, a monotone transformation on .0; /)
Z 1
n
D k3n n1 .jjtjj/1 2 J n2 1 .jjtjj/d
0
(on using the known formula for the integral in the above line)
J n2 .jjtjj/
D k4n n :
jjtjj 2
J n .u/
Now, at this final step use the fact that limu!0 2
n D n
1
, and this gives the
u2 2 2 . n
2 C1/
formula that
n
n J n2 .jjtjj/
.t/ D 2 2 C1 n :
2 jjtjj 2
298 8 Characteristic Functions and Applications
Given the cf of a random variable, it is possible to recover the CDF at its points of
continuity; additionally, if the cf is absolutely integrable, then the random variable
must have a density, and the density too can be recovered from the cf. The first result
on the recovery of the CDF leads to the distribution determining property of a cf.
Theorem 8.1 (Inversion Theorems). Let X have the CDF F and cf .t/.
(a) For any 1 < a < b < 1, such that a; b are both continuity points of F ,
Z
1 T
e iat e ibt
F .b/ F .a/ D lim .t/dtI
T !1 2 T it
R1
(b) If 1 j .t/jdt < 1, then X has a density, say f .x/, and
Z 1
1
f .x/ D e itx .t/dt:
2 1
at continuity points b. To make the proof completely rigorous from here, some mea-
sure theory is needed; but we have basically shown that the CDF can be written as
8.2 Inversion and Uniqueness 299
1
R1 i tx
the integral of the function 2 1 e .t/dt, and so X must have this function
as its density. It is bounded and continuous for the same reason that a cf is bounded
and continuous.
For part (c), notice that j .t/j2 is the cf of X Y , where X; Y are iid with cf
R1.t/. But the density of X Y , by the density convolution formula, is g.x/ D
1 f .x/f .x C y/dy. If j .t/j is integrable, the inversion formula of part (b)
2
applies
R 1to this cf, namely, j .t/j2 . If we doRapply the inversion formula, we get
1
that R1 j .t/j dt D g.0/.
2
R 1 But g.0/ D 1 jf .x/j2 dx. Therefore, we must
1
have 1 j .t/j dt D 1 jf .x/j dx. We leave the converse part as an easy
2 2
exercise. t
u
The inversion theorem leads to the distribution determining property.
Theorem 8.2 (Distribution Determining Property). Let X; Y have character-
istic functions X .t/; Y .t/. If X .t/ D Y .t/ 8t, then X; Y have the same
distribution.
The proof uses part (a) of the inversion theorem, which applies to any type of
random variables. Applying part (a) separately to X and Y , one gets that the CDFs
FX ; FY must be the same at all points b that are continuity points of both FX ; FY .
This forces FX and FY to be equal everywhere, because one can approach an arbi-
trary point from above through common continuity points of FX ; FY . t
u
Remark. It is important to note that the assumption that X .t/ D Y .t/ 8t cannot
really be relaxed in this theorem, because it is actually possible for random variables
X; Y with different distributions to have identical characteristic functions at a lot of
points, for example, over arbitrarily long intervals. In fact, we show such an example
later.
We had stated in Chapter 1 that the sum of any number of independent Cauchy
random variables also has a Cauchy distribution. Aided by cfs, we can now prove it.
Example 8.4 (Sum of Cauchys). Let Xi ; 1 i n, be independent, and suppose
Xi C.i ; i2 /. Let Sn D X1 C C Xn . Then, the cf of Sn is
Y
n Y
n
P Pn 2
n
This coincides with the cf of a C i D1 i ; i D1 i distribution, therefore by
the distribution determining property of cfs, one has
0 !2 1
X
n X
n
Sn C@ i ; i A
i D1 i D1
Pn Pn 2 !
Sn i D1 i i D1 i
, XNn D C ; :
n n n
300 8 Characteristic Functions and Applications
Analogous to the inversion formula for the density case, there is a corresponding
inversion formula for integer-valued random variables, and actually more generally,
for lattice-valued random variables. We need two definitions for presenting those
results.
Example 8.5. This example helps illustrate the concepts of an atom and a lattice.
If X has a density, then it cannot have any atoms. It also cannot be a lat-
tice random variable, because any random variable with a density assigns zero
probability to any countable set. Now consider a mixture distribution such as
pı0 C .1 p/N.0; 1/; 0 < p < 1. This distribution has one atom, namely the
value x D 0. Consider a Poisson distribution. This has all nonnegative integers as
its atoms, and it is also a lattice distribution, with a D 0; h D 1. Moreover, this dis-
tribution is purely atomic. Now take Y Bin.n; p/ and let X D Yn . Then X has the
atoms 0; n ; n ; : : : ; 1. This is also a lattice distribution, with a D 0; h D n1 . This dis-
1 2
tribution is also purely atomic. Lastly, let Z N.0; 1/ and let X D 1C2bZc. Then,
the atoms of X are the integers ˙1; ˙3; : : : and again, X has a lattice distribution,
with a D 1 and h D 2.
(c) Given a random variable X with cf .t/, let A be the (countable) set of all the
atoms of X . Then
X Z T
1
ŒP .X D x/2 D lim j .t/j2 dt:
T !1 2T T
x2A
Corollary. Suppose X has a cf .t/ that is square integrable. Then X cannot have
any atoms. In particular, if .t/ is absolutely integrable, then X cannot have any
atoms, and in fact must have a density.
Theorem 8.4 (Esseen’s Lemma). (a) Let F; G be two CDFs, of which G has a
uniformly bounded density g; g K < 1. Let F; G have cfs ; . Then for
any T > 0, and b > 21
, there exists a finite positive constant C D C.b/ such
that
Z T
j .t/ .t/j CK
supx jF .x/ G.x/j b dt C :
T jtj T
(b) (Reversal) Let F; G be any two CDFs, with cfs ; . Then
ˇZ ˇ
1 ˇˇ 1 ˇ
supx jF .x/ G.x/j ˇ Œ .t/ .t/.t/dt ˇˇ ;
2 1
See Chapter 5 in Petrov (1975, pp. 142–147 and pp. 186–187) for each part of
this theorem.
302 8 Characteristic Functions and Applications
Unlike mgfs, characteristic functions need not be differentiable, even once. We have
already seen such an example, namely the cf of the standard Cauchy distribution,
which is e jt j , and therefore continuous but not differentiable. The tail of the Cauchy
distribution is causing the lack of differentiability of its cf. If the tail tapers off
rapidly, then the cf will be differentiable, and could even be infinitely differentiable.
Thus, thin tails of the distribution go hand in hand with smoothness of the cf. Con-
versely, erratic tails of the cf go hand in hand with a CDF that is not sufficiently
smooth. Inasmuch as a thin tail of the CDF is helpful for existence of moments,
these three attributes are linked together, namely,
Does the CDF F have a thin tail?
Is the cf differentiable enough number of times?
Does F have enough number of finite moments, and if so, how does one
recover them directly from the cf?
Conversely, these two attributes are linked together:
Does the cf taper off rapidly?
Is the CDF F very smooth, for example, differentiable a (large) number of
times?
These issues, together with practical applications in the form of Taylor expan-
sions for the cf are discussed next. It should be remarked that Taylor expansions for
the cf and its logarithm form hugely useful tools in various problems in asymptotics.
For example, the entire area of Edgeworth expansions in statistics uses these Taylor
expansions as the most fundamental tool.
Proof. We outline the proof of parts (a) and (e) of this theorem. See Port (1994,
pp. 658–663 and p. 670) for the remaining parts. For part (a), for notational simplic-
ity, assume that X has a density f , and formally
R 1 differentiate .t/ m times inside
the integral sign. We get the expression i m 1 x m e itx f .x/dx. The absolute value
of the integrand jx m e itx f .x/j is bounded by jx m f .x/j, which is integrable, because
8.4 Continuity Theorems 303
m k and the kth moment exists by hypothesis. This means that the dominated
convergence theorem R1is applicable, and that at any t the cf is m times differentiable,
with .m/ .t/ D i m 1 x m e itx f .x/dx. Putting t D 0; .m/ .0/ D i m E.X m /.
For part (e), suppose X has a density f and thatR f 0 exists and is absolutely
1
integrable. Then, by integration by parts, .t/ D it 1 e itx f 0 .x/dx. Now apply
the Riemann–Lebesgue lemma, namely part (d), to conclude that j .t/j D o jt1j
as t ! ˙1. For general n, use this same argument by doing repeated integration
by parts. t
u
The practically useful Taylor expansions for the cf and its logarithm are given
next. For completeness, we recall the definition of cumulants of a random variable.
X
n
.it/j
.t/ D 1 C E.X j / C o.jtjn /;
jŠ
j D1
as t ! 0.
(b) If E.jX jn / < 1, then log .t/ admits the Taylor expansion
X
n
.it/j
log .t/ D j C o.jtjn /;
jŠ
j D1
as t ! 0. See Port 1994 Port (1994, p. 660) for a derivation of the expansion.
is where we are not quite sure what the limiting distribution of Zn is. But, still, if
we can calculate the cf of Zn and obtain a pointwise limit for it, say .t/, and if we
can also establish that this function .t/ is indeed a valid cf (it is not automatically
guaranteed), then the limiting distribution will be whatever distribution has .t/ as
its cf. Characteristic functions thus make a particularly effective tool in asymptotic
theory. In fact, we later give a proof of the CLT in the iid case, without making
restrictive assumptions such as the existence of the mgf, by using characteristic
functions.
L
Theorem 8.7. (a) Let Xn ; n 1; X have cfs n .t/; .t/. Then Xn ) X if and
only if n .t/ ! .t/ 8t.
(b) (Lévy) Let Xn ; n 1 have cfs n .t/, and suppose n .t/ converges pointwise
to (some function) .t/. If is continuous at zero, then .t/ must be a cf, in
L
which case Xn ) X , where X has .t/ as its cf.
Proof. We prove only part (a). Let Fn ; F denote the CDFs of Xn ; X . Let
Note that the complement of B is at most countable. Suppose n .t/ ! .t/ 8t.
By the inversion theorem, for any a; b.a < b/ 2 B,
Z
1 T
e i at e i bt
F .b/ F .a/ D lim .t/ dt
T !1 2 T it
Z " #
1 T
e i at e i bt
D lim lim n .t/ dt
T !1 2 T n it
Z
1 T
e i at e i bt
D lim lim n .t/ dt
T !1 n 2 T it
L
This implies that Xn ) X .
L
Conversely, suppose Xn ) X . Then, by the portmanteau theorem, for any t,
each of EŒcos.tXn /; EŒsin.tXn / converges to EŒcos.tX /; EŒsin.tX /, and so,
n .t/ ! .t/ 8t. t
u
8.5 Proof of the CLT and the WLLN 305
Perhaps the most major application of part (a) of the continuity theorem is in sup-
plying a proof of the CLT in the iid case, without making any assumptions other
than what the CLT says. Characteristic functions also provide a very efficient proof
of the weak law of large numbers and the Cramér–Wold theorem. These three proofs
are provided below.
Theorem 8.8 (CLT in the IID Case). Let Xi ; i 1 be iid variables with mean ;
p L
N
n.X/
and variance 2 .< 1/. Let Zn D
. Let Z N.0; 1/. Then Zn ) Z.
Proof. Without any loss of generality, we may assume that D 0; D 1. Let .t/
denote the cf of the Xi . Then, the cf of Zn is
n
t
n .t/ D p
n
n
t2 1
D 1 Co
2n n
t2
! e 2 ;
L
and hence, by part (a) of the continuity theorem, Zn ) Z. t
u
Theorem 8.9 (WLLN). Suppose Xi ; i 1 are iid, with E.jXi j/ < 1. Let
P
E.Xi / D . Let Sn D X1 C C Xn ; n 1. Then XN D Sn ) . n
Proof. We may assume without loss of generality that D 0. Because E.jXi j/ <
1, the cf of Xi admits the Taylor expansion .t/ D 1 C o.jtj/; t ! 0. Now, the cf
of XN is
n n
t 1
n .t/ D D 1Co
n n
! 1 8t:
Therefore, XN converges in distribution to (the point mass at) zero, and so, also con-
verges in probability to zero. t
u
X
n X
n
!i !Nj .ti tj / 0:
i D1 j D1
We can now conclude more. If 1 ; 2 are both cfs, then so must be their product
1 2 , by the convolution theorem for cfs. Applied to the present situation, this
means that if .t/ is a cf, so must be .t/ .t/ D j .t/j2 . In fact, this is the
cf of X1 X2 , with X1 ; X2 being iid with the cf . Recall that this is just the
symmetrization of X1 .
To conclude this example, because is a cf whenever is, the special convex
combination 12 C 12 is also a cf. But, 12 C 12 is simply the real part of .
Therefore, if is any cf, then so is its real part <. /. There is a simple interpretation
for it. For example, if X has a density f .x/ and the cf , then <. / is just the cf of
the mixture density 12 f .x/ C 12 f .x/.
The following sufficient condition is among the most practically useful methods
for constructing valid characteristic functions. We also show some examples of its
applications.
Theorem 8.12 (Polýa’s Criterion). Let be a real and even function. Suppose
.0/ D 1, and suppose that .t/ is nonincreasing and convex for t > 0, and con-
verges to zero as t ! 1. Then is the characteristic function of some distribution
having a density.
See Feller (1971, p. 509) for a proof.
˛
Example 8.7 (Stable Distributions). Let .t/ D e jt j ; 0 < ˛ 1. Then, by sim-
ple calculus, .t/ satisfies the convexity condition in Polýa’s criterion, and the other
˛
conditions are obviously satisfied. Therefore, for 0 < ˛ 1; e jt j is a valid char-
acteristic function. In fact, these are the characteristic functions of some very special
distributions. Distributions F with the property that for any n, if X1 ; : : : Xn are iid
with CDF F , then their sum Sn D X1 C C Xn is distributed as an X1 C bn
for some sequences an ; bn called stable distributions. It turns out that the sequence
1
an must be of the form n ˛ for some ˛ 2 .0; 2I ˛ is called the index of the stable
˛
distribution. A symmetric stable law has cfs of the form e cjt j ; c > 0. Therefore,
we have arrived here at the cfs of symmetric stable laws of index 1. Polýa’s cri-
˛
terion breaks down if 1 < ˛ 2. So, although e cjt j ; c > 0 are also valid cfs for
1 < ˛ 2, it cannot be proved by using Polýa’s criterion. Note that the case ˛ D 2
corresponds to mean zero normal distributions, and ˛ D 1 corresponds to centered
Cauchy distributions. The case ˛ D 12 arises as the limiting distribution of nu r2
r
,
where r is the time of the rth return to the starting point zero of a simple symmetric
random walk; see Chapter 11. It can be proved that stable distributions must be con-
tinuous, and have a density. However, except for ˛ D 12 ; 1; and 2, a stable density
cannot be written in a simple form using elementary functions. Nevertheless, stable
distributions are widely used in modeling variables in economics, finance, extreme
event modeling, and generally, whenever densities with heavy tails are needed.
Example 8.8 (An Example Where the Inversion Theorem Fails). The inversion the-
orem says that if the cf is absolutely integrable, then the density can be found
by using the inversion formula. This example shows that the cf need not be ab-
solutely integrable for the distribution to possess a density. For this, consider the
function .t/ D 1Cjt 1
j . Then clearly satisfies every condition in Polýa’s crite-
rion, and therefore is a cf corresponding to a density. However, at the same time,
308 8 Characteristic Functions and Applications
R1 1
1 1Cjt j dtD 1, and therefore the density cannot be found by using the inver-
sion theorem.
Example 8.9 (Two Characteristic Functions That Coincide in an Interval). We re-
marked earlier that unlike mgfs, cfs of different distributions can coincide over
nonempty intervals. We give such an example now. Towards this, define
jtj
1 .t/ D 1 Ifjt jT g I
T
jtj T
2 .t/ D 1 Ifjt j T g C I T :
T 2 4jtj fjt j> 2 g
Each of 1 ; 2 is a cf by Polýa’s criterion. Note that, however, 1 .t/ D 2 .t/ 8t 2
Œ T2 ; T2 . Because T is arbitrary, this shows that two different cfs can coincide on
arbitrarily large intervals. Also note that at the same time this example provides a cf
with a bounded support.
Perhaps a little explanation is useful: mgfs can be extended into analytic func-
tions defined on C, and there is a theorem in complex analysis that two analytic
functions cannot coincide on any subset of C that has a limit point. Thus, two dif-
ferent mgfs cannot coincide on a nonempty real interval. However, unlike mgfs,
characteristic functions are not necessarily analytic. That leaves an opening for find-
ing nonanalytic cfs which do coincide over nonempty intervals.
Cˇ3
n p :
3 n
The proof of the Berry–Esseen theorem uses two technical inequalities on charac-
teristic functions. One is Esseen’s lemma, and the other a lemma that further bounds
a term in Esseen’s lemma itself. It is the second lemma that needs elaborate
p
argu-
N
ments involving Taylor expansions for the characteristic function of n.X/
, and
its logarithm. A detailed proof of this second lemma can be seen in Petrov (1975,
pp. 142–147).
Lemma. Let X1 ; X2 ; : : : ; Xn be iid random variables with mean , variance 2 ,
and ˇ3 D E.jX1 j3 / < 1. Let Ln D 3ˇp 3
n
, and let n .t/ denote the charac-
p
N
n.X/
teristic function of Zn D
. Then,
t2 t2 1
jn .t/ e 2 j 16Ln jtj3 e 3 8t such that jtj :
4Ln
on using the simple inequality that the standard normal density .x/ p1 8x.
2
t2
Now, in the first term on the right, use the pointwise bound for jn .t/ e 2 j
from the above lemma, and integrate. Some algebra then gives the Berry–Esseen
theorem with a new universal constant C . t
u
Remark. The universal constant C may be taken as C D 0:8. Fourier analytic
proofs give the best available constant; direct proofs have so far not succeeded
in producing
p
good constants. The constant C cannot be taken to be smaller than
3C 10 :
p
6 2
D 0:41. Better values of the constant C can be found for specific types of
the underlying CDF, for example, if it is known that the samples are iid from an ex-
ponential distribution. However, no systematic studies in this direction seem to have
been done. Also for some specific underlying CDFs F , better rates of convergence
in the CLT may be possible. For example, under suitable additional conditions on
1
F , one can have n D C.F /
n C o.n /:
310 8 Characteristic Functions and Applications
The main use of the Berry–Esseen theorem is that it establishes the facts that,
in
general, the rate of convergence in the central limit theorem for sums is O p1n ,
and that the accuracy of the CLT approximation is linked to the third moment. It
need not, necessarily, give accurate practical bounds on the error of the CLT in
specific applications. We take an example below.
This is less than or equal to the prescribed D 0:005 iff pq > 0:4784, which does
not hold for any p; 0 < p < 1. Even for p D :5, the Berry–Esseen bound is less
than or equal to D 0:005 only when n > 25,000, which is a very large sample
size.
Various refinements of the Berry–Esseen theorem are available. They replace the
third absolute central moment ˇ3 by an expectation of some more general function.
Petrov (1975) and van der Vaart et al. (1996) are good references for more general
Berry–Esseen type bounds on the error of the CLT for sums. Here is an important
refinement that does not assume the existence of the third moment, or that the vari-
ables are iid. The zero mean assumption in the next theorem is not a restriction,
because we can make the means zero by writing Xi i in place of Xi .
Theorem 8.14.
PLet X1 ; : : : ; Xn be independent with E.Xi / D 0; Var.Xi / D i2
n
and let Bn D i D1 i2 . Let g W R ! RC be such that
(a) g is even:
x
(b) g.x/ and g.x/ are nondecreasing for x > 0:
(c) E.Xi g.Xi // < 1 for each i D 1; 2; : : : ; n:
2
Then,
ˇ ! ˇ Pn
ˇ XN E.XN / ˇ 2
ˇ ˇ i D1 E.Xi g.Xi //
sup ˇP p x ˆ.x/ˇ C p ;
x ˇ Var.XN / ˇ g. Bn /Bn
Corollary. Let X1 ; : : : ; Xn be iid with mean , variance 2 , and suppose for some
ı; 0 < ı 1; E.jX1 j2Cı / < 1. Then,
8.8 Lindeberg–Feller Theorem for General Independent Case 311
p !
n.XN / E jX1 j2Cı
n D supx jP x ˆ.x/j C ı
;
2Cı n 2
The central limit theorem generalizes far beyond the iid situation. The general rule is
that if we add a large number of independent summands, none of which dominates
the rest, then the sum should still behave as does a normally distributed random
variable. There are several theorems in this regard, of which the Lindeberg–Feller
theorem is usually considered to be the last word. A weaker result, which is easier
to apply in many problems, is Lyapounov’s theorem, which we present first.
Theorem 8.15 (Lyapounov’s Theorem). Suppose fXn g isP a sequence of indepen-
dent variables, with E.Xi / D i ; Var.Xi / D i2 . Let sn2 D niD1 i2 .
If for some ı > 0,
1 X
n
EjXj j j2Cı ! 0 as n ! 1;
sn2Cı j D1
then
Pn
i D1 .Xi i / L
) N.0; 1/:
sn
Sketch of Proof of Lyapounov’s Theorem. We explain the idea of the proof when the
condition of the theorem holds with ı D 1. Here is the idea, first in nontechnical
terms.
P
Assume without any loss of generality that i D 0; i2 D 1 8i 1. Denote
n
i D1 Xi
sn by Zn and its characteristic function by n .t/. The idea is to approximate
312 8 Characteristic Functions and Applications
2
the logarithm of n .t/ and to show that it is approximately equal to t2 for each
t2
fixed t. This means that n .t/ itself is approximately equal to e 2 , and hence by
L
the continuity theorem for characteristic functions, Zn ) N.0; 1/.
Let the cumulants of Zn be denoted by j;n . Note that 1;n D 0; 2;n D 1. We ap-
proximate the logarithm o f n .t/ by using our previously given Taylor expansion:
X
3
.i t/j
log n .t/ j;n
jŠ
j D1
t2 i
D 3;n t
3
:
2 6
P P
Now, 3;n D E.Zn3 / D s13 EŒ. niD1 Xi /3 : By an expansion of . niD1 Xi /3 ,
n
and on using the independence and zero mean Pn
property of the Xi , and on using
3
i D1 E.jXi j /
the triangular inequality, one gets j 3;n j s3
, which by the Lyapounov
n
2
condition goes to zero, when ı D 1. Thus, log n .t/ t2 , which is what we
need. t
u
As remarked earlier, a central limit theorem for independent but not iid
summands holds under conditions weaker than the Lyapounov condition. The
Lindeberg–Feller theorem is usually considered the best possible result on this,
in the sense that the conditions of the theorem are not only sufficient, but also
necessary under some natural additional restrictions. However, the conditions of the
Lindeberg–Feller theorem involve calculations with the moments of more compli-
cated functions of the summands. Thus, in applications, using the Lindeberg–Feller
theorem gives a CLT under weaker conditions than Lyapounov’s theorem, but typi-
cally at the expense of more cumbersome calculations to verify the Lindeberg–Feller
conditions.
Theorem 8.16 (Lindeberg–Feller Theorem). With the same notation as in Lya-
pounov’s theorem, assume that
1 X 2
n
8ı > 0; 2
E Xj IfjXj j>ısn g ! 0:
sn
j D1
Then, Pn
i D1 .Xi i / L
) N.0; 1/:
sn
Proof. We give a characteristic function proof. Merely for notational simplicity, we
assume that each Xi has a density fi . This assumption is not necessary, and the
proof goes through verbatim in general, with integrals replaced by the expectation
notation.
Denote the cf of Xi by i , and without any loss of generality, we assume that
i D 0 8i , so that we have to show that
8.8 Lindeberg–Feller Theorem for General Independent Case 313
Pn
i D1 Xi L
) N.0; 1/
sn
Y
n
t 2 =2
, k ! e t :
sn
kD1
X
n
t t2
k 1 !
sn 2
kD1
X
n
t t2
, k 1 C ! 0;
sn 2
kD1
by using a two-term Taylor expansion for the logarithm of each k (see the section
on characteristic functions).
Now, because each Xi has mean zero, the quantity on the left in this latest ex-
pression can be rewritten as
n Z
X 1
t t t2
e i sn x 1 i x C 2 x 2 fk .x/dx
1 sn 2sn
kD1
Xn Z
t t t2
D e i sn x 1 i x C 2 x 2 fk .x/dx
jxjısn sn 2sn
kD1
n Z
X t t t2
C e i sn x 1 i x C 2 x 2 fk .x/dx
jxj>ısn sn 2sn
kD1
We bound each term in this expression. First, we bound the integrand. In the
first term, that is when jxj ısn , the integrand is bounded above by j stn xj3 <
ıjtj3 x 2 =sn2 . In the second term, that is when jxj > ısn , the integrand is bounded
2
above by t 2 xs 2 . Therefore,
n
n Z
X t t t2
e i sn x 1 i x C 2 x 2 fk .x/dx
jxjısn sn 2sn
kD1
X
n Z
t t t2
C e i sn x 1 i x C 2 x 2 fk .x/dx
jxj>ısn sn 2sn
kD1
Pn R
3kD1 jxjısn x 2 fk .x/dx
ıjtj
sn2
2 X
n Z
t
C x 2 fk .x/dx:
sn2 jxj>ısn
kD1
314 8 Characteristic Functions and Applications
Hold ı fixed and let n ! 1. Then, the second term ! 0 by hypothesis of the
Lindeberg–Feller theorem. Now, notice that the expression we started with, namely,
n Z
X 1
t t t2
e i sn x 1 i x C 2 x 2 fk .x/dx
1 sn 2sn
kD1
has no ı in it, and so, now letting ı ! 0, even the first term is handled, and we
conclude that
n Z
X 1
t t t2
e i sn x 1 i x C 2 x 2 fk .x/dx ! 0;
1 sn 2sn
kD1
maxfjc1 j; : : : ; jcn jg
rn D qP ! 0; as n ! 1:
n 2
i D1 i c
Pn L
i D1 ci Ui
We want to show that qP
n 2
) N.0; 1/: We do this by verifying the Lindeberg–
i D1 ci
Feller condition. qP
n
Denote Xi D ci Ui ; sn D 2
i D1 ci . We need to show that for any ı > 0;
1 X 2
n
2
E Xj IfjXj j>ısn g ! 0:
sn
j D1
Fix j , then,
E Xj2 IfjXj j>ısn g D cj2 E Uj2 IfjUj j>ısn =jcj jg
cj2 E Uj2 IfjUj j>ısn = maxfjc1 j;:::;jcn jgg :
P cj2
But this implies s12 njD1 E.Xj2 IfjXj j>ısn g / goes to zero too, because 2
sn
1. This
n
completes the verification of the Lindeberg–Feller condition.
Part (b) is attributed to Feller; see Port (1994, p. 704). It shows that if variances
exist, then one cannot do away with the Lindeberg–Feller condition, except in un-
usual cases where the summands are not uniformly negligible relative to the sum.
Infinitely divisible distributions were introduced by de Finetti in 1929 and the most
fundamental results were developed by Kolmogorov, Lévy, and Khintchine in the
thirties. The area has undergone tremendous growth and a massive literature now
exists. Stable distributions form a subclass of infinitely divisible distributions and
are quite extensively used in modeling heavy tail data in various applied fields.
The origin of infinite divisibility and stable laws was in connection with char-
acterizing possible limit distributions of centered and normed partial sums of
independent random variables. If X1 ; :X2 ; : : : is an iid sequence with mean and a
p L
finite variance 2 , then we know that with an D n and bn D n; Snba n
n
)
N.0; 1/. But, if X1 ; :X2 ; : : : are iid standard Cauchy, then with an D 0 and
L
bn D n; Snban
n
) C.0; 1/. It is natural to ask what are all the possible limit laws for
suitably centered and normed partial sums of iid random variables. One can remove
the iid assumption and keep just independence, and ask the same question. It turns
out that stable and infinitely divisible distributions arise as the class of all possible
limits in these cases. Precise statements are given in the theorems below. But, first
we need to define infinite divisibility and stable laws.
316 8 Characteristic Functions and Applications
Example 8.13. Let X be N.; 2 /. For any n > 1, let X1n ; : : : ; Xnn be iid
N.=n; 2 =n/. Then X has the same distribution as X1n C C Xnn , and so X is
infinitely divisible.
Example 8.14. Let X have a Poisson distribution with mean . For a given n, let
X1n ; : : : ; Xnn be iid Poisson variables with mean =n. Then X has the same distri-
bution as X1n C C Xnn , and so X is infinitely divisible.
Example 8.15. Let X have the continuous U Œ0; 1 distribution. Then X is not in-
finitely divisible. For if it is, then for any n, there exist iid random variables
X1n ; : : : ; Xnn with some distribution Fn such that X has the same distribution as
X1n C C Xnn . Evidently, the supremum of the support of Fn is at most 1=n. This
implies Var.X1n / 1=n2 and hence Var.X / 1=n, a contradiction.
In fact, a random variable X with a bounded support cannot be infinitely divisi-
ble, and the uniform case proof applies in general.
The following important property of the class of infinitely divisible distributions
describes the connection of infinite divisibility to possible weak limits of partial
sums of triangular arrays of independent random variables.
Theorem 8.18. A random variable X is infinitely divisible if and only if for each n,
P L
there is an iid sequence Z1n ; : : : ; Znn , such that niD1 Zi n ) X .
See Feller (1971, p. 303) for its proof. The result above allows triangular arrays
of independent random variables, with possibly different common distributions Hn
for the different rows. An important special case is that ofPjust one iid sequence
X1 ; X2 ; : : : with a common cdf H . If the partial sums Sn D niD1 Xi , possibly after
suitable centering and norming, converge in distribution to some random variable
Z, then Z belongs to a subclass of the class of infinitely divisible distributions. This
class is the so-called stable family. We first give a more direct definition of a stable
distribution that better explains the reason for the name stable.
Definition 8.6. A random variable X , or its CDF F , is said to be stable if for every
n 1, there exist constants bn and an such that Sn D X1 C X2 C C Xn
and bn X1 C an have the same distribution, where X1 ; X2 ; : : : ; Xn are iid with the
common distribution F .
It turns out that bn has to be n1=˛ for some 0 < ˛ 2. The constant ˛ is said to
be the index of the stable distribution F . The case ˛ D 2 corresponds to the normal
case, and ˛ D 1 corresponds to the Cauchy case. Generally, stable distributions are
heavy tailed. For instance, the only stable laws with a finite variance are the normal
distributions. The following theorem is often useful.
8.10 Some Useful Inequalities 317
Theorem 8.19. If X is stable with an index 0<˛<2, then for any p>0; EjX jp <1
if and only if 0<p<˛.
Thus, stable laws with an index ˛ 1 cannot even have a finite mean; see Feller
(1971, pp. 578–579) for a proof of the above theorem.
Stable distributions are necessarily absolutely continuous, and therefore have
densities. However, except for ˛ D 12 ; 1; 32 , and 2, it is not possible to write a stable
density analytically using elementary functions. See Chapter 11 and Chapter 18 for
examples of situations where the stable law with ˛ D 12 naturally arises. There
are various infinite series and integral representations for them. The general stable
distribution is parametrized by four parameters; a location parameter , a scale
parameter , a skewness parameter ˇ, and the index parameter, which we have
called ˛. For instance, for the standard normal distribution, which is stable, D 0;
D 1; ˇ D 0, and ˛ D 2. By varying ; ; ˇ, and ˛, one can fit a lot of heavy-tailed
data by using a stable law. However, fitting the four parameters from observed data
is a very hard statistical problem. Feller (1971) is an excellent reference for more
details on stable and infinitely divisible distributions. Infinite divisibility and stable
laws are also treated in more detail in DasGupta (2008).
Here is the result connecting stable laws to limits of partial sums of iid random
variables; see Feller (1971, p. 172).
Theorem 8.20. Let X1 ; X2 ; : : : be iid with the common cdf H . Suppose for constant
L
sequences fan g; fbn g; Snba
n
n
)Z F . Then F is stable.
References to the inequalities below are given in DasGupta (2008), Chapter 34.
Bikelis Nonuniform Inequality. Given independent random variables p X1 ; : : : ; Xn
with mean zero, and finite third absolute moments, and Fn .x/ D P . nXN x/,
Pn
i D1 EjXi j
3
1
jFn .x/ ˆ.x/j A ;
Bn3 .1 C jxj/3
Pn
for all real x, where Bn2 D i D1 Var.Xi /.
Exercises
Exercise 8.1 (Symmetrization of U Œ0; 1). Let X; Y be iid U Œ0; 1. Calculate the
characteristic function of X Y . Is X Y uniformly distributed on Œ1; 1?
Exercise 8.4. Calculate the cf of the standard logistic distribution with the CDF
F .x/ D 1Ce1x ; 1 < x < 1.
Exercise 8.7. For each of the Bi n.n; p/; Geo.p/, and P oi. /, use the cf formula
given in text and show that there exists t ¤ 0 at which j .t/j D 1.
Remark. This is true of any lattice distribution.
Exercise 8.8 * (Random Sums). Suppose Xi ; i 1 are iid, and N Poi. / is
independent of fX1 ; X2 ; g. Let Sn D X1 C : : : C Xn ; n 1; S0 D 0. Derive a
formula for the cf of SN .
Exercise 8.9 * (Characteristic Function of Products).
(a) Write a general expression for the cf of X Y if X; Y are independent.
(b) Find the cf of the product of two independent standard normal variable s.
Exercise 8.10 * (Sum of Dice Rolls). Let Xi ; 1 i n be iid integer-valued ran-
dom variables, with cf .t/. Let Sn D X1 C C Xn ; n 1. By using the inversion
formula for integer-valued random variables, derive an expression for P .Sn D k/.
Hence, derive the pmf of the sum of n independent rolls of a fair die.
Exercise 8.11 * (A Characterization of Normal Distribution). Let X; Y be iid
random variables. Give a characteristic function proof that if X C Y; X Y are
independent, then X; Y must be normal.
Exercise 8.12 (Characteristic Function of Compact Support). Consider the ex-
ample .t/ D .1 jtj/Ifjt j1g given in the text. Find the density function corre-
sponding to this cf.
Exercise 8.13 (Practice with the Inversion Formula). For each of the following
cfs, find the corresponding density function:
t2
.cosh t/1 I .cosh t/2 I t.sinh t/1 I .1/n H2n .t/e 2 ;
indep:
Exercise 8.18. Suppose Xi U Œai ; ai ; i 1; ai a < 1 8i .
(a) Give a condition on fai g such that the Lindeberg–Feller condition holds.
(b) Give fai g such that the Lindeberg–Feller condition does not hold.
P
(c) * Prove that the Lindeberg–Feller condition holds if and only if 1 i D1 ai D 1.
2
indep
Exercise 8.19. Let Xi Poi. i /; i 1. Find a sufficient condition on f i g
so that the Lindeberg–Feller condition holds. Next, find a sufficient condition for
Lyapounov’s condition to hold with ı D 1.
indep:
Exercise 8.20. (Lindeberg–Feller for Independent Bernoullis). Let Xni
Pn
Bin.1; tni /; 1 i n, and suppose tni .1 tni / ! 1. Show that
i D1
P
n Pn
Xni i D1 tni
i D1 L
s ) N.0; 1/:
P
n
tni .1 tni /
i D1
indep
Exercise 8.21 (Lindeberg–Feller for Independent Exponentials). Let Xi
max i
1i n
Exp. i /, i 1, and suppose P
n ! 0. Show that, on standardization to zero
i
i D1
mean and unit variance, XN converges in distribution to N.0; 1/.
indep:
Exercise 8.22 (Poisson Limit in Bernoulli Case). Let Xni Bin.1; tni /; 1
Pn Pn
i n, and suppose tni ! 0 < < 1. Where does Xni converge in law?
i D1 i D1
Hint: Look at characteristic functions.
Exercise 8.23. Verify, for which of the following cases, the Lindeberg–Feller con-
dition holds:
(a) P .Xn D ˙ 1n / D 12 .
(b) P .Xn D ˙n/ D 12 .
(c) P .Xn D ˙2n / D 12 .
(d) P .Xn D ˙2n / D 2n1 :P .Xn D ˙1/ D 1
2 2n1 :
Exercise 8.24. Let X Bin.n; p/. Use the Berry–Esseen theorem to bound
P .X k/ in terms of the standard normal CDF ˆ uniformly in k. Is it possible
to give any nontrivial bounds that are uniform in both k; p?
Exercise 8.25. Suppose X1 ; : : : ; X20 are the scores in 20 independent rolls of a fair
die. Obtain an upper and a lower bound on P .X1 C X2 C C X20 75/ by using
the Berry–Esseen inequality.
322 8 Characteristic Functions and Applications
References
Berry, A. (1941). The accuracy of the Gaussian approximation to the sum of independent variates,
Trans. Amer. Math. Soc., 49, 122–136.
Bhattacharya, R.N., and Rao, R.R. (1986). Normal Approximation and Asymptotic Expansions,
John Wiley, New York.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Esseen, C. (1945). Fourier analysis of distribution functions: A mathematical study, Acta Math.,
77, 1–125.
Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol II, Wiley,
New York.
Hall, P. (1992). The Bootstrap and the Edgeworth Expansion, Springer, New York.
Petrov, V. (1975). Limit Theorems of Probability Theory, Oxford University Press, London.
Port, S. (1994). Theoretical Probability for Applications, Wiley, New York.
Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York.
van der Vaart, Aad and Wellner, J. (1996). Weak Convergence and Empirical Processes: With
Applications to Statistics, Springer-Verlag, New York.
Chapter 9
Asymptotics of Extremes and Order Statistics
Example 9.1 (Uniform Case). As a simple motivational example, consider the case
of iid U Œ0; 1 observations U1 ; U2 ; : : :. Fix n, and p; 0 < p < 1, and assume for
p
¢z n
1ˆ p
npq
D 1 ˆ.z/ D ˆ.z/:
The steps
p
of this argument can be made rigorous and this shows that in the U Œ0; 1
n.U.k/ p/
case, p converges in distribution to N.0; 1/. This is an important example,
pq
because the case of a general continuous CDF will follow from the uniform case by
making the very convenient quantile transformation (see Chapter 1).
If the observations are iid according to some general continuous CDF F , with a
L
density f , then, by the quantile transformation, we can write X.k/ D F 1 .U.k/ /.
1
The transformation u ! F .u/ is a continuously differentiable function at u D
p with the derivative f .F 1 1
.p//
, provided 0 < f .F 1 .p// < 1. By the delta
p L
theorem, it then follows that n.X.k/ F 1 .p// ) N.0; Œf .F 1 pq
.p//2
. This is an
important result. However, the best result is quite a bit stronger. We do not have
to assume that k D kn is exactly equal to np for some fixed p, and we do not
require the CDF F to have a density everywhere. In other words, F need not be
differentiable at all x; all we need is enough local smoothness at the limiting value
of kn . Here is the exact theorem. Note that this theorem applies if we take k D bnpc,
the integer part of np for some fixed p.
(a)
p p.1 p/
lim PF . n.Xkn Wn p / t/ D P N 0; 0 t
n!1 .F .p //2
for t < 0, provided the left derivative F 0 .p / exists and is > 0.
(b)
p p.1 p/
lim PF . n.Xkn Wn p / t/ D P N 0; 0 t
n!1 .F .p C//2
for t > 0, provided the right derivative F 0 .p C/ exists and is > 0.
say
(c) If F 0 .p / D F 0 .p C/ D f .p / > 0, then
p L p.1 p/
n.Xkn Wn p / ) N 0; 2 :
f .p /
Remark. Part (c) of the theorem is the most ubiquitious version and most used in
applications. The same results hold with Xkn Wn replaced by Fn1 .p/.
p
By the previous theorem, PF . n.Xb n2 cWn 12 / t/ can be approximated by
P .N.0; 14 / t/ when t < 0 and by P .N.0; 16 1
/ t/ when t > 0. Separate
approximations are necessary because F changes slope at x D 12 .
Here is an important statistical application.
Example 9.4 (Confidence Interval for a Quantile). Suppose X1 ; X2 ; : : : ; Xn are iid
observations from some CDF F , and to keep things simple, assume that F has a
density f . Suppose we wish to estimate p D F 1 .p/ for some fixed p; 0 < p < 1.
Let k D kn D bnpc, and let Op D XkWn . Then, from the above theorem,
p L p.1 p/
n.Op p / ) N 0; :
.f .p //2
The interval has a simplistic appeal and is computed much more easily than an exact
interval based on order statistics.
Just as a single central order statistic is asymptotically normal under very mild
conditions, any fixed number of central order statistics are jointly asymptotically
normal, under similar conditions. Furthermore, any two of them are positively corre-
lated, and there is a simple explicit description of the asymptotic covariance matrix.
We present that result next; a detailed proof can be found in Serfling (1980) and
Reiss (1989).
Theorem 9.2 (Several Order Statistics). Let Xki Wn ; 1 i r, be r specified
p
order statistics, and suppose for some 0 < q1 < q2 < < qr < 1, n. kni
qi / ! 0 as n ! 1. Then
p L
nŒ.Xk1 Wn ; Xk2 Wn ; : : : ; Xkr Wn / .q1 ; q2 ; : : : ; qr / ) Nr .0; †/;
where for i j,
qi .1 qj /
¢ij D ;
F .qi /F 0 .qj /
0
Remark. Note that ¢ij > 0 for i ¤ j in the above theorem. However, as we proved
in Chapter 6, what is true is that if Xi ; 1 i n, are iid from any CDF F , then
provided the covariance exists, Cov.Xi Wn ; Xj Wn / 0 for any i; j and any n 2.
An important consequence of the joint asymptotic normality of a finite number
of order statistics is that linear combinations of a finite number of order statistics
will be asymptotically univariate normal. A precise statement is as follows.
Example 9.5 (The Interquartile Range). The 25th and the 75th percentiles of a set
of sample observations are called the first and the third quartile of the sample. The
difference between them gives information about the spread in the distribution from
which the sample values are coming. The difference is called the interquartile range,
and we denote it as IQR. In statistics, suitable multiples of the IQR are often used as
measures of spread, and are then compared to traditional measures of spread, such
as the sample standard deviation s.
Let k1 D b n4 c; k2 D b 3n
4
c. Then IQR D Xk2 Wn Xk1 Wn . It follows on some
calculation from the above corollary that
" #!
p L 1 3 3 2
n.IQR . 3 1 // ) N 0; C :
4 4 16 f 2 . 3 / f 2 . 1 / f . 1 /f . 3 /
4 4 4 4
Here, the notation f means the derivative of F at the particular point. In most cases,
f is simply the density of F .
Specializing to the case when F is the CDF of N.; ¢ 2 /, on some algebra and
computation, for normally distributed iid observations,
p L
n.IQR 1:35¢/ ) N.0; 2:48¢ 2/
p IQR L 2:48 2
) n ¢ ) N 0; ¢ D N.0; 1:36¢ 2/:
1:35 1:352
p L
On the other hand, n.s ¢/ ) N.0; :5¢ 2/ (we have previously solved this
problem in general for any distribution with four finite moments). The ratio of the
:5
asymptotic variances, namely, 1:36 D :37 is the ARE of the IQR-based estimate
relative to s: Thus, for normal data, one is better off using s. For populations with
thicker tails, IQR-based estimates can be more efficient. DasGupta and Haff (2006)
work out the general asymptotic theory for comparison between estimates based on
IQR and s.
328 9 Asymptotics of Extremes and Order Statistics
9.2 Extremes
The asymptotic theory of sample extremes for iid observations is completely dif-
ferent from that of central order statistics. For one thing, the limiting distributions
of extremes are never normal. Sample extremes are becoming increasingly useful
in various statistical problems, such as financial modeling, climate studies, multiple
testing, and disaster planning. As such, the asymptotic theory of extremes is gain-
ing in importance. General references for this section are Galambos (1987), Reiss
(1989), Sen and Singer (1993), and DasGupta (2008).
We start with a familiar easy example that illustrates the different kind of asymp-
totics that extremes have, compared to central order statistics.
i id
Example 9.6. Let U1 ; : : : ; Un U Œ0; 1. Then
t
P .n.1 UnWn / > t/ D P 1 UnWn >
n
t t n
D P UnWn < 1 D 1 ; if 0 < t < n:
n n
L
So P .n.1 UnWn / > t/ ! e t for all real t, which implies that n.1 UnWn / )
Exp.1/.pNotice two key things: the limit is nonnormal and the norming constant is
n, not n. The norming constant in general depends on the tail of the underlying
CDF F .
It turns out that if X1 ; X2 ; : : : are iid from some F , then the limit distributions
of XnWn , if it at all exists, can be only one of three types. Characterizations are
available and were obtained rigorously by Gnedenko (1943), although some of his
results were previously known to Frechet, Fisher, Tippett, and von Mises.
The usual characterization result, called the convergence of types theorem, is
somewhat awkward to state and can be difficult to verify. Therefore, we first present
more easily verifiable sufficient conditions. We make the assumption until further
notified that F is continuous.
F .j / .x/ f .x/
j 2; .1/j 1 ; as x ! 1;
F .j 1/ .x/ 1 F .x/
where Hj .x/ is the j -th Hermite polynomial and is of degree j (see Chapter 12).
.j /
Therefore, for every j; FF.j 1/.x/
.x/
.1/j 1 x. Thus, F D ˆ is a CDF of the
exponential type.
Example 9.9. All t distributions, including therefore the Cauchy, are of polynomial
type. Consider the t-distribution with ’ degrees of freedom and with median zero.
Then, it is easily seen that x ’ .1 F .x// has a finite nonzero limit. Hence, a t’ -
distribution is of the polynomial type of order ’.
We now present our sufficient conditions for weak convergence of the maximum
to three different types of limit distributions. The first three theorems below are
proved in de Haan (2006, pp. 15–19); also see Sen and Singer (1993). The first
result handles cases such as the uniform on a bounded interval.
Theorem 9.3. Suppose X1 ; X2 ; : : : are iid from a CDF with mt h order terminal
contact at .F / < 1. Then for suitable an ; bn ,
XnWn an L
) G;
bn
where
(
mC1
e .t / t 0
G.t/ D
1 t > 0:
1
1
0.8
0.8
0.6 0.6
0.4 0.4
0.2 0.2
Fig. 9.1 True and asymptotic density of maximum in N(0, 1) case; n D 50, 100
See de Haan (2006, pp. 11–12) or Galambos (1987) for a proof. The distribution
t
with the CDF e e is generally known as the Gumbel distribution or the extreme
value distribution.
Example 9.12 (Sample Maximum in Normal Case). The density of the Gumbel dis-
t
tribution is g.t/ D e t e e ; 1 < t < 1. This distribution has mean m D C
2
(the Euler constant), and variance v2 D 6 . The asymptotic distribution gives us a
formal approximation for the density of XnWn :
!!
p p p log log n C log 4
fOn .x/ D 2 log n g 2 log n x 2 log n C p :
2 2 log n
Of course, the true density of XnWn is n.x/ˆn1 .x/. The asymptotic and the true
density are plotted in Fig. 9.1 for n D 50; 100. The asymptotic density is more
peaked at the center, and although it is fairly accurate at the upper tail, it is badly
inaccurate at the center and the lower tail. Its lower tail dies too quickly. Hall (1979)
shows that the rate of convergence of the asymptotic distribution is extremely slow,
in a uniform sense.
Formal approximations to the mean and variance of XnWn are also obtained from
the asymptotic distribution. Uniform integrability arguments are required to make
these formal approximations rigorous.
p log log n C 1 log 4
E.XnWn / 2 log n p C p2 ;
2 2 log n 2 log n
2
Var.XnWn / :
12 log n
The moment approximations are not as inaccurate as the density approximation.
We evaluated the exact means of XnWn in Chapter 6 for selected values of n. We
reproduce those values with the approximate mean given by the above formula for
comparison.
332 9 Asymptotics of Extremes and Order Statistics
We now present the famous trichotomy result of asymptotics for the sample max-
imum in the iid case. Either the sample maximum XnWn cannot be centered and
normalized in any way at all to have a nondegenerate limit distribution, or, it can be,
in which case the limit distribution must be one of exactly three types. Which type
applies to a specific example depends on the support and the upper tail behavior of
the underlying CDF. The three possible types of limit distributions are the following.
r
9
G1;r .x/ D e x ; x>0 >
>
>
>
>
>
D 0; x 0 >
>
=
.x/r
G2;r .x/ D e ; x 0 ;
>
>
D 1; >
>
x>0 >
>
>
>
1 < x < 1 ;
x
G3 .x/ D e e ;
where r > 0.
To identify which type applies in a specific case, we need a few definitions that
are related to the upper tail behavior of the CDF F .
Definition 9.5. Suppose X1 ; X2 ; : : : are iid with CDF F . We say that F is in the
domain of maximal attraction of a CDF G and write F 2 D.G/, if for some an ; bn ,
XnWn an L
) G:
bn
We now state the three main theorems that make up the trichotomy result. See
Chapter 1 in de Haan (2006) for a proof of these three theorems.
Theorem 9.8. F 2 D.G2;r / iff .F / < 1 and F e 2 D.G1;r /, where Fe .x/ is the
CDF Fe .x/ D F ..F / 1 /; x > 0. In this case, an may be chosen to be .F / and
x
bn such that 1 F .a bn / n1 .
Theorem 9.9. F 2 D.G3 / iff there is a function u.t/ > 0, such that
1 F .t C xu.t//
lim D e x ; for all x:
t ! .F / 1 F .t/
Definition 9.6. Given a CDF F , the mean residual life is the function L.t/ D
EF .X tjX > t/.
One specific important result covering some important special cases with un-
bounded upper end points for the support of F is the following.
Example 9.13. Let X Exp. /. Then trivially, L.t/ D constant. Any constant
function is obviously slowly varying and so is of regular variation. Also, obviously,
limt !1 L.t /
t D 0. Therefore, for any exponential distribution, the CDF F 2 D.G3 /.
1 1
L.t/ D t ;
R.t/ t
where R.t/ is the Mills ratio. Therefore, L.t/ is of regular variation at t D 1, and
also, obviously, limt !1 L.t /
t D 0. It follows that ˆ 2 D.G3 /, and as a consequence
the CDF of any normal distribution also belongs to D.G3 /.
The three different types of distributions that can at all arise as limit distributions of
centered and normalized sample extremes can be usefully unified in a single one-
parameter family of distributions, called the Fisher–Tippett distributions. We do so
in this section, together with some additional simplifications.
334 9 Asymptotics of Extremes and Order Statistics
Suppose for some real sequences an ; bn , we have the convergence result that
X.n/ an L
bn
) G. Then, from the definition of convergence in distribution, we have
that at each continuity point x of G,
X.n/ an
P x ! G.x/ , P .X.n/ an C bn x/ ! G.x/
bn
, F n .an C bn x/ ! G.x/
, n log F .an C bn x/ ! log G.x/
, n logŒ1 .1 F .an C bn x// ! log G.x/
, nŒ1 F .an C bn x/ ! log G.x/
1 1
, ! :
1 F .an C bn x/ log G.x/
Now, if we consider the case where F is strictly monotone, and let U.t/ denote the
inverse function of 1F1 .x/ , then the last convergence assertion above is equivalent to
U.nx/ an 1
! G 1 e x ;
bn
x > 0:
We can appreciate the role of this function U.t/ in determining the limit distribu-
tion of the maximum in a given case. Not only that, when this is combined with the
Fisher–Tippett result that the possible choices of G are a very precisely defined one-
parameter family, the statement takes an even more aesthetically pleasing form. The
Fisher–Tippett result and a set of equivalent characterizations for a given member
of the Fisher–Tippett family to be the correct limiting distribution in a specific prob-
lem are given below. See pp. 6–8 in de Haan (2006) for a proof of the Fisher–Tippett
theorem.
Theorem 9.11 (Fisher–Tippett Theorem). Let X1 ; X2 ; : : : be iid with the common
X a L
CDF F . Suppose for some sequences an ; bn , and some CDF G, .n/bn n ) G. Then
G must be a location-scale shift of some member of the one parameter family
1=
fG W G .x/ D e .1Cx/ ; 1 C x > 0; 2 RgI
x
in the above, G0 is defined to be G0 .x/ D e e ; 1 < x < 1. See de Haan (2006)
for a proof of this theorem.
We can reconcile the Fisher–Tippett theorem with the trichotomy result we have
previously described. Indeed,
(a) For > 0, using the particular location-scale shift G . x1
/, and denoting 1
as ’, we end up with G1;’ .
(b) For D 0, we directly arrive at the Gumbel law G3 .
(c) For < 0, using the particular location-scale shift G .xC1/
, and denoting
1 as ’, we end up with G2;’ .
Exercises 335
Theorem 9.12. Let X1 ; X2 ; : : : be iid with the common CDF F . Then the following
are all equivalent.
X.n/ an L
(a) For some real sequences an ; bn ; and some real ; bn ) G .
(b) F n .an C bn x/ ! G .x/ for all real x.
(c) nŒ1F .a1n Cbn x/ ! .1 C x/1= for all real x.
x 1
(d) For some positive function a.t/, and any x > 0; U.tx/U.t
a.t /
/
! as t ! 1.
x 1
Here, for D 0; is defined to be log x. Furthermore, if the condition
in this part holds with a specific function a.t/, then in part (a), one may choose
an D a.n/, and bn D U.n/.
Exercises
Exercise 9.1 (Sample Maximum). Let Xi ; i 1 be an iid sequence, and X.n/ the
maximum of X1 ; : : : ; Xn . Let .F / D supfx W F .x/ < 1g, where F is the common
a:s:
CDF of the Xi . Prove that X.n/ ) .F /.
Exercise 9.3 (Asymptotic Relative Efficiency of Median). For each of the fol-
lowing cases, derive the asymptotic relative efficiency of the sample median with
respect to the sample mean:
(a) Double exponential.; 1/.
(b) U Œ ¢; C ¢.
(c) Beta.’; ’/.
Exercise 9.4 * (Percentiles with Large and Small Variance). Consider the stan-
dard normal, standard double exponential, and the standard Cauchy densities. For
0 < p < 1, find expressions for the variance in the limiting normal distribution of
Fn1 .p/, and plot them as functions of p. Which percentiles are the most variable,
and which the least?
Exercise 9.5 (Interquartile Range). Find the limiting distribution of the in-
terquartile range for sampling from the normal, double exponential, and Cauchy
distributions.
Exercise 9.6 (Quartile Ratio). Find the limiting distribution of the quartile ratio
defined as Xb 3n cWn =XŒ n4 Wn for the exponential, Pareto, and uniform distributions.
4
336 9 Asymptotics of Extremes and Order Statistics
Exercise 9.7 * (Best Linear Combination). Suppose X1 ; X2 ; : : : are iid with den-
sity f .x /. For each of the following cases, find the estimate of the form
pXbn’1 cWn C pXbnn’1 cWn C .1 2p/Xb n2 cWn
which has the smallest asymptotic variance:
x2
(a) f .x/ D p1 e 2 ;
2
1 jxj
(b) f .x/ D 2
e ;
(c) f .x/ D 1
.1Cx 2 /
.
Exercise 9.8 * (Poisson Median Oscillates). Suppose Xi are iid from a Poisson
distribution with mean 1. How would the sample median behave asymptotically?
Specifically, does it converge in probability to some number? Does it converge in
distribution on any centering and norming?
Exercise 9.9 * (Position of the Mean Among Order Statistics). Let Xi be
iid standard Cauchy. Let Nn be defined as the number of observations among
X1 ; : : : ; Xn that are less than or equal to XN . Let pn D Nnn . Show that pn converges
to the U Œ0; 1 distribution.
Hint: Use the Glivenko–Cantelli theorem.
Exercise 9.10 * (Position of the Mean Among Order Statistics). Let Xi be
iid standard normal. Let Nn be defined as the number of observations among
X1 ; : : : ; Xn that are less than or equal to XN . Let pn D Nnn . Show that on suitable
centering and norming, pn converges to a normal distribution, and find the variance
of this limiting normal distribution.
Exercise 9.11 (Domain of Attraction for Sample Maximum). In what domain of
(maximal) attraction are the following distributions. (a) F D t’ ; ’ > 0; (b) F D 2k ;
(c) F D .1 /N.0; 1/ C C.0; 1/.
Exercise 9.12 (Maximum of the Absolute Values). Let Xi be iid standard normal,
and let n D max1i n jXi j; n 1. Find constants an ; bn and a CDF G such that
n an L
bn
) G.
Exercise 9.13 (Limiting Distribution of the Second Largest).
(a) Let Xi be iid U Œ0; 1 random variables. Find constants an ; bn and a CDF G such
an L
that Xn1Wn
bn
) G;
(b) * Let Xi be iid standard Cauchy random variables. Find constants an ; bn ; and a
Xn1Wn an L
CDF G such that bn ) G.
Exercise 9.14. Let X1 ; X2 ; : : : be iid Exp(1) samples.
a:s:
(a) Find a sequence an such that X1Wn an ) 0:
X1Wn a:s:
(b) * Find a sequence cn such that cn ) 1:
References 337
Hint: Use without proof that X1Wn ; XnWn are asymptotically independent and the
result in the text on the limiting distribution of XnWn in the N.0; 1/ case. Finally,
calculate in closed-form the appropriate convolution density.
Hint: They do. Either use direct calculations or use uniform integrability arguments.
Exercise 9.21. Suppose Xi ; i 1 are iid U Œ0; 1. Does EŒn.1 XnWn /k have a
limit for all k?
338 9 Asymptotics of Extremes and Order Statistics
References
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
DasGupta, A. and Haff, L. (2006). Asymptotic expansions for correlations between different mea-
sures of spread, JSPI, 136, 2197–2212.
DasGupta, A. and Rubin, H. (2005). Estimation of binomial parameters when N; p are both un-
known, JSPI, Felicitation Volume for Herman Chernoff, 130, 391–404.
David, H.A. (1980). Order Statistics, Wiley, New York.
de Haan, L. (2006). Extreme Value Theory: An Introduction, Springer, New York.
Galambos, J. (1987). Asymptotic Theory of Extreme Order Statistics, Wiley, New York.
Gnedenko, B.V. (1943). Sur la distribution limité du terme maximum d’une serie aleatoire, Annals
od Math., 44, 423–453.
Hall, P. (1979). On the rate of convergence of normal extremes, J. Appl. Prob., 16(2), 433–439.
Reiss, R. (1989). Approximate Distributions of Order Statistics, Springer, New York.
Sen, P.K. and Singer, J. (1993). Large Sample Methods in Statistics, Chapman and Hall, New York.
Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York.
Chapter 10
Markov Chains and Applications
The set S is called the state space of the chain. If S is a finite set, the chain is called
a finite state Markov chain. X0 is called the initial state.
Without loss of generality, we can denote the elements of S as 1; 2; : : : ; although
in some examples we may use the original labeling of the states to avoid confusion.
Definition 10.2. The distribution of the initial state X0 is called the initial distribu-
tion of the chain. We denote the pmf of the initial distribution as i D P .X0 D i /.
Definition 10.4. Let fXn g be a stationary Markov chain. Then the probabilities
pij D P .XnC1 D j jXn D i / are called the one-step transition probabilities, or sim-
ply transition probabilities. The matrix P D ..pij // is called the transition proba-
bility matrix.
Definition 10.5. Let fXn g be a stationary Markov chain. Then the probabili-
ties pij .n/ D P .XnCm D j jXm D i / D P .Xn D j jX0 D i / are called the n-step
transition probabilities, and the matrix P .n/ D ..pij .n/// is called the n-step
transition probability matrix.
Remark. If the state space of the chain is finite and has, Psay t elements, then the
transition probability matrix P is a t t matrix. Note that j 2S pij is always one.
A matrix with this property is called a stochastic matrix.
Definition
P 10.6. A t t square matrix P is called a stochastic matrix if for each
i; tj D1 pij D 1. It is called doubly stochastic or bistochastic if, in addition, for
P
every j; ti D1 pij D 1. Thus, a transition probability matrix is always a stochastic
matrix.
Markov chains are widely used as models for discrete time sequences that exhibit
local dependence. Part of the reason for this popularity of Markov chains as a model
is that a coherent, complete, and elegant theory exists for how a chain evolves. We
describe below examples from numerous fields where a Markov chain is either the
correct model, or is chosen as a model.
10.2 Examples and Various Applications as a Model 341
Example 10.1 (Weather Pattern). Suppose that in some particular city, any day is
either dry or wet. If it is dry on some day, it remains dry the next day with probability
˛, and will be wet with the residual probability 1 ˛. On the other hand, if it is wet
on some day, it remains wet the next day with probability ˇ, and becomes dry with
probability 1 ˇ. Let X0 ; X1 ; : : : be the sequence of states of the weather, with
X0 being the state on the initial day (on which observation starts). Then fXn g is a
two-state stationary Markov chain with the transition probability matrix
!
˛ 1˛
P D :
1ˇ ˇ
Example 10.3 (An Urn Model Example). Two balls, say A; B are initially in urn 1,
and two others, say C; D are in urn 2. In each successive trial, one ball is chosen
at random from urn 1, and one independently and also at random from urn 2, and
these balls switch urns. We let Xn denote the vector of locations of the four balls
A; B; C; D, in that order of the balls, after the nth trial. Thus, X10 D 1122 means
that after the tenth trial, A; B are located in urn 1, and C; D in urn 2, and so on. Note
that X0 D 1122. Two of the four balls are always in urn 1, and two in urn 2. Thus, the
possible number of states is 42 D 6. They are 1122; 1212; 1221; 2112; 2121; 2211.
Then fXn g is a six-state stationary Markov chain. What are the transition pro-
babilities?
For notational convenience, denote the above six states as 1; 2; : : : ; 6 respec-
tively. For the state of the chain to move from state 1 to state 2 in one trial, B; C
have to switch their urns. This will happen with probability :5:5 D :25. As another
example, for the state of the chain to move from state 1 to state 6, all of the four balls
must switch their urns. This is not possible. Therefore, this transition probability is
zero. Also, note that if the chain is in some state now, it cannot remain in that same
342 10 Markov Chains and Applications
state after the next trial. Thus, all diagonal elements in the transition probability
matrix must be zero. Indeed, the transition probability matrix is
0 1
0 :25 :25 :25 :25 0
B C
B :25 0 :25 :25 0 :25 C
B C
B :25 :25 0 0 :25 :25 C
P DB
B :25
C:
B :25 0 0 :25 :25 C
C
B C
@ :25 0 :25 :25 0 :25 A
0 :25 :25 :25 :25 0
Notice that there are really three distinct rows in P , each occuring twice. It is easy
to argue that that is how it should be in this particular urn experiment. Also note the
very interesting fact that in each row, as well as in each column, there are two ze-
roes, and the nonzero entries obviously add to one. This is an example of a transition
probability matrix that is doubly stochastic. Markov chains with a doubly stochastic
transition probability matrix show a unified long run behavior. By definition, ini-
tially the chain is in state 1, and so P .X0 D 1/ D 1; P .X0 D i / D 0 8i ¤ 0.
However, after many trials, the state of the chain would be any of the six possible
states with essentially an equal probability; that is, P .Xn D i / 1
6 for each pos-
sible state i for large n. This unifying long run behavior of Markov chains with a
doubly stochastic transition probability matrix is a significant result of wide appli-
cations in Markov chain theory.
Example 10.4 (Urn Model II: Ehrenfest Model). This example has wide applica-
tions in the theory of heat transfers. The mathematical model is that we initially
have m balls, some in one urn, say urn I, and the rest in another urn, say urn II.
At each subsequent time n D 1; 2; : : : ; one ball among the m balls is selected at
random. If the ball is in urn I, with probability ˛ it is transferred to urn II, and with
probability 1 ˛ it continues to stay in urn I. If the ball is in urn II, with probability
ˇ it is transferred to urn I, and with probability 1 ˇ it continues to stay in urn II.
Let X0 be the number of balls initially in urn I, and Xn the number of balls in
urn I after time n. Then fXn g is a stationary Markov chain with state space S D
f0; 1; : : : ; mg. If there are, say i balls in urn I at a particular time, then at the next
instant urn I could lose one ball, gain one ball, or neither lose nor gain any ball.
It loses a ball if one of its i balls gets selected for possible transfer, and then the
transfer actually happens. So pi;i 1 D mi ˛. Using this simple argument, we get the
one-step transition probabilities as
i i mi
pi;i 1 D ˛I pi;i C1 D cm i mˇI pii D 1 ˛ ˇ;
m m m
and pij D 0 if j ¤ i 1; i; i C 1.
As a specific example, suppose m D 7, and ˛ D ˇ D 12 . Then the transition
matrix on the state space S D f0; 1; : : : ; 7g can be worked out by using the formulas
given just above, and it is
10.2 Examples and Various Applications as a Model 343
0 1 1
1
2 2 0 0 0 0 0 0
B 1 1 3
0 0 0 0 0 C
B 14 2 7 C
B 1 1 5 C
B 0 0 0 0 0 C
B 7 2
3
14
1 2 C
B 0 0 0 0 0 C
P DB 14 2 7 C:
B 0 0 0 2 1 3
0 0 C
B 7 2 14 C
B 0 0 0 0 5 1 1
0 C
B 14 2 7 C
@ 0 0 0 0 0 3 1 1 A
7 2 14
1 1
0 0 0 0 0 0 2 2
D Z0 C Z1 C Z2 2k; if Z0 k; Z0 C Z1 2kI
D Z2 ; if Z0 k; Z0 C Z1 < 2kI
D Z1 C Z2 k; if Z0 < k; Z1 kI
D Z2 ; if Z0 < k; Z1 < k;
and so on.
If Zi ; i 0 are iid, then fXn g forms a stationary Markov chain. The state
space of this chain is f0; 1; 2; : : :g. What is the transition probability matrix?
For simplicity, take k D 1. For example, P .X2 D 1 jX1 D 0/ D P .Z1 D 1 jZ0 D 0/
D P .Z1 D 1/ D p1 (say). On the other hand, as another example, P .X2 D 2 j
X1 D 4/ D 0. If we denote the common mass function of the Zi by P .Zi D j / D
pj ; j 0, then the transition probability matrix is
0 1
p0 p1 p2 p3
Bp C
B 0 p1 p2 p3 C
B C
B 0 p0 p1 p2 C
B C
P DB
C
:
B 0 0 p0 p1 C
B C
B 0 0 0 p0 C
@ A
::
:
344 10 Markov Chains and Applications
Example 10.7 (An Example from Genetics). Many traits in organisms, for example,
humans, are determined by genes. For example, eye color in humans is determined
by a pair of genes. Genes can come in various forms or versions, which are called
alleles. An offspring receives one allele from each parent. A parent contributes one
of his or her alleles with equal probability to an offspring, and the parents make their
contributions independently. Certain alleles dominate over others. For example, the
allele for blue eye color is dominated by the allele for brown eye color. The allele
for blue color would be called recessive, and the allele for brown eye color would
be called dominant. If we denote these as b, B respectively, then a person may have
the pair of alleles BB, Bb, or bb. They are called the dominant, hybrid, and recessive
genotypes. We denote them as d, h, r, respectively. Consider now the sequence of
genotypes of descendants of an initial individual, and denote the sequence as fXn g;
for any n; Xn must be one of d, h, r (we may call them 1, 2, 3).
Consider now a person with an unknown genotype (X0 ) mating with a known
hybrid. Suppose he has genotype d. He will necessarily contribute B to the off-
spring. Therefore, the offspring can only have genotype d or h, and not r. It will
be d if the offspring also gets the B allele from the mother, and it will be h if the
offspring gets b from the mother. The chance of each is 12 . Therefore, the transition
probability P .X1 D d jX0 D d / D P .X1 D h jX0 D d / D 12 , and P .X1 D r j
X0 D d / D 0.
Suppose X0 D h. Then the father contributes B or b with probability 12 each,
and so does the mother, who was assumed to be a hybrid. So the probabilities that
X1 D d; h; r are, respectively, 14 ; 12 ; 14 .
If X0 D r, then X1 can only be h or r, with probability 12 each. So, if we assume
this same mating scheme over generations, then fXn g forms a three-state stationary
Markov chain with the transition probability matrix
0 1
:5 :5 0
P D @ :25 :5 :25 A :
0 :5 :5
10.3 Chapman–Kolmogorov Equation 345
Example 10.8 (Simple Random Walk). Consider a particle starting at the origin at
time zero, and making independent movements of one step to the right or one step
to the left at each successive time instant 1; 2; : : : : Assume that the particle moves
to the right at any particular time with probability p, and to the left with probabil-
ity q D 1 p. The mathematical formulation is that the successive movements are
iid random variables X1 ; X2 ; : : : ; with common pmf P .Xi D 1/ D p; P .Xi D 1/
D q. The particle’s location after the nth step has been taken is denoted as
Sn D X0 C X1 C C Xn D X1 C : : : C Xn , assuming that X0 D 0 with
probability one. At each time the particle can move by just one unit, thus fSn g is a
stationary Markov chain with state space S D Z D f: : : ; 2; 1; 0; 1; 2; : : :g, and
with the transition probabilities
Proof. A verbal proof is actually the most easily understood. In order to get to state
j from state i in m C n steps, the chain must go to some state k 2 S in m steps, and
then travel from that k to the state j in the next n steps. By adding over all possible
k 2 S , we get the Chapman–Kolmogorov equation.
An extremely important corollary is the following result.
346 10 Markov Chains and Applications
Corollary. Let P .n/ denote the n-step transition probability matrix. Then, for all
n 2; P .n/ D P n , where P n denotes the usual nth power of P .
A further important consequence is that we can now write an explicit formula for
the pmf of the state of the chain at a given time n.
Proposition. Let fXn g be a stationary Markov chain with the state space S , and
one-step
P transition probability matrix P . Fix n 1. Then, n .i / D P .Xn D i / D
0
k2S p ki .n/P .X0 D k/. In matrix notation, if D . 1 ; 2 ; : : :/ denotes the
vector of the initial probabilities P .X0 D k/; k D 1; 2; , and if n denotes the
row vector of probabilities P .Xn D i /; i D 1; 2; : : : ; then n D P n .
This is an important formula, because it lays out how to explicitly find the distri-
bution of Xn from the initial distribution and the one-step transition matrix P .
Example 10.9 (Weather Pattern). Consider once again the weather pattern example
with the one-step transition probability matrix
!
˛ 1˛
P D :
1ˇ ˇ
Next suppose that we want to know what the probability is that Saturday and
Sunday will both be dry if Wednesday is dry. In notation, we now want to find
P .X3 D 1; X4 D 1 jX0 D 1/
D P .X3 D 1 jX0 D 1/P .X4 D 1 jX3 D 1; X0 D 1/
D P .X3 D 1 jX0 D 1/P .X4 D 1 jX3 D 1/ D :608 :8
D :4864:
10.3 Chapman–Kolmogorov Equation 347
Therefore,
P .It will be dry one week from the initial day/ D :513997 1 C :486003 2
D :5 C :013997. 1 2 /:
P .It will be dry two weeks from the initial day/ D :500392 1 C :499608 2
D :5 C :000392. 1 2 /:
P .It will be dry three weeks from the initial day/ D :5 C :000011. 1 2 /:
P .It will be dry four weeks from the initial day/ D :5:
by direct computation,
0 1
:66 :1 :24
P 2 D @ :06 :82 :12 A :
:16 :18 :66
348 10 Markov Chains and Applications
Hence, the probabilities that a Labor voter in this election will vote Labor,
Conservative, or Independent two elections from now are 66%; 10%, and 24%.
We also see from P 2 that a Conservative voter will vote Conservative two elections
from now with 82% probability and has a chance of just 6% to switch to Labor, and
so on.
Example 10.11 (Hopping Mosquito). Consider again the hopping mosquito exam-
ple previously introduced. The goal of this example is to find the n-step transition
probability matrix P n for a general n. We describe a general method for finding P n
using a linear algebra technique, known as diagonalization of a matrix. If a square
matrix P (not necessarily symmetric) of order t t has t distinct eigenvalues, say
ı1 ; : : : ; ıt , which are complex numbers in general, and if u1 ; : : : ; ut are a set of t
eigenvectors of P corresponding to the eigenvalues ı1 ; : : : ; ıt , then define a matrix
U as U D .u1 ; : : : ; ut /; that is, U has u1 ; : : : ; ut as its t columns. Then, U has
the property that U 1 P U D L, where L is the diagonal matrix with the diagonal
elements ı1 ; : : : ; ıt . Now, just note that
U 1 P U D L ) P D ULU 1 ) P n D ULn U 1 ;
8n 2:
Therefore, we only need to compute the eigenvalues of P , and the matrix U of
a set of eigenvectors. As long as the eigenvalues are distinct, the n-step transition
matrix will be provided by the unified formula P n D ULn U 1 .
The eigenvalues of our P are
i i
ı1 D ; ı2 D ; ı3 D 1I
2 2
note that they are distinct. The eigenvectors (one set) turn out to be:
0 0
i 1 i C1 0
u1 D 1 i; ; 1 ; u2 D i 1; ; 1 ; u3 D .1; 1; 1/ :
2 2
Therefore, 0 1
i 1 i 1 1
U D@ i 1
2
i C1
2
1A;
1 1 1
0 1
3i 1
10
2i5C1 i C3
10
B C
B C
U 1 D B 3i10
C1 2i 1 3i
C:
@ 5 10 A
1 2 2
5 5 5
This leads to 0 1
. 2i /n 0 0
P DU@ 0
n i n
.2/ 0 A U 1 ;
0 0 1
with U; U 1 as above.
10.4 Communicating Classes 349
For example,
n n
3i 1 i 3i C 1 i 1
p11 .n/ D .i 1/ C .i 1/ C1 1
10 2 10 2 5
n
1 2i i 2Ci i n
D C C I
5 5 2 5 2
this is the probability that the mosquito will be back on the forehead after n moves
if it started at the forehead. If we take n D 2, we get, on doing the complex mul-
tiplication, p11 .2/ D 0. We can logically verify that p11 .2/ must be zero by just
looking at the one-step transition matrix P . However, if we take n D 3, then the
formula will give p11 .3/ D 14 > 0. Indeed, if we take n D 3, we get
01 1 1 1
4 4 2
B1 C
P3 D B
@4
3
8
3
8
C:
A
1 1 3
8 2 8
We notice that every element in P 3 is strictly positive. That is, no matter where the
mosquito was initially seated, by the time it has made three moves, we cannot rule
out any location for where it will be: it can now be anywhere. In fact, this property
of a transition probability matrix is so important in Markov chain theory, that it has
a name. It is the first definition in our next section.
Definition 10.7. Let fXn g be a stationary Markov chain with transition probability
matrix P . It is called a regular chain if there exists a universal n0 > 0 such that
pij .n0 / > 0 8i; j 2 S .
So, what we just saw in the last example is that the mosquito is engaged in move-
ments according to a regular Markov chain.
A weaker property is that of irreducibility.
Definition 10.8. Let fXn g be a stationary Markov chain with transition probability
matrix P . It is called an irreducible chain if for any i; j 2 S; i ¤ j , there exists
n0 > 0, possibly depending on i; j such that pij .n0 / > 0.
Irreducibility means that it is possible to travel from any state to any other state,
however many steps it might take, depending on which two states are involved.
Another terminology also commonly used is that of communicating.
Definition 10.9. Let fXn g be a stationary Markov chain with transition probability
matrix P . Let i; j be two specific states. We say that i communicates with j.i $ j /
350 10 Markov Chains and Applications
if there exists n0 > 0, possibly depending on i; j such that pij .n0 / > 0, and also,
there exists n1 > 0, possibly depending on i; j such that pj i .n1 / > 0.
In words, a pair of specific states i; j are communicating states if it is possible to
travel back and forth between i; j , however many steps it might take, depending on
i; j , and possibly even depending on the direction of the journey, that is, whether
the direction is from i to j , or from j to i .
By convention, we say that i $ i . Thus, $ defines an equivalence relation on
the state space S :
i $ i I i $ j ) j $ i I i $ j; j $ k ) i $ k:
Therefore, like all equivalence relations, $ partitions the state space S into mutually
exclusive subsets of S , say C1 ; C2 ; : : : : These partitioning subsets C1 ; C2 ; : : : are
called the communicating classes of the chain.
Here is an example to help illustrate the notion.
Example 10.12 (Identifying Communicating Classes). Consider the one-step transi-
tion matrix
0 1
:75 :25 0 0 0 0
B 0 0C
B 0 1 0 0 C
B C
B :25 0 0 :25 :5 0C
P DB
B 0
C:
B 0 0 :75 :25 0C
C
B C
@ 0 0 0 0 0 1A
0 0 0 0 1 0
We can observe more interesting things about the chain from the transition ma-
trix. Consider, for example, state 5. If you are in state 5, then your transitions
would have to be 565656 . So, starting at 5, you can return to 5 only at times
n D 2k; k 1. In such a case, we call the state periodic with period equal to two.
Likewise, state 6 is also periodic with period equal to two. An exercise asks to show
that all states within the same communicating class always have the same period.
It is useful to have a formal definition, because there is an element of subtlety about
the exact meaning of the period of a state.
Definition 10.10. A state i 2 S is said to have the period d.> 1/ if the greatest
common divisor of all positive integers n for which pii .n/ > 0 is the given number
d . If a state i has no period d > 1, it is called an aperiodic state. If every state of a
chain is aperiodic, the chain itself is called an aperiodic chain.
Example 10.13 (Computing the Period). Consider the hopping mosquito example
again. First, let us look at state 1. Evidently, we can go to 1 from 1 in any number
of steps: p11 .n/ > 0 8n 1. So the set of integers n for which p11 .n/ > 0 is
f1; 2; 3; 4; : : :g, and the gcd (greatest common divisor) of these integers is one. So 1
is an aperiodic state. Because f1; 2; 3g is a communicating class, we then must have
that 3 is also an aperiodic state. Let us see it. Note that in fact we cannot go to 3
from 3 in one step. But we can go from 3 to 1, then from 1 to 2, and then from 2 to
3. That takes three steps. But we can also go from 3 to 3 in n steps for any n > 3,
because once we go from 3 to 1, we can stay there with a positive probability for
any number of times, and then go from 1 to 2, and from 2 to 3. So the set of integers
n for which p33 .n/ > 0 is f3; 4; 5; 6; : : :g, and we now see that 3 is an aperiodic
state. Similarly, one can verify that 2 is also an aperiodic state.
Remark. It is important to note the subtle point that just because a state i has period
d , it does not mean that pii .d / > 0. Suppose, for example, that we can travel from
i back to i in steps 6; 9; 12; 15; 18; : : : ; which have gcd equal to 3, and yet pii .3/ is
not greater than zero.
A final definition for now is that of an absorbing state. Absorption means that
once you have gotten there, you will remain there forever. The formal definition is
as follows.
Definition 10.11. A state i 2 S is called an absorbing state if pij .n/ D 0 for all n
and for all j ¤ i . Equivalently, i 2 S is an absorbing state if pii D 1; that is, the
singleton set fi g is a closed class.
Remark. Plainly, if a chain has an absorbing state, then it cannot be regular, and
cannot even be irreducible. Absorption is fundamentally interesting in gambling
scenarios. A gambler may decide to quit the game as soon as his net fortune becomes
zero. If we let Xn denote the gambler’s net fortune after the nth play, then zero will
be an absorbing state for the chain fXn g. For chains that have absorbing states, time
taken to get absorbed is considered to be of basic interest.
352 10 Markov Chains and Applications
The problem of the gambler’s ruin is a classic and entertaining example in the theory
of probability. It is an example of a Markov chain with absorbing states. Answers to
numerous interesting questions about the problem of the gambler’s ruin have been
worked out; this is all very classic. We provide an introductory exposition to this
interesting problem.
Imagine a gambler who goes to a casino with $a in his pocket. He will play a
game that pays him one dollar if he wins the game, or has him pay one dollar to the
house if he loses the game. He will play repeatedly until he either goes broke, or his
total fortune increases from the initial amount a he entered with to a prespecified
larger amount b.b > a/. The idea is that he is forced to quit if he goes broke, and
he leaves of his own choice if he wins handsomely and is happy to quit. We can ask
numerous interesting questions. But let us just ask what is the probability that he
will leave because he goes broke.
This is really a simple random walk problem again. Let the gambler’s initial
fortune be S0 D a. Then, the gambler’s net fortune after n plays is Sn D S0 C X1 C
X2 C C Xn , where the Xi are iid with the distribution P .Xi D 1/ D p; P .Xi D
1/ D q D 1 p. We make the realistic assumption that p < q , p < 12 ; that is,
the game is favorable to the house and unfavorable to the player. Let pa denote the
probability that the player will leave broke if he started with $a as his initial fortune.
In the following argument, we hold b fixed, and consider pa as a function of a, with
a varying between 0 and the fixed bI 0 a b. Note that p0 D 1 and pb D 0.
Then, pa satisfies the recurrence relation
The argument is that if the player wins the very first time, which would happen
with probability p, then he can eventually go broke with probability paC1 , because
the first win increases his fortune by one dollar from a to a C 1; but, if the player
loses the very first time, which would happen with probability 1 p, then he can
eventually go broke with probability pa1 , because the first loss will decrease his
fortune by one dollar from a to a 1.
Rewrite the above equation in the form
1p
paC1 pa D .pa pa1 /:
p
However, we can find p1 explicitly by using the last equation with the choice a D
b 1, which gives
b1
1p
p
1
0 D pb D 1 C .p1 1/ 1p
:
p 1
Substituting the expression we get for p1 from here into the formula for paC1 , we
have
.q=p/b .q=p/aC1
paC1 D :
.q=p/b 1
This last formula actually gives an expression for px for a general x b; we can
use it with x D a in order to write the final formula
.q=p/b .q=p/a
pa D :
.q=p/b 1
Example 10.14 (Simple Random Walk). Let PXi ; i 1 be iid random variables, with
P .Xi D ˙1/ D 12 , and let Sn D X0 C niD1 Xi ; n 0, with the understanding
that X0 D 0. Then fSn g; n 0 is a stationary Markov chain with initial state as
zero, and state space S D f: : : ; 2; 1; 0; 1; 2; : : :g.
A graph of the first 50 steps of a simulated random walk is given in Fig. 10.1.
By carefully reading the plot, we see that the first passage to zero, the initial state,
occurs at T0 D 4. We can also see from the graph that the walk returns to zero a
total of nine times within these first 50 steps. The first passage to j D 5 occurs at
T05 D 9. The first passage to the set D D f: : : ; 9; 6; 3; 3; 6; 9; : : :g occurs at
n
10 20 30 40 50
-2
-4
-6
T0D D 7. The walk goes up to a maximum of 6 at the tenth step. So, we can say
that T07 > 50; in fact, we can make a stronger statement about T07 by looking at
where the walk is at time n D 50. The reader is asked to find the best statement we
can make about T07 based on the graph.
Example 10.15 (Infinite Expected First Passage Times). Consider the three-state
Markov chain with state space S D f1; 2; 3g and transition probability matrix
0 1
x y z
P D @p q 0A;
0 0 1
where x C y C z D p C q D 1.
First consider the recurrence time T1 . Note that for the chain to return at all to
state 1, having started at 1, it cannot ever land in state 3, because 3 is an absorbing
state. So, if T1 D t, then the chain spends t 1 time instants in state 2, and then
returns to 1. In other words, P .T1 D 1/ D x, and for t > 1; P .T1 D t/ D yq t 2 p.
From here, we can compute P .T1 < 1/. Indeed,
1
py X t
P .T1 < 1/ D x C q
q 2 t D2
py q 2
DxC D x C y D 1 z:
q2 p
py 2q 2 q 3 1 C p x.1 C p 2 /
DxC D :
q 2 p q2 p.1 p/
Remark. Note that if a stationary chain returns to its original state i (at least) once
with probability one, it will then also return infinitely often with probability one. So,
we could also think of recurrence and transience of a state in terms of the following
questions.
(a) Is Pi .Xn D i for some n 1/ D 1‹
(b) Is Pi .Xn D i for some n 1/ < 1‹
Here is another way to think about it. Consider our previously defined recurrence
time Ti (still with the understanding that the initial state is i ). We can think of
recurrence in terms of whether Pi .Ti < 1/ D 1.
Needless to say that just because Pi .Ti < 1/ D 1, it does not follow that
its expectation Ei .Ti / < 1. It is a key question in Markov chain theory whether
Ei .Ti / < 1 for every state i . Not only is it of practical value to compute Ei .Ti /,
the finiteness of Ei .Ti / for every state i crucially affects the long run behavior of
the chain. If we want to predict where the chain will be after it has run for a long
time, our answers will depend on these expected values Ei .Ti /, provided they are all
finite. The relationship of Ei .Ti / to the limiting value of P .Xn D i / is made clear
in the next section. Because of the importance of the issue of finiteness of Ei .Ti /,
the following are important definitions.
Preview.
(a) PYou can verify recurrence or transience of a given state i by verifying whether
1
i D0 pi i .n/ D 1 or < 1:
(b) You can also try to directly verify whether Pi .Ti < 1/ D 1 or < 1:
(c) Chains with a finite state space are more easily handled as regards settling recur-
rence or transience issues. For finite chains, there must be at least one recurrent
state; that is, not all states can be transient, if the chain has a finite state space.
(d) Recurrence is a class property; that is, states within the same communicating
class have the same recurrence status. If one of them is recurrent, so are all the
others.
(e) In identifying exactly which communicating classes have the recurrence prop-
erty, you can identify which of the communicating classes are closed.
(f) Even if a state i is recurrent, Ei .Ti / can be infinite: the state i can be null
recurrent. However, if the state space is finite, and if the chain is regular, then
you do not have to worry about it. As a matter of fact, for any set D, TiD will
10.6 First Passage, Recurrence, and Transience 357
be finite with probability one, and even Ei .TiD / will be finite. So, for a finite
regular chain, you have a very simple recurrence story; every state is not just
recurrent, but even positive recurrent.
(g) For chains with an infinite state space, it is possible that every state is transient,
and it is also possible that every state is recurrent, or it could also be something
in between. Whether the chain is irreducible is going to be a key factor in sorting
out the exact recurrence structure.
Some of the major results on recurrence and transience are now given.
P
Theorem 10.2. Let fXn g beP a stationary Markov chain. If 1 nD0 pii .n/ D 1, then
i is a recurrent state, and if 1 nD0 p ii .n/ < 1, then i is a transient state.
P1
Proof. Introduce the variable Vi D nD0 IfXn Di g ; thus, Vi is the total number of
visits of the chain to state i . Let also pi D Pi .Ti < 1/. By using the Markov
property of fXn g, it follows that Pi .Vi > m/ D pim for any m 0. Suppose now
pi < 1. Then, by the tailsum formula for expectations,
1
X
Ei .Vi / D Pi .Vi > m/
mD0
X1
1
D pim D < 1:
mD0
1 pi
But also,
" 1
#
X
Ei .Vi / D Ei IfXn Di g
nD0
1
X 1
X
D EŒIfXn Di g D Pi .Xn D i /
nD0 nD0
X1
D pii .n/:
nD0
P1
P1if pi < 1, then we must have nD0 pii .n/ < 1, which is the same as saying if
So,
nD0 pii .n/ D 1, then pi must be equal to 1, and so i must be a recurrent state.
Suppose on the other hand that pi D 1. Then, for any m; Pi .Vi > P m/ D 1, and so,
1
with probability one, Vi D 1. So, Ei .V
P1i / D 1, which implies that nD0 pii .n/ D
Ei .Vi / D 1. So,P if p i D 1, then nD0 p ii .n/ must be 1, which is the same
as saying that if 1 nD0 pii .n/ < 1, then pi < 1, which would mean that i is a
transient state. t
u
The next theorem formalizes the intuition that if you keep coming back to some
state over and over again, and that state communicates with some other state, then
you will be visiting that state over and over again as well. That is, recurrence is a
class property, and that implies that transience is also a class property.
358 10 Markov Chains and Applications
Proof. The theorem is proved if we can show that if i; j both belong to a common
communicating class, and i is transient, then j must also be transient. If we can
prove this, it follows that if j is recurrent, then i must also be recurrent, for if it
were not, it would be transient, and so that would make j transient, which would be
a contradiction.
So, suppose i 2P C , and assume that i is transient.
P1 By virtue of the transience
of i , we know that 1 p
rD0 ii .r/ < 1, and so, rDR ii .r/ < 1 for any fixed R.
p
This is useful to us in the proof.
Now consider another state j 2 C . Because C is a communicating class, there
exist k; n such that pij .k/ > 0; pj i .n/ > 0. Take such k; n and hold them fixed.
Now observe that for any m, we have the inequality
Theorem 10.4. (a) Let fXn g be a stationary irreducible Markov chain with a finite
state space. Then every state of fXn g must be recurrent.
(b) For any stationary Markov chain with a finite state space, a communicating
class is recurrent if and only if it is closed.
Example 10.16 (Various Illustrations.). We revisit some of the chains in our previ-
ous examples and examine their recurrence structure.
If 0 < ˛ < 1 and also 0 < ˇ < 1, then clearly the chain is irreducible, and
it obviously has a finite state space. And so, each of the two P states is recurrent.
If ˛ D ˇ D 1, then each state is an absorbing state, and clearly, 1nD0 pi i .n/ D 1
10.7 Long Run Evolution and Stationary Distributions 359
for both i D 1; 2. So, each state is recurrent. If ˛ D ˇ D 0, then the chain evolves
either as 121212 : : : ; or 212121 : Each state is periodic and recurrent.
In the hopping mosquito example,
0 1
0 1 0
@
P D 0 :5 :5 A :
:5 0 :5
In this case, some elements of P are zero. However, we have previously seen that
every element in P 3 is strictly positive. Hence, the chain is again irreducible. Once
again, each of the three states is recurrent.
Next consider the chain with the transition matrix
0 1
:75 :25 0 0 0 0
B 0 0 0C
B 0 1 0 C
B C
B :25 0 0 :25 :5 0 C
P DB C:
B 0 0 0 :75 :25 0 C
B C
@ 0 0 0 0 0 1A
0 0 0 0 1 0
We have previously proved that the communicating classes of this chain are
f1; 2; 3g; f4g; f5; 6g, of which f5; 6g is the only closed class. Therefore, 5; 6 are
the only recurrent states of this chain.
A natural human instinct is to want to predict the future. It is not surprising that
we often want to know exactly where a Markov chain will be after it has evolved
for a fairly long time. Of course, we cannot say with certainty where it will be.
But perhaps we can make probabilistic statements. In notation, suppose a stationary
Markov chain fXn g started at some initial state i 2 S . A natural question is what
can we say about P .Xn D j jX0 D i / for arbitrary j 2 S , if n is large. Again, a
short preview might be useful.
Preview. For chains with a finite state space, the answers are concrete, extremely
structured, and furthermore, convergence occurs rapidly. That is, under some rea-
sonable conditions on the chain, regardless of what the initial state i is, P .Xn D
j jX0 D i / has a limit j , and P .Xn D j jX0 D i / j for quite moder-
ate values of n. In addition, the marginal probabilities P .Xn D j / are also well
approximated by the same j , and there is an explicit formula for determining the
limiting probability j for each j 2 S . Somewhat different versions of these results
are often presented in different texts, under different sets of conditions on the chain.
360 10 Markov Chains and Applications
Our version balances the ease of understanding the results with the applicability of
the conditions assumed. But, first let us see two illustrative examples.
Example 10.17. Consider first the weather pattern example, and for concreteness,
take the one-step transition probability matrix to be
:8 :2
P D :
:2 :8
Notice that this chain has an absorbing state; once you are in state 3, you can never
leave from there. To be concrete, take x D :25; y D :75; p D q D :5. Then, by
direct computing,
0 1 0 1
:400001 :599999 0 :4 :6 0
P 10 D @ :4 :6 0AI P 20 D @ :4 :6 0A:
0 0 1 0 0 1
This time it appears that P n converges to a limiting matrix whose first two rows are
the same, but the third row is different. Specifically, the first two rows of P n seem
to be converging to .:4; :6; 0/ and the third row is .0; 0; 1/, the same as the third row
in P itself. Thus, the limiting behavior of P .Xn D j jX0 D i / seems to depend on
the initial state i .
The difference between the two chains in this example is that the first chain is
regular, whereas the second chain has an absorbing state and cannot be regular.
Indeed, regularity of the chain is going to have a decisive effect on the limiting
behavior of P .Xn D j jX0 D i /. An important theorem is the following.
10.7 Long Run Evolution and Stationary Distributions 361
Theorem 10.5 (Fundamental Theorem for Finite Markov Chains). Let fXn g be
a stationary Markov chain with a finite state space S , consisting of t elements.
Assume furthermore that fXn g is regular. Then, there exist j ; j D 1; 2; : : : ; t such
that
(a) For any initial state i; P .Xn D j jX0 D i / ! j ; j D 1; 2; : : : ; t:
(b)
P1t; 2 ; : : : ; t are the uniquePsolutions of the system of equations j D
t
i D1 i pij ; j D 1; 2; : : : ; t, j D1 j D 1, where pij denotes the .i; j /th
element in the one-step transition matrix P .
Equivalently, the row vector D .1 ; 2 ; : : : ; t / is the unique solution of the
equations P D , 10 D 1, where 1 is a row vector with each coordinate
equal to one.
(c) The chain fXn g is positive recurrent; that is, for any state i , the mean recurrence
time i D Ei .Ti / < 1, and furthermore i D 1i :
The vector D .1 ; 2 ; : : : ; t / is called the stationary distribution of the regular
finite chain fXn g. It is also sometimes called the equilibrium distribution or the
invariant distribution of the chain. The difference in terminology can be confusing.
Suppose now that a stationary chain has a stationary distribution . If we use
this as the initial distribution of the chain, then we observe that
X
P .X1 D j / D P .X1 D j jX0 D k/k D j ;
k2S
by the fact that is a stationary distribution of the chain. Indeed, it now follows
easily by induction that for any n; P .Xn D j / D j ; j 2 S . Thus, if a chain
has a stationary distribution, and the chain starts out with that distribution, then at
all subsequent times, the distribution of the state of the chain remains exactly the
same, namely the stationary distribution. This is why a chain that starts out with its
stationary distribution is sometimes described to be in steady-state.
We now give a proof of parts (a) and (b) of the fundamental theorem of Markov
chains. For this, we use a famous result in linear algebra, which we state as a lemma;
see Seneta (1981) for a proof.
(c) There exist left and right eigenvectors of P , each having only strictly posi-
tive elements, corresponding to the eigenvalue 1 ; that is, there exist vectors
; !, with both ; ! having only strictly positive elements, such that P D
1 I P ! D 1 !:
362 10 Markov Chains and Applications
(d) The algebraic multiplicity of 1 is one and the dimension of the set of left as
well as right eigenvectors corresponding to 1 equals one.
(e) For any i , and any vector .c1 ; c2 ; : : : ; ct / with each cj > 0,
1 X 1 X
lim log pij .n/cj D lim log pj i .n/cj D log 1:
n n n n
j j
We assumed that our chain is regular for the fundamental theorem. An exercise
asks us to show that regularity is not necessary for the existence of a stationary
distribution. Regular chains are of course irreducible. But irreducibility alone is not
enough for the existence of a stationary distribution. More is said of the issue of
existence of a stationary distribution a bit later. For finite chains, irreducibility plus
aperiodicity is enough for the validity of the fundamental theorem because of the
simple reason that such chains are regular in the finite case. It is worth mentioning
this as a formal result.
Theorem 10.6. Let fXn g be a stationary Markov chain with a finite-state space S .
If fXn g is irreducible and aperiodic, then the fundamental theorem holds.
Example 10.18 (Weather Pattern). Consider the two-state Markov chain with the
transition probability matrix
!
˛ 1˛
P D :
1ˇ ˇ
Assume 0 < ˛; ˇ < 1, so that the chain is regular. The stationary probabilities
1 ; 2 are to be found from the equation
.1 ; 2 /P D .1 ; 2 /
) ˛1 C .1 ˇ/2 D 1 I
1˛
) .1 ˛/1 D .1 ˇ/2 ) 2 D 1 :
1ˇ
1ˇ
Substituting this into 1 C 2 D 1 gives 1 C 1ˇ1˛
1 D 1, and so 1 D 2˛ˇ ,
which then gives 2 D 1 1 D 2˛ˇ : For example, if ˛ D ˇ D :8, then we get
1˛
1 D 2 D 2:8:8
1:8
D :5, which is the numerical limit we saw in our example by
n
computing P explicitly for large n. For general 0 < ˛; ˇ < 1, each of the states is
positive recurrent. For instance, if ˛ D ˇ D :8, then Ei .Ti / D :5
1
D 2 for each of
i D 1; 2.
Example 10.19. With the row vector D .1 ; 2 ; : : : ; t / denoting the vector of
stationary probabilities of a chain, satisfies the vector equation P D , and
taking a transpose on both sides, P 0 0 D 0 . That is, the column vector 0 is a right
eigenvector of P 0 , the transpose of the transition matrix.
For example, consider the voting preferences example with
0 1
:8 :05 :15
P D @ :03 :9 :07 A :
:1 :1 :8
364 10 Markov Chains and Applications
The transpose of P is
0 1
:8 :03 :1
P 0 D @ :05 :9 :1 A :
:15 :07 :8
0 1 0 1 0 1
:38566 :44769 :56867
@ :74166 A ; @ :81518 A ; @ :22308 A :
:54883 :36749 :79174
Of these, the last two cannot be the eigenvector we are looking for, because they
contain negative elements. The first eigenvector contains only nonnegative (actually,
strictly positive) elements, and when normalized to give elements that add to one,
results in the stationary probability vector D .:2301; :4425; :3274/: We could
have also obtained it by the method of elimination as in our preceding example, but
the eigenvector method is a general clean method, and is particularly convenient
when the number of states t is not small.
Example 10.20 (Ehrenfest Urn). Consider the symmetric version of the Ehrenfest
urn model in which a certain number among m balls are initially in urn I, the rest
in urn II, and at each successive time one of the m balls is selected completely at
random and transferred to the other urn with probability 12 (and left in the same urn
with probability 12 ). The one-step transition probabilities are
i mi 1
pi;i 1 D ; pi;i C1 D ; pi i D :
2m 2m 2
A stationary distribution would satisfy the equations
mj C1 j C1 j
j D j 1 C j C1 C ; 1 j m 1I
2m 2m 2
0 1 m m1
0 D C I m D C :
2 2m 2 2m
These are equivalent to the equations
1 m1 mj C1
0 D I m D I ; j D j 1
m m m
j C1
C j C1 ; 1 j m 1:
m
Starting with 1 , one can solve these equations by just successive
substitution, leav-
ing 0 as an undetermined constant to get j D m j
0 . Now use the fact that
10.7 Long Run Evolution and Stationary Distributions 365
Pm .m/
j D0 j must equal one. This forces 0 D 21m , and hence, j D 2jm . We now
realize that these are exactly the probabilities in a binomial distribution with param-
eters m and 12 . That is, in the symmetric Ehrenfest urn problem, there is a stationary
distribution and it is the Bin.m; 12 / distribution. In particular, after the process has
evolved for a long time, we would expect close to half the balls to be in each urn.
Each state is positive recurrent, that is, the chain is sure to return to its original
configuration with a finite expected value for the time it takes to return to that con-
figuration. As a specific example, suppose m D 10 and that initially, there were five
.10/
balls in each urn. Then, the stationary probability 5 D 2150 D 256 63
D :246, so that
we can expect that after about four transfers, the urns will once again have five balls
each.
j D j 1 ˛j 1 C j C1 ˇj C1 ;
1 ˛0 ˛0 ˛1
1 D 0 D 0 I 2 D 0 I : : : ;
ˇ1 ˇ1 ˇ1 ˇ2
Consider now the special case when ˛i D ˇi D 12 for all i 1. Then, for
P
any j 1; cj D 12 , and hence 1 j D1 cj diverges. Therefore, the case of the sym-
metric random walk does not possess a stationary distribution, in the sense that no
stationary distribution exists that is a valid probability distribution.
The stationary distribution of a Markov chain is not just the limit of the n-step
transition probabilities; it also has important interpretations in terms of the marginal
distribution of the state of the chain. Suppose the chain has run for a long time,
and we want to know what the chances are that the chain is now in some state j . It
turns out that the stationary probability j approximates that probability too. The
approximations are valid in a fairly strong sense, made precise below. Even more,
j is approximately equal to the fraction of the time so far that the chain has spent
visiting state j . To describe these results precisely, we need a little notation.
Given a stationary chain fXn g, P we denote fn .j / D P .Xn D j /. Also let
n
Ik .j / D IfXk Dj g , and Vn .j / D kD1 Ik .j /. Thus, Vn .j / counts the number
of times up to time n that the chain has been in state j , and ın .j / D Vnn.j / mea-
sures the fraction of times up to time n that the chain has been in state j . Then, the
following results hold.
Theorem 10.7 (Weak Ergodic Theorem). Let fXn g be a regular Markov chain
with a finite state space and the stationary distribution D .1 ; 2 ; : : : ; t /. Then,
(a) Whatever the initial distribution of the chain is, for any j 2 S , P .Xn D j / !
j as n ! 1.
(b) For any > 0; and for any j 2 S , P .jın .j / j j > / ! 0 as n ! 1.
(c) More generally, given any bounded function g, and any > 0, P .j n1
Pn Pt
kD1 g.Xk / j D1 g.j /j j > / ! 0 as n ! 1.
Remark. See Norris (1997, p. 53) for a proof of this theorem. Also see Section
19.3.1 in this text, where an even stronger version is proved. The theorem provides a
basis for estimating the stationary probabilities of a chain by following its trajectory
for a long time. Part (c) of the theorem says that time averages of a general bounded
function will ultimately converge to the state space average of the function with
respect to the stationary distribution. In fact, a stronger convergence result than the
one we state here holds and is commonly called the ergodic theorem for stationary
Markov chains; see Brémaud 1999 or Norris (1997).
Exercises
Let Xn be the room in which the cat is at time n, and Yn the room in which the
mouse is at time n. Assume that the chains fXn g and fYn g are independent.
368 10 Markov Chains and Applications
Exercise 10.8. A flea is initially located on the top face of a cube, which has six
faces, top and bottom, left and right, and front and back. Every minute it moves
from its current location to one of the other five faces chosen at random.
(a) Find the probability that after four moves it is back to the top face.
(b) Find the probability that after n moves it is on the top face; on the bottom face.
(c) * Find the probability that the next five moves are distinct. This is the same as
the probability that the first six locations of the flea are the six faces of the cube,
each location chosen exactly once.
Exercise 10.10. Let fXn g be a three-state stationary Markov chain with the transi-
tion probability matrix
0 1
0 x 1x
P D @y 1y 0 A:
1 0 0
Exercise 10.11 (An IID Sequence). Let Xi ; i 1 be iid Poisson random variables
with some common mean . Prove or disprove that fXn g is a staionary Markov
chain. If it is, describe the transition probability matrix.
How important is the Poisson assumption? What happens if Xi ; i 1 are inde-
pendent, but not iid?
Exercises 369
Exercise 10.12. Let fXn g be a stationary Markov chain with transition matrix P ,
and g a one-to-one function. Define Yn D g.Xn /. Prove that fYn g is a Markov chain,
and characterize as well as you can the transition probability matrix of fYn g.
Exercise 10.13 * (Loop Chains). Suppose fXn g is a stationary Markov chain with
state space S and transition probability matrix P .
(a) Let Yn D .Xn ; XnC1 /. Show that Yn is also a stationary Markov chain.
(b) Find the transition probability matrix of Yn .
(c) How about Yn D .Xn ; XnC1 ; XnC2 /? Is this also a stationary Markov chain?
(d) How about Yn D .Xn ; XnC1 ; ; XnCd / for a general d 1?
Exercise 10.14 (Dice Experiments). Consider the experiment of rolling a fair die
repeatedly. Define
(a) Xn D # sixes obtained up to the nth roll.
(b) Xn D number of rolls, at time n, that a six has not been obtained since the last
six.
Prove or disprove that each fXn g is a Markov chain, and if they are, obtain the
transition probability matrices.
Exercise 10.15. Suppose fXn g is a regular stationary Markov chain with transition
probability matrix P . Prove that there exists m 1 such that every element in P n
is strictly positive for all n m.
Exercise 10.17 * (Periodicity and Simple Random Walk). Consider the Markov
chain corresponding to the simple random walk with general step probabilities
p; q; p C q D 1.
(a) Identify the periodic states of the chain and the periods.
(b) Find the communicating classes.
(c) Are there any communicating classes that are not closed? If there are, iden-
tify them. If not, prove that there are no communicating classes that are not
closed.
370 10 Markov Chains and Applications
X
n
Xn D Zi ; Xn D maxfZ1 ; : : : ; Zn g; Xn D minfZ1 ; : : : ; Zn g;
i D1
state of the stock on the nth day; thus, Xn takes the values 1; 2, or 3. Assume that
fXn g forms a stationary Markov chain with the transition probability matrix
0 1 1 1
0 2 2
B C
P D @ 13 1
3
1
3 A:
1 3 1
2 8 8
A friend offers you the following bet: if the stock goes up tomorrow, he pays you 15
dollars, and if it goes down you pay him 10 dollars. If it remains the same as where
it closes today, a fair coin will be tossed and he will pay you 10 dollars if a head
shows up, and you will pay him 15 dollars if a tail shows up. Will you accept this
bet? Justify with appropriate calculations.
Exercise 10.32 (Wheel of Fortune Continued). Consider again the Markov chains
corresponding to the wheel of fortune. Prove or disprove that they are irreducible
and aperiodic.
Exercises 373
Exercise 10.39 (Use Your Computer: Cat and Mouse). Take the cat and mouse
chain and simulate it to find how long it takes for the cat and the mouse to end up
in the same room. Repeat the simulation and estimate the expected time until the
cat and the mouse end up in the same room. Vary the transition matrix and examine
how the expected value changes.
374 10 Markov Chains and Applications
Exercise 10.40 (Use Your Computer: Ehrenfest Urn). Take the symmetric
Ehrenfest urn; that is, take ˛ D ˇ D :5. Put all the m balls in the second urn.
Simulate the chain and find how long it takes for the urns to have an equal number
of balls for the first time. Repeat the simulation and estimate the expected time until
both urns have an equal number of balls. Take m D 10; 20.
Exercise 10.41 (Use Your Computer: Gambler’s Ruin). Take the gambler’s ruin
problem with p D :4; :49. Simulate the chain using a D 10; b D 25, and find the
proportion of times that the gambler goes broke by repeating the simulation. Com-
pare your empirical proportion with the exact theoretical value of the probability
that the gambler will go broke.
References
Bhattacharya, R.N. and Waymire, E. (2009). Stochastic Processes with Applications, Siam,
Philadelphia.
Brémaud, P. (1999). Markov Chains, Gibbs Fields, Monte Carlo, and Queues, Springer, New York.
Diaconis, P. (1988). Group Representations in Probability and Statistics, IMS, Lecture Notes and
Monographs Series, Hayward, CA.
Feller, W. (1968). An Introduction to Probability Theory, With Applications, Wiley, New York.
Freedman, D. (1975). Markov Chains, Holden Day, San Francisco.
Isaacson, D. and Madsen, R. (1976). Markov Chains, Theory and Applications, Wiley, New York.
Kemperman, J. (1950). The General One-Dimensional Random Walk with Absorbing Barriers,
Geboren Te, Amsterdam.
Meyn, S. and Tweedie, R. (1993). Markov Chains and Stochastic Stability, Springer, New York.
Norris, J. (1997). Markov Chains, Cambridge University Press, Cambridge, UK.
Seneta, E. (1981). Nonnegative Matrices and Markov Chains, Springer-Verlag, New York.
Stirzaker, D. (1994). Elementary Probability, Cambridge University Press, Cambridge, UK.
Chapter 11
Random Walks
We have already encountered the simple random walk a number of times in the
previous chapters. Random walks occupy an extremely important place in prob-
ability because of their numerous applications, and because of their theoretical
connections in suitable limiting paradigms to other important random processes
in time. Random walks are used to model the value of stocks in economics, the
movement of the molecules of a particle in a liquid medium, animal movements in
ecology, diffusion of bacteria, movement of ions across cells, and numerous other
processes that manifest random movement in time in response to some external
stimuli. Random walks are indirectly of interest in various areas of statistics, such
as sequential statistical analysis and testing of hypotheses. They also help a student
of probability simply to understand randomness itself better.
We present a treatment of the theory of random walks in one or more dimensions
in this section, focusing on the asymptotic aspects that relate to the long run prob-
abilistic behavior of a particle performing a random walk. We recommend Feller
(1971), Rényi (1970), and Spitzer (2008) for classic treatment of random walks;
Spitzer (2008), in particular, provides a comprehensive coverage of the theory of
random walks with numerous examples in setups far more general than we consider
in this chapter.
or back, or up, or down, choosing one of these six options with an equal probability.
In d dimensions, it chooses one of the 2d options with an equal probability.
We show two simulations of 100 steps of a simple symmetric random walk on
the line; this is just the cubic lattice random walk when d D 1. We use these two
simulated plots to illustrate a number of important and interesting variables con-
nected to a random walk. One example of such an interesting variable is the number
of times the walk returns to its starting position in the first n steps (in the simulated
plots, n D 100).
We now give some notation:
And, for d D 1,
n D IfSk >0g
kD1
D Number of times in the first n steps that the walk takes a positive value:
Example 11.1 (Two Simulated Walks). We refer to the plots in Fig. 11.1 and
Fig. 11.2 of the first 100 steps of the two simulated simple random walks. First, note
that in both plots, the walk does not at all stay above and below the axis about 50%
of the time. In the first simulation, the walk spends most of its time on the negative
side, and in the second simulation, it does the reverse. This is borne out by theory,
although at first glance it seems counter to intuition of most people. We give a table
providing the values of the various quantities defined above corresponding to the two
simulations.
Walk n i n
1 3 2, 14, 16 11
2 8 2, 4, 6, 8, 10, 22, 24, 34 69
11.1 Random Walk on the Cubic Lattice 377
20 40 60 80 100
-5
-10
-15
-20
-25
15
12.5
10
7.5
2.5
20 40 60 80 100
-2.5
A matter of key interest is whether the random walk returns to its origin, and if so
how many times. More generally, we may ask how many times the random walk
visits a given state x 2 S. These considerations lead to the following fundamental
definition.
We now show some exact and asymptotic distribution theory. Part of the asymptotic
distribution theory has elements of surprise.
It is possible to write combinatorial formulas for Pd;n . These simplify to simple
expressions for d D 1; 2. Note first that the walk cannot return to the origin at odd
times 2n C 1. For the walk to return to the origin at an even time 2n, in each of the d
coordinates, the walk has to take an equal number of positive and negative steps. We
can think of this as a multinomial trial, where the 2n times 1; 2; : : : ; 2n are thought
of as 2n balls, and they are dropped into 2d cells, with the restriction that pairs of
cells receive equal number of balls n1 ; n1 ; n2 ; n2 ; : : : ; nd ; nd . Thus,
1 X .2n/Š
Pd;2n D 2 .n Š/2
.2d /2n n1 ;:::;nd 0;n1 CCnd Dn
.n1 Š/ d
2n 2
X nŠ
n
D :
.2d /2n n1 ;:::;nd 0;n1 CCnd Dn
n1 Š nd Š
In particular,
2n
P1;2n D n
I
.4/n
2
Œ 2n
P2;2n D n nI
.16/
2n
X nŠ 2
P3;2n D n
:
.36/n kŠlŠ.n k l/Š
k;l0;kCln
Apart from the fact that simply computable exact formulas are always attractive,
these formulas have other very important implications, as we demonstrate shortly.
Example 11.2. We give a plot of Pd;2n in Fig. 11.3 for d D 1; 2; 3 for n 25.
There are two points worth mentioning. The return probabilities for d D 2 and
d D 3 cross each other. As n ! 1; P3;n ! 0 at a faster rate. This is shown
later. The second point is that the return probabilities decrease the most when the
dimension jumps from one to two.
0.5
0.4
0.3
0.2
0.1
It is possible to make deeper conclusions about the lattice random walk by using the
above exact formulas for Pd;n . Plainly, by using Stirling’s approximation,
1 1
P1;2n p I P2;n ;
n n
where recall that the notation means that the ratio converges to one, as n ! 1.
Establishing the asymptotic order of Pd;n for d > 2 takes a little more effort. This
can be done by other more powerful means, but we take a direct approach. We use
two facts:
P
(a) d1n n1 ;:::;nd 0;n1 CCnd Dn n1 Šn
nŠ
dŠ
D 1:
nŠ
(b) The multinomial coefficient n1 Šnd Š is maximized essentially when
n1 ; : : : ; nd are all equal.
380 11 Random Walks
Theorem 11.3. Let fXn g; n 0 be a Markov chain with the state space S equal to
the set of all integers. Assume that X0 D 0. Let
.k/
p00 D P .Xk D 0/I D P .The chain always returns to zero/I
N D Total number of visits of the chain to zero:
Then,
1
X 1
X
.k/
p00 D n D E.N /:
kD1 nD1
Corollary 11.1. The Markov chain fXn g returns to zero infinitely often with prob-
P
ability one if and only if 1 .k/
kD1 p00 D 1.
It is now clear why the lattice random walk is recurrent in one dimension. In-
deed, we have already established that for the one-dimensional lattice random walk,
.k/ .2n/ P1 .k/
p00 D 0 if k is odd, and p00 D P1;2n p1 . Consequently,
kD1 p00 D
P1 n
nD1 P1;2n D 1, and therefore, by the general Markov chain theorem we
stated above, the chain returns to zero infinitely often with probability one in one
dimension.
Exactly the same Markov chain argument also establishes the lattice random
walk’s recurrence in two dimensions, and transience in dimensions higher than two.
This is an elegant and shorter derivation of the recurrence structure of the lattice
random walk than the alternative direct method we provided first in this section. The
disadvantage of the shorter proof is that we must appeal to Markov chain theory to
give the shorter proof.
382 11 Random Walks
We just saw that in dimensions three or higher, the probability that the cu-
bic lattice random walk returns to zero infinitely often is zero. In notation, if
d 3; P .Sn D 0 for infinitely many n/ D 0. But this does not mean that
P .Sn D 0 for some n/ D 0 also. This latter probability is the probability that
the walk returns to the origin at least once. We know that for d 3, it is not 1; but
it need not be zero. Indeed, it is something in between. In 1921, Pólya gave a pretty
formula for the probability that the cubic lattice random walk returns to the origin
at least once.
Theorem 11.4 (Pólya’s Formula). Let Sn be the cubic lattice random walk
starting at 0. Then,
where
1
X
Pd;2n D The expected number of returns to the origin
nD1
Z Z !1
d X
d
D d cos k d1 dd 1
.2 /d
kD1
Z 1 d
t
D e t I0 dt 1:
0 d
In particular, for d D 3,
1
X Z Z Z
3 1
P3;2n D dxdyd z 1
nD1
.2 /3 3 cos x cos y cos z
p
6 1 5 7 11
D 3
1
32 24 24 24 24
D :5164
In addition to Pólya (1921), see Finch (2003, p. 322) for these formulas.
11.2 First Passage Time and Arc Sine Law 383
We note that it takes 50 dimensions for the probability of a return to become 1%.
For the simple symmetric random walk on the real line starting at zero, consider the
number of steps necessary for the walk to reach a given integer j for the first time:
Tj D minfn > 0 W Sn D j g:
Also let Tj;r ; r 1 denote the successive visit times of the random walk of the
integer j ; note that the one-dimensional symmetric walk we are considering visits
every integer j infinitely often with probability one, and so it is sensible to talk
of Tj;r .
Definition 11.3. Tj D Tj;1 is called the first passage time to j , and the sequence
fTj;r g the recurrence times of j .
For the special case j D 0, we denote the recurrence times as 1 ; 2 ; 3 ; : : :,
instead of using the more complicated notation T0;1 ; T0;2 ; : : :. Our goal is to write
the distribution of 1 , and from there conclude the asymptotic distribution of r
as r ! 1. It turns out that although the random walk returns to zero infinitely
often with probability one, the expected value of 1 is infinite! This precludes
Gaussian asymptotics for r as r ! 1. Recall also that n denotes the number
of times in the first n steps that the random walk returns to zero.
Theorem 11.5 (Distribution of Return Times).
(a) 1 ; 2 1 ; 3 2 ; : : : are independent and identically
p distributed.
(b) The generating function of 1 is E.t 1 / D 1 1 t 2 ; jtj 1.
.2k2/
(c) P . 1 D 2k C 1/ D 0 8kI P . 1 D 2k/ D k 2k1 2k1 :
(d) E. 1 / D 1: p
(e) The characteristic function of 1 is 1 .t/ D 1 1 e 2i t :
384 11 Random Walks
Detailed Sketch of Proof. Part (a) is a consequence of the observation that each
time the random walk returns to zero, it simply starts over again. A formal proof
establishes that the Markov property is preserved at first passage times.
For part (b), first obtain the generating function
1
X
G.t/ D P1;n t n
nD1
1 1 2n
X X
D P1;2n t 2n
D n
t 2n
nD1 nD1
4n
1
D p 1; jtj < 1:
1 t2
Now, notice that the sequences P1;n ; Q1;n satisfy the recursion relation
X
n1
P1;2n D Q1;2n C P1;2k Q1;2n2k :
kD1
which produces the expression of part (b). Part (c) now comes out of part (b) on dif-
ferentiation of the generating function E.t 1 /. Part (d) follows on directly showing
P .2k2
k1 /
that 1 kD1 k k 22k1 D 1.
Parts (e) and (f) are connected. The characteristic functionPformula follows
1
by a direct evaluation of the complex power series 1 .t/ D nD1 Q1;n e
itn
D
P1 P 1
2n2
2itn . n1 /
nD1 Q1;2n e
2itn
D nD1 e n22n1 . Alternatively, it can also be deduced by the
argument that led to the generating function formula of part (b).
For part (f), by the iid property of 1 ; 2 1 ; : : : ; we can represent r as a sum
of iid variables
r D 1 C . 2 1 / C C . r r1 /:
Therefore, by virtue of part (e), we can write the characteristic function r .t/ of r,
and we get r
t t p
lim r 2
D lim 2
D e 2i t :
r!1 r r!1 r
11.2 First Passage Time and Arc Sine Law 385
We now use the inversion of technique to produce a density, which, by the continuity
theorem for characteristic functions, must be the density of the limiting distribution
of r r2 . The inversion formula gives
Z 1 p
1
f .x/ D e i tx e 2i t
dt
2 1
1 3
e 2x x 2
D p ; x > 0:
2
The CDF corresponding to this density f is 2Œ1 ˆ. p1x /; x > 0, which is what
part (f) says.
Finally, part (g) is actually a restatement of part (f) because of the identity P . r
n/ D P .n r/: t
u
Example 11.4 (Returns to Origin and Stable Law of Index 12 ). There are several in-
teresting things about the density f .x/ of the limiting distribution of r r2 . First, by
inspection, we see that it is the density of Z12 where Z N.0; 1/. It clearly does
not have a finite expectation (and neither does r for any r). Moreover, the charac-
teristic function of f matches the form of the characteristic function of a one-sided
(positive) stable law of index ˛ D 12 . This is an example of a density with tails
even heavier than that of the absolute value of a Cauchy random variable. Although
the recurrence times do not have a finite expectation, either for finite r or asymp-
totically, it is interesting to know that the median of the asymptotic distribution of
r
r2
is Œ 11 3 2 D 2:195. So, for large r, with probability about 50%, the rth return
ˆ .4/
to zero will come within about 2:2r 2 steps of the random walk.
A plot of the asymptotic density in Fig. 11.4 shows the extremely sharp peak
near zero, accompanied by a very flat tail. The lack of a finite expectation is due to
this tail.
0.4
0.3
0.2
0.1
We next turn to the question of studying the proportion of time that a random
walk spends above the horizontal axis.
P In notation, we are interested in the distribu-
tion of nn , where recall that n D nkD1 IfSk >0g . The steps each have a symmetric
distribution, and they are independent, thus each Sk has a symmetric distribution.
Intuitively, one might expect that a trajectory of the random walk, meaning the
graph of the points .n; Sn /, is above and below the axis close to 50% of the time.
The arc sine law, which we now present, says that this is not true. It is more likely
that the proportion nn will either be near zero or near one, than that it will be near
the naively expected value 12 . We provide formulas for the exact and asymptotic
distribution of n .
p
Example 11.5. The density of the CDF 2 arcsin. x/ is the Beta density f .x/ D
p 1
x.1x/
; 0 < x < 1. The density is unbounded as x ! 0; 1, and has its minimum
(rather than the maximum) at x D 12 . Consider the probability that the random walk
takes a positive value between 45% and 55% of the times in the first n steps. The arc
sine law implies, by just integrating the Beta density, that this probability is :0637
for large n. Now, consider the probability that the random walk takes a positive value
either more than 95% or less than 5% of the times in the first n steps. By integrating
the Beta density, we find this probability to be :2871, more than four times the :0637
value. We can see the tendency of a trajectory to spend most of the time either above
or below the axis, rather than dividing its time on a close to 5050 basis above and
below the axis. This seems counter to intuition largely because we automatically
expect a random variable to concentrate near its expectation, which in this case is
1
2 . The arc sine law of random walks says that this sort of intuitive expectation can
land us in trouble.
We provide some values on the exact distribution of nn for certain selected n.
n P .:45 n
n
:55/ P. n
n
:05 or :95/
10 .0606 .5379
25 .0500 .4269
50 .0631 .3518
100 .0698 .3077
250 .0636 .3011
11.3 The Local Time 387
Consider the simple symmetric random walk starting at zero, with iid steps having
the common distribution P .Xi D 1/ D P .Xi D 1/ D 12 , and fix any integer x.
The family of random variables
.x; n/ D #fk W 0 k n; Sk D xg
is called the local time of the random walk. Note, therefore, that .0; n/ is the same
as n . The local time of the random walk answers the following interesting question
of how much time does the random walk spend at x up to the time n? It turns out
that the distribution of .x; n/ can be written in a simple closed-form, and therefore
we can compute it.
Note that .0; 2n/ D .0; 2n C 1/, because the simple symmetric walk starting
at zero can visit the origin only at even numbered times. Therefore, in the case of
.0; n/, it is enough to know the distribution of .0; 2n/ for a general n.
Theorem 11.7. For any n 1; 0 k n,
2nk
P ..0; 2n/ D k/ D P ..0; 2n C 1/ D k/ D n
:
22nk
Proof. For the proof of this formula, we require a few auxiliary combinatorial facts,
which we state together as a lemma.
Lemma. For the simple symmetric random walk, with the same notation as in the
previous section,
(a) For any k 1; P . 1 D 2k jX1 D 1/ D P . 1 D 2k jX1 D 1/ D P .T1 D
2k 1/.
(b) For any k 1; Tk and k k have the same distribution.
n
.b nk c
/
(c) For any n, and 0 k n; P .Mn D k/ D 22n .
Part (c) of the lemma is not very easy to prove, and we do not present its proof here.
Part (a) is immediate from the symmetry of the distribution of X1 and from the total
probability formula
1 1
P. 1 D 2k/ D P . 1 D 2k jX1 D 1/ C P . 1 D 2k jX1 D 1/
2 2
1 1
D P .T1 D 2k 1/ C P .T1 D 2k 1/ D P .T1 D 2k 1/:
2 2
Therefore, 1 1 and T1 have the same distribution. It therefore follows that for any
k 1; k k and Tk have the same distribution. As a consequence, we have the
important identity
X
n2k
P ..x; n/ D k/ D P ..x; n/ D k jTx D j /P .Tx D j /
j Dx
X
n2k
D P ..0; n j / D k 1/P .Tx D j /;
j Dx
and from here, on some algebra, the distribution of .x; n/ works out to the follow-
ing formula. t
u
if .n C x/ is even;
nk
nCx1
P ..x; n/ D k/ D 2
;
2nk
if .n C x/ is odd.
Remark. In the above, combinatorial coefficients rs are to be taken as zero if r < s.
It is a truly rewarding consequence of clever combinatorial arguments that in the
symmetric case, the distribution of the local time can in fact be fully written down.
Example 11.6 (The Local Time at Zero). Consider the simple symmetric random
walk starting at zero. Because zero is the starting point of the random walk, the
local time at zero is of special interest. In this example, we want to investigate the
local time at zero in some additional detail.
First, what exactly does the distribution look like as a function of n? By using
our analytical formula,
2nk
n
P ..0; 2n/ D k/ D ;
22nk
11.4 Practically Useful Generalizations 389
we can easily compute the pmf of the local time for (small) given n. An inspection
of the pmf will help us in understanding the distribution. Here is a small table.
P ..0; 2n/ D k/
k nD3 nD6 n D 10
0 .3125 .2256 .1762
1 .3125 .2256 .1762
2 .25 .2051 .1669
3 .125 .1641 .1484
4 .1094 .1222
5 .0547 .0916
6 .0156 .0611
7 .0349
8 .0161
9 .0054
10 .0010
We see from this table that the distribution of the local time at zero has a few
interesting properties: it is monotone nondecreasing, and it has its maximum value
at k D 0 and k D 1. Thus the random walk spends more time near its original home
than at another location. Both of these properties can be proved analytically. Also,
from the table of the pmf, we can easily compute the mean of the local time at zero.
For n D 3; 6; 10; EŒ.0; 2n/ equals 1.1875, 1.93, and 2.70. For somewhat larger
say n D 100; EŒ.0; 2n/ equals p
n, p 10.33. The mean local time grows at the rate
of n. In fact, when normalized by n, the local time has a limiting distribution,
and the limiting distribution is a very interesting one, namely the absolute value of a
standard normal. This can be proved by using Stirling’s approximation in the exact
formula for the pmf of the local time. Here is the formal result.
.0; n/ L
p ) jZj:
n
Definition P11.4. Let Xi ; i 1 be iid random variables with common CDF F , and
n
let Sn D i D1 Xi ; n 1. Let x be a real number and Sn;x D x C Sn . Then
fSn;x g; n 1 is called a random walk with step distribution F starting at x.
(b) (ErdRos–Kac). If Xi are independent, have zero mean, and satisfy the conditions
of the Lindeberg–Feller theorem, but are not necessarily iid, then the arc sine
law holds:
n 2 p
P x ! arcsin. x/; 8x 2 Œ0; 1:
n
Part (a) is also popularly known as Spitzer’s identity. If, in part (a), we choose
k D n, we get
Returning to the random variable T , namely the first time the random walk enters
negative territory, an interesting general formula for its expectation can be given,
for random walks more general than the simple symmetric random walk. We have
to be careful about talking about E.T /, because it need not be finite. But if it is, then
it is an interesting number to know. Consider, for example, a person gambling in a
casino, and repeatedly playing a specific game. We may assume that the game is (at
least slightly) favorable to the house, and unfavorable to the player. So, intuitively,
the player already knows that eventually she will be sunk. But how long can she
continue without being sunk? If the player expects that she can hang around for a
long time, it may well add to the excitement of the game.
11.5 Wald’s Identity 391
We do in fact deal with a more general result, known as Wald’s identity. It was
proved in the context of sequential testing of hypotheses in statistics. To describe the
identity, we need a definition. We caution the reader that although the meaning of the
definition is clear, it is not truly rigorous because of our not using measure-theoretic
terminology.
Definition 11.5. Let X1 ; X2 ; : : : be an infinite sequence of random variables defined
on a common sample space . A nonnegative integer-valued random variable N ,
also defined on , is called a stopping time if whether or not N n for a given n can
be determined by only knowing X1 ; X2 ; : : : ; Xn , and if, moreover, P .N < 1/ D 1.
Example 11.7. Suppose a fair die is rolled repeatedly (and independently), and let
the sequence of rolls be X1 ; X2 ; : : :. Let N be the first throw at which the sum of
the rolls exceeds 10. In notation, N D minfn W Sn D X1 C C Xn > 10g. Clearly,
N cannot be more than 11 and so P .N < 1/ D 1, and also, whether the sum has
exceeded 10 within the first n rolls can be decided by knowing the values of only
the first n rolls. So N is a valid stopping time.
Example 11.8. Suppose X1 ; X2 ; : : : are iid N.0; 1/, and let Wn D XnWn X1Wn be
the range of the X1 ; X2 ; : : : ; Xn . Suppose N is the first time the range Wn exceeds
a:s: a:s: a:s:
five. Because XnWn ) 1, and X1Wn ) 1, we have that Wn ) 1. Therefore,
P .N < 1/ D 1. Also, evidently, whether Wn > 5 can be decided by knowing the
values of only X1 ; X2 ; : : : ; Xn . So N is a valid stopping time.
Theorem 11.11 (Wald’s Identity). Let Xi ; i 1 be iid random variables, with
E.jX1 j/ < 1. Let Sn D X1 C C Xn ; n 1; S0 D 0 and let N be a stopping
time with a finite expectation. Then E.SN / D E.N /E.X1 /.
Proof. The proof is not completely rigorous, and we should treat it as a sketch of
the proof. First, note that
1
X 1
X
SN D SN IfN Dng D Sn IfN Dng
nD1 nD1
X1
D Sn ŒIfN >n1g IfN >ng
nD1
X1
D Xn IfN >n1g
nD1
" 1
#
X
) E.SN / D E Xn IfN >n1g
nD1
1
X
D EŒXn IfN >n1g
nD1
X1
D EŒXn EŒIfN >n1g
nD1
392 11 Random Walks
1
X
D EŒX1 EŒIfN >n1g
nD1
D E.X1 /E.N /: t
u
or drift off to 1, or it can oscillate. If it oscillates, what can we say about the nature
of its oscillation? For example, under certain conditions on the common distribution
of the Xi , does it come arbitrarily close to any real number over and over again?
Clearly, the answer depends on the common distribution of the Xi . For example, if
the Xi can only take the values ˙1, then obviously the random walk cannot come
arbitrarily close to any real number over and over again. The answer in general is
that either the random walk does not visit neighborhoods of any number over and
over again, or that it visits neighborhoods of every number over and over again,
or perhaps that it visits neighborhoods of certain distinguished numbers over and
over again. We give two formal definitions and a theorem for the one-dimensional
case first. The case of random walks in general dimensions is treated following the
one-dimensional case.
Remark. The recurrent set will be empty when the random walk drifts off to one
of C1 or 1. For example, an asymmetric simple random walk will do so. The
simple symmetric random walk will have a countable lattice set as its recurrent set.
On the other hand, as we later show, a random walk with iid standard normal steps
will have the entire real line as its recurrent set.
Although we shortly present a handy and effective all at one time theorem for
verifying whether a particular random walk in the general dimension has a specific
point, say the origin, in its recurrent set, the following intuitively plausible result in
the one-dimensional case is worth knowing; see Chung (2001, pp. 279–283) for a
proof.
P
Theorem 11.13. (a) IfPfor some > 0; 1 nD1 P .jSn j / D 1, then 0 2 X .
(b) If for some > 0; 1 nD1 P .jS n j / < 1, then 0 … X .
394 11 Random Walks
Although part (b) of this theorem easily follows by an application of the Borel–
Cantelli lemma, the proof of part (a) is nontrivial; see Chung (1974).
A generalization of this is assigned as a chapter exercise. Let us see a quick
example of application of this theorem.
Example 11.10 (Standard Normal Random P Walk). Suppose X1 ; X2 ; : : : are iid
N.0; 1/ and consider the random walk Sn D niD1 Xi ; n 1. Then, Sn N.0; n/
Sn
and therefore p n
N.0; 1/. Fix any > 0. Then,
ˇ ˇ
ˇ Sn ˇ
P .jSn j /DP ˇp ˇ p D 2ˆ p 1;
ˇ nˇ n n
where ˆ denotes the standard normal CDF. Now,
ˆ p ˆ.0/ C p .0/;
n n
P
and therefore, 2ˆ. pn / 1 2 p .0/
n
. Because .0/ > 0, and 1 p1
nD1 n diverges,
by the theorem above, 0 is seen to be a recurrent state for Sn . The rough approxima-
tion argument we gave can be made rigorous easily, by using lower bounds on the
standard normal CDF. In fact, any real number x is a recurrent state for Sn in this
case, and it is shown shortly.
˛ jtj˛ 1 2
jtj < 1; e jt j < 1 ) < ˛:
2 1 .t/ jtj
R1
But for 0 < ˛ < 1; 0 t2˛ dt < 1, and therefore, by the reverse part of the Chung–
Fuchs theorem, one-dimensional stable random walks with index smaller than one
are transient.
current in one and two dimensions, and transient in all dimensions higher than two.
A number of extremely important general results follow from the Chung–Fuchs
theorem. Among them, the following three stand out.
396 11 Random Walks
Theorem 11.15 (Three General Results). (a) Suppose d D 1, and that E.jXi j/
< 1, with E.Xi / D 0. Then the random walk Sn is recurrent.
P
(b) More generally, suppose Snn ) 0. Then Sn is recurrent.
(c) Consider a general d 3, and suppose F is not supported on any two-
dimensional subset of Rd . Then Sn must be transient.
Part (c) is the famous result that there are no recurrent random walks beyond two
dimensions. Beyond two dimensions, a random walker has too many directions in
which to wander off, and does not return to its original position recurrently, that is,
infinitely often.
Proof. Parts (a) and (c) are in fact not very hard to derive from the Chung–
Fuchs theorem. For example, for part (a), finiteness of E.jXi j/ allows us to write
a Taylor expansion for .t/, the characteristic function of F (as we have al-
ready discussed). Furthermore, that E.Xi / D 0 allows us to conclude that lo-
cally, near the origin, 1 .t/ is o.jtj/. This leads to divergence of the integral
R1 1
1 1 .t / dt, and hence recurrence of the random walk by the Chung–Fuchs the-
orem. Part (c) is also proved by a Taylor expansion, but we do not show the
argument, because we have not discussed Taylor expansions of characteristic func-
tions for d > 1. Part (b) is harder to prove, but it too follows from the Chung–Fuchs
theorem. t
u
Inequalities are of tremendous value for proving convergence of suitable things, for
obtaining rates of convergences, and for finding concrete bounds on useful functions
and sequences. We collect a number of classic inequalities on the moments and
distributions of partial sums in this section. References to each inequality are given
in DasGupta (2008).
Exercises
Exercise 11.1 (Simple Random Walk). By evaluating its generating function, cal-
culate the probabilities that the second return to zero of the simple symmetric
random walk occurs at the fourth step; at the sixth step.
Exercise 11.2 (Simple Random Walk). For the simple symmetric random walk,
find limn!1 E.
p n / , where n is the number of returns to zero by step n.
n
Exercise 11.4. Consider two particles starting out at specified integer values x; y.
At each subsequent time, one of the two particles is selected at random, and moved
one unit to the right or one unit to the left, with probabilities p; q; p C q D 1.
Calculate the probability that the two particles will eventually meet.
Exercise 11.5 (On the Local Time). For n D 10; 20; and 30, calculate the prob-
ability that the simple symmetric random walk spends zero time at the state x for
x D 2; 1; 1; 2; 5; 10.
398 11 Random Walks
Exercise 11.6 (On the Local Time). Find and plot the mass function of the local
time .x; n/ of the simple symmetric random walk for x D 5; 10; 15 for n D 25.
Exercise 11.7 (On the Local Time). Calculate the expected value of the local
time .x; n/ of the simple symmetric random walk for x D 2; 5; 10; 15 and
n D 25; 50; 100. Comment on the patterns that emerge.
Exercise 11.8 (Quartiles of Number of Positive Terms). For the simple symmet-
ric random walk, compute the quartiles of n , the number of positive values among
S1 ; : : : ; Sn , for n D 5; 8; 10; 15.
Exercise 11.9 * (Range of a Random Walk). Consider the simple symmetric ran-
dom walk and let Rn be the number of distinct states visited by the walk up to time
n (i.e., Rn is the number of distinct elements in the set of numbers fS0 ; S1 ; : : : ; Sn g.
(a) First derive a formula for P .Sk ¤ Sj 8j; 0 j k 1/ for any given k 1.
(b) Hence derive a formula for E.Rn / for any n.
(c) Compute this formula for n D 2; 5; 10; 20.
Exercise 11.10 * (Range of a Random Walk). In the notation of the previous ex-
Rn P
ercise, show that n ) 0.
Exercise 11.18 (The Asymmetric Cubic Lattice Random Walk). Consider the
cubic lattice random walk in d dimensions, but with the change that a coordinate
changes by ˙1 with respective probabilities p; q; p C q D 1.
P1 Derive a formula for Pd;2n in this general situation, and then verify whether
nD1 Pd;2n converges or diverges for given d D 1; 2; 3; : : :. Make a conclusion
about the recurrence or transience of such an asymmetric random walk for each
d 1.
Exercise 11.22 (Ratio of Recurrence Times). For the simple symmetric random
walk on Z,
(a) Find an explicit formula for E. rs /; 1 r < s < 1, where i denotes the time
of the i th return of the random walk to zero.
(b) Find an explicit answer for E. rs /; 1 r < s < 1.
Exercise 11.23 * (Number of Positive Terms Given the Last Term). Consider a
one-dimensional random walk driven by F D ˆ, the standard normal CDF. Derive
a formula for E. n jSn D c/, that is, the conditional expectation of the number of
positive terms given the value of the last term.
Exercise 11.26 (Application of Wald’s Identity). Let X have the negative bino-
mial distribution with parameters r; p. Prove by using only Wald’s identity that
E.X / D pr .
400 11 Random Walks
References
We started this text with discussions of a single random variable. We then proceeded
to two and more generally, a finite number of random variables. In the last chapter,
we treated the random walk, which involved a countably infinite number of random
variables, namely the positions of the random walk Sn at times n D 0; 1; 2; 3; : : :.
The time parameter n for the random walks we discussed in the last chapter belongs
to the set of nonnegative integers, which is a countable set. We now look at a special
continuous time stochastic process, which corresponds to an uncountable family of
random variables, indexed by a time parameter t belonging to a suitable uncountable
time set T . The process we mainly treat in this chapter is Brownian motion, although
some other Gaussian processes are also treated briefly.
Brownian motion is one of the most important continuous-time stochastic pro-
cesses and has earned its special status because of its elegant theoretical properties,
its numerous important connections to other continuous-time stochastic processes,
and due to its real applications and its physical origin. If we look at the path of
a random walk when we run the clock much faster, and the steps of the walk are
also suitably smaller, then the random walk converges to Brownian motion. This
is an extremely important connection, and it is made precise later in this chapter.
Brownian motion arises naturally in some form or other in numerous statistical
inference problems. It is also used as a real model for modeling stock market
behavior.
The process owes its name to the Scottish botanist Robert Brown, who noticed
under a microscope that pollen particles suspended in fluid engaged in a zigzag and
eccentric motion. It was, however, Albert Einstein who in 1905 gave Brownian mo-
tion a formal physical formulation. Einstein showed that Brownian motion of a large
particle visible under a microscope could be explained by assuming that the parti-
cle gets ceaselessly bombarded by invisible molecules of its surrounding medium.
The theoretical predictions made by Einstein were later experimentally verified by
various physicists, including Jean Baptiste Perrin who was awarded the Nobel prize
in physics for this work. In particular, Einstein’s work led to the determination of
Avogadro’s constant, perhaps the first major use of what statisticians call a moment
estimate. The existence and construction of Brownian motion was first explicitly
established by Norbert Wiener in 1923, which accounts for the other name Wiener
process for a Brownian motion.
We remarked in the introduction that random walks and Brownian motion are in-
terconnected in a suitable asymptotic paradigm. It would be helpful to understand
this connection in a conceptual manner before going into technical treatments of
Brownian motion.
Consider then the usual simple symmetric random walk defined by S0 D 0; Sn D
X1 C X2 C C Xn ; n 1, where the Xi are iid with common distribution P .Xi D
˙1/ D 12 . Consider now a random walk that makes its steps at much smaller time
intervals, but the jump sizes are also smaller. Precisely, with the Xi ; i 1 still as
above, define
Sbnt c
Sn .t/ D p ; 0 t 1;
n
where bxc denotes the integer part of a nonnegative real number x. This amounts to
joining the points
1 X1 2 X1 C X2
.0; 0/; ;p ; ; p ;:::
n n n n
by linear interpolation, thereby obtaining a curve. The simulated plot of Sn .t/ for
n D 1000 in Fig. 12.1 shows the zigzag path of the scaled random walk. We can see
that the plot is rather rough, and the function takes the value zero at t D 0; that is,
Sn .0/ D 0, and Sn .1/ ¤ 0.
It turns out that in a suitable precise sense, the graph of Sn .t/ on Œ0; 1 for large
n should mimic the graph of a random function called Brownian motion on Œ0; 1.
Brownian motion is a special stochastic process, which is a collection of infinitely
many random variables, say W .t/; 0 t 1, each W .t/ for a fixed t being a
normally distributed random variable, with other additional properties for their joint
distributions. They are introduced formally and analyzed in greater detail in the next
sections.
The question arises of why is the connection between a random walk and the
Brownian motion of any use or interest to us. A short nontechnical answer to that
question is that because Sn .t/ acts like a realization of a Brownian motion, by using
known properties of Brownian motion, we can approximately describe properties
of Sn .t/ for large n. This is useful, because the stochastic process Sn .t/ arises in
12.2 Basic Definitions 403
0.75
0.5
0.25
t
0.2 0.4 0.6 0.8 1
-0.25
-0.5
-0.75
Our principal goal in the subsequent sections is to study Brwonian motion and
Brownian bridge due to their special importance among Gaussian processes.
The Brownian bridge is closely related to Brownian motion, and shares many of
the same properties as Brownian motion. They both arise in many statistical appli-
cations. It should also be understood that the Brownian motion and bridge are of
enormous independent interest in the study of probability theory, regardless of their
connections to problems in statistics.
We caution the reader that it is not possible to make all the statements in this
chapter mathematically rigorous without using measure theory. This is because we
are now dealing with uncountable collections of random variables, and problems of
measure zero sets can easily arise. However, the results are accurate and they can be
practically used without knowing exactly how to fix the measure theory issues.
We first give some general definitions for future use.
P .Xtn xtn jXt1 D xt1 ; : : : ; Xtn1 D xtn1 / D P .Xtn xtn jXtn1 D xtn1 /I
that is, the distribution of the future values of the process given the entire past de-
pends only on the most recent past.
Remark. Property (iv) actually can be proved to follow from the other three proper-
ties. But it is helpful to include it in the definition to emphasize the importance of the
continuity of Brownian paths. Property (iii) is the celebrated independent increments
12.2 Basic Definitions 405
property and lies at the heart of numerous further properties of Brownian motion.
We often just omit the word standard when referring to standard Brownian motion.
Definition 12.8. Let W .t/ be a standard Wiener process on Œ0; 1. The process
B.t/ D W .t/ tW .1/ is called a standard Brownian bridge on Œ0; 1.
Remark. Note that the definition implies that B.0/ D B.1/ D 0 with probability
one. Thus, the Brownian bridge on Œ0; 1 starts and ends at zero. Hence the name tied
down Wiener process. The Brownian bridge on Œ0; 1 can be defined in various other
equivalent ways. The definition we adopt here is convenient for many calculations.
Definition 12.9. Let 1 < d < 1, and let Wi .t/; 1 i d , be independent Brown-
ian motions on Œ0; 1/. Then Wd .t/ D .W1 .t/; : : : ; Wd .t// is called d -dimensional
Brownian motion.
0.4
0.2
-0.2
-0.4
-0.6
-0.8
The two cases are combined into the single formula Cov.W .s C 1/ W .s/;
W .t C 1/ W .t// D .s t C 1/C . The covariance functions of X1 .t/; X3 .t/, and
X5 .t/ depend only on s t, and these are stationary.
We show a simple and useful result on characterizing Gaussian processes that are
Markov. It turns out that there is a simple way to tell if a given Gaussian process
is Markov by simply looking at its correlation function. Because we only need to
12.2 Basic Definitions 407
It is not a priori obvious that an uncountable collection of random variables with the
defining properties of Brownian motion can be constructed on a common probability
space (a measure theory terminology). In other words, that Brownian motion exists
requires a proof. Various proofs of the existence of Brownian motion can be given.
We provide two explicit constructions, of which one is more classic in nature. But
the second construction is also useful.
408 12 Brownian Motion and Gaussian Processes
k1 kC1
Hn;k .t/ D 0 if t … ; n ; and Hn;k .t/ D 2.n1/=2
2n 2
k1 k k kC1
if t 2 ; ; and 2.n1/=2 if t 2 ; :
2n 2n 2n 2n
Let Sn;k .t/ be the sequence of Schauder functions defined as Sn;k .t/ D
Rt
0 Hn;k .s/ds; 0 t 1; n 0; k 2P In . P
1
Then the infinite series W .t/ D nD0 k2In Zn;k Sn;k .t/ converges uni-
formly in t on Œ0; 1 and the process W .t/ is a Brownian motion on Œ0; 1.
Remark. See Bhattacharya and Waymire (2007, p. 135) for a proof. Both con-
structions of Brownian motion given above can be heuristically understood by
using ideas of Fourier theory. If the sequence f0 .t/ D 1; f1 .t/; f2 .t/; : : : forms
an orthonormal basis of L2 Œ0; 1, then
P we can expand a square integrable func-
tion, say w.t/, as an infinite series i ci fi .t/, where ci equals the inner product
R1
0 w.t/fi .t/dt. Thus, c0 D 0 if the integral of w.t/ is zero. The Karhunen–Loéve
expansion can be heuristically explained as a random orthonormal expansion of
W .t/. The basis functions fi .t/ chosen do depend on the process W .t/, specifically
R1
the covariance function. The inner products 0 W .t/fi .t/dt; i 1 form a sequence
of iid standard normals. This is very far from a proof, but provides a heuristic con-
text for the expansion. The second construction is based similarly on expansions
using a wavelet basis instead of a Fourier basis.
Distributional properties and formulas are always useful in doing further calcula-
tions and for obtaining concrete answers to questions. The most basic distributional
properties of the Brownian motion and bridge are given first.
12.3 Basic Distributional Properties 409
Throughout this chapter, the notation W .t/ and B.t/ mean a (standard) Brownian
motion and Brownian bridge. The phrase standard is often deleted for brevity.
Proposition. (a) Cov.W .s/; W .t// D min.s; t/I Cov.B.s/; B.t// D min.s; t/st:
(b) (The Markov Property). For any given n and t0 < t1 < < tn ,
the conditional distribution of W .tn / given that W .t0 / D x0 ; W .t1 / D
x1 ; : : : ; W .tn1 / D xn1 is the same as the conditional distribution of W .tn /
given W .tn1 / D xn1 .
(c) Given s < t, the conditional distribution of W .t/ given W .s/ D w is N.w; t s/.
(d) Given t1 < t2 < < tn , the joint density of W .t1 /; W .t2 /; : : : ; W .tn / is given
by the function
x2
where p.x; t/ is the density of N.0; t/; that is, p.x; t/ D p 1 e 2t :
2 t
Definition 12.11. A nonnegative random variable is called a stopping time for the
process W .t/ if for any s > 0, whether s depends only on the values of W .t/
for t s.
Example 12.2. For b > 0, consider the first passage time Tb D infft > 0 W W .t/
D bg. Then, Tb > s if and only if W .t/ < b for all t s. Therefore, Tb is a stopping
time for the process W .t/.
Theorem 12.3 (Strong Markov Property). If is a stopping time for the process
W .t/, then W . Ct/W ./ is also a Brownian motion on Œ0; 1/ and is independent
of fW .s/; s g.
See Bhattacharya and Waymire (2007, p. 153) for its proof.
410 12 Brownian Motion and Gaussian Processes
1
4 X .1/m .2mC1/2 2 =.8x 2 /
G.x/ D e ; x 0:
mD0 2m C 1
(f) (Maximum of a Brownian Bridge). Let B.t/ be a Brownian bridge on Œ0; 1.
Then, sup0t 1 jB.t/j has the CDF
1
X 2 x2
H.x/ D 1 .1/k1 e 2k ; x 0:
kD1
P .Tb < t/ D P .Tb < t; W .t/ > b/ C P .Tb < t; W .t/ < b/;
and,
P .Tb < t; W .t/ > b/ D P .Tb < t; W .t/ < b/:
Because P .Tb < t; W .t/ > b/ D P .W .t/ > b/ (because W .t/ > b implies
that Tb < t), if we accept the second identity above, then we immediately have the
desired result P .M.t/ > b/ D P .Tb < t/ D 2P .W .t/ > b/. Thus, only the second
identity needs a proof. This is done by a clever argument.
The event fTb < t; W .t/ < bg happens if and only if at some point < t,
the process reaches the level b, and then at time t drops to a lower level l; l < b.
However, once at level b, the process could as well have taken the path reflected
along the level b, which would have caused the process to end up at level b C
.b l/ D 2b l at time t. We now observe that 2b l > b, meaning that
corresponding to every path in the event fTb < t; W .t/ < bg, there is a path in
the event fTb < t; W .t/ > bg, and so P .Tb < t; W .t/ < b/ must be equal to
P .Tb < t; W .t/ > b/.
This is the famous reflection principle for Brownian motion. An analytic proof
of the identity P .Tb < t; W .t/ < b/ D P .Tb < t; W .t/ > b/ can be given by using
the strong Markov property of Brownian motion.
Note that both parts (b) and (c) of the theorem are simply restatements of part
(a). Many of the remaining parts follow on calculations that also use the reflection
principle. Detailed proofs can be seen, for example, in Karlin and Taylor (1975,
pp. 345–354). t
u
Example 12.4 (Density of Last Zero Before T ). Consider standard Brownian motion
W .t/ on Œ0; 1/ starting at zero and fix a time T > 0. We want to find the density
of the last zero of W .t/ before the time T . Formally, let D T D supft < T W
W .t/ D 0g. Then, we want to find the density of .
By using part (g) of the previous theorem,
r
2 s
P . > s/ D P .There is at least one zero of W in .s; T // D arccos :
T
Theorem 12.5. Let W .t/; t > 0 be a Brownian motion on Œ0; 1/. Then,
1
(a) (Scaling). For c > 0; X.t/ D c 2 W .ct/ is a Brownian motion on Œ0; 1/.
(b) (Time reciprocity). X.t/ D tW . 1t /, with the value being defined as zero at
t D 0 is a Brownian motion on Œ0; 1/.
(c) (Time Reversal). Given 0 < T < 1; XT .t/ D W .T /W .T t/ is a Brownian
motion on Œ0; T .
Proof. Only part (b) requires a proof, the others being obvious. First note that for
s t, the covariance function is
1 1 1 1 1
Cov sW ; tW D st min ; D st D s D minfs; tg:
s t s t t
It is obvious that X.t/ X.s/ N.0; t s/. Next, for s < t < u; Cov.X.t/
X.s/; X.u/ X.t// D t s t C s D 0, and the independent increments property
holds. The sample paths are continuous (including at t D 0) because W .t/ has con-
tinuous sample paths, and X.0/ D 0. Thus, all the defining properties of a Brownian
motion are satisfied, and hence X.t/ must be a Brownian motion. t
u
W .t /
Proposition. With probability one, t
! 0 as t ! 1.
The behavior of Brownian motion near t D 0 is quite a bit more subtle, and
we postpone its discussion till later. We next describe a series of classic results that
illustrate the extremely rough nature of the paths of a Brownian motion. The results
essentially tell us that at any instant, it is nearly impossible to predict what a particle
performing a Brownian motion will do next. Here is a simple intuitive explanation
for why the paths of a Brownian motion are so rough.
12.3 Basic Distributional Properties 413
Take two time instants s; t; s < t. We then have the simple moment formula
EŒ.W .t/ W .s/2 D .t s/. Writing t D s C h, we get
2
W .s C h/ W .s/ 1
EŒW .s C h/ W .s/2 D h , E D :
h h
If the time instants s; t are close together, then h 0, and so h1 is large. We can
W .sCh/W .s/
see that the increment h
is blowing up in magnitude. Thus, differentia-
bility is going to be a problem. In fact, not only is the path of a Brownian motion
guaranteed to be nondifferentiable at any prespecified t, it is guaranteed to be non-
differentiable simultaneously at all values of t. This is a much stronger roughness
property than lack of differentiability at a fixed t.
The next theorem is regarded as one of the most classic ones in probability theory.
We first need a few definitions.
Definition 12.12. Let f be a real-valued continuous function defined on some open
subset T of R. The upper and the lower Dini right derivatives of f at t 2 T are
defined as
f .t C h/ f .t/ f .t C h/ f .t/
D C f .t/ D lim sup ; DC f .t/ D lim inf :
h#0 h h#0 h
(c) (Unbounded Variation). For every T > 0, with probability one, W .t/ has an
unbounded total variation as a function of t on Œ0; T .
(d) With probability one, no nonempty time interval W .t/ can be monotone increas-
ing or monotone decreasing.
(e) P .For all t > 0; D C W .t/ D 1 or DC W .t/ D 1 or both/ D 1.
(f) (Holder Continuity). Given any finite T > 0 and 0 < < 12 , with probability
one, W .t/ is Holder continuous on Œ0; T of order .
(g) For any > 12 , with probability one, W .t/ is nowhere Holder continuous of
order .
414 12 Brownian Motion and Gaussian Processes
(h) (Uniform Continuity in Probability). Given any > 0; and 0 < T < 1;
P .supt;s;0 t;s T;jt sj<h jW .t/ W .s/j > / ! 0 as h ! 0.
Proof. Each of parts (c) and (d) would follow from part (b), because of results in
real analysis that monotone functions or functions of bounded variation must be
differentiable almost everywhere. Part (e) is a stronger version of the nondifferen-
tiability result in part (b); see Karatzas and Shreve (1991, pp. 106–111) for parts
(e)–(h). Part (b) itself is proved in many standard texts on stochastic processes; the
proof involves quite a bit of calculation. We show here that part (a) is a consequence
of the reflection principle.
Clearly, it is enough just to show that for any T > 0; P .supt 2Œ0;T W .t/ > 0/ D 1.
This will imply that P .inft 2Œ0;T W .t/ < 0/ D 1, because W .t/ is a Brownian
motion if W .t/ is, and hence it will imply all the other statements in part (a). Fix
c > 0. Then,
Remark. It should be noted that the set of points at which the path of a Brownian
motion is Holder continuous of order 12 is not empty, although in some sense such
points are rare.
The oscillation properties of the paths of a Brownian motion are further illus-
trated by the laws of the iterated logarithm for Brownian motion. The path of a
Brownian motion is a random function. Can we construct suitable deterministic
functions, say u.t/ and l.t/, such that for large t the Brownian path W .t/ will be
bounded by the envelopes l.t/; u.t/? What are the tightest such envelope functions?
Similar questions can be asked about small t. The law of the iterated logarithm an-
swers these questions precisely. However, it is important to note that in addition to
the intellectual aspect of just identifying the tightest envelopes, the iterated loga-
rithm laws have other applications.
p
Theorem 12.7 (LIL). Let f .t/ D 2t log j log tj; t > 0. With probability one,
W .t / W .t /
(a) lim supt !1 f .t / D 1I lim inft !1 f .t / D 1.
(b) lim supt !0 W .t / W .t /
f .t / D 1I lim inft !0 f .t / D 1.
Remark on Proof: Note that the lim inf statement in part (a) follows from the
lim sup statement because W .t/ is also a Brownian motion if W .t/ is. On the
other hand, the two statements in part (b) follow from the corresponding statements
in part (a) by the time reciprocity property that tW . 1t / is also a Brownian motion
if W .t/ is. For a proof of part (a), see Karatzas and Shreve (1991), or Bhattacharya
and Waymire (2007, p. 143). t
u
12.3 Basic Distributional Properties 415
If the expected length is zero, then the length itself must be zero with probability
one. That C0 has no isolated points is entirely nontrivial to prove and we omit the
416 12 Brownian Motion and Gaussian Processes
proof. Finally, by a result in real analysis that any closed set with no isolated points
must be uncountable unless it is empty, we have that C0 is an uncountable set. t
u
Remark. The implication is that the set of times at which Brownian motion returns
to zero is a topologically large set marked by holes, and collectively the holes are big
enough that the zero set, although uncountable, has length zero. Such sets in one di-
mension are commonly called Cantor sets. Corresponding sets in higher dimensions
often go by the name fractals.
Remark. When Xd .t/ exits from U having started at a point inside U , it can exit
through different points on the boundary @U . If it exits at the point y 2 @U , then
g.Xd .// will equal g.y/. The exit point y is determined by chance. If we average
over y, then we get a function that is harmonic inside U and equals g on @U . We
omit the proof of this theorem, and refer the reader to Karatzas and Shreve (1991,
p. 244), and Körner (1986, p. 55).
jzj r
u.z/ D for d D 1I
Rr
log jjzjj log r
u.z/ D for d D 2I
log R log r
r 2d jjzjj2d
u.z/ D for d > 2:
r 2d R2d
418 12 Brownian Motion and Gaussian Processes
For now, let us consider the case d D 1. Fix positive numbers r; R and suppose a
one-dimensional Brownian motion starts at a number x between r and R, 0 < r <
x < R < 1. Then the probability that it will hit the line z D R before hitting
the line z D r is Rrxr
. The closer the starting point x is to R, the larger is the
probability that it will first hit the line z D R. Furthermore, the probability is a very
simple linear function. We revisit the case d > 1 when we discuss recurrence and
transience of d -dimensional Brownian motion in the next section.
We observed during our discussion of the lattice random walk (Chapter 11) that it
is recurrent in dimensions d D 1; 2 and transient for d > 2. That is, in one and two
dimensions the lattice random walk returns to any integer value x at least once (and
hence infinitely often) with probability one, but for d > 2, the probability that the
random walk returns at all to its starting point is less than one. For the Brownian
motion, when the dimension is more than one, the correct question is not to ask if
it returns to particular points x. The correct question to ask is if it returns to any
fixed neighborhood of a particular point, however small. The answers are similar
to the case of the lattice random walk; that is, in one dimension, Brownian motion
returns to any point x infinitely often with probability one, and in two dimensions,
Brownian motion returns to any given neighborhood of a point x infinitely often
with probability one. But when d > 2, it diverges off to infinity. We can see this by
using the connection of Brownian motion to the Dirichlet problem on discs. We first
need two definitions.
Definition 12.17. For d > 1, a d -dimensional stochastic process Xd .t/; t 0 is
called neighborhood recurrent if with probability one, it returns to any given ball
B.x; / infinitely often.
Definition 12.18. For any d , a d -dimensional stochastic process Xd .t/; t 0 is
called transient if with probability one, Xd .t/ diverges to infinity.
We now show how the connection of the Brownian motion to the solution of the
Dirichlet problem will help us establish that Brownian motion is transient for d > 2.
That is, if we let B be the event that limt !1 jjWd .t/jj ¤ 1, then we show that
P .B/ must be zero for d > 2. Indeed, to be specific, take d D 3, pick a point
12.5 The Local Time of Brownian Motion 419
x 2 R3 with jjxjj > 1, suppose that our Brownian motion is now sitting at the point
x, and ask what the probability is that it will reach the unit ball B1 before it reaches
the disk jjzjj D R. Here R > jjxjj. We have derived this probability. The Markov
1
1 jjxjj
property of Brownian motion gives this probability to be exactly equal to 1 1 .
1 R
1
This clearly converges to jjxjj as R ! 1. Imagine now that the process has evolved
for a long time, say T , and that it is now sitting at a very distant x (i.e., jjxjj is large).
The LIL for Brownian motion guarantees that we can pick such a large T and such
a distant x. Then, the probability of ever returning from x to the unit ball would
be the small number D jjxjj 1
. We can make arbitrarily small by choosing jjxjj
sufficiently large, and what that means is that the probability of the process returning
infinitely often to the unit ball B1 is zero. The same argument works for Bk , the ball
of radius k for any k 1, and therefore, P .B/ D P .[1 kD1 Bk / D 0; that is, the
process drifts off to infinity with probability one. The same argument works for any
d > 2, not just d D 3. The case of d D 1; 2 is left as a chapter exercise. We then have
the following theorem. t
u
Theorem 12.10. Brownian motion visits every real x infinitely often with prob-
ability one if d D 1, is neighborhood recurrent if d D 2, and transient if d > 2.
Moreover, by its neighborhood recurrence for d D 2, the graph of a two-dimensional
Brownian path on Œ0; 1/ is dense in the two-dimensional plane.
For the simple symmetric random walk in one dimension, we derived the distribu-
tion of the local time .x; n/, which is the total time the random walk spends at the
integer x up to the time instant n. It would not be interesting to ask exactly the same
question about Brownian motion, because the number of time points t up to some
time T at which the Brownian motion W .t/ equals a given x is zero or infinity. Paul
Lévy gave the following definition for the local time of a Brownian motion. Fix a
set A in the real line and a general time instant T; T > 0. Now ask what is the total
size of the times t up to T at which the Brownian motion has resided in the given
set A. That is, denoting Lebesgue measure on R by , look at the following kernel
Using this, Lévy formulated the local time of the Brownian motion at a given x as
H.Œx ; x C ; T /
.x; T / D lim ;
#0 2
where the limit is supposed to mean a pointwise almost sure limit. It is important to
note that the existence of the almost sure limit is nontrivial.
420 12 Brownian Motion and Gaussian Processes
Instead of the clumsy notation T , we eventually simply use the notation t, and
thereby obtain a new stochastic process .x; t/, indexed simultaneously by two
parameters x and t. We can regard .x; t/ together as a vector-valued time parameter,
and call .x; t/ a random field. This is called the local time of one-dimensional
Brownian motion. The local time of Brownian motion is generally regarded to be an
analytically difficult process to study. We give a relatively elementary exposition to
the local time of Brownian motion in this section.
Recall now the previously introduced maximum process of standard Brownian
motion, namely M.t/ D sup0st W .s/. The following major theorem on the dis-
tribution of the local time of Brownian motion at zero was proved by Paul Lévy.
Theorem 12.11. Let W .s/; s 0 be standard Brownian motion starting at zero.
Consider the two stochastic processes, f.0; t/; t 0g, and fM.t/; t 0g. These
two processes have the same distribution.
In particular, for any given fixed t, and y > 0,
r Z y
.0; t/ 2 2 =2
P p y D e z d z D 2ˆ.y/ 1
t 0
r Z y
2 2 =.2t /
, P ..0; t/ y/ D e z d z:
t 0
For a detailed proof of this theorem, we refer to Mörters and Peres (2010, p. 160).
A sketch of the proof can be seen in Révész (2005).
For a general level x, the corresponding result is as follows, and it follows from
the case x D 0 treated above.
Theorem 12.12.
y C jxj
P ..x; t/ y/ D 2ˆ p 1; 1 < x < 1; t; y > 0:
t
It is important to note that if the level x ¤ 0, then the local time .x; t/ can actually
be exactly equal to zero with a positive probability, and this probability is simply
the probability that Brownian motion does not reach x within time t, and equals
jxj
2ˆ. p t
/ 1. This is not the case if the level is zero, in which case the local time
.0; t/ possesses a density function.
The theorem
p above also says that the local time of Brownian motion grows at
the rate Rof t for any level x. The expected value follows easily by evaluating the
1
integral 0 Œ1 P ..x; t/ y/dy, and one gets
p x jxj jxj 2
EŒ.x; t/ D 4 t p 1ˆ p 4jxj 1 ˆ p :
t t t
q p
The limit of this as x ! 0 equals 2 t , which agrees with EŒ.0; t/. The ex-
pected local time is plotted in Fig. 12.3.
12.6 Invariance Principle and Statistical Applications 421
2
10
1 8
0 6
0
1 4
2
3 2
4
50
We remarked in the first section of this chapter that scaled random walks mimic the
Brownian motion in a suitable asymptotic sense. As a matter of fact, if X1 ; X2 ; : : :
is any iid sequence of one-dimensional random variables satisfying
P some relatively
simple conditions, then the sequence of partial sums Sn D niD1 Xi ; n 1, when
appropriately scaled, mimics Brownian motion in a suitable asymptotic sense. Why
is this useful? This is useful because in many concrete problems of probability and
statistics, suitable functionals of the sequence of partial sums arise as the objects
of direct importance. The invariance principle allows us to conclude that if the se-
quence of partial sums Sn mimics W .t/, then any nice functional of the sequence of
partial sums will also mimic the same functional of W .t/. So, if we can figure out
how to deal with the distribution of the needed functional of the W .t/ process, then
we can use it in practice to approximate the much more complicated distribution of
the original functional of the sequence of partial sums. It is a profoundly useful fact
in the asymptotic theory of probability that all of this is indeed a reality. This sec-
tion treats the invariance principle for the partial sum process of one-dimensional
iid random variables. We recommend Billingsley (1968), Hall and Heyde (1980),
and Csörgo and Révész (1981) for detailed and technical treatments; Erdös and Kac
(1946), Donsker (1951), Komlós et al. (1975, 1976, Major (1978), Whitt (1980),
and Csörgo and Hall (1984) for invariance principles for the partial sum process;
and Pyke (1984) and Csörgo (2002)) for lucid reviews. Also, see Dasgupta (2008)
for references to various significant extensions, such as the multidimensional and
dependent cases.
422 12 Brownian Motion and Gaussian Processes
Although the invariance principle for partial sums of iid random variables is usu-
ally credited to Donsker (1951), Erdös and Kac (1946) contained the basic idea
behind the invariance principle and also worked out the asymptotic distribution of a
number of key and interesting functionals of the partial sum process. Donsker (1951)
provided the full generalization of the Erdös–Kac technique by providing explicit
Sk
embeddings of the discrete sequence p n
; k D 1; 2; : : : ; n into a continuous-time
stochastic process Sn .t/ and by establishing the limiting distribution of a general
continuous functional h.Sn .t//. In order to achieve this, it is necessary to use a con-
tinuous mapping theorem for metric spaces, as consideration of Euclidean spaces
is no longer enough. It is also useful to exploit a property of the Brownian mo-
tion known as the Skorohod embedding theorem. We first describe this necessary
background material.
Define
C Œ0; 1 D Class of all continuous real valued functions on Œ0; 1; and
DŒ0; 1 D Class of all real-valued functions on Œ0; 1 that are right continuous
and have a left limit at every point in Œ0; 1:
Given two functions f; g in either C Œ0; 1 or DŒ0; 1, let .f; g/ D sup0t 1
jf .t/ g.t/j denote the supremum distance between f and g. We refer to as the
uniform metric. Both C Œ0; 1 and DŒ0; 1 are (complete) metric spaces with respect
to the uniform metric .
Suppose X1 ; X2 ; : : : is an iid sequence of real valued random variables with mean
Sk
zero and variance one. Two common embeddings of the discrete sequence p n
;kD
1; 2; : : : ; n into a continuous time process are the following.
1
Sn;1 .t/ D p ŒSbnt c C fntgXbnt cC1 ;
n
and
1
Sn;2 .t/ D p SŒnt ;
n
0 t 1. Here, b:c denotes the integer part and f:g the fractional part of a positive
real.
Sk
The first one simply continuously interpolates between the values p n
by drawing
straight lines, but the second one is only right continuous, with jumps at the points
t D kn ; k D 1; 2; : : : ; n. For certain specific applications, the second embedding is
more useful. It is because of these jump discontinuities that Donsker needed to con-
sider weak convergence in DŒ0; 1. It led to some additional technical complexities.
The main idea from this point on is not difficult. One can produce a version
of Sn .t/, say SQn .t/, such that SQn .t/ is close to a sequence of Wiener processes
Wn .t/. Because SQn .t/ Wn .t/, if h.:/ is a continuous functional with respect to
the uniform metric, then one can expect that h.SQn .t// h.Wn .t// D h.W .t// in
distribution. SQn .t/ being a version of Sn .t/; h.Sn .t// D h.SQn .t// in distribution,
12.6 Invariance Principle and Statistical Applications 423
and so, h.Sn .t// should be close to the fixed Brownian functional h.W .t// in distri-
bution, which is the question we wanted to answer.
The results leading to Donsker’s theorem are presented below.
We prove the last theorem, assuming the Skorohod embedding theorem. A proof
of the Skorohod embedding theorem may be seen in Csörgo and Révész (1981), or
in Bhattacharya and Waymire (2007, p. 160).
Proof. We treat only the linearly interpolated process Sn;1 .t/, and simply call it
Sn .t/. To reduce notational clutter, we write as if the version SQn of Sn is Sn itself.
Thus, the SQn notation is dropped in the proof of the theorem. Without loss of gener-
ality, we take E.X1 / D 0 and Var.X1 / D 1. First, by using the Skorohod embedding
theorem, construct a stopping time 1 with respect to the process W .t/; t 0 such
L
that E.1 / D 1 and such that W .1 / D X1 . Using the strong Markov property of
Brownian motion, W .t C 1 / W .1 / is also a Brownian motion on Œ0; 1/, inde-
pendent of .1 ; W .1 //, and we can now pick a stopping time, say 20 with respect to
L
this process, with the two properties E.20 / D 1 and W .20 / D X2 . Therefore, if we
define 2 D 1 C 20 , then we have obtained a stopping time with respect to the orig-
inal Brownian motion, with the properties that its expectation is 2, and 2 1 and
1 are independent. Proceeding in this way, we can construct an infinite sequence of
stopping times 0 D 0 1 2 3 , such that k k1 are iid with mean
one, and the two discrete time processes Sk and W .k / have the same distribution.
Moreover, by the usual SLLN,
1X
n
n a:s:
D Œk k1 ! 1;
n n
kD1
max0kn jk kj P
! 0:
n
424 12 Brownian Motion and Gaussian Processes
p
Set Wn .t/ D Wp.nt
n
/
; n 1. Therefore, in this notation, W .k / D nWn . nk /. Now
fix > 0 and consider the event Bn D fsup0t 1 jSn .t/ Wn .t/j > g. We need to
show that P .Bn / ! 0.
Example 12.6 (CLT Follows from Invariance Principle). The central limit theorem
for iid random variables having a finite variance follows as a simple consequence of
Donsker’s invariance principle. Suppose X1 ; X2 ; : : : are iid random variables with
P
mean zero and variance 1. Let Sk D kiD1 Xi ; k 1. Define the functional h.f / D
f .1/ on C Œ0; 1. This is obviously a continuous functional on C Œ0; 1 with respect to
the uniform metric .f; g/ D sup0t 1 jf .t/ g.t/j. Therefore, with Sn .t/ as the
linearly interpolated partial sum process, it follows from the invariance principle that
Pn
D1 Xi L
h.Sn / D Sn .1/ D ip ) h.W / D W .1/ N.0; 1/;
n
Example 12.8 (Sums of Powers of Partial Sums). Consider once again iid random
variables X1 ; X2 ; : : : with zero mean and a P
unit variance. Fix a positive integer m
and consider the statistic Tm;n D n1m=2 nkD1 Skm . By direct integration of the
R1
polygonal curve ŒSn .t/m , we find that Tm;n D 0 ŒSn .t/m dt. This guides us to
R1 m
the functional h.f / D 0 f .t/dt. Because Œ0; 1 is a compact interval, it is easy
to verify that h is a continuous functional on C Œ0; 1 with respect to the uniform
metric. Indeed, the continuity of h.f / follows from simply the algebraic identity
jx m y m j D jx yjjx m1 C x m2 y C C y m1 j. It therefore follows from
L R1
Donsker’s invariance principle that Tm;n ) 0 W m .t/dt. At first glance it seems
surprising that a nondegenerate limit distribution for partial sums of Skm can exist
with only two moments.
In addition to the weak invariance principle described above, there are also strong
invariance principles. The first strong invariance principle for partial sums was ob-
tained in Strassen (1964). Since then, a lot of literature has developed, including
for the multidimensional case. Good sources for information are Strassen (1967),
Komlós et al. (1976), Major (1978), Csörgo and Révész (1981), and Einmahl (1987).
It would be helpful to first understand exactly what a strong invariance principle
is meant to achieve. Suppose X1 ; X2 ; : : : is a zero mean unit variance
P iid sequence
of random variables. For n 1, let Sn denote the partial sum niD1 Xi , and Sn .t/
Sk
the interpolated partial sum process with the special values Sn . kn / D p n
for each
n and 1 k n. In the process of proving Donsker’s invariance principle, we
have shown that we can construct (on a common probability space) a process SQn .t/
(which is equivalent to the original process Sn .t/ in distribution) and a single Wiener
P
process W .t/ such that sup0t 1 jSQn .t/ p1 W .nt/j ! 0. Therefore,
n
1 P
jSQn .1/ p W .n/j ! 0
n
jSQn W .n/j P
) p ! 0:
n
426 12 Brownian Motion and Gaussian Processes
The strong invariance principle asks if we can find suitable functions g.n/ such that
Q W .n/j a:s:
we can make the stronger statement jSng.n/ ! 0, and as a next step, what is the
best possible choice for such a function g.
The exact statements of the strong invariance principle results require us to say
that we can construct an equivalent process SQn .t/ and a Wiener process W .t/ on
Q W .n/j a:s:
some probability space such that jSng.n/ ! 0 for some suitable function g. Due
to the clumsiness in repeatedly having to mention these qualifications, we drop the
SQn notation and simply say Sn .t/, and we also do not mention that the processes
have all been constructed on some new probability space. The important thing for
applications is that we can use the approximations on the original process itself, by
simply adopting the equivalent process on the new probability space.
Paradoxically, the strong invariance principle does not imply the weak invariance
principle (i.e., Donsker’s invariance principle) in general. This is because under the
assumption of just the pfiniteness of the variance of the Xi , the best possible g.n/
increases faster than n. On the other hand, if the common distribution of the Xi
satisfies more stringent pmoment conditions, then we can make g.n/ a lot slower,
and even slower than n. The array of results that is available is bewildering and
they are all difficult to prove. We prefer to report a few results of great importance,
including in particular the KMT theorem, due to Komlós et al. (1976).
Remark. The KMT theorem is widely regarded as one of the most major advances
in the area of invariance principles and central limit problems. One should note that
the inequality given in the above theorem has a qualitative nature attached to it,
as we can only use the inequality with constants C; K; that are known to exist,
depending on the underlying F . Refinements of the version of the inequality given
12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process 427
above are available. We refer to Csörgo and Révész (1981) for such refinements and
general detailed treatment of the strong invariance principle.
We finish with two special processes derived from standard Brownian motion. Both
are important in applications.
Definition 12.19. Let W .t/; t 0 be standard Brownian motion starting at zero.
Fix 2 R and > 0. Then the process X.t/ D t C W .t/; t 0 is called
Brownian motion with drift and diffusion coefficient . It is clear that it inherits
the major path properties of standard Brownian motion, such as nondifferentiablity
at all t with probability one, the independent increments property, and the Markov
property. Also, clearly, for fixed t; X.t/ N.t; 2 t/.
There are, however, also some important differences when a drift is introduced. For
example, unless D 0, the reflection principle no longer holds, and consequently
one cannot derive the distribution of the running maximum M .t/ D sup0st X.s/
by using symmetry arguments. If 0, then it is not meaningful to ask for the
distribution of the maximum over all t > 0. However, if < 0, then the process
has a tendency to drift off towards negative values, and in that case the maximum in
fact does have a nontrivial distribution. We derive the distribution of the maximum
when < 0 by using a result on a particular first passage time of the process.
Theorem 12.17. Let X.t/; t 0 be Brownian motion starting at zero, and with
drift < 0 and diffusion coefficient . Fix a < 0 < b, and let
Example 12.9 (Probability That Brownian Motion Does Not Hit a Line). Consider
standard Brownian motion W .t/ starting at zero on Œ0; 1/, and consider a straight
line L with the equation y D a C bt; a; b > 0. Because W .0/ D 0; a > 0, and paths
of W .t/ are continuous, the probability that W .t/ does not hit the line L is the same
as P .W .t/ < a C bt8t > 0/. However, if we define a new Brownian motion (with
drift) X.t/ as X.t/ D W .t/ bt, then
P .W .t/ < a C bt8t > 0/ D P sup X.t/ a D 1 e 2ab ;
t >0
by our theorem above on the maximum of a Brownian motion with a negative drift.
We notice that the probability that W .t/ does not hit L is monotone increasing in
each of a; b, as it should be.
If we consider Brownian motion starting at some number x, and with drift < 0
and diffusion coefficient , then by simple calculations, the conditional distribution
of X.t/ given that X.0/ D x is N.x C t; 2 t/, which has the density
2
1 .yxt/
pt .x; y/ D p p e 2 2 t :
2 t
This is called the transition density of the process. The transition density satisfies a
very special partial differential equation, which we now prove.
By direct differentiation,
@ .x y/2 2 t 2 2 t .yxt/ 2
pt .x; y/ D p e 2 2 t I
@t 2 2 3 t 5=2
@ x y C t .yxt/ 2
pt .x; y/ D p e 2 2 t I
@y 2 3 t 3=2
@2 .x y C t/2 2 t .yxt/ 2
p t .x; y/ D p e 2 2 t :
@y 2 2 5 t 5=2
12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process 429
On using these three expressions, it follows that the transition density pt .x; y/ sat-
isfies the partial differential equation
@ @ 2 @2
pt .x; y/ D pt .x; y/ C pt .x; y/:
@t @y 2 @y 2
This is the drift-diffusion equation in one dimension. In the particular case that D
0(no drift), and D 1, the equation reduces to the celebrated heat equation
@ 1 @2
pt .x; y/ D pt .x; y/:
@t 2 @y 2
Returning to the drift-diffusion equation for the transition density in general, if we
now take a general function f .x; y/ that is twice continuously differentiable in y
and is bounded above by Ke cjyj for some finite K; c > 0, then integration by parts
in the drift-diffusion equation produces the following expectation identity, which we
state as a theorem.
Theorem 12.19. Let x; be any real numbers, and > 0. Suppose Y N.x C
t; 2 t/, and f .x; y/ twice continuously differentiable in y such that for some 0 <
K; c < 1; jf .x; y/j Ke cjyj for all y. Then,
@ @ 2 @2
Ex Œf .x; Y / D Ex f .x; Y / C Ex f .x; Y / :
@t @y 2 @y 2
This identity and a multidimensional version of it has been used in Brown et al.
(2006) to derive various results in statistical decision theory.
The covariance function of standard Brownian motion W .t/ is Cov.W .s/; W .t// D
p
min.s; t/. Therefore, if we scale by t , and let X.t/ D Wp.tt / ; t > 0, we get that
q q
min.s;t /
Cov.X.s/; X.t// D max.s;t /
D s
t
, if s t. Therefore, the covariance is a func-
tion of only the time lag on a logarithmic time scale. This motivates the definition
of the Ornstein–Uhlenbeck process as follows.
Definition 12.20. Let W .t/ be standard Brownian motion starting at zero, and let
’t
’ > 0 be a fixed constant. Then X.t/ D e 2 W .e ’t /; 1 < t < 1 is called
the Ornstein–Uhlenbeck process. The most general Ornstein–Uhlenbeck process is
defined as
’t
X.t/ D C p e 2 W .e ’t /; 1 < < 1; ’; > 0:
’
mean, then the drift drags the process back towards , and if the present state of
the process is smaller than , then it does the reverse. The ’ parameter controls this
tendency to return to the grand mean. The third parameter controls the variability.
Theorem 12.20. Let X.t/ be a general Ornstein–Uhlenbeck process. Then, X.t/
is a stationary Gaussian process with EŒX.t/ D , and Cov.X.s/; X.t// D
2 ’2 jst j .
’ e
Exercises
Exercise 12.10. Let W .t/; t 0 be standard Brownian motion on Œ0; 1/. Let T >
0 be a fixed finite time instant. Find the density of the first zero of W .t/ after the
time t D T . Does it have a finite mean?
Exercise 12.20. Let W .t/; t 0 be standard Brownian motion, and 0 < s < t <
u < 1. Find the conditional expectation of X.t/ given X.s/ D x; X.u/ D y.
Hint: Consider first the conditional expectation of X.t/ given X.0/ D X.1/ D 0.
Exercise 12.30 (Median Local Time). Find the median of the local time .x; t/ of
a standard Brownian motion on Œ0; 1/ starting at zero.
Caution: For x ¤ 0, the local time has a mixed distribution.
Exercise 12.31 (Monotonicity of the Mean Local Time). Give an analytical proof
that the expected value of the local time .x; t/ of a standard Brownian motion
starting at zero is strictly decreasing in the spatial coordinate x.
invariance principle.
Hint: The terms in the quotient are jointly normal with zero means.
W .t /
(c) supt >0 aCbt ; a; b > 0.
Exercise
Rt 12.38. Let X.t/ be a general Ornstein–Uhlenbeck process and Y .t/ D
0 X.u/d u. Find the correlation between Y .s/ and Y .t/ for 0 < s < t < 1, and
find its limit when ; ’ ! 1 and ’ ! 1.
Exercise 12.39. Let W .t/; t 0 be standard Brownian motion starting at zero, and
0 < s < t < 1 two general times. Find an expression for P .W .t/ > 0 jW .s/ > 0/,
and its limit when s is held fixed and t ! 1.
References
Bhattacharya, R.N. and Waymire, E. (2007). A Basic Course in Probability Theory, Springer,
New York.
Bhattacharya, R.N. and Waymire, E. (2009). Stochastic Processes with Applications, SIAM,
Philadelphia.
Billingsley, P. (1968). Convergence of Probability Measures, John Wiley, New York.
Breiman, L. (1992). Probability, Addison-Wesley, New York.
Brown, L. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value prob-
lems, Ann. Math. Statist., 42, 855–903.
Brown, L., DasGupta, A., Haff, L.R., and Strawderman, W.E. (2006). The heat equation and Stein’s
identity: Connections, Applications, 136, 2254–2278.
Csörgo, M. (2002). A glimpse of the impact of Pal Erdös on probability and statistics, Canad. J.
Statist., 30, 4, 493–556.
Csörgo, M. and Révész, P. (1981). Strong Approximations in Probability and Statistics, Academic
Press, New York.
Csörgo, S. and Hall, P. (1984). The KMT approximations and their applications, Austr. J. Statist.,
26, 2, 189–218.
Dasgupta, A. (2008), Asymptotic Theory of Statistics and Probability Springer,New York.
Donsker, M. (1951). An invariance principle for certain probability limit theorems, Mem. Amer.
Math. Soc., 6.
Durrett, R. (2001). Essentials of Stochastic Processes, Springer, New York.
Einmahl, U. (1987). Strong invariance principles for partial sums of independent random vectors,
Ann. Prob., 15, 4, 1419-1440.
Erdös, P. and Kac, M. (1946). On certain limit theorems of the theory of probability, Bull. Amer.
Math. Soc., 52, 292–302.
436 12 Brownian Motion and Gaussian Processes
13.1 Notation
The Poisson process is an example of a stochastic process with pure jumps indexed
by a running label t. We call t the time parameter, and for the purpose of our dis-
cussion here, t belongs to the infinite interval Œ0; 1/. For each t 0, there is a
nonnegative random variable X.t/, which counts how many events have occurred
up to and including time t. As we vary t, we can think of X.t/ as a function. It is a
random function, because each X.t/ is a random variable. Like all functions, X.t/
has a graph. The graph of X.t/ is called a path of X.t/. It is helpful to look at a
typical path of a Poisson process; Fig. 13.1 gives an example.
We notice that the path is a nondecreasing function of the time parameter t, and
that it increases by jumps of size one. The time instants at which these jumps occur
are called the renewal or arrival times of the process. Thus, we have an infinite se-
quence of arrival times, say Y1 ; Y2 ; Y3 ; : : : ; the first arrival occurs exactly at time
Y1 , the second arrival occurs at time Y2 , and so on. We define Y0 to be zero. The
gaps between the arrival times, namely, Y1 Y0 ; Y2 Y1 ; Y3 Y2 ; : : : are called the
interarrival times. Writing Yn Yn1 D Tn , we see that the interarrival times and
the arrival times are related by the simple identity
Yn D .Yn Yn1 /C.Yn1 Yn2 /C C.Y2 Y1 /C.Y1 Y0 / D T1 CT2 C CTn :
A special property of a Poisson process is that these interarrival times are indepen-
dent. So, for instance, if T3 , the time you had to wait between the second and the
third event, were large, then you would have no right to believe that T4 should be
small, because T3 and T4 are actually independent for a Poisson process.
t
0.5 1 1.5 2
The Poisson process can be arrived at in a number of ways. All of these apparently
different definitions are actually equivalent. Here are some equivalent definitions of
a Poisson process.
Definition # 1. One possibility is to start with the interarrival times, and assume
that they are iid exponential with some mean . Then the number of arrivals up to
and including time t is a Poisson process with a constant arrival rate D 1 .
Definition # 2. Or, we may assume that the number of arrivals in a general time
interval Œt1 ; t2 is a Poisson variable with mean .t2 t1 / for some fixed positive
number , and that the number of arrivals over any finite number of disjoint intervals
is mutually independent Poisson variables. This is equivalent to the first definition
given in the paragraph above.
Definition # 3. A third possibility is to use a neat result due to Alfred Rényi. Rényi
proved that if X.t/ satisfies the Poisson property that the number of arrivals X.B/
within any set of times B, not necessarily a set of the form of an interval, is a
Poisson variable with mean jBj, where jBj denotes the Lebesgue measure of B,
then X.t/ must be a Poisson process. Note that there is no mention of independence
over disjoint intervals in this definition. Independence falls out as a consequence
of Rényi’s condition, and Rényi’s result is also a perfectly correct definition of a
Poisson process; see Kingman (1993, p. 33).
Definition # 4. From the point of view of physical motivation and its original his-
tory, it is perhaps best to define a Poisson process in terms of some characteristic
physical properties of the process. In other words, if we make a certain number of
specific assumptions about how the process X.t/ behaves, then those assumptions
serve as a definition of a Poisson process. If you do not believe in one or more of
these assumptions in a particular problem, then the Poisson process is not the right
model for that problem. Here are the assumptions.
(a) X.0/ D 0:
(b) The rate of arrival of the events remains constant over time, in the sense there is
a finite positive number such that for any t 0,
(c) The number of events over nonoverlapping time intervals are independent;
that is,
given disjoint time intervals Œai ; bi ; i D 1; 2; : : : ; n;
the random variables X.bi / X.ai /; i D 1; 2; : : : ; n are mutually independent.
(d) More than one event cannot occur at exactly the same time instant.
440 13 Poisson Processes and Applications
Precisely,
(i) For h ! 0; P .X.t C h/ D X.t// D 1 h C o.h/:
(ii) For h ! 0; P .X.t C h/ D X.t/ C 1/ D h C o.h/:
(iii) For h ! 0; P .X.t C h/ > X.t/ C 1/ D o.h/:
The important point is that all of these definitions are equivalent. Depending on
taste, one may choose any of these as the definition of a homogeneous or stationary
Poisson process on the real line.
Starting with these assumptions in our Definition #4 about the physical behavior of
the process, one can use some simple differential equation methods and probability
calculations to establish various important properties of a Poisson process. We go
over some of the most important properties next.
Given k 0, let pk .t/ D P .X.t/ D k/, and let fk .t/ D e t pk .t/. By the total
probability formula, as h ! 0,
for k 1; t > 0, and when k D 0; p00 .t/ D p0 .t/; t > 0. Because X.0/ D 0, the
last equation immediately gives p0 .t/ D e t . For k 1, the system of differential
equations pk0 .t/ D Œpk .t/ pk1 .t/ is equivalent to
. t/k
fk .t/ D ; k 1; t > 0
kŠ
. t/k
) pk .t/ D P .X.t/ D k/ D e t :
kŠ
This is true also for k D 0. So, we have proved the following theorem, which
accounts for the name Poisson process.
13.3 Important Properties and Uses as a Statistical Model 441
Theorem 13.1. If X.t/; t 0 is a Poisson process with the constant arrival rate ,
then for any t > 0; X.t/ is distributed as a Poisson with mean t. More generally,
the number of arrivals in an interval .s; t is distributed as a Poisson with mean
.t s/.
Example 13.1 (A Medical Example). Suppose between the months of May and
October, you catch allergic rhinitis at the constant average rate of once in six weeks.
Assuming that the incidences follow a Poisson process, let us answer some simple
questions.
First, what is the expected total number of times that you will catch allergic
rhinitis between May and October in one year? Take the start of May 1 as t D 0,
and X.t/ as the number of fresh incidences up to(and including) time t. Note that
time is being measured in some implicit unit, say weeks. Then, the arrival rate of
the Poisson process for X.t/ is D 16 . There are 24 weeks between May and
October, and X.24/ is distributed as a Poisson with mean 24 D 4, which is the
expected total number of times that you will catch allergic rhinitis between May and
October.
Next, what is the probability that you will catch allergic rhinitis at least once
before the start of August and at least once after the start of August? This is the
same as asking what is P .X.12/ 1; X.24/ X.12/ 1/. By the property of
independence of X.12/ and X.24/ X.12/, this probability equals
A key property of the Poisson process is that the sequence of interarrival times
T1 ; T2 ; : : : is iid exponential with mean 1 . We do not rigorously prove this here, but
as the simplest illustration of how the exponential density enters into the picture, we
show that T1 has the exponential density. Indeed, P .T1 > h/ D P .X.h/ D 0/ D
e h , because X.h/ has a Poisson distribution with mean h. It follows that T1 has
the density fT1 .h/ D e h ; h > 0.
As a further illustration, consider the joint distribution of T1 and T2 . Here is
a very heuristic explanation for why T1 ; T2 are iid exponentials. Fix two positive
numbers t; u. The event fT1 > t; T2 > ug is just the event that the first arrival time
Y1 is at some time later than t, and counting from Y1 , no new further events occur
for another time interval of length u. But the intervals Œ0; t and .Y1 ; Y1 C u/ are
nonoverlapping if the first arrival occurs after t, and the probability of zero events
in both of these intervals would then factor as e t e u . This means P .T1 > t/ D
e t and P .T2 > u/ D e u , and that T1 ; T2 are iid exponentials with the same
density function.
Because the sum of iid exponentials has a Gamma distribution (see Chapter 4),
it also follows that for a Poisson process, the nth arrival time Yn has a Gamma
distribution. All of these are recorded in the following theorem.
442 13 Poisson Processes and Applications
Theorem 13.2. Let X.t/; t 0 be a Poisson process with constant arrival rate .
Then,
(a) T1 ; T2 ; : : : are iid exponential with the density function fTi .t/ D e t ; t > 0:
n e y y n1
(b) Let n 1:Then Yn has the Gamma densityfYn .y/ D .n1/Š ;y > 0:
See Kingman (1993, p. 39) for a rigorous proof of this key theorem.
Example 13.2 (Geiger Counter). Geiger counters, named after Hans Geiger, are
used to detect radiation-emitting particles, such as beta and alpha particles, or
low-energy gamma rays. The counter does so by recording a current pulse when
a radioactive particle or ray hits the counter. Poisson processes are standard models
for counting particle hits on the counter.
Suppose radioactive particles hit such a Geiger counter at the constant average
rate of one hit per 30 seconds. Therefore, the arrival rate of our Poisson process
is D 30 1
. Let Y1 ; Y2 ; : : : ; Yn be the times of the first n hits on the counter, and
T1 ; T2 ; : : : ; Tn the time gaps between the successive hits. We ask a number of ques-
tions about these arrival and interarrival times in this example.
First, by our previous theorem, Yn has the Gamma distribution with density
y
e 30 y n1
fYn .y/ D ; y > 0:
.30/n .n 1/Š
Suppose we want to calculate the probability that the hundredth hit on the Geiger
counter occurs within the first hour. This is equal to P .Y100 3600/, there being
60 60 D 3600 seconds in an hour. We can try to integrate the Gamma density
for Yn and evaluate P .Y100 3600/. This will require a computer. On the other
hand, the needed probability is also equal to P .X.3600/ 100/, where X.3600/
Poi.120/, because with t D 3600; t D 3600 30 D 120. However, this calculation is
also clumsy, because the Poisson mean is such a large number. We can calculate the
probability approximately by using the central limit theorem. Indeed,
3600 100 30
P .Y100 3600/ P Z p D P .Z 2/ D :9772;
100 900
Example 13.3 (Poisson Process and the Beta Distribution). It was mentioned in the
chapter introduction that the Poisson process has a connection to the Beta distribu-
tion. We show this connection in this example.
Suppose customers stream in to a drugstore at the constant average rate of 15 per
hour. The pharmacy opens its doors at 8:00 AM and closes at 8:00 PM. Given that
the hundredth customer on a particular day walked in at 2:00 PM, we want to know
what is the probability that the fiftieth customer came before noon. Write n for 100
and m for 50, and let Yj be the arrival time of the j th customer on that day. Then,
we are told that Yn D 6, and we want to calculate P .Ym < 4 jYn D 6/.
We can attack this problem more generally by drawing a connection of a Poisson
process to the Beta distribution. For this, we recall the result from Chapter 4 that if
X; Y are independent positive random variables, with densities
then U D X
XCY
and V D X C Y are independent, and U has the Beta density
.’ C ˇ/ ’1
fU .u/ D u .1 u/ˇ 1 ; 0 < u < 1:
.’/.ˇ/
.n/
fU .u/ D um1 .1 u/nm1 ; 0 < u < 1;
.m/.n m/
Ym
(inasmuch as U D Yn and Yn are independent)
Z 4 Z 4
6 .100/ 6
D fU .u/d u D u49 .1 u/49 d u D :9997:
0 Œ.50/2 0
Example 13.4 (Filtered Poisson Process). Suppose the webpage of a certain text is
hit by a visitor according to a Poisson process X.t/ with an average rate of 1.5 per
day. However, 30% of the time, the person visiting the page does not purchase the
book. We assume that customers make their purchase decisions independently, and
independently of the X.t/ process. Let Y .t/ denote the number of copies of the book
444 13 Poisson Processes and Applications
sold up to and including the tth day. We assume, by virtue of the Poisson process
assumption, that more than one hit is not made at exactly the same time, and that a
visitor does not purchase more than one book. What kind of a process is the process
Y .t/; t 0?
We can imagine a sequence of iid Bernoulli variables U1 ; U2 ; : : : , where Ui D 1
if the i th visitor to the webpage actually purchases the book. Each Ui is a Bernoulli
with parameter p D :7. Also let X.t/ denote the number of hits made on the page
P /
up to time t. Then, Y .t/ D X.t i D1 Ui , where the sum over an empty set is defined,
as usual, to be zero. Then,
1
X
P .Y .t/ D k/ D P .Y .t/ D k jX.t/ D x/P .X.t/ D x/
xDk
! k
1 p 1
X x k t x X . t.1p//x
xk e . t/ t 1p
D p .1p/ De
k xŠ kŠ .x k/Š
xDk xDk
k
p
1p
D e t . t.1 p//k e t .1p/
kŠ
.p t/k
D e pt :
kŠ
Therefore, for each t; Y .t/ is also a Poisson random variable, but the mean has
changed to p t.
This is not enough to prove that Y .t/ is also a Poisson process. One needs
to show, in addition, the time homogeneity property, namely, regardless of
s; Y .t C s/ Y .s/ is a Poisson variable with mean p t, and the independent
increments property; that is, over disjoint intervals .ai ; bi , Y .bi / Y .ai / are inde-
pendent Poisson variables. Verification of these is just a calculation, and is left as
an exercise.
To summarize, a filtered Poisson process is also a Poisson process.
Example 13.5 (Inspection Paradox). Suppose that buses arrive at a certain bus stop
according to a Poisson process with some constant average rate , say once in 30
minutes. Thus, the average gap between any two arrivals of the bus is 30 minutes.
Suppose now that out of habit, you always arrive at the bus stop at some fixed time
t, say 5:00 PM. The term inspection paradox refers to the mathematical fact that
the average time gap between the last arrival of a bus before 5:00 PM and the first
arrival of a bus after 5:00 PM is larger than 30 minutes. It is as if by simply showing
up at your bus stop at a fixed time, you can make the buses tardier than they are on
an average! We derive this peculiar mathematical fact in this example.
We need some notation. Given a fixed time t, let ıt denote the time that has
elapsed since the last event; in symbols, ıt D t YX.t / . Also let t denote the time
until the next event; in symbols, t D YX.t /C1 t. The functions ıt and t are
commonly known as current life and residual life in applications. We then have
It is important that we note that P .ıt > t/ D e t , and that for h > t; P .ıt > h/ D 0.
Thus, the function P .ıt > h/ has a discontinuity at h D t. Likewise,
Therefore,
Z t
1
EŒıt C t D h. e h /dh C te t C
0
2 1
D e t ;
(because it has been assumed that W1 ; W2 ; : : : are independent of the X.t/ process)
!
1
X Y
x h i . t/x
t
De E s Wi
xŠ
xD0 i D1
X1
. t/x
D e t ŒGW .s/x D e t e t GW .s/ ;
xD0
xŠ
where, in the last two lines we have used our assumption that W1 ; W2 ; : : : are iid. We
have thus derived a fully closed-form formula for the generating function of Y .t/.
We can, in principle, derive P .Y .t/ D k/ for any k from this formula. We can also
derive the mean and the variance, for which we need the first two derivatives of this
generating function. The first two derivatives are
and
GY00 .s/ D e t e t GW .s/ .Œ tGW
0 00
.s/2 C tGW .s//:
If the Wi have a finite mean and variance, then we know from Chapter 1 that
0 00
E.W / D GW .1/ and EŒW .W 1/ D GW .1/. Plugging these into our latest ex-
0 00
pressions for GY .s/ and GY .s/, and some algebra, we find
Proof. Recall the notation that the arrival times are denoted as Y1 ; Y2 ; : : : , and the
interarrival times by T1 ; T2 ; : : : , and so on, so that Y1 D T1 ; Y2 D T1 C T2 ; Y3 D
T1 C T2 C T3 ; : : : ; Fix 0 < u1 < u2 < < un < t. We show that
@n nŠ
P .Y1 u 1 ; Y2 u 2 ; : : : ; Yn un jX.t/ D n/ D ;
@u1 @u2 @un tn
13.3 Important Properties and Uses as a Statistical Model 447
0 < u1 < u2 < < un < t. This completes our proof because the mixed partial
derivative of the joint CDF gives the joint density in general(see Chapter 3), and the
function of n variables u1 ; u2 ; : : : ; un given by tnŠn on 0 < u1 < u2 < < un < t
indeed is the joint density of the order statistics of n iid U Œ0; t variables.
For ease of presentation, we show the proof for n D 2; the proof for a general n is
exactly the same. Assume then that n D 2. The key fact to use is that the interarrival
times T1 ; T2 ; T3 are iid exponential. Therefore, the joint density of T1 ; T2 ; T3 is
3 .t1 Ct2 Ct3 /
e ; t1 ; t2 ; t3 > 0. Now make the linear transformation Y1 D T1 ; Y2 D
T1 C T2 ; Y3 D T1 C T2 C T3 . This is a one-to-one transformation with a Jacobian
equal to one. Therefore, by the Jacobian method (see Chapter 4), the joint density
of Y1 ; Y2 ; Y3 is
3 y3
fY1 ;Y2 ;Y3 .y1 ; y2 ; y3 / D e ;
P .Y1 u 1 ; Y2 u2 jX.t/ D 2/
P .Y1 u1 ; Y2 u2 ; Y3 > t/
D
P .X.t/ D 2/
R u1 R u2 R 1 3 y
0 y1 t e 3 dy3 dy2 dy1
D 2
e t .t2Š/
2 t u2
e u1 u2 2 1
D 2
e t .t2/
Therefore,
@2 2
P .Y1 u 1 ; Y2 u2 jX.t/ D 2/ D ;
@u1 @u2 t2
U Œ0; 1 variables. Hence, given that X.t/ D 9; Y5 , the time of the fifth arrival has
the Beta.5; 5/ density. From the symmetry of the Beta.5; 5/ density, it immediately
follows that P .Y5 :5 jX.t/ D 9/ D :5. So, there is a 50% probability that the fifth
e-mail arrived before 12:30.
Next, what is the probability that at least three e-mails arrived after 12:45? This
probability is the same as P .Y7 > :75 jX.t/ D 9/. Once again, this follows from
a Beta distribution calculation, because given that X.t/ D 9; Y7 has the Beta.7; 3/
density. Hence, the required probability is
Z 1
.10/
x 6 .1 x/2 dx D :3993:
.7/.3/ :75
Finally, what is the expected gap between the times that the third and the seventh
e-mail arrived? That is, what is E.Y7 Y3 jX.t/ D 9/? because E.Y7 jX.t/ D 9/ D
7
10
D :7, and E.Y3 jX.t/ D 9/ D 10 3
D :3, we have that E.Y7 Y3 jX.t/ D 9/ D
:7 :3 D :4. Hence the expected gap between the arrival of the third and the seventh
e-mail is 24 minutes.
The model in the compound Poisson process example in the previous section can be
regarded as a model for displacement of a particle in a medium subject to collisions
with other particles or molecules of the surrounding medium. Recall that this was
essentially Einstein’s derivation of Brownian motion for the movement of a physical
particle in a fluid or gaseous medium.
How does the model of the compound Poisson process example become useful
in such a context, and what exactly is the connection to Brownian motion? Suppose
that a particle immersed in a medium experiences random collisions with other par-
ticles or the medium’s molecules at random times which are the event times of a
homogeneous Poisson process X.t/ with constant rate . Assume moreover that at
each collision, our particle moves a distance of a units linearly to the right, or a
units linearly to the left, with an equal probability. In other words, the sequence of
displacements caused by the successive collisions form an iid sequence of random
variables W1 ; W2 ; : : : , with P .Wi D ˙a/ D 12 . Then, the total displacement of the
P /
particle up to a given time t equals W .t/ D X.t i D1 Wi , with the empty sum W .0/
being zero. If we increase the rate of collisions and simultaneously decrease the
displacement a caused by a single collision in just the right way, then the W .t/ pro-
cess is approximately a one-dimensional Brownian motion. That is the connection
to Brownian motion in this physical model for how a particle moves when immersed
in a medium.
We provide an explanation for the emergence of Brownian motion in such a
model. First, recall from our calculations for the compound Poisson example that
13.4 Linear Poisson Process and Brownian Motion: A Connection 449
the mean of W .t/ as defined above is zero for all t, and Var.W .t// D EŒW .t/2 D
a2 t. Therefore, to have any chance of approximating the W .t/ process by a Brow-
nian motion, we should let a2 converge to some finite constant 2 . We now look at
the characteristic function of W .t/ for any fixed t, and show that if a ! 0; ! 1
in such a way that a2 ! 2 , then the characteristic function of W .t/ itself con-
verges to the characteristic function of the N.0; 2 t/ distribution. This is be one step
towards showing that the W .t/ process mimics a Brownian motion with zero drift
and diffusion coefficient if a2 ! 2 .
The characteristic function calculation is exactly similar to the generating func-
tion calculation that we did in our compound Poisson process example. Indeed, the
characteristic function of W .t/ equals W .t / .s/ D e t Œ W .s/1 , where W .s/ de-
notes the common characteristic function of W1 ; W2 ; : : : , the sequence of displace-
ments of the particle. Therefore, by a Taylor expansion of W .s/ (see Chapter 8),
h 2
i
t 1Ci sE.W1 / s2 E.W12 /1C jsj3 E.jW1 j3 /
W .t / .s/ D e
h i 1 h
X i e .t s/ . .t s//d
E e i uŒW .t /W .s/ D E e i uŒW .t /W .s/ jDDd
dŠ
d D0
on using the facts that conditional on D D d , W .t/ W .s/ is the sum of d iid
variables, which are jointly independent of the X.t/ process, and then on some
algebra. This shows that the increments W .t/ W .s/ are stationary.
450 13 Poisson Processes and Applications
We remarked in the introduction that Poisson processes are also quite commonly
used to model a random scatter of points in a planar area or in space, such as trees
in a forest, or galaxies in the universe. In the case of the one-dimensional Pois-
son process, there was a random sequence of points too, say …, and these were
just the arrival times of the events. We then considered how many elements of …
belonged to a test set, such as an interval Œs; t. The number of elements in any
particular interval was Poisson with mean depending on the length of the interval,
and the number of elements in disjoint intervals were independent. In higher dimen-
sions, say Rd , we similarly have a random countable set of points … of Rd , and
we consider how many elements of … belong to a suitable test set, for example, a
Q
d -dimensional rectangle A D diD1 Œai ; bi . Poisson processes would now have the
properties that the counts over disjoint test sets would be independent, and the count
of any particular test set A, namely N.A/ D #fx 2 … W x 2 Ag, will have a Poisson
distribution with some appropriate mean depending on the size of A.
Remark. If .x/ > 0, then .A/ D Œvol.A/, and in that case the Poisson
process is called stationary or homogeneous with the constant intensity . Notice
that this coincides with the definition of a homogeneous Poisson process in one
dimension that was given in the previous sections, because in one dimension volume
would simply correspond to length.
According to the definition, given a set A Rd ; N.A/ Poi..A//, and there-
fore, the probability that a given set A contains no points of the random countable
set … is P .N.A/ D 0/ D e .A/ . It is clear, therefore, that if we know just these
void probabilities P .N.A/ D 0/, then the intensity measure and the distribution of
the full process is completely determined. Indeed, one could use this as a definition
of a d -dimensional Poisson process.
Remark. This definition is saying that if the scientist understands the void probabil-
ities, then she understands the entire Poisson process.
It is not obvious that a Poisson process with a given intensity measure exists.
This requires a careful proof, within the rigorous measure-theoretic paradigm of
probability. We only state the existence theorem here. A proof may be seen in many
13.5 Higher-Dimensional Poisson Point Processes 451
places, for instance, Kingman (1993). We do need some restrictions on the intensity
measure for an existence result. The restrictions we impose are not the weak-
est possible, but we choose simplicity of the existence theorem over the greatest
generality.
Example 13.8 (Distance to the Nearest Event Site). Consider a Poisson process …
in Rd with intensity measure , and fix a point xQ 2 Rd . We are interested in
D D D.x; Q …/, the Euclidean distance from xQ to the nearest element of the Poisson
Q r/ is the
process …. The survival function of D is easily calculated. Indeed, if B.x;
Q then
ball of radius r centered at x,
Q r// D 0/ D e .B.x;r//
P .D > r/ D P .N.B.x; Q
:
d=2
Or, equivalently, Dd has a standard exponential density. In the special case
. d
2 C1/
of two dimensions, this means that D 2 has a standard exponential distribution.
This result is sometimes used to statistically test whether a random countable set in
some specific application is a homogeneous Poisson process.
Example 13.9 (Multinomial Distribution and Poisson Process). Suppose the in-
tensity measure of a Poisson process is finite; that is, .Rd / < 1. In that
case, the cardinality of … itself, that is, N.Rd / < 1 with probability one, be-
cause N.Rd / has a Poisson distribution with the finite mean M D .Rd /. Suppose
in a particular realization, the total number of events N.Rd / D n for some fi-
nite n. We want to know how these n events are distributed among the members
of a partition A1 ; A2 ; : : : ; Ak of Rd . It turns out that the joint distribution of
N.A1 /; N.A2 /; : : : ; N.Ak / is a multinomial distribution.
P
Indeed, given n1 ; n2 ; : : : ; nk 0 such that kiD1 ni D n,
If a random set … is a Poisson process in Rd with some intensity measure, and if the
points are mapped into some other space by a transformation f , then the mapped
points will often form another Poisson process in the new space with a suitable new
intensity measure. This is useful, because we are often interested in a scatter induced
by an original scatter, and we can view the induced scatter as a mapping of a Poisson
process.
If … is a Poisson process with intensity measure , and f WRd ! Rk is a map-
ping, denote the image of … under f by … ; that is, … D f .…/. Let f 1 .A/ D
fx 2 Rd W f .x/ 2 Ag. Then,
be a Poisson process, we also need to show the independence property for disjoint
sets A1 ; A2 ; : : : ; An ; for any n 2, and we need to ensure that singleton sets do not
carry positive weight under the new intensity measure . The independence over
disjoint sets is inherited from the independence over disjoint sets for the original
Poisson process …; the requirement that .fyg/ should be zero for any single-
ton set fyg is usually verifiable on a case-by-case basis in applications. The exact
statement of the mapping theorem then says the following.
Nonhomogeneous Poisson processes are clearly important from the viewpoint of ap-
plications. But an additional fact of mathematical convenience is that the mapping
theorem implies that a nonhomogeneous Poisson process in one dimension can be
transformed into a homogeneous Poisson process. As a consequence, various for-
mulas and results for the nonhomogeneous case can be derived rather painlessly, by
simply applying the mapping technique suitably.
Suppose X.t/; t 0 is a Poisson process on the real line with intensity
Rt function
.x/, and intensity measure , so that for any given t; .Œ0; t/ D 0 .x/dx. We
denote .Œ0; t/ as ƒ.t/. Thus, ƒ.t/ is a nonnegative, continuous, and nondecreas-
ing function for t 0, with ƒ.0/ D 0.
We show how this general Poisson process can be transformed to a homogeneous
one by simply changing the clock. For this, think of the Poisson process as a random
countable set and simply apply the mapping theorem. Using the same notation as
in the mapping theorem, consider the particular mapping f .x/ D ƒ.x/. Then, by
the mapping theorem, … D ƒ.…/ is another Poisson process with the intensity
measure .Œ0; t/ D .f 1 .Œ0; t// D ƒ.ƒ1 .t// D t. In other words, mapping
the original Poisson process by using the transformation f .x/ D ƒ.x/ converts the
general Poisson process with the intensity measure into a homogeneous Poisson
process with constant intensity equal to one. An equivalent way to state this is to
say that if we define a new process Z.t/ as Z.t/ D X.ƒ1 .t//; t 0, then Z.t/
is a homogeneous Poisson process with constant rate (intensity) equal to one. This
amounts to simply measuring time according to a new clock.
This transformation result is useful in deriving important distributional properties
for a general nonhomogeneous Poisson process on the real line. We first need some
notation. Let Ti;X ; i D 1; 2; : : : denote the sequence of interarrival times for our
Poisson process X.t/ with intensity function .x/, and let Ti;Z denote the sequence
of interarrival times for the transformed process Z.t/ D X.ƒ1 .t//. Let also, Yn;X
be the nth arrival time in the X.t/ process, and Yn;Z the nth arrival time in the Z.t/
process.
Consider the simplest distributional question, namely what is the distribution of
T1;X . By applying the mapping theorem,
The mapping theorem similarly leads to the main distributional results for the gen-
eral nonhomogeneous Poisson process on the real line, which are collected in the
theorem below.
Theorem 13.5. Let R t X.t/; t 0 be a Poisson process with the intensity function
.x/. Let ƒ.t/ D 0 .x/dx. Let the interarrival times be Ti ; i 1, and the arrival
times Yi ; i 1. Then,
(a) For n 1; Yn has the density
.t/e ƒ.t / .ƒ.t//n1
fYn .t/ D ; t > 0:
.n 1/Š
(b) For n 2 the conditional density of Tn given Yn1 D w is
Example 13.10 (Piecewise Linear Intensity). Suppose printing jobs arrive at a com-
puter printer according to the piecewise linear periodic intensity function
.x/ D x; if 0 x :1I D :1; if :1 x :5I
D :2.1 x/; if :5 x 1:
We assume that the unit of time is a day, so that for the first few hours in the morning
the arrival rate increases steadily, and then for a few hours it remains constant. As the
13.6 One-Dimensional Nonhomogeneous Processes 455
day winds up, the arrival rate decreases steadily. The intensity function is periodic
with a period equal to one.
By direct integration, in the interval Œ0; 1,
t2
ƒ.t/ D ; if t D :005 C :1.t :1/; if :1 t :5I
:1I
2
2
t2 t
D :045 C :2.t :375/ D :2 :03; if :5 t 1:
2 2
The intensity function .x/ and the mean function ƒ.t/ are plotted in Figs. 13.2
and 13.3. Note that ƒ.t/ would have been a linear function if the process were
homogeneous.
0.1
0.08
0.06
0.04
0.02
x
0.2 0.4 0.6 0.8 1
0.07
0.06
0.05
0.04
0.03
0.02
0.01
t
0.2 0.4 0.6 0.8 1
For f 0, let
.f / D E.e †f /:
Definition 13.4. Let … be a Poisson process in Rd , and f a nonnegative function
on Rd . Then .f / is called the characteristic functional of ….
The convergence of the sum †f is clearly an issue of importance. A necessary
and sufficient condition for the convergence of †f is given in the theorem below.
The trick is first to prove it for functions that take only a finite number of values
and are of a compact support (i.e., they are zero outside of a compact set), and
then to approximate a general nonnegative function by functions of this type. See
Kingman (1993) for the details; these proofs and the outline of our arguments below
involve notions of integrals with respect to general measures. This is unavoidable,
and some readers may want to skip directly to the statement of the theorem below.
13.7 Campbell’s Theorem and Shot Noise 457
Y
k h i Y
k
fj
1/.Aj /
D E e fj N.Aj / D e .e
j D1 j D1
(because N.Aj / are independent Poisson random variables with means .Aj /),
Pk Pk R
fj f .x/ 1
e 1 .Aj / .e /d.x/
D e R j D1 De j D1 Aj
f .x/ 1
D e R d .e /d.x/ :
Example 13.11 (Shot Noise and Stable Laws). Consider the particular function f W
R ! R defined byR f .x/ D cjxjˇ sgn.x/; 0 < ˇ < 1. We eventually take
ˇ > 12 . The integral R min.1; f .x//dx exists as a principal value, by taking limit
RM
through M min.1; f .x//dx. We evaluate the characteristic function of †f , that
is, we evaluate E.e †f / for D i u, where u is a real. We evaluate the characteristic
function by using Campbell’s theorem.
For this, we first evaluate I./ with D i u. Because of the skew-symmetry of
the function f .x/, only the cosine part in the characteristic function remains. In
other words, if D iu;u > 0, then
Z 1 Z 1
I./ D Œ1 cos.uf .x//dx D 2 Œ1 cos.uf .x//dx
1 0
Z 1 1 Z 1
2 .uc/ ˇ 1
D2 Œ1 cos.ucx ˇ /dx D Œ1 cos.y/y ˇ 1 dy:
0 ˇ 0
R1 1
The integral 0 Œ1cos.y/y ˇ 1 dy converges if and only if ˇ > 12 , in which case
R1 1 1
denote 0 Œ1 cos.y/y ˇ 1 dy D k.ˇ/. Thus, for ˇ > 12 ; u > 0; I./ D .ˇ/u ˇ
for some constant .ˇ/. As a function of u; I./ is symmetric in u, and so ultimately
we get that if ˇ > 12 , then the characteristic function of †f is
1
D e juj ;
ˇ
E e iu†f
for some positive constant . This is the characteristic function of a stable law with
exponent ’ D ˇ1 < 2 (and, of course, ’ > 0). We thus find here an interesting
manner in which stable distributions arise from consideration of shot noises and
Campbell’s theorem.
Exercises
Exercise 13.1 (Poisson Process for Catching a Cold). Suppose that you catch a
cold according to a Poisson process of once every three months.
(a) Find the probability that between the months of July and October, you will catch
at least four colds.
Exercises 459
(b) Find the probability that between the months of May and July, and also between
the months of July and October, you will catch at least four colds.
(c) * Find the probability that you will catch more colds between the months of
July and October than between the months of May and July.
Exercise 13.2 (Events up to a Random Time). Jen has two phones on her desk.
On one number, she receives internal calls according to a Poisson process at the rate
of one in two hours. On the other number, she receives external calls according to
a Poisson process at the rate of one per hour. Assume that the two processes run
independently.
(a) What is the expected number of external calls by the time the second internal
call arrives?
(b) * What is the distribution of the number of external calls by the time the second
internal call arrives?
Exercise 13.3 (An Example Due to Emanuel Parzen). Certain machine parts in a
factory fail according to a Poisson process at the rate of one in every six weeks. Two
such parts are in the factory’s inventory. The next supply will come in two months
(eight weeks). What is the probability that production will be stopped for a week or
more due to the lack of this particular machine part?
Exercise 13.5 (Spatial Poisson Process). Stars are distributed in a certain part of
the sky according to a three-dimensional Poisson process with constant intensity .
Find the mean and the variance of the separation between a particular star and the
star nearest to it.
Exercise 13.7 (An Example Due to Emanuel Parzen: Cry Baby). A baby cries
according to a Poisson process at the constant rate of three times per 30 minutes.
The parents respond only to every third cry. Find the probability that 20 minutes
will elapse between two successive responses of the parents; that 60 minutes will
elapse between two successive responses of the parents.
460 13 Poisson Processes and Applications
Exercise 13.20 * (Inspection Paradox). The city bus arrives at a certain bus stop
at the constant average rate of once per 30 minutes. Suppose that you arrive at the
bus stop at a fixed time t. Find the probability that the time till the next bus is larger
than the time since the last bus.
Exercise 13.22 (A Strong Law). Suppose X.t/ is a Poisson process with constant
arrival rate . Show that X.t /
t converges almost surely to as t ! 1.
Exercise 13.23 (Two Poisson Processes). Suppose X.t/; Y .t/; t 0 are two Pois-
son processes with rates 1 ; 2 . Assume that the processes run independently.
(a) Prove or disprove: X.t/ C Y .t/ is also a Poisson process.
(b) * Prove or disprove: jX.t/ Y .t/j is also a Poisson process.
References
Karlin, S. and Taylor, H.M. (1975). A First Course in Stochastic Processes, Academic Press,
New York.
Kingman, J.F.C. (1993). Poisson Processes, Oxford University Press, Oxford, UK.
Lawler, G. (2006). Introduction to Stochastic Processes, Chapman and Hall, New York.
Parzen, E. (1962). Stochastic Processes, Holden-Day, San Francisco.
Port, S. (1994). Theoretical Probability for Applications, Wiley, New York.
Chapter 14
Discrete Time Martingales and Concentration
Inequalities
We start with a simple example, which nevertheless captures the spirit of the idea of
a martingale sequence of random variables.
E.SnC1 jS1 ; : : : ; Sn /
D E.Sn C XnC1 jS1 ; : : : ; Sn / D Sn C E.XnC1 jS1 ; : : : ; Sn /
D Sn C E.XnC1 / D Sn C 0 D Sn :
The sequence fXn g is called a martingale adapted to the sequence fYn g if for each
n 1; Xn is a function of Y1 ; : : : ; Yn , and E.XnC1 jY1 ; Y2 ; : : : ; Yn / D Xn with
probability one.
E.XnC1 jX1 ; X2 ; : : : ; Xn /
D EŒ.Z1 C C Zn /2 C 2ZnC1 .Z1 C C Zn /
2
C ZnC1 jX1 ; X2 ; : : : ; Xn .n C 1/
D Xn C n C 2.Z1 C C Zn /E.ZnC1 jX1 ; X2 ; : : : ; Xn /
C E.ZnC1
2
jX1 ; X2 ; : : : ; Xn / .n C 1/
D Xn C n C 0 C 1 .n C 1/ D Xn ;
Example 14.5 (Matching Problem). Consider the matching problem. For example,
suppose N people, each wearing a hat, have gathered in a party and at the end of
the party, the N hats are returned to them at random. Those that get their own hats
back then leave the room. The remaining hats are distributed among the remaining
guests at random, and so on. The process continues until all the hats have been given
away. Let Xn denote the number of guests still present after the nth round of this
hat returning process.
466 14 Discrete Time Martingales and Concentration Inequalities
At each round, we expect one person to get his own hat back and leave the room.
In other words, E.Xn XnC1 / D 1 8n. In fact, with a little calculation, we even
have
D E.XnC1 Xn jX1 ; : : : ; Xn / C Xn D 1 C Xn :
Example 14.6 (Pólya’s Urn). The Pólya urn scheme is defined as follows. Initially,
an urn contains a white and b black balls, a total of a C b balls. One ball is drawn
at random from among all the balls in the urn. It, together with c more balls of its
color is returned to the urn, so that after the first draw, the urn has a C b C c balls.
This process is repeated.
Suppose Xi is the indicator of the event Ai that a white ball is drawn at the i th
trial, and for given n 1; Sn D X1 C C Xn , which is the total number of
times that a white ball has been drawn in the first n trials. For the sake of notational
simplicity, we take c D 1. Then, the proportion of white balls in the urn just after
aCSn
the nth trial has been completed is Rn D aCbCn .
Elementary arguments show that
a C x1 C C xn
P .XnC1 D 1 jX1 D x1 ; : : : ; Xn D xn / D :
aCbCn
Thus,
a C Sn
E.SnC1 jS1 ; : : : ; Sn / D E.SnC1 jSn / D Sn C
aCbCn
a 1
) E.RnC1 jR1 ; : : : ; Rn / D C
aCbCnC1 aCbCnC1
Œ.a C b C n/Rn a C Rn D Rn :
We therefore have the interesting result that in the Pólya urn scheme, the sequence
of proportions of white balls in the urn forms a martingale.
Example 14.7 (The Wright–Fisher Markov Chain). Consider the stationary Markov
chain fXn g on the state space f0; 1; 2; : : : ; N g with the one-step transition pro-
babilities
!
N i j i N j
pij D 1 :
j N N
14.1 Illustrative Examples and Applications in Statistics 467
This is the Wright–Fisher chain in population genetics (see Chapter 10). We show
that Xn is a martingale adapted to itself. Indeed, by direct calculation,
Y
n
f1 .Xi /
ƒn D :
f0 .Xi /
i D1
If ƒn is large for the observed data, then one concludes that the data values come
from a high-density region of f1 and a low-density region of f0 , and therefore con-
cludes that the true f generating the observed data is f1 .
Suppose now the null hypothesis is actually true; that is, truly, X1 ; X2 ; : : : are iid
with the common density f0 . Now,
f1 .XnC1 /
Ef0 ŒƒnC1 jƒ1 ; : : : ; ƒn D Ef0 ƒn jƒ1 ; : : : ; ƒn
f0 .XnC1 /
f1 .XnC1 /
D ƒn Ef0 jƒ1 ; : : : ; ƒn
f0 .XnC1 /
f1 .XnC1 /
D ƒn Ef0
f0 .XnC1 /
D ƒn 1 D ƒn :
Therefore, the sequence of likelihood ratios forms a martingale under the null hy-
pothesis (i.e., if the true f is f0 ).
Example 14.9 (Bayes Estimates). Suppose random variables Y; X1 ; X2 ; : : : are de-
fined on a common sample space . For given n 1; .X1 ; X2 ; : : : ; Xn / has the
joint conditional distribution P;n given that Y D . From a statistical point of
view, Y is supposed to stand for an unknown parameter, which is formally treated
468 14 Discrete Time Martingales and Concentration Inequalities
The optional stopping theorem is one of the most useful results in martingale theory.
It can be explained in gambling terms. Consider a gambler playing a fair game
repeatedly, so that her sequence of fortunes forms a martingale. One might think that
by gaining experience as the game proceeds, and by quitting at a cleverly chosen
opportune time based on the gambler’s experience, a fair game could be turned
into a favorable game. The optional stopping theorem says that this is in fact not
possible, if the gambler does not have unlimited time on her hands and the house
14.2 Stopping Times and Optional Stopping 469
has limits on how much she can put up on the table. Mathematical formulation of the
optional stopping theorem requires use of stopping times, which were introduced in
Chapter 11 in the context of random walks. We redefine stopping times and give
additional examples below before introducing optional stopping.
The Wald sequential probability ratio test (SPRT) continues sampling as long as ƒn
remains between two specified numbers a and b; a < b, and stops and decides in
favor of f1 or f0 the first time ƒn > b or < a. If we denote l D log a; u D log b,
P Pn
then Wald’s test waits till the first time log ƒn D niD1 log ff10 .Zi/
.Zi / D i D1 Xi (say)
goes above u or below l, and thus the sampling number of Wald’s SPRT is a stopping
time.
Example 14.12 (Combining Stopping Times). This example shows a few ways that
we can make new stopping times out of given ones. Suppose is a stopping time
(adapted to some sequence fXn g) and n is a prespecified positive integer. Then n D
min.; n/ is a stopping time (adapted to the same sequence). This is because
fn kg D f kg [ fn kg;
470 14 Discrete Time Martingales and Concentration Inequalities
and therefore, being a stopping time adapted to fXn g, for any given k, deciding
whether n k requires the knowledge of only X1 ; : : : ; Xk .
Suppose 1 ; 2 are both stopping times, adapted to some sequence fXn g. Then
1 C 2 is also a stopping time adapted to the same sequence. To prove this, note
that
Theorem 14.2. (a) Let be a stopping time adapted to some sequence fXn g. Then,
for any given n 1; min.; n/ is also a stopping time adapted to fXn g.
(b) Let 1 ; 2 be stopping times adapted to fXn g. Then each of 1 C 2 ; min.1 ; 2 /;
max.1 ; 2 / is a stopping time adapted to fXn g.
(c) Let fk ; k 1g be a countable family of stopping times, each adapted to fXn g.
Let
D inf k I D sup k I D lim k ;
k k k!1
where ; , and are defined pointwise, and it is assumed that the limit exists
almost surely. Then each of ; and is a stopping time adapted to fXn g.
The most significant derivative of introducing the concept of stopping times is the
optional stopping theorem. At the expense of using some potentially hard to verify
conditions, stronger versions of our statement of the optional stopping theorem can
be stated. We choose to opt for simplicity of the statement over greater generality,
and refer to more general versions (which are useful!). The main message of the op-
tional stopping theorem is that a gambler cannot convert a fair game into a favorable
one by using clever quitting strategies.
In particular, if
then
E.X / D E.X0 /:
Proof of Theorem. For simplicity, we give the proof only for the case when fXn g is
adapted to itself. The main step involved is to notice the identity
X
n1
Wn D Xn D Xi IfDi g C Xn Ifng ; ./
i D0
for all n 0. It follows from this identity and the submartingale property of the
fXn g sequence that
E.WnC1 jX0 ; : : : ; Xn /
X
n
D E.Xi IfDi g jX0 ; : : : ; Xn / C E.XnC1 If>ng jX0 ; : : : ; Xn /
i D0
X
n
D Xi IfDi g C If>ng E.XnC1 jX0 ; : : : ; Xn /
i D0
X
n
Xi IfDi g C Xn If>ng D Xn D Wn :
i D0
To complete the proof of the theorem, we need the reverse inequality E.Wn /
E.Xn /. This too follows from the same identity ./ given at the beginning of the
proof of this theorem, and on using the additional inequality
because fXn g is a submartingale. If this bound on Xi IfDi g is plugged into our basic
identity ./ above, the reverse inequality follows.
The remaining claim, when fXn g is in fact a martingale, follows immediately
from the two inequalities E.X0 / E.Wn / E.Xn /. t
u
472 14 Discrete Time Martingales and Concentration Inequalities
Easy examples show that the assertion E.X / D E.X0 / for a martingale sequence
fXn g cannot hold without some control on the stopping time . We first provide
a simple example where the assertion of the optional stopping theorem fails. In
looking for such counterexamples, it is useful to construct the stopping time in a
way that when we stop, the value of the stopped martingale is a constant; that is, X
is a constant.
Example 14.13 (An Example Where the Optional Stopping Theorem Fails). Con-
sider again the gambling example, or what really is the simple symmetric random
walk, Pwith Xi iid having the common distribution P .Xi D ˙1/ D 12 , and
Sn D niD1 Xi ; n 1. We define S0 D 0. We know Sn to be a martingale. Consider
now the stopping time
D inffn > 0 W Sn D 1g:
We know from Chapter 11 that the one-dimensional simple symmetric random walk
is recurrent; thus, P . < 1/ D 1. Note that S D 1, and so, E.S / D 1. However,
E.S0 / D E.Sn / D 0. So, the assertion of the optional stopping theorem does not
hold.
What is going on in this example is that we do not have enough control on the
stopping time . Although the random walk visits all its states (infinitely often) with
probability one, the recurrence times are infinite on the average. Thus, can be
uncontrollably large. Indeed, the assumption
does not hold. Roughly speaking, P . > n/ goes to zero at the rate p1n and if the
random walk still has not reached positive territory
p by time n, then it has traveled
to some distance roughly of the order of n. These two now exactly balance
out, so that E.Smin.;n/ /If>ng does not go to zero. This causes the assumption
limn!1 E.Smin.;n/ / D E.S / D 1 to fail.
Thus, our search for sufficient conditions in the optional stopping theorem should
be directed at finding nice enough conditions that ensure that the stopping time
cannot get too large with a high probability. The next two theorems provide such a
set of aesthetically attractive sufficient conditions. It is not hard to prove these two
theorems. We refer the reader to Fristedt and Gray (1997, Chapter 24) for proofs of
these two theorems.
(c) For some positive and finite c; jXnC1 Xn j c for all n 0, E.jX0 j/ < 1,
and E./ < 1.
(d) For some finite constant c; E.Xn2 / c for all n 0.
Then E.X / D E.X0 /.
Remark. The special structure of the random walk martingale allows us to conclude
the assertion of the optional stopping theorem, without requiring the bounded incre-
ments condition jXnC1 Xn j c, which was included in the all-purpose sufficient
condition in Theorem 14.4.
X
n
Var.Xi /
E.Sn2 / D Var.Sn / D
i 2’
i D1
X n 1
X
1 1
D D .2’/ < 1;
i 2’ i 2’
i D1 i D1
P1 1
where .z/ is the Riemann zeta function .z/ D nD1 nz ; z > 1. Therefore, if
’ > 2 ; E.Sn / c D .2’/ for all n, and hence by our theorem above, E.S / D 0
1 2
Example 14.15 (The Simple Random Walk). Consider the one-dimensional random
walk with iid steps Xi , having the common distribution P .Xi D 1/ D p; P .Xi D
1/ D q; 0 <Pp < 1; p C q D 1. Then, E.Xi / D p q D (say), and Sn n,
where Sn D niD1 Xi , is a martingale. We also have, for any n,
Furthermore, E.jS1 j/ is clearly finite. Therefore, for any stopping time with
a finite expectation, by using our theorem above, the equality E.S / D 0, or
equivalently, E.S / D E./ holds. Recall from Chapter 11 that this is a special
case of Wald’s identity. Wald’s identity is revisited in the next section.
We provide a few applications of the optional stopping theorem. The optional stop-
ping theorem also has important applications to martingale inequalities, which is
our topic in the next section.
Perhaps the two best general applications of the optional stopping theorem are
two identities, known as Wald identities. Of these, the first Wald identity is already
known to us; see Chapter 11. We connect that identity to martingale theory and
present a second identity, which was not presented in Chapter 11.
Proof. Both parts of this theorem follow from Theorem 14.5. For part (a), apply
Theorem 14.5 to the martingale sequence Sn n to conclude that E.S / D
0 ) E.S / D .E/. For part (b), because D E.X1 / has now been assumed to
be zero, by applying part (a) of this theorem,
that is, Sn2 n 2 is a martingale sequence adapted to the Sn sequence. From here,
it follows that E.S2 2 / D E.S12 2 / D 0, which means
We know from general random walk theory (Chapter 11) that P .x < 1/ D 1.
Also, obviously E.jX1 j/ D 1 < 1, and D E.X1 / D 0. Therefore, if E.x /
is finite, Wald’s identity E.Sx / D E.x / will hold. However, Sx D x with
probability one, and hence, E.Sx / D x. It follows that the equality x D 0 E.x /
cannot hold for any finite value of E.x /. In other words, for any positive x, the
expected hitting time of x must be infinite for the simple symmetric random walk.
The same argument also works for negative x.
Example 14.17 (Gambler’s Ruin). Now let us revisit the so-called gambler’s ruin
problem, wherein the gambler quits when he either goes broke, or attains a prespec-
ified amount of fortune (see Chapter 10). In other words, the gambler waits for the
random walk Sn to hit one of two integers 0; b; b > 0. Suppose a < b is the amount
of money with which the gamblerPwalked in, so that the gambler’s sequence of for-
tunes is the random walk Sn D niD1 Xi C S0 , where S0 D a, and the steps are
still iid with P .Xi D ˙1/ D 12 . Formally, let
note that we have implicitly assumed the validity of the optional stopping theorem
in the last step (which is true in this example; why?). Rearranging terms, we deduce
that P .S D 0/ D ba b
, or equivalently, P .S D b/ D ab .
Example 14.18 (Generalized Wald Identity). The two identities of Wald given above
assume only the existence of the first and the second moment of Xi , respectively.
If we make the stronger assumption that the Xi have a finite mgf, then a more
embracing martingale identity can be proved, from which the two Wald identities
given above fall out as special cases. This generalized Wald identity is presented in
this example.
476 14 Discrete Time Martingales and Concentration Inequalities
The basic idea is the same as before, which is to think of a suitable martingale,
and apply the optional stopping theorem to it. Suppose then that X1 ; X2 ; : : : is an
iid sequence, with the mgf .t/ D E.e tXi /, which we assume to be finite in some
nonempty interval containing zero. The martingale that works for our purpose in
this example is
Zn D Œ .t/n e tSn ; n 0;
Pn
where, as usual, Sn D i D1 Xi , and we take S0 D 0. The number t is fixed, and is
often cleverly chosen in specific applications.
The special normal case of this martingale was seen in Example 14.4. Exactly
the same proof works in order to show that Zn as defined above is a martingale in
general, not just the normal case. Formally, therefore, whenever we have a stopping
time such that the optional stopping theorem is valid for this martingale sequence
Zn , we have the identity
Once we have this general identity, we can manipulate it for special stopping times
to make useful conclusions in specific applications.
The type I error probability of our test is the probability that the test would reject
H0 if H0 happened to be true. Denoting the type I error probability as ’, we have
’ D P D0 .S u/. We use Wald’s generalized identity to approximate ’. Exact
calculation of ’ is practically impossible except in stray cases.
14.3 Martingale and Concentration Inequalities 477
1 D E D0 .e tS / e t u ’ C e t l .1 ’/
1 et l
)’ :
et u et l
This is the classic Wald approximation to the type I error probability of the SPRT
(sequential probability ratio test). A similar approximation exists for the type II
error probability of the SPRT, which is the probability that the test will accept H0 if
H0 happens to be false.
The optional stopping theorem is also the main tool in proving a collection of impor-
tant inequalities involving martingales. To provide a little context for P
such inequali-
ties, consider the special martingale of a random walk, namely Sn D niD1 Xi ; n
1, where we assume the Xi to be iid mean zero random variables with a fi-
nite variance 2 . If we take any fixed n, and any fixed > 0, then simply by
E.Sn2 /
Chebyshev’s inequality, P .jSn j / 2
. Kolmogorov’s inequality (see Chap-
E.S 2 /
ter 8) makes the stronger assertion P .max1kn jSk j / 2
n
. A fundamental
inequality in martingale theory says that such an inequality holds for more general
martingales, and not just the special martingale of a random walk.
(b) Let fXn ; n 0g be a martingale adapted to some sequence fYn g, and any
fixed positive number. Suppose p 1 is such that E.jXk jp / < 1 for any
k 0. Then, for any n 0,
E jXn jp Ifmax0kn jXk jg E.jXn jp /
P max jXk j p p
:
0kn
Proof. Note that the final inequality in part (b) follows from part (a) by use of
Theorem 14.1 because f .z/ D jzjp is a nonnegative convex function, and therefore
if fXn g is a martingale adapted to some sequence fYn g, then for p 1, fjXn jp g is a
nonnegative submartingale (adapted to that same sequence fYn g). The first inequal-
ity in part (b) is proved by partitioning the event fmax0kn jXk j g into disjoint
events of the form fjX0 j < ; : : : jXi j < ; jXi C1 j g, and then using simple
bounds on each of these partitioning sets. This is left as an exercise.
For proving part (a) of this theorem, define the stopping time
D inffk 0 W Xk > g;
Part (a) of the theorem above assumes the submartingale fXn g to be nonnega-
tive. This assumption is in fact not needed. In addition, the inequality itself can be
somewhat strengthened. The following improved version of the maximal inequality
can be proved by minor modifications of the argument given above; we record the
stronger version, which is important for applications.
Example 14.20 (Sharper Bounds Near Zero). The bounds in Theorem 14.7 and
Theorem 14.8 are not useful unless is large, because the upper bounds blow up
as ! 0. However, if we work a little harder, then useful bounds can be derived
at least in some cases even when is near zero. This example illustrates such a
calculation.
Let fXn g be a zero mean martingale, and suppose k2 D Var.Xk / < 1 for
all k. For n 0, denote Mn D max0kn Xk . Fix a constant c > 0; the constant
c is chosen later suitably. By Theorem 14.1, f.Xk C c/2 g is a submartingale, and
therefore, by Theorem 14.8,
P .Mn / D P .Mn C c C c/ D P max .Xk C c/ Cc
0kn
Therefore,
c 2 C n2
P .Mn / inf :
c>0 c 2 C 2c C 2
c 2 Cn2
The function c 2 C2cC2
is uniquely minimized at the root of the derivative equation
c cC
D0
c2 C n2 c 2 C 2c C 2
n2
, c 2 C c. 2
n2 / n2 D 0 , c D :
c 2 C n2
P .Mn / inf
c>0 c 2 C 2c C 2
2
D 2 n 2;
C n
for any > 0. Clearly, this bound is strictly smaller than one for any > 0.
Example 14.21 (Bounds on the Moments of the Maximum). Here is a clever applica-
tion of Theorem 14.7 to bounding the moments of Mn D max0kn jXk j in terms
of the same moment of jXn j for a martingale sequence fXn g. The example is a very
nice illustration of the art of putting simple things together to get a pretty end result.
Suppose that fXn ; n 0g is a martingale sequence, and p > 1 is such that
E.jXk jp / < 1 for every k. The proof of the result in this example makes use
of Holder’s inequality E.jX Y j/ .EjX j’ /1=’ .EjY jˇ /1=ˇ , where ’; ˇ > 1, and
ˇ D ’1 (see Chapter 1).
’
480 14 Discrete Time Martingales and Concentration Inequalities
Proceeding,
Z 1
E.Mnp / D p p1
P .Mn > /d
0
Z 1
p1 E jXn jIfMn g
p d
0
2 2
There is a useful connection between E.X Pnn/ and E.Dn / for a general square
integrable martingale fXn g, where Dn D 2
i D1 .X i X i 1 / 2
. The connection,
which we prove below, is the neat identity E.Xn2 / E.X02 / D E.Dn2 /, so that if
X0 D 0, then E.Xn2 / and E.Dn2 / are equal. Therefore, we can think of the maximal
inequality and the implied moment bounds in terms of E.Dn2 /, because E.Dn2 / and
E.Xn2 / are, after all, equal. It was shown in Burkholder (1973), Davis (1970), and
Burkholder, Davis, and Gundy (1972) that one can bound expectations of far more
general functions of Mn in terms of expectations of the same functions of Dn ; in
particular, one can bound the pth moment of Mn from both directions by multiples
of the pth moment of Dn for general p 1. In some sense, the moments of Mn
and the moments of Dn grow in the same order; if one can control the increments of
the martingale sequence, then one can control the maximum. Three such important
bounds are presented in this section for reference and completeness. But first, we
demonstrate the promised connection between E.Xn2 / and E.Dn2 /, an interesting
result in its own right.
Proof.
X
n X
n
E.Dn2 / D EŒ.Xi Xi 1 /2 D EŒXi .Xi Xi 1 / Xi 1 .Xi Xi 1 /
i D1 i D1
X
n
D E.EŒXi .Xi Xi 1 / jX0 ; : : : ; Xi 1 /
i D1
X
n
E.EŒXi 1 .Xi Xi 1 / jX0 ; : : : ; Xi 1 /
i D1
X
n
D fE.EŒXi2 jX0 ; : : : ; Xi 1 / E.Xi 1 EŒXi jX0 ; : : : ; Xi 1 /g
i D1
X
n
E.Xi 1 EŒXi jX0 ; : : : ; Xi 1 Xi21 /
i D1
X
n X
n
D fE.Xi2 / E.Xi21 /g E.Xi21 Xi21 /
i D1 i D1
D E.Xn2 / E.X02 /: t
u
482 14 Discrete Time Martingales and Concentration Inequalities
Remark. In view of this result, we can restate part (b) of Theorem 14.7 for the case
p D 2 in the following manner.
Theorem 14.9. Let fXn ; n 0g be a martingale such that X0 D 0 and E.Xk2 / <
1 for all k 1. Let be any fixed positive number, and for any n 1; Mn D
max0kn jXk j. Then,
E.Dn2 /
P .Mn / 2
:
The inequalities of Burkholder, Davis, and Gundy show how to establish bounds
on moments of Mn in terms of the same moments of Dn . To describe some of these
bounds, we first need a little notation.
Remark. Note that apart from the explicit constants, both parts of Theorem 14.10
follow as special cases of this theorem. To our knowledge, no explicit choices of
c ; C are known.
14.3 Martingale and Concentration Inequalities 483
The proof of part (b) is exactly the same as that of part (a), and part (c) is an im-
mediate consequence of parts (a) and (b). So only part (a) requires a proof. For this,
we need a classic convexity lemma, originally used in Hoeffding (1963), and then a
generalized version of it. Here is the first lemma.
Remark. It is important to note that the bound in this lemma depends only on b a
and the mean zero assumption, but not on the individual values of a; b.
x a sb b x sa
e sx e C e :
ba ba
Taking an expectation,
for some function f .Z/ of Z and some positive constant c. Then, for any s > 0,
s2 c2
E.e sV jZ/ e 8 :
The generalized Hoeffding lemma has the same proof as Hoeffding’s lemma itself.
Refer to the remark that we made just before the proof of Hoeffding’s lemma. It is
the generalized Hoeffding lemma that gives us Azuma’s inequality.
(because E.Vn jX0 ; : : : ; Xn1 / D 0 by the martingale property of fXn g, and then
by applying the generalized Hoeffding lemma)
2
s 2 cn
Pn1 s2
Pn
c2
i D1 i
D e st e 2 E e s i D1 Vi e st e 2 ;
where the infimum over s is easily established by a simple calculus argument. This
proves Azuma’s inequality. t
u
McDiarmid (1989) and Devroye (1991) use novel martingale techniques to derive
concentration inequalities and variance bounds for potentially complicated func-
tions of independent random variables. The only requirement is that the function
should not change by arbitrarily large amounts if all but one of the coordinates
remain fixed. The first result below says that functions of certain types are concen-
trated near their mean value with a high probability.
Proof. Once again, only part (a) is proved, because (b) is proved exactly analo-
gously, and (c) follows by adding the inequalities in (a) and (b). For notational
convenience, we take E.f / to be zero; this allows us to write f in place of f E.f /
below.
486 14 Discrete Time Martingales and Concentration Inequalities
Pn
The trick is to decompose f as f D kD1 Vk , where fVk g is a martingale
difference sequence such that it can be bounded in both directions, Zk Vk Wk ,
in a manner so that Wk Zk ck ; k D 1; 2; : : : ; n. Then, Azuma’s inequality
applies and the inequality of this theorem falls out. Construct the random variables
Vk ; Zk ; Wk as follows.
Define
To prove the first fact, we use the martingale decomposition as in Theorem 14.13
to get
! 2 !2 3
X
n X
n
Var.f / D Var Vi DE4 Vi 5
i D1 i D1
X
n XX
D EŒVi2 C 2 EŒVi Vj
i D1 i <j
X
n XX
D EŒVi2 C 2 E.Vi EŒVj jX1 ; : : : ; Xj 1 /
i D1 i <j
X
n XX X
n
D EŒVi2 C 2 E.Vi 0/ D EŒVi2 :
i D1 i <j i D1
The two theorems in this section give useful probability and variance bounds
in many complicated problems in which direct evaluation would be essentially
impossible.
j sup jFn .x/ F0 .x/j sup jFn0 .x/ F0 .x/jj sup jFn .x/ Fn0 .x/j 1=n:
x x x
2
P .jn E.n /j t/ 2e 2nt
2
) P .jDn E.Dn /j t/ 2e 2t :
This concentration inequality holds for every fixed n and t > 0, and we do not need
to deal with the exact distribution of Dn to arrive at this inequality.
Again plugging ci D n1 in Theorem 14.14, we get
1 1
Var.n / ) Var.Dn / ;
4n 4
for all n 1. Once again, this is an attractive variance bound that is valid for every
n, and we do not need to work with the exact distribution of Dn to arrive at this
bound.
T0 D inffk 0 W Xk agI
T2nC1 D inffk > T2n W Xk bg; n 0I
T2nC2 D inffk > T2nC1 W Xk ag; n 0:
14.3 Martingale and Concentration Inequalities 489
The times T1 ; T3 ; : : : are then the instants of upcrossing, and the times T0 ; T2 ; : : :
are the instants of downcrossing. The upcrossing inequality places a bound on the
expected value of Ua;b;N , the number of upcrossings up to the time N . Note that
this is simply the number of odd labels 2n C 1 for which T2nC1 N .
Theorem 14.15. Let fXn ; n 0g be a submartingale. Then for any a; b; N.a < b/,
X
N
YN Y0 D .Yi Yi 1 /;
i D0
(iii) The sum over the even labels satisfy the inequality
2 3
X
E4 .Yi Yi 1 /5 0:
i W0i N; i even
E.YN Y0 / .b a/EŒV0;ba;N ;
T00 ; T10 ; : : : be the downcrossing and upcrossing times of .0; b a/ for the new fYn g
process. Now define i D min.Ti0 ; N /. The i are increasing, that is, 0 1 : : :,
because the Ti0 are. Note that these i are stopping times adapted to fYn g.
Now look at the sum over the odd labels, namely .Y1 Y0 / C .Y3 Y2 / C .
Break this sum further into two subsets of labels, i V D V0;ba;N , and i > V .
For each label i in the first subset, .Y2i C1 Y2i / b a, because Y2i C1 b and
Y2i a. Adding over these labels, of which there are V many, we get the sum to
be .b a/V . The labels in the other subset can be seen to give a sum 0 (just
think of what V means, and a little thinking shows that the rest of the labels produce
a sum 0). So, now adding over the two subsets of labels, we get our claimed
inequality in (ii) above.
The claim in (iii) is automatic by the optional stopping theorem, because for each
individual i , we will have E.Yi 1 / E.Yi / (actually, this is a slightly stronger
demand than what the optional stopping theorem says; but it is true).
As was explained above, this completes the argument for the upcrossing
inequality. t
u
Paul Lévy initiated his study of martingales in his search for laws of large numbers
beyond the case of means in the iid case. It turns out that martingales often con-
verge to a limiting random variable, and even convergence of the means or higher
moments can be arranged, provided that our martingale sequence is not allowed to
fluctuate or grow out of control. To see why some such conditions P would be needed,
consider the case of the simple symmetric random walk Sn D niD1 Xi , where the
Xi are iid taking the values ˙1 with probability 12 each. We know that the simple
symmetric random walk is recurrent, and therefore it comes back infinitely often
to every integer value x with probability one. So Sn , although a martingale, does
p value of jSn j in the simple p
not converge to some S1 . The expected symmetric ran-
dom walk case is of the order of c n for some constant c, and c n diverges as
n ! 1. A famous result in martingale theory says that if we can keep E.jXn j/ in
control (i.e., bounded away from 1), then a martingale sequence fXn g will in fact
converge to some suitable X1 . Furthermore, some such condition is also essentially
necessary for the martingale to converge. We start with an example.
Example 14.23 (Convergence of the Likelihood Ratio). Consider again the likeli-
Q
hood ratio ƒn D niD1 ff10 .X i/
.Xi /
, where f0 ; f1 are two densities and the sequence
X1 ; X2 ; : : : is iid from the density f0 . We have seen that ƒn is a martingale (see
Example 14.8).
The likelihood ratio ƒn gives a measure of the support in the first n data values
for the density f1 . We know f0 to be the true density from which the data values
14.4 Convergence of Martingales 491
are coming, therefore we would like the support for f1 to diminish as more data are
accumulated. Mathematically, we would like ƒn to converge to zero as n ! 1.
We recognize that this is therefore a question about convergence of a martingale
sequence, because ƒn , after all, is a martingale if the true density is f0 .
Does ƒn indeed converge (almost surely) to zero? Indeed, it does, and we can
verify it directly, without using any martingale convergence theorems that we have
not yet encountered. Here is why we can verify the convergence directly.
Assume that f0 ; f1 are strictly positive for the same set of x values; that is,
fx W f1 .x/ > 0g D fx W f0 .x/ > 0g. Since u ! log u is a strictly concave function
on .0; 1/, by Jensen’s inequality,
f1 .X / f1 .X /
m D Ef0 log < log Ef0 D log 1 D 0:
f0 .X / f0 .X /
1X
n
1 a:s: a:s:
log ƒn D Zi ! m < 0 ) log ƒn ! 1
n n
i D1
a:s:
) ƒn ! 0:
So, in this example, the martingale ƒn does converge with probability one to a
limiting random variable ƒ1 , and ƒ1 happens to be a constant random variable,
equal to zero. We remark that the martingale ƒn satisfies E.jƒn j/ D E.ƒn / D 1
and so, a fortiori, supn E.jƒn j/ < 1. This has something to do with the fact that
ƒn converges in this example, although the random walk, also a martingale, failed
to converge. This is borne out by the next theorem, a famous result in martingale
theory. The proof of this next theorem requires the use of two basic facts in measure
theory, which we state below.
Theorem 14.16 (Fatou’s Lemma). Let Xn ; n 1 and X be random variables
defined on a common sample space . Suppose each Xn is nonnegative with prob-
a:s:
ability one, and suppose Xn ! X . Then, lim infn E.Xn / E.X /.
Theorem 14.17 (Monotone Convergence Theorem). Let Xn ; n 1 and X be
random variables defined on a common sample space . Suppose each Xn is non-
a:s:
negative with probability one, that X1 X2 X3 : : :, and Xn ! X . Then
E.Xn / " E.X /.
Theorem 14.18 (Submartingale Convergence Theorem). (a) Let fXn g be a
submartingale such that supn E.XnC / D c < 1. Then there exists a random
a:s:
variable X D X1 , almost surely finite, such that Xn ! X .
(b) Let fXn g be a nonnegative supermartingale, or a nonpositive submartingale.
Then there exists a random variable X D X1 , almost surely finite, such that
a:s:
Xn ! X .
492 14 Discrete Time Martingales and Concentration Inequalities
Proof. The proof uses the upcrossing inequality, the monotone convergence theo-
rem, and Fatou’s lemma. The key idea is first to show that under the hypothesis
of the theorem, the process fXn g cannot fluctuate indefinitely between two given
numbers a; b; a < b. Then a standard analytical technique of approximation by ra-
tionals, and use of the monotone convergence theorem and Fatou’s lemma produces
the submartingale convergence theorem. Here are the steps of the proof. Define
E.jXN j/ C jaj
EŒUa;b;N ) EŒUa;b D lim EŒUa;b;N
ba N
lim supN E.jXN j/ C jaj
< 1:
ba
This means that Ua;b must be finite with probability one (i.e., it cannot equal 1
with a positive probability).
Next, note that ‚ [fa<b; a; b rationalg ‚a;b , and because we now have that
P .‚a;b / D 0 for any specific pair a; b; P .[fa<b; a; b rationalg ‚a;b / must also be
zero. This then implies that P .‚/ D 0, which establishes the existence of an almost
sure limit for the sequence Xn .
However, a subtle point still remains. The limit, X , could be 1 or 1 with a
positive probability. We use Fatou’s lemma to rule out that possibility. Indeed, by
Fatou’s lemma,
and so X must be finite with probability one. This finishes the proof of part (a) of
the submartingale convergence theorem.
Part (b) is an easy consequence of part (a). For example, if fXn g is a nonpositive
submartingale, then
The basic convergence theorem that we just proved says that an L1 bounded sub-
martingale converges to some random variable X . It is a bit disappointing that the
apparently strong hypothesis that the submartingale is L1 bounded is not strong
enough to ensure convergence of the expectations: E.Xn / need not converge to
E.X / in spite of the L1 bounded assumption. A slightly stronger control on the
growth of the submartingale sequence is needed to ensure convergence of expec-
tations, in addition to the convergence of the submartingale itself. For example,
supn E.jXn jp / < 1 for some p > 1 will suffice. A condition of this sort immedi-
ately reminds us of uniform integrability. Indeed, if supn E.jXn jp / < 1 for some
p > 1, then fXn g will be uniformly integrable. It turns out that uniform integrability
will be enough to assure us of convergence of the expectations in the basic conver-
gence theorem, and it is almost the minimum that we can get away with. Statisticians
are often interested in convergence of variances also. That is a stronger demand, and
requires a stronger hypothesis. The next theorem records the conclusions on these
issues. For reasons of space, this next theorem is not proved. One can see a proof in
Fristedt and Gray (1997, p. 480).
Example 14.24 (Pólya’s Urn). We previously saw that the proportion of white balls
aCSn
in Pólya’s urn, namely Rn D aCbCn forms a martingale (see Example 14.6). This
is an example in which the various convergences that we may want come easily. Be-
cause Rn is obviously a uniformly bounded sequence, by the theorem stated above,
Rn converges almost surely and in L2 (and therefore, in L1 ) to a limiting random
variable R, taking values in Œ0; 1.
Neither the basic (sub)martingale convergence theorem nor the theorem in this
section helps us in any way to identify the ditribution of R. In fact, in this case,
R has a nondegenerate distribution, which is a Beta distribution with parameters a
and b. As a consequence of this, E.Rn / ! aCba
and Var.Rn / ! .aCb/2ab .aCbC1/
as
n ! 1. A proof that R has a Beta distribution with parameters a; b is available in
DasGupta (2010).
Example 14.25 (Bayes Estimates). We saw in Example 14.9 that the sequence of
Bayes estimates (namely, the mean of the posterior distribution of the parameter)
494 14 Discrete Time Martingales and Concentration Inequalities
is a martingale adapted to the sequence of data values fXn g. Continuing with the
same notation as in Example 14.9, Zn D E.Y jX .n/ / is our martingale sequence.
Assume that the prior distribution for the parameter has a finite variance; that is,
E.Y 2 / < 1. Then, by using Jensen’s inequality for conditional expectations,
Hence, by the theorem above in this section, the sequence of Bayes estimates Zn
converges to some Z almost surely, and moreover the mean and the variance of Zn
converge to the mean and the variance of Z.
A natural followup question is what exactly is this limiting random variable Z.
We can only give partial answers in general. For example, for each n; E.Z jX .n/ / D
Zn with probability one. It is tempting to conclude from here that Z is the same as
Y with probability one. This will be the case if knowledge of the entire infinite data
sequence X1 ; X2 ; : : : pins down Y completely, that is, if it is the case that someone
who knows the infinite data sequence also knows Y with probability one.
Partial sums of iid random variables are of basic interest in many problems in proba-
bility, such as the study of random walks, and as we know, the sequence of centered
partial sums forms a martingale. On the other hand, the sequence of sample means is
of fundamental interest in statistics; but the sequence of means does not form a mar-
tingale. Interestingly, if we measure time backwards, then the sequence of means
does form a martingale, and then the rich martingale theory once again comes into
play. This motivates the concept of a reverse martingale.
Then, by the exchanageability property of the fXn g sequence, for any given n,
and any k; 1 k n,
1X
n
X n D E.X n jSn ; SnC1 ; : : :/ D E.Xi jSn ; SnC1 ; : : :/
n
i D1
1
D nE.Xk jSn ; SnC1 ; : : :/ D E.Xk jSn ; SnC1 ; : : :/:
n
Consequently,
1 X
n1
E.X n1 jSn ; SnC1 ; : : :/ D E.Xk jSn ; SnC1 ; : : :/
n1
kD1
1
D .n 1/X n D X n ;
n1
which shows that the sequence of sample means is a reverse martingale (adapted to
the sequence of partial sums).
There is a useful convex function theorem for reverse martingales as well, which
is straightforward to prove.
Theorem 14.20 (Second Convex Function Theorem). Let fXn g be a sequence of
random variables defined on some sample space , and f a convex function. Let
Zn D f .Xn /.
(a) If fXn g is a reverse martingale, then fZn g is a reverse submartingale.
(b) If fXn g is a reverse submartingale, and f is also nondecreasing, then fZn g is
a reverse submartingale.
(c) If fXn;m g; m D 1; 2; : : : is a countable family of reverse submartingales, defined
on the same space and all adapted to the same sequence, then fsupm Xn;m g
is also a reverse submartingale, adapted to the same sequence.
Example 14.27 (A Paradoxical Statistical Consequence). Suppose Y is some real-
valued random variable with mean , and that we do not know the true value of .
Thus, we would like to estimate . But, suppose that we cannot take any observa-
tions on the variable Y (for whatever reason). We can, however, take observations
on a completely unrelated random variable X , where E.jX j/ < 1. Suppose we do
take n iid observations on X . Call them X1 ; X2 ; : : : ; Xn and let X n be their mean.
Then, by part (a) of the second convex function theorem, jX n j forms a reverse
submartingale, and hence E.jX n j/ is monotone nonincreasing in n. In other
words, E.jX nC1 j/ E.jX n j/ for all n, and so taking more observations
on the useless variable X is going to be beneficial for estimating the mean of Y , a
comical conclusion.
Note that there is really nothing special about using the absolute difference
jX n j as the criterion for the accuracy of estimation of . The standard termi-
nology in statistics for the criterion to be used is a loss function, and loss functions
L.X n ; / with a convexity property with respect to X n for any fixed will result in
the same paradoxical conclusion. One needs to make sure that EŒL.X n ; / is finite.
496 14 Discrete Time Martingales and Concentration Inequalities
m; n 1. Note that X k and Ym;n are independent for any m, k m, and any n, and
that, furthermore, for any fixed m; Ym;n converges almost surely to Y (the same Y
as above) as n ! 1. These two facts together imply
P Y y; max X k y D P .Y y/P max X k y
1km 1km
) P .Y y/ D P .Y y/P .Y y/ D ŒP .Y y/2 ;
For an iid mean zero sequence of random variables Z1 ; Z2 ; : : : with variance one,
the central limit theorem says that for large n; Z1 CCZ
p
n
n
is approximately standard
normal. Suppose now that we consider a mean zero martingale (adapted to some
sequence fYn g) fXn ; n 0g with X0 D 0 and write Zi D Xi Xi 1 ; i 1. Then,
obviously we can write
X
n X
n
Xn D Xn X0 D .Xi Xi 1 / D Zi :
i D1 i D1
The summands Zi are certainly no longer independent; however, they are un-
correlated (see the chapter exercises). The martingale central limit theorem
says that under certain conditions on the growth of the conditional variances
Xn
Var.Zn jY0 ; : : : ; Yn1 /, pn
will still be approximately normally distributed for
large n.
The area of martingale central limit theorems is a bit confusing due to an over-
whelming variety of central limit theorems, each known as a martingale central
limit theorem. In particular, the normalization of Xn can be deterministic or ran-
dom. Also, there can be a double array of martingales and central limit theorems
for them, analogous to Lyapounov’s central limit theorem for the independent case.
The best source and exposition of martingale central limit theorems is the classic
book by Hall and Heyde (1980). We present two specific martingale central limit
theorems in this section.
First, we need some notation. Let fXn ; n 0g be a zero mean martingale adapted
to some sequence fYn g, with X0 D 0. Let
Zi D Xi Xi 1 ; i 1I
j2 D Var.Zj jY0 ; : : : ; Yj 1 / D E.Zj2 jY0 ; : : : ; Yj 1 /I
X
n
Vn2 D j2 I
j D1
(see Section 14.3.2 for the fact that E.Vn2 / and E.Xn2 / are equal if X0 D 0).
The desired result is that Xsnn converges in distribution to N.0; 1/. The question is
under what conditions can one prove such an asymptotic normality result. The con-
ditions that we use are very similar to the corresponding Lindeberg–Lévy conditions
in the independent case. Here are the two conditions we assume.
(A) Concentration Condition
Vn2 Vn2 P
2
D ! 1:
sn E.Vn2 /
498 14 Discrete Time Martingales and Concentration Inequalities
Under condition (A), the Lindeberg condition (B) is nearly equivalent to the uni-
max1j n 2 P
form asymptotic negligibility condition that 2
j
! 0. We commonly see
sn
such uniform asymptotic negligibility conditions in the independent case central
limit theorems. See Hall and Heyde (1980) and Brown (1971) for much additional
discussion on the exact role of the Lindeberg condition in martingale central limit
theorems. Here is our basic martingale CLT.
Theorem 14.22 (Basic Martingale CLT). Suppose conditions (A) and (B) hold.
L
Then Xsnn ) Z, where Z N.0; 1/.
The proof of the Lindeberg–Lévy theorem for the independent case has to be
suitably adapted to the martingale structure in order to prove this theorem. The
two references above can be consulted for a proof. The Lindeberg condition can be
difficult to verify. The following simpler version of martingale central limit theorems
suffices for many applications. For this, we need the additional notation
8 9
< X
n =
t D inf n > 0 W j2 t :
: ;
j D1
Exercises
Exercise 14.4. Let fUn g; fVn g be martingales, adapted to the same sequence fYn g.
Identify, with proof, which of the following are also submartingales, and for those
that are not necessarily submartingales, give a counterexample.
(a) jUn Vn j.
(b) Un2 C Vn2 .
(c) U n Vn .
(d) min.Un ; Vn /.
Exercise 14.11 (Branching Process). Let fZij g be a double array of iid random
P n
variables with mean and variance 2 < 1. Let X0 D 1 and XnC1 D X j D1 Znj .
Show that
(a) Wn D X n
n is a martingale.
(b) supn E.Wn / < 1.
(c) Is fWn g uniformly integrable? Prove or disprove it.
Remark. The process Wn is commonly called a branching process and is important
in population studies.
Exercise 14.12 (A Time Series Model). Let Z0 ; Z1 ; : : : be iid standard normal
variables. Let X0 D Z0 , and for n 1; Xn D Xn1 C Zn hn .X0 ; : : : ; Xn1 /,
where for each n; hn .x0 ; : : : ; xn1 / is an absolutely bounded function.
Show that fXn g is a martingale adapted to some sequence fYn g, and explicitly
identify such a sequence fYn g.
Exercise 14.13 (Another Time Series Model). Let Z0 ; Z1 ; : : : be a sequence of
random variables such that E.ZnC1 jZ0 ; : : : ; Zn / D cZn C .1 c/Zn1 ; n 1,
where 0 < c < 1. Let X0 D Z0 ; Xn D ’Zn C Zn1 ; n 1. Show that ’ may be
chosen to make fXn ; n 0g a martingale with respect to fZn g.
Exercise 14.14 (Conditional Centering of a General Sequence). Let Z0 ; Z1 ; : : :
be a general sequence of random variables,
P not necessarily independent, such that
E.jZk j/ < 1 for all k. Let Vn D niD1 ŒZi E.Zi jZ0 ; : : : ; Zi 1 /; n 1. Show
that fVn g is a martingale with respect to the sequence fZn g.
Exercise 14.15 (The Cross-Product Martingale). Let X1 ; X2 ; : : : be indepen-
P with E.jXi j/ < 1 and E.Xi / D 0 for all i . For a fixed
dent random variables,
k 1, let Vk;n D 1i1 <i2 <<ik n Xi1 : : : Xik ; n k. Show that fVk;n g is a
martingale with respect to fXn g.
Exercise 14.16 (The Wright–Fisher Markov Chain). Consider the Wright-Fisher
Markov chain of Example 14.7. Let
Xn .N Xn /
Vn D ; n 0:
.1 N1 /n
Exercise 14.19 (Stopping Times). Consider the simple symmetric random walk
Sn with S0 D 0. Identify, with proof, which of the following are stopping times,
and which among them have a finite expectation.
(a) inffn > 0 W jSn j > 5g.
(b) inffn 0 W Sn < SnC1 g.
(c) inffn > 0 W jSn j D 1g.
(d) inffn > 0 W jSn j > 1g.
Exercise 14.21. Suppose 1 ; 2 are both stopping times with respect to some se-
quence fXn g. Is j1 2 j necessarily a stopping time with respect to fXn g?
Exercise
P 14.23 (The Random Walk). Consider the asymmetric random walk
Sn D niD1 Xi , where P .Xi D 1/ D p; P .Xi D 1/ D q D 1 p; p > 12 , and
S0 D 0. Let x be a fixed positive
p integer, and D inffn > 0 W Sn D xg. Show that
1 14pqs 2 x
for 0 < s < 1; E.s / D . 2qs
/ .
Exercise 14.24 (The Random Walk; continued). For the stopping time of the
previous exercise, show that
x xŒ1 .p q/2
E./ D and Var./ D :
pq .p q/3
P
Exercise 14.25 (Gambler’s Ruin). Consider the general random walk Sn D niD1
Xi , where P .Xi D 1/ D p ¤ 12 ; P .Xi D 1/ D q D 1 p, and S0 D 0. Let a; b
be fixed positive integers, and D inffn > 0 W Sn D b or Sn D ag. Show that
p b
b a C b Œ1 . q /
E./ D ;
p q p q Œ1 . pq /aCb
and that by an application of L’Hospital’s rule, this gives the correct formula for
E./ even when p D 12 .
1 X
n2
1 1
Vn D 3 IfXi DXi C1 DXi C2 D1g C 2 IfXn DXn1 D1g C IfXn D1g ; n 3:
p p p
i D1
(a) Show that fVn ng is a martingale with respect to the sequence fXn g.
(b) Show that E.Vn / D E.n / for all n.
(c) Hence, show that
1 1 1
E./ D E.V / D C 2 C 3:
p p p
(d) Generalize to the case of the expected waiting time for obtaining r consecu-
tive 1s.
Exercise 14.28. Let fXn ; n 0g be a martingale.
(a) Show that limn!1 E.jXn j/ exists.
(b) Show that for any stopping time ; E.jX j/ limn!1 E.jXn j/.
(c) Show that if supn E.jXn j/ < 1, then E.jX j/ < 1 for any stopping time .
Exercise 14.29 (Inequality for Stopped Martingales). Let fXn ; n 0g be a
martingale, and a stopping time adapted to fXn g. Show that E.jX j/
2 supn E.XnC / E.X1 / 3 supn E.jXn j/.
References
Azuma, K. (1967). Weighted sums of certain dependent random variables, Tohoku Math. J., 19,
357–367.
Brown, B.M. (1971). Martingale central limit theorems, Ann. Math. Statist., 42, 59–66.
Burkholder, D.L. (1973). Distribution function inequalities for martingales, Ann. Prob., 1, 19–42.
504 14 Discrete Time Martingales and Concentration Inequalities
Burkholder, D.L., Davis, B., and Gundy, R. F. (1972). Integral inequalities for convex functions
of operators on martingales, Proc. Sixth Berkeley Symp. Math. Statist. Prob., Vol. II, 223–240,
University of California Press, Berkeley.
Chow, Y.S. and Teicher, H. (2003). Probability Theory: Independence, Interchangeability, Martin-
gales, Springer, New York.
Chung, K.L. (1974). A Course in Probability, Academic Press, New York.
DasGupta, A. (2010). Fundamentals of Probability: A First Course, Springer, New York.
Davis, B. (1970). On the integrability of the martingale square function, Israel J. Math., 8,
187–190.
Devroye, L. (1991). Exponential Inequalities in Nonparametric Estimation, Nonparametric Func-
tional Estimation and Related Topics, 31–44, Kluwer Acad. Publ., Dordrecht.
Doob, J.L. (1971). What is a martingale?, Amer. Math. Monthly, 78, 451–463.
Fristedt, B. and Gray, L. (1997). A Modern Approach to Probability Theory, Birkhauser,R Boston.
Hall, P. and Heyde, C. (1980). Martingale Limit Theory and Its Applications, Academic Press,
New York.
Heyde, C. (1972). Martingales: A case for a place in a statistician’s repertoire, Austr. J. Statist., 14,
1–9.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables, J. Amer.
Statisto Assoc., 58, 13–30.
Karatzas, I. and Shreve, S. (1991). Brownian Motion and Stochastic Calculus, Springer, New York.
Karlin, S. and Taylor, H.M. (1975). A First Course in Stochastic Processes, Academic Press, New
York.
McDiarmid, C. (1989). On the Method of Bounded Differences, Surveys in Combinatorics, London
Math. Soc. Lecture Notes, 141, 148–188, Cambridge University Press, Cambridge, UK.
Williams, D. (1991). Probability with Martingales, Cambridge University Press, Cambridge, UK.
Chapter 15
Probability Metrics
As we said above, there are numerous metrics and distances on probability mea-
sures. The choice of the metric depends on the need in a specific situation. No single
metric or distance is the best or the most preferable. There is also the very important
issue of analytic tractability and ease of computing. Some of the metrics are more
easily bound, and some less so. Some of them are hard to compute. Our choice of
metrics and distances to cover in this chapter is guided by all these factors, and also
personal preferences. The definitions of the metrics and distances are given below.
However, we must first precisely draw the distinction between metrics and distances.
1X
.P; Q/ D .p; q/ D jp.i / q.i /j;
2
i
of Rd .
508 15 Probability Metrics
Basic properties and interrelationships of these distances and metrics are now
studied.
Theorem 15.1. (a) Let Pn ; P be probabilty measures on R such that d.Pn ; P /
! 0 as n ! 1. Then Pn converges weakly (i.e., in distribution) to P . If the
CDF of P is continuous, then the converse is also true.
(b) .P; Q/ D supB jP .B/Q.B/j; where the supremum is taken over all (Borel)
sets B in Rd .
(c) Given probability measures P1 ; P2 ; .P1 ; P2 / satisfies the coupling identity
Furthermore,
1
.P1 ; P2 / D sup jEP1 .h/ EP2 .h/j :
2 hWjhj1
(d) Let H denote the family of all functions h W R ! R with the Lipschitz norm
bounded by one; that is,
( )
jh.x/ h.y/j
H D h W sup 1 :
fx;yg jx yj
Then,
W .P; Q/ D supfjEP .h/ EQ .h/j W h 2 Hg:
(e) Let Pn ; P be probabilty measures on Rd . Then Pn converges weakly to P if
and only if L.Pn ; P / ! 0.
(f) If any of .Pn ; P /; W .Pn ; P /; D.Pn ; P /; H.Pn ; P /; K.Pn ; P / ! 0, then Pn
converges weakly to P .
(g) The following converses of part (f) are true.
(i) Pn converges weakly to P ) .Pn ; P / ! 0 if Pn ; P are absolutely con-
tinuous and unimodal probability measures on R, or if Pn ; P are discrete
with mass functions pn ; p.
(ii) Pn converges weakly to P ) W .Pn ; P / ! 0 if Pn ; P are all supported
on a bounded set in R.
(iii) Pn converges weakly to P ) H.Pn ; P / ! 0 if Pn ; P are absolutely con-
tinuous and unimodal probability measures on R, or if Pn ; P are discrete
with mass functions pn ; p.
Proof. Due to the long nature of the proofs, we only refer to the proofs of parts
(c)–(g). Parts (a) and (b) are proved below. We refer to Gibbs and Su (2002) for
parts (c), (e), and (f). Part (d) is proved in Dudley (2002). Part (i) in (g) is a result
in Ibragimov (1956). Part (ii) in (g) is a consequence of Theorem 2 in Gibbs and
Su (2002), and part (e) in this theorem. The first statement in part (iii) in (g) is a
15.2 Basic Properties of the Metrics 509
WeR can also switch P; Q in the above argument to show that supB .Q.B/P .B// D
1
2 jf .x/ g.x/jdx, and therefore the statement in part (b) follows. t
u
In fact, quite evidently, the inequality generalizes to the case of k-variate joint dis-
Pk
tributions, .P; Q/ i D1 .Pi ; Qi /, as long as we make the assumption of
independence of the coordinates in the k-variate distributions.
Next consider the Kullback–Leibler distance between joint distributions, under
the same assumption of independence of the coordinates. Consider the bivariate case
for ease. Then,
Z Z
f1 .x1 /f2 .x2 /
K.P; Q/ D f1 .x1 /f2 .x2 / log dx1 dx2
g1 .x1 /g2 .x2 /
Z Z
D f1 .x1 /f2 .x2 / log f1 .x1 / C log f2 .x2 /
log g1 .x1 / log g2 .x2 / dx1 dx2
Z Z
D f1 .x1 /logf1 .x1 /dx1 C f2 .x2 /logf2 .x2 /dx2
Z Z
f1 .x1 /log g1 .x1 /dx1 f2 .x2 /logg2 .x2 /dx2
Z Z
f1 .x1 / f2 .x2 /
D f1 .x1 /log dx1 C f2 .x2 /log dx2
g1 .x1 / g2 .x2 /
D K.P1 ; Q1 / C K.P2 ; Q2 /:
Once again, the result generalizes to the case of a general k; that is, under the
assumption of independence of the coordinates of the k-variate joint distribution,
P
K.P; Q/ D kiD1 K.Pi ; Qi /.
A formula connecting H.P; Q/, the Hellinger distance, to H.Pi ; Qi / is also
possible, and is a chapter exercise.
Example 15.2 (Hellinger Distance Between Two Normal Distributions). The total
variance distance between two general univariate normal distributions was worked
out in Example 7.36. It was, in fact, somewhat involved. In comparison, the
Hellinger and the Kullback–Leibler distance between two normal distributions is
easy to find. We work out a formula for the Hellinger distance in this example. The
Kullback–Leibler case is a chapter exercise.
15.2 Basic Properties of the Metrics 511
1 1 0 0 0
D d=2
e 2 x 2 .x 2 / dx e 8 D e 8 :
.2/ Rd
s
0 †1
This gives H.P; Q/ D 2 1 e 8 . For the general case, when P corre-
We recall from Chapter 7 that in the univariate case when the variances are equal, the
total variation distance between N.1 ; 2 /, and N.2 ; 2 / has the simple formula
2ˆ j2
1j
1. The Hellinger distance and the total variation distance between
N.0; 1/ and N.; 1/ distributions in one dimension are plotted in Fig. 15.1 for visual
comparison; the similarity of the two shapes is interesting.
1.4
1.2
0.8
0.6
0.4
0.2
mu
-7.5 -5 -2.5 2.5 5 7.5
Fig. 15.1 Plot of Hellinger and variation distance between N.0; 1/ and N(mu,1)
The numerator of this expression is a cubic, thus it follows that it can be zero at
at most three values of x0 , and therefore the mean value theorem implies that the
number of roots of the equation f .x0 / D g.x0 / can be at most four. Let 1 <
x1;; x2;; x3;; x4;; < 1 be the roots of f .x0 / D g.x0 /. Then,
d.P; Q/ D max jF .xi;; / G.xi;; /j D d; say:
1i 4
and the best normal approximation to the standard Cauchy distribution is found
by minimizing d; over .; /. This cannot be done in a closed-form analytical
manner.
Numerical work gives that the minimum Kolmogorov distance is attained
when D ˙:4749 and D 3:10, resulting in the best normal approximation
N.˙:4749; 9:61/, and the corresponding minimum Kolmogorov distance of :0373.
15.2 Basic Properties of the Metrics 513
0.3
0.25
0.2
0.15
0.1
0.05
x
-10 -5 5 10
Fig. 15.2 Standard cauchy and the best Kolmogorov normal approximation
The best normal approximation is not symmetric around zero, although the standard
Cauchy is. Moreover, the best normal approximation does not approximate the
density well, although the Kolmogorov distance is only :0373. Both are plotted in
Fig. 15.2 for a visual comparison.
Example 15.4 (f -Divergences). f -Divergences have a certain unifying character
with respect to measuring distances between probability measures. A number of
leading metrics and distances that we have defined above are f -divergences with
special choices of the convex function f .
For example, if f .x/ D jx1j
2 , then
Z Z
p.x/ 1 p.x/
df .P; Q/ D q.x/f dx D q.x/j 1jdx
q.x/ 2 q.x/
Z
1
D jp.x/ q.x/jdx D .P; Q/:
2
If we let f .x/ D log x, then
Z Z
p.x/ p.x/
df .P; Q/ D q.x/f dx D q.x/ log dx
q.x/ q.x/
Z
q.x/
D q.x/log dx D K.Q; P /:
p.x/
Note that two different choices of f were needed to produce K.Q; P / and
K.P; Q/; this is because the Kullback–Leibler distance is not symmetric between
P and Q.
p
Next, if f .x/ D 2.1 x/, then
Z Z " s #
p.x/ p.x/
df .P; Q/ D q.x/f dx D 2 q.x/ 1 dx
q.x/ q.x/
Z p
D 2 1 p.x/q.x/dx D H 2 .P; Q/:
2
p H .P; Q/ rather than H.P; Q/ itself. Interestingly,
Notice that we ended up with
if we choose f .x/ D .1 x/2 , then
Z p 2
p
df .P; Q/ D q.x/ 1 p dx
q
Z p Z
p p p
D q.x/ 1 2 p C dx D q 2 pq C p dx
q q
Z Z
p p
D 12 pqdx C 1 D 2 1 pqdx D H 2 .P; Q/:
So, two different choices of the function f result in the f -divergence being the
square of the Hellinger metric.
Now observe that the second derivative of log.1 xn /n1 log.e x / is 1n
n2 .1 x 2 <0
n/
x
for all x in .1; n/. Therefore, log.1 x n1
n/
can cut log.e / at most twice on
.1; n/, which means that .1 xn /n1 can cut e x at most twice on .1; n/.
Because there is obviously one cut at x D 0, there can be at most one more cut
on .1; n/, and hence at most one more cut on .0; n/. In fact, there is such a cut,
as can be seen by observing that .1 xn /n1 > e x for small positive x, whereas
.1 xn /n1 < e x at x D n. Denote this unique point of cut by xn . Then,
Z xn Z n
x n1 x n1
2.Pn ; P / D 1 e x dxC e x 1 dxCe n :
0 n xn n
Thus, the total variation distance between the exact and the limiting distribution of
2
n.1 X.n/ / goes to zero at the rate n1 and is asymptotic to 2en . Even for n D
2e 2
20, the approximation n is extremely accurate. The exact value for n D 20
2e 2
is .Pn ; P / D :01376 and the approximation is n D :01353, the error being
:00023.
Proof. We prove a few of these inequalities here; see Gibbs and Su (2002), and
Reiss (1989) for the remaining parts.
Because d.P; Q/ D supx jP ..1; x/ Q..1; x/j, and .P; Q/ D
supB jP .B/ Q.B/j, it is obvious that d.P; Q/ .P; Q/, because .P; Q/
is a supremum over a larger collection of sets.
To prove that .P; Q/ H.P; Q/, we use
p p p p
.p q/ D . p q/. p C q/;
and therefore, by the Cauchy–Schwarz inequality,
Z 2 Z Z
p p 2 p p 2
jp qj . p q/ . p C q/
Z
p p 2
D H .P; Q/
2
. p C q/
Z
2
H .P; Q/ 2.p C q/ D 4H 2 .P; Q/
Z 2
1
) 2 .P; Q/ D jp qj H 2 .P; Q/;
4
giving the inequality .P; Q/ H.P; p Q/. 2
We now prove H.P; Q/ K.P; Q/, which
R pis the same as H .P; Q/
K.P; Q/. For this,
Rp recall that H .P; Q/ D 2Œ1
2
pq. We now obtain a suitable
lower bound on R pq, which leads to the desired upper bound on H 2 .P; Q/. The
p
lower bound on pq is
Z Z r r
p q q
pq D p D EP
p p
1 q 1 q
D EP .e 2 log p / e 2 EP .log p /
(by Jensen’s inequality applied to the convex function u ! e u )
R Z
1
p log q 1 q
De 2 p 1C p log
2 p
15.3 Metric Inequalities 517
p 1p
22 .P; Q/ D 2.p r/2 p log C .1 p/ log ;
r 1r
which is an entropy inequality. Thus, for such two-point distributions supported on
the same two points, 22 .P; Q/ K.P; Q/.
Now take general P; Q, and consider the set B0 D fi W p.i / q.i /g. Also,
consider the special two-point distributions P ; Q with supports on f0; 1g and de-
fined by p D P .0/ D P .B0 /, and Q .0/ D Q.B0 /. Then, from the proof of part
(b) of Theorem 15.1, .P; Q/ D .P ; Q /. On the other hand, by the finite parti-
tion property of general f -divergences (see Section 15.1). K.P; Q/ K.P ; Q /.
Therefore,
We finally give a proof of part (e) in a special case. The special case we consider is
when the two mass functions fpi g and fri g corresponding to the two distributions
P; Q have one cut; that is, there exists i0 such that pi ri for i i0 and pi ri
for i > i0 . Then,
Z X Z kC1
W .P; Q/ D jF .x/ G.x/jdx D jF .x/ G.x/jdx
k
k
ˇ ˇ ˇ ˇ
X ˇˇ X ˇ ˇX
ˇ ˇ
ˇ X
ˇ
D ˇ .pi ri /ˇ ˇ .pi ri /ˇ D .pi ri / D .P; Q/:
ˇ ˇ ˇ ˇ
k i k i i0 i i0
t
u
Example 15.6 (Total Variation versus Hellinger). The inequalities in parts (a) and
(b) of the above theorem say that the planar point .H.P; Q/; .P; Q// for any P
and Q falls in the convex region bounded by the straight line y x, the parabolic
2 p N
curve y x2 , and the rectangle Œ0; 2 Œ0; 1. In Fig. 15.3, we first provide a plot
of the curve of the set of points .H.P; Q/; .P; Q// as P; Q run through the family
518 15 Probability Metrics
rho
1.4
1.2
0.8
0.6
0.4
0.2
H
0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 15.3 Plot of total variation versus. Hellinger for P, Q in Poisson family
rho
1.4
1.2
0.8
0.6
0.4
0.2
H
0.2 0.4 0.6 0.8 1 1.2 1.4
Fig. 15.4 Plot of total variation versus. Hellinger for P, Q univariate normal
of Poisson distributions. This curve is the dark curve inside the region bounded by
2 p
y D x; y D x2 ; x 0; x 2, which are also plotted to give a perspective
for where the curve for the Poisson family lies within the admissible region. The
curve for the Poisson family was obtained by computing .H.P; Q/; .P; Q// at
1;000 randomly chosen pairs of Poisson distributions with means between 0 and
10. Then, the same plot is provided when P; Q run through the family of univariate
normal distributions in Fig. 15.4. The curves for the Poisson and the normal case
look very similar.
15.4 Differential Metrics for Parametric Families 519
The quantity
2
Z d
f .xj/ 2
d d
dx D E log f .xj/
f .xj/ d
is called the Fisher information function for the family ff .xj/g and is usually de-
noted as If ./. Thus, subject to the validity of the formal calculations done in the
. / 2
above lines, for ; K.P ; P / 2 If ./. In other words, the Fisher in-
formation function If ./ measures the curvature of the Kullback–Leibler distance
at the particular distribution P , and can be used as a measure for how quickly the
measure P is changing if we change slightly.
as the first-order partial derivatives will vanish under the assumptions made on
F and f (in the above, when i D j , the iterated partial derivatives denote
the second order partial derivative). Now, by a direct differentiation under the
integral sign, we get
@2
J.; /j D
@i @j
Z
@ @
D Fyy .f .xj/; f .xj// f .xj/ f .xj/ dx D gi;j
F
./(say);
@i @j
The integral
Z
@ log f @ log f @ log f @ log f
f .xj/ D E
@i @j @i @j
is the .i; j /th element in the Fisher information matrix corresponding to the regular
family of densities ff .xj/g. Thus, apart from the constant multiplier f 00 .1/, the
differential metric arising from all of these divergences is the same. This is an inter-
esting unifying phenomenon, and at the same time shows that the Fisher information
function has a special role.
522 15 Probability Metrics
The differential metric corresponding to the Fisher information function can be used
to produce distances between two members P ; P of a parametric family of dis-
tributions where the parameter belongs to a manifold in some Euclidean space. The
distance between P and P is simply the geodesic distance between and arising
from the differential metric on ‚, the parameter space. The geodesic distance is ba-
sically the distance between and along the shortest curve joining the two points
along the manifold ‚, This approach was initiated in Rao (1945), and in special
parametric families that we encounter in applications, the geodesic distance works
out to neat and interesting distances. Specifically, the geodesic distance often has
a connection to the form of a variance-stabilizing transformation (see Chapter 7).
The geodesic distance is defined below.
Definition 15.3. Let ‚ be a manifold in an Euclidean space Rp for some p; 1
p < 1, and d a metric on it. Let ; 2 ‚, and let C be the family of curves .t/
on Œ0; 1 defined as
The geodesic distance between and is the infimum of the lengths of the curves
.t/ 2 C, with length of a curve being defined with respect to the metric d .
Geodesic curves are usually hard to compute, and need not, in general, be unique.
Using calculus of variation methods, it can be shown that a geodesic curve is a
solution to the following boundary value problem. Find a curve .t/ such that
.0/ D I .1/ D I
2
d k di dj
C ijk D 0;
dt 2 dt dt
where
1 @ @ @
ijk D gjk ./ C gki ./ gij ./ :
2 @i @j @k
The geodesic distances have been calculated in the literature for several well-known
parametric families of distributions. They are always nontrivial to calculate. We
provide a selection of these formulas; see Rao (1987) for these formulas.
Exercises 523
Exercises
Exercise 15.1 (Skills Exercise). Compute the Hellinger and the Kullback–Leibler
distance between P and Q when (a) P D U Œ0; 1; Q D Beta.2; 2/; (b) P D
U Œa; a; Q D N.0; 1/; (c) P D Exp. /; Q D Gamma.’; /.
Exercise 15.2 (A Useful Formula). Compute the Kullback–Leibler distance be-
tween two general d -dimensional normal distributions.
Exercise 15.3. Suppose X1 P1 ; X2 P2 ; Y1 Q1 ; Y2 Q2 , and
that X1 ; X2 are independent and Y1 ; Y2 are independent. Let P; Q denote the
joint distributions of .X1 ; X2 / and .Y1 ; Y2 /. Show that 1 12 H 2 .P; Q/ D
1 12 H 2 .P1 ; Q1 / 1 12 H 2 .P2 ; Q2 / .
Exercise 15.4. Suppose X1 ; X2 ; : : : ; Xn are iid U Œ0; 1. Compute the Kullback–
Leibler distance between P D Pn and Q where P is the exact distribution of
n.1 X.n/ / and Q stands for an exponential distribution with mean one. Hence,
find a sequence cn ! 0 such that K.Pcnn;Q/ ! 1 as n ! 1.
Exercise 15.5 * (How Large is the Class of t Distributions). Consider the class
F of all one-dimensional t densities symmetric about zero; that is,
8 9
ˆ
< ’C1 >
=
2
F D f W f .x/ D
; ’ > 0 :
:̂ p 2
.’C1/=2 >
;
’ ’2 1 C x’
Exercise 15.7 * (Binomial and Poisson). Let Pn be the Bin.n; n / distribution and
Q a Poisson distribution with mean ; here 0 < < 1 is a fixed number inde-
pendent of n. Prove that the total variation distance between Pn and Q converges to
zero as n ! 1.
Exercise 15.10 * (An Interesting Plot). Use the formula in the text (Example 7.36)
for the total variation distance between two general univariate normal distributions
to plot the following set.
Exercise 15.18. * Prove that almost sure convergence is not metrized by any metric
on probability measures.
References
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Diaconis, P. and Saloff-Coste, L. (2006). Separation cut-offs for birth and death chains, Ann. Appl.
Prob., 16, 2098–2122.
Dudley, R. (2002). Real Analysis and Probability, Cambridge University Press, Cambridge, UK.
Gibbs, A. and Su, F. (2002). On choosing and bounding probability metrics, Internat. Statist. Rev.,
70, 419–435.
Ibragimov, I.A. (1956). On the composition of unimodal distributions, Theor. Prob. Appl., 1,
283–288.
526 15 Probability Metrics
Kass, R. and Vos, p. (1997). Geometrical Foundations of Asymptotic Inference, Wiley, New York.
Lecam, L. (1969). Théorie Asymptotique de la Bécision statistique, Les presses de l’université de
Montréal, Montréal.
Leise, F. and Vajda, I. (1987). Convex Statistical Distances, Teubner, Leipzig.
Rachev, S.T. (1991). Probability Metrics and the Stability of Stochastic Models, Wiley, New York.
Rao, C.R. (1945). Information and accuracy available in the estimation of statistical parameters,
Bull. Calcutta Math. Soc., 37, 81–91.
Rao, C.R. (1987). Differential metrics in probability spaces, in Differential Geometry in Statistical
Inference, S.-I. Amari et al. Eds., IMS Lecture Notes and Monographs Series, Hayward, CA.
Reiss, R. (1989). Approximation Theorems of Order Statistics, Springer-Verlag, New York.
Zolotarev, V.M. (1983). Probability metrics, Theor. Prob. Appl., 28, 2, 264–287.
Chapter 16
Empirical Processes and VC Theory
If X1 ; : : : ; Xn happen to be iid, with the common CDF F , then nFn .t/ Bin.n;
F .t// for any fixed t. The mean of Fn .t/ is F .t/, and the variance is F .t /.1F
n
.t //
.
The empirical quantile function is defined in terms of the order statistics of the
sample values X1 ; X2 ; : : : ; Xn , namely,
k1 k
Qn .y/ D Fn1 .y/ D XkWn ; <y ;
n n
the quantile process. In the special case of U Œ0; 1 variables, we of course have
Q.y/ D y. We use the notation Un .y/ for the quantile function Qn .y/ in
p special U Œ0; 1 case, which gives us the uniform quantile process un .y/ D
the
nŒUn .y/ y; 0 < y 1.
Given a real-valued function f on some interval Œa; b R; let jjf jj1 D
supaxb jf .x/j. We often call it the uniform norm or the L1 norm of f . The
paths of the empirical process are only right continuous; indeed, Fn .t/ jumps at the
data values X1 ; : : : ; Xn . However, at any point t; Fn .t/ has a left limit. We need a
name for functions of these types. Let Œa; b be an interval with 1 a < b 1.
Then, C Œa; b denotes the class of all continuous real-valued functions on Œa; b,
and `1 Œa; b denotes the class of all real-valued functions f with jjf jj1 < 1.
The class of all real-valued functions that are right continuous on Œa; b and have
left limits everywhere is denoted by DŒa; b; they are commonly known as cadlag
functions. An important inclusion property is C Œa; b DŒa; b `1 Œa; b.
As an estimate of the true CDF F , the empirical CDF Fn is uniformly accurate
in large samples. Indeed, the Glivenko–Cantelli theorem says that (in the iid case)
a:s:
jjFn F jj1 ! 0 as n ! 1 (see Chapter 7). A common test p statistic for goodness
of fit in statistics is the Kolmogorov–Smirnov statistic Dn D njjFn F jj1 (see
Chapter 14). Empirical process theory is going to help us in pinning down finer
properties of jjFn F jj1 and the asymptotic distribution of Dn .
Note that jjFn F jj1 is just the Kolmogorov distance (see Chater 15) be-
tween Fn and F . Hence, the Glivenko–Cantelli theorem may be rephrased as
a:s:
d.Fn ; F / ! 0. A simple point worthy of mention is that Fn need not be close
to the true CDF F according to other common notions of distance. For example,
consider the empirical measure Pn and the true underlying distribution (measure)
P that corresponds to the CDF F . The total variation distance between Pn and
P is .Pn ; P / D supA jPn .A/ P .A/j, the supremum being taken over arbitrary
(Borel) sets of R. Then, clearly, .Pn ; P /, the total variation distance between Pn
and P , cannot converge to zero as n ! 1 in general, because Pn is always sup-
ported on a finite set, and P may even give zero probability to all finite sets (e.g.,
if P had a density). In such a case, .Pn ; P / would actually be equal to 1 for any
n, and therefore would not converge to zero. This argument has nothing to do with
the variables being real-valued. The same argument works in Rd for any d < 1.
Thus, the empirical measure need not estimate the true P accurately even in large
samples, if our notion of accuracy is too strong. Empirical process theory is about
the nature of deviation of Pn from P , especially in large samples, and to quantify
the deviation in very precise terms, using highly powerful mathematical ideas and
tools. Of course, the theory has numerous applications.
same flavor as those between the partial sum process and Brownian motion, which
was extensively discussed in Chapter 12. The connection between the empirical pro-
cess and the Brownian bridge is the driving force behind many of the most valuable
results on the asymptotic behavior of the one-dimensional empirical process, and it
is useful to have a preview of it.
Preview. Consider first the uniform empirical process ’n .t/; t 2 Œ0; 1. As we re-
L
marked in the previous section, for any fixed t; ’n .t/ ) Zt , where Zt is distributed
as N.0; t.1 t//. This is an immediate consequence of the one-dimensional central
limit theorem in the iid case. We can quickly generalize this to the case of any finite
number of times 0 t1 < t2 < : : : < tk 1. First note that by a simple and direct
calculation, for any fixed n, and a pair of times ti ; tj ,
Therefore, by the multivariate central limit theorem for the iid case (see Chapter 7),
we have the convergence result that
L
.’n .t1 /; ’n .t2 /; : : : ; ’n .tk // ) .Zt1 ; Zt2 ; : : : ; Ztk /;
where .Zt1 ; Zt2 ; : : : ; Ztk / has a k-dimensional normal distribution with means
zero, and the covariance matrix with elements Cov.Zti ; Ztj / D min.ti ; tj / ti tj .
Notice that the covariances min.ti ; tj / ti tj exactly coincide with the formula for
the covariance between B.ti / and B.tj /, where B.t/ is a Brownian bridge on Œ0; 1
(see Chapter 12). In other words, the finite-dimensional distributions of the pro-
cess ’n .t/ converge to the corresponding finite-dimensional distributions of B.t/, a
Brownian bridge on Œ0; 1.
This would lead one to hope that perhaps ’n .t/ converges to B.t/ as a process. If
true, this would be stronger than convergence of just the finite-dimensional distribu-
tions, and would lead to better applications. It was a triumph of probability theory
that in 1952, Monroe Donsker proved that indeed ’n .t/ converges to B.t/ as a pro-
cess (Donsker (1952)). There were certain technical problems of measurability in
Donsker’s proof. The problems were initially overcome by Anatoliy Skorohod and
Andrei Kolmogorov. In order to avoid bringing in that part of the theory, we state
Donsker’s theorem in the form obtained in Dudley (1999).
The approximation of the one-dimensional uniform empirical process by a Brow-
nian bridge easily carries over to the general one-dimensional empirical process,
by essentially using the quantile transformation for real-valued random variables
with a continuous CDF. This is seen below. However, the approximation through
a Brownian bridge is actually much stronger than what convergence in distribution
would imply. Starting in the 1970s, a much stronger form of approximation of the
empirical process by a sequence of Brownian bridges was established. This is the
analogue of what we called the strong invariance principle for the partial sum pro-
cess in Chapter 12. These strong approximations for the one-dimensional empirical
16.2 Classic Asymptotic Properties of the Empirical Process 531
process allow one to derive finer asymptotic properties of the empirical process, and
in some problems (but not all) lead to quick solutions of apparently involved weak
convergence problems.
One problematic feature of empirical process theory, even in one dimension, is
that the proofs of the most key results are almost always long and involved. We often
refer to a source for a proof because of this reason.
X1 ; X2 ; : : : ; Xn , and
p F0 is a specified p
continuous CDF (see Example 14.22). It
is defined as Dn D njjFn F0 jj1 D n sup1<t <1 jFn .t/ F0 .t/j. The exact
distribution of Dn is difficult to find except for quite small n; some calculations
were done in Kolmogorov (1933). Usually, in applications of the Kolmogorov–
Smirnov test, the exact distribution is replaced by its asymptotic distribution, and
this can be obtained elegantly by using the invariance principle.
First note that under any continuous CDF F0 on the real line, the quantile trans-
form shows that for any n, the distribution of Dn is the same for all F0 . Thus,
one may assume F0 to be the U Œ0; 1 distribution. Consider now the functional
h.f / D jjf jj1 on DŒ0; 1. It is obviously continuous with respect to the uniform
norm, and therefore, by the invariance principle,
p p
Dn D n sup jFn .t/ tj D sup j n.Fn .t/ t/j D sup j’n .t/j
0<t <1 0<t <1 0<t <1
L
) sup jB.t/j;
0<t <1
where B.t/ is a Brownian bridge on Œ0; 1. From Chapter 12 (see Theorem 12.4),
P
sup0<t <1 jB.t/j has the CDF H.x/ D 1 2 1 kD1 .1/
k1 2k 2 x 2
e ; x 0. There-
P1 2 2
fore, we have the result that for any x > 0, P .Dn > x/ ! 2 kD1 .1/k1 e 2k x
as n ! 1. This is of tremendous practical utility in statistics.
Example 16.2 (Cramér–von Mises Statistic). It is conceivable that the empirical
CDF Fn is a moderate distance away from the postulated CDF F0 over large parts
of the real line, although it is never too far away. In such a case, the Kolmogorov–
Smirnov statistic may fail to detect the falsity of the postulated null hypothesis,
R 1deviation of Fn from F0 may succeed. The
but a statistic that measures an average
Cramér–von Mises statistic Cn2 D n 1 .Fn .t/ F0 .t//2 dF0 .t/ is such a statistic,
and is frequently used as an alternative to the Kolmogorov–Smirnov statistic, or as
a complementary statistic.
Once again, the exact distribution of Cn2 is difficult to find, but an application
of the invariance principle leads to the asymptotic distribution, which is used as an
approximation to the exact distribution. As long as F0 is continuous, the distribution
of Cn2 is independent of F0 for any n, and so once again, as in our previous example,
R1 p
we may take F0 to be the U Œ0; 1 CDF. Then, we have Cn2 D 0 Œ n.Fn .t/
R 1 R 1
t/2 dt D 0 ’2n .t/dt. The functional h.f / D 0 f 2 .t/dt is continuous on DŒ0; 1
with respect to the uniform norm. To see this, use the sequence of inequalities
ˇZ Z ˇ Z
ˇ 1 1 ˇ 1
ˇ f .t/dt
2
g .t/dtˇˇ
2
jf 2 .t/ g 2 .t/jdt
ˇ
0 0 0
Z 1
D jf .t/ g.t/jjf .t/ C g.t/jdt
0
Z 1
jjf gjj1 .jf .t/j C jg.t/j/dt
0
jjf gjj1 .jjf jj1 C jjgjj1 /:
16.2 Classic Asymptotic Properties of the Empirical Process 533
L R1
Therefore, by the invariance principle, Cn2 ) 0 B 2 .t/dt. It remains to characterize
R1
the distribution of 0 B 2 .t/dt.
For this, we usep the Karhunen–Loeve expansion of B.t/ (see Theorem 12.2)
P
given by B.t/ D 2 1 mD1
sin.m t /
m Zm , where Z1 ; Z2 ; : : : is an iid N.0; 1/ se-
quence. By the orthogonality of the sequence of functions sin.m t/; m 1, we
get
Z Z 1 1 R1
1 1 p X sin.m t/ 2
2 X Zm
2
sin2 .m t/dt
B .t/dt D
2
2 Zm dt D 2 0
0 0 mD1
m mD1 m2
1 1
2 X Zm 2
1 X Zm 2
D 2 D D Y .say/:
mD1 2m2 2 mD1 m2
R1
That is, 0 B 2 .t/dt is distributed as an infinite linear combination of iid chi-square
random variables of one degree of freedom. At this point, the strategy is to find the
cf (characteristic function) of Y , and to invert it to find a density for Y (see Theorem
8.1). The cf of a single chi-square random variable with one degree of freedom is
.1 2i t/1=2 . It follows that the cf of Y is
Y1 1=2 Y 1 1=2
t t
Y .t/ D 1 2i 2 2 D .1 2i 2 2 / :
mD1
m mD1
m
Q1 p
sin. z/
mD1 .1 D
z p
We now use the identity that for a complex number z; m2
/ z
.
Using z D 2it
2
, we get, with a little algebra, that
p s p
sin. 2it/ 1=2 2it
Y .t/ D p D p :
2it sin. 2it/
It is possible to invert this to write the CDF of Y as a convergent infinite series (see
part (a) of Theorem 8.1). The CDF has the formula
1 Z .2j C2/2 2 s p
1 X 1 z yz
FY .y/ D 1 .1/ j
p e 2 d z:
.2j C1/
2 2 z sin. z/
j D0
Possible uses of such a weighted empirical process would include statistical tests of
a null hypothesis H0 W F D F0 by using test statistics such as
ˇp ˇ Z
ˇ n.Fn .t/ F0 .t// ˇ 1
ŒFn .t/ F0 .t/2
Dn;w D sup ˇ ˇ ˇ
ˇ ; or Cn;w D n
2
dF0 .t/:
1<t <1 w.F .t// 0 1 w2 .F0 .t//
The limiting behavior of such statistics is not necessarily what one might expect
intuitively. For instance, Dn;w does not necessarily converge in law to the supremum
of jB.t /j
w.t /
. The tails of the function w must be such that B.t /
w.t /
does not blow up near
t D 0 or 1. The specification of the weighting function w such that no such disas-
ters occur is a very subtle and nontrivial problem. The following result (Chibisov
(1964), O’Reilly (1974)) completely describes the properties required of w so that
the weighted empirical process behaves well at the tails.
Theorem 16.2 (Chibisov–O’Reilly). (a) Suppose the function w is nondecreas-
ing in a neighborhood of zero and nonincreasing in a neighborhood of 1. The
statistic Dn;w has a nontrivial limiting distribution if and only if
Z w2 .t /
1
e t .1t /
dt < 1 for some > 0;
0 t.1 t/
L
in which case Dn;w ) sup0<t <1 jB.t /j
w.t / .
2
(b) The statistic Cn;w has a nontrivial limiting distribution if and only if
R 1 t .1t / L R 1 B.t / 2
0 w2 .t / dt < 1, in which case C 2
n;w ) 0 w.t / dt.
Note that A2n and Dn have some formal similarity; Dn is the supremum (L1 norm)
and An the L2 norm of the weighted empirical process
p
n.Fn .t/ F0 .t//
p :
F0 .t/.1 F0 .t//
However, the theorem of Chibishov and O’Reilly implies that Dn does not have
a nontrivial limiting distribution, whereas A2n does, and converges in distribution
R 1 2 .t / R 1 2 .t /
to 0 tB.1t /
dt. It is a chapter exercise to find the distribution of 0 tB.1t /
dt. The
1
problem with Dn is that pt .1t / diverges to infinity at 0 and 1 too rapidly to balance
the behavior of B.t/ near zero and one. If we weight the empirical process by a more
modest weight function, the problems disappear and the weighted Kolmogorov–
Smirnov statistic has a nontrivial limiting distribution.
As we just saw, the supremum of the standardized empirical process, namely Dn ,
does not have a nontrivial limiting distribution by itself. However, it can be centered
and normed suitably to make it have a nontrivial limiting distribution. Very roughly
speaking, Dn does not have a nontrivial limiting
p distribution because it blows up
when n ! 1, and it grows at the rate of 2 log log n. So, by centering it (more
p
or less) at 2 log log n, and then norming it, we can obtain a limiting distribution.
This can still be used to calculate large sample approximations to tail probabilities
for Dn (i.e., P-values). The following results are due to Jaeschke (1979) and Eicker
(1979), Csáki (1980), and Einmahl and Mason (1985).
Theorem 16.3.
Dn P
(a) p ) 1:
2 log log n " # !
p
2 log log n C 12 log log log nC 12 log
(b) 8 x; P 2 log log n Dn p x
2 log log n
2e x
!e :
1
(c) For any > 0; with probability 1 for all large n; Dn .log n/ 2 C :
Notice that part (a) of this theorem follows from part (b). It is to be noted that
part (a) cannot be strengthened to almost sure convergence.
1
The preceding discussion and the results show that the weight function pt .1t /
grows too rapidly near t D 0 and 1 for the supremum of the weighted empirical
process to settle down. A particular special case of part (a) of Theorem 16.2 is
sometimes useful in applications, and is given below.
Just as the empirical CDF Fn approximates a true CDF F on the real line, the
empirical quantile function Qn .y/ D Fn1 .y/ approximates the true quantile func-
tion Q.y/ D F 1 .y/. However, we can intuitively see that it is difficult to estimate
quantiles of F beyond the range of the data, namely, X1 ; X2 ; : : : ; Xn . So, we have to
be careful about the nature of uniform approximability of Q.y/ by Qn .y/. In addi-
tion, to get convergence results to Brownian bridges akin to the case of the empirical
process, we have to properly normalize the quantile process, so that the covariance
structure matches that of a Brownian bridge. The theorem below collects the three
most important asymptotic properties of the normalized quantile process; a proof
for them can be seen in csörgo (1983). Quantile processes are useful in a number of
statistical problems, such as change point problems, goodness of fit, reliability, and
survival analysis.
Theorem 16.5. Suppose Fpis an absolutely continuous CDF on the real line, with
a density f . Let qn .y/pD nŒQn .y/ Q.y/; 0 < y < 1; n 1, be the quantile
process, and n .y/ D nf .Q.y//ŒQn .y/ Q.y/; 0 < y < 1; n 1, be the
standardized quantile process.
(a) (Restricted Glivenko–Cantelli Property). Suppose f has a bounded support,
that f is differentiable, is bounded away from zero on its support, and that jf 0 j
is bounded on the support of f . Then
a:s:
sup jQn .y/ Q.y/j ! 0:
0<y<1
(c) Suppose F has a general support (not necessarily bounded) with a density f .
Assume that f is strictly positive on its support, that f is differentiable, and
0 .x/j
that F .x/.1 F .x// jf
f 2 .x/
is uniformly bounded on the support of f . Then, on
a suitable probability space, one can construct an iid sequence X1 ; X2 ; : : : with
the CDF F , and a sequence of Brownian bridges Bn .y/; 0 y 1, such that
P
sup jn .y/ Bn .y/j ) 0:
1 n
nC1 y nC1
We proved in Chapter 12 that the partial sum process for an iid sequence with a
finite variance converges in distribution to a Brownian motion. This is the invariance
principle for the partial sum process (see Section 12.6). In Section 12.7, we showed
that in fact for each fixed n, we can construct a Wiener process such that the partial
sum process is uniformly close to the Wiener process with probability one. This was
called the strong invariance principle for the partial sum process. A parallel strong
approximation theory exists for the empirical process of an iid sequence. The strong
approximation provides a handy tool in many situations for solving a particular
problem. In addition, the error statements in the strong approximation results give a
very precise idea about the accuracy of the Brownian bridge approximation of the
empirical process. The results pin down what can and cannot be done. We remark
in passing that part (a) in Theorem 16.1 and part (c) in Theorem 16.5 are in fact
instances of strong approximations.
Theorem 16.6. On a suitable probability space, one can construct an iid sequence
X1 ; X2 ; : : : with the common CDF F , and p a sequence of Brownian bridges Bn .t/,
such that the empirical process ˇn .t/ D n.Fn .t/ F .t//; 1 < t < 1 satisfies
log n
(a) Almost surely; sup jˇn .t/ Bn .F .t//j D O p :
1<t <1 n
12 log n C x
(b) 8 n 1; and 8 x 2 R; P sup jˇn .t/ Bn .F .t//j > p
1<t <1 n
x
2e 6 :
jˇn .t/ Bn .F .t//j
(c) 8 c > 0; sup p D OP .1/:
F 1 . n
c
/ t F 1 .1 nc / F .t/.1 F .t//
1 jˇn .t/ Bn .F .t//j
(d) 8 c > 0; and 8 0 < < ; sup D
2 F 1 . c / t F 1 .1 c / ŒF .t/.1 F .t//
1
n n
OP n 2 :
Part (a) of this theorem actually follows from part (b) by clever choices of x, and
then an application of the Borel–Cantelli lemma. The choice of x is indicated in a
chapter exercise. It is important to note that the inequality is valid for all n and all
real x, and so in a specific application, n and x can be chosen to suit one’s need. In
particular, x can depend on n. Part (b) is called the Komlós, Major, Tusnady (KMT)
theorem. See Komlós et al. (1975a, b), and Mason and van Zwet (1987) for a more
detailed proof. Part (c) is an OP .1/ result, rather than an oP .1/ one, and reinforces
our
p discussion in the previous section that if we weight the empirical process by
F .1 F /, then the invariance principle will fail. But if we weight it by a smaller
power of F .1 F /, then not only can we recover the invariance principle, but even
n
a strong approximation holds, as in part (d). The KMT rate log p in part (a) cannot
n
be improved.
538 16 Empirical Processes and VC Theory
We first discuss the plausibility of strong laws more general than the well-known
Glivenko–Cantelli theorem, which asserts that in the one-dimensional iid case
16.3 Vapnik–Chervonenkis Theory 539
a:s:
supt jFn .t/ F .t/j ! 0. We need a concept of combinatorial richness of a class
of sets C that will allow us to make statements like such as supC 2C jPn .C /
a:s:
P .C /j ! 0. A class of sets for which this property holds is called a Glivenko–
Cantelli Class. A useful such concept of combinatorial richness is that of the
Vapnik–Chervonenkis dimension of a class of sets. Meaningful asymptotics will
exist for classes of sets that have a finite Vapnik–Chervonenkis dimension. It is
therefore critical to know what it means and what are good examples of classes of
sets with a finite Vapnik–Chervonenkis dimension. A basic treatment of this is given
next.
We often use the notation below. Given a specific set of n elements x1 ; x2 ; : : : ; xn
of a general set S, which need not be distinct, and a specific class C of subsets of S,
we let
4C .x1 ; x2 ; : : : ; xn / D Card.fx1 ; x2 ; : : : ; xn g \ C W C 2 C/;
where Card denotes cardinality. In words, 4C .x1 ; x2 ; : : : ; xn / counts how many
distinct subsets of x1 ; x2 ; : : : ; xn can be generated by intersecting fx1 ; x2 ; : : : ; xn g
with the members of C.
That is, S.n; C/ is the largest possible number of subsets of some (wisely chosen)
set of n points that can be formed by intersecting the set with members of C. Clearly,
for any n; S.n; C/ 2n .
Remark. Sauer’s lemma says that either a class of sets has infinite VC dimension,
or its shattering coefficients grow polynomially. A few other important and useful
properties of the shattering coefficients are listed below; most of them are derived
easily. These properties are useful for generating new classes of VC sets from known
ones by using various Boolean operations.
Example 16.4. Let C be the class of all left unbounded closed intervals on the real
line; that is, C D f.1; x W x 2 Rg. To illustrate the general formula, suppose
n D 2; what is S.n; C/? Clearly, if we pick up the larger one among x1 ; x2 , we
will pick up the smaller one too. Or, we may pick up none of them or just the
smaller one. So we can pick up three distinct subsets from the power set of fx1 ; x2 g.
The same argument shows that the general formula for the shattering coefficients is
S.n; C/ D n C 1. Consequently, this is a VC class with VC dimension one.
Example 16.5. Although topologically, there are just as many left unbounded in-
tervals on the real line as there are arbitrary intervals, in the VC index they act
differently. This is interesting. Thus, let C D f.a; b/ W a b 2 Rg. Then it is easy
to establish the formula S.n; C/ D 1 C nC1 2 . For n D 2, this is equal to four, which
nC1
is also 2 . For n D 3; 1 C 2 D 7 < 2n . Consequently, this is a VC class with
2
VC dimension two.
Example 16.6. The previous example says that on R, the class of all convex sets is
a VC class. However, this is far from being true even in two dimensions. Indeed, if
we let C be just the class of convex polygons in the plane, it is clear geometrically
that for any n, C can shatter n points. So, convex polygons in R2 have an infinite
VC dimension.
More examples of exact values of VC dimensions are given in the chapter ex-
ercises. For actual applications of these ideas to concrete extensions of Donsker’s
principles, it is extremely useful to know what other natural classes of sets in various
16.3 Vapnik–Chervonenkis Theory 541
spaces are VC classes. The various parts of the following result are available in
Vapnik and Chervonenkis (1971) and Dudley (1978, 1979).
Theorem 16.8. Each of the following classes of sets is a VC class.
(a) The class of all southwest quadrants of Rd ; that is, the class of all sets of the
Q
form diD1 .1; xi .
(b) The class of all closed half-spaces of Rd .
(c) The class of all closed balls of Rd .
(d) The class of all closed rectangles of Rd .
(e) C D ffx 2 Rd W g.x/ 0g W g 2 Gg, where G is a finite-dimensional vector
space of real-valued functions defined on Rd .
(f) Projections of a class of the form in part (e) onto a smaller number of
coordinates.
There are practically useful ways to generate new VC classes from known ones.
By using the various parts of Theorem 16.7, one can obtain the following useful
results.
Theorem 16.9. (a) C is VC implies C c D fC c W C 2 Cg is VC.
(b) C is VC implies .C/ is VC for any one-to-one function , where .C/ is the
class of images of sets in C under .
(c) C; D are VC implies C \ D and C [ D are VC, where the intersection and union
classes are defined as classes of intersections C \ D and unions C [ D; C 2
C; D 2 D.
(d) C is VC in some space S1 , and D is VC in some space S2 implies that the
Cartesian product C ˝ D is VC in the product space S1 ˝ S2 .
We can now state a general version of the familiar Glivenko–Cantelli theorem.
It say’s that a VC class is a Glivenko–Cantelli class. The following famous theo-
rem of Vapnik and Chervonenkis (1971) on Euclidean spaces is considered to be a
penultimate result of the problem.
iid
Theorem 16.10. Let X1 ; X2 ; : : : P , a probability measure on Rd for some fi-
nite d . Given any class of (measurable) sets C, for n 1; > 0,
2 =32
P .sup jPn .C / P .C /j > / 8EŒ4C .X1 ; : : : ; Xn /e n
C
2 =32
8S.n; C/e n :
Remark. This theorem implies that for classes of sets that are of the right complexity
as measured by the VC dimension, the empirical measure converges to the true at
an essentially exponential rate. This is a sophisticated generalization of the one-
dimensional DKW inequality. The first bound of this theorem is harder to implement
because it involves computation of a hard expectation, namely EŒ4C .X1 ; X2 ; : : : ;
Xn /. It would usually not be possible to find this expectation, although simulating
the quantity 4C .X1 ; X2 ; : : : ; Xn / would be an interesting exercise.
The general theorem is given next; see Giné (1996).
542 16 Empirical Processes and VC Theory
! 0;
log 4C .X1 ; X2 ; : : : ; Xn / P
) 0;
n
as was desired.
Theorem 16.8 and Theorem 16.9 give us hope for establishing CLTs for suit-
ably normalized versions of supC 2C jPn .C / P .C /j in general spaces and with
general VC classes of sets. It is useful to think of this as an analogue of the
one-dimensional Kolmogorov–Smirnov statistic for real-valued random variables,
namely, supx jFn .x/ F .x/j. Invariance principles allowed us to conclude that the
limiting distribution is related to a Brownian bridge, with real numbers in Œ0; 1 as
the time parameter. Now, however, the setup is much more abstract. The space is
not an Euclidean space, and the time parameter is a set or a function. So the for-
mulation and description of the appropriate CLTs is more involved, and although
suitable Gaussian processes will still emerge as the relevant processes that deter-
mine the asymptotics, they are not Brownian bridges, and they even depend on the
underlying P from which we are sampling. Some of the most profound advances
in the theory of statistics and probability in the twentieth century have taken place
around this problem, resulting along the way in deep mathematical developments
and completely new tools. A short description of this is provided next.
First some
R notation R and definitions. We recall that the notation .Pn P /.f / would
mean f dPn f dP . Here, f is supposed to belong to some suitable class
of functions F . For example, F could be the class of indicator functions of the
members C of a class of sets C. In that case, .Pn P /.f / would simply mean
544 16 Empirical Processes and VC Theory
We now discuss what sorts of assumptions on our class of functions F will ensure
that weak convergence occurs (i.e., a CLT holds), and also what are some good
applications of such CLTs. There are multiple sets of assumptions on the class of
16.4 CLTs for Empirical Measures and Applications 545
functions F that ensure a CLT. Here, we describe only two, one of which relates
to the concept of VC classes. and the second related to metric entropy and packing
numbers. Inasmuch as we are already familiar with the concept of VC classes, we
first state a CLT based on a VC assumption of a suitable class of sets.
Definition 16.8. In the same setup of the previous definition, a bracket is the set of
functions sandwiched between two given functions l; u; that is, i.e., a bracket is the
set ff W l.s/ f .s/ u.s/ 8s 2 Sg. It is denoted as Œl; u.
Discussion: Clearly, the smaller the radius of the balls or the width of the brackets,
the greater is the number of balls or brackets necessary to cover the function class F .
The important thing is to pin down, qualitatively, the rate at which the entropy (with
or without bracketing) is going to 1 for a given F , as ! 0. It turns out, as we
show, that for many interesting and useful classes of functions F , this rate would
be of the order of . log /, and this will, by virtue of some theorems to be given
below, ensure that the class F is P -Donsker.
Example 16.9. As mentioned above, the key to the applicability of the entropy the-
orems is a good upper bound on the rate of growth of the entropy numbers of the
class. Such bounds have been worked out for many intuitively interesting classes.
The bounds are sometimes sharp in the sense lower bounds can also be obtained that
grow at the same rate as the upper bounds. In nearly every case mentioned in this
example, the derivation of the upper bound is completely nontrivial. A very good
reference is van der Vaart and Wellner (1996), particularly Chapter 2.7 there.
Uniformly Bounded Monotone Functions on R For this function class F ; log NŒ
. ; F ; jj:jj D L2 .P // K , where K is a universal constant independent of P , and
so by part (a) of Theorem 16.11, this class is in fact universal P -Donsker.
Uniformly Bounded Lipschitz Functions on Bounded Intervals in R Let F be
the class of real-valued functions on a bounded interval I in R that are uniformly
bounded by a universal constant and are uniformly Lipschitz of some order ’ > 12 ;
a that is, jf .x/ f .y/j M jx yj’ , uniformly in x; y and for some finite uni-
1=’
versal constant M . For this class, log NŒ . ; F ; jj:jj D L2 .P // K 1 , where
K depends only on the length of I; M , and ’, and so this class is also universal
P -Donsker.
Compact Convex Subsets of a Fixed Compact Set in Rd Suppose S is a compact
set in Rd for some finite d , and let C be the class of all compact convex subsets of S .
For any absolutely continuous P , this class satisfies log NŒ . ; C; jj:jj D L2 .P //
d 1
K 1 , where K depends on S; P , and d . Here it is meant that the function class
is the set of indicators of the members of C. Thus, for d D 2; F is P -Donsker for
any absolutely continuous P .
A common implication of all of these applications of the entropy thorems is that
in each of these cases, asymptotic goodness-of-fit tests can be constructed by using
these function classes.
turns out to be a rather happy coincidence, because empirical processes are, after
all, roughly Gaussian. A highly ingenious technique that has been widely used to
turn this near Gaussianity into useful maximal inequalities for the empirical process
is the symmetrization method. Ultimately, the inequalities on tail probabilities are
aimed at establishing some sort of an exponential decay for the tail, and the mo-
ment inequalities often end up having ties to covering and entropy numbers of the
class F . In this way, maximal inequalities for Gaussian processes, the technique of
symmetrization, and covering numbers come together very elegantly to produce a
collection of sophisticated results on deviations of the empirical from the true. The
treatment below gives a flavor of this part of the theory. We follow Pollard (1989)
and Giné (1996).
Preview. It is helpful to have a summary of the main ideas behind the technical
development of the ultimate inequalities. The path to the inequalities is roughly the
following.
(a) Use existing Gaussian process theory to write down inequalities for the mo-
ments of the maxima of a Gaussian process in terms of its covariance function.
This has connections to covering numbers.
(b) Invent a sequence of suitable new random variables 1 ; 2 ; : : : , and from these,
a sequence of new processes Zn ; n 1, such that conditional on the Xi , Zn is
a Gaussian process, and such that the tails of Zn are heavier than the tails of the
empirical process. This is the symmetrization part of the story.
(c) Apply the Gaussian process moment inequalities to the Zn process conditional
on the Xi , and then uncondition them. Because Zn has heavier tails than the
empirical process, these unconditioned moment bounds will also be valid for
the empirical process.
(d) Use the moment bounds to bound in turn tail probabilities for the empirical
process.
These steps are sketched below. We need a new definition and a Gaussian process
inequality.
There are similar inequalities for other moments of supt 2T jZ.t/j, and those too
can be used to bound the tail probabilities.
Theorem 16.14. Let F be a given class of functions and let, for n 1; supf 2F j
p
n.Pn P /.f /j D n . Suppose the family F has an envelope function F bounded
by M < 1. Then, for a suitable universal finite constant K,
Z M p 2
EŒn2 2K 2 E log D.x; F ; Pn /dx ;
0
where D.x; F ; Pn / is the largestp number N for which we can find functions
f1 ; f2 ; : : : ; fN 2 F such that Pn .fi fj /2 x for any i; j; i ¤ j .
Proof. We give a sketch of its proof. For this proof, we use the notation EfXg to
denote conditional expectation given X1 ; X2 ; : : : ; Xn . The bulk of the proof consists
of showing the following.
(a) Construct a new process Zn .X; f /; f 2 F , based on a sequence of new random
variables described below, such that for any X1 ; X2 ; : : : ; Xn ,
1=2 Z M p
EfXg sup jZn .X; f /j2 K log D.x; F ; Pn /dx:
f 2F 0
(b) Show that EŒn2 2EŒsupf 2F jZn .X; f /j2 ; note that here E means uncon-
ditional expectation.
(c) Square the bound in part (a) on the conditional expectation, and take another
expectation to get a bound on the unconditional expectation. Then combine it
with the bound in part (b), and the theorem falls out. Thus, part (a) and part (b)
easily imply the result of the theorem.
To prove parts (a) and (b), we construct a duplicate iid sequence X10 ; X20 ; : : : ; Xn0
from our basic measure P , and iid N.0; 1/ variables W1 ; W2 ; : : : ; Wn , such that the
entire collection of variables fX1 ; X2 ; : : : ; Xn g; fX10 ; X20 ; : : : ; Xn0 g; fW1 ; W2 ; : : : ;
Wi
Wn g is mutually independent. Let i D jW ij
; 1 i n; note that i are symmetric
˙1 valued random variables, and i is independent of jWi j. We now define our new
process Zn :
X
n X
n
Zn .X; f / D n1=2 Wi f .Xi / D n1=2 i jWi jf .Xi /:
i D1 i D1
Note that conditionally, given the Xi , this is a Gaussian process, and the
Gaussian process lemma stated above will apply to it. The duplicate sequence
550 16 Empirical Processes and VC Theory
fX10 ; X20 ; : : : ; Xn0 g has not so far been used, but is used very cleverly in the steps
below:
2 ˇ !ˇ2 3
ˇp Xn Z ˇ
ˇ 1 ˇ
EŒn2 D E 4sup ˇ n f .Xi / f dP ˇ 5
f ˇ n ˇ
i D1
2 ˇ n 3
ˇp 1 X X n ˇˇ2
ˇ 1 ˇ
D E 4sup ˇ n f .Xi / Ef .Xi0 / ˇ 5
f ˇ n
i D1
n
i D1
ˇ
2 ˇ n ˇ2 3
ˇX ˇ
1 ˇ ˇ
D E 4 sup ˇ Œf .Xi / Ef .Xi0 /ˇ 5
n f ˇ ˇ
i D1
2 ˇ n ˇ2 3
ˇX ˇ
1 ˇ ˇ
D E 4 sup ˇ Œf .Xi / EfXg f .Xi0 /ˇ 5
n f ˇ ˇ
i D1
(inside the supremum, use the fact that for any U; jE.U /j2 E.U 2 /);
2 ˇ n ˇ2 3
ˇ X ˇ
1 ˇ ˇ
E 4 EfXg sup ˇ Œf .Xi / f .Xi0 /ˇ 5
n f ˇ ˇ
i D1
(use the fact that the supremum of EfXg of a collection is the EfXg of the supre-
mum of the collection)
2 ˇ n ˇ2 3
ˇX ˇ
1 ˇ ˇ
D E 4 sup ˇ Œf .Xi / f .Xi0 /ˇ 5
n f ˇ ˇ
i D1
(because the three sequences fi g; fXi g; fXi0 g are mutually independent, and the
inequality holds for each of the 2n combinations for the vector .1 ; 2 ; : : : ; n /)
1 X n
4E sup j i f .Xi /j2
n f
i D1
P P
(use
Pn the triangular inequality j niD1 i Œf .Xi / f .Xi0 /j j niD1 i f .Xi /j C
j i D1 i f .Xi0 /j, and then the fact that .jaj C jbj/2 2.a2 C b 2 /).
P
This implies, by the q definition of Zn .X; f / D n1=2 niD1 i jWi jf .Xi /, and
the fact that E.jWi j/ D 2
,
EŒn2 4 E sup jZn .X; f /j2 D 2E sup jZn .X; f /j2 :
2 f f
Finally, now, evaluate E supf jZn .X; f /j2 by first conditioning on fX1 ; X2 ; : : : ;
Xn g, so that the Gaussian process lemma applies, and we get
Z ı p 2
EfXg sup jZn .X; f /j 2
K 2
log D.x; F ; Pn /dx
f 0
Z M p 2
K2 log D.x; F ; Pn /dx :
0
This proves both part (a) and part (b) and that leads to the theorem. t
u
L
fnGn .t/; 0 t 1g D fX.t/; 0 t 1 jX.1/ D ng:
Proof. Denote
P .A2 \ fM C MN D ng/
P .A1 / D P .A2 jX.1/ D n/ D
P .M C MN D n/
1 Xn
D P .A2 \ fM D n kg/P .MN D k/
P .M C MN D n/ kD0
supk0 P .MN D k/ X
n
P .A2 \ fM D n kg/
P .X.1/ D n/
kD0
supk0 P .MN D k/
P .A2 /:
P .X.1/ D n/
Now we use the fact that b c is a mode of a Poisson distribution with mean (see
Theorem 1.25). This allows us to conclude
16.6 Connection to the Poisson Process 553
Œn.1 T /bn.1T /c
sup P .MN D k/ e n.1T / :
k0 bn.1 T /cŠ
Plugging this into the previous line,
1 Œn.1 T /bn.1T /c
P .A1 / e n.1T / P .A2 /:
e n nŠ nn bn.1 T /cŠ
Now using Stirling’s approximation for the factorials, it follows that
e n nŠ n.1T / Œn.1 T /bn.1T /c
e
nn bn.1 T /cŠ
has a finite supremum over n, and we may take the constant C of the theorem to be
this supremum. This completes the proof of this theorem. t
u
This theorem, together with the following Poisson process inequality ultimately
leads to an exceedance probability for the empirical process. The function h in the
theorem below is h.x/ D x log x x C 1; x > 0.
See Del Barrio et al. (2007, p. 126) for this Poisson process inequality.
Here is our final exceedance probability for the uniform empirical process. By
the usual means of a quantile transform, this can be translated into an exceedance
probability for ˇn .t/, the general one-dimensional empirical process. The theorem
below establishes exponential bounds on the tail of the supremum of the empirical
process. In spirit, this is similar to the DKW inequality (see Theorem 6.13). Pro-
cesses with such an exponential rate of decay for the tail are called sub-Gaussian.
Thus, the theorem below and the DKW inequality establish the sub-Gaussian nature
of the one-dimensional empirical process.
Theorem 16.18 (Exceedance Probability for the Empirical Process). Let 0 <
T < 1, and ’ > 0. The uniform empirical process ’n .t/; 0 t 1 satisfies the
inequality
!
p nTh 1C p’
P sup j’n .t/j ’ T Ce nT
0t T
2 ’3
’2 C p
Ce 6 nT ;
Exercises
Exercise 16.1. Simulate iid U Œ0; 1 observations and plot the uniform empirical
process ’n .t/ for n D 10; 25; 50.
Exercise 16.2. Give examples of three functions that are members of DŒ0; 1, but
not of C Œ0; 1.
Exercise 16.3. For a given n, what is the set of possible values of the Cramér–von
Mises statistic Cn2 ?
Exercise 16.4. Find an expression for E.Cn2 /.
Exercise 16.5. By using the Karhunen–Loeve expansion for a Brownian bridge,
P
give a proof of the familiar calculus formula 1 2
nD1 n2 D 6 .
1
Exercise 16.8 * (Deviations from the Three Sigma Rule). Let X1 ; X2 ; : : : be iid
N.; 2 /, and let X n ; sn be the mean and the standard deviation of the first n ob-
servations. On suitable centering and normalization, find the limiting distribution of
the number of observations among the first n that are between X n ˙ ksn , where k
is a fixed general positive number.
Exercise 16.9 (Rate in Strong Approximation). Prove part (a) of Theorem 16.5
by using the KMT inequality in part (b), using x D xn D c log n, and then by
invoking the Borel–Cantelli lemma.
Exercise 16.10 (Strong Approximation of Quantile Process). Show that the
strong approximation result in part (c) of Theorem 16.5 holds for any normal and
any Cauchy distribution.
Exercise 16.11 * (Strong Approximation of Quantile Process). Give an example
of a distribution for which the conditions of part (c) of Theorem 16.5 do not hold.
Hint: Look at distributions with very slow tails.
Exercise 16.12 (Nearest Neighbors Are Close). Let X1 ; X2 ; : : : be iid from a dis-
tribution P supported on a compact set C in Rd . Let K be a fixed compact set in
a:s:
Rd . Prove that supx2K\C jjXn;NN;x xjj ! 0, where for a given x; Xn;NN;x is a
nearest neighbor of x among X1 ; X2 ; : : : ; Xn .
Is this result true without the assumption of compact support for P ?
Exercise 16.13. Consider the weight function w’ .t/ D t ’ for 0 t 1
2 , and
’
w’ .t/ D .1 t/ for 2 1
t 1. For what values of ’; ’ 0, does w’ .t/
satisfy part (a) of the Chibisov–O’Reilly theorem? Part (b) of the Chibisov–O’Reilly
theorem?
Exercises 555
Exercise 16.14. Let B.t/ be a Brownian bridge on Œ0; 1. What can you say about
lim supt !0 pjB.t /j
t .1t /
?
Exercise 16.18. Give examples of three nontrivial classes of sets in Rd that are not
VC classes.
Exercise 16.20. *Is the VC dimension of the class of all ellipsoids of Rd the same
as that of the class of all closed balls of Rd ?
Exercise 16.21. Find the VC dimension of the class of all simplices in Rd , where
P C1 P C1
a simplex is a set of the form C D fx W x D di D1 pi xi ; pi 0; di D1 pi D 1g.
Exercise 16.23 * (Huge Classes with Low VC Dimension). Show that it is possi-
ble for a class of sets of integers to have uncountably many sets, and yet to have a
VC dimension of just one.
Exercise 16.26. *Design a test for testing that sample observations in R2 are iid
from a uniform distribution in the unit square by using suitable VC classes and
applying the CLT for empirical measures.
Exercise 16.27. Let P be a distribution on Rd , and Pn the empirical
q measure.
Let
log n
C be a VC class. Show that EŒsupC 2C jPn .C / P .C /j D O n
.
(a) Show that the Lp norm of X , namely jjX jjp D .EŒjX jp /1=p is an Orlicz norm
for all p 1.
p
(b) Consider p .x/ D e x 1 for p 1. Show that jjX jjp pŠjjX jj 1 .
(c) Show that jjX jj2 jjX jj 2 .
(d) Obtain the tail bound P .jX j x/ 1 for a general such .
x
jjX jj
Exercise 16.34 (Metric Entropy). Suppose T is a compact subset of Œ0; 1. Show
that
1
(a) If T has positive Lebesgue measure, then N. ; T; jj:jj/ , where the norm
jj:jj is the usual distance between two reals.
1
(b) If T has zero Lebesgue measure, then N. ; T; jj:jj/ D o. /.
Exercise 16.36. Prove the following generalization of the binomial tail inequality
in Example 16.7: suppose W1 ; W2 ; : : : ; Wn are independent Bernoulli variables with
P k
parameters p1 ; p2 ; : : : ; pn . Then, for any k; P . niD1 Wi > k/ e np
k
, where
p1 Cp2 C:::Cpn
pD n
.
References
Alexander, K. (1987). The central limit theorem for empirical processes on Vapnik-Chervonenkis
classes, Ann. Prob., 15, 178–203.
Beran, R. and Millar, P. (1986). Confidence sets for a multinomial distribution, Ann. Statist., 14,
431–443.
558 16 Empirical Processes and VC Theory
The mean of a random variable X is arguably the most common one number
summary of the distribution of X . Although averaging is a primitive concept with
some natural appeal, the mean is a useful summary only when the random variable
X is concentrated around the mean , that is, probabilities of large deviations from
the mean are small. The most basic large deviation inequality is Chebyshev’s in-
equality, which says that if X has a finite variance 2 , then P .jX j > k/ k12 .
But, usually, this inequality is not strong enough in specific applications, in the sense
that the assurance we seek is much stronger than what Chebyshev’s inequality will
give us. The theory of large deviations is a massive and powerful mathematical en-
terprise that gives bounds, usually of an exponential nature, on probabilities of the
form P .f .X1 ; X2 ; : : : ; Xn / EŒf .X1 ; X2 ; : : : ; Xn / > t/, where X1 ; X2 ; : : : ; Xn
is some set of n random variables, not necessarily independent, f .X1 ; X2 ; : : : ; Xn /
is a suitable function of them, and t is a given positive number. We expect that the
probability P .f .X1 ; X2 ; : : : ; Xn /EŒf .X1 ; X2 ; : : : ; Xn / > t/ is small for large n
P
if we know from some result that f .X1 ; X2 ; : : : ; Xn /EŒf .X1 ; X2 ; : : : ; Xn / ) 0;
the theory of large deviations attempts to quantify how small this probablity is.
The basic Chernoff–Bernstein inequality (see Theorem 1.34) is generally re-
garded to be the first and most fundamental large deviation Pn inequality. The
Chernoff–Bernstein inequality is for f .X1 ; X2 ; : : : ; Xn / D i D1 Xi in the one-
dimensional iid case, assuming the existence of an mgf for the common distribution
of the Xi . Since then, the theory of large deviations has grown by leaps and bounds
in every possible manner. The theory covers dependent sequences, higher and even
infinite dimensions, nonlinear functions f , much more abstract spaces than Eu-
clidean spaces, and in certain cases the existence of an mgf is no longer assumed.
Numerous excellent book-length treatments of the theory of large deviations are
now available. We can only give a flavor of this elegant theory in a few selected
cases. It should be emphasized that large deviations is one area of probability where
beautiful mathematics has smoothly merged with outstanding concrete applications
over a diverse set of problems and areas. It is now seeing applications in emerging
areas of contemporary statistics, such as multiple testing in high-dimensional situa-
tions. The importance of large deviations is likely to increase even more than what
it already is.
Although large deviation theory has been worked out for statistics that are far more
general than sample means, and without requiring that the underlying sequence
X1 ; X2 ; : : : be iid or even independent, for historical importance we start with the
case of sample means of iid random variables in one dimension.
We start with the basic Cramér–Chernoff theorem. This result may be regarded as
the starting point of large deviation theory.
Theorem 17.1. Suppose X1 ; X2 ; : : : are iid zero mean random variables with an
mgf .z/ D E.e zX1 /, assumed to exist for all real z. Let k.z/ D log .z/ be the
cumulant generating function of X1 . Then, for fixed t > 0,
1
lim log P .X > t/ D I.t/ D inf .k.z/ tz/ D sup.tz k.z//
n n z2R z2R
The function I.t/ is called the rate function corresponding to F , the common
distribution of the Xi . Because we assume the existence of a mean (in fact, the
existence of an mgf), by the WLLN X converges in probability to zero. Therefore,
we already know that P .X > t/ is small for large enough n. According to the
Cramér–Chernoff theorem, limn n1 log P .X > t/ D I.t/, and so for large n; n1 log
P .X > t/ I.t/. Therefore, as a first approximation, P .X > t/ e nI.t / . In
other words, assuming the existence of an mgf, P .X > t/ converges to zero at an
exponential rate, and I.t/ exactly characterizes that exponential rate. This justifies
the name rate function for I.t/.
Actually, it would be natural to consider P .X > t/ itself, rather than its log-
arithm. But the quantity P .XN > t/ is a sequence of the form cn .t/e nI.t / , for
some suitable sequence cn .t/, which does not converge to zero at an exponential
rate. Pinning down the exact asymptotics of the cn sequence is a difficult problem;
17.1 Large Deviations for Sample Means 561
see Exercise 17.13. If we instead look at n1 log P .X > t/, then I.t/ becomes the
dominant term, and analysis of the sequence cn can be avoided.
P
Proof of the Cramér–Chernoff theorem. For n 1, let Sn D niD1 Xi . First note
that we may, by a simple translation argument, take t to be zero, and correspond-
of X1 to be < 0. This reduces the theorem to showing that
ingly, the mean
limn n1 log P X 0 D I.0/. Because X 0 if and only if Sn 0, we need to
show that limn n1 log P .Sn 0/ D I.0/.
If the common CDF F of the Xi is supported on .1; 0, that is, if
P .X1 0/ D 1, then the theorem falls out easily. This case is left as an exer-
cise. We therefore consider the case where P .X1 < 0/ > 0; P .X1 > 0/ > 0; D
E.X1 / < 0.
We now observe a few important facts that are used in the rest of the proof.
0 00
.a/ .0/ D 1; .0/ D < 0; .z/ > 0 8 z 2 R;
1
lim log P .Sn 0/ D log :
n n
This consists of showing that lim supn n1 log P .Sn 0/ log and lim infn
1
n
log P .S n 0/ log . Of these, the first inequality is nearly immediate. Indeed,
for any z > 0,
Note that d FQ .x/ D e dF .x/ ) dF .x/ D e x d FQ .x/. Let XQ1 ; XQ2 ; : : : ; XQn be
x
Also, XQi have a finite variance. With all these, for any given n 1,
Z
P .Sn 0/ D dF .x1 /dF .x2 / dF .xn /
.x1 ;x2 ;:::;xn /W x1 Cx2 CCxn 0
Z
D n e .x1 Cx2 CCxn /
.x1 ;x2 ;:::;xn /W x1 Cx2 CCxn 0
1 1 Q
log P .Sn 0/ D log C log E e Sn ISQn 0 :
n n
Q
Sn
By the CLT, p n
is approximately a mean zero normal random variable. This implies
(with just a little manipulation) that
Q
E e Sn ISQn 0
p
is bounded below by ce a n (with a lot of room to spare) for suitable a; c > 0,
and so, by taking logarithms,
p
1 Q log c a n
lim inf log E e Sn ISQn 0 lim inf D 0;
n n n n n
What is the practical value of a large deviation result? The practical value is
that, in the absence of a large deviation result, we approximate a probability of the
type P .XN > t/ by a CLT approximation, as that would be almost the only other
approximation that we can think of. Indeed, it is common practice to use the CLT
approximation in applied work. However, for fixed t, the CLT approximation is not
17.1 Large Deviations for Sample Means 563
going to give an accurate approximation to the true value of P .XN > t/. The CLT is
supposed to be applied for those t that are typical values for XN , that is, t of the order
of p1n , but not for fixed t. The exponential tilting technique brilliantly reduces the
problem to the case of a typical value, but for a new sequence XQ1 ; XQ 2 ; : : :. Thus, in
comparison, an application of the Cramér–Chernoff theorem is going to produce a
more accurate approximation than a straight CLT approximation. Whether it really
is more accurate in a given case depends on the value of t. See Groeneboom and
Oosterhoff (1977, 1980, 1981) for extensive finite sample numerics on the compar-
ative accuracy of the CLT and the large deviation approximations to P .XN > t/.
We now work out some examples of the rate function I.t/ for some common
choices of F .
Example 17.1 (Rate Function for Normal). Let X1 ; X2 ; : : : be iid N.0; 2 /. Then
2 2
.z/ D e z =2 and k.z/ D log .z/ D z2 2 =2. Therefore, k.z/ tz D z2 2 =2 tz
is a strictly convex quadratic in z with a unique minima at z0 D t2 , and
t2
a minimum value of z20 2 =2 tz0 D 2 2
. Therefore, in this case, the rate
t2
function equals I.t/ D .
This can be used to form the first approximation
2 2
nt 2 =.2 2 / 2
P .X > t/ e , or, equivalently P X >t e nt =2 .
Note that in this example, the rate function I.t/ turns out to be convex and strictly
positive unless t D 0. It is also smooth. These turn out to be true in general, as we
later show.
Example 17.2 (Rate Function in Bernoulli Case). Here is an example where the
rate function I.t/ is related to the Kullback–Leibler distance between two suit-
able Bernoulli distributions (see Chapter 15 for the definition and properties of the
Kullback–Leibler distance). Suppose X1 ; X2 ; : : : is an iid sequence of Bernoulli
variables with parameter p. Then the mgf .z/ D pe z C q, where q D 1 p. In
the Bernoulli case, the question of deriving the rate function is interesting only for
0 < t < 1, because P .X > t/ D 1 if t < 0 and P .X > t/ D 0 if t 1. Also, if
t D 0, then, trivially, P .X > t/ D 1 .1 p/n .
For 0 < t < 1,
d pe z qt pe z .1 t/
Œtz log .z/ D t z D ;
dz pe C q pe z C q
and,
d2 pqe z
Œtz log .z/ D < 0 8 z:
d z2 .pe z C q/2
Therefore, tz log .z/ has a unique maxima at z given by
1p t
qt D pe z .1 t/ , z D log C log ;
p 1t
564 17 Large Deviations
X
1
t x .1t/1x t 1t
K.P; Q/ D t x .1t/1x log x 1x
DEP X log C.1X / log
xD0
p .1p/ p 1p
t 1t
D t log C .1 t/ log :
p 1p
Example 17.3 (The Cauchy Case). The Cramér–Chernoff theorem requires the ex-
istence of the mgf for the underlying F . Suppose X1 ; X2 ; : : : are iid C.0; 1/. In
this case, XN is also distributed as C.0; 1/ for all n (see Chapter 8), and there-
fore, limn n1 log P .XN > t/ D limn n1 log P .X1 > t/ D 0. As regards the mgf
.z/; .0/ D 1 and at any z ¤ 0; .z/ D 1. Therefore, formally, supz Œtz
log .z/ D 0. That is, formally, the rate function I.t/ D 0, which gives the correct
answer for limn n1 log P .XN > t/.
The rate function I.t/ satisfies a number of general shape and smoothness proper-
ties. We need a few definitions and some notation to describe these properties. Given
a CDF F , we let be its mean, and let
Theorem 17.2. The rate function I.t/ satisfies the following properties:
(a) I.t/ 0 for all t and equals zero only when t D .
(b) I.t/ is convex on R. Moreover, it is strictly convex on the interior of D .
17.1 Large Deviations for Sample Means 565
Proof. Part (a), the convexity part of part (b), and part (c) follow from simple facts
in real analysis (such as the supremum of a collection of continuous functions is
lower semicontinuous, and the supremum of a collection of linear functions is con-
vex). Part (e) follows from the properties of strict convexity and differentiability, by
simply setting the first derivative of tz k.z/ to be zero. Part (d) is a deep property.
See p. 121 in den Hollander (2000). t
u
A likelihood ratio test is a test that rejects H0 when ƒn > n for some sequence of
numbers n . The type I error probability of such a test is defined to be the probability
of false rejection of H0 , namely,
’n D P0 .ƒn > n /;
and the type II error probability is defined to be the probability of false acceptance
of H0 , namely,
ˇn D P1 .ƒn n /:
Then,
By direct differentiation,
Z R
d log ff10 f1z f01z dx
zK.P1 ; P0 / log f1z f01z dx D K.P1 ; P0 / R z 1z ;
dz f1 f0 dx
which equals zero at z D 1, which is therefore the maxima of z log 0 .z/, and
the maximum value is log 0 .1/ D log 1 D , giving eventually,
1
log ’n ! D K.P1 ; P0 /:
n
So, as a first approximation, we may write ’n e nK.P1 ;P0 / ; the larger the distance
between P0 and P1 , the smaller is the chance that the test will make a type I error.
The same analysis can also be done for the type II error probability ˇn .
Theorem 17.3. Suppose X1 ; X2 ; : : : are iid zero mean random variables, with an
mgf .z/ D EŒe zX1 , assumed to exist for all real z. Let F; C be general (mea-
surable) closed and open sets in R, respectively, and let I.t/ D supz2R Œtz k.z/,
where k.z/ D log .z/. For any given set S , denote I.S / D inft 2S I.t/. Then,
(a) lim supn n1 log P .XN 2 F / I.F /:
(b) lim infn n1 log P .XN 2 C / I.C /:
17.2 The GRartner–Ellis Theorem and Markov Chain Large Deviations 567
See Dembo and Zeitouni (1998, p. 27) for a proof of this theorem. When we
compare the basic Cramér–Chernoff theorem with this more general theorem, we
see that in this greater generality, we can no longer assert the existence of a limit
for n1 log P .XN 2 F / (or for n1 log P .XN 2 C //. We can assert the existence of a
limit for n1 log P .XN 2 S / for a general set S , provided inft 2S 0 I.t/ D inft 2SN I.t/,
where S 0 ; SN denote the interior and the closure of S . If this holds, then n1 log P .XN 2
S / has a limit, and
1
lim log P .XN 2 S / D inf I.t/ D inf I.t/ D inf I.t/ D I.S /:
n n t 2S 0 t 2SN t 2S
This property is aesthetically desirable, but we can ensure it only when I.t/ does
not have discontinuities on the boundary of the set S .
For example, consider singleton sets S D ftg, for fixed t 2 R. Now P .XN 2
S / D P .XN D t/ is actually zero if the summands Xi have a continuous distribution
(i.e., if the Xi have a density). If we insist on the equality of limn n1 log P .XN 2 S /
and I.S /, this would force I.t/ to be identically equal to 1. But if I.t/ 1,
then the equality limn n1 log P .XN 2 S / D I.S / will fail at other subsets S of R.
This explains why the general Cramér–Chernoff theorem is in the form of lower and
upper bounds, rather than an exact inequality.
For sample means of iid random variables, the mgf is determined from the mgf of
the underlying distribution itself. And that underlying mgf determines the large de-
viation rate function I.t/. When we give up the iid assumption, there is no longer
one underlying mgf that determines the final rate function, even if there are mean-
ingful large deviation asymptotics in the new problem. Rather, one has a sequence of
mgfs, one for each n, corresponding to whatever statistics Tn we want to consider;
for example, Tn could still be a sequence of sample means, but when the Xi have
some dependence among themselves. Or, Tn could be a more complicated statistic
with some nonlinear structure.
It turns out that despite these complexities, under some conditions a large de-
viation rate can be established without imposing the restriction of an iid setup,
or requiring that Tn be the sequence of sample means. This greatly expands the
scope of applications of large deviation theory, but in exchange for considerably
more subtlety in exactly what assumptions are needed for which result to hold.
The GRartner–Ellis theorem, a special case of which we present below, is regarded
as a major advance in large deviation theory, due to its wide-reaching applica-
tions. Although the assumptions (namely, (a)–(d) below) can fail in simple-looking
problems, the GRartner–Ellis theorem has also been successfully used to find large
568 17 Large Deviations
deviation
Prates in important non-iid setups, for example, for functionals of the form
n
Tn D i D1 .Xi /, where X1 ; X2 ; : : : forms a Markov chain; see the examples
f
below.
Suppose now S is a set separated from the origin. Then, by the GRartner–Ellis
theorem, n1 log P .X 2 S / is going to be approximately equal to the minimum of
0 1 0 1
I.t/ D t †2 t as t ranges over S . Now the contours of the function t †2 t are
ellipsoids centered at the origin. So, to get the limiting value of n1 log P .X 2 S /,
we keep drawing ellipsoids centered at the origin and with orientation †, until for
the first time the ellipsoid is just large enough to touch the set S . The point where
the ellipsoid touches S will determine the large deviation rate. This is a very elegant
geometric connection to the probability problem at hand.
(recall that the notation i stands for the initial state of the chain)
X
D log Œ…z ni;xn ;
xn
where Œ…z n denotes the nth power of the matrix …z and Œ…z ni;xn means the .i; xn /
element of it. This formula for kn .nz/ leads to the large deviation rate for the sample
mean of our Markov chain.
By part (e) of the Perron–Frobenius theorem (see Chapter 10), we get
1 X
lim.1=n/kn .nz/ D lim log Œ…z ni;xn
n n n x n
D log 1 .z/;
where 1 .z/ is the largest eigenvalue of …z (see part (a) of the Perron–Frobenius
theorem). Therefore, .z/ D limn n1 kn .nz/ exists for every z, and it is differentiable
by the implicit function theorem (i.e., basically because 1 .z/ is the maximum root
of the determinant of …z I , and the determinant itself is a smooth function of z).
570 17 Large Deviations
It now follows from the GRartner–Ellis theorem that the conclusion of both parts
(a) and (b) of the theorem hold, with the large deviation rate function being I.t/ D
supz Œtz log 1 .z/.
L
is iid with zero mean and a finite variance, then Tn ) N.0; 1/ (see Chapter 7). The
t-statistic is widely used in statistics in situations involving tests of hypotheses about
a population mean. Effect of nonnormality on the t-statistic has been studied by sev-
eral authors, among them, Efron (1969), Cressie (1980), Hall (1987), and Basu and
DasGupta (1991). Logan et al. (1973) showed that if X1 ; X2 ; : : : are iid standard
Cauchy, then Tn still does converge in distribution, but to a bimodal and unbounded
nonstandard density. It was conjectured there that the t-statistic converges in distri-
bution if and only if the underlying CDF, say F , is in the domain of attraction of a
stable law. This conjecture has turned out to be correct. Even more, it is now known
that the t-statistic converges in distribution to a normal distribution if and only if the
underlying CDF is in the domain of attraction of the normal; see Giné et al. (1997).
For example, the t-statistic converges in distribution to N.0; 1/ if F itself is a t-
distribution with two degrees of freedom, which does not have a finite variance.
Hall and Wang (2004) characterize the speed of convergence of the t-statistic to the
standard normal under this weakest possible assumption.
Large deviations for the t-statistic are of interest on a few grounds. One inter-
esting fact is that the large deviation rate function for the t-statistic is not the same
as the rate function for the sample mean, even if we have a conventional finite mgf
situation. Second, in certain multiple testing problems of recent interest in statistics,
large deviations of the t-statistic have become important. The results on the large
deviations of the t-statistic are complex, and the proofs are long. We refer to Shao
(1997) for the derivation of the main result on the rate function of the t-statistic.
The most striking part of this result is that large deviation rates are derived without
making any moment conditions at all, in contrast to the Cramér–Chernoff theorem
for the sample mean, which assumes the existence of the mgf itself. The multivari-
ate case, usually known as Hotelling’s T 2 (see Chapter 5), is treated in Dembo and
Shao (2006).
We use the following notation. Given an iid sequence of random variables
X1 ; X2 ; : : : from a CDF F on R, we let, for n 2,
X
n X
n
1 X
n
Sn D Xi ; Vn2 D Xi2 ; sn2 D .Xi X /2 ;
n1
i D1 i D1 i D1
p
Sn nX
Zn D ; Tn D :
Vn sn
17.3 The t -Statistic 571
Then, the self-normalized statistic Zn and the t-statistic Tn are related by the
identity
n 1 1=2
Tn D Zn :
n Zn2
The function z ! p zis strictly increasing, as can be seen from its first derivative
n
nz2 p p
.nz2 /3=2
, which is strictly positive on . n; n/. As a result,
! p
Zn t t n
P .Tn > t/ D P p >p DP Zn > p :
n Zn2 n1 n 1 C t2
Therefore, the large deviation rate function of the t-statistic can be figured out from
the large deviation rate function of the formally simpler statistic Zn . This is the
approach taken in Shao (1997); this convenient technique was also adopted in Efron
(1969), and is generally a useful trick to remember in dealing with t-statistics.
(b) If E.X1 / does not exist, then the result of part (a) still holds for all x > 0.
Part (a) actually holds with a minor modification if E.X1 / exists and is nonneg-
ative, but we do not show the modification here; see Shao (1997).
Example 17.7 (The Normal Case). Suppose X1 ; X2 ; : : : are iid N.0; 1/. In this case,
xt .c 2 CX 2 /
ctX1 p 1
EŒe .2 1Cx 2 / exists for all t 0, and on calculation and algebra, we get
p !
2 3
2 1=4
c 2 t t x 1Cx 2
.p 2
xt c 2 CX1 / 1Cx exp p
ctX1 2 1Cx 2 Ctx 1Cx 2
E 4e 2 1Cx 2 5D p p
tx C 1 C x 2
D g.x; c; t/ (say):
Tn
To find the limit of n1 log P p n
x , we need to find supc0 inft 0 g.x; c; t/. By
taking the logarithm of g.x; c; t/, and differentiating it with respect to t, we find that
the infimum over t 0 occurs at the unique positive root of the following function
of t.
p
.1Cc 2 /x2.1Cc 2 /x 3 .1Cc 2 /x 5 C 1Cx 2 t 2c 2 C.c 2 2/x 2 .c 2 C4/x 4
p
C 3c 2 x C .3c 2 1/x 3 x 5 t 2 C c 2 x 2 1 C x 2 t 3 D 0:
572 17 Large Deviations
p
It turns out that the required root is t D x 1 C x 2 . Plugging this back into the
expression for g.x; c; t/, we find that supc0 inft 0 g.x; c; t/ D p 1 2 . That is,
1Cx
1 Tn 1
lim log P p x D log.1 C x 2 /:
n n n 2
2
We notice that the rate function 12 log.1 C x 2 / is smaller than x2 , which was the
rate function for the sample mean in the standard normal case. This is due to the
fact that the t-statistic is more dispersed than the mean, and the rate functions of the
t-statistic and the mean are demonstrating that.
P .jf .Z/ EŒf .Z/j > / P .jf .Z/ f .0/j > / P .jZjjf 0 .0/j > /
P .M jZj > / D P jZj > D2 1ˆ
M M
2
e 2M 2 ;
2
1 z =2
because 1 ˆ.z/ 2e for z 0. These heuristics suggest that smooth func-
tions of a normal random variable should be quite tightly concentrated around their
mean, as long as they do not change by too much over small intervals. We present
a major result to this effect. The result was originally proved in Borell (1975) and
Sudakov and Cirelson (1974). The proof below is coined from Talagrand (1995),
and uses the following lemma. We need the notion of outer parallel body for the
proof, which is defined first.
Definition 17.4. Let B Rd , and let > 0. Then the -outer parallel body of B
is the set B D fx 2 Rd W infy2B jjx yjj g.
Also, let H denote the collection of all half spaces in Rd . The idea of the proof of
the result in this section is that if a set B has probability at least :5 under a standard
d -dimensional normal distribution, then with a large probability the normal random
vector will lie within the -outer parallel body of B, and that the worst offenders to
this rule are half-spaces. Here, the requirement that B have a probability at least :5
is indispensable.
17.4 Lipschitz Functions and Talagrand’s Inequality 573
sup P .Z 62 B / D sup P .Z 62 H /:
BWP .Z2B/D’ H 2HWP .Z2H /D’
Theorem 17.6. Let Z Nd .0; I /, and let f .z/ be a function with Lipschitz norm
M < 1; that is,
jf .x/ f .y/j
sup D M < 1:
x;y jjx yjj
Let be either EŒf .Z/ or the median of f .Z/. Then, for any > 0,
2
P .jf .Z/ j > / e 2M 2 :
Proof. We prove the theorem only for the case when is a median of f .Z/. Let
B D fz W f .z/ g. Fix > 0, and consider y 2 B . Pick z 2 B such that
jjy zjj . Then,
(by the lemma). This latest supremum can be proved to be equal to 1ˆ. /, leading
to, for x > 0,
x
P .f .Z/ > M C / 1 ˆ. / ) P .f .Z/ > x/ 1 ˆ :
M
Similarly, x
P .f .Z/ < x/ 1ˆ :
M
Adding the two inequalities together,
x x2
P .jf .Z/ j > x/ 2 1ˆ e 2M 2 :
M
t
u
the extreme of some stochastic sequence or process. We are then content with good
upper bounds on probabilities of the form P .supt 2T jX.t/j > x/. This has a resem-
blance to a large deviation probability. A basic treatment of this topic when X.t/ is a
Gaussian process is provided in this section. As usual, we first need some definitions
and notation.
Throughout we take X.t/ to be a zero mean one-dimensional Gaussian process;
it is referred to as a centered Gaussian process. We also take the indexing set T to
be a set in an Euclidean space, usually the real line. The covariance kernel of X.t/
is denoted as .s; t/, and when X.t/ is stationary, we drop one of the arguments,
and simply use the notation .t/. The covariance kernel induces a new metric on T ,
which we define below.
Continuity of X.t/ is a key factor in our final goal of writing bounds on prob-
abilities of the form P .supt 2T X.t/ > x/ or P .supt 2T jX.t/j > x/. This is
because continuity and boundedness are connected, and also because a smooth
process does not jump around too much, which helps in keeping the supremum
of the process in control. We limit our treatment to Gaussian processes with the
time set T a subset of an Euclidean space, and often just Œ0; 1. Conditions for the
continuity of a Gaussian process have evolved from simple sufficient conditions to
more complex necessary and sufficient conditions. Obviously, if the time set T is
compact, then continuity of the process implies its boundedness. If T is not com-
pact, this need not be true. A simple example is that of the Wiener process X.t/
on T D Œ0; 1/. We know from Chapter 12 that X.t/ is continuous. However,
P .supt 0 X.t/ D 1; inft 0 X.t/ D 1/ D 1, so that X.t/ is not bounded, and
is in fact almost surely unbounded. We first give a theorem with a set of classic
sufficient conditions for the continuity of a Gaussian process. See Adler (1987) or
Fernique (1974) for a proof.
C
42 .s; t/ D EŒX.t/ X.s/2 8 s; t; jjs tjj h:
j log jjs tjj j1C’
A
.0/ .t/ 8 t; jjtjj ı:
j log jjtjj j1ˇ
Example 17.9 (The Wiener Process). We prove by using the theorem above that
paths of a Wiener process are continuous. If we can show that a Wiener process is
continuous on Œ0; 1, it follows that it is continuous on all of Œ0; 1/. For the Wiener
process, 42 .s; t/ D EŒX.t/ X.s/2 D t C s 2 min.s; t/ D jt sj. Choose
p R1 2
.u/ D u. Then, we have 4.s; t/ .jt sj/. Furthermore, 0 .e x /dx D
R 1 x 2 =2
0 e dx < 1. Therefore, by part (a) of the above theorem, the Wiener pro-
cess is continuous on Œ0; 1/. Next, as regards part (b) of the theorem, because
17.5 Large Deviations in Continuous Time 577
x.log x/2 ! 0 as x # 0, the Wiener process also satisfies the sufficient condi-
tion in part (b) with, for example, ’ D 1; C D 1, and h D 12 , and part (b) again
shows that the Wiener process is continuous on Œ0; 1, and hence, on Œ0; 1/.
Example 17.10 (Logarithm Tail of the Maxima and the Landau–Shepp Theorem).
This example illustrates a famous theorem of Landau and Shepp (1970). The
Landau–Shepp theorem says that if X.t/ is a centered and almost surely bounded
Gaussian process, then the tail of its supremum acts like the tail of a suitable
single normal random variable. Precisely, let X.t/ be a centered and bounded Gaus-
sian process on some set T . For instance, if T is a compact interval in R and if
X.t/ satisfies one of the two sufficient conditions in our theorem above, then the
Landau–Shepp theorem applies. Let T2 D supt 2T Var.Xt /. The Landau–Shepp the-
orem says that limu!1 u12 log P .MT > u/ D 12 , where MT D supt 2T X.t/.
2T
Let us put this result in context. Take a single univariate normal random variable
X with mean zero and variance 2 . Then, the distribution of X satisfies the two-
sided bounds
3 u
2
u
2
p p e 2 2 P .X > u/ p e 2 2 ;
2u 2u3 2u
The classic sufficient conditions for the continuity of a Gaussian process have
evolved into conditions on the size of the time set T that control continuity of
the process and magnitude of the supremum at the same time. These conditions
involve the metric entropy of the set T with respect to the canonical metric of T .
The smoother the covariance kernel of the process is near zero, the smaller will be
the metric entropy of T , and the easier it is to control the supremum of the pro-
cess over T . Roughly, if the covering numbers N. ; T; 4/ of T grow at the rate of
’
for some positive ’, then the Gaussian process will be continuous, and various
578 17 Large Deviations
results on the tail of the supremum of the process can be derived. These metric en-
tropy conditions and their ties to the magnitude of the supremum of the process, as
in the theorem below, are due to Dudley (1967) and Talagrand (1996). The reader
should compare the results below with the main theorem in Section 16.5, where
the L2 norm of the supremum of a continuous Gaussian process is bounded by the
-capacity of T . We remark that stationarity of the process is not assumed in the
theorem below.
Theorem 17.8. Let X.t/; t 2 T be a real-valued centered Gaussian process. Let
N. ; T; 4/ be the covering number of T with respect to the canonical metric 4 of
T . Assume:
(a) For each > 0; N. ; T; 4/ < 1. p
(b) The diameter of T with respect to 4, that is, L D sups;t 2T EŒX.s/ X.t/2 ,
is finite.
RLp
(c) 0 log N. ; T; 4/d < 1.
1
N. ; T; 4/ 1C
2 2
1
) log N. ; T; 4/ log 2 C 2 2 2 log C 2 2 log 2:
2
R :5 p
We can evaluate the integral 0 2 log C 2 2 log 2d . We can easily show
analytically that the integral is finite. Our theorem above then implies that X.t/ is
Exercises 579
continuous and bounded on Œ0; 1 (something that we already knew). Moreover, the
theorem also says that
Z :5 p Z :5 p
E.M / 12 log N. ; T; 4/d 12 2 log C 2 2 log 2d
0 0
D 12 :811 D 9:732;
whereas the exact value is E.M / D :799. So the bounds are not yet practically very
useful, but describe exactly how the covering numbers must behave for the tail of
the supremum to go down at a Gaussian rate.
Exercises
Exercise 17.1. Derive the rate function I.t/ for the sample mean in the exponential
case.
Exercise 17.2. Derive the rate function I.t/ for the sample mean in the Poisson
case.
Exercise 17.3. Derive the rate function I.t/ for the sample mean for the three-point
distribution giving probability 13 to each of 0; ˙1.
Note: You will need a formula for the derivative of the cosine hyperbolic function.
Exercise 17.4 (SLLN from Cramér–Chernoff Theorem). Prove, by using the
Cramér–Chernoff theorem, the SLLN for the mean of an iid sequence under the
conditions of the Cramér–Chernoff theorem.
Exercise 17.5 * (Rate Function for Uncommon Distributions). For each of the
following densities, derive the rate function I.t/ for the sample mean.
x
(a) f .x/ D 1
Beta.;’/ e .1 e x /’1 ; x; ; ’ > 0.
x P1
(b) f .x/ D e
./ ./.e e
x
1/
; x > 0; > 1, where ./ D nD1 n .
2 2
e x
(c) f .x/ D 2 ; 1 < x < 1; > 0.
.1ˆ. //e .1Cx 2 /
Exercise 17.6 * (Multivariate Normal). Characterize the rate function I.t/ for the
statistic Tn D jjXjj, when X1 ; X2 ; : : : are iid Nd .0; I /.
Exercise 17.7 * (Uniform Distribution in a Ball). Characterize the rate function
I.t/ for the statistic Tn D jjXjj, when X1 ; X2 ; : : : are iid uniform in the d -
dimensional unit ball.
Exercise 17.8 (Numerical Accuracy of Large Deviation Approximation). Let
Wn 2n . Do a straight CLT approximation for P .Wn > .1 C t/n/, and do an
approximation using the Cramér–Chernoff theorem.
Compare the numerical accuracy of these two approximations for t D :5; 1; 2
when n D 30.
580 17 Large Deviations
Exercise 17.9. Prove that the rate function I.t/ for the sample mean of an iid
sequence in the one-dimensional case cannot be zero at t ¤ , when the mgf exists
for all z.
Exercise 17.10. Prove that the rate function I.t/ for the sample mean of an iid
sequence satisfies I 00 ./ D 12 .
1
Exercise 17.11 * (Type II Error Rate of the Likelihood Ratio Test). Find limn n
log ˇn for the likelihood ratio test of Example 17.4.
Exercise 17.15 * (Adjusted Large Deviation Rate When the Population Is t).
Suppose X1 ; X2 ; : : : are iid from a t-distribution with two degrees of freedom. In
this case, X converges in probability to zero by the WLLN, but P .X > t/ does not
converge to zero exponentially. Find the exact rate at which P .X > t/ converges to
zero.
Exercise 17.16. Suppose Xn Bin.n; p/. Show that for any a > 0; P .Xn <
a2
2np
np a/ e .
Hint: Use the technique to obtain the upper bound part in the Cramér–Chernoff
theorem.
Exercise 17.17. Suppose X Poi. /. Show that for any > 0; P .X < .1
2
// e 2 .
Hint: Approximate a Poisson by a suitable binomial. Then use the binomial distri-
bution inequality in the exercise above. Take a limit.
References
Adler, R.J. (1987). An Introduction to Continuity, Extrema, and Related Topics, IMS, Lecture Notes
and Monograph Series, Hayward, CA.
Basu, S. and DasGupta, A. (1991). Robustness of standard confidence intervals for location
parameters under departure from normality, Ann. Stat., 23, 1433–1442.
Borell, C. (1975). Convex functions in d -space, Period. Math. Hungar., 6, 111–136.
Bucklew, J. (2004). Introduction to Rare Event Simulation, Springer, New York.
Cressie, N. (1980). Relaxing assumptions in the one sample t -test, Austr. J. Statist., 22, 143–153.
Dembo, A. and Shao, Q.M. (2006). Large and moderate deviations for Hotelling’s T 2 statistic,
Electron. Comm. Prob., 11, 149–159.
Dembo, A. and Zeitouni, O. (1998). Large Deviations, Techniques and Applications, Jones and
Bartlett, Boston.
den Hollander, F. (2000). Large Deviations, AMS, Providence, RI.
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition,
Springer, New York.
Dubhashi, D. and Panconesi, A. (2009). Concentration of Measure for the Analysis of Randomized
Algorithms, Cambridge University Press, Cambridge, UK.
Dudley, R.M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian
processes, J. Funct. Anal., 1, 290–330.
Efron, B. (1969). Student’s t -test under symmetry conditions, J. Amer. Statist. Assoc., 64,
1278–1302.
Fernique, X. (1974). Des resultats nouveaux sur les processus gaussiens, C. R. Acad. Sci. Paris,
Ser. A, 278, 363–365.
R
Giné, E., Gotze, F., and Mason, D.M. (1997). When is the Student’s t -statistic asymptotically
standard normal?, Ann. Prob., 25, 1514–1531.
Groeneboom, P. and Oosterhoff, J. (1977). Bahadur efficiency and probabilities of large deviations,
Statist. Neerlandica, 31, 1, 1–24.
Groeneboom, P. and Oosterhoff, J. (1980). Bahadur Efficiency and Small Sample Efficiency :
A Numerical Study, Mathematisch Centrum, Amsterdam.
Groeneboom, P. and Oosterhoff, J. (1981). Bahadur efficiency and small sample efficiency, Inter-
nat. Statist. Rev., 49, 2, 127–141.
Hall, P. (1987). Edgeworth expansions for Student’s t -statistic under minimal moment conditions,
Ann. Prob., 15, 920–931.
Hall, P. and Wang, Q. (2004). Exact convergence rate and leading term in central limit theorem for
Student’s t -statistic, Ann. Prob., 32, 1419–1437.
Landau, H.J. and Shepp, L. (1970). On the supremum of a Gaussian process, Sankhya, N Ser. A, 32,
369–378.
Ledoux, M. (2004). Spectral gap, logarithmic Sobolev constant, and geometric bounds, Surveys in
Differential Geometry, IX, 219–240, International. Press, Somerville, MA.
Logan, B.F., Mallows, C.L., Rice, S.O., and Shepp, L. (1973). Limit distributions of self-
normalized sums, Ann. Prob., 1, 788–809.
Lugosi, G. (2004). Concentration Inequalities, Preprint.
McDiarmid, C. (1998). Concentration, Prob. Methods for Algorithmic Discrete Math., 195–248,
Algorithm. Combin., 16, Springer, Berlin.
Shao, Q.M. (1997). Self-normalized large deviations, Ann. Prob., 25, 285–328.
Stroock, D. (1984). An Introduction to the Theory of Large Deviations, Springer, New York.
Sudakov, V.N. and Cirelson, B.S. (1974). Extremal properties of half spaces for spherically invari-
ant measures, Zap. Nauchn. Sem. Leningrad Otdel. Mat. Inst. Steklov, 41, 14–24.
Talagrand, M. (1995). Concentration of measures and isoperimetric inequalities, Inst. Hautes
Etudes Sci. Publ. Math., 81, 73–205.
Talagrand, M. (1996). Majorizing measures: the generic chaining, Ann. Prob., 24, 1049–1103.
Varadhan, S.R.S. (1984). Large Deviations and Applications, SIAM, Philadelphia.
Chapter 18
The Exponential Family and Statistical
Applications
The exponential family is a practically convenient and widely used unified family
of distributions on finite-dimensional Euclidean spaces parametrized by a finite-
dimensional parameter vector. Specialized to the case of the real line, the exponen-
tial family contains as special cases most of the standard discrete and continuous
distributions that we use for practical modeling, such as the normal, Poisson, bi-
nomial, exponential, Gamma, multivariate normal, and so on. The reason for the
special status of the exponential family is that a number of important and useful
calculations in statistics can be done all at one stroke within the framework of the
exponential family. This generality contributes to both convenience and larger-scale
understanding. The exponential family is the usual testing ground for the large spec-
trum of results in parametric statistical theory that require notions of regularity
or Cramér–Rao regularity. In addition, the unified calculations in the exponential
family have an element of mathematical neatness. Distributions in the exponential
family have been used in classical statistics for decades. However, it has recently
obtained additional importance due to its use and appeal to the machine learning
community. A fundamental treatment of the general exponential family is provided
in this chapter. Classic expositions are available in Barndorff-Nielsen (1978), Brown
(1986), and Lehmann and Casella (1998). An excellent recent treatment is available
in Bickel and Doksum (2006).
Exponential families can have any finite number of parameters. For instance, as we
should a normal distribution with a known mean is in the one-parameter exponential
family, where as a normal distribution with both parameters unknown is in the two-
parameter exponential family. A bivariate normal distribution with all parameters
unknown is in the five-parameter exponential family. As another example, if we take
a normal distribution in which the mean and the variance are functionally related
(e.g., the N.; 2 / distribution), then the distribution is neither in the one-parameter
nor in the two-parameter exponential family, but in a family called a curved expo-
nential family. We start with the one-parameter regular exponential family.
We start with an illustrative example that brings out some of the most important
properties of distributions in an exponential family.
Example 18.1 (Normal Distribution with a Known Mean). Suppose X N.0; 2 /.
Then the density of X is
2
1 x
f .x j/ D p e 2 2 Ix2R :
2
1 1
./ D ; T .x/ D x 2 ; ./ D log ; h.x/ D p Ix2R ;
2 2 2
for any 2 RC .
Next, suppose that we have an iid sample X1 ; X2 ; : : : ; Xn N.0; 2 /. Then the
joint density of X1 ; X2 ; : : : ; Xn is
Pn
x2
1 i D1 i
f .x1 ; x2 ; : : : ; xn j/ D n e 2 2 Ix1 ;x2 ;:::;xn 2R :
.2/n=2
Now writing
1 X n
./ D 2 ; T .x1 ; x2 ; : : : ; xn / D xi2 ; ./ D n log ;
2
i D1
and
1
h.x1 ; x2 ; : : : ; xn / D Ix ;x ;:::;xn 2R ;
.2/n=2 1 2
once again we can represent the joint density in the same general form
1 1
./ D ; S.t/ D t; ./ D n log ; h.t/ D It >0 ;
2 2 2n=2 . n2 /
Pn
once again we are able to write even the density of T .X1 ; X2 ; : : : ; Xn / D i D1 Xi2
in that same general form
Clearly, something very interesting is going on. We started with a basic density in a
specific form, namely, f .x j/ D e ./T .x/ ./ h.x/, and then we found P that the
joint density and the density of the relevant one-dimensional statistic niD1 Xi2 in
that joint density, are once again densities of exactly that same general form. It turns
out that all of these phenomena are true of the entire family of densities that can be
written in that general form, which is the one-parameter exponential family. Let us
formally define it and we then extend the definition to distributions with more than
one parameter.
for some real-valued functions T .x/; ./ and h.x/ 0. Note that the functions
; T , and h are not unique. For example, in the product T , we can multiply T by
some constant c and divide by it. Similarly, we can play with constants in the
function h.
The notion of a sufficient statistic is a fundamental one in statistical theory and its
applications. Sufficiency was introduced into the statistical literature by Sir Ronald
A. Fisher (Fisher (1922)). Sufficiency attempts to formalize the notion of no loss
of information. A sufficient statistic is supposed to contain by itself all of the infor-
mation about the unknown parameters of the underlying distribution that the entire
sample could have provided. In that sense, there is nothing to lose by restricting
attention to just a sufficient statistic in one’s inference process. However, the form
of a sufficient statistic is very much dependent on the choice of a particular distri-
bution P for modeling the observable X . Still, reduction to sufficiency in widely
used models usually makes just simple common sense. We come back to the issue
of sufficiency once again later in this chapter.
We now show examples of a few more common distributions that belong to the
one-parameter exponential family.
p
Writing .p/ D log 1p ; T .x/ D x; .p/ D n log.1 p/, and h.x/ D
n
I
x fx2f0;1;:::;ngg
, we have represented the pmf f .x jp/ in the one-parameter ex-
ponential family form, as long as p 2 .0; 1/. For p D 0 or 1, the distribution be-
comes a one-point distribution. Consequently, the family of distributions ff .x jp/;
0 < p < 1g forms a one-parameter exponential family, but if either of the boundary
values p D 0; 1 is included, the family is not in the exponential family.
1 x2 2
f .x j/ D p e 2 Cx 2 Ix2R ;
2
.x1 / 2
.x / 2
1 2 C 22 2.x1 /.x2 /
f .x j/ D p e
3 2
Ix1 ;x2 2R
2 3
2 Cx 2 4x x
1 2 2 2 x1 2 1 2
D p e Œ 3 .x1 Cx2 / 3 e 3 Ix1 ;x2 2R :
2 3
This is in the form of a one-parameter exponential family with the natural sufficient
statistic T .X / D T .X1 ; X2 / D X1 C X2 .
Example 18.5 (Gamma Distribution). Suppose X has the Gamma density
x
e x ’1
I
’ .’/ x>0
:
x x ’1
f .x j / D e ’ log Ix>0 ;
.’/
x
’ log x’.log /log .’/ e
f .x j’/ D e Ix>0 ;
x
’’ 1
f .x j’/ D e ’xC’ log x Ix>0
.’/ x
1
D e ’Œlog xxŒlog .’/’ log ’ Ix>0 ;
x
which is once again in the one-parameter exponential family form with .’/ D ’;
T .x/ D log x x; .’/ D log .’/ ’ log ’; h.x/ D x1 Ix>0 .
588 18 The Exponential Family and Statistical Applications
.x/2
e 2
f .x j/ D p Iaxb :
2Œˆ.b / ˆ.a /
and we recognize this to be in the exponential family form with ./ D ; T .x/ D
2 x2
x; ./ D 2 C logŒˆ.b / ˆ.a /, and h.x/ D e 2 Iaxb . Thus, the
distribution of W truncated to A D Œa; b is still in the one-parameter exponential
family. This phenomenon is in fact more general.
Example 18.8 (Some Distributions Not in the Exponential Family). It is clear from
the definition of a one-parameter exponential family that if a certain family of dis-
tributions fP ; 2 ‚g belongs to the one-parameter exponential family, then each
PR has exactly the same support. Precisely, for any fixed ; P .A/ > 0 if and only
if A h.x/dx > 0, and in the discrete case, P .A/ > 0 if and only if A \ X ¤ ;,
where X is the countable set X D fx W h.x/ > 0g. As a consequence of this
common support fact, the so-called irregular distributions whose support depends
on the parameter cannot be members of the exponential family. Examples would
be the family of U Œ0; ; U Œ; distributions, and so on. Likewise, the shifted
exponential density f .x j/ D e x Ix> cannot be in the exponential family.
Some other common distributions are also not in the exponential family, but for
other reasons. An important example is the family of Cauchy distributions given by
the location parameter form f .x j/ D Œ1C.x/
1
2 Ix2R . Suppose that it is. Then,
1
e ./T .x/ D ) ./T .x/ D log.1 C .x /2 /
1 C .x /2
1 log.1 C .x /2 /
c./ log.1 C x 2 / D log.1 C .x /2 / ) ./ D :
c log.1 C x 2 /
2
/
This means that log.1C.x/
log.1Cx 2 /
must be a constant function of x, which is a contra-
diction. The choice of D 0 as the special value of is not important.
where
Z
2T D We ./
D e T .x/ h.x/dx < 1 ;
Rd
in the discrete case, with X being the countable set on which h.x/ > 0.
For a distribution in the canonical one-parameter exponential family, the param-
eter is called the natural parameter, and T is called the natural parameter space.
Note that T describes the largest set of values of for which the density (pmf) can
be defined. In a particular application, we may have extraneous knowledge that
belongs to some proper subset of T . Thus, fP g with 2 T is called the full canon-
ical one-parameter exponential family. We generally refer to the full family, unless
otherwise stated.
590 18 The Exponential Family and Statistical Applications
p p
If we write log 1p D , then 1p D e , and hence, p D 1Ce
e
, and 1 p D 1Ce 1
.
Therefore, the canonical exponential family form of the binomial distribution is
!
xn log.1Ce / n
f .x j/ D e Ix2f0;1;:::;ng ;
x
Written in its canonical form, a density (pmf) in an exponential family has some
convexity properties. These convexity properties are useful in manipulating with
moments and other functionals of T .X /, the natural sufficient statistic appearing in
the expression for the density of the distribution.
Theorem 18.1. The natural parameter space T is convex, and ./ is a convex
function on T .
Proof. We consider the continuous case only, as the discrete case admits basically
the same proof. Let 1 ; 2 be two members of T , and let 0 < ’ < 1. We need to
show that ’1 C .1 ’/2 belongs to T ; that is,
Z
e .’1 C.1’/2 /T .x/ h.x/dx < 1:
Rd
18.2 The Canonical Form and Basic Properties 591
But
Z Z
e .’1 C.1’/2 /T .x/ h.x/dx D e ’1 T .x/ e .1’/2 T .x/ h.x/dx
Rd Rd
Z ’ 1’
D e 1 T .x/ e 2 T .x/ h.x/dx
Rd
Z ’Z 1’
1 T .x/ 2 T .x/
e h.x/dx e h.x/dx
Rd Rd
The next result is a very special fact about the canonical exponential family, and is
the source of a large number of closed-form formulas valid for the entire canonical
exponential family. The fact itself is actually a fact in mathematical analysis. Due
to the special form of exponential family densities, the fact in analysis translates to
results for the exponential family, an instance of interplay between mathematics and
statistics and probability.
dk X
e ./
D ŒT .x/k e T .x/ h.x/:
dk x2X
exists, and the limit can be carried inside the integral, to give
Z Z
e .Cı/T .x/ e T .x/ e .Cı/T .x/ e T .x/
lim h.x/dx D lim h.x/dx
ı!0 Rd ı Rd ı!0 ı
Z
d T .x/
D e h.x/dx
Rd d
Z
D T .x/e T .x/ h.x/dx:
Rd
.3/ .4/
./ ./
ˇ. / D 00
I and ./ D 00 ./2
I
Œ ./3=2 Œ
Proof. Again, we take just the continuous case. Consider the result of the previous
dk
R
theorem that for any k 1; d ke
./
D Rd ŒT .x/k e T .x/ h.x/dx. Using this for
k D 1, we get
Z Z
0 0
./e ./
D T .x/e T .x/ h.x/dx ) T .x/e T .x/ ./
h.x/dx D ./;
Rd Rd
0
which gives the result E ŒT .X / D ./.
Similarly,
Z
d2 ./ 00 0
e D ŒT .x/2 e T .x/ h.x/dx ) Œ ./ C f ./g2 e ./
d2 Rd
Z
D ŒT .x/2 e T .x/ h.x/dx
Rd
Z
00 0
) ./ C f ./g D2
ŒT .x/2 e T .x/ ./
h.x/dx;
Rd
which gives E ŒT .X /2 D 00 ./ C f 0 ./g2 . Combine this with the already ob-
tained result that E ŒT .X / D 0 ./, and we get Var ŒT .X / D E ŒT .X /2
.E ŒT .X //2 D 00 ./.
.X/ E T .X/3
The coefficient of skewness is defined as ˇ D E ŒT.VarT .X//3=2
. To obtain
EŒT .X / ET .X /3 D EŒT .X /3 3EŒT .X /2 EŒT .X / C 2ŒET .X /3 , use the
d3
R
identity d 3e
./
D Rd ŒT .x/3 e T .x/ h.x/dx. Then use the fact that the third
derivative of e ./ is e ./ Œ .3/ ./ C 3 0 ./ 00 ./ C f 0 ./g3 . As we did in our
proofs for the mean and the variance above, transfer e ./ into the integral on the
right-hand side and then simplify. This gives EŒT .X / ET .X /3 D .3/ ./, and
the skewness formula follows. The formula for kurtosis is proved by the same argu-
dk
R
ment, using k D 4 in the derivative identity d ke
./
D Rd ŒT .x/k e T .x/ h.x/dx.
Finally, for the mgf formula,
Z
M .t/ D E Œe t T .X/ D e t T .X/ e T .x/ ./
h.x/dx
Rd
Z
D e ./
e .t C/T .x/ h.x/dx
Rd
Z
./ .t C/
De e e .t C/T .x/ .t C/
h.x/dx D e ./
e .t C/
1
Rd
.t C/ ./
De :
An important consequence of the mean and the variance formulas is the following
monotonicity result. t
u
Corollary 18.1. For a nonsingular canonical exponential family, E ŒT .X / is
strictly increasing in on T 0 .
594 18 The Exponential Family and Statistical Applications
Proof. From part (a) of Theorem 18.3, the variance of T .X / is the derivative of the
expectation of T .X /, and by nonsingularity, the variance is strictly positive. This
implies that the expectation is strictly increasing.
As a consequence of this strict monotonicity of the mean of T .X / in the natural
parameter, nonsingular canonical exponential families may be reparametrized by
using the mean of T itself as the parameter. This is useful for some purposes.
Example 18.10 (Binomial Distribution). From Example 18.9, in the canonical
representation of the binomial distribution, ./ D n log.1 C e /. By direct
differentiation,
0 ne 00 ne
./ D I ./ D I
1 C e .1 C e /2
.3/ ne .e 1/ .4/ ne .e 2 4e C 1/
./ D I ./ D :
.1 C e /3 .1 C e /4
Now recall from Example 18.9 that the success probability p and the natural param-
eter are related as p D 1Ce
e
. Using this, and our general formulas from Theorem
18.3, we can rewrite the mean, variance, skewness, and kurtosis of X as
1 2p
1
p.1p/ 6
E.X / D npI Var.X / D np.1 p/I ˇp D p I p D :
np.1 p/ n
For completeness, it is useful to have the mean and the variance formula in an
original parametrization, and they are stated below. The proof follows from an
application of Theorem 18.3 and the chain rule.
Theorem 18.4. Let fP ; 2 ‚g be a family of distributions in the one-parameter
exponential family with density (pmf)
The exponential family satisfies a number of important closure properties. For in-
stance, if a d -dimensional random vector X D .X1 ; : : : ; Xd / has a distribution in
the exponential family, then the conditional distribution of any subvector given the
rest is also in the exponential family. There are a number of such closure properties,
of which we discuss only four.
18.2 The Canonical Form and Basic Properties 595
h.T 1 .t//
fT .t j/ D e t ./
:
jT 0 .T 1 .t//j
This is once again in the one-parameter exponential family form, with the natural
sufficient statistic as T itself, and the function unchanged. The h function has
h.T 1 .t //
changed to a new function h .t/ D jT 0 .T 1 .t //j .
Similarly, in the discrete case, the pmf of T .X / is given by
X
P .T .X / D t/ D e T .x/ ./
h.x/ D e t ./
h .t/;
xW T .x/Dt
P
where h .t/ D xW T .x/Dt h.x/.
Next, suppose X D .X1 ; : : : ; Xd / has a density (pmf) f .x j/ in the expo-
nential family and Y1 ; Y2 ; : : : ; Yn are n iid observations from this density f .x j/.
Note that each individual Yi is a d -dimensional vector. The joint density of Y D
.Y1 ; Y2 ; : : : ; Yn / is
Y
n Y
n
f .y j/ D f .yi j/ D e T .yi / ./
h.yi /
i D1 i D1
Pn Y
n
D e i D1 T .yi /n ./
h.yi /:
i D1
(c) Let I0 be any proper subset of I D f1; 2; : : : ; d g. Then the joint conditional dis-
tribution of Xi ; i 2 I0 given Xj ; j 2 I I0 also belongs to the one-parameter
exponential family.
(d) For given n 1, suppose Y1 ; : : : ; Yn are iid with the same distribution as X .
Then the joint distribution of .Y1 ; : : : ; Yn / also belongs to the one-parameter
exponential family.
There is an implicit assumption in this definition that the number of freely varying
is the same as the number of freely varying , and that these are both equal to the
specific k in the context. The formal way to say this is to assume the following.
Assumption. The dimension of ‚ as well as the dimension of the image of ‚ under
the map .1 ; 2 ; : : : ; k / ! .1 .1 ; 2 ; : : : ; k /; 2 .1 ; 2 ; : : : ; k /; : : : ; k .1 ;
2 ; : : : ; k // is equal to k.
There are some important examples where this assumption does not hold. They
are not counted as members of a k-parameter exponential family. The name curved
exponential family is commonly used for them, and this is discussed in the last
section.
The terms canonical form, natural parameter, and natural parameter space
mean the same things as in the one-parameter case. Thus, if we parametrize
the distributions by using 1 ; 2 ; : : : ; k as the k parameters, then the vector
D .1 ; 2 ; : : : ; k / is called the natural parameter vector, the parametrization
Pk
f .x j/ D e i D1 i Ti .x/ ./ h.x/ is called the canonical form, and the set of all
18.3 Multiparameter Exponential Family 597
vectors for which f .x j/ is a valid density (pmf) is called the natural parameter
space. The main theorems for the case k D 1 hold for a general k.
Theorem 18.6. The results of Theorem 18.1 and 18.5 hold for the k-parameter ex-
ponential family.
The proofs are almost verbatim the same. The moment formulas differ somewhat
due to the presence of more than one-parameter in the current context.
Theorem 18.7. Suppose X D .X1 ; : : : ; Xd / has a distribution P ; 2 T , belong-
ing to the canonical k-parameter exponential family, with a density (pmf)
Pk
f .x j/ D e i D1 i Ti .x/ ./
h.x/;
where Z Pk
T D 2 Rk W e i D1 i Ti .x/
h.x/dx < 1
Rd
1 1
1 ./ D ; 2 ./ D ; T1 .x/ D x 2 ; T2 .x/ D x;
222 22
12 1
./ D C log 2 ; h.x/ D p Ix2R :
222 2
1 1 22 1
1 D ; 2 D ; and ./ D log.1 /:
222 22 41 2
x
C1 log x1 log 2 log .1 / 1
f .x j/ D e 2 Ix>0 :
x
This is in the two-parameter exponential family with 1 ./ D 12 ; 2 ./ D
1 ; T1 .x/ D x; T2 .x/ D log x; ./ D 1 log 2 C log .1 /, and h.x/ D x1 Ix>0 .
The parameter space in the -parametrization is .0; 1/ ˝ .0; 1/. For the canon-
ical form, use 1 D 12 ; 2 D 1 , and so, the natural parameter space is
.1; 0/ ˝ .0; 1/. The natural sufficient statistic is .X; log X /.
d2 d d.d C 1/ d.d C 3/
k Dd Cd C DdC D :
2 2 2
18.3 Multiparameter Exponential Family 599
The density of X is
1 1 0 1
f .x j/ D e 2 .x/ † .x/ Ix2Rd :
.2/d=2 j†j1=2
1 1 0 1 0 1 1 0 1
D e 2 x † xC † x 2 † Ix2Rd
.2/d=2 j†j1=2
1 1 PP P P 1 0 1
e 2 i;j xi xj C i . k k /xi 2 †
ij ki
D
Ix2Rd
.2/ j†j
d=2 1=2
1 P ii 2 P P P P
1 i xi i <j xi xj C i . k k /xi
ij ki
D e 2
.2/ j†j
d=2 1=2
1 0 1
2 † Ix2Rd :
!nPkiD1 xi
nŠ x
Y
k X
k
f .x j/ D Qk Pk pi i 1 pi
. i D1 xi Š/.n i D1 xi /Š i D1 i D1
Ix ;:::;x 0;Pk x n
1 k i D1 i
nŠ Pk Pk Pk
D Qk i D1 .log pi /xi log.1 i D1 pi /. xi /
P e i D1
. i D1 xi Š/.n kiD1 xi /Š
P
C n log.1 ki D1 pi / Pk
Ix
1 ;:::;xk 0; i D1 xi n
!
Pk P
pi
nŠ i D1 log P xi Cn log 1 k
i D1 pi
1 k
D Qk P e p
i D1 i
. i D1 xi Š/.n kiD1 xi /Š
Ix ;:::;x 0;Pk x n :
1 k i D1 i
600 18 The Exponential Family and Statistical Applications
This is in the k-parameter exponential family form with the natural sufficient statis-
tic and natural parameters
pi
T .X / D .X1 ; X2 ; : : : ; Xk /; i D log Pk ; 1 i k:
1 i D1 pi
1 3
as r ! 1. The density of this limiting CDF is f .x/ D e p 2x x 2
Ix>0 . This is a
2
special inverse Gaussian distribution. The general inverse Gaussian distribution has
the density
2 1=2 1 x 2 C2p1 2
f .x j1 ; 2 / D e x Ix>0 I
x 3
the parameter space for D .1 ; 2 / is Œ0; 1/ ˝ .0; 1/. Note that the special
inverse Gaussian density ascribed to the above corresponds to 1 D 0; 2 D 12 .
The general inverse Gaussian density f .x j1 ; 2 / is the density of the first time
p proces (starting at zero) hits the straight line with the equation y D
p a Wiener
that
22 21 t; t > 0.
It is clear from the formula for f .x j1 ; 2 / that it is a member of the two-
parameter exponential family with the natural sufficient statistic T .X / D .X; X1 /
and the natural parameter space T D .1; 0 ˝ .1; 0/. Note that the natural
parameter space is not open.
Exponential families under mild conditions on the parameter space ‚ have the prop-
erty that if a function g.T / of the natural sufficient statistic T D T .X / has zero
expected value under each 2 ‚, then g.T / itself must be essentially identically
equal to zero. A family of distributions that has this property is called a complete
family. The completeness property, particularly in conjunction with the property of
sufficiency, has had an historically important role in statistical inference. Lehmann
(1959), Lehmann and Casella (1998), and Brown (1986) give many applications.
However, our motivation for studying the completeness of a full rank exponential
family is primarily for presenting a well-known theorem in statistics, which actu-
ally is also a very effective and efficient tool for probabilists. This theorem, known
as Basu’s theorem (Basu (1955)), is an efficient tool for probabilists in minimizing
clumsy distributional calculations. Completeness is required in order to state Basu’s
theorem.
18.4 Sufficiency and Completeness 601
Then,
Example 18.18. Suppose X1 ; X2 are iid N.; 1/, and belongs to some subset
‚ of the real line. Let S.X1 ; X2 / D X1 X2 . then, under any P ; S.X1 ; X2 /
N.0; 2/, a fixed distribution that does not depend on . Thus, S.X1 ; X2 / D X1 X2
is ancillary, whatever the set of values of is.
Example 18.19. Suppose X1 ; X2 are iid U Œ0; , and belongs to some subset ‚
of .0; 1/. Let S.X1 ; X2 / D X1
X2 . We can write S.X1 ; X2 / as
L U1 U1
S.X1 ; X2 / D D ;
U2 U2
where U1 ; U2 are iid U Œ0; 1. Thus, under any P ; S.X1 ; X2 / is distributed as the
ratio of two independent U Œ0; 1 variables. This is a fixed distribution that does not
depend on . Thus, S.X1 ; X2 / D X 1
X2 is ancillary, whatever the set of values of is.
Example 18.20. Suppose X1 ; X2 ; : : : ; Xn are iid N.;
Pn 1/, and belongs to some
subset ‚ of the real line. Let S.X1 ; : : : ; Xn / D i D1 .X i X / 2
. We can write
S.X1 ; : : : ; Xn / as
L
X
n X
n
S.X1 ; : : : ; Xn / D . C Zi Œ C Z/ D 2
.Zi Z/2 ;
i D1 i D1
There is a more general version of Basu’s theorem that applies to arbitrary para-
metric families of distributions. The intuition is the same as it was in the case of an
exponential family, namely, a sufficient statistic, which contains all the information,
and an ancillary statistic, which contains no information, must be independent. For
this, we need to define what a sufficient statistic means for a general parametric
family. Here is Fisher’s original definition (Fisher (1922)).
Definition 18.9. Let n 1 be given, and suppose X D .X1 ; : : : ; Xn / has a joint
distribution P;n belonging to some family
Fn D fP;n W 2 ‚g:
18.4 Sufficiency and Completeness 603
The intuition of the factorization theorem is that the only way that the parameter
is tied to the data values X1 ; : : : ; Xn in the likelihood function f .x1 ; : : : ; xn j/ is
via the statistic T .X1 ; : : : ; Xn /, because there is no in the function h.x1 ; : : : ; xn /.
Therefore, we should only care to know what T is, but not the individual values
X1 ; : : : ; Xn .
Here is one example on using the factorization theorem.
Y
n n Y
n
1 1
f .x1 ; : : : ; xn j/ D I xi D I xi
i D1 i D1
n
1
D I x.n/ ;
then, by the factorization theorem, the sample maximum X.n/ is sufficient for the
U Œ0; family. The result does make some intuitive sense.
604 18 The Exponential Family and Statistical Applications
We had previously commented that the sufficient statistic by itself captures all of the
information about that the full knowledge of X could have provided. On the other
hand, an ancillary statistic cannot provide any information about , because its dis-
tribution does not even involve . Basu’s theorem says that a statistic which provides
all the information, and another that provides no information, must be independent,
provided the additional nonempty interior condition holds, in order to ensure com-
pleteness of the family F . Thus, the concepts of information, sufficiency, ancillarity,
completeness, and independence come together in Basu’s theorem. However, our
main interest is simply to use Basu’s theorem as a convenient tool to arrive quickly
at some results that are purely results in the domain of probability. Here are a few
such examples.
Example 18.22 (Independence of Mean and Variance for a Normal Sample). Sup-
pose X1 ; X2 ; : : : ; Xn are iid N.; 2 / for some ; . It was stated in Chapter 4 that
the sample mean X and the sample variance s 2 are independently distributed for
any n, and whatever and are. We now prove it. For this, first we establish the
claim that if the result holds for D 0; D 1, then it holds for all ; . Indeed, fix
any ; , and write Xi D C Zi ; 1 i n, where Z1 ; : : : ; Zn are iid N.0; 1/.
Now,
! !
Xn
L
Xn
X; .Xi X / D C Z;
2 2
.Zi Z/ : 2
i D1 i D1
Pn
Therefore, X P and i D1 .Xi X /2 are independently distributed under .; / if and
only if Z and niD1 .Zi Z/2 are independently distributed. This is a step in getting
rid of the parameters ; from consideration.
But, now, we import a parameter! Embed the N.0; 1/ distribution into a larger
family of fN.; 1/; 2 Rg distributions. Consider now a fictitious sample
Y1 ; Y2 ; : : : ; Yn from P D N.; 1/. The joint density of Y D .Y1 ; Y2 ; : : : ; Yn /
is a one-parameter
P exponential family P
density with the natural sufficient statistic
T .Y / D niD1 Yi . By Example 18.20, niD1 .Yi Y /2 is ancillary. The parameter
space for obviously has a nonempty interior, thus P all the conditions
P of Basu’s
theorem are satisfied, and therefore, under each ; niD1 Yi and niD1 .Yi Y /2 are
independently distributed. In particular, they are independently distributed under
D 0, that is, when the samples are iid N.0; 1/, which is what we needed to prove.
18.4 Sufficiency and Completeness 605
is a (vector) ancillary statistic. Next observe that the joint density of X D .X1 ;
X2 ; : : : ; Xn / is a one-parameter exponential family, with the natural sufficient statis-
tic T .X / D X1 C C Xn . Because the parameter space .0; 1/ obviously contains
a nonempty interior, by Basu’s theorem, under each ,
X1 Xn1
;:::; and X1 C C Xn
X1 C C Xn X1 C C Xn
is independently distributed.
Example 18.24 (A Covariance Calculation). Suppose X1 ; : : : ; Xn are iid N.0; 1/,
and let X and Mn denote the mean and the median of the sample set X1 ; : : : ; Xn .
By using our old trick of importing a mean parameter , we first observe that the
difference statistic X Mn is ancillary. On the other hand, the joint density of
X D .X1 ; : : : ; Xn / is of course a one-parameter exponential family with the natural
sufficient statistic T .X / D X1 C C Xn . By Basu’s theorem, X1 C C Xn and
X Mn are independent under each , which implies
Cov.X1 C C Xn ; X Mn / D 0 ) Cov.X ; X Mn / D 0
1
) Cov.X ; Mn / D Cov.X ; X / D Var.X/ D :
n
We have achieved this result without doing any calculations at all. A direct attack
on this problem requires handling the joint distribution of .X; Mn /.
Example 18.25 (An Expectation Calculation). Suppose X1 ; : : : ; Xn are iid U Œ0; 1,
and let X.1/ ; X.n/ denote the smallest and the largest order statistic of X1 ; : : : ; Xn .
Import a parameter > 0, and consider the family of U Œ0; distributions. We have
606 18 The Exponential Family and Statistical Applications
shown that the largest order statistic X.n/ is sufficient; it is also complete. On the
X.1/ L
other hand, the quotient X.n/
is ancillary. To see this, again, write .X1 ; : : : ; Xn / D
X.1/ L U.1/
.U1 ; : : : ; Un /, where U1 ; : : : ; Un are iid U Œ0; 1. As a consequence, X.n/ D U.n/ .
X.1/
So, X.n/
is ancillary. By the general version of Basu’s theorem which works for any
X.1/
family of distributions (not just an exponential family), it follows that X.n/ and X.n/
are independently distributed under each . Hence,
X.1/ X.1/
EŒX.1/ D E X.n/ D E EŒX.n/
X.n/ X.n/
X.1/ EŒX.1/ nC1 1
)E D D n
D :
X.n/ EŒX.n/ nC1
n
Once again, we can get this result by using Basu’s theorem without doing any inte-
grations or calculations at all.
and,
Z 1 nŠ. dk C 1/
EŒdn D
k
kuk1 .1 ud /n d u D :
0 .n C k
d
C 1/
Now, embed the uniform distribution in the unit ball into the family of uniform
distributions in balls of radius and centered at the origin. Then, Dn is complete
and sufficient (akin to Example 18.24), and n is ancillary. Therefore, once again,
by the general version of Basu’s theorem, Dn and n are independently distributed
under each > 0, and so, in particular under D 1. Thus, for any k 1,
18.5 Curved Exponential Family 607
There are some important examples in which the density (pmf) has the basic expo-
Pk
nential family form f .x j/ D e i D1 i . /Ti .X/ . / h.x/, but the assumption that
the dimensions of ‚, and that of the range space of .1 ./; : : : ; k .// are the same
is violated, more precisely, the dimension of ‚ is some positive integer q strictly
less than k. Let us start with an example.
Example 18.27. Suppose X N.; 2 /; ¤ 0. Writing D , the density of
X is
1 1 .x /2
f .x j/ D p e 2 2 Ix2R
2jj
1 x2 C x 12 log j j
D p e 2 2 Ix2R :
2
Writing 1 ./ D 212 ; 2 ./ D 1 ; T1 .x/ D x 2 ; T2 .x/ D x; ./ D 12 Clog jj, and
Pk
h.x/ D p12 Ix2R , this is in the form f .x j/ D e i D1 i . /Ti .x/ . / h.x/,
with k D 2, although 2 R, which is only one-dimensional. The two functions
608 18 The Exponential Family and Statistical Applications
2
1 ./ D 212 and 2 ./ D 1 are related to each other by the identity 1 D 22 ,
so that a plot of .1 ; 2 / in the plane would be a curve, not a straight line. Distribu-
tions of this kind go by the name of curved exponential family. The dimension of
the natural sufficient statistic is more than the dimension of ‚ for such distributions.
Definition 18.10. Let X D .X1 ; : : : ; Xd / have a distribution P ; 2 ‚ Rq .
Suppose P has a density (pmf) of the form
Pk
f .x j/ D e i D1 i . /Ti .x/ . /
h.x/;
where k > q. Then, the family fP ; 2 ‚g is called a curved exponential family.
Example 18.28 (A Specific Bivariate Normal). Suppose X D .X1 ; X2 / has a bi-
variate normal distribution with zero means, standard deviations equal to one, and a
correlation parameter ; 1 < < 1. The density of X is
1
1 Œx12 Cx22 2 x1 x2
e 2.1 /
2
f .x j/ D p Ix1 ;x2 2R
2 1 2
x 2 Cx 2
1 2 C x1 x2
1 .
2 1 2 / 1 2
D p e Ix1 ;x2 2R :
2 1 2
Y
n
e zi . zi /xi
f .x1 ; : : : ; xn ; z1 ; : : : ; zn j / D p.z1 ; z2 ; : : : ; zn /Ix1 ;:::;xn 2N0
xi Š
i D1
Iz1 ;z2 ;:::;zn 2N1
Pn Pn Y
n x
zi i
D e i D1 zi C. i D1 xi / log
p.z1 ; z2 ; : : : ; zn /
xi Š
i D1
Ix1 ;:::;xn 2N0 Iz1 ;z2 ;:::;zn 2N1 ;
where N0 is the set of nonnegative integers, and N1 is the set of positive integers.
This is in the curved exponential family with
X
n
q D 1; k D 2; 1 . / D ; 2 . / D log ; T1 .x; z/ D zi ;
i D1
X
n
T2 .x; z/ D xi ;
i D1
Exercises 609
and
Y
n x
zi i
h.x; z/ D p.z1 ; z2 ; : : : ; zn /Ix1 ;:::;xn 2N0 Iz1 ;z2 ;:::;zn 2N1 :
xi Š
i D1
Exercises
Exercise 18.1. Show that the geometric distribution belongs to the one-parameter
exponential family if 0 < p < 1, and write it in the canonical form and by using the
mean parametrization.
Exercise 18.2 (Poisson Distribution). Show that the Poisson distribution belongs
to the one-parameter exponential family if > 0. Write it in the canonical form and
by using the mean parametrization.
Exercise 18.3 (Negative Binomial Distribution). Show that the negative binomial
distribution with parameters r and p belongs to the one-parameter exponential fam-
ily if r is considered fixed and 0 < p < 1. Write it in the canonical form and by
using the mean parametrization.
Exercise 18.5 * (Normal with Equal Mean and Variance). Show that the
N.; / distribution belongs to the one-parameter exponential family if > 0.
Write it in the canonical form and by using the mean parametrization.
Exercise 18.7 (Beta Distribution). Show that the two-parameter Beta distribution
belongs to the two-parameter exponential family if the parameters ’; ˇ > 0. Write
it in the canonical form and by using the mean parametrization.
610 18 The Exponential Family and Statistical Applications
Exercise 18.8 * (Poisson Skewness and Kurtosis). Find the skewness and
kurtosis of a Poisson distribution by using Theorem 18.3.
Exercise 18.9 * (Gamma Skewness and Kurtosis). Find the skewness and
kurtosis of a Gamma distribution, considering ’ as fixed, by using Theorem 18.3.
Exercise 18.10 * (Distributions with Zero Skewness). Show that the only dis-
tributions in a canonical one-parameter exponential family such that the natural
sufficient statistic has a zero skewness are the normal distributions with a fixed vari-
ance.
the mean and the variance of Y1 ; : : : ; Yn . Also let r denote the sample correlation
coefficient based on the pairs .Xi ; Yi /; 1 i n. Prove that X ; Y ; s12 ; s22 ; r are
mutually independent under all 1 ; 2 ; 1 ; 2 .
Exercise 18.25 (Mixtures of Normal). Show that the mixture distribution
:5N.; 1/ C :5N.; 2/ does not belong to the one-parameter exponential fam-
ily. Generalize this result to more general mixtures of normal distributions.
Exercise 18.26 (Double Exponential Distribution). (a) Show that the double ex-
ponential distribution with a known value and an unknown mean does not
belong to the one-parameter exponential family, but the double exponential dis-
tribution with a known mean and an unknown belongs to the one-parameter
exponential family.
(b) Show that the two-parameter double exponential distribution does not belong to
the two-parameter exponential family.
612 18 The Exponential Family and Statistical Applications
References
The principle underlying all MCMC methods is extremely simple. Given a target
distribution on some state space S from which one wants to simulate, one simply
constructs a Markov chain Xn ; n D 0; 1; 2; : : : ; such that Xn has a unique stationary
distribution, and that stationary distribution is . So, if we simply generate succes-
sive states X0 ; X1 ; : : : ; XB ; XBC1 ; : : : ; Xn for some suitable large values of B; n,
then we can act as if XBC1 ; : : : ; Xn is a dependent sample of size n B from . We
can then use these n B values to approximate probabilities, to approximate expec-
tations, or for whatever reason we want to use them. The practical part is to devise
a Markov chain that has as its stationary distribution, and it must be reasonably
easy to run this chain as a matter of convenient implementation.
There are a number of popular MCMC methods in use. These include the
Hastings algorithm, the Metropolis algorithm, and the Gibbs sampler. Implement-
ing these algorithms is usually a simple matter, and computer codes for many of
them are now publicly available. The difficult part in justifying MCMC methods is
answering the question of how long to let the chain run so as to allow it to get close
to stationarity. This is critical, because the stationary distribution of the chain is the
target distribution. In principle, the speed of convergence of the chain to stationar-
ity is exponential under mild conditions on the chain. General Markov chain theory
implies such exponential convergence. However, concrete statements on the exact
exponential speed, or asserting theoretical bounds on the closeness of the chain to
stationarity can usually be done only on a case-by-case basis, and typically each
new case needs a new creative technique.
We start with a treatment of conventional Monte Carlo and textbook simulation
techniques, and then provide the most popular MCMC algorithms with a selection
of applications. We then provide some of the available general theory on the ques-
tion of speed of convergence of a particular chain to stationarity. This is linked to
choosing the number B, the burn-in period, and the total run length n. These con-
vergence theorems form the most appealing theoretical aspect of MCMC methods.
They are also practically useful, because without a theorem ensuring that the chain
has come very close to its equilibrium state, the user can never be confident that the
simulation output is reliable.
For conventional Monte Carlo and techniques of simulation, some excellent
references are Fishman (1995), Ripley (1987), Robert and Casella (2004), and
Ross (2006). Markov Chain Monte Carlo started off with the two path-breaking
articles: Metropolis et al. (1953), and Hastings (1970). Geman and Geman (1984),
Tanner and Wong (1987), Smith and Roberts (1993), and Gelfand and Smith (1987)
are among the pioneering articles in the statistical literature. Excellent book-length
treatments of MCMC include Robert and Casella (2004), Gamerman (1997), and
Liu (2008). Geyer (1992), Gilks et al. (1995), and Gelman et al. (2003) are ex-
cellent readings on MCMC with an applied focus. Diaconis (2009) is an up-to-
date review. Literature on convergence of an MCMC algorithm has been growing
steadily. Brémaud (1999) is one of the best sources to read about general the-
ory of convergence of MCMC algorithms. Diaconis et al. (2008) is a tour de
force on convergence of the Gibbs sampler, and Diaconis and Saloff-Coste (1998)
on the Metropolis algorithm. Other important references on the difficult issue of
convergence are Diaconis and Stroock (1991), Tierney (1994), Athreya, Doss and
19.1 The Ordinary Monte Carlo 615
Sethuraman (1996), Propp and Wilson (1998), and Rosenthal (2002); Dimakos
(2001) is a useful survey. Various useful modifications of the basic MCMC have
been suggested to address specific important applications. Notable among them
are Green (1995), Besag and Clifford (1991), Fill (1998), Diaconis and Sturmfels
(1998), Higdon (1998), and Kendall and Thönnes (1999), among many others. Var-
ious other specific references are given in the sections.
The ordinary Monte Carlo is a simulation technique for approximating the expecta-
tion of a function .X / for a general random variable X , when the exact expectation
cannot be found analytically, or by other numerical means, such as quadrature. The
idea of Monte Carlo simulation originated around 1940 in connection with secret
nuclear weapon projects in which physicists wanted to understand how the physical
properties of neutrons would be affected by various possible scenarios following a
collison with a nucleus. The Monte Carlo term was picked by Stanislaw Ulam and
von Neumann during that time.
The basis for the ordinary Monte Carlo is Kolmogorov’s SLLN (see Chapter 7),
which says that if X1P ; X2 ; : : : are iid copies of X , the basic underlying random vari-
able, then O D n1 niD1 .Xi / converges almost surely to D EŒ.X /, as
long as we know that the expectation exists. This is because, if X1 ; X2 ; : : : are iid
copies of X , then Zi D .Xi /; i DP 1; 2; : : : are iid copies of Z D .X / and
therefore, by the canonical SLLN, n1 niD1 Zi converges almost surely to EŒZ.
Therefore, provided that we can actually do the simulation in practice, we can sim-
ply simulate a large number of iidPcopies of X and approximate the true value of
EŒ.X / by the sample mean n1 niD1 .Xi /. Of course, there will be a Monte
Carlo error in this approximation, and if we run the simulation again, we get a differ-
ent approximated value for EŒ.X /. Thus, some reliability measure for the Monte
Carlo estimate is necessary. A quick one at hand is the variance of the Monte Carlo
estimate
VarŒ.X / EŒ 2 .X / .EŒ.X //2
Var.O / D D :
n n
However, this involves quantities that we do not know, in particular EŒ.X /, the
very quantity we set out to approximate! However, we can fall back on the sample
variance
1 X
n
sz2 D .Zi Z/2
n1
i D1
616 19 Simulation and Markov Chain Monte Carlo
to estimate this uncomputable exact variance, that is, a reliability measure could be
2
Var.O / D
1
n.n 1/
X n
.Zi Z/2 ;
i D1
Monte Carlo
n Estimate of D 0 95% Confidence Interval
100 .0714 :0714 ˙ :2921
250 :0232 :0232 ˙ :2185
500 .0712 :0712 ˙ :1435
1000 .0116 :0116 ˙ :1005
The Monte Carlo estimate itself oscillates. But the confidence interval gets tighter
as the Monte Carlo sample size n increases. Only for n D 1000, the results of the
Monte Carlo simulation approach even barely acceptable accuracy. It is common
practice to use n in several thousands when applying ordinary Monte Carlo for esti-
mating one single . If we have to estimate for several different choices of ,
and if the functions have awkward behavior at regions of low density of X , then
the Monte Carlo sample size has to be increased. Formal recommendations can be
given by using a pilot or guessed value of 2 , the true variance of .X /.
Example 19.2 (Estimating the Volume of a Set). Let S be a set in an Euclidean space
Rd , and suppose we want to find the volume of S . Unless S has a very specialized
shape, an exact volume formula will be difficult or impossible to write (and es-
pecially so, when d is large). However, Monte Carlo can assist us in this difficult
problem. We have to make some assumptions. We assume that S is a bounded set,
and that there is an explicit d -dimensional rectangle R such that S R. With-
d
out loss of generality, we may R take R to be Œ0; 1 . Denote the volume of S by
D Vol.S /. Note that D S dx.
There is nothing probabilistic in the problem so far. It appears to be a purely
a mathematical
R problem. But, we can think of probabilistically by writing it as
D S f .x/dx, where f .x/ D Ix2R is the density of the d -dimensional uniform
distribution on R D Œ0; 1d . Therefore, D P .X 2 S /, where X is distributed
uniformly in R. We now realize the potential of Monte Carlo in estimating .
Indeed, let X1 ; X2 ; : : : ; Xn be independent uniformly distributed points in R.
Then, from our general discussion above, a Monte Carlo estimate of is O D
#fi WXi 2Sg
n . We can also construct confidence intervals for by following the score
interval’s formula given above. Thus, a potentially very difficult mathematical prob-
lem is reduced to simply simulating n uniform random vectors from the rectangle R,
which is the same as simulating nd iid U Œ0; 1 random variables, an extremely sim-
ple task. Of course, it is necessary to remember that Monte Carlo is not going to give
us the exact value of the volume of S . With luck, it will give a good approximation,
which is useful.
Example 19.3 (Monte Carlo Estimate of the Volume of a Cone). This is a specific
example of the application of the Monte Carlo method to estimate the volume of a
set. Take the case of a right circular cone S of base radius r D 1 and height h D 1.
The true volume of S is D 13 r 2 h D 3 D 1:047. The apex aperture of S is
given by D 2 arctan. hr / D 2 4 D 2 . Therefore, S is described in the Cartesian
coordinates as
S D f.x; y; z/ W 0 z 1; x 2 C y 2 z2 g:
618 19 Simulation and Markov Chain Monte Carlo
Monte Carlo
n Estimate of D 1:047 95% Confidence Interval
100 1.08 1:08 ˙ :343
250 1.168 1:168 ˙ :224
500 1.016 1:016 ˙ :152
1000 1.100 1:100 ˙ :1105
10,000 1.064 1:064 ˙ :035
Once again, as in Example 19.1, we see that the Monte Carlo estimate itself
oscillates, and the Monte Carlo error is not monotone decreasing in n. However,
the width of the confidence interval consistently decreases as the Monte Carlo sam-
ple size n increases. To get a really good estimate and a tight confidence interval,
we appear to need a Monte Carlo sample size of about 10,000.
Example 19.4 (A Bayesian Example). Monte Carlo methods can be used not only
to approximate expectations, but also to approximate percentiles of a distribution.
Precisely, if X1 ; X2 ; : : : are iid sample observations from a continuous distribution
F with a strictly positive density f , then for any ’; 0 < ’ < 1, the sample quantile
Fn1 .’/ converges almost surely to the corresponding quantile F 1 .’/ of F . So,
as in the case of Monte Carlo estimates for expectations, we can generate a Monte
Carlo sample from F and estimate F 1 .’/ by Fn1 .’/.
Suppose X Bin.m; p/, and the unknown parameter p is assigned a beta prior
density, with parameters ’; ˇ. Then the posterior density of p is another Beta,
namely Be.x C ’; m x C ˇ/ (see Chapter 3). To give a specific example, sup-
pose m D 100; x D 45; ’ D ˇ D 1 (i.e., p has a U Œ0; 1 prior). The percentiles of
a Beta density do not have closed-form formulas. So, one has to resort to numerical
methods to evaluate them.
We first estimate the posterior median by using a Monte Carlo sample for vari-
ous values of the Monte Carlo sample size n. For comparison, a true value for the
posterior median in this case is reported by Mathematica as 0:4507.
19.1 The Ordinary Monte Carlo 619
The estimates oscillate slightly. But even for a Monte Carlo sample size as small as
n D 50, the estimate is impressively accurate. But change the problem to Monte
Carlo estimation of an extreme quantile, say the 99:9th percentile of the posterior
distribution of p. The true value is reported by Mathematica to be 0:6028. The
Monte Carlo estimates for various n are reported below. We can see that the quality
of the estimation has deteriorated.
Monte Carlo Estimate of 99:9th
n Posterior Percentile D 0:6028
50 .5662
100 .5618
250 .5714
500 .5735
1000 .5889
5000 .5956
10,000 .5987
The need for a much larger Monte Carlo sample size is clearly seen from the table.
Although we could estimate the posterior median accurately with a Monte Carlo
sample size of 100, for the extreme quantile, we need five to ten thousand Monte
Carlo samples.
Example 19.5 (Computing a Nasty Integral). Monte Carlo methods can be ex-
tremely useful in computing the value of a complicated integral that cannot be
evaluated in closed form. Monte Carlo is especially useful for this purpose in high
dimensions, because methods of numerical integration are hard to implement and
quite unreliable when the number of dimensions is even moderately high (as few
as four). The basic idea
R is very simple. Suppose we wish to know the value of the
definite integral I D Rd f .x/dx for some (possibly complicated) function f . We
are of course assuming we know that the integral exists. This must be verified math-
ematically before we start on the Monte Carlo journey. Monte Carlo cannot verify
the existence of the integral.
The idea now is to use a suitable density function g.x/ on Rd and write I as
Z Z
f .x/ f .X /
I D f .x/dx D g.x/dx D Eg :
Rd Rd g.x/ g.X /
620 19 Simulation and Markov Chain Monte Carlo
Now the Monte Carlo method is usable. We P simulate iid random vectors X1 ; : : : ; Xn
from g, and approximate I by IO D n1 niD1 fg.X .Xi /
. The choice of g is obviously
i/
not unique. It is usually chosen such that it is easy to simulate from g, and such
that fg.x/
.x/
can be reliably computed. In other words, the function fg.x/ .x/
preferably
should not have singularities, or cusps, or too many local maximas or minimas.
Some preliminary investigation on the choice of g is needed, and in fact, this choice
issue is related to a topic known as importance sampling, which we discuss later.
As a concrete example, suppose we want to find the value of
Z 1 Z 1 Z 1 Z 1 2 y 2 z2 w2
e x
I D p dxdydzdw:
1 1 1 1 1 C x 2 C y 2 C z2 C w2
1 x 2 y 2 z2 w2
g.x; y; z; w/ D e ;
2
and
2 y 2 z2 w2
e x
f .x; y; z; w/ D p :
1 C x 2 C y 2 C z2 C w2
Then,
Z 1 Z 1 Z 1 Z 1
I D f .x; y; z; w/dxdydzdw
1 1 1 1
f .X / 1
D Eg D 2 Eg p :
g.X / 1 C X C Y 2 C Z2 C W 2
2
1 X
n
1
IO D 2 q :
n 1 C Xi2 C Yi2 C Zi2 C Wi2
i D1
The table below reports Monte Carlo estimates of I for various values of n. The
exact value of I is useful for comparison and can be obtained after transformation
to polar coordinates (see Chapter 4) to be
h p p i
I D 2 1 C e ˆ 2 1 D 6:1297:
19.1 The Ordinary Monte Carlo 621
Monte Carlo
n Estimate of I D 6:1297
50 6.3503
100 6.2117
250 6.2264
500 6.1854
1000 6.1946
5000 6.0806
10,000 6.1044
By the time the Monte Carlo sample size is 10,000, we get fairly accurate estimates
for the value of I .
6
lim P .X; Y are coprime/ D :
N !1 2
Here, coprime means that X; Y do not have any common factors >1. So, in
principle, we may choose a large N , choose n pairs .Xi ; Yi / independently at ran-
dom from the discrete set f1; 2; : : : ; N g, and find a Monte Carlo estimate pO for
p D P .X; Y are coprime/, and invert it to form an estimate for the value of as
s
6
O D :
pO
The table below reports the results of such a Monte Carlo experiment.
Monte Carlo
N n Estimate of D 3:14159
500 100 3.0252
1000 100 3.0817
1000 250 3.1308
5000 250 3.2225
5000 500 3.2233
10000 1000 3.1629
10,000 5000 3.1395
622 19 Simulation and Markov Chain Monte Carlo
This is an interesting example of the application of Monte Carlo where two indices
N; n have to take large values simultaneously. Only when N D 10;000 and n D
5;000, do we come close to matching the second digit after the decimal.
and reject the null hypothesis if pm is small (say, smaller than 0:01). The program
assumes that pm D PH0 .Tm > tm / can be computed (and that it does not depend
on any unknown parameters not specified by H0 ). However, except in some spe-
cial cases, we do not know the exact null distribution of our test statistic Tm , and
so, computing the P -value will require some suitable approximation method. The
Monte Carlo P -value method simply simulates a lot of datasets .X1 ; : : : ; Xm / under
the null, computes Tm .X1 ; : : : ; Xm / for each generated dataset, and then computes
the percentile rank of our value tm among these synthetic sets of values of Tm . For
example, if we generated n D 99 additional datasets, and we find our value tm to
be the 98th order statistic among these 100 values of Tm , we declare the Monte
Carlo P -value to be 0:02. If Tm .X1 ; : : : ; Xm / has a continuous distribution under
H0 , then the rank of tm among the nC1 values of Tm , say Rm;n , is simply a discrete
uniform random variable on the set f1; 2; : : : ; n C 1g, and therefore, we can estimate
the true value of pm by using nC1R nC1
m;n
. This lets us avoid a potentially impossible
distributional calculation for the exact calculation of pm .
The idea of a Monte Carlo P -value is attractive. But it should be noted that if we
repeat the simulation, we get a different P -value by this method for the same orig-
inal dataset. Second, one can work out at best a case-by-case Monte Carlo sample
size necessary for accurate approximation of the true value of pm . Such calcula-
tions are very likely to need the same difficult calculations that we are trying to
avoid. Also, if the test statistic Tm is hard to compute, then evaluation of a Monte
Carlo P -value may require a prohibitive amount of computing. There will often be
theoretically grounded alternative methods for approximating the true value of pm ,
such as central limit theorems or Edgeworth expansions. So, use of Monte Carlo
P -values need not be the only method available to us, or the best method avail-
able to us in a given problem. Monte Carlo P -values were originally suggested in
19.1 The Ordinary Monte Carlo 623
Barnard (1963) and Besag and Clifford (1989). They have become quite popular in
certain applied sciences, notably biology. For deeper theoretical studies of Monte
Carlo P -values, one can see Hall and Titterington (1989).
Example 19.7 (Testing for the Center of Cauchy). This is an example where the
calculation of the exact P -value using any reasonable test statistic is not easy.
Suppose X1 ; : : : ; Xm are iid C.; 1/ and suppose we wish to test H0 W D 0
against > 0. As a test statistic, the sample mean X is an extremely poor choice
in this case. The sample median is reasonable, and the unique maximum likeli-
hood estimate of is asymptotically the best, but it is not easy to compute the
maximum likelihood estimate in this case. Although the sample median and the
maximum likelihood estimate are both asymptotically normal, neither of them has
a tractable CDF for given m > 2. Therefore, an exact calculation of the P -value
pm D PD0 .Tm > tm / is essentially impossible using either of these two statistics.
With m D 25 and various values of the Monte Carlo sample size n, the fol-
lowing table reports the Monte Carlo P -values with Tm as the sample median.
An approximation to the P -value obtained from the asymptotic normality result
p L 2
m.Tm / ) N.0; 4 / is pm 0:386. The original dataset of size m D 25 was
also generated under the null, that is, from C.0; 1/.
The Monte Carlo P -values stabilize when the Monte Carlo sample size n is about
100. They closely match the normal approximation pm 0:386. However, it is im-
possible to say which one is more accurate, the Monte Carlo P -value or the normal
approximation.
19.1.3 Rao–Blackwellization
Let X; Y be any two random variables such that Var.X / and Var.X j Y / exist. Then,
the iterated variance formula says that Var.X / D EŒVar.X j Y / C Var.E.X j Y //
(see Chapter 2). As a consequence, Var.E.X j Y // Var.X /. Therefore, if we de-
fine h.Y / D E.X j Y /, then E.h.Y // D EŒE.X j Y / D E.X /, and Var.h.Y //
Var.X /. This says that as an estimate of D E.X /, the conditional expectation
h.Y / D E.X j Y / is at least as good an unbiased estimate as X itself. The technique
is similar to the well-known technique in statistics of conditioning with respect to
a sufficient statistic, which is due to David Blackwell and C. R. Rao. So, h.Y / is
called a Rao–Blackwellized Monte Carlo estimate for D E.X /. For the method to
624 19 Simulation and Markov Chain Monte Carlo
t 2y2
h.y/ D E.e tXY j Y D y/ D E.e tyX j Y D y/ D E.e tyX / D e 2 :
Pn Pn
The Rao–Blackwellized Monte Carlo estimate is O D n1 i D1 h.Yi / D n1 i D1
t 2Y 2
i
e 2 . Note that this requires simulation of only Y1 ; : : : ; Yn , and not the pairs
.Xi ; Yi /; i D 1; : : : ; n.
As was commented before, the entire Monte Carlo method is based on the assump-
tion that we can in fact simulate the Monte Carlo sample observations X1 ; : : : ; Xn
from whatever distribution is the relevant one for the given problem. Widely
available commercial software exists for simulating from nearly every common dis-
tribution in one dimension, and many common distributions in higher dimensions.
With the modern high-speed computers that most researchers now generally use,
the efficiency issue of these commercial algorithms has become less important than
before. Still, the fact is that often one has to customize one’s own algorithm to a
given problem, either because the problem is unusual and commercial software is
not available, or that commercial software is unacceptably slow. We do not intend
to delve into customized simulation algorithms for special problems in this text. We
give a basic description of a few widely used simulation methods, and some eas-
ily applied methods for 25 special distributions, for the purpose of quick reference.
Textbook and more detailed treatments of simulation from standard distributions are
available in Fishman (1995), Robert and Casella (2004), and Ross (2006), among
others. Schmeiser (1994, 2001) provides lucidly written summary accounts.
Quantile Transformation. We are actually already familiar with this method (see
Chapter 1 and Chapter 6). Suppose F is a continuous CDF on the real line with the
quantile function F 1 . Suppose X F . Then U D F .X / U Œ0; 1. Therefore,
19.2 Textbook Simulation Techniques 625
Example 19.9. Suppose we want to simulate a value for X Exp.1/. The quantile
function of the standard exponential distribution is F 1 .p/ D log.1 p/; 0 <
p < 1. Therefore, to simulate X , we can generate U U Œ0; 1, and use X D
log.1 U / ( log U will work too).
As another example, suppose we want to simulate X Be. 12 ; 12 /, the Beta
distribution with parameters 12 each. The density of the Be. 12 ; 12 / distribution is
p 1 ; 0 < x < 1. By direct integration, we get that the CDF is F .x/ D
x.1x/
p
2
arcsin. x/. Therefore, the quantile function is F 1 .p/ D sin2 . 2 p/, and so,
to simulate X Be. 12 ; 12 /, we generate U U Œ0; 1, and use X D sin2 . 2 U /.
The quantile function F 1 does not have closed-form formulas if F is a normal,
or a Beta, or a Gamma, and so on. In such cases, the use of the quantile transform
method by numerically evaluating F 1 .U / will cause a slight error. The error may
be practically negligible. But, for these distributions, simulated values are usually
obtained in practice by using special techniques, rather than the quantile transform
technique. For example, simulations for a standard normal variable are obtainable
by using the Box–Muller method (see the chapter exercises in Chapter 4).
t
u
u
1
0.8
0.6
0.4
0.2
x
-4 -2 2 4
.’ C ˇ/ ’1
f .x/ D x .1 x/ˇ 1 ; 0 < x < 1:
.’/.ˇ/
Because
’1 .’ C ˇ/ .’ 1/’1 .ˇ 1/ˇ 1
sup f .x/ D f D ;
x ’Cˇ2 .’/.ˇ/ .’ C ˇ 2/’Cˇ 2
X ’1 .1 X /ˇ 1 .’ C ˇ 2/’Cˇ 2
U :
.’ 1/’1 .ˇ 1/ˇ 1
628 19 Simulation and Markov Chain Monte Carlo
It is to be noted that this accept–reject scheme with g as the U Œ0; 1 density would
not be very efficient if ’; ˇ are large. If ’; ˇ are large, the beta density tapers off very
rapidly from its mode, and the uniform envelope density would be a poor choice for
g. For ’; ˇ not too far from 1, the scheme of this example would be reasonably
efficient.
An important practical issue about an accept–reject scheme is the acceptance
percentage. One must strive to make this as large as possible in order to increase the
efficiency of the method. This is achieved by choosing c to be the smallest possible
number that one can, as the following result shows.
Proposition. For an accept–reject scheme, the probability that an X g is ac-
cepted is 1c , and is maximized when c is chosen to be c D supx fg.x/
.x/
.
We have essentially already proved it, because the acceptance probability is
Z 1 Z f .t /
f .X / cg.t /
P U D g.t/dudt
cg.X / 1 0
Z 1 Z 1
f .t/ f .t/ 1
D g.t/dt D dt D :
1 cg.t/ 1 c c
f .x/
Because any c that can be chosen must be at least as large as supx g.x/ , obviously
1
c is maximized by choosing c D supx fg.x/
.x/
.
Example 19.12 (Efficiency of Accept–Reject Scheme). In Example 19.8, we used an
accept–reject scheme with g as the standard double exponential density to simulate
from the standard normal distribution. Because
r
f .x/ 2e
sup D ;
x g.x/
q
the acceptance rate, by our previous theorem, would be 2e
D :7602. If we gen-
erate 100 X -values from g, we can expect that about 75 of them would be retained,
and the others discarded.
Suppose now we use g.x/ D .1Cx 1
2 / , the standard Cauchy density. This density
There are two different ways to think about importance sampling. The more tradi-
tional one is to go back to the primary problem that Monte
R Carlo wants to solve,
namely to approximate the value of an expectation D 0 .x/dF0 .x/ for some
R 0 and some CDF F0 . However, .0 ; F0 / is not the only pair .; F / for
function
which .x/dF .x/ equals the specific number . Indeed, given any other CDF F1 ,
Z Z Z
dF0
D 0 .x/dF0 .x/ D 0 .x/ .x/dF1 .x/ D .x/0 .x/dF1 .x/;
dF1
dF0 f0 .x/
where .x/ D dF1 .x/. If F0 ; F1 have densities f0 ; f1 , then .x/ D f1 .x/ ; if F0 ; F1
f0 .x/
have respective pmfs f0 ; f1 , then also .x/ D f1 .x/
(if one is continuous and the
other discrete, dF0
dF1 .x/ need not be defined; the general treatment needs the use of
Radon–Nikodym derivatives).
This raises the interesting possibility that we can sample from a general F1 , and
subsequently use the usual Monte Carlo estimate
1X
n
O D .Xi /0 .Xi /;
n
i D1
dilemma in greater detail). We also assume that F0 ; F1 both have densities, say
f0 ; f1 . The presentation given below carries over with only notational change if
F0 ; F1 are both discrete with the same support. Suppose then fi .x/ D hic.x/
i
;i D
0; 1, where the assumption is that h0 ; h1 are completely known and also computable,
but c0 ; c1 are unknown and are not even computable. Then, as we showed above,
for any function for which the expectation EF0 Œ.X / exists,
Z
D EF0 Œ.X / DD .x/.x/f1 .x/dx
Z
c1 .x/h0 .x/
D f1 .x/dx
c0 h1 .x/
c1 .X /h0 .X /
D EF1 :
c0 h1 .X /
This is a useful reduction, but we still have to deal with the fact that the ratio cc10
is not known to us. Fortunately, if we use the special function .x/ 1, the same
representation above gives us
c1 h0 .X / c1 1
1D EF ) D h i;
c0 1 h1 .X / c0 EF1 hh01 .X/
.X/
and because h0 ; h1 are explicitly known to us, we have a way to get rid of the
quotient cc10 and write the final importance sampling identity
h i
.X/h0 .X/
EF1 h1 .X/
EF0 Œ.X / D h i :
h0 .X/
EF1 h1 .X/
We can now use an available Monte Carlo sample X1 ; : : : ; Xn from F1 to find Monte
Carlo estimates for D EF0 Œ.X /.
The basic plug-in estimate for is the so-called ratio estimate
Pn .Xi /h0 .Xi /
i D1 h1 .Xi /
O D Pn h0 .Xi / :
i D1 h1 .Xi /
If the Monte Carlo sample size is small, this estimate will probably have quite a bit
of bias, and some bias correction would be desirable. In any case, we have at hand
at least a first-order approximation for EF0 Œ.X / based on a Monte Carlo sample
from a general candidate importance sampling distribution F1 . The issue of which
F1 to actually choose has not been addressed yet. All we have done so far is some
algebraic manipulation and then a proposal for using the ratio estimate of survey
theory. The choice issue is addressed below. But first let us see an example.
19.2 Textbook Simulation Techniques 631
Note that we did not need to calculate the normalizing constant in the posterior
density. We take m D 100; x D 45 for specificity. The following table gives values
of this importance sampling estimate and also the value of computed by using a
numerical integration routine, so that we can assess the accuracy of the importance
sampling estimate.
Importance Sampling
n Estimate of D :4532
20 .4444
50 .4476
100 .4558
250 .4537
500 .4529
Even with an importance sampling size of n D 20, the estimation error is less
than 2%. This has partly to do with the choice of the importance sampling distri-
bution. In this example, the importance sampling distribution was chosen to match
the shape of the posterior distribution well. This generally enhances the accuracy of
importance sampling.
632 19 Simulation and Markov Chain Monte Carlo
Let us now study two basic theoretical properties of the importance sampling
estimate O in general. The first issue is whether asymptotically it estimates cor-
O
rectly, and the second issue is what can we say about the amount of the error in
general. Fortunately, we can handle both questions by using our asymptotic toolbox
from Chapter 7.
Theorem 19.2. Suppose
h0 .X / h0 .X /
VarF1 .X / and VarF1
h1 .X / h1 .X /
Then, as n ! 1,
a:s:
(a) O ! D
2
1
.
p L
(b) n.O / ) N.0; 2 /, where
1
converges almost surely to 2 D .
For part (b), denote
and note that by the multivariate central limit theorem (see Chapter 7),
p L
n.U 1 ; V 2 / ) N2 .0; †/;
19.2 Textbook Simulation Techniques 633
where
12 12
†D :
12 22
Now define the transformation g.u; v/ D u
v
, so that O D g.U ; V /. By the delta
theorem (see Chapter 7), one has
p L
n.g.U ; V / g.1 ; 2 // ) N.0; Œrg.1 ; 2 /0 †Œrg.1 ; 2 //:
We now address the question of the optimal choice of the importance sampling
distribution. There is no unique way to define what an optimal choice means. We
formulate one definition of optimality and provide an optimal importance sampling
distribution. The optimal choice would not be practically usable, as we shown.
However, the solution still gives useful insight.
P
Theorem 19.3. Consider the importance sampling estimator O D n1 niD1
R
.Xi /.Xi / for D .x/f0 .x/dx, where .x/ D ff01 .x/ .x/
, and X1 ; : : : ; Xn
are iid observations from F1 . Assume that .x/ 0, and > 0. Then, VarF1 ./ O is
.x/f0 .x/
minimized when f1 .x/ D .
Proof. Because X1 ; : : : ; Xn are iid, so are .X1 /.X1 /; : : : ; .Xn /.Xn /, and
hence,
1
O D VarF1 . .X1 /.X1 //:
VarF1 ./
n
Clearly, this is minimized when with probability one under F1 ; .X1 /.X1 / is a
constant, say k. The constant k must be equal to the mean of .X1 /.X1 /, that is,
Z Z
.x/f0 .x/
kD .x/.x/f1 .x/dx D f1 .x/dx
f1 .x/
Z
D .x/f0 .x/dx D :
This is not usable in practice, because it involves , which is precisely the un-
known number we want to approximate. However, the theoretically optimal solution
suggests that the importance sampling density should follow key properties of the
634 19 Simulation and Markov Chain Monte Carlo
unnormalized function .x/f0 .x/. For example, f1 should have the same shape and
tail behavior as .x/f0 .x/. Do and Hall (1989) show the advantages of using impor-
tance sampling and choosing the correct shape in distribution estimation problems
O and increases its accuracy.
that arise in the bootstrap. This reduces the variance of ,
Standard software for simulating from common univariate and multivariate distribu-
tions is now widely available. Mathematica permits simulation from essentially all
common distributions. However, for the sake of quick simulations when efficiency
is not of primary concern, a few simple rules for simulating from 25 common distri-
butions are listed below. Their justification comes from various well-known results
in standard distribution theory, many of which have been previously derived in this
text itself.
Standard Exponential. To generate X Exp.1/, generate U U Œ0; 1 and use
X D log U .
Gamma with Parameters n and . To generate X G.n; /, generate n inde-
pendent values X1 ; : : : ; Xn from a standard exponential, and use X D .X1 C
C Xn /.
Beta with Integer Parameters m; n. To generate X Be.m; n/, generate U
G.m; 1/; V G.n; 1/ independently, and use X D U U
CV .
Xi X
n
pi D PnC1 ; i D 1; 2; : : : ; n; and pnC1 D 1 pj :
j D1 Xj j D1
Uniform on the Boundary of the Unit Ball. To generate X from the uniform
distribution on the boundary of the unit ball, generate Z1 ; : : : ; Zd N.0; 1/ inde-
pendently, and use Xi D q 2 Zi 2 ; i D 1; : : : ; d .
Z1 CCZd
Uniform Inside the Unit Ball. To generate X from the uniform distribution in the
d -dimensional unit ball, use either of the following methods.
(i) Generate Z according to a uniform distribution on the boundary of the
d -dimensional unit ball, and independently generate U U Œ0; 1, and use
1
X D U d Z.
(ii) Generate U1 ; : : : ; Ud U Œ1; 1 independently, and use X D .U1 ; : : : ; Ud /0 ,
Pd 2
if i D1 Ui 1. Otherwise, discard it and repeat.
Bernoulli with a General Parameter. To generate X Ber.p/, generate U
U Œ0; 1 and set X D IU >1p .
Binomial with General Parameters. To generate PX Bin.n; p/, generate
X1 ; : : : ; Xn Ber.p/ independently, and use X D niD1 Xi .
Geometric with a General
j Parameter.
k To generate X Geo.p/, generate Y
Exp.1/ and use X D 1 C log
Y
p , where b c denotes the integer part.
The standard simulation techniques are difficult to apply, or even do not apply, when
the target distribution is an unconventional one, or even worse, it is known only up to
a normalizing constant: that is, f .x/ D h.x/
c
for some explicit function h, but only
an implicit normalizing constant c, because c cannot be computed exactly, or even
to a high degree of accuracy. This problem often arises in simulating from posterior
densities of a parameter (perhaps a vector)
f .x j /./
. j x/ D ;
m.x/
where f .x j / is the likelihood function, ./ is the prior density, and m.x/ is
R density of the observable X induced by .f; / (see Chapter 3). Thus,
the marginal
m.x/ D ‚ f .x j /./d, and it serves as the normalizing constant to the func-
tion h./ D f .x j /./. But, if the parameter is high-dimensional, and the
prior density ./ is not a very conveniently chosen one, then m.x/ usually cannot
be calculated in closed-form, or even to a high degree of numerical approximation.
All the simulation methods discussed in the previous section are useless in such a
0.5
-1 -0.5 0.5 1
-0.5
-1
0.5
-1 -0.5 0.5 1
-0.5
-1
Fig. 19.3 Two hundred random points inside the unit circle
Now fix n. Then, from the defining equation for reversibility and Bayes’ theorem,
P .XnC1 D i j Xn D j /P .Xn D j /
P .Xn D j j XnC1 D i / D
P .XnC1 D i /
pji .j / pij .i /
D D D pij
.i / .i /
D P .XnC1 D j j Xn D i /:
In other words, in the statement, “The probability that the next state is j given
that the current state is i is pij ” we have the liberty to take either n as the current
time and n C 1 as the next, or, n C 1 as the current time and n as the next. That is, if
we run our clock backwards, then it would seem to us that the chain is evolving the
same way as it did when the clock ran forward; hence the name a reversible chain.
But we get even more. Assume that the chain is regular (see Chapter 10) and that
the state space S is finite. By summing the defining equation for reversibility over
j , we get, for every i 2 S ,
X X X
pj i .j / D pij .i / D .i / pij D .i /:
j 2S j 2S j 2S
Proof. The proof uses the fact that for a positive recurrent Markov chain on a dis-
crete state space, the proportion of times that the chain is at a particular state i
converges almost surely to the stationary probability of that state i . That is, given
P a:s:
n 1 and i 2 S , suppose Vi .n/ D nkD1 IXk Di . Then, Vin.n/ ! .i /. We can now
see intuitively why the strong ergodic theorem is true. We have:
1X 1X X
n
.Xk / D .Xk /
n n
kD1 i 2S kW Xk Di
1X X 1X
D .i / 1D .i /Vi .n/
n n
i 2S kW Xk Di i 2S
X Vi .n/ X
D .i / .i /.i / D E Œ.X /:
n
i 2S i 2S
Formally, fix an > 0, and find a finite subset of states E S such that
.S E/ < . Let T denote the number ˇ ˇ in the set E, and suppose
of elements
ˇ ˇ
n is large enough that for each i 2 E; ˇ Vin.n/ .i /ˇ < . Such an n can be chosen
because E is a finite set, and because we have the almost sure convergence property
that Vin.n/ converges to .i / for any fixed i , as was mentioned habove. Also, note i for
P P Vi .n/
the sake of the proof below that i 2S Vi .n/ D n ) i 2S n
.i / D 0.
We prove the theorem for functions that are bounded, and in that case we may
assume that jj 1.
Hence,
ˇ n ˇ
ˇ1 X X ˇ
ˇ ˇ
ˇ .Xk / .i /.i /ˇ
ˇn ˇ
kD1 i 2S
ˇ ˇ ˇ ˇ
ˇX Vi .n/ X ˇ ˇX ˇ
ˇ ˇ ˇ Vi .n/ ˇ
Dˇ .i / .i /.i /ˇ D ˇ .i / .i / ˇ
ˇ n ˇ ˇ n ˇ
i 2S i 2S i 2S
X ˇ ˇ
ˇ Vi .n/ ˇ
j.i /j ˇˇ .i /ˇˇ
n
i 2S
X ˇ ˇ X ˇ ˇ
ˇ Vi .n/ ˇ ˇ Vi .n/ ˇ
D j.i /j ˇˇ .i /ˇˇ C j.i /j ˇˇ .i /ˇˇ
n n
i 2E i 2SE
ˇ ˇ
X ˇ Vi .n/ ˇ X Vi .n/
ˇ ˇ
ˇ n .i /ˇ C n
C .i /
i 2S i 2SE
642 19 Simulation and Markov Chain Monte Carlo
X ˇˇ Vi .n/ ˇ
ˇ X Vi .n/ X
ˇ .i / ˇC .i / C 2 .i /
ˇ n ˇ n
i 2S i 2SE i 2SE
X ˇˇ Vi .n/ ˇ X
ˇ V .n/ X
D ˇ ˇ i
ˇ n .i /ˇ C .i /
n
C2 .i /
i 2S i 2E i 2SE
T C T C 2 D 2.T C 1/ :
The general principle of any Metropolis algorithm is the following. Suppose the
chain is at some state i 2 S at the current instant. Then, as a first step, one of the
states, say j , from the states in S is picked according to some probability distribu-
tion for possibly moving to that state j . The state j is commonly called a candidate
state. The distribution used to pick this candidate state is called a proposal distri-
bution. Then, as a second step, if j happens to be different from i , then we either
move to the candidate step j with some designated probability, or we decline to
move, and therefore remain at the current state i . Thus, the entries of the overall
transition matrix P have the multiplicative structure
pij D ij ij ; i; j 2 S; j ¤ i;
X
pii D 1 pij ;
j ¤i
where
The matrix ..ij // is chosen to be irreducible, in order that the ultimate transition
matrix P D ..pij // is irreducible. Note that the Metropolis algorithm has a formal
similarity to the accept–reject scheme. A candidate state j is picked according to
the proposal probabilities ij and once picked, is accepted according to the accep-
tance probabilities ij . Special choices for ij ; ij lead to the following well-known
algorithms.
Independent Sampling. Choose
.j /
ij D c D constant; and ij D min 1; :
.i /
.j /
ij D c D constant; and ij D :
.i / C .j /
It is implicitly assumed that the state space S is finite for the Metropolis–Hastings
and the Barker algorithm to apply. Note that the Metropolis–Hastings, independent
Metropolis, and the Barker algorithm only require the specification of the target dis-
tribution up to normalizing constants. That is, if we only knew that .k/ D h.k/ c
,
where h.k/ is explicit, but the normalizing constant c is not, we can still execute
all of these algorithms. It is also worth noting that in the Metropolis–Hastings al-
gorithm, if a state j gets picked as the candidate state, and j happens to be more
likely than the current state i , then the chain surely moves to j . This is not the case
for Barker’s algorithm.
A fourth MCMC algorithm that is especially popular in statistics, and particularly
in Bayesian statistics, is the Gibbs sampler, which is treated separately. Note that for
each proposed algorithm, one needs to verify two things; that the chain is irreducible
and aperiodic and that the time-reversibility equation indeed holds.
Example 19.15 (Verifying the Assumptions). First consider the case of purely in-
dependent sampling. Then, pij D .j / and pj i D .i /. So irreducibility holds,
because P itself has all entries strictly positive. Also, .i /pij D .i /.j / D
pj i .j /, and so time-reversibility also holds.
Next, consider the Metropolis–Hastings algorithm. Fix i; j . Consider the case
.i /
that .j / .i /. Then, pij D c, and pji D c .j /
. Therefore, every en-
try in P is again strictly positive, and so the chain is irreducible. Furthermore,
pj i .j / D c.i / D pij .i /, and therefore, time-reversibility holds. For the other
case, namely .i / .j /, the proof of reversibility is the same. The verification of
the assumptions for the Barker algorithm and the independence Metropolis sampler
is a chapter exercise.
P .Y D x/
.x/ D I1xn
P .Y n/
.j /
D .1 p/j i 1 iff j i:
.i /
19.4 The Gibbs Sampler 645
1
pij D if j < i I
n
1
pij D .1 p/j i if j > i I
n
X
pi i D 1 pij :
j ¤i
The general Metropolis algorithm uses more general forms for the probabilities
ij ; ij , and we provide this general form next.
The Metropolis algorithms of the previous section can be difficult to apply when the
dimension of the state space is high, and the likelihood ratios .y/ .x/ ; x; y 2 S depend
on all the coordinates of x; y. The generation of the chain becomes too much of a
multidimensional problem and becomes at least unwieldy, and possibly undoable.
A collection of special Metropolis algorithms very cleverly reduces the mul-
tidimensional nature of the Markov chain generation problem to a sequence
of one-dimensional problems. To explain it more precisely, suppose a state x
in the state space S is a vector in some m-dimensional space with coordinates
.x1 ; x2 ; : : : ; xm /. Suppose we are currently in state x. To make a transition to a new
state y 2 S , we change coordinates one at a time, such as .x1 ; x2 ; : : : ; xm / !
.y1 ; x2 ; : : : ; xm / ! .y1 ; y2 ; : : : ; xm / ! .y1 ; y2 ; : : : ; ym /, and each co-
ordinate change is made by using the conditional distribution of that coordinate
given the rest of the coordinates. For example, the transition .x1 ; x2 ; : : : ; xm / !
.y1 ; x2 ; : : : ; xm / is made by simulating from the distribution .x1 j x2 ; : : : ; xm /.
These conditional distributions of one coordinate given all the rest are called full
conditionals. Therefore, as long as we can calculate and also simulate from all
646 19 Simulation and Markov Chain Monte Carlo
.x/pi;xy D .x1 ; : : : ; xi 1 ; xi ; xi C1 ; : : : ; xm /
.x1 ; : : : ; xi 1 ; yi ; xi C1 ; : : : ; xm /
P Iyj Dxj 8 j ¤i
yi .x1 ; : : : ; xi 1 ; yi ; xi C1 ; : : : ; xm /
.y1 ; : : : ; yi 1 ; xi ; yi C1 ; : : : ; ym /.y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
D P
yi .y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
Iyj Dxj 8 j ¤i
19.4 The Gibbs Sampler 647
D .y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
.y1 ; : : : ; yi 1 ; xi ; yi C1 ; : : : ; ym /
P Iyj Dxj 8 j ¤i
yi .y1 ; : : : ; yi 1 ; yi ; yi C1 ; : : : ; ym /
D .y/pi;yx :
1 X
P D Pi1 Pi2 Pim ;
mŠ
S.m/
P
where S.m/ denotes the sum over all mŠ permutations of 1; 2; : : : ; m. Once we
have both reversibility and irreducibility, we get the desired convergence result
(Theorem 19.4). We state this formally.
Theorem 19.7. Suppose .x/ > 0 for all x 2 S . Then the random scan and the or-
der randomized systematic scan Gibbs sampler both have as the unique stationary
distribution of the chain.
as its stationary distribution. We can reverse the order of the systematic scan, that is,
start with a fixed x0 , get p0 from Be.x0 C ’; m x0 C ˇ/, then x1 from Bin.m; p0 /,
and so on, and the convergence result will still hold.
. j x/ D cf .x j /./;
then . j x/ is indeed a probability density with a suitable finite normalizing con-
stant c. In that case, one sometimes proceeds as if . j x/ is a bona fide posterior
density and does inference with it. This is such an example of the application of the
Gibbs sampler to a posterior density arising from an improper prior.
Suppose given D .; 2 /; X1 ; X2 ; : : : ; Xn are iid N.; 2 /, and that for D
.; 2 / we use the nonnegative function ./ D 12 dd 2 as an improper prior. It
is important to note that 2 is being treated as the second coordinate of , not . One
may informally interpret such a prior as treating and 2 to be independent, and
then putting, respectively, the improper prior densities 1 ./ 1 and 2 . 2 / D
1 2
2
. > 0/.
Our final goal is to simulate observations from the posterior distribution of
D .; 2 / given the data X D .X1 ; : : : ; Xn /. To apply the Gibbs sampler, we
need the full conditionals. Because is two-dimensional, there are only two full
conditionals to find, namely, the posterior distribution of given 2 , and the pos-
terior distribution of 2 given . Note that when we consider posteriors, the data
values are considered as fixed. So X does not enter into this Gibbs picture as an-
other variable.
It turns out that even though the prior is an improper prior, the formal posterior
is a bona fide probability density on D .; 2 / for any n 1. Furthermore, these
two full conditionals are very easy to find. In the following lines, c denotes a generic
normalizing constant, and is not intended to be the same number at each occurrence.
Consider the conditional of given 2 . This is formed as
19.4 The Gibbs Sampler 649
Therefore, by transforming to w D 1
2
, from the Jacobian formula, the conditional
of w given is
n
Pn 2
.w j ; X / D cw 2 1 e 2 i D1 .Xi / :
w
P
Making one final transformation to v D w niD1 .Xi /2 ; the conditional of v
given is
n
.v j ; X / D cv 2 1 e 2 :
v
Example 19.20 (Gibbs Sampling from Dirichlet Distributions). The success of the
Gibbs sampler hinges on one’s ability to write and easily simulate from the one-
dimensional full conditionals. One important case where this works out very nicely
is that of the n-dimensional Dirichlet distribution (see Chapter 4). The Dirich-
let distribution arises naturally as models for data on a simplex. They also arise
prominently in Bayesian statistics in numerous situations, for example, as priors
for the parameters of a multinomial distribution, and also in dealing with infinite-
dimensional Bayesian problems.
650 19 Simulation and Markov Chain Monte Carlo
L X1 X2
.p1 ; p2 ; : : : ; pn / D ; ;:::;
X1 C X2 C C XnC1 X1 C X2 C C XnC1
!
Xn
;
X1 C X2 C C XnC1
So, to simulate from a Dirichlet distribution by using the systematic scan Gibbs
sampler, we can
Step 1. Fix initial values p20 ; : : : ; pn0 for p2 ; : : : ; pn . P
Step 2. Simulate a Beta variable Y1 from Be.’1 ; ’nC1 / and set p10 D .1 njD2
pj 0 /Y1 . P
Step 3. Simulate a Beta variable Y2 from Be.’2 ; ’nC1 / and set p21 D .1 j ¤2
pj 0 /Y2 , and so on, by updating the coordinates one at a time. The result-
ing chain will not be reversible, but reversibility can be retrieved by using a
somewhat more involved updating scheme; see the general principles pre-
sented at the beginning of this section.
Example 19.21 (Gibbs Sampling in a Change Point Problem). Suppose for some
given n 2, and 1 k n, we have X1 ; : : : ; Xk Poi./, and XkC1 ; : : : ; Xn
Poi. /, where all n observations are mutually independent, and k; ; are unknown
parameters. This is an example of a change point problem, where an underlying se-
quence of Poisson distributed random variables has a level change at some unknown
time within the interval of the study. This unknown change point is the parameter k.
The problem is unusual in one parameter being integer-valued, and the others con-
tinuous. Change point problems are rather difficult from a theoretical perspective.
Suppose the problem is attacked by assuming a prior distribution on .k; ; /,
with
k Uniff1; 2; : : : ; ngI
G.’; 1 /; G.ˇ; 2 /;
PkiD1 xi
. /k
e
k j ; .k j ; / D PkiD1 xi I1kn I
Pn
kD1 e . /k
X
k 1 !
1
j .k; / G ’C xi ; Ck I
1
i D1
0 1
X
n 1
1
j .k; / G @ˇ C xi ; Cnk A:
2
i DkC1
Each of these full conditionals is a standard distribution. The full conditional corre-
sponding to k is just a multinomial distribution, and the other two full conditionals
are Gamma distributions. It is easy to simulate from each of them, and so the Gibbs
chain can be generated easily. Special interest lies in estimating the change point
k. Once one has the three-component Gibbs chain, the k-component of it can be
used to approximate the posterior mean of k; there are also methods to estimate the
posterior distribution itself from the output of the Gibbs chain.
Theorem 19.4 does ensure that a stationary Markov chain with suitable additional
properties has the target distribution as its unique stationary distribution. In partic-
ular, whatever the initial distribution is, for any state i 2 S; P .Xn D i / ! .i /
as n ! 1. However, in practice we can only run the chain for a finite amount
of time, and hence only for finitely many steps. So, a question of obvious practi-
cal importance is how large the value of n should be in a given application for the
true distribution of Xn to closely approximate the target distribution , according to
some specified measure of distance between distributions. Several probability met-
rics introduced in Chapter 15 now become useful. A related question is what can we
say about the distance between the true distribution of Xn and the target distribution
for a given n.
Without any doubt, a study of these questions forms the most challenging and
the most sophisticated part of the MCMC theme. A broad general rule is that the
second largest eigenvalue in modulus (SLEM) of our transition probability matrix
P is going to determine the rapidity of the convergence of the chain to stationarity.
The smaller the SLEM is, the faster will be the convergence. However, it is usually
extremely difficult to go further, and find the SLEM explicitly. We can instead try
to give concrete bounds on the SLEM, or directly the distance between the true
652 19 Simulation and Markov Chain Monte Carlo
distribution of Xn and the target distribution, and each new problem usually requires
ingenious new methods. The area is still flourishing, and a fascinating interplay of
powerful tools from pure mathematics and probability is producing increasingly
sophisticated results.
There are some general theorems on the SLEM and hence speed of conver-
gence, and these are useful to know. They are mostly for reversible chains on
finite-state spaces, although there are some exceptions. Some of these are presented
with illustrative examples in this section. These results are taken primarily from
Dobrushin (1956), Diaconis and Stroock (1991), Diaconis and Saloff-Coste (1996),
and Brémaud (1999). References on convergence of the Gibbs sampler are given
later.
First, we need to choose the specific distances that we use to determine the ra-
pidity of convergence of the chain. Use of the total variation distance is the most
common; the separation and the chi-square distances are also used. We redefine
these distances here for easy reference; they were introduced and treated in detail in
Chapter 15.
Suppose P n D pij .n/ t t is the n-step transition probability matrix for our
stationary chain on the finite state space S D f1; 2; : : : ; tg. Suppose that enough
conditions on the chain have been assumed so that we know that a unique stationary
distribution exists. If the initial distribution of the chain is , then the distribution
of Xn for fixed n is
X X
P .Xn D i / D P .X0 D k/pki .n/ D k pki .n/;
k2S k2S
The total variation distance between the true distribution of Xn and the stationary
distribution under a general initial distribution is
ˇ ˇ
ˇ ˇ ˇX X X ˇ
ˇ ˇ
.P n ; / D sup ˇPP n .A/ P .A/ˇ D sup ˇ k pki .n/ .i /ˇ :
A A ˇ ˇ
i 2A k2S i 2A
If we know that for a particular n0 ; .P n0 ; / is small, and if we also know that
.P n ; / is monotone decreasing in n, then we can be assured that MCMC output
after step n0 will produce sufficiently accurate estimates of probabilities of arbitrary
sets under the target distribution . This is the motivation for wanting a small value
for .P n ; /.
19.5 Convergence of MCMC and Bounds on Errors 653
Two other distances that are also used in this context are the separation and the
chi-square distances. The separation distance is defined as
P .Xn D i /
D.P n ; / D sup 1 ;
i 2S .i /
The chain is said to be geometrically ergodic if for any x 2 S , there exist a finite
constant M.x/ and 0 < r < 1 such that
The chain is said to be uniformly ergodic if there exist a fixed finite constant M and
0 < r < 1 such that for any x 2 S ,
If the chain has a finite state space and is geometrically ergodic, then it is obvi-
ously uniformly ergodic. Otherwise, and especially so for MCMC chains, uniform
ergodicity usually does not hold, although geometric ergodicity does.
In the case of an irreducible and aperiodic chain with a finite state space, conver-
gence to stationarity occurs geometrically fast. That is,
0 1
B C
B C
P n D B : C C O.n’2 1 c n /;
@ :: A
654 19 Simulation and Markov Chain Monte Carlo
where c is the second largest value among the moduli of the eigenvalues of P
(called the SLEM), and ’2 is the algebraic multiplicity of that particular eigenvalue.
This follows from general spectral decomposition of stochastic matrices, and is also
known as the Perron–Frobenius theorem; see Chapter 6 in Brémaud (1999).
The important point is that irreducibility and aperiodicity guarantee that all
eigenvalues except one are strictly less than one in modulus (although they can be
complex), and so, in particular c < 1, which causes each row of P n to converge to
the stationary vector at a geometric (i.e., exponentially fast) rate, and therefore,
the chain is geometrically ergodic. If, in addition, the chain is reversible, then the
eigenvalues are real, say
Example 19.23. Consider the three-state stationary Markov chain with the transi-
tion probability matrix
0 1
0 1 0
P D@ 0 :5 :5 A :
:5 0 :5
Note that the chain is irreducible and aperiodic (see Example 10.13). But it is not
reversible. Thus, there can be complex eigenvalues. Indeed, the eigenvalues are
i i
1 D 1I 2 D I 3 D :
2 2
The moduli of the eigenvalues are therefore 1; 12 , and 12 , and so, the SLEM equals
c D 12 .
The practical difficulty is that it is usually difficult or impossible to derive expres-
sions for 2 and t in closed-form, and especially so when the number of possible
states t is large. Hence, much of the research effort in this field has been directed to-
wards finding effective and explicit upper bounds on c. The following theorem gives
some of the most significant bounds currently available on the distance between
19.5 Convergence of MCMC and Bounds on Errors 655
P n and for some of the distances that we defined above. The bounds in the
theorem below all depend on our ability to evaluate or bound the SLEM c, and they
are all for the reversible finite-state case.
Theorem 19.8. Let Xn ; n 0 be an irreducible, stationary, reversible Markov
chain on the finite-state space S D f1; 2; : : : ; tg. Let be the stationary distri-
bution of the chain, and c the SLEM of its transition probability matrix P . Then,
(a) For all n 1 and any i 2 S ,
s
1 .i / c n
sup jP n .i; A/ .A/j :
A .i / 2
cn p 2
sup jP .Xn 2 A/ .A/j .; /:
A 2
1X
2
sup jP .Xn 2 A j X0 D i / .A/j D jp1j .n/ .j /j
A 2
j D1
656 19 Simulation and Markov Chain Monte Carlo
(this is a standard formula for the total variation distance; see Section 15.1)
1 j’ C ˇ 1jn 1’
D .j1 ’j C j’ 1j/ D j’ C ˇ 1jn :
2 2’ˇ 2’ˇ
On the other hand, the first bound in part (a) of Theorem 19.8 equals
s s
1 .1/ c n .2/ c n
D
.1/ 2 .1/ 2
s
1 ’ j’ C ˇ 1jn
D :
1ˇ 2
That is, we get the very interesting conclusion that the relative error of the bound in
part (a) of Theorem 19.8 is the quotient of the arithmetic and the geometric mean of
1 ’ and 1 ˇ minus one. If ’ D ˇ, then the error is zero, and the bound becomes
exact. If ’ and ˇ are very different, the bound becomes inefficient.
1 .j /
pij D min 1; ; j ¤ iI
t .i /
X
pi i D 1 pij :
j ¤i
We label the states in decreasing order of the values of their stationary probabili-
ties. That is, we label the states such that .1/ .2/ : : : .k/. The first
eigenvalue, as always, is 1 D 1, and the remaining eigenvalues are
2 3
1 4 X .k 1/ .j / 5
t
k D ; k 2:
t .k 1/
j Dk1
19.5 Convergence of MCMC and Bounds on Errors 657
3 .1/
j D1
2 3
1 X
3
14 1 1
D 3 .j /5 D 1 D1 ;
3 .1/ 3.1/ 3a
j D1
P3
because j D1 .j / D 1. Likewise,
2 3
1 4 X .2/ .j / 5
3
3 .2/
j D2
2 3
1 X
3
14 2 1 .1/ 2 1a
D 2 .j /5 D D :
3 .2/ 3 3.2/ 3 3b
j D2
The bounds in the preceding section are only for the reversible case. There are
some examples in which convergence occurs, although the chain is not reversible.
There are two principal ways to deal with the nonreversible situation. A classic
inequality in Dobrushin (1956) provides an upper bound on the SLEM of the tran-
sition probability matrix P by using a very simple computable index, known as
Dobrushin’s coefficient, even if the chain is not reversible. A second method intro-
duced in Diaconis and Stroock (1991) and Fill (1991) uses a technique of reversal
of a nonreversible transition probability matrix, and derives the needed bounds in
terms of the SLEM of a reversible matrix M , constructed from the original transition
probability matrix P .
658 19 Simulation and Markov Chain Monte Carlo
pEi D .pi1 ; : : : ; pi t /;
pj i .j /
pQij D I PQ D pQij I M D P PQ ;
.i /
In the above, recall that we are using the notation to denote total variation distance.
Note that 0 .P / 1, and usually, 0 < .P / < 1.
Here is our main theorem on handling rates of convergence to stationarity when
the chain is not reversible. These are not the only bounds available, but the specific
bounds in the theorem below have some simplicity about them.
Example 19.26 (Two-State Chain). Consider our two-state case with the transition
probability matrix P
’ 1’
P D :
1ˇ ˇ
1
.P / D j’ 1 C ˇj C j1 ’ ˇj D 2j’ C ˇ 1j:
2
19.5 Convergence of MCMC and Bounds on Errors 659
We have previously seen that the eigenvalues of P are 1 and ’ C ˇ 1 (see Example
19.21), so that the SLEM equals c D j’ C ˇ 1j. Hence, in this case, c D .P /.
Example 19.27 (Metropolis–Hastings for Truncated Geometric). It was shown in
Example 19.15 that the Metropolis–Hastings algorithm for simulating from the dis-
tribution of Y given Y m, where Y Geo.p/, has the transition matrix with
entries
1 1 X
pij D if j < i I pij D .1 p/j i if j > i I pii D 1 pij :
m m
j ¤i
where q D 1 p. The eigenvalues are all in the unit interval Œ0; 1, and therefore
the second largest eigenvalue is also the SLEM. The second largest eigenvalue is the
last one, namely,
Pm1
m1 j D1 qj 1 qm 1 qm
m D D1 D1 :
m m.1 q/ mp
There is no such general closed-form formula for the Dobrushin coefficient in this
case, but for any given m and p, it is easily computable.
Example 19.28 (Dobrushin Bound May Not Be Useful). Consider again the transi-
tion matrix of Example 19.20
0 1
0 1 0
P D @ 0 :5 :5 A :
:5 0 :5
In this case, .pE1 ; pE2 / D :5; .pE2 ; pE3 / D :5, and .pE1 ; pE3 / D 1. Therefore,
.P / D 1. Therefore, the results of Theorem 19.9 involving the Dobrushin coeffi-
cient are not useful in this example. A closer examination reveals that the Dobrushin
coefficient is rendered equal to one because pE1 and pE3 are orthogonal. In any such
case, the bounds involving .P / would not be informative.
The bounds of this section on the distance between the exact distribution of the
chain at some fixed time n and the stationary distribution have the appealing
feature that they apply to even nonreversible chains. As we have seen, certain Gibbs
660 19 Simulation and Markov Chain Monte Carlo
chains are not reversible. However, in principle, the drift and minorization methods
of this section can apply to them, and geometric convergence to stationarity may be
established by using the methods below.
We consider the case of discrete state spaces below. For more general state spaces
(e.g., if S D RC ), the methods of this section need an added assurance of a form
of recurrence, known as Harris recurrence. For the special case of a Gibbs chain
on a discrete state space, the Harris recurrence condition will automatically hold
under the conditions that we assume. However, it should be emphasized that Harris
recurrence must be verified before using drift methods if the underlying chain has a
more general state space. We first need the appropriate definitions.
for all x 62 R.
pxy ıq.y/ 8 y 2 S; x 2 C;
Remark. The set C in the definition of the minorization condition is generally called
a small set. The drift condition says that if the chain is sitting at some x 62 R,
then it will tend to drift towards some state with a smaller energy than x. However,
the energy function is bounded from below, and so it cannot get lower at a steady
rate ad infinitum. So, eventually, the chain will seek shelter in R, and stay there.
Convergence will occur geometrically fast; that is, the total variation distance will
satisfy a bound of the form supA jP n .x; A/.A/j K n for some 2 .0; 1/ and
some K. If we have the minorization condition, we can furthermore place concrete
bounds on , typically of the form 1 ı.
It is an unfortunate aspect of the approach via drift and minorization conditions,
that it is usually very difficult to manufacture the energy function V and the prob-
ability distribution q even in simple problems. In addition, if an energy function V
satisfying the drift condition exists, it is not unique, and different choices produce
different constants K; . However, the approach has met some success, and it is a
19.5 Convergence of MCMC and Bounds on Errors 661
main theoretical approach in the convergence area. In particular, the drift approach
has been successfully used for some Gibbs chains for simulating from the posterior
distribution of a vector of means in so-called hierarchical Bayes linear models.
At first glance, the drift and the minorization conditions appear to be rather ob-
scure, and it is not clear how they lead to bounds on the total variation distance
between the exact distribution of the chain and the stationary distribution. For the
case that the minorization condition pxy ıq.y/ holds for all x and y in the state
space S , we show the proof of the following theorem.
Proof. The proof uses a famous technique in probability theory, known as coupling.
We manufacture (only for the purpose of the proof) two new chains, Zn ; n 0,
and Yn ; n 0, such that the two chains will eventually couple. That is, from some
random time T D t onwards, we will have Zn D Yn . It turns out that the chains
are so constructed that this coupling time T Geo.ı/, and it also turns out that
jP .Xn 2 A j X0 D i / .A/j P .T > n/ for any A; i , and n, where Xn is our
original chain (with as the stationary distribution). Because P .T > n/ D .1ı/n ,
the theorem follows.
Here is how the chains Zn ; Yn ; n 0 are constructed. Define the probability
density rx .y/ on the state space S by the defining equation
we can define such a density because the minorization condition has been assumed
to hold for all x; y 2 S . Set Z0 D i 2 S; Y0 , and choose a Bernoulli variable
B0 Ber.ı/. If B0 D 0, choose Y1 ry0 .y/ and choose Z1 rz0 .y/, indepen-
dently of each other. If B0 D 1, choose a single z q.y/ and set Z1 D Y1 D z.
This is one sweep. Repeat this sweep, mutually independently, until the Z-chain
and the Y -chain couple and call the coupling time T . Once the chains couple, at all
subsequent steps, the two chains remain identical. Because the sweeps are indepen-
dent and at each sweep the same Bernoulli experiment with success probability ı is
performed, we have T Geo.ı/.
To complete the proof, we now show that jP .Xn 2 A j X0 D i / .A/j
P .T > n/ for any A; i , and n. For this, note the important fact that because Y0
was chosen from the stationary distribution , and because the transition probabil-
ities satisfy the mixture representation pxy D ıq.y/ C .1 ı/rx .y/, the marginal
distribution of Yn coincides with for every n. Furthermore, because the Z-chain
started at the state i , its marginal distribution coincides with the distribution of Xn
given that X0 D i . Hence,
662 19 Simulation and Markov Chain Monte Carlo
jP .Xn 2 A j X0 D i / .A/j
D jP .Xn 2 A j X0 D i / P .Yn 2 A/j
D jP .Zn 2 A/ P .Yn 2 A/j
D jP .Zn 2 A; Zn DYn / C P .Zn 2 A; Zn ¤ Yn /
P .Yn 2 A; Yn D Zn / P .Yn 2 A; Yn ¤ Zn /j
D jP .Zn 2 A; Zn DYn / C P .Zn 2 A; Zn ¤ Yn /
P .Zn 2 A; Zn DYn / P .Yn 2 A; Yn ¤ Zn /j
D jP .Zn 2 A; Zn ¤ Yn / P .Yn 2 A; Zn ¤ Yn /j
max P .Zn 2 A; Zn ¤ Yn /; P .Yn 2 A; Zn ¤ Yn /
Markov chain Monte Carlo methods are not limited to target distributions that
have a discrete set S as their support. They are routinely used for simulating from
continuous distributions with general supports. The basic definitions, results, and
phenomena in these general cases closely match the corresponding results in the
discrete case. For example, we still have Metropolis schemes and the Gibbs sampler
for these general cases, and convergence theory and convergence conditions are
very similar to those for the discrete case. We describe the continuous versions of
the MCMC algorithms and the basic associated theory with examples in this section.
First, we set up the necessary notation and provide the necessary definitions.
The Markov transition kernelP P .x; A/ is the direct generalization of the probability
P .XnC1 2 A j Xn D i / D j 2A pij in the discrete state space case. In the discrete
case, the diagonal elements pi i can be strictly positive. Likewise, in the general
state space case, P .x; fxg/ can be strictly positive. Therefore, in general, a Markov
transition kernel P .x; A/ need not be a continuous distribution; that is, P .x; A/
19.6 MCMC on General Spaces 663
need not have a density. If it does, the density corresponding to the measure P .x; :/
is called a transition density, and we denote it by p.x; y/. So, a transition density,
if it exists, satisfies the two usual properties:
Z
p.x; y/ 0I p.x; y/dy D 1:
S
Caution. In the next definition, a probability measure and its density function have
been denoted by using the same notation . This is an abuse of notation, but is very
common in the literature in the present context.
P
This is a direct extension of the corresponding equation .i / D j 2S pj i .j / in
the discrete case. The distribution of the chain at time n, having started from state x
at time zero, is still denoted as
state space S such that the one-step transitions of the chain must be of the form
A1 ! A2 ! A3 ! ! Ad ! A1 . The chain is called aperiodic if there is no
d > 1 such that the chain is periodic of period d .
Metropolis chains on general state spaces will usually (but not always) have these
three helpful properties of reversibility with respect to the target distribution that
is in our mind, and irreducibility with respect to that same , and aperiodicity. This
enables us to conclude that the chain indeed has the targeted as its unique station-
ary distribution, and that convergence takes place as well. Those issues are picked
up later in this section.
The Metropolis chain in this general case is defined in essentially the same man-
ner as in the discrete case. If the chain is currently at state x, a candidate state y is
generated according to a proposal distribution and once a candidate state has been
picked, it is accepted or declined according to an acceptance distribution. We need
notations for these two things. We assume below that the proposal distribution is
continuous; that is, it has a density, and this proposal densityR is denoted as .x; y/.
So, the properties of a proposal density are .x; y/ 0I S .x; y/dy D 1. The
acceptance probabilities are denoted by .x; y/. So, for each fixed x and each fixed
y; 0 .x; y/ 1. The overall transition kernel of our Metropolis chain is then
Throughout we make the assumption that the stationary density .x/ and the pro-
posal densities .x; y/ are strictly positive for all x; y. This is not essential, but it
saves us from a lot of tedious fixing if we do not allow them to take the value zero.
Now we proceed to define the special MCMC schemes exactly as we did in the
discrete case.
Independent Sampling.
.y/.y; x/
.x; y/ D min ;1 :
.x/.x; y/
.y/
.x; y/ D min ;1 :
.x/
19.6 MCMC on General Spaces 665
Independent Metropolis Scheme. Choose .x; y/ D p.y/, that is, a fixed strictly
positive density p.y/ independent of x, and
( .y/
)
p.y/
.x; y/ D min 1; .x/
:
p.x/
.y/.y; x/
.x/.x; y/
’C1
1 C x’
2 2
.1 C .y x/2 / ’C1
’ C x2 2
D D
.1 C y 2 ’C1
/ 2 .1 C .x y/2 / ’ C y2
’
So, finally, the Metropolis scheme for simulating from the t-distribution proceeds
as follows.
Step 1. If the current value of the chain is x, generate a candidate state y C.x; 1/.
Step 2. If jxj jyj, accept the candidate value y; if jxj < jyj, perform a Bernoulli
’C1
’Cx 2 2
experiment with success probability ’Cy 2 , and accept the candidate
value y if the Bernoulli experiment results in a success.
666 19 Simulation and Markov Chain Monte Carlo
Thus, in this example, the state space is S D R3C . We use a suitable independent
Metropolis scheme to simulate from . Towards this, choose the proposal density
c
p.y1 ; y2 ; y3 / D ; y1 ; y2 ; y3 > 0
.1 C y12 C y22 C y32 /2
Note that the proposal density is a trivariate spherically symmetric Cauchy density,
restricted to the first quadrant. We can simulate quite easily
q from this proposal den-
sity. For instance, we can first simulate a value for r D y12 C y22 C y32 according
to the density
4 r2
g.r/ D ; 0 < r < 1;
.1 C r 2 /2
and then generate .y1 ; y2 ; y3 / as .y1 ; y2 ; y3 / D r.u1 ; u2 ; u3 / where .u1 ; u2 ; u3 / is
a point on the boundary of the unit sphere (see Section 19.2.4 on how to do these
simulations). When the point .u1 ; u2 ; u3 / does not fall in the first quadrant, we have
to fix the signs.
We only have to work out the acceptance probabilities of our Metropolis scheme
now. We have
.y/
p.y/ .1 C x1 C x2 C x3 /4 .1 C y12 C y22 C y32 /2
D
.x/ .1 C y1 C y2 C y3 /4 .1 C x12 C x22 C x32 /2
p.x/
D ’.x; y/ say:
Then, the acceptance probability is .x; y/ D minf1; ’.x; y/g. For example, if the
current state is .x1 ; x2 ; x3 / D .1; 1; 1/, and the candidate state is .y1 ; y2 ; y3 / D
.2; 2; 2/, then ’.x; y/ D 1:126, and therefore .x; y/ D minf1; 1:126g D 1; that is,
the candidate state is definitely accepted.
19.6.2 Convergence
P .9 n such that Xn 2 B j X0 D x/ D 1:
This is the same as saying that regardless of where the chain started, any set B with
.B/ > 0 will be visited infinitely often by the chain with probability one. On
comparing this with the definition of irreducibility, we find that Harris recurrence is
stronger than irreducibility.
668 19 Simulation and Markov Chain Monte Carlo
(d) In addition, under the sufficient conditions stated above, the random walk
Metropolis chain and the independent Metropolis chain R are strongly er-
godic; that is, for any W S ! R such that S j.x/j.x/dx <
P a:s: R
1; n1 nkD1 .Xk / ! s .x/.x/dx, for any initial distribution .
tive in any neighborhood of zero, and so it satisfies neither condition (A) nor
condition (B) of part (a) in Theorem 19.11, and the theorem fails to guarantee
supA jP n .x; A/ .A/j ! 0.
The above theorem suggests that if all we care for is simple convergence, then
Metropolis chains will usually deliver it. However, for practical use, what we re-
ally need are concrete bounds on supA jP n .x; A/ .A/j for given n. For the
19.6 MCMC on General Spaces 669
discrete state space case, we obtained such bounds by spectral methods, drift and
minorization, and by using Dobrushin’s coefficient. In the general state space case,
we provide a result based on drift and minorization, and another based on the tails
of the target and the proposal density. We need another definition before the next
theorem.
Definition 19.13. A density .x/ on the real line is called logconcave in the tails
with the associated exponential constant ’ > 0 if for some x0 ,
(b) Suppose the state space S R. Assume the following conditions on and the
proposal densities .x; y/:
(i) .x/ is logconcave in the tails with the associated exponential constant ’.
(ii) .x; y/ D q.jy xj/ for some fixed density function q, which satisfies the
fat tail condition q.x/ Ke ’jxj with ’ as in (i) above.
Then, with V .x/ defined as V .x/ D e ajxj with a < ’, the random walk
Metropolis chain with the proposal density .x; y/ is geometrically, but not
uniformly, ergodic in the following sense:
Z
sup jE.f .Xn / j X0 D x/ f .y/.y/dyj M V .x/r n
f W jf .y/jV .y/ S
for some finite constant M , and 0 < r < 1. This holds for any initial state
x, and any n 1.
Example 19.32 (A Geometrically Ergodic Metropolis Chain). This is an example
where we can verify the conditions in part (b) of the above theorem easily. Suppose
the target density is .x/ D ce jxj ; 1 < x < 1. We use a Metropolis sampling
scheme with the proposal density .x; y/ D q.jy xj/, where q is the uniform
density q.y/ D 2b 1
Ijyjb ; 0 < b < 1. Then, .x/ is certainly logconcave in the
tails. Indeed, taking x0 to be any positive number, we have for y > x > x0 , or for
y < x < x0 ,
670 19 Simulation and Markov Chain Monte Carlo
Example 19.33 (Proposal Densities Should Be Heavier Tailed Than Target). Sup-
pose the target density is a standard exponential, and suppose we consider an
independent Metropolis sampler with a proposal density that is exponential with
mean . Thus, for y > 0,
.y/ e y 1
D 1 y=
D e y :
p.y/
e
Suppose now that the chain is started at X0 D x0 D 1. Then, the candidate value
y D y0 , it being from an exponential distribution with a small mean, will likely be
smaller than x0 , and therefore, from our expression above, the probability .x0 ; y0 /
that the candidate value will be accepted is small. This cycle will persist, and the
chain will mix very slowly. When a candidate value gets accepted, it will tend to
bePstill lower than the initial value x0 . Even after many steps, averages such as
1 n
n kD1 .Xk / will produce poor estimates of E ..X //. This example illustrates
that it is important that the proposal density not be thinner tailed than the target
density when using an independent Metropolis sampler.
Example 19.34 (Failure of the Gibbs Sampler to Converge). Suppose U; V are iid
N.0; 1/ (normality is not important for this example), and suppose that we wish to
simulate from the joint distribution of .X; Y / where Y D U; X D jV jsgn.U / D
jV jsgn.Y /. Therefore, the full conditionals, that of X j Y and Y j X , are such that
P .Y > 0 j X > 0/ D P .X > 0 j Y > 0/ D 1. Suppose the starting values in
a Gibbs chain are x0 D 1; y0 D 1. Then, with probability one, the Gibbs update
produces x1 ; y1 such that first x1 > 0 (because x1 is simulated from f .x j Y D y0 /
and so x1 has the same sign as y0 ), and then, y1 > 0. Then, with probability one,
the Gibbs update produces x2 ; y2 such that x2 ; y2 > 0, and so on. Thus, if we let
A be the event that X > 0; Y > 0, the estimate of P .A/ from a Gibbs sampler,
regardless of the length of the chain, is 1. But the true value of P .A/ is P .A/ D
P .X > 0; Y > 0/ D 12 .
The reason that the Gibbs sampler fails to converge to the correct joint distribu-
tion of .X; Y / in the above example is that the support of the joint distribution is
confined to two disjoint subsets, the first quadrant, and the third quadrant. When-
ever the target distribution has such a topologically disconnected support, the full
conditionals of course preserve it, and so the Markov chain corresponding to Gibbs
sampling fails to be irreducible. Without the irreducibility property, the chain fails
to mix, and gets lost.
Convergence issues for the Gibbs sampler are usually treated separately for
the cases of discrete and continuous state spaces. The Gibbs sampler is a special
Metropolis scheme, in which the acceptance probabilities .x; y/ are equal to one.
So, any general theorem on the convergence of a Metropolis scheme, and any gen-
eral bound on the error of the approximation works, in principle, for the Gibbs
sampler. For example, in the finite-state space case, if a particular Gibbs chain is
reversible (recall that Gibbs chains are not necessarily reversible and the exact up-
dating scheme matters; see Section 19.4), then all the results in Theorem 19.8 apply.
However, their practical use depends on whether the constant c, that is, the SLEM,
can be evaluated or at least bounded for that specific Gibbs chain. Unfortunately,
this has to be done on a case-by-case basis. Diaconis et al. (2008) have succeeded in
some exponential family problems, where is the posterior distribution of a scalar
parameter, and the prior has a conjugate character (see Chapter 3 for an explanation
of conjugacy).
The case of a continuous-state space is more complicated. There are some general
theorems with conditions on that at least guarantee convergence. However, if we
seek geometric ergodicity over and above just convergence, then typically one has to
fall back on drift and minorization methods, and for Gibbs chains, this approach is
essentially problem specific, and not easy. Construction of the energy function V .x/
corresponding to the drift condition (see Section 19.5.3, and also Theorem 19.12
part (b)) is often too difficult for Gibbs chains. Rosenthal (1995, 1996), Chan (1993),
and Jones and Hobert (2001) are some specific success stories. Uniform ergodicity
usually does not hold for Gibbs chains unless the state space is bounded.
No very general satisfactory error bounds are available in the Gibbs context. The
following concrete theorem is based on Athreya et al. (1996) and Tierney (1994).
672 19 Simulation and Markov Chain Monte Carlo
Theorem 19.13. (a) Suppose the target distribution has the joint density .x1 ; x2 ;
: : : ; xm / and that it satisfies the condition
Then the systematic scan Gibbs sampler that updates the coordinates in the
order 1 ! 2 ! ! m has the convergence property
for almost all starting values x (with respect to the target distribution ).
(b) Suppose that the target distribution has the joint density .x1 ; x2 ; : : : ; xm / and
that D fx W .x/ > 0g is an open connected set in an Euclidean space.
Then the random scan Gibbs sampler is reversible, Harris recurrent, has as
its unique stationary distribution, and has the convergence property
Pm
mC1 1
2 i D1 .xi /
2
.; j x1 ; : : : ; xm / D c 2 e :
Even though the formal prior is not a probability density, the formal posterior is
integrable for m 3. We assume that m 5 in order that the drift method fully
works out. The full conditionals are derived easily from the joint posterior, and
they are:
1 m1 2
j .; x1 ; : : : ; xm / G ; Pm I
2 .xi /2
i D1
j . ; x1 ; : : : ; xm / N x; :
m
Consider now the systematic scan Gibbs sampler that updates in the order . ; / !
. 0 ; / ! . 0 ; 0 /. The transition density k ..; /; . 0 ; 0 // of this Gibbs chain is
k .; /; . 0 ; 0
/ D . 0
j ; x1 ; : : : ; xm /. 0 j 0
; x1 ; : : : ; xm /;
EŒV . 0 ; 0
/ j .; / D EŒV . 0 ; 0 / j
D E0 j EŒV . 0 ; 0
/ j .; 0
/ :
Now choose the energy function to be V .; / D . x/2 . Then, from the above,
0
EŒV . 0 ; 0
/ j .; / D E0 j . /
Pm m
.xi /2
D i D1 :
m.m 3/
Pm Pm Pm
Now decompose i D1 .xi /2 as i D1 .xi /2 D m. x/2 C i D1 .xi x/2 .
Therefore,
P Pm
mV .; / C mi D1 .xi x/
2
V .; / .xi x/2
EŒV . ;0 0
/ j .; / D D C i D1
m.m 3/ m3 m.m 3/
.1 /V .; / C b;
Pm
.x x/2
where 1 m31
D m4
m3 , and b
i D1 i
m.m3/ . This establishes the drift
condition in this example (see Definition 19.5).
The spectral methods as well as the drift methods for studying convergence
rates of MCMC algorithms may sometimes give very conservative answers to
the all-important practical question: how long should the chain be run to make
.P n ; / ? The spectral bounds can be difficult to apply, if the eigenvalues of
P are intractable, and the drift methods are usually difficult to apply in any case. In
addition the spectral and the drift methods require each new problem to be treated
separately. Therefore, effective and simple omnibus convergence diagnostics are
useful. They can be graphical, or numerical. Fueled by this practical need, a bewil-
dering array of MCMC convergence diagnostics have been proposed over the last
25 years. A major review of methods up to that time is Cowles and Carlin (1996).
A more recent literature summary is Mengersen et al. (2004).
These convergence diagnostics have often been received with skepticism by ac-
tual users, because none of them has been found to be generally superior to the
others, and more importantly, because in a given application, they often give contra-
dictory answers. Although one diagnostic may lead us to conclude that approximate
674 19 Simulation and Markov Chain Monte Carlo
convergence has already been achieved, another may suggest that we are still very
far from convergence. However, in conjunction with the use of spectral or drift meth-
ods, and collectively, they have some practical utility. We provide a brief overview
of a few of the diagnostics that have become popular in the area.
Gelman-Rubin Multiple Chain Method. This is based on Gelman and Rubin
(1992). In this method, m parallel chains, say Xkj ; k D 1; 2; : : : ; 2n; j D
1; 2; : : : ; m are generated by using an MCMC algorithm. The method applies to any
MCMC algorithm for which convergence to a stationary distribution is theoretically
known.
Suppose the target distribution is d -dimensional for some d 1, and let
g W Rd ! R be a function of interest. Let
1 X 1 XX
2n m 2n
gkj D g.Xkj /; gj D gkj ; gN D gkj ;
n mn
kDnC1 j D1 kDn
1 X
2n
sj2 D .gkj g j /2 :
n
kDnC1
Thus, gkj is the value of g based on the kth sweep of the j th run, g j is their mean,
sj2 is their variance, and gN is the overall mean. Now define the between and within
variances by
1 X 1 X 2
m m
BD N 2; W D
.g j g/ sj :
m m
j D1 j D1
The method also computes an estimate On of the target distribution by pooling the
last n values from the m parallel chains. Convergence is monitored by calculating
n1 mCn B
GR D GRg;n D C cn ;
n mn W
1 X
m
pOk D IXkj 2A ; k D B C 1; : : : ; n:
m
j D1
X
n
O D OB D argmin .pOk p c k 2
2/ :
kDBC1
Exercises
R 10 p
Exercise 19.1. Find a Monte Carlo estimate for the value of D 0 e x dx, and
construct a 95% confidence interval for . Use a selection of values for the Monte
Carlo sample size n. Compare the Monte Carlo estimate to the true value of .
R1 2
Exercise 19.2. * Find a Monte Carlo estimate for the value of D 1 e x
sin5 .x/dx, and construct a 95% confidence interval for . Use a selection of values
for the Monte Carlo sample size n. Compare the Monte Carlo estimate to the true
value of .
and construct a 95% confidence interval for . Use a selection of values for the
Monte Carlo sample size n.
Exercise 19.4. With reference to Example 19.1, give a rigorous proof that
E.log jX j/ D 0 if X C.0; 1/.
Exercise 19.8 * (Monte Carlo Estimate of Surface Area). Devise a Monte Carlo
scheme for finding the surface area of the d -dimensional unit ball. Use it to find
estimates for the surface area when d D 2; 3; 5; 10; 50. Use a selection of values for
the Monte Carlo sample size n. Compare the Monte Carlo estimate to the true value
d=2
of the surface area of the unit ball (which is dd /.
. 2 C1/
Exercises 677
Exercise 19.9 (Using the Monte Carlo on a Divergent Integral). The integral
R1
D 0 cosx x diverges. Investigate what happens if you try to find a Monte Carlo
estimate for the integral anyway. Use a selection of values for the Monte Carlo
sample size n.
Exercise 19.10 * (Using the Monte Carlo on an Infinite Expectation). For any
m 2, if X1 ; : : : ; Xm are iid standard Cauchy, then E.X.m/ X.m1/ / D 1.
Investigate what happens if you try to find a Monte Carlo estimate for this expecta-
tion anyway; use m D 2; 5; 20, and a selection of values for the Monte Carlo sample
size n.
Exercise 19.15. * Suppose X N3 .; I /, and that the mean vector has a prior
density
c
./ D :
1 C .j1 j C j2 j C j3 j/3:5
Find a Monte Carlo estimate for the posterior mean of i ; i D 1; 2; 3 if X D
.1; 0; 1/. Use a selection of values for the Monte Carlo sample size n.
Exercise 19.16 (Monte Carlo Estimate of e). Find a Monte Carlo estimate for e
by using the identity P .X > 1/ D 1e if X Exp.1/. Use a selection of values for
the Monte Carlo sample size n, and plot the Monte Carlo estimates against n.
Exercise 19.17 (Monte Carlo P -value). With reference to Example 19.7, calcu-
late the Monte Carlo P -value for the median-based test for the center of a Cauchy
distribution when the original dataset is generated under the alternative C.; 1/ with
D 2:5. Compare with the P -value approximated by using the normal approxima-
tion, as in Example 19.7.
678 19 Simulation and Markov Chain Monte Carlo
Exercise 19.18 * (Monte Carlo P -value). Calculate the Monte Carlo P -value for
the Kolmogorov–Smirnov statistic for testing H0 W F D N.0; 1/, when the original
dataset of size m D 50 is generated under the null; under F D U Œ2; 2. Compare
with the P -value approximated by using the Brownian bridge approximation, as in
Chapter 16, Section 16.2.1.
cX
n
O 1 D .Xi /I
n
i D1
Mc X
n
O 2 D IYi .Xi / :
n
i D1
Exercise 19.54 (Metropolis–Hastings Chain). For the special case that n D 4 and
’ D ˇ D 2, find explicitly the transition probabilities of the Metropolis–Hastings
chain for the Beta–Binomial case (Example 19.15), and verify that the chain is irre-
ducible and aperiodic.
Exercise 19.56. Write the following scheme formally as a Metropolis chain: gen-
erate a candidate state j according to the target distribution , regardless of the
current state i , and accept the candidate state only if the target distribution assigns
at least probability to it.
Exercise 19.69 * (Monotonicity of Gibbs Error). Show that the systematic scan
Gibbs chain for simulating an m-dimensional random vector has the property that
the total variation distance supA jP n .x; A/ .A/ is monotonically nonincreasing
in n.
Show that the chain has a stationary distribution , but that the chain is not
reversible.
Exercise 19.75. Generate a Gibbs chain of length n D 50 using the random scan
Gibbs chain, when the target distribution is a bivariate normal with means 0, stan-
dard deviations 1, and correlation 0:5.
Exercise 19.78. Find the eigenvalues in analytical form for the three-state station-
ary Markov chain with the transition matrix
0 1
’ 1’ 0
P D@ 0 ˇ 1 ˇ A:
1 0
Exercise 19.80. Give a proof that for a reversible chain, the eigenvalues of the tran-
sition matrix are all in the interval Œ0; 1.
Note that the stationary distribution of the Ehrenfest chain was worked out in
Example 10.20.
Exercise 19.84. Calculate the Dobrushin coefficient as well as the SLEM for the
nonreversible transition matrix of Example 19.22, and verify that c .P /.
Exercise 19.85. * Construct an example in which the SLEM and the Dobrushin
coefficient coincide.
686 19 Simulation and Markov Chain Monte Carlo
Exercise 19.86 * (SLEM versus Dobrushin Coefficients). Prove the better in-
n1
n
equality c minn .P / .
References
Athreya, K., Doss, H., and Sethuraman, J. (1996). On the convergence of the Markov chain simu-
lation method, Ann. Statist., 24, 89–100.
Barnard, G. (1963). Discussion of paper by M.S. Bartlett, JRSS Ser. B, 25, 294.
Besag, J. and Clifford, P. (1989). Generalized Monte Carlo significance tests, Biometrika, 76,
633–642.
Besag, J. and Clifford, P. (1991). Sequential Monte Carlo p-values, Biometrika, 78, 301–304.
Brémaud, P. (1999). Markov Chains, Springer, New York.
Chan, K. (1993). Asymptotic behavior of the Gibbs samples, J. Amer. Statist. Assoc., 88, 320–326.
Cowles, M. and Carlin, B. (1996). Markov chain Monte Carlo convergence diagnostics: A com-
parative review, J. Amer. Statist. Assoc., 91, 883–904.
Diaconis, P. (2009). The MCMC revolution, Bull. Amer. Math. Soc., 46, 179–205.
Diaconis, P. and Saloff-Coste, L. (1996). Logarithmic Sobolev inequalities for finite Markov
chains, Ann. Appl. Prob., 6, 695–750.
Diaconis, P. and Saloff-Coste, L. (1998). What do we know about the Metropolis algorithm,
J. Comput. System Sci., 57, 20–36.
Diaconis, P. and Stroock, D. (1991). Geometric bounds for eigenvalues of Markov chains, Ann.
Appl. Prob., 1, 36–61.
Diaconis, P. and Sturmfels, B. (1998). Algebraic algorithms for sampling from conditional distri-
butions, Ann. Statist., 26, 363–398.
Diaconis, P., Khare, K., and Saloff-Coste, L. (2008). Gibbs sampling, exponential families, and
orthogonal polynomials, with discussion, Statist. Sci., 23, 2, 151–200.
Dimakos, X.K. (2001). A guide to exact simulation, Internat. Statist. Rev., 69, 27–48.
Do, K.-A. and Hall, P. (1989). On importance resampling for the bootstrap, Biometrika, 78,
161–167.
Dobrushin, R.L. (1956). Central limit theorems for non-stationary Markov chains II, Ther. Prob.
Appl., 1, 329–383.
Fill, J. (1991). Eigenvalue bounds on convergence to stationarity for non-reversible Markov chains,
with an application to the exclusion process, Ann. Appl. Prob., 1, 62–87.
Fill, J. (1998). An interruptible algorithm for perfect sampling via Markov chains, Ann. App. Prob.,
8, 131–162.
Fishman, G. S. (1995). Monte Carlo, Concepts, Algorithms, and Applications, Springer, New York.
Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference,
Chapman and Hall, London.
Garren, S. and Smith, R.L. (1993). Convergence diagnostics for Markov chain samplers,
Manuscript.
Gelfand, A. and Smith, A.F.M. (1987). Sampling based approaches to calculating marginal densi-
ties, J. Amer. Stat. Assoc., 85, 398–409.
Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences,
with discussion, Statist. Sci., 7, 457–511.
Gelman, A., Carlin, B., Stern, H., and Rubin, D. (2003). Bayesian Data Analysis, Chapman and
Hall/CRC, Boca Raton.
References 687
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images, IEEE Trans. Pattern Anal. Mach. Intele., 721–740.
Geyer, C. (1992). Practical Markov chain Monte Carlo, with discussion, Statist. Sci., 7, 473–511.
Gilks, W., Richardson, S., and Spiegelhalter, D. (Eds.), (1995). Markov Chain Monte Carlo in
Practice, Chapman and Hall, London.
Glauber, R. (1963). Time dependent statistics of the Ising Model, J. Math. Phys., 4, 294–307.
Green, P.J. (1995). Reversible jump Markov Chain Monte Carlo computation and Bayesian model
determination, Biometrika, 82, 711–732.
Hall, P. and Titterington, D.M. (1989). The effect of simulation order on level accuracy and power
of Monte Carlo tests, JRSS Ser. B, 51, 459–467.
Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications,
Biometrika, 57, 92–109.
Higdon, D. (1998). Auxiliary variables methods for Markov chain Monte Carlo applications,
J. Amer. Statist. Assoc., 93, 585–595.
Jones, G. and Hobert, J. (2001). Honest exploration of intractable probability distributions via
Markov Chain Monte Carlo, Statist. Sci., 16, 312–334.
Kendall, W. and Thönnes, E. (1999). Perfect simulation in stochastic geometry, Patt. Recogn., 32,
1569–1586.
Liu, J. (1995). Eigenanalysis for a Metropolis sampling scheme with comparisons to rejection
sampling and importance sampling, Manuscript.
Liu, J. (2008). Monte Carlo Strategies in Scientific Computing, Springer, New York.
Mengersen, K. and Tweedie, R. (1996). Rates of convergence of Hastings and Metropolis algo-
rithms, Ann. Statist., 24, 101–121.
Mengersen, K., Knight, S., and Robert, C. (2004). MCMC: How do we know when to stop?,
Manuscript.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). Equations of state
calculations by fast computing machines, J. Chem. Phys., 21, 1087–1092.
Propp, J. and Wilson, B. (1998). How to get a perfectly random sample from a generic Markov
chain and generate a random spanning tree to a directed graph, J. Alg., 27, 170–217.
Ripley, B. D. (1987). Stochastic Simulation, Wiley, New York.
Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods, Springer, New York.
Roberts, G. and Rosenthal, J.S. (2004). General state space Markov chains and MCMC algorithms,
Prob. Surveys, 1, 20–71.
Rosenthal, J. (1995). Minorization conditions and convergence rates for Markov chain Monte
Carlo, J. Amer. Statist. Assoc., 90, 558–566.
Rosenthal, J. (1996). Analysis of the Gibbs sampler for a model related to the James–Stein estima-
tions, Statist. Comput., 6, 269–275.
Rosenthal, J. (2002). Quantitative convergence rates of Markov chains: A simple account, Electr.
Comm. Prob., 7, 123–128.
Ross, S. (2006). Simulation, Academic Press, New York.
Rubin, H. (1976). Some fast methods of generating random variables with pre-assigned distribu-
tions: General acceptance-rejection procedures, Manuscript.
Schmeiser, B. (1994). Modern simulation environments: Statistical issues, Proceedings of the First
IE Research Conference, 139–144.
Schmeiser, B. (2001). Some myths and common errors in simulation experiments, B. Peters et al.
Eds., Proceedings of the Winter Simulation Conference, 39–46.
Smith, A.F.M. and Roberts, G. (1993). Bayesian computation via the Gibbs sampler, with discus-
sion, JRSS Ser. B, 55, 3–23.
Tanner, M. and Wong, W. (1987). The calculation of posterior distributions, with discussions,
J. Amer. Statist. Assoc., 82, 528–550.
Tierney, L. (1994). Markov chains for exploring posterior distributions, with discussion, Ann.
Statist., 22, 1701–1762.
Yu, B. and Mykland, P. (1994). Looking at Markov samplers through CUSUM path plots: A simple
diagnostic idea, Manuscript.
Chapter 20
Useful Tools for Statistics and Machine Learning
had on practice. We recommend Hall (1992) and Shao and Tu (1995) for detailed
theoretical developments of the bootstrap, and Efron and Tibshirani (1993) for
an application-oriented readable exposition. Modern reviews include Hall (2003),
Bickel (2003), Efron (2003), and Lahiri (2006). Lahiri (2003) is a rigorous treat-
ment of the bootstrap for various kinds of dependent data, including problems that
arise in time series and spatial statistics.
iid
Suppose X1 ; X2 ; : : :; Xn F , and
p
T .X1 ; X2 ; : : :; Xn ; F / is a functional, for
n.X/
example, T .X1 ; X2 ; : : :; Xn ; F / D
, where D .F / D EF .X1 /, and
2 D 2 .F / D VarF .X1 /, assumed to be finite. In statistical problems, we fre-
quently need to know something about the sampling distribution of T , for example,
PF .T .X1 ; X2 ; : : :; Xn ; F / t/. If we had replicated samples from the population,
resulting in a series of values for the statistic T , then we could form estimates of
PF .T t/ by counting how many of the Ti s are t. But statistical sampling is
not done that way. Usually, we do not obtain replicated samples; we obtain just one
set of data values of some size n. The intuition of the canonical bootstrap is that
by the Glivenko–Cantelli theorem (see Chapter 7), the empirical CDF Fn should be
very close to the true underlying CDF F , and so, sampling from Fn , which amounts
to simply resampling n values with replacement from the already available data
.X1 ; X2 ; : : :; Xn /, should produce new sets of values that act like samples from F
itself. So, although we did not have replicated datasets to start with, it is as if by re-
sampling from the available dataset we now have the desired replications. There is
a certain element of faith in this idea, unless we have demonstrable proofs that this
simple idea will in fact work, that is, that these resamples lead us to accurate approx-
imations to the true distribution of T . It turns out that such theorems are available,
and have led to the credibility and popularity of the bootstrap as a distribution ap-
proximation tool. To implement the bootstrap, we only need to be able to generate
enough resamples from the original dataset. So, in a sense, the bootstrap replaces a
hard mathematical calculation in probability theory by an omnibus and almost au-
tomated computing exercise. It is the automatic nature of the bootstrap that makes
it so appealing. However, it is also frequently misused in situations where it should
not be used, because it is theoretically unjustifiable in those problems, and will in
fact give incorrect and inaccurate answers.
Suppose for some number B, we draw B resamples of size n from the original
sample. Denoting the resamples from the original sample as
.X11 ; X12 ; : : :; X1n /; .X21 ; X22 ; : : :; X2n /; : : :; .XB1 ; XB2 ; : : :; XBn /;
with corresponding values T1 ; T2 ; : : :; TB for the functional T , one can use simple
j #fj W T t g
frequency-based estimates such as B to estimate PF .T t/. This is the
basic idea of the bootstrap.
The formal definition of the bootstrap distribution of a functional is as follows.
where .X1 ; : : :; Xn / refers to an iid sample of size n from the empirical CDF Fn .
It is common to use the notation P to denote probabilities under the boot-
strap distribution. PFn ./ corresponds to probability statements corresponding to all
the nn possible with replacement resamples from the original sample X1 ; : : : ; Xn .
Recalculating T from all nn resamples is basically impossible unless n is very
small, therefore one uses a smaller number of B resamples and recalculates T only
B times. Thus Hboot .x/ is itself estimated by a Monte Carlo, known as bootstrap
Monte Carlo. So the final estimate for PF .T .X1 ; X2 ; : : :; Xn ; F / x/ absorbs er-
rors from two sources: (i) pretending that .Xi1 ; Xi2 ; : : :; Xin / are bona fide samples
from F ; (ii) estimating the true Hboot .x/ by a Monte Carlo. By choosing B ade-
quately large, the issue of the Monte Carlo error is generally ignored. The choice of
which B would let one ignore the Monte Carlo error is a hard mathematical problem;
Hall (1986, 1989) are two key references. It is customary to choose B 500 1000
for variance estimation and a somewhat larger value for estimating quantiles. It is
hard to give any general reliable prescriptions for B.
At first glance, the bootstrap idea of resampling from the original sample appears
to be a bit too simple to actually work. One has to have a definition for what one
means by the bootstrap working in a given situation. For estimating the CDF of
a statistic, one should want Hboot .x/ to be numerically close to the true CDF, say
Hn .x/, of T . This would require consideration of metrics on CDFs, a topic we
covered in Chapter 15. For a general metric , the definition of the bootstrap working
in a given problem is the following.
Definition 20.2. Let .F; G/ be a metric on the space of CDFs on X . For a given
functional T .X1 ; X2 ; : : :; Xn ; F /, let
P
We say that the bootstrap is weakly consistent under for T if .Hn ; Hboot / ) 0
as n ! 1. We say that the bootstrap is strongly consistent under for T if
a:s:
.Hn ; Hboot / ! 0.
Note that the need for mentioning convergence to zero in probability or a.s. in
this definition is due to the fact that the bootstrap distribution Hboot is a random
CDF. It is a random CDF because as a function it depends on the original sam-
ple .X1 ; X2 ; : : :; Xn /. Thus, the bootstrap uses a random CDF to approximate a
deterministic but unknown CDF, namely the true CDF Hn of the functional T . In
principle, a sequence of random CDFs could very well converge to another random
CDF, or not converge at all! It is remarkable that under certain minimal conditions,
those disasters do not happen, and Hboot and Hn get close as n ! 1.
Example 20.1 (Applying the Bootstrap). How does one applyp
the bootstrap in prac-
N
tice? Suppose for example, T .X1 ; X2 ; : : :; Xn ; F / D n.X/
. In the canonical
692 20 Useful Tools for Statistics and Machine Learning
This number depends on the original sample set fX1 ; X2 ; : : :; Xn g, the particular
resampled sets .Xi1 ; Xi2 ; ; Xin /; and the bootstrap Monte Carlo sample size B.
If the bootstrap Monte Carlo is repeated, then for the same B, and of course, the
same original sample set fX1 ; X2 ; : : :; Xn g, the bootstrap estimate will be a differ-
ent number.
p
We would like the bootstrap estimate to be close to the true value of
n.XN /
PF .
x/; consistency of the bootstrap is about our ability to guarantee
that for large n, and an implicit unspoken assumption of a large B.
iid
p mean of iid random riables. If X1 ; X2 ; : : :; Xn
We start with the case of the sample
F and if VarF .Xi / < 1, then n.p XN / has a limiting normal distribution, by the
CLT. So a probability such as PF . n.XN / x/ could be approximated by, for
example, ˆ. xs /, where s is the sample standard deviation. An interesting property
of the bootstrap approximation is that even when the CLT approximation ˆ. xs / is
available, the bootstrap approximation may be more accurate. Such results generally
go by the name of higher-order accuracy of the bootstrap.
But first we present two consistency results corresponding to two specific metrics
that have earned a special status in this literature. The two metrics are
(i) Kolmogorov metric
Proof. We sketch a proof of the strong consistency in K. The proof requires use of
the Berry–Esseen inequality, Polya’s theorem (see Chapter 7 and Chapter 8), and a
strong law known as the Zygmund–Marcinkiewicz strong law, which we state below
without a proof.
We are now ready to sketch the proof of strong consistency of Hboot under the
Kolmogorov metric K.
4 EFn jX1 XN j3
Cn p
5 n ŒVarFn .X1 /3=2
Pn
4 jXi XN j3
D p i D1 3
5 n ns
" n #
4 X
2 3
jXi j C nj XN j
3 3
5n3=2 s 3 i D1
" #
1 X jXN j3
n
M
D 3 jXi j C p
3
;
s n3=2 i D1 n
where M D 32 .
5 p
Because s ) > 0 and XN ) , it is clear that jXN j3 =. ns 3 / ) 0 almost
surely. As regards the first term, let Yi D jXi j3 and ı D 2=3. Then the fYi g are
iid and
EjYi jı D EF jXi j32=3 D VarF .X1 / < 1:
It now follows from the Zygmund–Marcinkiewicz SLLN that
1 X X
n n
1=ı
jXi j D n
3
Yi ) 0; a:s:; as n ! 1:
n3=2 i D1 i D1
Example 20.2 (Practical Accuracy of Bootstrap). How does the bootstrap compare
with the CLT approximation in actual applications? The question can only be an-
swered by case-by-case simulation. The results are mixed
p in the following numerical
table. The Xi are iid Exp(1) in this example and T D n.XN 1/, with n D 20. For
the bootstrap approximation, B D 250 was used.
Theorem 20.3. Let XE1 ; XE2 ; : : : ; XEn F , and let †d d D CovF .XE1 / be finite. Let
iid
p E
T .XE1 ; XE2 ; : : : ; XEn ; F / D n.XN /
E and for some m 1, let g W Rd ! Rm . If
rg./ exists in a neighborhood of , E rg./ E and if rg./ is continuous at ,
E 6D 0, E
then the bootstrap is strongly consistent with respect to the Kolmogorov metric K
p
for nŒg.XEN / g./. E See Shao and Tu (1995, p. 80) for a sketch of a proof of this
theorem.
696 20 Useful Tools for Statistics and Machine Learning
timate. One wants to know how accurate the bootstrap-based estimates are in reality.
This can only be answered on the basis of case-by-case investigation. Some over-
all qualitative phenomena have emerged from these investigations. For instance,
(a) The bootstrap distribution estimate captures information about skewness that
the CLT will miss.
(b) But the bootstrap tends to underestimate the variance of a statistic T .
Here are a few more illustrative examples.
Example 20.3 (Bootstrapping the Sample Variance). Let X1 ; X2 ; : : : be iid one-
dimensional random variables with the CDF F , and suppose EF .X14 / < 1. Let
YEi D . X i2 /. Then with d D 2, YE1 ; YE2 ; : : : ; YEn are iid d -dimensional vectors with
X
i
1 X 1 X 2 N 2
n n
2
Xi XN D Xi X D g.YEN /:
n n
i D1 i D1
X 1 2
2n3 .1 2 /.n1/=2 nCk1 .2r/k
f .rj/ D .1 r 2 /.n4/=2 I
.n 3/Š 2 kŠ
kD0
80
60
40
20
-2 -1 0 1 2 3 4 5 6 7
see Tong (1990). In the table below, we give simulation averages of the estimated
standard deviation of r by using the bootstrap. We used n D 20, and B D 200.
The bootstrap estimate was calculated for 1000 independent simulations; the table
reports the average of the standard deviation estimates over the 1000 simulations.
Except when is large the bootstrap underestimates the variance and the CLT
estimate is better.
Example 20.6 (The t-Statistic for Poisson p Data). Suppose X1 ; : : : ; Xn are iid
Poi./ and let Tn be the t-statistic Tn D n.XN /=s. In this example n D 20
and B D 200 and for the actual data, was chosen to be 1. Apart from the bias
and the variance of Tn , in this example we also report percentile estimates for Tn .
The bootstrap percentile estimates are found by calculating Tn for the B resamples
and calculating the corresponding percentile value of the B values of Tn . The bias
and the variance are estimated to be 0:18 and 1.614, respectively. The estimated
percentiles are reported in the table. Note that the 5th and the 95th percentiles are
not equal in absolute value in this table; neither are the 10th and the 90th. Such
potential strong skewness would have remained undetected, if we had simply used
a normal approximation.
Example 20.7 (Bootstrap Failure). In spite of the many consistency theorems in the
previous section, there are instances where the ordinary bootstrap with replacement
sampling from Fn actually does not work. Typically, these are instances where the
functional Tn fails to admit a CLT. Here is a simple example where the ordinary
bootstrap fails to consistently estimate the true distribution of a statistic.
Let X1 ; X2 ; : : : be iid U.0; / and let Tn D n. X.n/ /, Tn D n.X.n/ X.n/
/.
The ordinary bootstrap will fail in this example in the sense that the conditional
distribution of Tn given X.n/ does not converge to the Exp./ distribution a.s. Let
us take, for notational simplicity, D 1. Then for t 0,
20.1 The Bootstrap 699
PFn .Tn t/ PFn Tn D 0
D PFn X.n/ D X.n/
D 1 PFn X.n/ < X.n/
n
n1
D 1
n
n!1
! 1 e 1
One question about the use of the bootstrap is whether the bootstrap has any ad-
vantages at all when a CLT is already available. To be specific, suppose T .X1 ;
p p L
: : : ; Xn ; F / D n.XN /. If 2 D VarF .X / < 1, then n.XN / )
a:s:
N.0; 2 / and K.Hboot ; Hn / ! 0. So two competitive p approximations to
PF .T .X1 ; : : : ; Xn ; F / x/ are ˆ. xO / and PFn . n.XN XN / x/. It turns
out that for certain types of statistics, the bootstrap approximation is (theoretically)
more accurate than the approximation provided by the CLT. The CLT, because any
normal distribution is symmetric, cannot capture information about the skewness
in the finite sample distribution of T . The bootstrap approximation does so. So
the bootstrap succeeds in correcting for skewness, just as an Edgeworth expansion
would do (see Chapter 1). This is called Edgeworth correction by the bootstrap and
the property is called second-order accuracy of the bootstrap. It is important to
remember that second-order accuracy is not automatic; it holds for certain types of
T but not for others. It is also important to understand that practical accuracy and
theoretical higher-order accuracy can be different things. The following heuristic
calculation illustrates when second-order accuracy can be anticipated. The first re-
sult on higher-order accuracy of the bootstrap is due to Singh (1981). In addition to
the references we provided in the beginning, Lehmann (1999) gives a very readable
treatment of higher-order accuracy of the bootstrap. p
iid N
Suppose X1 ; : : : ; Xn F and T .X1 ; : : : ; Xn ; F / D n.X/
; here 2 D
VarF .X1 / < 1. We know that T admits the Edgeworth expansion (see Chapter 1):
p1 .xjF / p2 .xjF /
PF .T x/ D ˆ.x/ C p .x/ C .x/
n n
C smaller order terms,
700 20 Useful Tools for Statistics and Machine Learning
p1 .xjFn / p2 .xjFn /
P .T x/ D ˆ.x/ C p .x/ C .x/
n n
C smaller order terms,
p1 .xjF / p1 .xjFn / p2 .xjF / p2 .xjFn /
Hn .x/ Hboot .x/ D p C
n n
C smaller order terms.
and the leading term in the bootstrap approximation in this unstandardized case
would be ˆ. xs /. So in the unstandardized case, the bootstrap approximates the true
CDF Hn .x/ also at the rate p1n , that is, if one does not standardize, then Hn .x/
Hboot .x/ D Op . p1n /. We have now lost the second-order accuracy. The following
second rule of thumb often applies.
iid
Rule of Thumb. Let X1 ; : : : ; Xn and T .X1 ; : : : ; Xn ; F / a functional. If T .X1 ;
L
: : : ; Xn ; F / ) N.0; / where is independent of F , then second-order accuracy
2
would usually be unknown, and the accurate approximation could not really be
used to construct a confidence interval for . Still, the second-order accuracy result
is theoretically insightful.
We state a specific result below for the case of standardized and nonstandardized
sample means. Let
p !
p n XN
Hn .x/ D PF n XN x ; Hn;0 .x/ D PF x ;
p !
p n XN XN
Hboot .x/ D P n XN XN x ; Hboot;0 .x/ D PFn x :
s
iid
Theorem 20.4. Let X1 ; : : : ; Xn F.
(a) If EF jX1 j3 < 1, and F is nonlattice, then K.Hn;0 ; Hboot;0 / D op . p1n /.
p P
(b) If EF jX1 j3 < 1, and F is lattice, then nK.Hn;0 ; Hboot;0 / ! c, 0 < c < 1.
See Shao and Tu (1995 p. 92–94) for a proof. The constant c in the lattice case
equals ph , where h is the span of the lattice fa C kh; k D 0; ˙1; ˙2; : : :g on
2
which the Xi are supported. Note also that part (a) says that higher-order accu-
racy for the standardized case obtains with three moments; Hall (1988) showed that
finiteness of three absolute moments is in fact necessary and sufficient for higher-
order accuracy of the bootstrap in the standardized case.
The ordinary bootstrap that resamples observations with replacement from the orig-
inal dataset does not work when the sample observations are dependent. This was
already pointed out in Singh (1981). It took some time before consistent bootstrap
schemes were offered for dependent data. There are consistent schemes that are
meant for specific dependence structures (e.g., stationary autoregression of a known
order) and there are also general bootstrap schemes that work for large classes of
stationary time series without requiring any particular dependence structure. The
model-based schemes are better for the specific models, but can completely fail if
some assumption about the specific model does not hold. Block bootstrap methods
are regarded as the bread and butter of resampling for dependent sequences. These
are general and mostly all-purpose resampling schemes that provide at least consis-
tency for a wide selection of dependent data models.
The basic idea of the block bootstrap method is that if the underlying series is a
stationary process with short-range dependence, then blocks of observations of suit-
able lengths should be approximately independent. Also, the joint distribution of
the variables in different blocks would be about the same, due to stationarity.
702 20 Useful Tools for Statistics and Machine Learning
(b) (Moving Block Bootstrap (MBB)). In this scheme, the blocks are
B1 D fy1 ; : : : ; yh g; B2 D fy2 ; : : : ; yhC1 g; : : : ; BN D fynhC1 ; : : : ; yn g;
where N D n h C 1. One then resamples B1 ; : : : ; Bm
from B1 ; : : : ; BN ,
where still n D mh.
(c) (Circular Block Bootstrap (CBB)). In this scheme, one periodically extends
the observed series as y1 ; y2 ; : : : ; yn ; y1 ; y2 ; : : : ; yn ; : : :. Suppose we let zi be
the members of this new series, i D 1; 2; : : :. The blocks are defined as
B1 D fz1 ; : : : ; zh g; B2 D fzhC1 ; : : : ; z2h g; : : : ; Bn D fzn ; : : : ; znCh1 g:
One then resamples B1 ; : : : ; Bm
from B1 ; : : : ; Bn .
We now give some theoretical properties of the three block bootstrap methods
described above. The results below are due to Lahiri (1999). We need a definition
for the result below.
Definition 20.3. Let Yn ; n D 0; ˙1; ˙2; : : : be a stationary time series with covari-
ance function .k/ D Cov.Yt ; Yt Ck /; k D 0; ˙1; ˙2; : : :. The spectral density of
the series is the function
1
1 X
f .!/ D .k/e i k! ; < ! ;
2
kD1
p
where i D 1.
20.1 The Bootstrap 703
2 2 c2 h
Var.bOn / D C o.hn3 /; n ! 1;
n3
and for the MBB and CBB,
4 2 c2 h
Var.bOn / D C o.hn3 /; n ! 1:
3n3
(c) For each of NBB, MBB, and CBB, there exists c3 D c3 .f / such that E.On2 / D
c3
n2 C nh C o..nh/1 /; n ! 1:
(d) For NBB, there exists c4 D c4 .f / such that
2 2 c4 h
Var.On2 / D C o.hn3 /; n ! 1;
n3
and for the MBB and CBB,
4 2 c4 h
Var.On2 / D C o.hn3 /; n ! 1:
3n3
We now use these expansions to derive optimal block sizes. The asymptotic expan-
sions for the bias and the variance are combined to derive mean-squared error
704 20 Useful Tools for Statistics and Machine Learning
optimal block sizes. For example, for estimating bn by bOn , the leading term in the
expansion for the mean-squared error is
4 2 c2 h c12
m.h/ D C :
3n3 n2 h2
Similarly, an optimal block length can be derived for estimating n2 by On2 . We state
the following optimal block length result of Lahiri (1999) below.
Theorem 20.6. For the MBB and the CBB, the optimal block length for estimating
bn by bOn satisfies
1=3
3c12
hopt D n1=3 .1 C o.1//;
2 2 c2
and the optimal block length for estimating n2 by On2 satisfies
1=3
3c32
hopt D n1=3 .1 C o.1//:
2 2 c4
The constants ci depend on the spectral density f of the process, which would be
unknown in a statistical context. So, the optimal block lengths cannot be directly
used. Plug-in estimates for the ci may be substituted. Or, the formulas can be used
to try block lengths proportional to n1=3 , with flexible proportionality constants.
There are also other methods in the literature on selection of block lengths; see Hall
et al. (1995) and Politis and White (2004).
However, closed-form formulas for maximum likelihood estimates are rare out-
side of the exponential family structure. In such cases, one must understand the
shape and boundedness properties of the likelihood function, and carefully compute
the maximum likelihood estimate numerically for the observed data. Driven by this
need, Fisher gave the well-known scoring method, the first iterative method for nu-
merical calculation of the maximum likelihood estimate. In problems with a small
number of parameters, the scoring method is known to work well, under some con-
ditions. It is awkward to use when the number of parameters is even moderately
large.
The EM algorithm, formally introduced in Dempster et al. (1977) as a general-
purpose iterative numerical method for approximating the maximum likelihood
estimate can be applicable, and even successful, when the scoring method is difficult
to apply. The EM algorithm has become a mainstay of the numerical approximation
of the maximum likelihood estimate, with widespread applications, quite like max-
imum likelihood itself is the mainstay of the estimation paradigm in parametric
inference. The reputation of one seems to fuel the popularity of the other, although
one of them is a principle, and the other a numerical scheme. The standard reference
on the EM algorithm, its various mutations, and practical applications and properties
is McLachlan and Krishnan (2008). Algorithms very similar to the EM algorithm
were previously described in several places, notably Sundberg (1974) for the case of
exponential families. The basic general algorithm is presented in this section with a
description of some of its known properties and known weaknesses.
Underlying each application of the EM algorithm, there is an implicit element of
missing data, say Z, and some observed data, say Y . If the missing data Z did be-
come available, one would have the complete data X D .Y; Z/. Truly, the likelihood
function, say l.; Y /, depends on only the data Y that we have, and not the data that
we might have had. However, the EM algorithm effectively fills in the missing data
Z using the observed data Y , and a current value for the parameter , thereby pro-
ducing a fictitious complete data likelihood l.; X /. One finds the projection of this
fictitious complete data likelihood l.; X / onto the class of functions that depend
only on the actual observed data Y , which is then maximized over 2 ‚ to pro-
duce a candidate maximum likelihood estimate . O This O is used as the next current
value for , and the process is let run until convergence within tolerable fluctuation
appears to have been achieved. The filling in part corresponds to the E-part of the
algorithm, and the maximization corresponds to the M-part of the algorithm. Be-
cause statistical models are often such that the logarithm of the likelihood function
is a more manipulable function than the likelihood itself, the algorithm works with
the log-likelihood, for which we use the notation L D log l below.
It is important to note that the so-called missing data Z may be really physically
missing in some problems, whereas in other problems the missing data are imag-
inary, deviously thought of so that the complete data likelihood l.; X / becomes
particularly pleasant and receptive to easy global maximization. In those problems
where the missing data are an artificial construct, there would be a choice as to how
to embed the problem into a missing data structure, and part of the art of the method
is to pick a wise embedding.
706 20 Useful Tools for Statistics and Machine Learning
In general, argmax 2‚ LO k .; Y / may be a set, rather than a single point. In that
case, choose any member of that set to be the next stage current value.
(e) Iterate until convergence to satisfaction appears to have been achieved.
Thus, implementation of the EM algorithm is substantially more straightfor-
ward when the calculation of the conditional expectation Ek ŒL.; X / j Y and the
maximization of LO k .; Y / can be done in closed-form. These two closed-form cal-
culations may not be possible unless one has an exponential family structure in the
complete data X . There have been newer versions of the EM algorithm that try
to bypass closed-form calculations of these two quantities; for example, the self-
evident idea of using Monte Carlo to calculate Ek ŒL.; X / j Y when it cannot be
done in closed-form is one of the newer versions of the EM algorithm, and is often
called the Monte Carlo EM algorithm. We now give some illustrative examples.
Example 20.8 (Poisson with Missing Values). Suppose for some fixed n 1, com-
plete data X1 ; : : : ; Xn are iid Poi. /, but the data value is actually reported only if
Xi 2. This sort of missing data can occur if, for example, the Xi are supposed to
be counts of minor accidents per week in n locations, but the values do not get re-
ported if there are too few incidents. If the number of recorded values is m n, then
denoting the recorded values as Y1 ; : : : ; Ym , the number of unreported zero values
as Z0 , and the number of unreported values that equal one as Z1 , the complete data
X can be represented as .Y1 ; : : : ; Ym ; m; Z0 ; Z1 /; the reported values Y1 ; : : : ; Ym
are iid from the conditional distribution of a Poisson variable P with mean given
that the variable is larger than 1. Therefore, writing Sy D m i D1 yi , the likelihood
based on the complete data is
e m Sy m
l. ; X / D Qm e /
1 e e e z0 z1 z1
e
i D1 Œyi Š .1 e
e n Sy Cz1
D Qm
i D1 yi Š
X
m
) L. ; X / D log l. ; X / D n C .Sy C z1 / log log yi Š:
i D1
20.2 The EM Algorithm 707
Therefore,
k
X
m
D n C Sy log C .log / .n m/ log yi Š;
1C k i D1
k
Because Z1 j .Y1 ; : : : ; Ym ; m/ Bin.n m; 1C /. This is the E-step of the prob-
k
lem. For the M-step, we have to maximize over > 0 the function
k
n C log Sy C .n m/ :
1C k
k
Sy C .n m/ 1C
D k
:
n
This takes the position of kC1 , and the process is iterated until apparent conver-
gence.
Example 20.9 (Bivariate Normal with Missing Coordinates). Suppose for some
n 1, complete data are iid bivariate normal vectors .X1j ; X2j /; j D 1; 2; : : : ; n
N2 .; †/. However, for n1 of the n units, only the X1 coordinate is available, and
for another n2 distinct units, only the X2 coordinate is available. For the rest of the
m D n n1 n2 units, the data on both coordinates are available. We can therefore
write the complete data in the canonical form
X D Y1 ; : : : ; Yn1 ; Yn1 C1 ; : : : ; Yn1 Cn2 ; .Y11 ; Y21 /; : : : ; .Y1m ; Y2m /;
Z1 ; : : : ; Zn1 ; Zn1 C1 ; : : : ; Zn1 Cn2 ;
where
iid
.Yi ; Zi /; 1 i n1 ; .Zi ; Yi /; n1 C 1 i n1 C n2 ; .Y1j ; Y2j /; 1 j m; N2 .; †/:
As usual, the notation Z is supposed to stand for the missing data. The parameter
vector is D .; †/ D .1 ; 2 ; 11 ; 12 ; 22 /. The corresponding values in the
.k/ .k/ .k/ .k/ .k/
kth iteration of the algorithm are denoted as k D .1 ; 2 ; 11 ; 12 ; 22 /. We
also use the following notation in the rest of this example, because they naturally
arise in the calculations of the E-step:
1 r11 r12
† DRD I
r12 r22
708 20 Useful Tools for Statistics and Machine Learning
.k/
.k/ .k/ .k/ 12 .k/
.k/ D q 12 I ’i D 2 C .k/
yi 1 I
.k/ .k/
11 22 11
12
.k/
ˇi.k/ D .k/
1 C .k/
yi .k/2 I
22
h i2 h i2
.k/ .k/ .k/ .k/ .k/ .k/
v1 D 11 1 I v2 D 22 1 :
Cn2
n1X
˚
C .zi 1 /2 r11 C .yi 2 /2 r22 C 2.zi 1 /.yi 2 /r12
i Dn1 C1
#
X
m
˚
C .y1j 1 / r11 C .y2j 2 / r22 C 2.y1j 1 /.y2j 2 /r12 :
2 2
j D1
These follow from standard bivariate normal conditional expectation formulas (see
Chapter 3). Indeed, in our notation introduced above,
.k/
For 1 i Ek ŒZi 2 j yi D ’i 2 I
n1 ;
2
Ek .Zi 2 /2 j yi D ’.k/
i 2 C v.k/2 I
j D1
So, the E-step can indeed be done in closed-form. The M-step now requires max-
imization of this expression over .1 ; 2 ; r11 ; r12 ; r22 /. Although the expression
above for Ek ŒL.; X / j Y is long, on inspection we can see that it has the same
structure as the logarithm of the density of a general bivariate normal distribution.
Therefore, even the M-step can be done in closed-form by using standard formulas
for maximum likelihood estimates of the mean vector and the covariance matrix in
the general multivariate normal case. Alternatively, the M-step can be done from
first principles by simply taking the partial derivative of the expression above with
respect to 1 ; 2 ; r11 ; r12 ; r22 and by solving the five equations obtained from set-
ting these partial derivatives equal to zero. We do not show that calculation here.
Example 20.10 (EM in Estimating ABO Allele Proportions). The ABO blood clas-
sification system is perhaps the most clinically important blood typing system for
humans. All humans can be classified into one of the four phenotypes A, B, AB,
and O. ABO blood typing is essential before blood transfusions, because infusion
of an incompatible blood type has fatal consequences. In fact, it was these observed
fatalities during blood transfusions that led to the discovery of the ABO blood types.
The specific blood type is governed by a single gene with three alleles, which
are also usually denoted as A, B, and O. Each individual receives one of these three
alleles from the father, and one from the mother. Alleles A and B dominate over
allele O. Thus, individuals who have one A allele and one O allele will show as phe-
notype A, and so on, although the true genotype is AO. Because A and B dominate
over allele O, an individual can have phenotype O only if she or he receives an O
allele from each parent.
EM is a natural tool for estimating the allele frequencies, that is, the respective
proportions of A, B, and O alleles among all individuals in a sampling population.
We think of the EM algorithm naturally, because although there are only four phe-
notypes A, B, AB, and O, there are six genotypes, AA, AO, BB, BO, AB, and OO.
Because of the dominance property of the A and the B alleles, we cannot phenotyp-
ically distinguish between AA and AO, or between BB and BO. So, we have some
missing data, and EM fits in very naturally.
710 20 Useful Tools for Statistics and Machine Learning
l.; X / D c .pA /2X1 .2pA pO /X2 .pB /2X3 .2pB pO /X4 .2pA pB /X5 .pO /2X6 ;
Because
2
pA;k
XO1;k W D Ek .X1 j Y / D Y1 I
2
pA;k C 2pA;k pO;k
2
pB;k
XO3;k W D Ek .X3 j Y / D Y2 ;
2
pB;k C 2pB;k pO;k
This finishes the E-step, and once again, we are fortunate that we can do it in closed-
form.
For the M-step, we have to maximize this with respect to .pA ; pB ; pO / over the
simplex
S D f.pA ; pB ; pO / W pA ; pB ; pO 0; pA C pB C pO D 1g:
Y1 C Y3 C XO 1;k
pA D I
2n
Y2 C Y3 C XO 3;k
pB D I
2n
Y1 C Y2 C 2Y4 XO 1;k XO 3;k
pO D :
2n
In its search for the global maximum of the likelihood function, the EM algorithm
has some positive properties and some murky properties. The EM algorithm does
not behave erratically. In fact, in each iteration the EM algorithm produces a value of
the likelihood function that is at least as large as the value at the previous iteration.
This is known as monotonicity of the EM algorithm. This property is mathemati-
cally demonstrable, and is proved below. The ideal goal of the EM algorithm is to
ultimately arrive at or very very close to the global maximum value of the likeli-
hood function, and the MLE. In this, the EM algorithm has mixed success. There
are no all-at-one-time theorems which show that iterates of the EM algorithm are
guaranteed to lead eventually to the correct global maximum. Indeed, there cannot
be such a theorem, because there are widely available counterexamples to it. What
is true is that under frequently satisfied conditions, iterates of the EM algorithm will
converge to a point of stationarity of the likelihood function. The starting value 0
determines to which stationary point the EM iterates converge. If we are willing
to assume quite a bit more structure, such as that of a multiparameter exponential
family, or a strongly unimodal likelihood function, then convergence to a global
maximum can be assured. However, the EM algorithm has the reputation of con-
verging very slowly to the global maximum, if it does at all. Although it has the
monotonicity property, the ascent to the peak value can be slow. The main reference
for this topic is Wu (1983). The work is nicely summarized in several other places,
specifically McLachlan and Krishnan (2008). The main facts are described with a
classic example below.
Theorem 20.7. Let l.; y/ denote the likelihood function on the basis of the ob-
served data Y D y, and let k ; k 0 be the sequence of EM iterates. Then
l.kC1 ; y/ l.k ; y/ for all k 0.
Proof. We recall the notation. The complete data X D .Y; Z/, where Y are the
actually observed data. The joint density of .Y; Z/ under is f .y; z/, and the
marginal density of Y under is g .y/. Thus, l.; y/ D g .y/ and L.; y/ D
log l.; y/ D log g .y/. We also need the Kullback–Leibler distance inequality
Ep .log p/ Ep .log q/, where p; q are two densities on some common space (see
Chapter 15).
712 20 Useful Tools for Statistics and Machine Learning
LO k .kC1 ; y/ L.k ; y/
which is the claim of the theorem. To prove that LO k .; y/ L.; y/ is maximized
at D k , we observe that
f .y;Z/
(because if we identify fg .y;Z/ k
.y/ as q, and g k .y/ as p, then the Kullback–Leibler
distance inequality Ep .log p/ Ep .log q/ is exactly the inequality in the last line)
.1; 1/; .1; 1/; .1; 1/; .1; 1/; .2; /; .2; /;
.2; /; .2; /; .; 2/; .; 2/; .; 2/; .; 2/:
The data have been deliberately so constructed as to produce a log-likelihood
function L.; y/ to have several local maximas, and two global maximas. The global
maximas are 12 D 22 D 83 ; D ˙ 12 . There are other stationary points of the log-
likelihood function L.; y/, one of which has 12 D 22 D 52 ; D 0. Actually,
this point is moreover a saddle point of L.; y/; that is, the Hessian matrix of the
function L.; y/ is indefinite at this particular stationary point. So, the point is not
a local maximum, or a local minimum. Coming to what the EM algorithm does in
this case, if the initial choice 0 has 0 D 0, then for all k 1; k D 0, and the
sequence of EM iterates converges to the saddle point given above. The problem
with the application of the EM algorithm in this example is exactly the fact that
once the EM reaches D 0 at any iteration, it fails to move out of there at any sub-
sequent iteration. It is indefinitely trapped at D 0, and can only maximize L.; y/
over the submanifold f.12 ; 22 ; / W D 0g. On that submanifold, the saddle point
is the unique maxima; but it is not a global, or even a local maxima.
Wu (1983) gives the following theorem on the convergence of the EM algorithm,
and this theorem is essentially the best possible that can be said.
Theorem 20.8. Define the map
O
L.; / D E .L.; X / j Y /:
Assume that
(a) ‚ is a subset of some Euclidean space Rd .
(b) For any y; L.; y/ is continuous on ‚ and once partially differentiable with
respect to each coordinate of in the interior of ‚.
(c) L.0 ; y/ > 1.
(d) The sets f W L.; y/ cg are compact for all real c.
O
(e) L.; / is jointly continuous on ‚ ˝ ‚.
Then,
(i) The sequence of iterates fL.k ; y/g converges to L. ; y/ for some stationary
point of L.; y/.
(ii) Any sub-sequence of the iterates fk g converges to some stationary point of
L.; y/.
Suppose, in addition, we assume that
O
(f) L.; y/ is concave on ‚ with a unique stationary point .
O
(g) The gradient vector r L.; / is jointly continuous on ‚ ˝ ‚.
Then the sequence of iterates fk g has only one limit point, and the limit coincides
O which is the unique MLE of .
with ,
Wu (1983) uses general theorems on limit points of iterated point-to-set maps for
proving the above theorem.
714 20 Useful Tools for Statistics and Machine Learning
20.2.3 Modifications of EM
The basic EM algorithm may not be exactly applicable, or inefficient due to slow
convergence in some important problems. The basic EM algorithm rests on two
assumptions: that the E-step can be done in closed-form, and that the M-step can
also be done in closed form. Of these, failure of the second assumption may not be
computationally too damaging in low dimensions, because numerical maximization
algorithms on the complete data likelihood may be easily implementable. For ex-
ample, numerical maximizations that use Newton–Raphson methods in the M-step
are known as the EM gradient algorithm; see McLachlan and Krishnan (2008) and
Lange (1999).
Failure to perform the E-step in closed-form almost certainly necessitates Monte
Carlo evaluation of the expectation Ek .L.; X / j Y /. This is called the Monte
Carlo EM algorithm. However, this has to be done for a fine grid of values, be-
cause the subsequent M-step requires the full function LO k .; y/. This may be time
consuming, depending on d , the dimension of . It is also very important to note that
substitution of Monte Carlo for analytic calculations in the E-step calls for simula-
tion from the conditional distribution of Z (the missing data) given Y (the observed
data). This can be cumbersome, and even very difficult. In such a case, the Gibbs
sampler may be useful, because the Gibbs sampler is specially designed for this (see
Chapter 19). An unfortunate consequence of using Monte Carlo to accomplish the
E-step is that the monotone ascent property germane to the basic EM algorithm is
now usually lost. These ideas are described in McLachlan and Krishnan (2008), Wei
and Tanner (1990), Chan and Ledolter (1995), and Levine and Casella (2001).
The EM is an optimization scheme for wanting to find the maximum value of a
function. As such, the idea can also be applied, verbatim, to approximate the maxi-
mum of a posterior density, that is, to approximate a posterior mode. This is known
as Bayesian EM. The posterior density of the parameter based on the complete data
likelihood is . j X / D c l.; X /./; ./ being the prior density. Consequently,
the E-step has only the trivial modification that we now have the extra term log ./
added to the usual term LO k .; y/. The M-step should not be much more complex,
unless the prior density is multimodal or somehow badly behaved. To an error of
O.n1 /, the approximation to the mode will also provide an approximation to the
mean of the posterior under enough regularity conditions; see DasGupta (2008) and
Bickel and Doksum (2006) for such Bayesian asymptotia.
Accumulated user experience shows that the monotone ascent of the EM algo-
rithm is often very slow. Certain modifications of the basic EM algorithm have been
suggested to enhance the speed of practical convergence. Typically, these blend
time-tested purely numerical analysis tools, such as one-or two-term Taylor ex-
pansions, with the EM algorithm itself. Collectively, these schemes are known as
accelerated EM algorithms. Once again, accelerated EM algorithms do not have
the monotone ascent property, and are also more difficult to code. The methods are
described in detail with many references in McLachlan and Krishnan (2008).
20.3 Kernels and Classification 715
Smoothing noisy data or a rough function in order to extract the main features out
of the noisy data is a long-standing and time-tested principle in quantitative sci-
ence. For example, consider the CDF of a Poisson random variable Y Poi.5/.
The CDF of Y of course is not smooth; it is not even continuous. Suppose now we
add to Y a small independent Gaussian random variable Z N.0; :012/. The sum
X D Y C Z has a continuous distribution with a density, and the CDF of X is not
only continuous, but even infinitely differentiable. In this example, we used convolu-
tion to smooth a nonsmooth CDF. Convolution is a special case of kernel smoothing,
a particular type of smoothing that has found wide applications in statistics, machine
learning, and approximation theory. This section provides a basic treatment of the
theory and applications of kernels, with examples.
R
Definition
R 20.4. A function K W Rd ! R is called a kernel if Rd jK.x/jdx < 1,
and Rd K.x/dx D 1.
In applications, we often take K 0, in which case a kernel is just a probability
density function on Rd . Moreover, we also often take K to be symmetric, in the
sense that K.x/ D K.x/ for all x 2 Rd .
Kernels are often used as a smoothing device via the operation of convolution, as
defined below.
R
Definition 20.5. Let f W Rd ! R be an L1 function: Rd jf .x/jdx < 1. Let K
be any kernel on Rd . The convolution of f and K is defined as
Z
.f K/.x/ D K.x y/f .y/dy:
Rd
Convolutions with kernels generally have two important properties that lead to the
wide acceptance of the principle of kernel smoothing:
(a) The convolution f K will generally have some extra smoothness in com-
parison to f . The exact nature of the extra smoothness will depend on both f
and K.
(b) If we smooth f by a sequence of increasingly spiky kernels Kn , then the se-
quence of convolutions f Kn will converge in some meaningful sense, and
often in a very strong sense, to f .
These two properties together give us what we want: close approximation of a noisy
function by a nicer smooth function. The following basic theorem gives a smoothing
and an appromixation property of convolutions with kernels. For the rest of this
716 20 Useful Tools for Statistics and Machine Learning
chapter we use the following standard notation for Lp norms and Lp spaces of
functions:
Z 1=p
jjf jjp D jf j dx
p
; 0 < p < 1I jjf jj1 D sup jf .x/jI
Rd x2Rd
˚
Lp Rd D f W jjf jjp < 1 ; 0<p 1:
For part (b), the key step is to use the fact that Kn is a kernelRfor any n 1, and hence
each Kn integrates to 1 on Rd . Therefore, writing g.z/ D Rd jf .x z/f .x/jdx,
we get
Z Z ˇZ ˇ
ˇ ˇ
jf Kn f jdx D ˇ K .x y/Œf .y/ f .x/dy ˇdx
ˇ d n ˇ
Rd Rd R
Z ˇZ ˇ
ˇ ˇ
D ˇ ˇ
ˇ d Kn .z/Œf .x z/ f .x/d zˇdx
Rd R
Z
jKn .z/jg.z/d z:
Rd
20.3 Kernels and Classification 717
Break the integral in the last line into two sets, a ballR B.0; ı/ of a suitable radius ı
and the complement Rd B.0; ı/. The first integral B.0;ı/ jKn .z/jg.z/d z becomes
small for large n, because g is continuous at z D 0 and g.0/ D 0, and because
R n is integrable and indeed, jjKn jj1 is just a fixed constant. The second integral
K
Rd B.0;ı/ can also be made small for large n, because f isRintegrable, which forces
g to be integrable, and because as n gets large, the integrals Rd B.0;ı/ jKn j become
small. Putting these together, we get the convergence result of part (b).
Part (c) is an easy consequence of the dominated convergence theorem. For ex-
ample, by the assumed boundedness of the partial derivatives @x@ i K,
Z Z
jjrK.x y/jj jf .y/jdy c jf .y/jdy < 1;
Rd Rd
The
R only mathematical requirement of a kernel is that it be integrable, and that
Rd K.x/dx should equal 1. In practice, we choose our kernels to have several
additional properties from the following list.
(a) Nonnegativity. K.x/ 0 forall x.
(b) Isotropic. K.x/ D h.jjxjj/ for some function h W RC ! R.
(c) Fourier or Positive Definiteness Property. K is isotropic, and h.t/ is the char-
acteristic function of some R variable X ; that is, h.t/ has the
R symmetric random
representation h.t/ D R e i tx dF .x/ D R cos.tx/dF .x/ for some CDF F
that has the symmetry property PF .X x/ D PF .X x/ for all x.
718 20 Useful Tools for Statistics and Machine Learning
(d) Rapid Decay.R K.x/ converges rapidly to zero as jjxjj ! 1. One type of rapid
decay is that Rd .jx1 j’1 : : : jxd j’d /K.x1 ; : : : ; xd /dx1 : : : dxd should be finite
for suitably large ’1 ; : : : ; ’d > 0.
(e) Compact Support. For some compact set C; K.x/ D 0 if x 62 C .
(f) Smoothness. K.x/ has sufficient smoothness, for examples, that it be continu-
ous, or uniformly continuous, or have some derivatives, or that it belongs to a
suitable Sobolev space.
The choice of the kernel depends on the nature of the problem for which it will be
used. Kernel methods are widely used in statistical density estimation, in classifi-
cation, in simply smoothing an erratic function, and in various other approximate
reconstruction problems. Good kernels for density estimation need not be good for
classification. The following table lists some common kernels in use in statistics,
computer science, and machine learning.
Exponential e ajjxjj
Laplace e ajjxjj1
Spherical 1 32 jjxjj C 12 jjxjj3 Ijjxjj1
Pm
Polynomial iD1 i c jjxjjki
Ijjxjj1
.1 jjxjj2 /e a
2 jjxjj2
Hermite
sin.cjjxjj/
Wave
Qcjjxjj
d
iD1 .1 xi /Ijxi j1
2
Product
Qd 1
Product iD1 1Cx 2
2i
kx
1 sin 2
Fejér k sin x2
.d D 1/
2
de la Vallée-Poussin sin x
x
.d D 1/
We plot some of these kernels in one and two dimensions in Fig. 20.2. Note how
some are quite flat, and others spiky, and some unimodal and the Fejer kernel wavy.
The choice would depend on exactly what one wants to achieve in a particular prob-
lem. Generally, a flatter kernel would lead to more smoothing.
20.3 Kernels and Classification 719
1 1
0.75 1 0.75 1
0.5 0.5
0.25 0.5 0.25 0.5
0 0
-1 0 -1 0
-0.5 -0.5
0 -0.5 0 -0.5
0.5 0.5
1 -1 1 -1
Gaussian and Laplace kernals with a = 1
1 1
0.75 0.75
0.5 2 0.5 2
0.25 0.25
0 0
0 0
-2 -2
0 0
-2 -2
2 2
1 1
0.75 1 0.75 1
0.5 0.5
0.25 0.5 0.25 0.5
0 0
-1 0 -1 0
-0.5 -0.5
0 -0.5 0 -0.5
0.5 0.5
1-1 1-1
de la valle-poussin and Fejer kernels with k = 5
1 5
0.8 4
0.6 3
0.4 2
0.2 1
-3 -2 -1 1 2 3 -3 -2 -1 1 2 3
The basic problem of density estimation is the following. Suppose we have n iid
observations X1 ; : : : ; Xn from a density f .x/ on an Euclidean space Rd . We do
not know the density function f .x/ and we want to estimate it. In some problems,
720 20 Useful Tools for Statistics and Machine Learning
" n #
X
ln
1 X
fO.x/ D k .Xi / k .x/;
n
kD0 i D1
The function K./ is a kernel and fOK is called the kernel density estimator. The
scaling factor h D hn is called the bandwidth; it must be suitably small so the
density estimate does not suffer from large biases. Suppose the kernel function is
a Gaussian kernel. In such a case, fOK is a mixture of N.Xi ; h2n / densities. Thus,
the kernel density estimate will take normal densities with small widths centered
at the data values and then blend them together. This fundamental and seminal
idea was initiated in Rosenblatt (1956). Kernel density estimates are by far the
most popular nonparametric density estimates. They generally provide the best
rates of convergence, as well as provide a great deal of flexibility through the
choice of the kernel. However, in an asymptotic sense, the choice of the kernel
is of less importance than the choice of bandwidth.
As in most statistical inference problems, there are two issues: systematic error
(in the form of bias) and random error (in the form of variance). To include both of
these aspects, we consider the mean squared error (MSE). If fOn is some estimate of
the unknown density f , we want to consider
Theorem 20.11. Assume that K is nonnegative and integrable. Also assume that f
is uniformly bounded by some M < 1. Suppose h D hn ! 0 as n ! 1. Then,
for any x that is a continuity point of f ,
Z
EŒfOn .x/ ! f .x/ K.z/ d z; as n ! 1:
P
Proof. Since the Xi s are iid and fOn .x/ D 1
nh K xXi
h , we get,
Z 1 x z Z 1
1
EŒfOn .x/ D K f .z/ d z D K.z/f .x hz/ d z:
h 1 h 1
R
In particular, if K.z/ d z D 1, then fOn .x/ is asymptotically unbiased at all conti-
nuity points of f , provided that h D hn ! 0. t
u
Next, we consider the variance of fOn . Consistency of the kernel estimate does
not follow from asymptotic unbiasedness alone; we need something more. To get
a stronger result, we need to assume more than simply hn ! 0. Loosely stated,
we want to drive the variance of fOn to zero, and for this we must have more than
hn ! 0. Obviously, because fOn .x/ is essentially a sample mean,
1 1 xX
VarŒfOn .x/ D Var K ;
n h h
which implies
1 xX
nhVarŒfOn .x/ D hVar K :
h h
K 2 L2 . We already know that for all continuity points x, EŒfOn .x/ converges
to f .x/kKk1 ; so it follows that
2
h EŒfOn .x/ ! 0; h ! 0:
P
fOn .x/ ) f .x/; provided h ! 0 and nh ! 1:
R
Let kr D rŠ1
zr K.z/d z. By decomposing the MSE into variance and the squared
bias, we get
f .x/
EŒfOn .x/ f .x/2 kKk22 C h2r jkr f .r/ .x/j2 ;
nh
MSE n2r=.2rC1/ :
We often assume that r D 2, in which case the pointwise convergence rate is MSE D
O.n4=5 /, slower than the parametric n1 rate. Note that it is expected that in a
nonparametric setup, we cannot get the parametric n1=2 convergence rate. See
DasGupta (2008) for the proofs.
724 20 Useful Tools for Statistics and Machine Learning
’.F; g/ D PF .g.X / ¤ Y /:
20.3 Kernels and Classification 725
where the infimum is taken over all possible functions g W Rd ! f1; 2g. The word
oracular is supposed to convey the concept that only a person with oracular access
to knowledge of F can find the rule g0 that makes ’.F; g0 / D ’.F /.
On the other hand, we have our real field classification rules n .x1 ; : : : ; xn ;
y1 ; : : : ; yn ; x/, and they have their conditional error probabilities
’n .F; n / D PF n ¤ Y j .x1 ; : : : ; xn ; y1 ; : : : ; yn / :
fit for some smoothness in the form of the function f .x/. A mathematical formula-
tion of such a constrained minimization problem is to minimize
X
n
L.yi ; f .xi // C ‚.f /
i D1
over some specified function space F , where L.y; f .x// is a loss function that
measures the goodness of our fit, ‚.f / is a real valued functional that measures
the roughness of f , and is a tuning constant that reflects the importance that we
place on using a smooth function. For example, if we use D 0, then that means
that all we care for is a good fit, and smoothness is of no importance to us. The
loss function L is typically a function such as .y f .x//2 or jy f .x/j, but could
be
R more general. The roughness penalty functional ‚.f / is often something like
.f 0 .x//2 dx, although it too can be more general.
On the face of it, this is an infinite dimensional optimization problem, because the
function space F would usually be infinite dimensional, unless we make the choice
of functions too restrictive. Reproducing kernels allow us to transform P such an in-
finite dimensional problem into finding a function of the form niD1 ci K.xi ; x/,
where K.x; x 0 / is a kernel, associated in a unique way with the function space F .
So, as long as we can identify what this reproducing kernel K.x; x 0 / is, all we have
to do to solve our original infinite dimensional optimization problem is to find the
n optimal constants c1 ; ; cn . Such a kernel K.x; x 0 / uniquely associated with the
function space F can be found, as long as F has a nice amount of structure. The
structure needed is that of a special kind of Hilbert space. Aronszajn (1950) is the
original reference on the theory of reproducing kernels. We first provide a basic
treatment of Hilbert spaces themselves; this is essential for studying reproducing
kernels. Rudin (1986) is an excellent first exposition on Hilbert spaces.
Definition 20.6. A real vector space H is called an inner product space if there is a
function .x; y/ W H ˝ H ! R such that
(a) .x; x/ 0 for all x 2 H, with .x; x/ D 0 if and only if x D 0, the null element
of H.
(b) .a1 x1 C a2 x2 ; y/ D a1 .x1 ; y/ C a2 .x2 ; y/ for all x1 ; x2 ; y 2 H, and all real
numbers a1 ; a2 .
p
The function .x; y/ is called the inner product of x and y, and .x; x/ D jjxjjH
the norm of x. The function d.x; y/ D jjx yjjH is called the distance between x
and y.
Inner product spaces are the most geometrically natural generalizations of Eu-
clidean spaces, because we can talk about length, distance, and angle on inner
product spaces by considering jjxjjH ; .x; y/, and d.x; y/.
Definition 20.7. Let H be an inner product space. The angle between x and y is
defined as D arccos jjxjj.x;y/
H jjyjjH
. x; y are called orthogonal if .x; y/ D 0, so that
D 2 . For n given vectors x1 ; ; xn , the linear span of x1 ; ; xn is defined to be
20.3 Kernels and Classification 727
Pn
the set of all z 2 H of the form z D i D1 ci xi , where c1 ; ; cn are arbitrary real
constants. The projection
P of a given y onto the linear span of x1 ; ; xn P is defined to
be Px1 ; ;xn y D niD1 ci xi , where .c1 ; ; cn / D argminc1 ; ;cn d.y; niD1 ci xi /.
A generalization of part (g) is given in Exercise 20.27. Inner product spaces that
have the property of completeness are Hilbert spaces. The following example
demonstrates what completeness is all about, and why it need not hold for arbi-
trary inner product spaces.
Example 20.13 (An Incomplete Inner Product Space). Consider H D C Œ0; 1
R1
equipped with the inner product .f; g/ D 0 f .x/g.x/dx. Consider the sequence
of functions fn 2 H defined as follows:
1 1 1
fn .x/ D 0 for x 2 0; ; fn .x/ D 1 for x 2 C ;1 ;
2 2 n
1 1 1 1
and fn .x/ D n x for x 2 ; C :
2 2 2 n
Now choose m; n ! 1, and suppose m < n. Then, it is clear that the graphs of
fm and fn coincide except on Œ 12 ; 12 C m
1
, and it follows easily that d.fm ; fn / D
R 1
1 2 2
0 .fm fn / ! 0. This means that the sequence fn 2 H is a Cauchy se-
quence. However, the sequence does not have a continuous limit, i.e., there is no
R1
function f , continuous on Œ0; 1 such that 0 .fn f /2 ! 0. So, here we have an
inner product space which is not complete, in the sense that we can have sequences
which are Cauchy with respect to the distance induced by the inner product, which
nevertheless do not have a limit within the same inner product space. Spaces on
which such Cauchy sequences cannot be found form Hilbert spaces.
Definition 20.8. Let H be an inner product space such that every Cauchy sequence
xn 2 H converges p to some x 2 H. Then H is called a Hilbert space with the Hilbert
norm jjxjjH D .x; x/.
Among the Lp spaces, L2 Œa; b is a Hilbert space; it has an inner product, and the
inner product is complete. Besides L2 , the other Lp spaces are not Hilbert spaces,
R 1=p
because their norms jjf jjH D jjf jjp D jf jp are not induced by an inner
product. We already saw that C Œa; b is not a Hilbert space, because the inner prod-
uct is not complete. The real line with the usual inner product, however, is a Hilbert
space, because a standard theorem in real analysis says that all Cauchy sequences
must converge to some real number. In fact, essentially the same proof shows that
all finite dimensional Euclidean spaces are Hilbert spaces.
In the finite dimensional Euclidean space Rn , the standard unit vectors ek D
.0; ; 0; 1; 0; ; 0/; k D 1; 2; ; n form an orthonomal basis, in the sense that
the set feP k g is an orthonormal set of n-vectors, and any x 2 R may be represented
n
P
as x D kD1 ck ek , where ck are real constants. Furthermore, jjxjj2 D nkD1 ck2 .
n
(a) H has an orthonormal basis B, that is, a set of vectors fe’ g of H such that
jje’ jjH D 1; .e’ ; eˇ / D 0 for all ’; ˇ; ’ ¤ ˇ, and the linear span of the vectors
in B is dense in H with respect to the Hilbert norm on H.
(b) Given any x 2 H, at most countably many among P .x; e’ / are not equal to zero,
and x mayPbe represented in the form x D ’ .x; e’ /e’ .
(c) jjxjj2H D ’ j.x; e’ /j2 .
See Rudin (1986) for a proof. It may be shown that all orthonormal bases of a Hilbert
space H have the same cardinality, which is called the dimension of H.
With these preliminaries, we can now proceed to the topic of reproducing kernels.
We need a key theorem about Hilbert spaces that plays a central role in the entire
concept of a reproducing kernel Hilbert space. This theorem, a classic in analysis,
gives a representation of continuous linear functionals on a Hilbert space. For com-
pleteness, we first define what is meant by a continuous linear functional.
Definition 20.9. Let H be a Hilbert space and ı W H ! R a real-valued linear
functional or operator on H; that is, ı.ax C by/ D aı.x/ C bı.y/ for all x; y 2 H
and all real constants a; b. The norm or operator norm of ı is defined to be jjıjj D
jı.x/j
supx2H jjxjj H
.
Definition 20.10. Let H be a Hilbert space and ı W H ! R a linear functional on
H. The functional ı is called continuous if xn ; x 2 H; d.xn ; x/ ! 0 ) ı.xn / !
ı.x/.
In general, linear functionals need not be continuous. But they are if they have a
finite operator norm. The following extremely important result says that these two
properties of continuity and boundedness are really the same.
Theorem 20.14. Let H be a Hilbert space and ı W H ! R a linear functional.
Then ı is continuous if and only if it is bounded; that is, jjıjj < 1. Equivalently, a
linear operator is continuous if and only if there exists a finite real constant c such
that jı.x/j cjjxjjH for all x 2 H.
Proof. For the “if” part, suppose ı is a bounded operator. Take xn 2 H ! 0 (the
null element). Then, by definition of operator norm,
because jjıjj < 1. This proves that ı is continuous at 0, and therefore continuous
everywhere by linearity. Conversely, if jjıjj D 1, find a sequence xn 2 H such that
jjxn jjH 1, but jı.xn /j ! 1. This is possible, because jjıjj is easily shown to be
xn
equal to supfjı.x/j W jjjjxjjH 1g. Now define zn D jı.x n /j
; n 1. Then zn ! 0,
but ı.zn / does not go to zero, because jı.zn /j is equal to 1 for all n. This proves the
“only if” part of the theorem.
Here is the classic representation theorem for continuous linear functionals on a
Hilbert space that we promised. The theorem says that any continuous linear func-
tional on a Hilbert space H can be recovered as an inner product with a fixed element
of H, associated in a one-to-one way with the particular continuous functional.
730 20 Useful Tools for Statistics and Machine Learning
ıx .f / D f .x/; f 2 H; x 2 X :
It is common to call K.x; y/ the reproducing kernel of the function space H. Note
that in a deviation with how kernels were defined in the context of smoothing, the
reproducing kernel is a function on X ˝ X . It is also important to understand that
not all Hilbert function spaces possess a reproducing kernel. The point evaluation
operators must be continuous for the function space to possess a reproducing ker-
nel. In what follows, more is said of this, and about which kernels can at all be a
reproducing kernel of some Hilbert function space.
Some basic properties of a reproducing kernel are given below.
Proposition. Let K.x; y/ be the reproducing kernel of a Hilbert function space H.
Then,
(a) K.x; y/ D .Kx ; Ky / for all x; y 2 X .
(b) K is symmetric; that is, K.x; y/ D K.y; x/ for all x; y 2 X .
20.3 Kernels and Classification 731
ˇn
ˇ
K.xi ; xj / ˇˇ
i;j D1
sin.x y/ sin.x C y/
D C ;
xy xCy
where sin0 0 is interpreted as the limit limz!0 sinz z D 1. Thus, K.x; x/ D 1 C sin.2x/
2x
if x ¤ 0, and K.0; 0/ D 2. By the characterization theorem, Theorem 20.17, this is
a reproducing kernel of a Hilbert space of functions on X D Œ0; 1.
Example 20.15. Suppose X D Rn for some n 1. Consider the kernel K.x; y/ D
.x 0 y/2 , where x 0 y denotes the usual Euclidean inner product x1 y1 C x2 y2 C C
xn yn . We find the feature maps corresponding to the positive-definite kernel K.
Define the maps
.x/ D x12 ; x1 x2 ; : : : ; x1 xn ; : : : ; xn x1 ; xn x2 ; : : : ; xn2 ;
20.3 Kernels and Classification 733
2
X
n X
n X
n
..x/; .y//F D ..x//k ..y//k D .xi xj /.yi yj /
kD1 i D1 j D1
X
n X
n X
n X
n
D .xi yi /.xj yj / D xi yi xj yj
i D1 j D1 i D1 j D1
D .x 0 y/2 :
Then,
(a) The operator AK has a countable number of nonnegative eigenvalues i ; i 1,
and a corresponding sequence of mutually orthonormal eigenfunctions i ;
i 1, satisfying AK . i / D i i ; i 1.
(b) The kernel K admits the representation
1
X
K.x; y/ D i .x/i .y/;
i D1
p
where i .x/ D i i .x/; i 1.
The theorem covers the two most practically important cases of X being a rectangle
(possibly unbounded) in a finite-dimensional Euclidean space with as Lebesgue
measure, and X being a finite set in a finite-dimensional Euclidean space with as
the counting measure. The success of Mercer’s theorem in explicitly identifying the
feature map 1
.x/ D i .x/ ;
i D1
depends on our ability to find the eigenfunctions and the eigenvalues of the linear
operator AK . In some cases, we can find them explicitly, and in some cases, we are
734 20 Useful Tools for Statistics and Machine Learning
out of luck. It is worth noting that the linear operator AK has certain additional
properties (compactness in particular,) which allows the representation as in Mer-
cer’s theorem to hold; see, for example, Theorem 1 on p. 93 in Cheney (2001) for
conditions needed for the spectral decomposition of an operator on a Hilbert space.
Minh, Niyogi and Yao (2006) give some very nice examples of the calculation of
i and i in Mercer’s theorem when the input space is the surface of a sphere in
some finite-dimensional Euclidean space, or a discrete set in a finite-dimensional
Euclidean space.
Theorem 20.15 says that symmetric positive-definite functions and reproducing
kernel Hilbert spaces are in a one-to-one relationship. If we produce a symmetric
positive-definite function, it will correspond to a suitable Hilbert space of functions
with an inner product, and an induced norm. For the sake of applications, it is
useful to know this correspondence for some special kernels. A list of these corre-
spondences is given below for practical use.
L2 Œ0; bg
R1
Œ0; 1 .1 x/.1 y/ C xy fu W u00 2 L2 Œ0; 1g u2 .0/ C u2 .1/ C 0 .u00 /2
C.xy/3C x6 .1
y/.x 2 2y C y 2 /
sin.M.xy//
R M.xy/
functions u 2 L2 .R/ usual L2 norm
Rwith itx
R e u.x/dx D
0 for jt j > M
.1/m1 R1 R1
Œ0; 1 1C .2m/Š
B2m .jxyj/ fu W u.m/ 2 . 0 u/2 C 0 .u.m/ /2
L2 Œ0; 1; u.j / .0/
D u.j / .1/
D 0 8 j < mg
Note: Above, Bj denotes the j th Bernoulli polynomial. For example, B2 .x/ D x 2 x C
1
6
; B4 .x/ D x 4 2x 3 C x 2 30
1
.
Let us now return to the two group statistical classification problem. Suppose
the covariate vector X is a d -dimensional multivariate normal under each group,
namely, X j Y D 1 Nd .1 ; †/; X j Y D 2 Nd .2 ; †/. Suppose also that
P .Y D 1/ D p; P .Y D 2/ D 1 p. The marginal distribution of Y and the
conditional distribution of X given Y determine the joint distribution F of .X; Y /.
A classic result is that the misclassification probability PF .g.X / ¤ Y / is min-
imized by a linear classification rule that classifies a given X value into group
1 (i.e., sets g.X / D 1) if c 0 x b for a suitable vector c and a suitable real
20.3 Kernels and Classification 735
-4 -2 2 4
-2
constant b. In the case that 1 ; 2 and † are known to the user, the vector c has
the formula c D .1 2 /0 †1 . Usually, these mean vectors and the covariance
matrix are unknown to the user, in which case c is estimated by using training data
.xi ; yi /; i D 1; 2; : : : ; n.
This is the historically famous Fisher linear classification rule. Geometrically,
the Fisher classification rule takes a suitable hyperplane in Rd , and classifies X
values on one side of the hyperplane to the first group and X values on the other
side of the hyperplane to the second group. For Gaussian data with identical or
nearly identical covariance structures, this idea of linear separation works quite
well. See Fig. 20.3. However, linear separability is too optimistic for many kinds of
data; for example, even in the Gaussian case itself, if the covariance structure of X
is different under the two groups, then linear separation is not going to work well.
On the other hand, linear separation has some advantages.
(a) A linear classification rule is easier to compute.
(b) A linear rule has geometric appeal.
(c) It may be easier to study operating characteristics, such as misclassification
probabilities, of linear rules.
An appealing idea is to map the original input vector X into a feature space, say
some Euclidean space RD , by using a feature map .X /, and use a linear rule in
the feature space. That is, use classification rules that classify an X value by using a
P 0
classification function of the form D i D1 ci .ˆ.xi // .ˆ.X //. As we remarked, the
dimension of the feature space may be much higher, and in fact the feature space
may even be infinite dimensional, in which case computing these inner products
.ˆ.xi //0 .ˆ.X // is going to be time consuming. However, now, our previous dis-
cussion of the theory of reproducing kernel Hilbert spaces is going to help us in
avoiding computation of these very high-dimensional inner products. As we saw
736 20 Useful Tools for Statistics and Machine Learning
Exercises
Exercise 20.1 (Simple Practical Bootstrap). For n D 10; 30; 50 take a random
sample from an N.0; 1/ distribution, and bootstrap the sample mean XN using a boot-
strap Monte Carlo size B D 500. Construct a histogram and superimpose on it the
exact density of XN . Compare.
Exercise 20.2. For n D 5; 20; 50, take a random sample from an Exp.1/ density,
and bootstrap the sample mean XN using a bootstrap Monte Carlo size B D 500.
Construct the corresponding histogram and superimpose it on the exact density.
Compare.
Exercise 20.3 * (Bootstrapping a Complicated Statistic). For n D 15; 30; 60,
take a random sample from an N.0; 1/ distribution, and bootstrap the sample kur-
tosis coefficient using a bootstrap Monte Carlo size B D 500. Next, find the
approximate normal distribution obtained from the delta theorem, and superimpose
the bootstrap histogram on this approximate normal density. Compare.
Exercise 20.4. For n D 20; 40; 75, take a random sample from the standard Cauchy
distribution, and bootstrap the sample median using a bootstrap Monte Carlo size
B D 500. Next, find the approximate normal distribution obtained from Theorem
9.1 and superimpose the bootstrap histogram on this approximate normal density.
Compare.
Exercise 20.5 * (Bootstrapping in an Unusual Situation). For n D 20; 40; 75,
take a random
p
sample from the standard Cauchy distribution, and bootstrap the
nXN
t-statistic s using a bootstrap Monte Carlo size B D 500. Plot the bootstrap
histogram. What special features do you notice in this histogram? In particular, com-
ment on whether the true density appears to be unimodal, or bounded.
Exercise 20.6 * (Bootstrap Variance Estimate). For n D 15; 30; 50, find a boot-
strap estimate of the variance of the sample median for a sample from a Beta.3; 3/
density. Use a bootstrap Monte Carlo size B D 500.
Exercise 20.7. For n D 10; 20; 40, find a bootstrap estimate of the variance of the
sample mean for a sample from a Beta.3; 3/ density. Use a bootstrap Monte Carlo
size B D 500. Compare with the known exact value of the variance of the sample
mean.
Exercise 20.8 (Comparing Bootstrap with an Exact pAnswer). For n D
N
15; 30; 50, find a bootstrap estimate of the probability P . ns X 1/ for samples
from a standard normal distribution. Use a bootstrap Monte Carlo size B D 500.
Compare the bootstrap estimate with the exact value of this probability. (Why is the
exact value easily computable?)
Exercise 20.9. * Prove that under appropriate moment conditions, the bootstrap is
consistent for the sample correlation coefficient r between two jointly distributed
variables X; Y .
738 20 Useful Tools for Statistics and Machine Learning
Exercise 20.11 * (Conceptual). In which of the following cases, can you use the
canonical bootstrap justifiably?
(a) Approximating the distribution of the largest order statistic of a sample from a
Beta distribution;
(b) Approximating the distribution of the median of a sample from a Beta
distribution;
(c) Approximating the distribution of the maximum likelihood estimate of P .X1
/ for an exponential density with mean
P
;
1 n
.X M /.Y M /
(d) Approximating the distribution of n i D1 isX sYX i Y , where .Xi ; Yi / are
independent samples from a bivariate normal distribution, MX and sX
are the median and the standard deviation of the Xi values, and MY and
sY are the median and the standard deviation of the Yi values;
(e) Approximating the distribution of the sample mean for a sample from a
t-distribution with two degrees of freedom.
Exercise 20.12. * Suppose XN n is the sample mean of an iid sample from a CDF F
with a finite variance, and XN n is the mean of a bootstrap sample. Consistency of the
bootstrap is a statement about the bootstrap distribution, conditional onpthe observed
data. What can you say about the unconditional limit distribution of n.XNn /,
where is the mean of F ?
Exercise 20.15 (EM with Data). The following are observations from a bivari-
ate normal distribution, with indicating a missing value. Use the derivations in
Example 20.9 to find the first six EM iterates for the MLE of the vector of five
parameters. Comment on how close to convergence you are.
Data: (0, .1), (0, 1), (1, ), (2, ), (.5, .75), (, 3), (, 2), (.2, .2).
Exercises 739
Exercise 20.16 (Conceptual). Consider again the bivariate normal problem with
missing values, except now there are no complete observations:
Data: (0, ), (2, ), (, 3), (, 2), (.5, ), (.6, ), (, 0), (, 1).
Is the EM algorithm useful now? Explain your answer.
Exercise 20.18 (EM in a Genetics Problem). Consider the ABO blood group
problem worked out in Example 20.10. For the data values YA D 182; YB D
60; YAB D 17; YO D 176 (McLachlan and Krishnan (2008)), find the first four
EM iterates, using the starting values .pA ; pB ; pO / D .:264; :093; :643/.
Exercise 20.21 (Plug-In Density Estimate). Suppose the true model is a N.; 2 /
and a parametric plug-in estimate using MLEs of the parameters is used. Derive an
R
expression for the global error index Ef Œf .x/ fO.x/2 dx. At what rate does this
converge to zero?
Exercise 20.22 (Choosing the Wrong Model). Suppose the true model is a double
exponential location parameter density, but you thought it was a N.; 1/ and used
R
a parametric plug-in estimate with an MLE. Does Ef jf .x/ fO.x/jdx converge
to zero?
Exercise 20.24 (Practical Effect of the Kernel). For the simulated data in the pre-
vious exercise, repeat part (b) with the Epanechnikov and Exponential kernels. Plot
the three density estimates, and write a report.
Exercise 20.25. Suppose H is an inner product space. Prove the parallelogram law.
Exercise 20.26. Suppose H is an inner product space, and let x; y 2 H. Show that
x D y if and only if .x; z/ D .y; z/ for all z 2 H.
Exercise 20.27. Suppose H is an inner product space, and let x 2 H. Show that
jjxjjH D supf.x; v/ W jjvjjH D 1g.
1
f0 .t/ p ; fn .t/ D cos.nt/; n D 1; 2; 3; : : : ;
2
fn .t/ D sin.nt/; n D 1; 2; 3; : : : :
Show that ffn g1 nD1 forms an orthonormal set on H; that is, jjfn jjH D 1 and
.fm ; fn / D 0, for all m; n; m ¤ n.
1 f .t/g.t/e dt.
Find, explicitly, Hi .t/; i D 0; 1; 2; 3; 4.
Exercise 20.33 (Laguerre Polynomials). Show that for ’ > 1, the polynomials
e t t ’ d n t nC’
Ln;’ .t/ D e t ; n D 0; 1; 2; : : :
nŠ dt n
R1
form an orthogonal set on L2 .0; 1/ with the inner product .f; g/ D 0 f .t/g.t/
e t t ’ dt.
Find, explicitly, Li;’ .t/; i D 0; 1; 2; 3; 4.
Exercise 20.34 (Jacobi Polynomials). Show that for ’; ˇ > 1, the polyno-
n
dn
mials Pn.’;ˇ / .t/ D .1/ 2n nŠ
.1 t/’ .1 C t/ˇ dt n .1 t/
’Cn
.1 C t/ˇ Cn ; n D
0; 1; 2; : : : form an orthogonal set on L2 .1; 1/ with the inner product .f; g/ D
R1
1 f .t/g.t/.1 t/ .1 C t/ dt.
’ ˇ
.’;ˇ /
Find, explicitly, Pi .t/; i D 0; 1; 2; 3; 4.
Exercise 20.35 (Gegenbauer Polynomials). Find, explicitly, the first five Gegen-
bauer polynomials, defined as Pi.’;ˇ / .t/; i D 0; 1; 2; 3; 4, when ’ D ˇ.
742 20 Useful Tools for Statistics and Machine Learning
Exercise 20.36 (Chebyshev Polynomials). Find, explicitly, the first five Cheby-
shev polynomials Tn .x/; n D 0; 1; 2; 3; 4, which are the Jacobi polynomials in the
special case ’ D ˇ D 12 . The polynomials Tn ; n D 0; 1; 2; : : : form an orthogonal
R1 1
set on L2 Œ1; 1 with the inner product .f; g/ D 1 f .t/g.t/.1 t 2 / 2 dt.
Exercise 20.38 (Bessel’s Inequality). Use the formula of the previous exercise to
B D fx1 ; x2 ; : : :g in an inner
show that for a general countable orthonormal set P
product space H, and any x 2 H, one has jjxjj2H 1i D1 j.x; xi /j .
2
Exercise 20.41. Let I be an interval in the real line. Show that the family of all
real-valued continuous functions on I, with the norm jjf jjH D supx2I jf .x/j is
not an inner product space.
Exercise 20.43 (Kernel Plots). Plot the Cauchy and the exponential kernel, as
defined in Section 20.3.2, for some selected values of a. What is the effect of in-
creasing a?
Exercise 20.45. Suppose f; g 2 L1 .Rd /. Prove that jj jf jjgj jj1 D jj f jj1 jj g jj1 .
Exercise 20.47. Consider the linear operator ıR W L2 Œa; b ! L2 Œa; b, where a; b
x
are finite real constants, defined as ı.f /.x/ D a f .y/dy. Show that jjıjj b a.
Exercises 743
X
n Y
d X
n Y
d
xj xi;j xj xi;j
Iyi D1 Kj Iyi D2 Kj ;
hj hj
i D1 j D1 i D1 j D1
Exercise 20.58 * (Classification with Fisher’s Iris Data). For Fisher’s Iris dataset
(e.g., wikipedia.org), form a kernel classification rule for the pairwise cases, setosa
versus. versicolor; setosa versus. virginica; versicolor versus. virginica, by using
(a) Fisher’s linear classification rule
(b) A kernel classification rule with an exponential kernel, where the parameter of
the kernel is to be chosen by you
(c) A kernel classification rule with the inverse quadratic kernel, where the param-
eter of the kernel is to be chosen by you
Find the empirical error rate for each rule, and write a report.
References
Aizerman, M., Braverman, E., and Rozonoer, L. (1964). Theoretical foundations of the potential
function method in pattern recognition learning, Autom. Remote Control, 25, 821–837.
Aronszajn, N. (1950). Theory of reproducing kernels, Trans. Amer. Math. Soc., 68, 307–404.
Athreya, K. (1987). Bootstrap of the mean in the infinite variance case, Ann. Statist., 15, 724–731.
Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and
Statistics, Kluwer, Boston.
Bickel, P.J. (2003). Unorthodox bootstraps, Invited paper, J. Korean Statist. Soc., 32, 213–224.
Bickel, P.J. and Doksum, K. (2006). Mathematical Statistics, Basic Ideas and Selected Topics,
Prentice Hall, upper Saddle River, NJ.
Bickel, P.J. and Freedman, D. (1981). Some asymptotic theory for the bootrap, Ann. Statist., 9,
1196–1217.
Carlstein, E. (1986). The use of subseries values for estimating the variance of a general statistic
from a stationary sequence, Ann. Statist., 14, 1171–1179.
Chan, K. and Ledolter, J. (1995). Monte Carlo estimation for time series models involving counts,
J. Amer. Statist. Assoc., 90, 242–252.
Cheney, W. (2001). Analysis for Applied Mathematics, Springer, New York.
Cheney, W. and Light, W. (2000). A Course in Approximation Theory, Pacific Grove, Brooks/
Cole, CA.
References 745
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and other
Kernel Based Learning Methods, Cambridge Univ. Press, Cambridge, UK.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer, New York.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the
EM algorithm, JRSS, Ser. B, 39, 1–38.
R
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition,
Springer, New York.
Efron, B. (2003). Second thoughts on the bootstrap, Statist. Sci., 18, 135–140.
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap, Chapman and Hall, London.
Giné, E. and Zinn, J. (1989).Necessary conditions for bootstrap of the mean, Ann. Statist., 17,
684–691.
Hall, P. (1986). On the number of bootstrap simulations required to construct a confidence interval,
Ann. Statist., 14, 1453–1462.
Hall, P. (1988). Rate of convergence in bootstrap approximations, Ann. prob, 16,4, 1665–1684.
Hall, P. (1989). On efficient bootstrap simulation, Biometrika, 76, 613–617.
Hall, P. (1990). Asymptotic properties of the bootstrap for heavy-tailed distributions, Ann. Prob.,
18, 1342–1360.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer, New York.
Hall, P., Horowitz, J. and Jing, B. (1995). On blocking rules for the bootstrap with dependent data,
Biometrika, 82, 561–574.
Hall, P. (2003). A short prehistory of the bootstrap, Statist. Sci., 18, 158–167.
KRunsch, H.R. (1989). The Jackknife and the bootstrap for general stationary observations, Ann.
Statist., 17, 1217–1241.
Lahiri, S.N. (1999). Theoretical comparisons of block bootstrap methods, Ann. Statist., 27,
386–404.
Lahiri, S.N. (2003). Resampling Methods for Dependent Data, Springer-Verlag, New York.
Lahiri, S.N. (2006). Bootstrap methods, a review, in Frontiers in Statistics, J. Fan and H. Koul Eds.,
231–256, Imperial College Press, London.
Lange, K. (1999). Numerical Analysis for Statisticians, Springer, New York.
Le Cam, L. and Yang, G. (1990). Asymptotics in Statistics, Some Basic Concepts, Springer, New
York.
Lehmann, E.L. (1999). Elements of Large Sample Theory, Springer, New York.
Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation, Springer, New York.
Levine, R. and Casella, G. (2001). Implementation of the Monte Carlo EM algorithm, J. Comput.
Graph. Statist., 10, 422–439.
McLachlan, G. and Krishnan, T. (2008). The EM Algorithm and Extensions, Wiley, New York.
Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of
integral equations, Philos. Trans. Royal Soc. London, A, 415–416.
Minh, H., Niyogi, P., and Yao, Y. (2006). Mercer’s theorem, feature maps, and smoothing, Proc.
Comput. Learning Theory, COLT, 154–168.
Murray, G.D. (1977). Discussion of paper by Dempster, Laird, and Rubin (1977), JRSS Ser. B, 39,
27–28.
Politis, D. and Romano, J. (1994). The stationary bootstrap, JASA, 89, 1303–1313.
Politis, D. and White, A. (2004). Automatic block length selection for the dependent bootstrap,
Econ. Rev., 23, 53–70.
Politis, D., Romano, J. and Wolf, M. (1999). Subsampling, Springer, New York.
Rudin, W. (1986). Real and Complex Analysis, 3rd edition, McGraw-Hill, Columbus, OH.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function, Ann.
Math, Statist., 27, 832–835. 3rd Edition, McGraw-Hill, Columbus, OH.
Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap, Springer, New York.
Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap, Ann. Statist., 9, 1187–1195.
Sundberg, R. (1974). Maximum likelihood theory for incomplete data from exponential family,
Scand. J. Statist., 1, 49–58.
Tong, Y. (1990). The Multivariate Normal Distribution, Springer, New York.
746 20 Useful Tools for Statistics and Machine Learning
Vapnik, V. and Chervonenkis, A. (1964). A note on one class of perceptrons, Autom. Remote Con-
trol, 25.
Vapnik, V. (1995). The Nature of Statistical Learning Theory, Springer, New York.
Wei, G. and Tanner, M. (1990). A Monte Carlo implementation of the EM algorithm, J. Amer.
Statist. Assoc., 85, 699–704.
Wu, C.F.J. (1983). On the convergence properties of the EM algorithm, Ann. Statist., 11, 95–103.
Appendix A
Symbols, Useful Formulas, and Normal Table
sample space
P .B j A/ conditional probability
f.1/; : : : ; .n/g permutation of f1; : : : ; ng
F CDF
FN 1F
F 1 ; Q quantile function
iid, IID independent and identically distributed
p.x; y/; f .x1 ; : : : ; xn / joint pmf; joint density
F .x1 ; : : : ; xn / joint CDF
f .y j x/; E.Y j X D x/ conditional density and expectation
var; Var variance
Var.Y j X D x/ conditional variance
Cov covariance
X;Y correlation
G.s/; .t/ generating function; mgf or characteristic function
.t1 ; : : : ; tn / joint mgf
ˇ; skewness and kurtosis
k E.X /k
mk sample kth central moment
r rth cumulant
r rth standardized cumulant; correlation of lag r
r; polar coordinates
J Jacobian
Fn empirical CDF
Fn1 sample quantile function
X .n/ sample observation vector .X1 ; : : : ; Xn /
Mn sample median
X.k/ ; XkWn kth-order statistic
Wn sample range
IQR interquartile range
P P
); ! convergence in probability
op .1/ convergence in probability to zero
Op .1/ bounded in probability
an bn 0 < lim inf abnn lim sup abnn < 1
an bn ; an bn lim abnn D 1
a:s: a:s:
); ! almost sure convergence
w.p. 1 with probability 1
a.e. almost everywhere
i.o. infinitely often
L L
); ! convergence in distribution
r r
); ! convergence in rth mean
u. i. uniformly integrable
LIL law of iterated logarithm
VST variance stabilizing transformation
ıx point mass at x
P .fxg/ probability of the point x
Lebesgue measure
convolution
absolutely continuous
dP
Radon–Nikodym derivative
N
d
product measure; Kronecker product
I./ Fisher information function or matrix
T natural parameter space
f . j x/; . j X .n/ / posterior density of
LRT likelihood ratio test
ƒn likelihood ratio
S sample covariance matrix
T2 Hotelling’s T 2 statistic
MLE maximum likelihood estimate
l.; X / complete data likelihood in EM
L.; X / log l.; X /
l.; Y / likelihood for observed data
L.; Y / log l.; Y /
LO k .; Y / function to be maximized in M-step
HBoot bootstrap distribution of a statistic
P bootstrap measure
pij transition probabilities in a Markov chain
A.1 Glossary of Symbols 749
DC ; D C Dini derivatives
jj:jj Euclidean norm
jj:jj1 supnorm
tr trace of a matrix
jAj determinant of a matrix
K.x/; K.x; y/ kernels
750 Appendix A Symbols, Useful Formulas, and Normal Table
Discrete Distributions
n x 12p 16p.1p/
Binomial p .1 p/nx ; x D 0; : : : ; n np np.1 p/ p .pe t C 1 p/n
x np.1p/ np.1p/
e x 1 1
p .e t 1/
A.2 Moments and MGFs of Common Distributions
Poisson xŠ
; x D 0; 1; : : : e
1 1p 2p p2 pe t
Geometric p.1 p/x1 ; x D 1; 2; : : : p 6C
p p2 1p 1p 1.1p/e t
D
.Dx /.Nnx / D D D N n
Hypergeom nN nN .1 /
N N 1
Complex Complex Complex
.Nn /
log.1C x1 / P9
Benford log 10
;x D 1; : : : ; 9 3.4 4 6.057 .796 2.45 xD1 e tx p.x/
751
752
Continuous Distributions
e x= 2
Exponential ;x 0 2 6
e x= x ’1 2 p2 6
Gamma ’ .’/ ; x 0 ’ ’ ’ ’
q
e x=2 x m=21 8 12
2m 2m=2 . m
;x 0 m 2m m m
2 /
p1 2 =.2 2 /
Normal 2
e .x/ ;x 2 R 2 0 0
.log x/2 2 =2 2 2 2 C2
p 2
lognormal p1 e 2 2 ;x > 0 e C .e 1/e 2C e e 1 Complex
2x
(continued)
Appendix A Symbols, Useful Formulas, and Normal Table
Distribution f .x/ Mean Variance Skewness Kurtosis
1
Cauchy .1C.x/2 = 2 /
;x 2R None None None None
. mC1
2 / 1 m 6
tm p 2 .mC1/=2 ;x 2 R 0(m > 1) (m > 2) 0(m > 3) (m > 4)
m. m2 / .1Cx =m/
m2 m4
ˇ
. ’ /ˇ x ’1 ˇ ˇ 2 .’Cˇ1/
F ˇ ;x > 0 ˇ1
.ˇ > 1/ ’.ˇ2/.ˇ1/2
.ˇ > 2/ Complex Complex
B.’;ˇ/.xC ’ /’Cˇ
e jxj=
Double Exp. ;x 2 R 2 2 0 3
A.2 Moments and MGFs of Common Distributions
2
q
’ ’ ’ ’ 2 2.’C1/ ’2
Pareto x ’C1
;x >0 ’1
.’ > 1/ .’1/2 .’2/
.’ > 2/ ’3 ’
.’ > 3/ Complex
x p
x 12 6.3/
1 .e / 2 2 12
Gumbel e e ;x 2 R C
6 3 5
Note:
P1 For the Gumbel distribution, :577216 is the Euler constant, and .3/ is Riemann’s zeta function .3/ D
1
nD1 n3 1:20206.
753
754 Appendix A Symbols, Useful Formulas, and Normal Table
e x=
Exponential ;x 0 .1 t/1 .t < 1= /
e x= x ’1
Gamma ’ .’/
;x 0 .1 t/’ .t < 1= /
e x=2 x m=21
2m 2m=2 . m
;x 0 .1 2t/m=2 .t < 12 /
2 /
ˇ x ˇ 1 . x ˇ P1 .t /n
Weibull ./ e / ; x >0 nD0 nŠ .1 C ˇn /
x ’1 .1x/ˇ1
Beta B.’;ˇ / ;0 x 1 1F1 .’; ’ C ˇ; t/
2 2 2 2 =2
Normal p1 e .x/ =.2 / ; x 2R e tCt
2
.log x/2
lognormal p1 e 2 2 ;x > 0 None
2x
Cauchy 1
.1C.x/2 = 2 /
;x 2R None
. mC1
2 /
tm p
m. m 2
1
.mC1/=2 ; x 2R None
2 / .1Cx =m/
.ˇ ˇ ’1
’/ x
F ;x > 0 None
B.’;ˇ /.xC ˇ
’/
’Cˇ
e jxj= e t
Double Exp. 2
;x 2 R 1 2 t 2
.jtj < 1=/
’ ’
Pareto x ’C1
;x >0 None
x
1 .e / x
Gumbel
e
e ;x 2 R e t .1 t/.t < 1=/
A.3 Normal Table 755
’ z’
:25 :675
:2 :84
:1 1:28
:05 1:645
:025 1:96
:02 2:055
:01 2:33
:005 2:575
:001 3:08
:0001 3:72
756 Appendix A Symbols, Useful Formulas, and Normal Table
0 1 2 3 4 5 6 7 8 9
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.8 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
4.0 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Author Index
757
758 Author Index
M
Madsen, R., 339 R
Mahalanobis, P., 208 Rachev, S.T., 505
Major, P., 421, 425, 426, 537 Rao, C.R., 20, 62, 69, 157, 210, 505, 519, 520,
Mallows, C.L., 570 522
Martynov, G., 238 Rao, R.R., 71, 76, 82, 249, 308
Mason, D.M., 535, 537, 570 Read, C., 157
Massart, P., 242 Reiss, R., 221, 231, 234, 236, 238, 285, 323,
McDiarmid, C., 485, 487, 488, 560 326, 328, 505, 516
McKean, H., 390 Rényi, A., 375, 380, 389
McLachlan, G., 705, 711, 714, 739 Resnick, S., 221, 402
Mee, R., 157 Révész, P., 254, 420, 421, 423, 425,
Mengersen, K., 667, 669, 673 427, 527
760 Author Index
Y Z
Yang, G., 704 Zabell, S., 67
Zeitouni, O., 560
Yao, Y., 734
Zinn, J., 538, 694
Yor, M., 402 Zolotarev, V.M., 505
Yu, B., 674 Zygmund, A., 20
Subject Index
A median, 335
ABO allele, 709–711 variance ratio, 325
Acceptance distribution, 664 Asymptotics
Accept-reject method convergence, distribution
beta generation, 627–628 CDF, 262
described, 625 CLT, 266
generation, standard normal values, Cramér–Wold theorem, 263–264
626–627 Helly’s theorem, 264
scheme efficiency, 628 LIL, 267
Almost surely, 252, 253, 256, 259, 261–262, multivariate CLT, 267
267, 287, 288, 314, 410, 461, 470, Pólya’s theorem, 265
491–494, 496, 498, 503, 535, 537, Portmanteau theorem, 265–266
542, 555, 576, 577, 578, 615, 618, densities and Scheffé’s theorem, 282–286
632, 641, 694, 725 laws, large numbers
Ancillary Borel-Cantelli Lemma, 254–256
definition, 601 Glivenko-Cantelli theorem, 258–259
statistic, 602, 604, 605 strong, 256–257
Anderson’s inequality, 214, 218 weak, 256–258
Annulus, Dirichlet problem, 417–418 moments, convergence (see Convergence
Aperiodic state, 351 of moments)
notation and convergence, 250–254
Approximation of moments
preservation, convergence
first-order, 278, 279
continuous mapping, 260–261
scalar function, 282
Delta theorem, 269–272
second-order, 279, 280
multidimension, 260
variance, 281 sample correlation, 261–262
Arc sine law, 386 sample variance, 261
ARE. See Asymptotic relative efficiency Slutsky’s theorem, 268–269
Arrival rate, 552, 553 transformations, 259
intensity function, 455 variance stabilizing transformations,
Poisson process, 439, 441, 442, 446, 272–274
460–462 Asymptotics, extremes and order statistics
Arrival time application, 325–326
definition, 438 several, 326–327
independent Poisson process, 445 single, 323–325
interarrival times, 438 convergence, types theorem
Asymptotic independence, 337 limit distributions, types, 332
Asymptotic relative efficiency (ARE) Mills ratio, 333
IQR-based estimation, 327 distribution theory, 323
763
764 Subject Index
Initial distribution, 369, 641, 651, 652, 655, formula, 105, 146, 167, 468
658, 668 higher-order, 107
definition, 340 Iterated variance, 119, 623
Markov chain, 361 applications, 105–106
weak ergodic theorem, 366 formula, 105
Inner product space, 726, 727, 728, 740–742
Inspection paradox, 444
J
Integrated Brownian motion, 432
Jacobian formula
Intensity, 557, 682
CDF, 42
function, 454, 455, 460
technique, 45
piecewise linear, 454–456 Jacobi polynomials, 741, 742
Interarrival time, 238, 240, 241 Jensen’s inequality, 52–53, 280, 468, 491, 494,
conditional distribution, 453 516
Poisson process, 438 Joint cumulative distribution function, 97, 112,
transformed process, 453 124, 177, 193, 261, 263, 447
Interquartile range (IQR) Joint density
definition, 327 bivariate uniform, 125–126
limit distribution, 335 continuous random vector, 125
Invariance principle defined, 123–124
application, 434 dimensionality curse, 131–132
convergence, partial sum process, 423–424 nonuniform, uniform marginals, 128–129
Donsker’s, 424–425, 538 Joint moment-generating function (mgf),
partial sums, 421 112–114, 121, 156, 597
Skorohod embedding theorem, 422, 423 Joint probability mass function (pmf), 121,
and statistics 148, 152, 585, 599, 608
Cramér–von Mises statistic, 532–533 definition, 96, 112
Kolmogorov–Smirnov statistic, function expectation, 100
531–532
strong
KMT theorem, 425–427 K
Karhunen–Loéve expansion, 408, 533, 554
partial sum process, 530–531, 537
Kernel classification rule, 724–731
uniform metric, 422
Kernels and classification
weak, 536
annhilator, 743
Invariant measure, 663
definition, 715
Inverse Gamma density, 59
density, 719–723
Inverse quadratic kernel, 736, 744 estimation
Inversion of mgf density estimator, 721
moment-generating function, 53 histograms, 720
Inversion theorem optimal local bandwidth, 723
CDF, 298–299 plug-in, 720
failure, 307–308 series estimation, 720–721
Plancherel’s identity, 298 Fourier, 717
IQR. See Interquartile range kernel plots, 742
Irreducible, 569, 581, 640–643, 645, 647, product kernels, 743
653–655, 660, 661, 663, 667, 671, smoothing
685 definition, 715–716
definition, 349 density estimation, 718
loop chains, 371 statistical classification
regular chain, 363 exponential kernel, 736
Isolated points, 415, 416 Fisher linear, 735
Isotropic, 717, 742, 743 Gaussian radial kernel, 736
Iterated expectation, 105, 119, 146, 162, 172 Hilbert norm, 729
applications, 105–106 Hilbert space, 725–727
774 Subject Index
P Polar coordinates
Packing number, 548 spherical calculations
Parallelogram law, 727 definition, 182–183
Pareto density, 60 dimensionality curse, 184
Partial correlation, 204 joint density, 183–184
Partial sum process spherically symmetric facts, 185
convergence, Brownian motion, 423–424 two dimensions
interpolated, 425 n uniforms product, 181–182
strong invariance principle, 537 polar transformation usefulness, 181
Pattern problems transformation, 180–181
discrete probability, 26 use, 134–135
recursion relation, 27 Polar transformation
variance formula, 27 spherical calculations, 182–185
Period use, 181
burn-in, 675 Pólýa’s criterion
circular block bootstrap (CBB), 702 characteristic functions, 308
computation, 351 inversion theorem failure, 307–308
intensity function, 454, 455 stable distributions, 307
Perron–Frobenius theorem, 361–363, 569, 654 Pólya’s formula, 382–383
Plancherel’s identity, 298–299 Pólya’s theorem
Plug-in estimate, 212, 630, 704, 720, 740 CDF, 265, 509
Poincaré’s lemma, 191 return probability, 382–383
Pólya’s urn, 466, 493
Poisson approximation
Polynomial kernel, 736
applications, 35
Portmanteau theorem
binomial random variable, 34
definition, 265
confidence intervals, 80–82
Positive definiteness, 568, 717, 732
Poisson distribution
Positive recurrent, 356–357, 361, 363, 365,
characteristic function, 295–296
651
confidence interval, 80–81
Posterior density
moment-generating function, 23–24 definition, 147–148
Poissonization, 116–118, 121, 243, 244 exponential mean, 148–149
Poisson point process normal mean, 151–152
higher-dimensional Poisson mean, 150
intensity/mean measure, 450 Posterior mean
Mapping theorem, 452–453 approximation, 651
multinomial distribution, 451–452 binomial, 149–150
Nearest Event site, 451 Prior density
Nearest Neighbor, 462 definition, 148
polar coordinates, 460 uses, 151
Poisson process, 551–553, 557, 636, 682 Probability metrics
Campbell’s theorem and shot noise differential metrics, 519–522
characteristic functional …, 456–457 metric inequalities, 515–518
shot effects, 456 properties
stable laws, 458 coupling identity, 508
1-D nonhomogeneous processes f -divergences, 513–514
intensity and mean function, plots, 455 Hellinger distance, 510–511
mapping theorem, 453–454 joint and marginal distribution
piecewise linear intensity, 454–456 distances, 509–510
higher-dimensional Poisson point standard probability, statistics
processes f -divergences, 507
distance, nearest event site, 451 Hellinger metric, 506–507
mapping theorem, 452–453 Kolmogorov metric, 506
stationary/homogeneous, 450 Kullback–Leibler distance, 507
Subject Index 779
V
Vapnik–Chervonenkis (VC)
dimension, 539–541 Z
subgraph, 545, 546 Zygmund–Marcinkiewicz SLLN, 693–694